Mapr'S Hadoop Distribution

MapR's Hadoop Distribution
Who am I?
http://www.mapr.com/company/events/
speaking/pdb-10-16-12
Keys Botzum
kbotzum@maprtech.com
Senior Principal Technologist, MapR Technologies
MapR Federal and Eastern Region
Agenda
Whats a Hadoop?
Whats MapR?
Enterprise Grade Hadoop
Making Hadoop More Open
Hadoop in 15 minutes
How to Scale?
Big Data has Big Problems
Petabytes of data
MTBF on 1000s of nodes is < 1 day
Something is always broken
There are limits to scaling Big Iron
Sequential and random access just dont scale
Example: Update 1% of 1TB
Data consists of 1010 records, each 100 bytes
Task: Update 1% of these records
Approach 1: Just Do It
Each update involves read, modify and write
t = 1 seek + 2 disk rotations = 20ms
1% x 1010 x 20 ms = 2 mega-seconds = 23 days
Total time dominated by seek and rotation times
Approach 2: The Hard Way
Copy the entire database 1GB at a time
Update records on the fly
t = 2 x 1GB / 100MB/s + 20ms = 20s
103 x 20s = 20,000s = 5.6 hours
100x faster to do 100x more work!
Moral: Read data sequentially even if you only want 1%
of it
MapReduce: A Paradigm Shift
Distributed computing platform
Large clusters
Commodity hardware
Pioneered at Google
BigTable, MapReduce and Google File System
Commercially available as Hadoop
Hadoop
Commodity hardware thousands of nodes
Handles Big Data petabytes and more
Sequential file access each spindle provides data as fast as
possible
Sharding
Data distributed evenly across cluster
More spindles and CPUs working on different parts of data set
Reliability self-healing (mostly), self-balancing
MapReduce
Parallel computing framework
Function shipping
Moves the computation to the data rather than the typical
reverse
Takes into account sharding
Hides most of complexity from developers
Inside Map-Reduce
the, 1
"The 6me has come," the Walrus said,
6me, 1
"To talk of many things: come, [3,2,1]
has, 1
Of shoesand shipsand shas, ealing-wax
[1,5,2]
come, 1 come, 6
the, [1,2,1] has, 8

6me, [10,1,3] the, 4
6me, 14
Input Map Shue Reduce
Output
and sort
Agenda
Whats a Hadoop?
Whats MapR?
The MapR Distribution for Apache Hadoop
Commercial Hadoop Distribution

Open, enterprise-grade distribution
Primarily leveraging open source components
Carefully targeted enhancements to make Hadoop more
open and enterprise-grade
Growing fast and a recognized leader

MapR in the Cloud
Available as a service with Amazon Elastic MapReduce

(EMR)
http://aws.amazon.com/elasticmapreduce/mapr

Available as a service with Google Compute Engine

MapR Partners
Agenda
Whats a Hadoop?
Whats MapR?
MapRs Complete Distribution
for Apache Hadoop
MapR Control System
Integrated, tested,
hardened and supported MapR
Heatmap
LDAP, NIS
Integration
Quotas, CLI,
REST APT
Alerts, Alarms
Integrated with
Accumulo
Hive Pig Oozle Sqoop HBase Whirr
Runs on commodity
hardware
Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
standards-based
extensions for:
Security
File-based access
Direct Snap-
Most SQL-based Access
Real- Volumes Mirrors Data
Time shots Placemen
access NFS Streamin t
Easiest integration g
No NameNode High Performance Stateful Failover
High availability Architecture Direct Shuffle and Self Healing
Best performance
MapRs Storage
2.7 Services
Easy Management at Scale
Health
Monitoring
Cluster
Administration
Application
Resource
Provisioning
Same information and tasks available via
command line and REST
MapR: Lights Out Data Center Ready
Dependable
Reliable Compute
Storage
Automated stateful failover Business con6nuity with

snapshots and mirrors
Automated re-replica6on Recover to a point in 6me
Self-healing from HW End-to-end check
and SW failures summing
Load balancing Strong consistency
Built in compression
Rolling upgrades
Mirror across sites to
No lost jobs or data meet
99999s of up6me Recovery Time Objec6ves
Storage Architecture
How does MapR manage storage and how is this dierent

from generic Hadoop?
What is a Volume?
Like a sub-directory
related dirs/les together
Contains le metadata for this
volume
Mounted to form global name-
space
Logical unit of policy
Volumes help you manage data

MapR Technologies - Conden6al 21
Typical Volume Layout
/binaries /hbase /projects /users /var/mapr
/build /test /mjones /jsmith local...
Create lots of volumes, 100K volumes OK!

Volumes Let You Manage Data
Replica6on factor
Quotas
Load balancing
Snapshots
Mirrors
Data placement
Made of containers
Container is Sharding unit
16 32G

Storage Architecture
Nodes
Disks
Storage Pools
Containers
Distributed across cluster
16-32 GB
Volumes

No NameNode Architecture
Other Hadoop Distribu6ons MapR
NAS
APPLIANCE
A B C D E F
A B C D E F
NameNode
NameNode NameNode NameNode
E
DataNode DataNode DataNode
A F C D E D

A B B C E B

A D C F B F
HA requires specialized hardware and/or HA w/ automa6c failover and re-replica6on

sonware Up to 1T les (> 5000x advantage)
File scalability hampered by namenode Higher performance
booleneck 100% commodity hardware
Metadata must t in memory Metadata is persisted to disk

MapR Snapshots
Hadoop / /H HBASE
Hadoop BASE NFS
NFS
Hadoop / H
APPLICATIONS BASE NFS
APPLICAITONS
APPLICATIONS APPLICAITONS
APPLICATIONS APPLICAITONS Snapshots without data
READ / WRITE duplica6on
MapR Storage Services

Saves space by sharing
blocks
Data Blocks REDIRECT ON WRITE
FOR SNAPSHOT Lightning fast
A B C C D Zero performance loss on
wri6ng to original
Scheduled, or on-demand
Easy recovery by user
Snapshot 1 Snapshot 2 Snapshot 3

MapR Mirroring/COOP Requirements
Business Con6nuity
Production Research and Eciency
Ecient design
WAN Dieren6al deltas are updated
Datacenter 1 Datacenter 1
Compressed and
check-summed
Easy to manage
Production
WAN
Cloud Scheduled or on-demand
WAN, Remote Seeding
Consistent point-in-6me
Compute Engine
Thought Questions
Consider a cluster with
Petabytes of data
Hundred or thousands of jobs running each day, creating new data
Many users and teams all using this cluster
How do I back this up?
User oops protection
How do I replicate data from one cluster to another in support of disaster
recovery?
Protection from power outages, floods, fire, etc
Designed for Performance and Scale
MapR Apache/CDH
Terasort w/ 1x replica6on (no compression)
Total (minutes) 24 min 34 sec 49 min 33 sec
1.4 PB user data
Map 9 min 54 sec 28 min 12 sec 900-1200 MapReduce jobs per day
Shue 9 min 8 sec 27 min 0 sec 16 TB/day average IO through each server
85-90% storage u6liza6on (with snapshots)
Terasort w/ 3x replica6on (no compression)
Very low-end hardware (consumer drives)
Total 47 min 4 sec 73 min 42 sec
Map 11 min 2 sec 30 min 8 sec
Shue 9 min 17 sec 28 min 40 sec Large Web 2.0 company
DFSIO/local write
6B les on a single cluster (+ 3x replica6on)
Throughput/node 870 MB/s 240 MB/s 2000 servers targeted
YCSB (HBase benchmark, 50% read, 50% update) No degrada6on during hardware failures
Heavy read/write/delete workload
Throughput 33102 ops/sec 7904 ops/sec
1.7K creates/sec/node
Latency (r/u) 2.9-4 ms/0.4 ms 7-30 ms/0-5 ms
Response Eme
YCSB (HBase benchmark, 95% read, 5% update)
(write/read/delete)
Throughput 18K ops/sec 8500 ops/sec
Atomic workload 7.8/4.5/8.7 ms
Latency (r/u) 5.5-5.7 ms/0.6 ms 12-30 ms/1 ms
Mixed workload 6.6/4.9/9.1 ms
HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB

Customer Support
24x7x365 Follow-The-Sun coverage

Critical customer issues are worked on
around the clock
Dedicated team of Hadoop engineering
experts
Contacting MapR support
Email: support@mapr.com
(automatically opens a case)
Phone: 1.855.669.6277
Self Service options:
http://answers.mapr.com/
Web Portal: http://mapr.com/
support
Two MapR Editions M3 and M5
Control System Control System

NFS Access NFS Access
Performance Performance
Unlimited Nodes High Availability
Free Snapshots & Mirroring
24 X 7 Support
Also Available through:
Annual Subscrip6on
Compute Engine
Agenda
Whats a Hadoop?
Whats MapR?
Not All ApplicaEons Use the Hadoop APIs
Applica6ons and
libraries that use les
and/or SQL
These are not legacy
30 years
applica6ons, they are
100,000s applica6ons valuable applica6ons
10,000s libraries
10s programming languages

Applica6ons and
libraries that use the
Hadoop APIs
MapR Technologies 33
Hadoop Needs Industry-Standard Interfaces
Hadoop MapReduce and HBase applica6ons

API Mostly custom-built
File-based applica6ons
NFS Supported by most opera6ng systems
SQL-based tools
ODBC Supported by most BI applica6ons and
query builders
NFS
Your Data is Important
HDFS-based Hadoop distribu6ons do not (cannot)

properly support NFS
Your data is important, it drives your business make

sure you can access it
Why store your data in a system which cannot be accessed
by 95% of the worlds applica6ons and libraries?
Direct Access NFS
File Browsers Standard Linux
Commands & Tools
grep!
Access Directly sed!
Drag & Drop sort!
tar!
Random Read
Random Write
Log directly
Applica6ons
The NFS Protocol
RFC 1813 WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;

struct WRITE3args {
nfs_fh3 file;
Very simple protocol offset3 offset;
count3 count;
stable_how stable;
Random reads/writes opaque data<>;
Read count bytes from };
oset oset of le le
READ3res NFSPROC3_READ(READ3args) = 6;
Write buer data to

oset oset of a le le
struct READ3args {
nfs_fh3 file;
offset3 offset;
HDFS does not support count3 count;
random writes so it };
cannot support NFS

S3
o.a.h.fs.s3na6ve.Na6veS3FileSystem
MapR Technologies
HDFS
o.a.h.hdfs.DistributedFileSystem
Local File System

Storage Layers
o.a.h.fs.LocalFileSystem
MapReduce
FTP
o.a.h.fs.np.FTPFileSystem
39
MapR storage layer
o.a.h.fs.FileSystem Interface
com.mapr.fs.MapRFileSystem
Hadoop
Hadoop Was Designed to Support MulEple
NFS interface
FileSystem API
One NFS Gateway
What about scalability and high availability?

MulEple NFS Gateways
MulEple NFS Gateways with Load Balancing
MulEple NFS Gateways with NFS HA (VIPs)
Customer Examples: Import/Export Data
Network security vendor
Network packet captures from switches are streamed into the cluster
New pattern definitions are loaded into online IPS via NFS
Online measurement company

Clickstreams from application servers are streamed into the cluster
SaaS company
Exporting a database to Hadoop over NFS
Ad exchange
Bids and transactions are streamed into the cluster
Customer Examples: Productivity and Operations
Retailer
Operational scripts are easier with NFS than HDFS + MapReduce
chmod/chown, file system searches/greps, perl, awk, tab-complete
Consolidate object store with analytics
Credit card company

User and project home directories on Linux gateways
Local files, scripts, source code,
Administrators manage quotas, snapshots/backups,
Large Internet company recommendation system

Web server serve MapReduce results (item relationships) directly from
cluster
Email marketing company

Object store with HBase and NFS
Apache Drill
Interactive Analysis of Large-Scale Datasets
Latency Matters
Ad-hoc analysis with interactive tools
Real-time dashboards
Event/trend detection and analysis

Network intrusion analysis on the fly
Fraud
Failure detection and analysis
Big Data Processing
Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
Introducing Apache Drill

Innovations
MapReduce
Scalable IO and compute trumps efficiency with today's commodity hardware
With large datasets, schemas and indexes are too limiting
Flexibility is more important than efficiency
An easy to use scalable, fault tolerant execution framework is key for large
clusters
Dremel
Columnar storage provides significant performance benefits at scale
Columnar storage with nesting preserves structure and can be very efficient
Avoiding final record assembly as long as possible improves efficiency
Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. No need to start JVMs, just push compact queries to
running agents.
Apache Drill
Open source project based upon Dremels ideas
More flexibility and openness
More Reading on Apache Drill
MapR and Apache Drill
http://www.mapr.com/drill
Apache Drill project page
http://incubator.apache.org/projects/drill.html
Googles Dremel
http://research.google.com/pubs/pub36632.html
Googles BigQuery
https://developers.google.com/bigquery/docs/query-reference
MITs C-Store a columnar database
http://db.csail.mit.edu/projects/cstore/
Microsofts Dryad
Distributed execution engine
http://research.microsoft.com/en-us/projects/dryad/
Googles Protobufs
https://developers.google.com/protocol-buffers/docs/proto

Mapr'S Hadoop Distribution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mapr'S Hadoop Distribution

Uploaded by

Copyright:

Available Formats

MapR's Hadoop Distribution

Commercial Hadoop Distribution

Growing fast and a recognized leader

Available as a service with Amazon Elastic MapReduce

Automated stateful failover Business con6nuity with

How does MapR manage storage and how is this dierent

Volumes help you manage data

/binaries /hbase /projects /users /var/mapr

/build /test /mjones /jsmith local...

Create lots of volumes, 100K volumes OK!

MapR Technologies - Conden6al 23

MapR Technologies - Conden6al 24

DataNode DataNode DataNode

DataNode DataNode DataNode

HA requires specialized hardware and/or HA w/ automa6c failover and re-replica6on

MapR Technologies - Conden6al 25

MapR Storage Services

Snapshot 1 Snapshot 2 Snapshot 3

MapR Technologies - Conden6al 26

HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB

MapR Technologies - Conden6al 29

24x7x365 Follow-The-Sun coverage

Control System Control System

Hadoop MapReduce and HBase applica6ons

HDFS-based Hadoop distribu6ons do not (cannot)

Your data is important, it drives your business make

RFC 1813 WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;

Local File System

What about scalability and high availability?

Online measurement company

Credit card company

Large Internet company recommendation system

Email marketing company

Ad-hoc analysis with interactive tools

Event/trend detection and analysis

Batch processing Interactive analysis Stream processing

Introducing Apache Drill

You might also like