Professional Documents
Culture Documents
Who am I?
http://www.mapr.com/company/events/
speaking/pdb-10-16-12
Keys Botzum
kbotzum@maprtech.com
Senior Principal Technologist, MapR Technologies
MapR Federal and Eastern Region
Agenda
Whats a Hadoop?
Whats MapR?
Enterprise Grade Hadoop
Making Hadoop More Open
Hadoop in 15 minutes
How to Scale?
Big Data has Big Problems
Petabytes of data
MTBF on 1000s of nodes is < 1 day
Something is always broken
There are limits to scaling Big Iron
Sequential and random access just dont scale
Example: Update 1% of 1TB
Data consists of 1010 records, each 100 bytes
Task: Update 1% of these records
Approach 1: Just Do It
Each update involves read, modify and write
t = 1 seek + 2 disk rotations = 20ms
1% x 1010 x 20 ms = 2 mega-seconds = 23 days
Total time dominated by seek and rotation times
Approach 2: The Hard Way
Copy the entire database 1GB at a time
Update records on the fly
t = 2 x 1GB / 100MB/s + 20ms = 20s
103 x 20s = 20,000s = 5.6 hours
100x faster to do 100x more work!
Moral: Read data sequentially even if you only want 1%
of it
MapReduce: A Paradigm Shift
Distributed computing platform
Large clusters
Commodity hardware
Pioneered at Google
BigTable, MapReduce and Google File System
Commercially available as Hadoop
Hadoop
Commodity hardware thousands of nodes
Handles Big Data petabytes and more
Sequential file access each spindle provides data as fast as
possible
Sharding
Data distributed evenly across cluster
More spindles and CPUs working on different parts of data set
Reliability self-healing (mostly), self-balancing
MapReduce
Parallel computing framework
Function shipping
Moves the computation to the data rather than the typical
reverse
Takes into account sharding
Hides most of complexity from developers
Inside Map-Reduce
the,
1
"The
6me
has
come,"
the
Walrus
said,
6me,
1
"To
talk
of
many
things:
come,
[3,2,1]
has,
1
Of
shoesand
shipsand
shas,
ealing-wax
[1,5,2]
come,
1
come,
6
the,
[1,2,1]
has,
8
6me,
[10,1,3]
the,
4
6me,
14
Input
Map
Shue
Reduce
Output
and
sort
Agenda
Whats a Hadoop?
Whats MapR?
Enterprise Grade Hadoop
Making Hadoop More Open
The MapR Distribution for Apache Hadoop
Available
as
a
service
with
Google
Compute
Engine
MapR Partners
Agenda
Whats a Hadoop?
Whats MapR?
Enterprise Grade Hadoop
Making Hadoop More Open
MapRs Complete Distribution
for Apache Hadoop
MapR Control System
Integrated, tested,
hardened and supported MapR
Heatmap
LDAP, NIS
Integration
Quotas, CLI,
REST APT
Alerts, Alarms
Integrated with
Accumulo
Hive Pig Oozle Sqoop HBase Whirr
Runs on commodity
hardware
Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
standards-based
extensions for:
Security
File-based access
Direct Snap-
Most SQL-based Access
Real- Volumes Mirrors Data
Time shots Placemen
access NFS Streamin t
Easiest integration g
No NameNode High Performance Stateful Failover
High availability Architecture Direct Shuffle and Self Healing
Best performance
MapRs Storage
2.7
Services
Easy Management at Scale
Health
Monitoring
Cluster
Administration
Application
Resource
Provisioning
Same information and tasks available via
command line and REST
MapR: Lights Out Data Center Ready
Dependable
Reliable Compute
Storage
Like
a
sub-directory
related
dirs/les
together
Contains
le
metadata
for
this
volume
Mounted
to
form
global
name-
space
Logical
unit
of
policy
Replica6on
factor
Quotas
Load
balancing
Snapshots
Mirrors
Data
placement
Made
of
containers
Container
is
Sharding
unit
16
32G
Nodes
Disks
Storage
Pools
Containers
Distributed
across
cluster
16-32
GB
Volumes
A B C D
E
F
A B C D
E
F
NameNode
NameNode
NameNode
NameNode
E
DataNode
DataNode
DataNode
A F
C D
E
D
Hadoop
/
/H
HBASE
Hadoop
BASE
NFS
NFS
Hadoop
/
H
APPLICATIONS
BASE
NFS
APPLICAITONS
APPLICATIONS
APPLICAITONS
APPLICATIONS
APPLICAITONS
Snapshots
without
data
READ
/
WRITE
duplica6on
Business
Con6nuity
Production Research and
Eciency
Ecient
design
WAN Dieren6al
deltas
are
updated
Datacenter
1
Datacenter
1
Compressed
and
check-summed
Easy
to
manage
Production
WAN
Cloud Scheduled
or
on-demand
WAN,
Remote
Seeding
Consistent
point-in-6me
Compute Engine
Thought Questions
Consider a cluster with
Petabytes of data
Hundred or thousands of jobs running each day, creating new data
Many users and teams all using this cluster
How do I back this up?
User oops protection
How do I replicate data from one cluster to another in support of disaster
recovery?
Protection from power outages, floods, fire, etc
Designed
for
Performance
and
Scale
MapR
Apache/CDH
Terasort
w/
1x
replica6on
(no
compression)
Total
(minutes)
24
min
34
sec
49
min
33
sec
1.4
PB
user
data
Map
9
min
54
sec
28
min
12
sec
900-1200
MapReduce
jobs
per
day
Shue
9
min
8
sec
27
min
0
sec
16
TB/day
average
IO
through
each
server
85-90%
storage
u6liza6on
(with
snapshots)
Terasort
w/
3x
replica6on
(no
compression)
Very
low-end
hardware
(consumer
drives)
Total
47
min
4
sec
73
min
42
sec
Map
11
min
2
sec
30
min
8
sec
Shue
9
min
17
sec
28
min
40
sec
Large
Web
2.0
company
DFSIO/local
write
6B
les
on
a
single
cluster
(+
3x
replica6on)
Throughput/node
870
MB/s
240
MB/s
2000
servers
targeted
YCSB
(HBase
benchmark,
50%
read,
50%
update)
No
degrada6on
during
hardware
failures
Heavy
read/write/delete
workload
Throughput
33102
ops/sec
7904
ops/sec
1.7K
creates/sec/node
Latency
(r/u)
2.9-4
ms/0.4
ms
7-30
ms/0-5
ms
Response
Eme
YCSB
(HBase
benchmark,
95%
read,
5%
update)
(write/read/delete)
Throughput
18K
ops/sec
8500
ops/sec
Atomic
workload
7.8/4.5/8.7
ms
Latency
(r/u)
5.5-5.7
ms/0.6
ms
12-30
ms/1
ms
Mixed
workload
6.6/4.9/9.1
ms
Compute Engine
Agenda
Whats a Hadoop?
Whats MapR?
Enterprise Grade Hadoop
Making Hadoop More Open
Not
All
ApplicaEons
Use
the
Hadoop
APIs
Applica6ons
and
libraries
that
use
les
and/or
SQL
These
are
not
legacy
30
years
applica6ons,
they
are
100,000s
applica6ons
valuable
applica6ons
10,000s
libraries
10s
programming
languages
Applica6ons
and
libraries
that
use
the
Hadoop
APIs
MapR
Technologies
33
Hadoop
Needs
Industry-Standard
Interfaces
File-based
applica6ons
NFS
Supported
by
most
opera6ng
systems
SQL-based
tools
ODBC
Supported
by
most
BI
applica6ons
and
query
builders
MapR
Technologies
34
NFS
MapR
Technologies
35
Your
Data
is
Important
MapR
Technologies
36
Direct
Access
NFS
File
Browsers
Standard
Linux
Commands
&
Tools
grep!
Access
Directly
sed!
Drag
&
Drop
sort!
tar!
Random
Read
Random
Write
Log
directly
Applica6ons
MapR
Technologies
37
The
NFS
Protocol
MapR
Technologies
HDFS
o.a.h.hdfs.DistributedFileSystem
o.a.h.fs.LocalFileSystem
MapReduce
FTP
o.a.h.fs.np.FTPFileSystem
39
MapR
storage
layer
o.a.h.fs.FileSystem
Interface
com.mapr.fs.MapRFileSystem
Hadoop
Hadoop
Was
Designed
to
Support
MulEple
NFS
interface
FileSystem
API
One
NFS
Gateway
MapR
Technologies
41
MulEple
NFS
Gateways
with
Load
Balancing
MapR
Technologies
42
MulEple
NFS
Gateways
with
NFS
HA
(VIPs)
MapR
Technologies
43
Customer Examples: Import/Export Data
Network security vendor
Network packet captures from switches are streamed into the cluster
New pattern definitions are loaded into online IPS via NFS
SaaS company
Exporting a database to Hadoop over NFS
Ad exchange
Bids and transactions are streamed into the cluster
Customer Examples: Productivity and Operations
Retailer
Operational scripts are easier with NFS than HDFS + MapReduce
chmod/chown, file system searches/greps, perl, awk, tab-complete
Consolidate object store with analytics
Real-time dashboards