IndicThreads-Pune12-Comparing Hadoop Data Storage

Comparing Hadoop Data Storage
(HDFS, HBase, Hive and Pig)
Rakesh Jadhav
SAS
Agenda
Hadoop Ecosystem HDFS HBase Hive Pig
Hadoop Ecosystem
Hadoop Ecosystem Components

HDFS: HBase: Hive: Pig: Zookeeper: Hadoop Distributed File System Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data Hadoop Petabyte scalable Data Warehousing Infrastructure Hadoop Data Flow/Analysis Infrastructure Hadoop Co-ordination service, Configuration Service Infrastructure MapReduce: Hadoop Distributed Programming Paradigm
Chukwa:
Avro: Mahout:
Hadoop Monitoring Service

Hadoop Data Serialization De-Serialization Infrastructure Hadoop Scalable Machine Learning Library
HDFS (Data Storage)

Design Features
Failure Is Norm Designed For Large Datasets than Small Designed For Batch Processing than Interactive Supports Write Once- Read Many Provides Interfaces to Move Processing Closer To Data
HDFS
APPLICATION AREAS
Large Log Processing Web search indexing
LIMITATIONS
Small Size Problem Single Node Of Failure No Random Access No Write Support
HBase (Data Storage)

Design Features
Key-Value Store (Like Map)
Semi Structured Data Column Family, Time Stamp Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp De-normalized Data Faster Data Retrieval Using Column Families Static Column Families, Dynamic Columns
RDBMS v/s HBase: Example

RDBMS ID 1 2 Name Age Sam Bob 35 56 BirthPlace Mumbai Chicago Marital Status Married Married Location Weight Pune New York 76 79 Employer XYZ PQR
HBase Row Key 1 Nam e: T1=S am Age: T2= 35 Age: T1:=2 5
Personal Information (Column Family) Birth-Place :T1=Mumbai Marital Status :T2= Married Marital Status: T1= Unmarried Weight:T2 = 76
Other Information (Column Family) Locatio n: T2= Pune Locatio n: T1:=Mu mbai Employer:T1= XYZ
Weight:T1 = 65
HBase: Application Areas

Applications which need Store/Access/Search using Key Need Fast Random Access/Update to scalable structured data Applications Needing Flexible Table Schema Applications Needing range-search capabilities supported by key ordering
HBase: Limitations
Expensive Full Row Read No Secondary Keys No SQL Support Not Efficient for Big Cell Values
Hive (Data Access)

Design Features
Scalable data warehouse on top of Hadoop developed by Facebook SQL like Query Language HiveQL Limited JDBC support Support for rich data types Ability to insert custom map-reduce jobs
Hive: Application Areas

Adhoc analysis on huge structured data, not having any requirement of low latency Log processing Text Mining Document Indexing Customer Facing business intelligence (Google analytics) Predictive Modeling, hypothesis testing
Hive: Limitations
No Support To Update Data Only Bulk Load Support Not Efficient For Small Data
Hive: Example
create table employee (id bigint, name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee; INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
Pig(Data Access)
Pig Latin High level data flow language. Client side library, no server side deployment needed. Batch processing large unstructured data Procedural language
Runtime Schema Creation, Check point ability, Splits pipeline support

Customer code support Rich data types Support for Joins
Pig: Application Areas

Extract Transform Load (ETL) Unstructured Data Analysis
PIG: Limitations
Not efficient for processing small datasets
PIG: Example
Load Emplyee data from text file, filter it using age and joining year and group using joining year.
1. records = LOAD 'sas/input/files/employee.txt' AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
Conclusion
Organizations
Revisit data strategy Evaluate Hadoop Ecosystem Build economical, scalable solutions for Big Data problems
References
Hadoop: Definitive Guide, By Tom White http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://www01.ibm.com/software/data/infosphere/hadoop/ http://www.informationmanagement.com/blogs/ http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation
Thank You
21

IndicThreads-Pune12-Comparing Hadoop Data Storage

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IndicThreads-Pune12-Comparing Hadoop Data Storage

Uploaded by

Copyright:

Available Formats

Comparing Hadoop Data Storage

(HDFS, HBase, Hive and Pig)

Hadoop Ecosystem Components

Hadoop Monitoring Service

HDFS (Data Storage)

HBase (Data Storage)

RDBMS v/s HBase: Example

HBase Row Key 1 Nam e: T1=S am Age: T2= 35 Age: T1:=2 5

HBase: Application Areas

Hive (Data Access)

Hive: Application Areas

Runtime Schema Creation, Check point ability, Splits pipeline support

Pig: Application Areas

You might also like