You are on page 1of 21

Comparing Hadoop Data Storage

(HDFS, HBase, Hive and Pig)

Rakesh Jadhav
SAS

Agenda
Hadoop Ecosystem HDFS HBase Hive Pig

Hadoop Ecosystem

Hadoop Ecosystem Components


HDFS: HBase: Hive: Pig: Zookeeper: Hadoop Distributed File System Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data Hadoop Petabyte scalable Data Warehousing Infrastructure Hadoop Data Flow/Analysis Infrastructure Hadoop Co-ordination service, Configuration Service Infrastructure MapReduce: Hadoop Distributed Programming Paradigm

Chukwa:
Avro: Mahout:

Hadoop Monitoring Service


Hadoop Data Serialization De-Serialization Infrastructure Hadoop Scalable Machine Learning Library

HDFS (Data Storage)


Design Features

Failure Is Norm Designed For Large Datasets than Small Designed For Batch Processing than Interactive Supports Write Once- Read Many Provides Interfaces to Move Processing Closer To Data

HDFS
APPLICATION AREAS
Large Log Processing Web search indexing

LIMITATIONS
Small Size Problem Single Node Of Failure No Random Access No Write Support

HBase (Data Storage)


Design Features
Key-Value Store (Like Map)
Semi Structured Data Column Family, Time Stamp Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp De-normalized Data Faster Data Retrieval Using Column Families Static Column Families, Dynamic Columns

RDBMS v/s HBase: Example


RDBMS ID 1 2 Name Age Sam Bob 35 56 BirthPlace Mumbai Chicago Marital Status Married Married Location Weight Pune New York 76 79 Employer XYZ PQR

HBase Row Key 1 Nam e: T1=S am Age: T2= 35 Age: T1:=2 5

Personal Information (Column Family) Birth-Place :T1=Mumbai Marital Status :T2= Married Marital Status: T1= Unmarried Weight:T2 = 76

Other Information (Column Family) Locatio n: T2= Pune Locatio n: T1:=Mu mbai Employer:T1= XYZ

Weight:T1 = 65

HBase: Application Areas


Applications which need Store/Access/Search using Key Need Fast Random Access/Update to scalable structured data Applications Needing Flexible Table Schema Applications Needing range-search capabilities supported by key ordering

HBase: Limitations
Expensive Full Row Read No Secondary Keys No SQL Support Not Efficient for Big Cell Values

Hive (Data Access)


Design Features

Scalable data warehouse on top of Hadoop developed by Facebook SQL like Query Language HiveQL Limited JDBC support Support for rich data types Ability to insert custom map-reduce jobs

Hive: Application Areas


Adhoc analysis on huge structured data, not having any requirement of low latency Log processing Text Mining Document Indexing Customer Facing business intelligence (Google analytics) Predictive Modeling, hypothesis testing

Hive: Limitations
No Support To Update Data Only Bulk Load Support Not Efficient For Small Data

Hive: Example
create table employee (id bigint, name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee; INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;

Pig(Data Access)
Pig Latin High level data flow language. Client side library, no server side deployment needed. Batch processing large unstructured data Procedural language

Runtime Schema Creation, Check point ability, Splits pipeline support


Customer code support Rich data types Support for Joins

Pig: Application Areas


Extract Transform Load (ETL) Unstructured Data Analysis

PIG: Limitations
Not efficient for processing small datasets

PIG: Example
Load Emplyee data from text file, filter it using age and joining year and group using joining year.
1. records = LOAD 'sas/input/files/employee.txt' AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;

Conclusion
Organizations
Revisit data strategy Evaluate Hadoop Ecosystem Build economical, scalable solutions for Big Data problems

References
Hadoop: Definitive Guide, By Tom White http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://www01.ibm.com/software/data/infosphere/hadoop/ http://www.informationmanagement.com/blogs/ http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation

Thank You

21