Professional Documents
Culture Documents
Rakesh Jadhav
SAS
Agenda
Hadoop Ecosystem HDFS HBase Hive Pig
Hadoop Ecosystem
Chukwa:
Avro: Mahout:
Failure Is Norm Designed For Large Datasets than Small Designed For Batch Processing than Interactive Supports Write Once- Read Many Provides Interfaces to Move Processing Closer To Data
HDFS
APPLICATION AREAS
Large Log Processing Web search indexing
LIMITATIONS
Small Size Problem Single Node Of Failure No Random Access No Write Support
Personal Information (Column Family) Birth-Place :T1=Mumbai Marital Status :T2= Married Marital Status: T1= Unmarried Weight:T2 = 76
Other Information (Column Family) Locatio n: T2= Pune Locatio n: T1:=Mu mbai Employer:T1= XYZ
Weight:T1 = 65
HBase: Limitations
Expensive Full Row Read No Secondary Keys No SQL Support Not Efficient for Big Cell Values
Scalable data warehouse on top of Hadoop developed by Facebook SQL like Query Language HiveQL Limited JDBC support Support for rich data types Ability to insert custom map-reduce jobs
Hive: Limitations
No Support To Update Data Only Bulk Load Support Not Efficient For Small Data
Hive: Example
create table employee (id bigint, name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee; INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
Pig(Data Access)
Pig Latin High level data flow language. Client side library, no server side deployment needed. Batch processing large unstructured data Procedural language
PIG: Limitations
Not efficient for processing small datasets
PIG: Example
Load Emplyee data from text file, filter it using age and joining year and group using joining year.
1. records = LOAD 'sas/input/files/employee.txt' AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
Conclusion
Organizations
Revisit data strategy Evaluate Hadoop Ecosystem Build economical, scalable solutions for Big Data problems
References
Hadoop: Definitive Guide, By Tom White http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://www01.ibm.com/software/data/infosphere/hadoop/ http://www.informationmanagement.com/blogs/ http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation
Thank You
21