Professional Documents
Culture Documents
Objectives
6-2
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
• This lesson covers Hive and Pig at a high level.
6-3
Hive: Data Units
Databases
Tables
Partitions
6-4
The Hive Metastore Database
6-5
Hive Framework
External
interfaces
CLI JDBC Hue
HiveServer2
6-6
Creating a Hive Database
1. Start hive.
6-7
Data Manipulation in Hive
Reduce Task
Reduce Task
6-8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
SELECT a, sum(b*b) AS timesTwo
FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable 2
ON otherTable.z = mt.a
3
GROUP BY mt.a,otherTable.z Temporary
Result
Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the innermost Reduce Task Output
query outward.
6-9
Steps in a Hive Query
If face_card: emit(suit,
emit(suit, count(suit))
card)
6 - 10
Hive-Based Applications
• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling
6 - 11
Hive: Limitations
6 - 12
Pig: Overview
Pig:
• Is an open-source high-level data flow system
• Provides a simple language called Pig Latin for queries
and data manipulation, which is compiled into map-reduce
jobs that are run on Hadoop
6 - 13
Pig Latin
6 - 14
Pig Applications
6 - 15
Running Pig Latin Statements
6 - 16
Pig Latin: Features
Ease of programming
Optimization opportunities
Extensibility
6 - 17
Working with Pig
6 - 18
Summary
6 - 19
End Of Topic
6 - 20