You are on page 1of 20

Overview of Hive and Pig

Objectives

After completing this lesson, you should be able to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig

6-2
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
• This lesson covers Hive and Pig at a high level.

6-3
Hive: Data Units

Databases

Tables

Partitions

6-4
The Hive Metastore Database

• Contains metadata regarding databases, tables, and


partitions
• Contains information about how the rows and columns are
delimited in the HDFS files that are used in the queries
• Is an RDBMS database, such as MySQL, where Hive
persists table schemas and other system metadata

6-5
Hive Framework

External
interfaces
CLI JDBC Hue

HiveServer2

6-6
Creating a Hive Database
1. Start hive.

2. Create the database.

3. Verify the database creation.

6-7
Data Manipulation in Hive

Hive SELECT with a WHERE clause:


Map Task
Map Task
SELECT a,sum(b) Map Task
FROM myTable Map Task
WHERE a < 100
GROUP BY a

Reduce Task
Reduce Task

Note: Aggregations such


as GROUP BY are
handled by reduce tasks. Result

6-8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
SELECT a, sum(b*b) AS timesTwo
FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable 2
ON otherTable.z = mt.a
3
GROUP BY mt.a,otherTable.z Temporary
Result

Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the innermost Reduce Task Output
query outward.

6-9
Steps in a Hive Query

SELECT suit, COUNT(*)


FROM cards
WHERE face_value > 10 HiveQL
GROUP BY suit;

Map task Reduce task


Shuffle

If face_card: emit(suit,
emit(suit, count(suit))
card)

Hadoop Cluster (Job Tracker or Resource Manager)

6 - 10
Hive-Based Applications

• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling

6 - 11
Hive: Limitations

• No support for materialized views


• No transaction-level support
• Not ideal for ad hoc work
• Limited subquery support

6 - 12
Pig: Overview

Pig:
• Is an open-source high-level data flow system
• Provides a simple language called Pig Latin for queries
and data manipulation, which is compiled into map-reduce
jobs that are run on Hadoop

6 - 13
Pig Latin

• Is a high-level data flow language


• Provides common operations like join, group, sort, and so
on
• Works on files in HDFS
• Was developed by Yahoo

6 - 14
Pig Applications

• Rapid prototyping of algorithms for processing large data


sets
• Log analysis
• Ad hoc queries across large data sets
• Analytics and sampling
• PigMix: A set of performance and scalability benchmarks

6 - 15
Running Pig Latin Statements

You can execute Pig Latin statements:


• Using the grunt shell or command line
• In MapReduce mode or local mode
• Either interactively or in batch
Pig processes Pig Latin statements as follows:
1. It validates the syntax and semantics of all statements.
2. If Pig encounters a DUMP or STORE, it executes the
statements.

6 - 16
Pig Latin: Features

Ease of programming

Optimization opportunities

Extensibility

Structure that accommodates substantial


parallelization

6 - 17
Working with Pig

1. Open a terminal window, type pig, and press Enter.

2. To execute scripts, use the grunt shell prompt.

6 - 18
Summary

In this lesson, you should have learned how to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig

6 - 19
End Of Topic

6 - 20

You might also like