Hive Pig

Overview of Hive and Pig
Objectives
After completing this lesson, you should be able to:

• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig
6-2
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
• This lesson covers Hive and Pig at a high level.
6-3
Hive: Data Units
Databases
Tables
Partitions
6-4
The Hive Metastore Database
• Contains metadata regarding databases, tables, and

partitions
• Contains information about how the rows and columns are
delimited in the HDFS files that are used in the queries
• Is an RDBMS database, such as MySQL, where Hive
persists table schemas and other system metadata
6-5
Hive Framework
External
interfaces
CLI JDBC Hue
HiveServer2
6-6
Creating a Hive Database
1. Start hive.
2. Create the database.
3. Verify the database creation.
6-7
Data Manipulation in Hive
Hive SELECT with a WHERE clause:

Map Task
Map Task
SELECT a,sum(b) Map Task
FROM myTable Map Task
WHERE a < 100
GROUP BY a
Reduce Task
Reduce Task
Note: Aggregations such

as GROUP BY are
handled by reduce tasks. Result
6-8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
SELECT a, sum(b*b) AS timesTwo
FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable 2
ON otherTable.z = mt.a
3
GROUP BY mt.a,otherTable.z Temporary
Result
Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the innermost Reduce Task Output
query outward.
6-9
Steps in a Hive Query
SELECT suit, COUNT(*)

FROM cards
WHERE face_value > 10 HiveQL
GROUP BY suit;
Map task Reduce task

Shuffle
If face_card: emit(suit,
emit(suit, count(suit))
card)
Hadoop Cluster (Job Tracker or Resource Manager)
6 - 10
Hive-Based Applications
• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling
6 - 11
Hive: Limitations
• No support for materialized views

• No transaction-level support
• Not ideal for ad hoc work
• Limited subquery support
6 - 12
Pig: Overview
Pig:
• Is an open-source high-level data flow system
• Provides a simple language called Pig Latin for queries
and data manipulation, which is compiled into map-reduce
jobs that are run on Hadoop
6 - 13
Pig Latin
• Is a high-level data flow language

• Provides common operations like join, group, sort, and so
on
• Works on files in HDFS
• Was developed by Yahoo
6 - 14
Pig Applications
• Rapid prototyping of algorithms for processing large data

sets
• Log analysis
• Ad hoc queries across large data sets
• Analytics and sampling
• PigMix: A set of performance and scalability benchmarks
6 - 15
Running Pig Latin Statements
You can execute Pig Latin statements:

• Using the grunt shell or command line
• In MapReduce mode or local mode
• Either interactively or in batch
Pig processes Pig Latin statements as follows:
1. It validates the syntax and semantics of all statements.
2. If Pig encounters a DUMP or STORE, it executes the
statements.
6 - 16
Pig Latin: Features
Ease of programming
Optimization opportunities
Extensibility
Structure that accommodates substantial

parallelization
6 - 17
Working with Pig
1. Open a terminal window, type pig, and press Enter.
2. To execute scripts, use the grunt shell prompt.
6 - 18
Summary
In this lesson, you should have learned how to:

• Define Hive
• Describe the Hive data flow
• Create a Hive database
• Define Pig
• List the features of Pig
6 - 19
End Of Topic
6 - 20

Hive Pig

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive Pig

Uploaded by

Copyright:

Available Formats

Overview of Hive and Pig

After completing this lesson, you should be able to:

• Contains metadata regarding databases, tables, and

2. Create the database.

3. Verify the database creation.

Hive SELECT with a WHERE clause:

Note: Aggregations such

SELECT suit, COUNT(*)

Map task Reduce task

Hadoop Cluster (Job Tracker or Resource Manager)

• No support for materialized views

• Is a high-level data flow language

• Rapid prototyping of algorithms for processing large data

You can execute Pig Latin statements:

Structure that accommodates substantial

1. Open a terminal window, type pig, and press Enter.

2. To execute scripts, use the grunt shell prompt.

In this lesson, you should have learned how to:

You might also like