You are on page 1of 16

BIGDATA

PIG
What is Pig?

 It is another hadoop framework for Non Java Developers


 It is using Pig Latin Language
 It is a data-flow language
 It is Intermediate language between java and hive
 Translates script to Map Reduce program under the hood
 Originally developed at Yahoo! (2007)
 PIG can eat anything that means it can handle structured and semi-structured

Why Pig?

 Map Reduce requires programmers.


 For pig only less programming
 No Java knowledge
 Development time is very less
 Can process any kind of data (structured, semi-structured, un-structured)
 good for Ad-hoc queries
 Extensible by UDF by Java , Python, Java script and Ruby

1 sairavi.bigdata@gmail.com
99520 29030
BIGDATA
Use case

 Suppose you have user data in one file, website data in another, and you need to find the top 5
most visited pages by users aged 18 - 25.

ETL
 Processing large amount of log data.
 Clean bad data.
Research of Raw data:
 User audit logs.
 Schema may be unknown or inconsistent.

2 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

Pig Data Types

 Atom - A field is piece of data/A Simple variable format

 Example: Sai

 Tuple - Ordered set of fields. Tuple is represented by braces.

 Example: (Sai,20)

 Bag - Collection of tuples. Bag is represented by flower or curly braces.

 Example: {(1,2),(3,4)}

 Map – A set of Key Value Pairs. Map is represented in a square bracket. The # is used to
separate key and value.

 Example: [ ‘name’#’Ravi’, ‘age’#30]

3 sairavi.bigdata@gmail.com
99520 29030
BIGDATA
Pig Program Structure

Script

 Pig can run a script that contains pig commands. Example -->pig pig1.pig

Grunt

 Grunt is an interactive shell for running Pig Commands..

Embedded

 Embedded can run Pig programs from Java

Pig Execution mode:

Local mode

 Executes in single JVM

 Works exclusively on local system.

 There is no need of Hadoop or HDFS.

 This mode is generally used for testing purpose.

 pig -x local Sample_script.pig

Map/Reduce Mode

 In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce
job is invoked in the back-end to perform a particular operation on the data that exists in the
HDFS.

 pig -x mapreduce Sample_script.pig

4 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

Pig Architecture

5 sairavi.bigdata@gmail.com
99520 29030
BIGDATA
Pig Latin Relational Operators

Loading and Storing

 LOAD

 STORE

 DUMP

Filtering

 FILTER

 DISTINCT

 FOREACH...GENERATE

 STREAM

Grouping and Joining

 JOIN

 COGROUP (groups the data in two or more relations)

 GROUP (groups the data in a single relation)

 CROSS - Creates the cross product of two or more relations

Sorting

 ORDER

 LIMIT

Combining and Splitting

 UNION (Combine two or more relation into one)

 SPLIT (Splits a relation into two or more )

6 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

7 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

8 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

9 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

10 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

11 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

12 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

13 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

14 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

15 sairavi.bigdata@gmail.com
99520 29030
BIGDATA

16 sairavi.bigdata@gmail.com
99520 29030

You might also like