You are on page 1of 30

Apache Pig

Pittsburgh Hadoop User Group


11/3/2009

Dmitriy Ryaboy
Ashutosh Chauhan
In This Talk
• What is Pig and why it’s needed
– This part is going to be brief.
– It’s a Hadoop User Group, after all.
• Examples
– As well as some extra motivation
• Advanced Features
• Improvements currently in development
• Interesting research problems
– Want to get involved?
What is Pig
or, “duality of Pig”

Pig compiles data Pig Latin is a language


analysis tasks into for expressing data
Map-Reduce jobs and transformation flows.
runs them on Hadoop.

Pig can be made to understand other languages, too.


There is an SQL prototype. Just a question of compiling a
language into internal operator tree.
See: Smith, Agent. The Matrix Trilogy, Warner Bros.
Who Uses Pig
• Yahoo
– “40% of all Hadoop jobs are run with Pig”
• Twitter
– “Some Java-based MapReduce, some
Hadoop Streaming”
– “Most analysis, and most interesting
analysis, done in Pig”
• LinkedIn, AOL, CoolIris, Ning, eBuddy …
Why use Pig
• Express data transformation tasks in a few lines.
• “Reach out and touch a Petabyte.”

Image credit: Michelangelo, with apologies.


Pig Latin Example
Why not plain Map/Reduce?

Brevity.
Corresponding Pig Script
Why Not SQL
• Pig Latin allows expressing transformations as a
sequence of steps. SQL describes desired
outcome.
• Writing down a sequence of steps is intuitive to
developers.
– Allows expressing more complex data flows
– But still no loops or conditionals.
• Support for nested structures (arrays, maps)
• Much easier to work with groups
– Let me show you…
Top 5 scores for each player
• Data: < playerId, score, date >
• For each player
• Best 5 scores
• Date each of these scores was achieved

• Common task; this particular one lifted from


http://stackoverflow.com/questions/1467898

• Classically painful in SQL


SQL

• Self-join
• Data explosion for same values of tblAbc.aa
• Readability?
• I like SQL. But this is far from straightforward.
SQL

We can fix these problems, by going procedural.


There goes the brevity and declarativity.
Pig Latin
Diving Deeper

http://www.flickr.com/photos/lollaping/2573664394/
User-Defined Functions
• Java Interfaces
– Reading / Writing Data
• Allows reading from DBs, HBase, custom file formats
• Significant API overhaul underway.
– Transformation / Evaluation of Tuples
– Group Aggregation
• See Piggybank for examples
http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/
• Support for other languages in the works
https://issues.apache.org/jira/browse/PIG-928
Streaming
Multiquery Optimization
load users
• IO Cost of scanning
the data dominates filter out bots
most jobs
• Share data scans
group by
• This example group by state
demographic
requires separate
pipelines for state apply UDFs apply UDFs
and demographics
• Or does it? store into store into
‘bystate’ ‘bydemo’

Slide credit: Alan Gates et al, VLDB 2009


Multiquery Optimization
• Tag each record with the map filter
pipeline it belongs to
• Regular Hadoop grouping split

on [pipeline #, key] local rearrange local rearrange

• Multiplexer in reduce
stage sends records to
appropriate pipeline
reduce
• Use a single script to
compute many things package multiplex package
from shared sources.
foreach foreach

Slide credit: Alan Gates et al, VLDB 2009


Hash Join

• Default algorithm
• Mapper emits ([key, relid], tuple)
• Reducer sees all tuples from each relation with the same
key, performs join
– easy, due to sorted order

• Optimization: entries in the last relation can be streamed


through (not accumulated in memory for each key)

• Can join multiple tables, supports outer joins


• If one relation has many duplicate keys, put it last!
Fragment-Replicate Join

A B
• Join large table A with 1 or more small tables
(B, C) C
• Fragment large table across mappers
• Replicate smaller tables in entirety to all
mappers B

• Supports multiple tables


• Use when all small tables together can fit B
in memory on a single map task

3pa 2Mpa M
C
Merge Join

• 2 inputs, both sorted on join key


• Build sparse index into B
• Partition A across mappers
Index
• Use index to read B from
corresponding block on each

1pa M
mapper
• Bounded buffering on both sides
– least memory intensive

• Significant speed gains,


especially on skewed data 23ppaaMM
Skewed Join

• 1+ keys with more tuples per


key than fit in reducer memory,
in both relations
• Build histogram based on
sample
• Round-robin skewed keys into
multiple reducers
• Stream other table, sending http://wiki.apache.org/pig/PigSkewedJoinSpec

each record to all reducers


responsible for key
In Development
• Release 0.5 : Hadoop 20 compatibility
– Or apply a patch, or get Cloudera distro
– Imminent
• Release 0.6 (or later)
– Load/Store redesign. See PIG-966
– UDFs in other languages, see PIG-923
– Columnar Storage, see Zebra in /contrib
– Stored Schemas, see PIG-760
– Metadata service (“Owl”), see PIG-823
– SQL compiler, see PIG-824
– Auto selection of joins when stats are known, see us
Research Problems
aka, when pigs fly

• Extending the language


– Functions
– Conditional logic
– Loops
• Smarter data storage
– Learn from workload
• auto-selected column families
• different sort orders
• overlapping projections
– Go Faster
• pushing queries to storage
• lazy decompression
Research Problems
aka, when pigs fly

• Adaptive optimization
– Change plan mid-flight based on observation of current
environment
– Can we do this without waiting for MR stage to finish?
• Continuous Queries, Approximate Early Results
– Initial work at Berkeley: Map-Reduce Online *
• Cost-Based Optimization
– Appropriate cost model?
– balance of disk IO, network traffic, task overhead, memory
limitations on individual tasks…
– Different from, but similar to, distributed DBs

* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html
Further Reading
• Gates et al, Building a High-Level Dataflow System on top of Map-
Reduce: The Pig Experience” VLDB 2009
http://infolab.stanford.edu/~olston/publications/vldb09.pdf
• Olston et al, “Generating example data for dataflow programs”
SIGMOD 2009 (best paper)
http://infolab.stanford.edu/~olston/publications/sigmod09.pdf
• Olston et al, “Pig Latin: A not-so-foreign language for data processing”
SIGMOD 2008
http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
• Olston et al, “Automatic optimization of parallel dataflow programs”
USENIX 2008
http://infolab.stanford.edu/~olston/publications/usenix08.pdf

• More: http://wiki.apache.org/pig/PigTalksPapers
Performance
average over 12 benchmark queries

Pig speed as factor of Hadoop time (average runtime)

7.6

2.5

1.8
1.6 1.5 1.4
1.22
1

alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009


Performance
weighted average over 12 benchmark queries

Pig speed as factor of Hadoop time (weighted average)

11.20

3.26

2.20
1.97 1.83 1.68
1.53
1.04

alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009


Queries?
dvryaboy@cmu.edu
achauha1@cs.cmu.edu

http://hadoop.apache.org/pig
pig-user@apache.org

@squarecog

You might also like