Apache Pig

Apache Pig
Pittsburgh Hadoop User Group

11/3/2009
Dmitriy Ryaboy
Ashutosh Chauhan
In This Talk
• What is Pig and why it’s needed
– This part is going to be brief.
– It’s a Hadoop User Group, after all.
• Examples
– As well as some extra motivation
• Advanced Features
• Improvements currently in development
• Interesting research problems
– Want to get involved?
What is Pig
or, “duality of Pig”
Pig compiles data Pig Latin is a language

analysis tasks into for expressing data
Map-Reduce jobs and transformation flows.
runs them on Hadoop.
Pig can be made to understand other languages, too.

There is an SQL prototype. Just a question of compiling a
language into internal operator tree.
See: Smith, Agent. The Matrix Trilogy, Warner Bros.
Who Uses Pig
• Yahoo
– “40% of all Hadoop jobs are run with Pig”
• Twitter
– “Some Java-based MapReduce, some
Hadoop Streaming”
– “Most analysis, and most interesting
analysis, done in Pig”
• LinkedIn, AOL, CoolIris, Ning, eBuddy …
Why use Pig
• Express data transformation tasks in a few lines.
• “Reach out and touch a Petabyte.”
Image credit: Michelangelo, with apologies.

Pig Latin Example
Why not plain Map/Reduce?
Brevity.
Corresponding Pig Script
Why Not SQL
• Pig Latin allows expressing transformations as a
sequence of steps. SQL describes desired
outcome.
• Writing down a sequence of steps is intuitive to
developers.
– Allows expressing more complex data flows
– But still no loops or conditionals.
• Support for nested structures (arrays, maps)
• Much easier to work with groups
– Let me show you…
Top 5 scores for each player
• Data: < playerId, score, date >
• For each player
• Best 5 scores
• Date each of these scores was achieved
• Common task; this particular one lifted from

http://stackoverflow.com/questions/1467898
• Classically painful in SQL

SQL
• Self-join
• Data explosion for same values of tblAbc.aa
• Readability?
• I like SQL. But this is far from straightforward.
SQL
We can fix these problems, by going procedural.

There goes the brevity and declarativity.
Pig Latin
Diving Deeper
http://www.flickr.com/photos/lollaping/2573664394/
User-Defined Functions
• Java Interfaces
– Reading / Writing Data
• Allows reading from DBs, HBase, custom file formats
• Significant API overhaul underway.
– Transformation / Evaluation of Tuples
– Group Aggregation
• See Piggybank for examples
http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/
• Support for other languages in the works
https://issues.apache.org/jira/browse/PIG-928
Streaming
Multiquery Optimization
load users
• IO Cost of scanning
the data dominates filter out bots
most jobs
• Share data scans
group by
• This example group by state
demographic
requires separate
pipelines for state apply UDFs apply UDFs
and demographics
• Or does it? store into store into
‘bystate’ ‘bydemo’
Slide credit: Alan Gates et al, VLDB 2009

Multiquery Optimization
• Tag each record with the map filter
pipeline it belongs to
• Regular Hadoop grouping split
on [pipeline #, key] local rearrange local rearrange
• Multiplexer in reduce
stage sends records to
appropriate pipeline
reduce
• Use a single script to
compute many things package multiplex package
from shared sources.
foreach foreach
Slide credit: Alan Gates et al, VLDB 2009

Hash Join
• Default algorithm
• Mapper emits ([key, relid], tuple)
• Reducer sees all tuples from each relation with the same
key, performs join
– easy, due to sorted order
• Optimization: entries in the last relation can be streamed

through (not accumulated in memory for each key)
• Can join multiple tables, supports outer joins

• If one relation has many duplicate keys, put it last!
Fragment-Replicate Join
A B
• Join large table A with 1 or more small tables
(B, C) C
• Fragment large table across mappers
• Replicate smaller tables in entirety to all
mappers B
• Supports multiple tables

• Use when all small tables together can fit B
in memory on a single map task
3pa 2Mpa M
C
Merge Join
• 2 inputs, both sorted on join key

• Build sparse index into B
• Partition A across mappers
Index
• Use index to read B from
corresponding block on each
1pa M
mapper
• Bounded buffering on both sides
– least memory intensive
• Significant speed gains,

especially on skewed data 23ppaaMM
Skewed Join
• 1+ keys with more tuples per

key than fit in reducer memory,
in both relations
• Build histogram based on
sample
• Round-robin skewed keys into
multiple reducers
• Stream other table, sending http://wiki.apache.org/pig/PigSkewedJoinSpec
each record to all reducers

responsible for key
In Development
• Release 0.5 : Hadoop 20 compatibility
– Or apply a patch, or get Cloudera distro
– Imminent
• Release 0.6 (or later)
– Load/Store redesign. See PIG-966
– UDFs in other languages, see PIG-923
– Columnar Storage, see Zebra in /contrib
– Stored Schemas, see PIG-760
– Metadata service (“Owl”), see PIG-823
– SQL compiler, see PIG-824
– Auto selection of joins when stats are known, see us
Research Problems
aka, when pigs fly
• Extending the language

– Functions
– Conditional logic
– Loops
• Smarter data storage
– Learn from workload
• auto-selected column families
• different sort orders
• overlapping projections
– Go Faster
• pushing queries to storage
• lazy decompression
Research Problems
aka, when pigs fly
• Adaptive optimization
– Change plan mid-flight based on observation of current
environment
– Can we do this without waiting for MR stage to finish?
• Continuous Queries, Approximate Early Results
– Initial work at Berkeley: Map-Reduce Online *
• Cost-Based Optimization
– Appropriate cost model?
– balance of disk IO, network traffic, task overhead, memory
limitations on individual tasks…
– Different from, but similar to, distributed DBs
* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html
Further Reading
• Gates et al, Building a High-Level Dataflow System on top of Map-
Reduce: The Pig Experience” VLDB 2009
http://infolab.stanford.edu/~olston/publications/vldb09.pdf
• Olston et al, “Generating example data for dataflow programs”
SIGMOD 2009 (best paper)
http://infolab.stanford.edu/~olston/publications/sigmod09.pdf
• Olston et al, “Pig Latin: A not-so-foreign language for data processing”
SIGMOD 2008
http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
• Olston et al, “Automatic optimization of parallel dataflow programs”
USENIX 2008
http://infolab.stanford.edu/~olston/publications/usenix08.pdf
• More: http://wiki.apache.org/pig/PigTalksPapers
Performance
average over 12 benchmark queries
Pig speed as factor of Hadoop time (average runtime)
7.6
2.5
1.8
1.6 1.5 1.4
1.22
1
alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009

Performance
weighted average over 12 benchmark queries
Pig speed as factor of Hadoop time (weighted average)
11.20
3.26
2.20
1.97 1.83 1.68
1.53
1.04
alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009

Queries?
dvryaboy@cmu.edu
achauha1@cs.cmu.edu
http://hadoop.apache.org/pig
pig-user@apache.org
@squarecog

Apache Pig - PittsburghHug

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Pig - PittsburghHug

Uploaded by

Copyright:

Available Formats

Pittsburgh Hadoop User Group

Pig compiles data Pig Latin is a language

Pig can be made to understand other languages, too.

Image credit: Michelangelo, with apologies.

• Common task; this particular one lifted from

• Classically painful in SQL

We can fix these problems, by going procedural.

Slide credit: Alan Gates et al, VLDB 2009

on [pipeline #, key] local rearrange local rearrange

Slide credit: Alan Gates et al, VLDB 2009

• Optimization: entries in the last relation can be streamed

• Can join multiple tables, supports outer joins

• Supports multiple tables

• 2 inputs, both sorted on join key

• Significant speed gains,

• 1+ keys with more tuples per

each record to all reducers

• Extending the language

Pig speed as factor of Hadoop time (average runtime)

alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009

Pig speed as factor of Hadoop time (weighted average)

alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009

You might also like