Professional Documents
Culture Documents
Dmitriy Ryaboy
Ashutosh Chauhan
In This Talk
• What is Pig and why it’s needed
– This part is going to be brief.
– It’s a Hadoop User Group, after all.
• Examples
– As well as some extra motivation
• Advanced Features
• Improvements currently in development
• Interesting research problems
– Want to get involved?
What is Pig
or, “duality of Pig”
Brevity.
Corresponding Pig Script
Why Not SQL
• Pig Latin allows expressing transformations as a
sequence of steps. SQL describes desired
outcome.
• Writing down a sequence of steps is intuitive to
developers.
– Allows expressing more complex data flows
– But still no loops or conditionals.
• Support for nested structures (arrays, maps)
• Much easier to work with groups
– Let me show you…
Top 5 scores for each player
• Data: < playerId, score, date >
• For each player
• Best 5 scores
• Date each of these scores was achieved
• Self-join
• Data explosion for same values of tblAbc.aa
• Readability?
• I like SQL. But this is far from straightforward.
SQL
http://www.flickr.com/photos/lollaping/2573664394/
User-Defined Functions
• Java Interfaces
– Reading / Writing Data
• Allows reading from DBs, HBase, custom file formats
• Significant API overhaul underway.
– Transformation / Evaluation of Tuples
– Group Aggregation
• See Piggybank for examples
http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/
• Support for other languages in the works
https://issues.apache.org/jira/browse/PIG-928
Streaming
Multiquery Optimization
load users
• IO Cost of scanning
the data dominates filter out bots
most jobs
• Share data scans
group by
• This example group by state
demographic
requires separate
pipelines for state apply UDFs apply UDFs
and demographics
• Or does it? store into store into
‘bystate’ ‘bydemo’
• Multiplexer in reduce
stage sends records to
appropriate pipeline
reduce
• Use a single script to
compute many things package multiplex package
from shared sources.
foreach foreach
• Default algorithm
• Mapper emits ([key, relid], tuple)
• Reducer sees all tuples from each relation with the same
key, performs join
– easy, due to sorted order
A B
• Join large table A with 1 or more small tables
(B, C) C
• Fragment large table across mappers
• Replicate smaller tables in entirety to all
mappers B
3pa 2Mpa M
C
Merge Join
1pa M
mapper
• Bounded buffering on both sides
– least memory intensive
• Adaptive optimization
– Change plan mid-flight based on observation of current
environment
– Can we do this without waiting for MR stage to finish?
• Continuous Queries, Approximate Early Results
– Initial work at Berkeley: Map-Reduce Online *
• Cost-Based Optimization
– Appropriate cost model?
– balance of disk IO, network traffic, task overhead, memory
limitations on individual tasks…
– Different from, but similar to, distributed DBs
* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html
Further Reading
• Gates et al, Building a High-Level Dataflow System on top of Map-
Reduce: The Pig Experience” VLDB 2009
http://infolab.stanford.edu/~olston/publications/vldb09.pdf
• Olston et al, “Generating example data for dataflow programs”
SIGMOD 2009 (best paper)
http://infolab.stanford.edu/~olston/publications/sigmod09.pdf
• Olston et al, “Pig Latin: A not-so-foreign language for data processing”
SIGMOD 2008
http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
• Olston et al, “Automatic optimization of parallel dataflow programs”
USENIX 2008
http://infolab.stanford.edu/~olston/publications/usenix08.pdf
• More: http://wiki.apache.org/pig/PigTalksPapers
Performance
average over 12 benchmark queries
7.6
2.5
1.8
1.6 1.5 1.4
1.22
1
11.20
3.26
2.20
1.97 1.83 1.68
1.53
1.04
http://hadoop.apache.org/pig
pig-user@apache.org
@squarecog