You are on page 1of 10

Apache Crunch

Rahul Sharma Apache

Agenda :

Issues with MapReduce pipelines


Solving with Apache Crunch

Data Model & Operations


System Workflow Examples Question & Answers

Issues with MapReduce Pipelines

Unit Testing pipeline ?? You must be joking !!

Can someone tell me where is the business logic ??

Chain performance??

Learn Latin(pig) first!!

Apache Crunch

Is a Java library Contains Collections which can excute Parallel operations Lazy evaluation of Collections at runtime

Operations merged at runtime to have efficient chains.


Available @ http://incubator.apache.org/crunch/ Based on Google FlumeJava paper

Apache Crunch

Supports Hadoop version 1 and 2-alpha Supports HBase, jdbc etc Works with Writables, Avro, Thrift and proto-buffers

Scala varient also exists


Integration with R and Clojure in process Archetype exists for creating sample maven project

Apache Crunch : Data Model

Pipeline MRPipeline

MemPipeline
PCollection<T> PTable<K,V> PGroupTable<K,V> Source<T>

Target<T>
Emitter<T> PType<K,V>
6

Apache Crunch : Operations

DoFn<S,T>
CombineFn<S,T> FilterFn<T> Joins Cartesian

Sort
SecondarySort PObject<T> BloomFilters
7

Apache Crunch : System Workflow


Construct a pipeline

Pipeline.done()

Map

Map

Map

GBK Reduce

GBK Reduce

Output

Apache Crunch : Examples

WordCount example
Avro example Sorting example SecondarySort Join Example

BloomFilters

Write to me : rsharma@apache.org Example src : http://github.com/rahul0208 Blog : devlearnings.wordpress.com

10

You might also like