Professional Documents
Culture Documents
Agenda :
Chain performance??
Apache Crunch
Is a Java library Contains Collections which can excute Parallel operations Lazy evaluation of Collections at runtime
Apache Crunch
Supports Hadoop version 1 and 2-alpha Supports HBase, jdbc etc Works with Writables, Avro, Thrift and proto-buffers
Pipeline MRPipeline
MemPipeline
PCollection<T> PTable<K,V> PGroupTable<K,V> Source<T>
Target<T>
Emitter<T> PType<K,V>
6
DoFn<S,T>
CombineFn<S,T> FilterFn<T> Joins Cartesian
Sort
SecondarySort PObject<T> BloomFilters
7
Pipeline.done()
Map
Map
Map
GBK Reduce
GBK Reduce
Output
WordCount example
Avro example Sorting example SecondarySort Join Example
BloomFilters
10