Professional Documents
Culture Documents
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
highly scalable
has bindings for non-Java programming languages applicable to many computational problems
Task process
Runs an individual map or reduce fragment for a given job Forks from the TaskTracker
Application Overview
Launching Program
Creates a JobConf to define a job.
Mapper
Is given a stream of key1,value1 pairs Generates a stream of key2, value2 pairs
Reducer
Isgivenakey2andastreamofvalue2s Generates a stream of key3, value3 pairs
JobConf.setOutputFormat()
Many, many more (Facade pattern)
An onslaught of terminology
We'll explain these terms, each of which plays a role in any non-trivial map/reduce job:
InputFormat, OutputFormat, FileInputFormat, ... JobClient and JobConf JobTracker and TaskTracker
TaskRunner,MapTaskRunner,MapRunner,
InputSplit, RecordReader, LineRecordReader, ... Writable, WritableComparable, WritableInt, ...
InputFormat
Splits the input to determine the input to each map task. Defines a RecordReader that reads key, value pairs that are passed to the map task
OutputFormat
Given the key, value pairs and a filename, writes the
Example
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
JobClient:
Determines proper division of input into InputSplits
Sends job data to master JobTracker server
Mapper
Override function map() void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) Emit (k2,v2) with output.collect(k2, v2)
Example
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
What is Writable?
Hadoopdefinesitsownboxclassesforstrings (Text), integers (IntWritable), etc. All values are instances of Writable All keys are instances of WritableComparable
Reading data
Data sets are specified by InputFormats
Defines input data (e.g., a directory) Identifies partitions of the data that form an InputSplit Factory for RecordReader objects to extract (k, v) records from the input source
Record Readers
Without a RecordReader, Hadoop would be forced to divide input on byte boundaries. Each InputFormat provides its own RecordReader implementation
Provides capability multiplexing
RecordReaders receive file, offset, and length of chunk Custom InputFormat implementations may override split size e.g.,NeverChunkFile
WritableComparator
Compares WritableComparable data
Will call WritableComparable.compare() Can provide fast path for serialized data Explicitly stated in JobConf setup
JobConf.setOutputValueGroupingComparator()
Partitioner
int getPartition(key, val, numPartitions)
Outputs the partition number for a given key One partition == values sent to one Reduce task
Reducer
reduce( WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Keys & values sent to one partition all go to the same reduce task Calls are sorted by key earlierkeysare reducedandoutputbeforelaterkeys
Example
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
OutputFormat
Analogous to InputFormat TextOutputFormat Writeskeyval\nstringsto output file SequenceFileOutputFormat Uses a binary format to pack (k, v) pairs NullOutputFormat Discards output
Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Requirements
Input:
a beginning word/phrase n-gram size (bigram, trigram, n-gram) the minimum number of occurrences (frequency) whether letter case matters
Follow along
The N-grams implementation exists and is ready for your perusal.
Grab it:
if you use Git revision control:
git clone git://git.qnan.org/pmw/hadoop-ngram
Follow along
Start Hadoop
bin/start-all.sh
A new RecordReader
Ours must implement RecordReader<K, V>
Contain certain functions: createKey(), createValue(), getPos(), getProgress(), next()
public synchronized boolean next(LongWritable key, Text value) throws IOException { Text linevalue = new Text(); boolean appended, gotsomething; boolean retval; byte space[] = {' '};
value.clear(); gotsomething = false; do { appended = false; retval = lrr.next(key, linevalue); if (retval) { if (linevalue.toString().length() > 0) { byte[] rawline = linevalue.getBytes(); int rawlinelen = linevalue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } gotsomething = true; } } while (appended); //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]"); return gotsomething; }
A new InputFormat
Given to the JobTracker during execution
getRecordReader method
This is the why we need InputFormat Must return our ParagraphRecordReader
Firststage:FindMapper
Define the startword at startup
Each time map is called we parse an entire paragraph and output matching N-Grams Tell Reporter how far done we are to track progress
Importanceofoutput.collect()
Remember Hadoop's data type model:
map: (K1, V1)list(K2, V2)
This means that for every single (K1, V1) tuple, the map stage can output zero, one, two, or any other number of tuples, and they don't have to match the input at all. Example:
output.collect(ngram, new IntWritable(1)); output.collect(good-ol'-+ngram,newIntWritable(0));
Find Mapper
Our mapper must have a configure() class
public void configure(JobConf conf) { desiredPhrase = conf.get("mapper.desired-phrase"); Nvalue = conf.getInt("mapper.N-value", 3); caseSensitive = conf.getBoolean("mapper.case-sensitive", false); }
FindReducer
Like WordCount example
Sum all the numbers matching our N-Gram
Secondstage:PruneMapper
Parse line from previous output and divide into Key/Value pairs
PruneReducer
This way we can sort our elements by frequency If this N-Gram occurs fewer times than our minimum, trim it out
Counters
The N-Gram generator has one programmerdefined counter: the number of partial/incomplete N-grams. These occur when a paragraph ends before we can read N-1 subsequent words.
We can add as many counters as we want.
JobConf
We need to set everything up
2 Jobs executing in series Find and Prune
JobConf ngram_find_conf = new JobConf(getConf(), NGram.class), ngram_prune_conf = new JobConf(getConf(), NGram.class);
Find JobConf
Now we can plug everything in:
ngram_find_conf.setJobName("ngram-find"); ngram_find_conf.setInputFormat(ParagraphInputFormat.class); ngram_find_conf.setOutputKeyClass(Text.class); ngram_find_conf.setOutputValueClass(IntWritable.class); ngram_find_conf.setMapperClass(FindJob_MapClass.class); ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);
Prune JobConf
Perform set up as before
ngram_prune_conf.setJobName("ngram-prune"); ngram_prune_conf.setInt("reducer.min-freq", min_freq); ngram_prune_conf.setOutputKeyClass(Text.class); ngram_prune_conf.setOutputValueClass(IntWritable.class); ngram_prune_conf.setMapperClass(PruneJob_MapClass.class); ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);
Execute Jobs
Run as blocking process with runJob
Batch processing is done in series
JobClient.runJob(ngram_find_conf); JobClient.runJob(ngram_prune_conf);
(There are already several common IFFs and RRs. Don't reinvent the wheel.)
Mapper and Reducer classes
Do Key (WritableComparator) and Value (Writable) classes exist?