You are on page 1of 13

Copyright

Cloudwick
Technologies
Copyright2012
2012
Cloudwick
Technologies

Copyright 2012 Cloudwick Technologies

Counters are a useful channel for gathering statistics about the job: for quality control or for
application level-statistics.
They are also useful for problem diagnosis
Built-in Counters:

Copyright 2012 Cloudwick Technologies

User Defined Java Counters:


MapReduce allows user code to define a set of counters, which are then incremented as desired in
the mapper or reducer. Counters are defined by a Java enum, which serves to group related counters.
The name of the enum is the group name, and the enums fields are the counter names.
Counters are global: the MapReduce framework aggregates them across all maps and reduces to
produce a grand total at the end of the job.
public class MaxTemperatureWithCounters extends Configured implements Tool {
enum Temperature {
MISSING,
MALFORMED
}
Counters can be set and incremented via method
Reporter.incrCounter(group, name, amount);
Sample Output:
09/04/20 12:33:36 INFO mapred.JobClient: Air Temperature Records
09/04/20 12:33:36 INFO mapred.JobClient: Malformed=3
09/04/20 12:33:36 INFO mapred.JobClient: Missing=66136856

Copyright 2012 Cloudwick Technologies

Often,Mappers produce large amounts of intermediate data


That data must be passed to the Reducers
This can result in a lot of network traffic
It is often possible to specify a Combiner
Like a mini-Reducer
Runs locally on a single Mappers output
Output from the Combiner is sent to the Reducers
Combiner and Reducer code are often identical
Technically, this is possible if the operation performed is commutative and associative
In this case, input and output data types for the Combiner/Reducer must be identical

Copyright 2012 Cloudwick Technologies

The Partitioner divides up the keyspace


Controls which Reducer each intermediate key and its associated values goes to
Often, the default behavior is fine
Default is the HashPartitioner
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
public void configure(JobConf job) {}
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
Implement the custom paritioner to send particular keys to a particular reducer.

Demo on Custom Partitioner:

Copyright 2012 Cloudwick Technologies

Though Sorting is done at sort and shuffle phase, there are different ways to achieve and control
sorting.
Partial Sort:
The default MapReduce job will sort the input records by keys. If there are 30 reducers, 30 sorted
files will be generated. These files cannot be combined to produce a globally sorted file.
Total Sort:
Use only one reducer. But, it is very inefficient for large files.
Use a Partitioner that respects the total order of the output. For example, if we had four partitions,
we could put keys for temperatures less than 10C in the first partition, those between 10C and
0C in the second, those between 0C and 10C in the third, and those over 10C in the fourth.
Secondary Sort:
For any particular key , values are not sorted.
Use a composite key of key and value and use a Partitioner by the key part of the composite key.

Copyright 2012 Cloudwick Technologies

MapReduce can perform joins between large datasets, but writing the code to do joins from scratch is
fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level
framework such as Pig, Hive, or Cascading, in which join operations are a core part of the
implementation.
If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the
reducer it is called a reduce-side join.
A map-side join between large inputs works by performing the join before the data reaches the map
function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular
way. Each input dataset must be divided into the same number of partitions, and it must be sorted by
the same key (the join key) in each source. All the records for a particular key must reside in the same
partition.
Use a CompositeInputFormat from the org.apache.hadoop.mapreduce.join package to run a map-side
join.
A Reduce-side join is less efficient as both datasets have to go through the MapReduce shuffle. The basic
idea is that the mapper tags each record with its source and uses the join key as the map output key, so
that the records with the same key are brought together in the reducer.

Copyright 2012 Cloudwick Technologies

Side data can be defined as extra read-only data needed by a job (map or reduce tasks) to
process the main dataset. The challenge is to make side data available to all the map or
reduce tasks (which are spread across the cluster) in a convenient and efficient fashion.
Example of side data:
Lookup tables
Dictionaries
Standard configuration values
it is possible to cache side-data in memory in a static field, so that tasks of the same job that
run in succession on the same tasktracker can share the data.
You can set arbitrary key-value pairs in the job configuration using the various setter methods
on Configuration (or JobConf in the old MapReduce API). This is very useful if you need to pass
a small piece of metadata to your tasks. But, this is not scalable.

Copyright 2012 Cloudwick Technologies

Rather than serializing side data in the job configuration, it is preferable to distribute
datasets using Hadoops distributed cache mechanism. This provides a service for
copying files and archives to the task nodes in time for the tasks to use them when they
run. To save network bandwidth, files are normally copied to any particular node once
per job.
Transfer happens behind the scenes before any task is executed
Note: DistributedCache is read-only
Files in the DistributedCache are automatically deleted from slave nodes when the job
finishes
Implementation:
Place the files into HDFS
Configure the DistributedCache in your driver code
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/tmp/lookup.txt"), job);
DistributedCache.addFileToClassPath(new Path("/tmp/abc.jar"), job);
DistributedCache.addCacheArchive(new URI("/tmp/xyz.zip", job);
or
$ hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...

Copyright 2012 Cloudwick Technologies

10

Retrive Filesystem API for using it


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
A file in HDFS represented by
Path p = new Path("/path/to/my/file");
Some useful API methods:
FSDataOuputStream create(...)
Provides methods for writing primitives, raw bytes etc
FSDataInputStream open(...)
Provides methods for reading primitives, raw bytes etc
boolean delete(...)
boolean mkdirs(...)
void copyFromLocalFile(...)
void copyToLocalFile(...)
FileStatus[] listStatus(...)

Copyright 2012 Cloudwick Technologies

11

Retrive Filesystem API for using it


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
A file in HDFS represented by
Path p = new Path("/path/to/my/file");
Some useful API methods:
FSDataOuputStream create(...)
Provides methods for writing primitives, raw bytes etc
FSDataInputStream open(...)
Provides methods for reading primitives, raw bytes etc
boolean delete(...)
boolean mkdirs(...)
void copyFromLocalFile(...)
void copyToLocalFile(...)
FileStatus[] listStatus(...)

Copyright 2012 Cloudwick Technologies

12

13

You might also like