Professional Documents
Culture Documents
Prashant Sharma
Table of Contents
1. INTRODUCTION............................................................................................................................................5 1.1 What is distributed computing?....................................................................................................................5 1.2 What is hadoop? Name o! a to" e#ephant actua##"$.....................................................................................5 1.% &o' does &adoop e#iminate comp#e(ities?...................................................................................................5 1.) What is map*reduce?......................................................................................................................................+ 1.5 What is &D,-?...............................................................................................................................................+ 1.+ What is Namenode?........................................................................................................................................+ 1.. What is a datanode?.......................................................................................................................................+ 1./ What is a 0obtrac1er and tas1trac1er?.......................................................................................................... 2. &OW 234*R5DUC5 WOR6?....................................................................................................................... ................................................................................................................................................................................
................................................................................................................................................................................ . 2.1 Introduction..................................................................................................................................................../ 2.2 2ap*reduce is the ans'er............................................................................................................................../ 2.% 3n e(amp#e program 'hich puts in7erted inde( in action using &adoop 8.28.28% 34I.........................../ 2.) &o' &adoop runs 2ap*reduce?..................................................................................................................11 2.4.1 Submit Job...................................................................................................................................................11 2.4.2 Job Initialization..........................................................................................................................................11 2.4.3 Task Assignment.........................................................................................................................................12 2.4.4 Task Execution............................................................................................................................................12 %. &3DOO4 -TR532IN9...............................................................................................................................12 %.1 3 simp#e e(amp#e run...................................................................................................................................1% %.2 &o' it 'or1s?...............................................................................................................................................1% %.% ,eatures.........................................................................................................................................................1%
). &3DOO4 DI-TRI:UT5D ,I;5 -<-T52 ................................................................................................1% ).1 Introduction..................................................................................................................................................1% ).2 What &D,- can not do?...............................................................................................................................1) ).% 3natom" o! &D,- =......................................................................................................................................1) 4.3.1 Filesystem Meta ata....................................................................................................................................14 4.3.2 Anatomy o! "#ite.........................................................................................................................................1$ 4.3.3 Anatomy o! a #ea .......................................................................................................................................1$ ).) 3ccessibi#it"...................................................................................................................................................15 4.4.1 %FS s&ell.....................................................................................................................................................1$ 4.4.2 %FS A min..................................................................................................................................................1$ 4.4.3 '#o"se# Inte#!ace.........................................................................................................................................1( 4.4.4 Mountable )%FS.........................................................................................................................................1( 5. -5RI3;I>3TION..........................................................................................................................................1+ 5.1 Introduction..................................................................................................................................................1+ 5.2 Write "our o'n composite 'ritab#e............................................................................................................1. 5.% 3n e(amp#e e(p#ained on seria#i?ation and ha7ing custom Writab#es !rom hadoop repos.....................1. 5.) Wh" 0a7a Ob@ect -eria#i?ation is not so e!!icient compared to other -eria#i?ation !rame'or1s?..........21 $.4.1 Ja*a Se#ialization oes not meet t&e c#ite#ia o! Se#ialization !o#mat............................................................21 $.4.2 Ja*a Se#ialization is not com+act.................................................................................................................21 $.4.3 Ja*a Se#ialization is not !ast.........................................................................................................................21 $.4.4 Ja*a Se#ialization is not extensible..............................................................................................................21 $.4.$ Ja*a Se#ialization is not inte#o+e#able..........................................................................................................21 $.4.( Se#ialization I%,.........................................................................................................................................21
+. DI-TRI:UT5D C3C&5. ..............................................................................................................................22 +.1 Introdution....................................................................................................................................................22 +.2 3n 5(amp#e usageA.......................................................................................................................................22 .. -5CURIN9 T&5 5;54&3NT.....................................................................................................................22
..1 1erberos tic1ets.............................................................................................................................................22 ..2 e(amp#eA o! using 1erberos. .........................................................................................................................22 ..% De#egation to1ens..........................................................................................................................................22 ..) ,urther securing the e#ephant......................................................................................................................2% /. &3DOO4 0O: -C&5DU;IN9....................................................................................................................2% /.1Three schedu#ersA...........................................................................................................................................2% %e!ault sc&e ule#-.................................................................................................................................................23 .a+acity Sc&e ule#-..............................................................................................................................................23 Fai# Sc&e ule#.......................................................................................................................................................24 3445NDIB 13 3CRO -5RI3;I>3TION......................................................................................................2) A+ac&e A*#o ........................................................................................................................................................24 A*#o is a ata se#ialization system........................................................................................................................24 A*#o #elies on JS/0 sc&emas...............................................................................................................................24 A*#o Se#ialzation .................................................................................................................................................24 A*#o Se#ialization is !ast......................................................................................................................................24 A*#o Se#ialization is Inte#o+e#able.......................................................................................................................24 3445NDIB 1:....................................................................................................................................................25 1.D2C pie estimator. .......................................................................................................................................25 9rep e(amp#eA.....................................................................................................................................................%1 WordCount.........................................................................................................................................................%%
1. Introduction
1.1 What is distributed computing? Multiple autonomous s stems appear as one! interacting "ia a message passing interface! no single point of failure.
2.1 ntroduction.
3hese are basic steps of a t pical map reduce program as described b ;oogle Map8reduce paper. We +ill understand this ta-ing in"erted inde) as e)ample. An in"erted inde) is same as the one that appear at the bac- of the boo- ! +here each +ord is listed and then location +here it occurs. Its main usage is to build inde)es for search engines. Suppose ou +ere to build a search engine +ith in"erted inde) as the inde). 6o+ the con"entional +a is to build the in"erted inde) in a large map51ata structure7 and update the map b reading the documents and updating the inde). ?imitation of this approach< If the no of documents is large then dis- IDo +ill become a bottle nec-. And +hat if data is in P@s? +ill it scale?
package testhadoop; import import import import import import import import import import import import import import java.io.IOException; java.util.Enumeration; java.util.Hashtable; java.util.StringTokenizer; org.apache.hadoop.con .!on iguration; org.apache.hadoop. s."ath; org.apache.hadoop.io.Text; org.apache.hadoop.mapreduce.lib.input.#; org.apache.hadoop.mapreduce.$ob; org.apache.hadoop.mapreduce.%apper; org.apache.hadoop.mapreduce.&educer; org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; org.apache.hadoop.util.(enericOptions"arser;
/** * Here we find the inverted index of a corpus. you can use the wikipedia * corpus. Read the hadoop quickstart guide for installation intruction. Map * :emits word as key and filename as value. Reduce :emtis word and occurances * in filenames. * * author !rashant * */ public class InvertedIndex )
/** * Mapper is the "#stract class which need to #e extended to write a mapper. * $e specify the input key and value format and output key and value * formats in %&nkey'&n(al')utkey')ut(al*. +o in the mapper we chose * %)#,ect'-ext'-ext'-ext* Remem#er we can only use the writa#le implemented * classes for key and value pairs.+eriali/ation issues discussed later0 * emits.$ord'1ilename in which it occurs0. * * author !rashant */ public static class *'%apper extends %apper+Object, Text, Text, Text- ) public void map.Object ke/, Text value, !ontext context0 throws IOException, InterruptedException ) Text 1ord 2 new Text.0; 33 Tokenize each line on basis o 45,. 4t4n /ou should add more 33 s/mbols i /ou are using HT%6 or 7%6 corpus. StringTokenizer itr 2 new StringTokenizer.value.toString.0, 545,. 4t4n50; 33 Here 1e used the context object to retrieve the ile name o the 33 map is 1orking on. String ile 2 new String...'ileSplit0 .context.getInputSplit.000 .get"ath.0.toString.00; Text '8 2 new Text. ile0; while .itr.has%oreTokens.00 ) 1ord.set.itr.nextToken.00; 33 Emits intermediate ke/ and value pairs. context.1rite.1ord, '80; 9 9 9 /** * "lmost same concept for reducer as well as mapper.Read $1Mapper * documentation0 2mits .$ord'3%1ilename*'%1ilename*....30 * Here we store the file names in a hashta#le and increment the count to augment the index. */ public static class *'&educer extends &educer+Text, Text, Text, Text- ) public void reduce.Text ke/, Iterable+Text- values, !ontext context0 throws IOException, InterruptedException ) Hashtable+String, 6ong- table 2 new Hashtable+String, 6ong-.0; for .Text val : values0 ) if .table.contains;e/.val.toString.000 ) 6ong temp 2 table.get.val.toString.00; temp 2 temp.long<alue.0 = >; table.put.val.toString.0, temp0; 9 else table.put.val.toString.0, new 6ong.>00; 9 String result 2 55; Enumeration+String- e 2 table.ke/s.0; while .e.has%oreElements.00 ) String tempke/ 2 e.nextElement.0.toString.0; 6ong tempvalue 2 .6ong0 table.get.tempke/0; result 2 result = 5+ 5 = tempke/ = 5, 5 = tempvalue.toString.0 = 5 - 5; 9 context.1rite.ke/, new Text.result00; 9 9
public static void main.String?@ args0 throws Exception ) /** * 4oad the configurations into the configuration o#,ect.1rom 5M4 that * you +etup while you have setup hadoop0 */ !on iguration con 2 new !on iguration.0; 33 pass the arguements to Hadoop utilit/ or options parsing String?@ otherArgs 2 new (enericOptions"arser.con , args0 .get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: invertedIndex +in- +out-50; S/stem.exit.C0; 9 33 !reate $ob object rom con iguration. $ob job 2 new $ob.con , 5Inverted index50; job.set$arE/!lass.InvertedIndex.class0; job.set%apper!lass.*'%apper.class0; /** * $hy do we use a com#iner when its optional6 $ell a com#iner helps in * reducing the output at the mapper end itself and thus #andwidth load * is reduced over the network and also increases the efficiency of the * reducer. */ job.set!ombiner!lass.*'&educer.class0; 33 *e used the same class or 33 combiner as reducer. Although 33 its possible to 1rite a 33 seperate. job.set&educer!lass.*'&educer.class0; job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Text.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0; 9 9
3he diagram abo"e is self e)planator and tells us about ho+ map reduce +or-s.
2.,.1 Submit :ob G As-s the :ob 3rac-er for a ne+ I1 G %hec-s output spec of the :ob. %hec-s oDp 1ir. If e)ists! thro+s error. :ob is not submitted. G Pass :ob%onf to :ob%lient.run:ob57 or submit:ob57 G run:ob57 bloc-s! submit:ob57 does not. 5As nchronous and s nchronous +a s of submitting a 9ob.7 G :ob%lient< 1etermines proper di"ision of input into InputSplits G Submits the 9ob to :ob3rac-er. G %omputes input split for the 9ob. Splits cannot be computed5inputs doesCt e)ist7! error is thro+n. :ob is not submitted G %opies the resources needed to run the 9ob to H1/S in a director named specified b =mapred.s stem.dir>. G :ob 9ar file. %opied +ith a high replication factor! factor of %an be set b =mapred.submit.replication> propert . G 3ells the :ob3rac-er. :ob is read
2.,.2 :ob Initiali$ation G Puts the 9ob in internal Hueue G :ob Scheduler +ill pic-up and initiali$e it
G G G G G
I %reate a :ob ob9ect and 9ob being run I (ncapsulate its tas-s I @oo- -eeping info to trac- tas-s status and progress %reate list of tas-s to run #etrie"es number of input splits computed b the :ob%lient from the shared files stem %reates one map tas- for each split. Scheduler creates the #educe tas-s and assigns them to tas-3rac-er. I 6o. of reduce tas-s is determined b the map.reduce.tas-s. 3as-s I1Cs are gi"en for each tas-
2.,.* 3as- Assignment. G 3as- Assignment G 3as- trac-ers send heartbeats to :ob3rac-er "ia #P%. G 3as- trac-er indicates readines for a ne+ tasG :ob 3rac-er +ill allocate a 3asG :ob 3rac-er communicates the tas- in a response to a heartbeat return G %hoosing a 3as- 3rac-er I :ob 3rac-er must choose a 3as- for a 3as-3rac-er I &ses scheduler to choose a tas- from I :ob Scheduling algorithms J Fdefault one based on priorit and /I/'.
2.,., 3as- ()ecution G 3as- trac-er has been assigned the tasG 6e)t step is to run the tasG ?ocali$es the 9ob b cop ing the 9ar file from the Kmapred.s stem.dirK to 9ob specific directories and copies an other files reBuired. G %reates a local +or-ing dir for the tas-! un89ars the contents of the 9ar onto this dir G %reates an instance of 3as-#unner to run the tasG 3as- runner launches a ne+ :LM to run each tasI 3o a"oid 3as- trac-er to fail! if an bugs in Map#educe tas-s I 'nl the child :LM e)its in case of a problem G 3as-3rac-er.%hild.main57< I Sets up the child 3as-InProgress attempt I #eads AM? configuration I %onnects bac- to necessar Map#educe components "ia #P% I &ses 3as-#unner to launch user process
*. Hadoop Streaming.
Hadoop streaming is a utilit that comes +ith the Hadoop distribution. 3he utilit allo+s ou to create and run mapDreduce 9obs +ith an e)ecutable or script as the mapper andDor the reducer.
*.* /eatures.
Pou can specif internal class as a mapper instead of an e)ecutable li-e this. 8mapper org.apache.hadoop.mapred.lib.Identit Mapper O Input ouput format classes can be specified li-e this. 8inputformat :a"a%lass6ame 8outputformat :a"a%lass6ame 8partitioner :a"a%lass6ame 8combiner :a"a%lass6ame Pou can specif :ob%onf parameters. MHA1''PNH'M(DbinDhadoop 9ar MHA1''PNH'M(Dhadoop8streaming.9ar O 8input m Input1irs O 8output m 'utput1ir O 8mapper org.apache.hadoop.mapred.lib.Identit MapperO 8reducer DbinD+c O 89obconf mapred.reduce.tas-sQ2
G G
Streaming data access. Write once and read man times architecture. Since files are large time to read is significant parameter than see- to first record. %ommodit hard+are. It is designed to run on commodit hard+are +hich ma fail. H1/S is capable of handling it.
,.*.2 Anatom of +rite. G 1/S'utputStream splits data into pac-ets. G G G G Writes into an internal Bueue. 1ataStreamer as-s namenode to get list of data nodes and uses the internal data Bueue. 6ame node gi"es a list of data nodes for the pipeline. Maintains internal Bueue of pac-ets +aiting to be ac-no+ledged.
,.*.* Anatom of a read. G 6ame node returns the locations of bloc-s for first fe+ bloc-s of the file G G G G G G 1ata nodes list is sorted according to their pro)imit to the client /S1ataInputStream +raps 1/SInputStream! +hich manages datanode and namenode ID' #ead is called repeatedl on the datanode till end of the bloc- is reached /inds the ne)t 1ata6ode for the ne)t datablocAll happens transparentl to the client %alls close after finishing reading the data
,., #ccessibilit)
H1/S can be accessed from applications in man different +a s. 6ati"el ! H1/S pro"ides a :a"a API for applications to use. A % language +rapper for this :a"a API is also a"ailable. In addition! an H33P bro+ser can also be used to bro+se the files of an H1/S instance. 'r can be mounted as uni) files stem. ,.,.1 1/S shell H1/S allo+s user data to be organi$ed in the form of files and directories. It pro"ides a commandline interface called DFSShell that lets a user interact +ith the data in H1/S. 3he s nta) of this command set is similar to other shells 5e.g. bash! csh7 that users are alread familiar +ith. Here are some sample actionDcommand pairs<
Action %reate a director named Dfoodir Lie+ the contents of a file named DfoodirDm file.t)t
%ommand binDhadoop dfs 8m-dir Dfoodir binDhadoop dfs 8cat DfoodirDm file.t)t
1/SShell is targeted for applications that need a scripting language to interact +ith the stored data. ,.,.2 1/S Admin 3he DFSAdmin command set is used for administering an H1/S cluster. 3hese are commands that are used onl b an H1/S administrator. Here are some sample actionDcommand pairs< Action %ommand
binDhadoop dfsadmin 8safemode enter binDhadoop dfsadmin 8report binDhadoop dfsadmin 8decommission datanodename
,.,.* @ro+ser Interface A t pical H1/S install configures a +eb ser"er to e)pose the H1/S namespace through a configurable 3%P port. 3his allo+s a user to na"igate the H1/S namespace and "ie+ the contents of its files using a +eb bro+ser. ,.,., Mountable H1/S< Please "isit the +i-i for more details MountableH1/S<
.. Seriali$ation.
..1 Introduction.
Seriali$ation is the process of turning structured ob9ects into a b te stream for transmission o"er a net+or- or for +riting to persistent storage. ()pectation from a seriali$ation interface. I I I I %ompact . 3o utili$e band+idth efficientl . /ast. #educed processing o"erhead of seriali$ing and deseriali$ing. ()tensible. (asil enhanceable protocols. Interoperable. Support for different languages.
Hadoop has +ritable interface +hich has all of those features e)cept interoperabilit +hich is implemented in A"ro. 3here are follo+ing predefined implementations a"ailable for Writable%omparable. 1. IntWritable 2. ?ongWritable *. 1oubleWritable ,. L?ongWritable. Lariable si$e! stores as much as needed. 18S b tes storage .. LIntWritable. ?ess used R as it is prett much represented b Llong. 0. @ooleanWritable 2. /loatWritable 4. @ tesWritable. S. 6ullWritable. Well this does not store an thing and ma be used +hen +e do not +ant to gi"e an thing as -e or "alue. It has also got one important usage< /or e)ample +e +ant to +rite a seB.
file and do not +ant it be stored in -e and "alue pairs! then +e can gi"e -e as 6ullWritable ob9ect and since it stores nothing! all "alues +ill be merged b reduce method into one single instance. 10. M1.Hash 11. 'b9ectWritable 12. ;enericWritable Apart from the abo"e there are four Writable %ollection t pes 1. Arra Writable 2. 3+o1Arra Writable *. MapWritable ,. SortedMapWritable
..* An e)ample e)plained on seriali$ation and ha"ing custom Writables from hadoop repos. 5See comments7
/** * 4icensed to the "pache +oftware 1oundation ."+10 under one * or more contri#utor license agreements. +ee the 7)-&82 file * distri#uted with this work for additional information * regarding copyright ownership. -he "+1 licenses this file * to you under the "pache 4icense' (ersion 9.: .the * 34icense30; you may not use this file except in compliance * with the 4icense. <ou may o#tain a copy of the 4icense at * * http://www.apache.org/licenses/4&827+2=9.: * * >nless required #y applica#le law or agreed to in writing' software * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. * +ee the 4icense for the specific language governing permissions and * limitations under the 4icense. */ package testhadoop; import import import import import import import import import import java.io.HataInput; java.io.HataOutput; java.io.IOException; java.util.StringTokenizer; org.apache.hadoop.con .!on iguration; org.apache.hadoop. s."ath; org.apache.hadoop.io.Int*ritable; org.apache.hadoop.io.6ong*ritable; org.apache.hadoop.io.&a1!omparator; org.apache.hadoop.io.Text;
/** * -his is an example Hadoop Map/Reduce application. * &t reads the text input files that must contain two integers per a line. * -he output is sorted #y the first and second num#er and grouped on the * first num#er. * * -o run: #in/hadoop ,ar #uild/hadoop=examples.,ar secondarysort * %i*in=dir%/i* %i*out=dir%/i* */ 3# # In this example the use o composite ke/ is demonstrated 1here since there is no de ault implementation o a composite ke/. *e had to override methods rom *ritable!omparable. # # %apclass: %apclass simpl/ reads the line rom input and emits the pair as Intpair.le t,right0 value as right reducer class that just emits the sum o #the input values. #3 public class Secondar/SortC ) /** * @efine a pair of integers that are writa#le. * -hey are seriali/ed in a #yte compara#le format. */ public static class Int"air implements *ritable!omparable+Int"air- ) private int irst 2 F; private int second 2 F; /** * +et the left and right values. */ public void set.int le t, int right0 ) irst 2 le t; second 2 right; 9 public int get'irst.0 ) return irst; 9 public int getSecond.0 ) return second; 9 /** * Read the two integers. * 2ncoded as: M&7B("4>2 =* :' : =* =M&7B("4>2' M"5B("4>2=* =C */ IOverride public void read'ields.HataInput in0 throws IOException ) irst 2 in.readInt.0 = Integer.%I8J<A6DE; second 2 in.readInt.0 = Integer.%I8J<A6DE; 9 IOverride public void 1rite.HataOutput out0 throws IOException ) out.1riteInt. irst K Integer.%I8J<A6DE0; out.1riteInt.second K Integer.%I8J<A6DE0; 9 IOverride
and
public int hash!ode.0 ) return irst # >LM = second; 9 IOverride public boolean eNuals.Object right0 ) if .right instanceof Int"air0 ) Int"air r 2 .Int"air0 right; return r. irst 22 irst OO r.second 22 second; 9 else ) return false; 9 9 /** " 8omparator that compares seriali/ed &nt!air. */ public static class !omparator extends *ritable!omparator ) public !omparator.0 ) super.Int"air.class0; 9 public int compare.byte?@ b>, int s>, int l>, byte?@ bC, int sC, int lC0 ) return compareE/tes.b>, s>, l>, bC, sC, lC0; 9
static ) 33 register this comparator *ritable!omparator.de ine.Int"air.class, new !omparator.00; 9 /** 8ompare on the #asis of first firstD then second. */ IOverride public int compareTo.Int"air o0 ) if . irst B2 o. irst0 ) return irst + o. irst G K> : >; 9 else if .second B2 o.second0 ) return second + o.second G K> : >; 9 else ) return F; 9 9 9 /** * !artition #ased on the first part of the pair. $e will need to override the partitioner * as we cannot go for default Hashpartitioner. +ince we have our own implementation of key and hash function. */ 3# "artion unction . irst#>CM %OH .noO "artition00.#3 public static class 'irst"artitioner extends "artitioner+Int"air,Int*ritable-) IOverride public int get"artition.Int"air ke/, Int*ritable value, int num"artitions0 ) return %ath.abs.ke/.get'irst.0 # >CM0 P num"artitions; 9 9
/** * Read two integers from each line and generate a key' value pair * as ..left' right0' right0. */ 3# %apclass simpl/ reads the line rom input and emits the pair as Intpair.le t,right0 as ke/ and value as right#3 public static class %ap!lass extends %apper+6ong*ritable, Text, Int"air, Int*ritable- ) private final Int"air ke/ 2 new Int"air.0; private final Int*ritable value 2 new Int*ritable.0; IOverride public void map.6ong*ritable in;e/, Text in<alue, !ontext context0 throws IOException, InterruptedException ) StringTokenizer itr 2 new StringTokenizer.in<alue.toString.00;
9 9
int le t 2 F; int right 2 F; if .itr.has%oreTokens.00 ) le t 2 Integer.parseInt.itr.nextToken.00; if .itr.has%oreTokens.00 ) right 2 Integer.parseInt.itr.nextToken.00; 9 ke/.set.le t, right0; value.set.right0; context.1rite.ke/, value0; 9
/** * " reducer class that ,ust emits the sum of the input values. */ public static class &educe extends &educer+Int"air, Int*ritable, Text, Int*ritable- ) private final Text irst 2 new Text.0;
IOverride public void reduce.Int"air ke/, Iterable+Int*ritable- values, !ontext context 0 throws IOException, InterruptedException ) irst.set.Integer.toString.ke/.get'irst.000; for.Int*ritable value: values0 ) context.1rite. irst, value0; 9 9
public static void main.String?@ args0 throws Exception ) !on iguration con 2 new !on iguration.0; String?@ otherArgs 2 new (enericOptions"arser.con , args0.get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: secondar/sortC +in- +out-50; S/stem.exit.C0; 9 $ob job 2 new $ob.con , 5secondar/ sort50; job.set$arE/!lass.Secondar/SortC.class0; job.set%apper!lass.%ap!lass.class0; job.set&educer!lass.&educe.class0; 33 group and partition b/ the irst int in the pair job.set"artitioner!lass.'irst"artitioner.class0; 33 the map output is Int"air, Int*ritable job.set%apOutput;e/!lass.Int"air.class0; job.set%apOutput<alue!lass.Int*ritable.class0; 33 the reduce output is Text, Int*ritable job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Int*ritable.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0;
9 9
.., Wh :a"a 'b9ect Seriali$ation is not so efficient compared to other Seriali$ation frame+or-s? *.+., :a"a Seriali$ation does not meet the criteria of Seriali$ation format.
17compact 27fast *7e)tensible ,7interoperable
0. 1istributed %ache.
0.1 Introdution.
A facilit pro"ided b map8reduce frame+or- to distribute the files e)plicitl specified using 88files option across the cluster and -ept in cache for processing. ;enerall all e)tra files needed b map reduce tas-s should be distributed this +a to sa"e net+or- band+idth.
1elegation to-en are used in hadoop in bac-ground so that user does not ha"e to authenticate at e"er command b contacting V1%.
4.13hree schedulers<
Default scheduler: G Single priorit based Bueue of 9obs G Scheduling tries to balance map and reduce load on all tas-trac-ers in the cluster
%apacit Scheduler< G PahooRCs scheduler 3he Capacit) Scheduler supports the follo+ing features< G Support for multiple Bueues! +here a 9ob is submitted to a Bueue. G Hueues are allocated a fraction of the capacit of the grid in the sense that a certain capacit of resources +ill be at their disposal. All 9obs submitted to a Bueue +ill ha"e access to the capacit allocated to the Bueue. G /ree resources can be allocated to an Bueue be ond itZs capacit . When there is demand for these resources from Bueues running belo+ capacit at a future point in time! as tas-s scheduled on these resources complete! the +ill be assigned to 9obs on Bueues running belo+ the capacit . G Hueues optionall support 9ob priorities 5disabled b default7. G Within a Bueue! 9obs +ith higher priorit +ill ha"e access to the BueueZs resources before 9obs +ith lo+er priorit . Ho+e"er! once a 9ob is running! it +ill not be preempted for a higher priorit 9ob! though ne+ tas-s from the higher priorit 9ob +ill be preferentiall scheduled.
In order to pre"ent one or more users from monopoli$ing its resources! each Bueue enforces a limit on the percentage of resources allocated to a user at an gi"en time! if there is competition for them. Support for memor 8intensi"e 9obs! +herein a 9ob can optionall specif higher memor 8 reBuirements than the default! and the tas-s of the 9ob +ill onl be run on 3as-3rac-ers that ha"e enough memor to spare.
#ppendi$ ,B
1ocumented e)amples from latest Apache hadoop distribution in the ne+ hadoop 0.21 trun"ersion. %an be ported to hadoop 0.20.20* api and used.
1.HM% pie estimator.
>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >C. * >nless required #y applica#le law or agreed to in writing' software >Q. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >R. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >L. * +ee the 4icense for the specific language governing permissions and >S. * limitations under the 4icense. >M. */ >T. >U.package org.apache.hadoop.examples; CF. C>.import java.io.IOException; CC.import java.math.EigHecimal; CQ.import java.math.&ounding%ode; CR. CL.import org.apache.hadoop.con .!on iguration; CS.import org.apache.hadoop.con .!on igured; CM.import org.apache.hadoop. s.'ileS/stem; CT.import org.apache.hadoop. s."ath; CU.import org.apache.hadoop.io.Eoolean*ritable; QF.import org.apache.hadoop.io.6ong*ritable; Q>.import org.apache.hadoop.io.SeNuence'ile; QC.import org.apache.hadoop.io.*ritable; QQ.import org.apache.hadoop.io.*ritable!omparable; QR.import org.apache.hadoop.io.SeNuence'ile.!ompressionT/pe; QL.import org.apache.hadoop.mapreduce.#; QS.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; QM.import org.apache.hadoop.mapreduce.lib.input.SeNuence'ileInput'ormat; QT.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QU.import org.apache.hadoop.mapreduce.lib.output.SeNuence'ileOutput'ormat; RF.import org.apache.hadoop.util.Tool; R>.import org.apache.hadoop.util.Tool&unner; RC./** 1irstly this is the only Mapreduce program that !rints output to the screen RQ. * Halton+equence:
instead of file.
RR.
* !opulates the arrays with Halton +equence. i.e. in first iteraion when iE: q is' C ' C/9 ' C/F .... GH nos. RL. * varia#le k is moded for randomi/ation' for second iteration when iEC ' q is' C'C/H'C/I ' RS. * x is sum of all elements of series d * q. RM. *next!oint.0: RT. * -his simply undoes what the constructor has done to get the next point RU. *JmcMapper creates si/e samples and checks if they are inside our outside #y first su#tracting :.K .which are center LF. * coordinates .supposingly0 from x and y and then putting it in equation of the circle and checking if it satisfies L>. * .xL9 M yL9 * rL90 then it emits 9 values one is num&nside and numoutside seperated #y a true or false value as keys. LC. *JmcReducer: LQ. * " reducer does not emit any thing it simply iterates over the keys of true and false and sums LR. * the no of points and updates the varia#le num&nside and numoutside ' &t has a seperate overriden LL. * close method wherein it has written the output to the file reduce=out and as it has only one LS. * reduce tasks .-hus it is possi#le else concurrency issues0. LM. * LT. *overridden cleanup.0:seperate close to register the output to file.
LU. SF.
* estimate: calls specified no of mapper and C reducer and reads the output from the file written #y S>. * the reducer. SC. * SQ. */
SR. SL. SS./** SM. * " map/reduce program that estimates the value of !i ST. * using a quasi=Monte 8arlo .qM80 method. SU. * "r#itrary integrals can #e approximated numerically #y qM8 methods. MF. * &n this example' M>. * we use a qM8 method to approximate the integral N& E OintB+ f.x0 dxN' MC. * where N+EP:'C0L9N is a unit square' MQ. * NxE.xBC'xB90N is a 9=dimensional point' MR. * and NfN is a function descri#ing the inscri#ed circle of the square N+N' ML. * Nf.x0ECN if N.9xBC=C0L9M.9xB9=C0L9 %E CN and Nf.x0E:N' otherwise. MS. * &t is easy to see that !i is equal to NF&N. MM. * +o an approximation of !i is o#tained once N&N is evaluated numerically. MT. * MU. * -here are #etter methods for computing !i. TF. * $e emphasi/e numerical approximation of ar#itrary integrals in this example. T>. * 1or computing many digits of !i' consider using ##p. TC. * TQ. * -he implementation is discussed #elow. TR. * TL. * Mapper: TS. * Qenerate points in a unit square TM. * and then count points inside/outside of the inscri#ed circle of the square. TT. * TU. * Reducer: UF. * "ccumulate points inside/outside results from the mappers. U>. * UC. * 4et num-otal E num&nside M num)utside. UQ. * -he fraction num&nside/num-otal is a rational approximation of UR. * the value ."rea of the circle0/."rea of the square0 E N&N' UL. * where the area of the inscri#ed circle is !i/F
US. * and the area of unit square is C. UM. * 1inally' the estimated value of !i is F.num&nside/num-otal0. UT. */ UU.public class Vuasi%onte!arlo extends !on igured implements Tool ) >FF. static final String HES!&I"TIO8 >F>. 2 5A map3reduce program that estimates "i using a NuasiK%onte !arlo >FC. /** tmp directory for input/output */ >FQ. static private final "ath T%"JHI& 2 new "ath. >FR. Vuasi%onte!arlo.class.getSimple8ame.0 = 5JT%"JQJ>R>LUCSLR50; >FL. >FS. /** 9=dimensional Halton sequence RH.i0S' >FM. * where H.i0 is a 9=dimensional point and i *E C is the index. >FT. * Halton sequence is used to generate sample points for !i estimation. >FU. */ >>F. private static class HaltonSeNuence ) >>>. /** ?ases */ >>C. static final int?@ " 2 )C, Q9; >>Q. /** Maximum num#er of digits allowed */ >>R. static final int?@ ; 2 )SQ, RF9; >>L. >>S. private long index; >>M. private double?@ x; >>T. private double?@?@ N; >>U. private int?@?@ d; >CF. >C>. /** &nitiali/e to H.startindex0' >CC. * so the sequence #egins with H.startindexMC0. >CQ. */ >CR. HaltonSeNuence.long startindex0 ) >CL. index 2 startindex; >CS. x 2 new double?;.length@; >CM. N 2 new double?;.length@?@; >CT. d 2 new int?;.length@?@; >CU. for.int i 2 F; i + ;.length; i==0 ) >QF. N?i@ 2 new double?;?i@@; >Q>. d?i@ 2 new int?;?i@@; >QC. 9 >QQ. >QR. for.int i 2 F; i + ;.length; i==0 ) >QL. long k 2 index; >QS. x?i@ 2 F; >QM. >QT. for.int j 2 F; j + ;?i@; j==0 ) >QU. N?i@?j@ 2 .j 22 FG >.F: N?i@?jK>@03"?i@; >RF. d?i@?j@ 2 .int0.k P "?i@0; >R>. k 2 .k K d?i@?j@03"?i@; >RC. x?i@ =2 d?i@?j@ # N?i@?j@; >RQ. 9 >RR. 9 >RL. 9 >RS. >RM. /** 8ompute next point. >RT. * "ssume the current point is H.index0. >RU. * 8ompute H.indexMC0. >LF. * >L>. * return a 9=dimensional point with coordinates in P:'C0L9 >LC. */ >LQ. double?@ next"oint.0 ) >LR. index==;
method.5;
>LL. >LS. >LM. >LT. >LU. >SF. >S>. >SC. >SQ. >SR. >SL. >SS. >SM. >ST. >SU. >MF. >M>. >MC. >MQ. >MR. >ML. >MS. >MM. >MT. >MU. >TF. >T>. >TC. >TQ. >TR. >TL. >TS. >TM. >TT. >TU. >UF. >U>. >UC. >UQ. >UR. >UL. >US. >UM. >UT. >UU. CFF. CF>. CFC. CFQ. CFR. CFL. CFS. CFM. CFT. CFU. C>F. C>>. C>C. C>Q.
for.int i 2 F; i + ;.length; i==0 ) for.int j 2 F; j + ;?i@; j==0 ) d?i@?j@==; x?i@ =2 N?i@?j@; if .d?i@?j@ + "?i@0 ) break; 9 d?i@?j@ 2 F; x?i@ K2 .j 22 FG >.F: N?i@?jK>@0; 9 9 return x; 9 9 /** * Mapper class for !i estimation. * Qenerate points in a unit square * and then count points inside/outside of the inscri#ed circle of the square. */ public static class Vmc%apper extends %apper+6ong*ritable, 6ong*ritable, Eoolean*ritable, 6ong*ritable- ) /** Map method. * param offset samples starting from the .offsetMC0th sample. * param si/e the num#er of samples for this map * param context output Rture=*num&nside' false=*num)utsideS */ public void map.6ong*ritable o set, 6ong*ritable size, !ontext context0 throws IOException, InterruptedException ) final HaltonSeNuence haltonseNuence 2 new HaltonSeNuence.o long numInside 2 F6; long numOutside 2 F6; for.long i 2 F; i + size.get.0; 0 ) 33generate points in a unit sNuare final double?@ point 2 haltonseNuence.next"oint.0; 33count points inside3outside o the inscribed circle o final double x 2 point?F@ K F.L; final double / 2 point?>@ K F.L; if .x#x = /#/ - F.CL0 ) numOutside==; 9 else ) numInside==; 9 33report status i==; if .i P >FFF 22 F0 ) context.setStatus.5(enerated 5 = i = 5 samples.50; 9 9 33output map results context.1rite.new Eoolean*ritable.true0, new 6ong*ritable.numInside00; the sNuare set.get.00;
C>R. C>L. C>S. C>M. C>T. C>U. CCF. CC>. CCC. CCQ. CCR. CCL. CCS. CCM. CCT. CCU. CQF. CQ>. CQC. CQQ. CQR. CQL. CQS. CQM. CQT. CQU. CRF. CR>. CRC. CRQ. CRR. CRL. CRS. CRM. CRT. CRU. CLF. CL>. CLC. CLQ. CLR. CLL. CLS. CLM. CLT. CLU. CSF. CS>. CSC. CSQ. CSR. CSL. CSS. CSM. CST. CSU. CMF. CM>. CMC.
context.1rite.new Eoolean*ritable.false0, new 6ong*ritable.numOutside00; 9 9 /** * Reducer class for !i estimation. * "ccumulate points inside/outside results from the mappers. */ public static class Vmc&educer extends &educer+Eoolean*ritable, 6ong*ritable, *ritable!omparable+G-, *ritable- ) private long numInside 2 F; private long numOutside 2 F; /** * "ccumulate num#er of points inside/outside results from the mappers. * param is&nside &s the points inside6 * param values "n iterator to a list of point counts * param context dummy' not used here. */ public void reduce.Eoolean*ritable isInside, Iterable+6ong*ritable- values, !ontext context0 throws IOException, InterruptedException ) if .isInside.get.00 ) for .6ong*ritable val : values0 ) numInside =2 val.get.0; 9 9 else ) for .6ong*ritable val : values0 ) numOutside =2 val.get.0; 9 9 9 /** * Reduce task done' write output to a file. */ IOverride public void cleanup.!ontext context0 throws IOException ) 331rite output to a ile "ath outHir 2 new "ath.T%"JHI&, 5out50; "ath out'ile 2 new "ath.outHir, 5reduceKout50; !on iguration con 2 context.get!on iguration.0; 'ileS/stem ileS/s 2 'ileS/stem.get.con 0; SeNuence'ile.*riter 1riter 2 SeNuence'ile.create*riter. ileS/s, con , out'ile, 6ong*ritable.class, 6ong*ritable.class, !ompressionT/pe.8O8E0; 1riter.append.new 6ong*ritable.numInside0, new 6ong*ritable.numOutside00; 1riter.close.0; 9 9 /** * Run a map/reduce ,o# for estimating !i. * * return the estimated value of !i */ public static EigHecimal estimate"i.int num%aps, long num"oints, !on iguration con
CMQ. CMR. CML. CMS. CMM. CMT. CMU. CTF. CT>. CTC. CTQ. CTR. CTL. CTS. CTM. CTT. CTU. CUF. CU>. CUC. CUQ. CUR. CUL. CUS. CUM. CUT. CUU. QFF. QF>. QFC. QFQ. QFR. QFL. QFS. QFM. QFT. QFU. Q>F. Q>>. Q>C. Q>Q. Q>R. Q>L. Q>S. Q>M. Q>T. Q>U. QCF. QC>. QCC. QCQ. QCR. QCL. QCS. QCM. QCT. QCU. QQF. QQ>.
0 throws IOException, !lass8ot'oundException, InterruptedException ) $ob job 2 new $ob.con 0; 33setup job con job.set$ob8ame.Vuasi%onte!arlo.class.getSimple8ame.00; job.set$arE/!lass.Vuasi%onte!arlo.class0; job.setInput'ormat!lass.SeNuence'ileInput'ormat.class0; job.setOutput;e/!lass.Eoolean*ritable.class0; job.setOutput<alue!lass.6ong*ritable.class0; job.setOutput'ormat!lass.SeNuence'ileOutput'ormat.class0; job.set%apper!lass.Vmc%apper.class0; job.set&educer!lass.Vmc&educer.class0; job.set8um&educeTasks.>0; 33 turn o speculative execution, because H'S doesnWt handle 33 multiple 1riters to the same ile. job.setSpeculativeExecution.false0; 33setup input3output directories final "ath inHir 2 new "ath.T%"JHI&, 5in50; final "ath outHir 2 new "ath.T%"JHI&, 5out50; 'ileInput'ormat.setInput"aths.job, inHir0; 'ileOutput'ormat.setOutput"ath.job, outHir0; final 'ileS/stem s 2 'ileS/stem.get.con 0; if . s.exists.T%"JHI&00 ) throw new IOException.5Tmp director/ 5 = s.makeVuali ied.T%"JHI&0 = 5 alread/ exists. "lease remove it irst.50; 9 if .B s.mkdirs.inHir00 ) throw new IOException.5!annot create input director/ 5 = inHir0; 9 try ) 33generate an input ile or each map task for.int i2F; i + num%aps; ==i0 ) final "ath ile 2 new "ath.inHir, 5part5=i0; final 6ong*ritable o set 2 new 6ong*ritable.i # num"oints0; final 6ong*ritable size 2 new 6ong*ritable.num"oints0; final SeNuence'ile.*riter 1riter 2 SeNuence'ile.create*riter. s, con , ile, 6ong*ritable.class, 6ong*ritable.class, !ompressionT/pe.8O8E0; try ) 1riter.append.o set, size0; 9 finally ) 1riter.close.0; 9 S/stem.out.println.5*rote input or %ap X5=i0; 9 33start a map3reduce job S/stem.out.println.5Starting $ob50; final long startTime 2 S/stem.currentTime%illis.0; job.1ait'or!ompletion.true0; final double duration 2 .S/stem.currentTime%illis.0 K startTime03>FFF.F; S/stem.out.println.5$ob 'inished in 5 = duration = 5 seconds50;
QQC. QQQ. QQR. QQL. QQS. QQM. QQT. QQU. QRF. QR>. QRC. QRQ. QRR. QRL. QRS. QRM. QRT. QRU. QLF. QL>. QLC. QLQ. QLR. QLL. QLS. QLM. QLT. QLU. QSF. QS>. QSC. QSQ. QSR. QSL. QSS. QSM. QST. QSU. QMF. QM>. QMC. QMQ. QMR. QML. QMS. QMM. QMT. QMU. QTF. QT>. QTC. QTQ. QTR. QTL.9 QTS.
33read outputs "ath in'ile 2 new "ath.outHir, 5reduceKout50; 6ong*ritable numInside 2 new 6ong*ritable.0; 6ong*ritable numOutside 2 new 6ong*ritable.0; SeNuence'ile.&eader reader 2 new SeNuence'ile.&eader. s, in'ile, con 0; try ) reader.next.numInside, numOutside0; 9 finally ) reader.close.0; 9 33compute estimated value final EigHecimal numTotal 2 EigHecimal.valueO .num%aps0.multipl/.EigHecimal.valueO .num"oints00; return EigHecimal.valueO .R0.setScale.CF0 .multipl/.EigHecimal.valueO .numInside.get.000 .divide.numTotal, &ounding%ode.HA6'JD"0; 9 finally ) s.delete.T%"JHI&, true0; 9 9 /** * !arse arguments and then runs a map/reduce ,o#. * !rint output in standard out. * * return a non=/ero if there is an error. )therwise' return :. */ public int run.String?@ args0 throws Exception ) if .args.length B2 C0 ) S/stem.err.println.5Dsage: 5=get!lass.0.get8ame.0=5 +n%aps- +nSamples-50; Tool&unner.print(eneric!ommandDsage.S/stem.err0; return C; 9 final int n%aps 2 Integer.parseInt.args?F@0; final long nSamples 2 6ong.parse6ong.args?>@0; S/stem.out.println.58umber o %aps 2 5 = n%aps0; S/stem.out.println.5Samples per %ap 2 5 = nSamples0; S/stem.out.println.5Estimated value o "i is 5 = estimate"i.n%aps, nSamples, get!on .000; return F; 9 /** * main method for running it as a stand alone command. */ public static void main.String?@ argv0 throws Exception ) S/stem.exit.Tool&unner.run.null, new Vuasi%onte!arlo.0, argv00; 9
;rep e)ample<
>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >nless required #y applica#le law or agreed to in writing' software >C. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >Q. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >R. * +ee the 4icense for the specific language governing permissions and >L. * limitations under the 4icense. >S. */ >M.package org.apache.hadoop.examples; >T. >U.import java.util.&andom; CF. C>.import org.apache.hadoop.con .!on iguration; CC.import org.apache.hadoop.con .!on igured; CQ.import org.apache.hadoop. s.'ileS/stem; CR.import org.apache.hadoop. s."ath; CL.import org.apache.hadoop.io.6ong*ritable; CS.import org.apache.hadoop.io.Text; CM.import org.apache.hadoop.mapreduce.#; CT.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; CU.import org.apache.hadoop.mapreduce.lib.input.SeNuence'ileInput'ormat; QF.import org.apache.hadoop.mapreduce.lib.map.Inverse%apper; Q>.import org.apache.hadoop.mapreduce.lib.map.&egex%apper; QC.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QQ.import org.apache.hadoop.mapreduce.lib.output.SeNuence'ileOutput'ormat; QR.import org.apache.hadoop.mapreduce.lib.reduce.6ongSum&educer; QL.import org.apache.hadoop.util.Tool; QS.import org.apache.hadoop.util.Tool&unner; QM./** Qrep search uses RegexMapper to read from the input that satisfies the regex QT. * and then emits the count as value and word as key. QU. * RF. * 4ongsumreducer simply sums all the long values. which is the count emitted R>. RC.
mapper
#y all the
* for that particular key. * &t first searches #y a#ove procedure and since the output o#tained a#ove is sorted on words it again RQ. * uses inversemapper class to sort the output on frequencies. RR. */
RL. RS.3# Extracts matching regexs rom input iles and counts them. #3 RM.public class (rep extends !on igured implements Tool ) RT. private (rep.0 )9 33 singleton RU. LF. public int run.String?@ args0 throws Exception ) L>. if .args.length + Q0 ) LC. S/stem.out.println.5(rep +inHir- +outHir- +regex- ?+group-@50; LQ. Tool&unner.print(eneric!ommandDsage.S/stem.out0; LR. return C; LL. 9 LS. LM. "ath tempHir 2 LT. new "ath.5grepKtempK5=
LU. SF. S>. SC. SQ. SR. SL. SS. SM. ST. SU. MF. M>. MC. MQ. MR. ML. MS. MM. MT. MU. TF. T>. TC. TQ. TR. TL. TS. TM. TT. TU. UF. U>. UC. UQ. UR. UL. US. UM. UT. UU. >FF. >F>. >FC. >FQ. >FR. >FL. >FS. >FM. >FT. >FU. >>F. >>>. >>C.9 >>Q.
Integer.toString.new &andom.0.nextInt.Integer.%A7J<A6DE000; !on iguration con 2 get!on .0; con .set.&egex%apper."ATTE&8, args?C@0; if .args.length 22 R0 con .set.&egex%apper.(&OD", args?Q@0; $ob grep$ob 2 new $ob.con 0; try ) grep$ob.set$ob8ame.5grepKsearch50; 'ileInput'ormat.setInput"aths.grep$ob, args?F@0; grep$ob.set%apper!lass.&egex%apper.class0; grep$ob.set!ombiner!lass.6ongSum&educer.class0; grep$ob.set&educer!lass.6ongSum&educer.class0; 'ileOutput'ormat.setOutput"ath.grep$ob, tempHir0; grep$ob.setOutput'ormat!lass.SeNuence'ileOutput'ormat.class0; grep$ob.setOutput;e/!lass.Text.class0; grep$ob.setOutput<alue!lass.6ong*ritable.class0; grep$ob.1ait'or!ompletion.true0; $ob sort$ob 2 new $ob.con 0; sort$ob.set$ob8ame.5grepKsort50; 'ileInput'ormat.setInput"aths.sort$ob, tempHir0; sort$ob.setInput'ormat!lass.SeNuence'ileInput'ormat.class0; sort$ob.set%apper!lass.Inverse%apper.class0; sort$ob.set8um&educeTasks.>0; 33 1rite a single ile 'ileOutput'ormat.setOutput"ath.sort$ob, new "ath.args?>@00; sort$ob.setSort!omparator!lass. 33 sort b/ decreasing reN 6ong*ritable.Hecreasing!omparator.class0; sort$ob.1ait'or!ompletion.true0; 9 finally ) 'ileS/stem.get.con 0.delete.tempHir, true0; 9 return F; 9 public static void main.String?@ args0 throws Exception ) int res 2 Tool&unner.run.new !on iguration.0, new (rep.0, args0; S/stem.exit.res0; 9
Word%ount
>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >C. * >nless required #y applica#le law or agreed to in writing' software >Q. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >R. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >L. * +ee the 4icense for the specific language governing permissions and >S. * limitations under the 4icense. >M. */ >T.package org.apache.hadoop.examples; >U. CF.import java.io.IOException; C>.import java.util.StringTokenizer; CC. CQ.import org.apache.hadoop.con .!on iguration; CR.import org.apache.hadoop. s."ath; CL.import org.apache.hadoop.io.Int*ritable; CS.import org.apache.hadoop.io.Text; CM.import org.apache.hadoop.mapreduce.$ob; CT.import org.apache.hadoop.mapreduce.%apper; CU.import org.apache.hadoop.mapreduce.&educer; QF.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; Q>.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QC.import org.apache.hadoop.util.(enericOptions"arser; QQ./** QR. * $ord count @ocumentation. QL. * -okeni/erMapper: &t takes as input each line from the input set and tokeni/e
emits each
QS. QM.
* word as key and integer one as value. * &nt+umReducer: &t accepts each word as key and aggregates all values i.e. the ones mapper has emitted. QT. * +o iterating over all the values and adding gives the count of that word. QU. * RF. * R>. */
RC. RQ.public class *ord!ount ) RR. RL. public static class Tokenizer%apper RS. extends %apper+Object, Text, Text, Int*ritable-) RM. RT. private final static Int*ritable one 2 new Int*ritable.>0; RU. private Text 1ord 2 new Text.0; LF. L>. public void map.Object ke/, Text value, !ontext context LC. 0 throws IOException, InterruptedException ) LQ. StringTokenizer itr 2 new StringTokenizer.value.toString.00; LR. while .itr.has%oreTokens.00 ) LL. 1ord.set.itr.nextToken.00; LS. context.1rite.1ord, one0; LM. 9 LT. 9
LU. SF. S>. SC. SQ. SR. SL. SS. SM. ST. SU. MF. M>. MC. MQ. MR. ML. MS. MM. MT. MU. TF. T>. TC. TQ. TR. TL. TS. TM. TT. TU. UF. U>. UC. UQ. UR. UL.9 US.
9 public static class IntSum&educer extends &educer+Text,Int*ritable,Text,Int*ritable- ) private Int*ritable result 2 new Int*ritable.0; public void reduce.Text ke/, Iterable+Int*ritable- values, !ontext context 0 throws IOException, InterruptedException ) int sum 2 F; for .Int*ritable val : values0 ) sum =2 val.get.0; 9 result.set.sum0; context.1rite.ke/, result0; 9 9 public static void main.String?@ args0 throws Exception ) !on iguration con 2 new !on iguration.0; String?@ otherArgs 2 new (enericOptions"arser.con , args0.get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: 1ordcount +in- +out-50; S/stem.exit.C0; 9 $ob job 2 new $ob.con , 51ord count50; job.set$arE/!lass.*ord!ount.class0; job.set%apper!lass.Tokenizer%apper.class0; job.set!ombiner!lass.IntSum&educer.class0; job.set&educer!lass.IntSum&educer.class0; job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Int*ritable.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0; 9