Professional Documents
Culture Documents
Lesson 2 Notes
Introduction
InthislessonwelltakeadeeperlookatthetwokeypartsofHadoophowitstoresthe
data,andhowitprocessesit.Letsstartbyseeinghowdataisstored.
HDFS
FilesarestoredinsomethingcalledtheHadoopDistributedFile
System,whicheveryonejustreferstoasHDFS.Asadeveloper,
thislooksverymuchlikearegularfilesystemthekindyoure
usedtoworkingwithonastandardmachine.Butitshelpfulto
understandwhatsgoingonbehindthescenes,sothatswhat
weregoingtotalkabouthere.
WhenafileisloadedintoHDFS,itssplitintochunks,whichwecallblocks.Eachblockispretty
bigthedefaultis64megabytes.So,imagineweregoingtostoreafilecalledmydata.txt,which
is150megabytesinsize.Asitsuploadedtothecluster,itssplitinto64megabyteblocks,and
eachblockwillbestoredononenodeinthecluster.Eachblockisgivenauniquenamebythe
system:itsactuallyjustblk,thenanunderscore,thenalargenumber.Inthiscase,thefilewill
besplitintothreeblocks:thefirstwillbe64megabytes,thesecondwillbe64megabytes,and
thethirdwillbetheremaining22megabytes.
Theresadaemon,orpieceofsoftware,runningoneachoftheseclusternodescalledthe
Copyright2014Udacity,Inc.AllRightsReserved.
DataNode,whichtakescareofstoringtheblocks.
Now,clearlyweneedtoknowwhichblocksmakeupthefile.Thatshandledbyaseparate
machine,runningadaemoncalledtheNameNode.TheinformationstoredontheNameNodeis
knownasthemetadata.
[]Networkfailurebetweennodes
[]DiskfailureonDataNode
[]NotallDataNodesareused
[]Blocksizesaredifferent
[]DiskfailureonNameNode
Answer:
Somenodesarenotusedisnotaproblem,sincetheycanbeusedfordifferentfiles,neitheris
differentblocksize.Networkanddiskfailuresarecertainlyaproblem,letslookintothisinmore
detail
Data Redundancy
Theproblemwiththingsrightnowisthatifoneofournodesfails,wereleftwithmissingdatafor
thefile.Ifthisnodegoesaway,forexample,wevegota64megabyteholeinthemiddleof
Copyright2014Udacity,Inc.AllRightsReserved.
mydata.txt.And,ofcourse,similarproblemswithanyotherfileswhichhaveblocksstoredon
thatnode.
Tosolvethisproblem,HadoopreplicateseachblockthreetimesasitsstoredinHDFS.So
blk_1doesntjustlivehere,itsalsostoredperhapshereandhere.blk_2isntjusthere,butalso
maybehereandhere.Andsimilarlyforblk_3.Hadoopjustpicksthreerandomnodes,andputs
onecopyoftheblockoneachofthethree.Well,actually,itsnotatotallyrandomchoice,butits
closeenoughforusrightnow.
Now,ifasinglenodefails,itsnotaproblembecausewehavetwoothercopiesoftheblockon
othernodes.AndtheNameNodeissmartenoughthatifitseesthatanyoftheblocksare
underreplicated,itwillarrangetohavethoseblocksrereplicatedontheclustersowerebackto
havingthreecopiesofthemagain.
dataonHDFSmaybeinaccessible[]
dataonHDFSmaybelostforever[]
thereisnoproblem[]
Copyright2014Udacity,Inc.AllRightsReserved.
Answer:
Ifthereisanetworkfailure,thedatawillnotbeaccessibletemporarily.Ifthediskofthesingle
NameNodewouldfail,dataonHDFSwouldbelostpermanently
DEMO of HDFS
So,hereIhaveadirectoryonmylocalmachine,whichcontainsacoupleoffiles,andIwantto
putoneofthemintohdfs.AllofthecommandswhichinteractwiththeHadoopfilesystemstart
withHadoopFS.Sofirstofall,let'sseewhatwe
Copyright2014Udacity,Inc.AllRightsReserved.
haveinhdfstostartwith.Idothatbysayinghadoopfsminusls.Thatgivesmealistingofwhat's
inmyhomedirectoryontheHadoopcluster.BecauseI'mloggedintothelocalmachineasa
usercalledtraining,myhomedirectoryinhdfsis/user/training.Andasyoucansee,there's
nothingthere.Sonow,let'suploadourpurchases.txtfile.Wedothatwithhadoopfsminusput
purchases.txt.Hadoopfsminusputtakesalocalfileandplacesitintohdfs.
SinceI'mnotspecifyingadestinationfilename,it'llbeuploadedwiththesamefilename.So,it
takesafewsecondstoupload.AndnowIcandoanotherhadoopfsminusls,andwecansee
thatthatfileisnowinhdfs.
Icantakealookatthelastfewlinesofthefilebysaying,hadoopfsminustail,andthenthe
filename,andthatjustdisplaysthelastfewlinesonthescreenforme.
There'salsoahadoopfsminuscat,whichwilldisplaytheentirecontentsofthefileandwe'lluse
thatlater.Thereareplentyofotherhadoopfscommandsandasyou'llprobablyhavestartedto
realize,theycloselymirrorstandardUNIXfilesystemcommands.So,ifIwanttorenamethefile,
forexample,Icansayhadoopfsminusmv,whichmovespurchases.txt,inthiscase,to
newname.txt.
IfIwanttodeleteafile,hadoopfsminusrmwillremovethatfileforme.So,let'sgetridof
Copyright2014Udacity,Inc.AllRightsReserved.
newname.txtfromhdfs.
Icreateadirectoryinhdfsbysayinghadoopfsminusmkdirandthenthedirectoryname,and
nowlet'suploadpurchases.txtandplaceitinthemyinputdirectorysothatit'sreadyfor
processingbyhdfs.OnceI'vedonethat,hadoopfsminuslsmyinputwillshowmethecontents
ofthatdirectory.AndjustasIexpected,there'sthefile.
MapReduce
Thanks,Ian.OK,nowweveseenhowfilesarestoredinHDFS,letsdiscusshow
thatdataisprocessedwithMapReduce.SayIhadalargefile.Processingthat
seriallyfromthetoptothebottomcouldtakealongtime.
Instead,mapreduceisdesignedtobeaveryparallelizedwayof
managingdata,meaningthatyourinputdataissplitintomanypieces,andeach
pieceisprocessedsimultaneously.Toexplain,letstakearealworldscenario.
Letsimaginewerunaretailerwiththousands
ofstoresaroundtheworld.Andwehavea
ledgerwhichcontainsallthesalesfromallthe
stores,organizedbydate.Wevebeenasked
tocalculatethetotalsalesgeneratedbyeach
storeoverthelastyear.
Now,onewaytodothatwouldbejusttostart
atthebeginningoftheledgerand,foreach
entry,writethestorenameandtheamountnexttoit.Forthenext
entry,IneedtoseeifIvealreadygotthatstorewrittendownifI
have,Icanaddtheamounttothatstore.Ifnot,Iwritedownanew
storenameandthatfirstpurchase.Andsoon,andsoon.
Hashtables
Copyright2014Udacity,Inc.AllRightsReserved.
Typically,thisishowwedsolvethingsinatraditionalcomputingenvironment:wedcreatesome
kindofassociativearrayorhashtableforthestoresthenprocesstheinputfileoneentryata
time.
Whatproblemsdoyouseewithsuchapproach,ifyourunthison1TBofdata?
[]itwillnotwork
[]youmightrunoutofmemory
[]itwilltakealongtime
[]theendresultmightbeincorrect
Answer:
Firstofall,yougotmillionsandmillionsofsalestoprocess.Soitsgoingtotakeanawfullylong
timeforyourcomputertofirstreadthefilefromadiskandthentoprocess.Also,themore
storesyouhave,thelongerittakesyoutocheckmytotalsheet,findtherightstore,andaddthe
newvaluetotherunningtotalforthatstore.Butagain,itwouldtakealongtimeandyoumay
evenrunoutofmemorytoholdyourarrayifyoureallydohaveahugenumberofstores.So
instead,letsseehowyouwoulddothisasaMapReducejob.
HereswhataMapperwilldo.Theywilltakethefirstrecordintheirchunkoftheledger,andonan
indexcardtheyllwritethestorenameastheheading.Underneath,theyllwritethesaleamount
forthatrecord.Thentheylltakethenextrecord,anddothesamething.Astheyrewritingthe
indexcards,theyllpilethemupsothatallthecardsforoneparticularstoregoonthesamepile.
Bytheend,eachMapperwillhaveapileofcardsperstore.
OncetheMappershavefinished,theReducerscancollecttheirsetsofcards.Wetelleach
Reducerwhichstorestheyreresponsiblefor.TheReducersgotoalltheMappersandretrieve
Copyright2014Udacity,Inc.AllRightsReserved.
thepilesofcardsfortheirownstores.Itsfastforthemtodo,becauseeachMapperhas
separatedthecardsintoapileperstorealready.OncetheReducershaveretrievedalltheir
data,theycollectallthesmallpilesperstoreandcreatealargepileperstore.Thentheystart
goingthroughthepiles,oneatatime.Alltheyhavetodoatthispointisaddupalltheamounts
onallthecardsinapile,andthatgivesthemthetotalsalesforthatstore,whichtheycanwrite
ontheirfinaltotalsheet.Andtokeepthingsorganized,eachReducergoesthroughhisorherset
ofpilesofcardsinalphabeticalorder.
MapReduce
AndthatsMapReduce!TheMappersareprogramswhicheachdealwitharelativelysmall
amountofdata,andtheyallworkinparallel.TheMappersoutputwhatwecallintermediate
records,whichinthiscasewereourindexcards.Hadoopdealswithalldataintheformof
records,andrecordsarekeyvaluepairs.Inthisexample,thekeywasthestorename,andthe
valuewasthesaletotalforthatparticularpieceofinput.OncetheMappershavefinished,a
phaseofMapReducecalledtheShuffleandSorttakesplace.Theshuffleisthemovementof
theintermediatedatafromtheMappersto
theReducersandthecombinationofallthe
smallsetsofrecordstogether,andthesort
isthefactthattheReducerswillorganize
thesetsofrecordsthepilesofindex
cardsinourexampleintoorder.Finally,
theReducephaseworksononesetof
recordsonepileofcardsatatimeit
getsthekey,andthenalistofallthevalues,
itprocessesthosevaluesinsomeway
(addingthemupinourcase)andthenit
writesoutitsfinaldataforthatkey.
Copyright2014Udacity,Inc.AllRightsReserved.
[]cantbedone
[]haveonlyoneReducer
[]mergetheresultfilesafterthejob
Answer:
Youcouldeitherhaveasinglereducer,ormergetheresultfilesafterthejob
[]AppleandBanana
[]AppleandCarrot
[]CarrotandGrape
[]AppleandGrape
[]Wedontknow,buttwowillgotoeachReducer
[]Wedontknow,anditspossiblethatoneReducerwillnotgetanyofthekeys
Answer:
Sincethereisnoguaranteethateachreducerwillgetsamenumberofkeys,itmightbethatone
ofthemwillgetnone.Formoreinformationonhowthisworksseethelinksinstructornotes.
Daemons of MapReduce
SoweveseenconceptuallyhowMapReduceworks.Inthenextlesson,welltalkabouthowto
actuallywritecodetoperformMapReducejobsonthecluster,butbeforewedothatitsusefulto
knowwherethecodewillactuallyrun.
JustaswithHDFS,thereareasetofdaemonswhichare
basicallyjustpiecesofcodewhichrunallthetimethatcontrol
MapReduceonthecluster.WhenyourunaMapReducejob,you
submitthecodetoamachinecalledtheJobTracker.Thatsplitsthe
workintoMappersandReducers,andthoseMappersandReducerswillrunonthecluster
nodes.RunningtheactualMaptasksandReducetasksishandledbyadaemoncalledthe
Copyright2014Udacity,Inc.AllRightsReserved.
TaskTracker,whichrunsoneachoftheslavenodesinthe
cluster.NoticethatsincetheTaskTrackersrunonthesame
machinesastheDataNodes,theHadoopframeworkwillbeable
tohaveMaptasksworkonpiecesofdatathatarestoredonthe
samemachine,whichwillsavealotofnetworktraffic.
Aswesaw,eachMapperprocessesaportionoftheinputdataknownasanInputSplit,andby
defaultHadoopwilluseanHDFSblockastheInputSplitforeachMapper.Itwilltrytomakesure
thataMapperworksondatawhichisonthesamemachineastheblockitself,soinanideal
world,theMapperwhichprocessesablockwillrunononeofthemachineswhichactually
storesthatblock.Ifblock2needsprocessing,forexample,itwillideallybeprocessedonthis
machine,thismachine,orthismachine.Thatwontalwaysbepossible,becausethe
TaskTrackersonallthreemachinesmayalreadybebusy,inwhichcasethedatawillbe
streamedtoanothernodeforprocessing,butitshouldhappenthemajorityofthetime.
TheMappersreadtheinputdata,andproduceintermediatedatawhichtheHadoopframework
thenpassestotheReducersthatstheshuffleandsort.TheReducersprocessthatdata,and
writetheirfinaloutputbacktoHDFS.
SoletshaveIanrunajobonourcluster.
Copyright2014Udacity,Inc.AllRightsReserved.
Running a Job
It'softenthecasethatMapReducecodeiswritteninJava.However,tomakethingsalittleeasier
forus,we'veactuallywrittenourmapperandreducerinPythoninstead.Andwecandothat
thankstoafeaturecalledHadoopstreaming,whichallowsyoutowriteyourcodeinprettymuch
anylanguageyou'dlike.Sofirstofall,let'sdoublecheckthatwehaveourinputdatainHDFS.
So,ifIHadoopfsminusls,thenthere'smyinputdirectory.AndifIlookatthatdirectory,thenyes,
there'spurchases.txtinthere.Andinmylocaldirectory,Ihavemapper.pyandreducer.py,that's
thecodeforthemapperandreducer,writteninPython.We'lllookattheactualcodeinthenext
lesson.
Okay,tosubmitthejobwehavetogivethisrathercumbersomecommand.Wesayhadoopjar,
apathtoajar,thenIspecifythemapper,Ispecifythereducer,Ineedtosayfile,forboththe
mapperandthereducercode.IspecifytheinputdirectoryinHDFSandIspecifytheoutput
directorytowhichthereducerswillwritetheiroutputdata.Andwe'recallingthatjoboutput.
IhitEnterandoffwego.Hadoop'sprettyverbose,asyoucansee.Asthejobruns,you'llseea
bunchofoutputwhichshowsushowfaralongthejobis.ItturnsoutthatforthisjobHadoopwill
berunningfourmappers.Andourvirtualmachineherecanonlyruntwoatatime.Sothejobis
goingtotakelongerthanitwouldonalargercluster.Actually,that'sworthmentioninghere.With
thesizeofthedatawehaveforthisexamplewhichisonly200megs,realistically,wecould
probablyhavesolvedthisproblemfasterbyjustimportingthedataintoarelationaldatabaseand
queryingitfromthere.Andthat'softenthecasewhenwe'redevelopingandtestingcode.
Becausethetestdatasetsareprettysmall,Hadoopisn'tnecessarilytheoptimaltoolforthejob.
Butwhenwe'redonetestingandweneedtoprocessourfullproductiondata,that'swhen
Hadoopreallycomesintoitsown.So,asyoucanseethejobisnownearlycomplete,andwhen
thejobhasfinishedwe'llseethatthelastlinetellsmethattheoutputdirectoryiscalled
joboutput.
Copyright2014Udacity,Inc.AllRightsReserved.
Let'stakealookatwhatwe'vegotinthere.Hadoopfsminusls,showsmethatyesIdohavea
joboutputdirectory.Andifwelookatthejoboutputdirectory,you'llseethatitcontainsthree
things.Itcontainsafilecalled_SUCCESS,whichjusttellsmethatthejobhassuccessfully
completed.Itcontainsadirectorycalled_logs,whichcontainssomeloginformationaboutwhat
happenedduringthejob'srun.Andthen,itcontainsafilecalledpart00000.Thatfileistheoutput
fromtheonereducerthatwehadforthisjob.
Let'stakealookatthatbysayinghadoopfsminuscatpart00000,andwe'llpipethattolesson
ourlocalmachine.
That'sthecontentsofthatfile,whichistheoutputfromourreducer.It'sthesumtotalsales
brokendownbystoreexactlyaswewantit.
Copyright2014Udacity,Inc.AllRightsReserved.
Incidentally,ifyouwanttoretrievedatafromHFDSandputitontoyourlocaldisk,youcandothat
withHadoopfsminusget.HadoopfsminusgetistheoppositeofHadoopfsminusput.Itjust
pullsdatafromHDFSandputsitonthelocaldisk.Soasyoucansee,nowIhavemylocalfile.txt
onmylocaldiskAndIcanmanipulatethathoweverI'dlike.
ThatHadoopjobcommandwetypedwasprettypainfultohavetoremember.Sotosaveyou
time,we'vecreatedanaliasinthedemovirtualmachinethatyou'llbedownloading.Youcanjust
typehsfollowedbyfourarguments,themapperscript,thereducerscript,theinputdirectory,
andtheoutputdirectory.
Here'soneimportantthingtonote,though.Whenyou'rerunningaHadoopjob,theoutput
directorymustnotalreadyexist.Andasyoucansee,ifwetryandrunthecommandwithan
existingdirectory.Inthiscase,joboutput.Hadooprefusestorunthejob.
ThisisactuallyafeatureofHadoop.It'sdesignedtostopyouinadvertentlydeletingoroverwriting
datathat'salreadyinthecluster.Butasyoucansee,ifwespecifyadifferentdirectory,which
doesn'talreadyexist,thenthejobwillbeginjustfine.
Copyright2014Udacity,Inc.AllRightsReserved.
Processing Logs
Theexamplewejusttalkedaboutwascalculatingtheaveragesalesperstore.Andtherearelots
ofotherthingswecandowithMapReducethatareactuallyquitesimilar,conceptually,tothat.
Forexample,logprocessingisreallyquitesimilar.Imagineyouhaveasetoflogfilesfroma
Webserverwhichlooklikethis,andyouwanttoknowhowmanytimeseachpagehasbeenhit.
Well,itsreallysimilartothesalesperstore.YourMapperwillreadalineofthelogfileatatime,
andwillextractthenameofthepagelikeindex.html,forexample.
Itsintermediatedatawillhavethename
ofthepageasthekey,anda1asthe
value,becauseyouvefoundonehitto
thepageatthatpositioninthelog.When
alltheMappersaredone,theReducers
willgetthekeys,andalistofallthe
valuesforeachparticularkey.Theycan
thenjustaddallthe1supforakeyandthatwilltellthemthetotalnumberofhitstothatpageon
theWebsite.Simple,butfarmoreefficientthanwritingastandaloneprogramtogothroughall
thelogsfromstarttofinishifyouhavehundredsofgigabytestoprocess.
AndthatsjustthestartofwhatyoucandowithMapReduce.Thingslikefraud
detection,recommendersystems,itemclassificationtherearemany,many
applicationsofMapReduce,buttheyallstartwiththosesimpleconcepts.Andtheyall
sharesomebasiccharacteristics:theresalotofdatatobeprocessed,andthework
canbeparallelizedyoudonthavetojuststartatthebeginningandslogthroughtotheend.
PerhapsthehardestthingtolearnwhenyourenewtoHadoopishowtosolveproblemsby
thinkingintermsofMapReduce.Itsaverydifferentwayofprocessingdatacomparedtohow
youreprobablyusedtoworkingand,honestly,thebestwaytolearnisbypractice.Inthenext
lessonwellwritethecodetosolveoursalesbystoreproblem,andyoullstarttoworkonother
MapReduceproblems.
OnceyouvedownloadedandstarteduptheVM,wedlikeyoutotryuploadingadatasetinto
HDFSandrunningaMapReducejobyourself.Theexerciseinstructionsdocumentinthe
InstructorNotessectiongivesyoustepbystepinstructionsonwhattodo(Instructions
document).Havefun!
Conclusion
So,thatstheendofthelesson.YoulearnedabouthowHadoopusesHDFStostoredata,and
thebasicprinciplesbehindMapReduce.Inthenextlesson,welllookattheMapReducecode
itselfbytheendofthelessonyoullbereadytowriteyourownprogramstoanalyzedata.
Number of Reducers
Onethingworthyofnoteisthatyou,asadeveloper,specifyhowmanyReducersyouwantfor
yourjob.ThedefaultistohaveasingleReducer,butforlargeamountsofdataitoftenmakes
sensetohavemanymorethanone.Otherwise,thatoneReducerwillenduphavingtoprocess
ahugeamountofdatafromtheMappers.TheHadoopframeworkdecideswhichkeysgetsent
Copyright2014Udacity,Inc.AllRightsReserved.
toeachReducer,andtheresnoguaranteethateachReducerwillgetthesamenumberofkeys.
ThekeyswhichgotoaparticularReduceraresorted,buteachReducerwritesitsownfileinto
HDFS.Soif,forexample,wehadfourkeys:a,b,c,andd,andtwoReducers,thenoneReducer
mightgetkeysaandd,theothermightgetbandc.Sotheresultswouldbesortedwithineach
Reducersoutput,butjustjoiningthefilestogetherwouldntproducecompletelysortedoutput.
QUIZ:
Beforewemoveon,though,whichofthefollowingtypesofproblemdoyouthinkaregood
candidatestosolvewithMapReduce?
[]Detectinganomalousbehaviorfromalogfile
[]Calculatingreturnsfromalargenumberofstockportfolios
[]Verylargematrixinversion
[](somethingelse)
Answer:Theansweristhatallbutmatrixmultiplicationaregoodcandidatestosolvewith
MapReduce.Thereasonmatrixinversionisnot,isthatmatrixmanipulationtendstorequire
holdingtheentirecontentsofbothmatricesinmemoryatonce,ratherthanprocessingindividual
portions.YoucandoitwithMapReduce,butitturnsouttobequitedifficult.
Copyright2014Udacity,Inc.AllRightsReserved.