You are on page 1of 50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Surv ey paper
JournalofBigData
December2015,2:24
Firstonline:05Nov ember2015

Asurveyofopensourcetools
formachinelearningwithbig
dataintheHadoopecosystem
SaraLandset
,TaghiM.Khoshgoftaar
,AaronN.Richter
,TawfiqHasanin
10.1186/s40537 01500321
Copyrightinformation

Abstract
Withanev erincreasingamountofoptions,thetaskofselectingmachine
learningtoolsforbigdatacanbedifficult.Theav ailabletoolshav eadv antages
anddrawbacks,andmany hav eov erlappinguses.Theworldsdataisgrowing
rapidly ,andtraditionaltoolsformachinelearningarebecominginsufficientas
wemov etowardsdistributedandrealtimeprocessing.Thispaperisintended
toaidtheresearcherorprofessionalwhounderstandsmachinelearningbutis
inexperiencedwithbigdata.Inordertoev aluatetools,oneshouldhav ea
thoroughunderstandingofwhattolookfor.Tothatend,thispaperprov idesa
listofcriteriaformakingselectionsalongwithananaly sisoftheadv antagesand
drawbacksofeach.Wedothisby startingfromthebeginning,andlookingat
whatexactly thetermbigdatameans.Fromthere,wegoontotheHadoop
ecosy stemforalookatmany oftheprojectsthatarepartofaty picalmachine
learningarchitectureandanunderstandingofhowev ery thingmightfit
together.Wediscusstheadv antagesanddisadv antagesofthreedifferent
processingparadigmsalongwithacomparisonofenginesthatimplementthem,
includingMapReduce,Spark,Flink,Storm,andH2 O.Wethenlookatmachine

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

1/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

includingMapReduce,Spark,Flink,Storm,andH2 O.Wethenlookatmachine
learninglibrariesandframeworksincludingMahout,MLlib,SAMOA,and
ev aluatethembasedoncriteriasuchasscalability ,easeofuse,and
extensibility .Thereisnosingletoolkitthattruly embodiesaonesizefitsall
solution,sothispaperaimstohelpmakedecisionssmootherby prov idingas
muchinformationaspossibleandquantify ingwhatthetradeoffswillbe.
Additionally ,throughoutthispaper,werev iewrecentresearchinthefield
usingthesetoolsandtalkaboutpossiblefuturedirectionsfortoolkitbased
learning.

Keywords
MachinelearningBigdataHadoopMahoutMLlibSAMOAH2O
SparkFlinkStorm

Background
Asthepriceofdatastoragehasgonedownandhighperformancecomputers
hav ebecomemorewidely accessible,wehav eseenanexpansionofmachine
learning(ML)intoahostofindustriesincludingfinance,lawenforcement,
entertainment,commerce,andhealthcare.Astheoreticalresearchislev eraged
intopracticaltasks,machinelearningtoolsareincreasingly seenasnotjust
useful,butintegraltomany businessoperations.
Thegoalofmachinelearningistoenableasy stemtolearnfromthepastor
presentandusethatknowledgetomakepredictionsordecisionsregarding
unknownfutureev ents.Inthemostgeneralterms,theworkflowfora
superv isedmachinelearningtaskconsistsofthreephases:buildthemodel,
ev aluateandtunethemodel,andthenputthemodelintoproduction.An
exampleofthisworkflowisinFig.1.

Fig.1
Superv isedmachinelearningworkflow
Attheheartofmachinelearningisthedatathatpowersthemodels,andthe

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

2/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

neweraofBigDataiscatapultingmachinelearningtotheforefrontofresearch
andindustry applications.Themeaningofthetermbigdataisstillthesubject
ofsomedisagreement,butitgenerally referstodatathatistoobigortoo
complextoprocessonasinglemachine.Weliv einanagewheredataisgrowing
ordersofmagnitudefasterthanev erbefore.AccordingtoInternationalData
CorporationsannualDigitalUniv ersestudy [1],theamountofdataonour
planetissettoreach44zettaby tes(4.41022by tes)by 2020whichwouldbe
tentimeslargerthanitwasin2013.Whilenosingleentity isworkingwithdata
atthismagnitude,many industriesarestillgeneratingdatatoolargetobe
processedefficiently usingtraditionaltechniques.Ancestry .com,forexample,
storesbillionsofrecordstotalingabout10petaby tesofdata[2].Withsucha
growthrateindataproduction,thechallengefacedby themachinelearning
community ishowtobestefficiently processandlearnfrombigdata.Popular
machinelearningtoolkitssuchasR[3]orWeka[4]werenotbuiltforthese
kindsofworkloads.AlthoughWekahasdistributedimplementationsofsome
algorithmsav ailable,itisnotonthesamelev elastoolsthatwereinitially
designedandbuiltforteraby tescale.Hadoop[5],apopularframeworkfor
workingwithbigdata,helpstosolv ethisscalability problemby offering
distributedstorageandprocessingsolutions.WhileHadoopisjustaframework
forprocessingdata,itprov idesav ery extensibleplatformthatallowsformany
machinelearningprojectsandapplicationsthefocusofthispaperistopresent
thosetools.
Theproliferationofbigdatahasforcedustorethinknotjustdataprocessing
frameworks,butimplementationsofmachinelearningalgorithmsaswell.
Choosingtheappropriatetoolsforaparticulartaskorenv ironmentcanbe
dauntingfortworeasons.First,theincreasingcomplexity ofmachinelearning
projectrequirementsaswellasofthedataitselfmay requiredifferentty pesof
solutions.Second,oftendev eloperswillfindtheselectionoftoolsav ailableto
beunsatisfactory ,butinsteadofcontributingtoexistingopensourceprojects,
they beginoneoftheirown.Thishasledtoagreatdealoffragmentationamong
existingbigdataplatforms.Bothoftheseissuescancontributetothedifficulty
ofbuildingalearningenv ironment,asmany optionshav eov erlappinguse
cases,butdiv ergeinimportantareas.Becausethereisnosingletoolor
frameworkthatcov ersallorev enthemajority ofcommontasks,onemust
considerthetradeoffsthatexistbetweenusability ,performance,and
algorithmselectionwhenexaminingdifferentsolutions.Thereisalackof
comprehensiv eresearchonmany ofthem,despitebeingwidely employ edon
anenterpriselev elandthereisnocurrentindustry standard.
Thegoalofthispaperistofacilitatethesedecisionsby prov idinga
comprehensiv erev iewofthecurrentstateoftheartinopensourcescalable
toolsformachinelearning.Recommendationsareofferedforcriteriawith
whichtoev aluatethev ariousoptions,andcomparisonsareprov idedbetween
v ariousopensourcedataprocessingenginesaswellasMLlibrariesand
frameworks.Thispaperpresumesthatthereaderhasabasicknowledgeof
machinelearningconceptsandworkflows.Itisintendedforpeoplewhohav e

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

3/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

machinelearningconceptsandworkflows.Itisintendedforpeoplewhohav e
experiencewithmachinelearningandwantinformationonthedifferenttools
av ailableforlearningfrombigdata.Thepaperwillbeusefultoany one
interestedinbigdataandmachinelearning,whetheraresearcher,engineer,
scientist,orsoftwareproductmanager.
Theremainderofthispaperwillbeorganizedasfollows:Thesectiontitled
Understandingbigdataprov idesbackgroundontheproblemsthatmay arise
whenworkingwithbigdata,andtheHadoopecosystemsectionserv esasan
explanationandov erv iewoftheHadoopecosy stemwithafocusontoolsthat
canhelpsolv ebigdataproblems.TheDataprocessingenginessection
examinesdifferentdataprocessingparadigmsandoutlinescriteriafor
ev aluation.Machinelearningtoolkitsdiscussescriteriaforev aluationof
machinelearningtoolsandlibraries,andEvaluationofmachinelearning
toolsprov idesanindepthanaly sisofspecificframeworksthatcanbeused
withtheprocessingplatforms.TheSuggestionsforfutureworksection
containsadiscussionofkey elementsmissingamongthemajortoolkitsandthe
finalsectionpresentsconclusionsfromthissurv ey .

Understandingbigdata
Thetermbigdatahasbecomeabuzzwordandassuch,itisoftenov erused
andmisunderstood.Whiletheframeworkswediscussinthispaperareableto
effectiv ely processdataofv ary ingsizesandcomplexities,they weredesigned
withv ery largedatainmindandmay notbethebestchoiceforcertainsmaller
projects.Forthisreason,thefirststepinchoosingbetweenbigdataframeworks
istodetermineifthey areneeded.Inordertodothis,itisimportanttohav ean
understandingofwhatconstitutesbigdata.Thissectionprov idesdefinitionsof
bigdataanddiscussesthechallengesassociatedwithit.
Thereisnouniv ersally agreedupondefinitionofbigdata,butthemorewidely
acceptedexplanationstendtodescribeitintermsofthechallengesitpresents.
Thisissometimesreferredtoasthebigdataproblem.In2001,Laney [6]
describedthreedimensionsofdatamanagementchallenges.This
characterization,whichaddressesv olume,v elocity ,andv ariety ,isfrequently
documentedinscientificliterature.Thesethreedimensions(commonly
referredtoasthe3Vs)canbeunderstoodasfollows:
Volumeisthemostobv iousofthethree,referringtothesizeofthe
data.Themassiv ev olumesofdatathatwearecurrently dealingwith
hasrequiredscientiststorethinkstorageandprocessingparadigms
inordertodev elopthetoolsneededtoproperly analy zeit.
Velocityaddressesthespeedatwhichdatacanbereceiv edaswell
asanaly zed.IntheDataprocessingenginessection,wediscuss
thedifferencesbetweenbatchprocessing,whichworksonhistorical

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

4/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

thedifferencesbetweenbatchprocessing,whichworksonhistorical
data,andstreamprocessing,whichanaly zesthedatainrealtimeas
itisgenerated.Thisalsoreferstotherateofchangeofdata,whichis
especially relev antintheareaofstreamprocessing.
Varietyreferstotheissueofdisparateandincompatibledata
formats.Datacancomeinfrommany differentsourcesandtakeon
many differentforms,andjustpreparingitforanaly sistakesa
significantamountoftimeandeffort.
Inthey earssinceLaney spaperwaspublished,numerouspeoplehav e
proposedadditionstothislistandmany refertofourorfiv eVs,addingin
ValueorVeracity [7 ].Howev er,weareskepticalthattheseadditionsaddtoan
ov erallunderstandingofbigdata,sowefocusourdiscussionheretothe
originalthree.
In1997 ,CoxandEllsworth[8]wereamongthefirstauthorsinscientific
literaturetodiscussbigdatainthecontextofmoderncomputing.Theirwork
focusedondatav isualization,buttheirobserv ationsaboutthebigdata
problemcaneasily beextrapolatedtogeneraldataanaly ticsandmachine
learning.Thebigdataproblem,accordingtothem,consistsoftwodistinct
issues:
Bigdatacollectionsareaggregatesofmultipledatasetsthatare
indiv idually manageable,butasagrouparetoolargetofitondisk.
Thedatasetsinthesecollectionsty pically comefromdifferent
sources,areindisparateformats,andarestoredinseparate
phy sicalsitesandindifferentty pesofrepositories.
Bigdataobjectsareindiv idualdatasetsthatby themselv esaretoo
largetobeprocessedby standardalgorithmsonav ailablehardware.
Unlikecollections,they ty pically comefromasinglesource.
Today ,theproblemofbigdatacollectionsisoftensolv edthroughdistributed
storagesy stems,whicharedesignedtocarefully controlaccessand
managementinafaulttolerantmanner.Onesolutionfortheproblemofbig
dataobjectsinmachinelearningisthroughparallelizationofalgorithms.Thisis
ty pically accomplishedinoneoftwoway s[9]:dataparallelism,inwhichthe
dataisdiv idedintomoremanageablepiecesandeachsubsetiscomputed
simultaneously ,ortaskparallelism,inwhichthealgorithmisdiv idedintosteps
thatcanbeperformedconcurrently .
Itisnotuncommontoencounterbigcollectionsofbigobjectsasdatagrows
andbecomesmorewidely av ailable.This,coupledwithunprecedentedaccess
tocomputingpowerthroughmoreaffordablehighperformancemachinesas
wellascloudserv ices,isopeningupmany newopportunitiesformachine
learningresearch.Many ofthesenewdirectionsutilizeincreasingly complex
workflowswhichrequiresy stemsbuiltusingacombinationofstateoftheart
toolsandtechniques.Oneoptionforsuchasy stemistouseprojectsfromthe
HadoopEcosy stem.Theremainderofthispaperprov idesdetailedinformation

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

5/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

HadoopEcosy stem.Theremainderofthispaperprov idesdetailedinformation


abouttheseprojectsanddiscusseshowthey canbeutilizedtogethertobuildan
architecturecapableofefficiently learningfromdataofthismagnitude.

Hadoopecosystem
Many peopleconsiderthetermsHadoopandMapReducetobe
interchangeable,butthisisnotentirely accurate.Hadoopwasinitially
introducedin2007 asanopensourceimplementationoftheMapReduce
processingenginelinkedwithadistributedfilesy stem[10],butithassince
ev olv edintoav astwebofprojectsrelatedtoev ery stepofabigdataworkflow,
includingdatacollection,storage,processing,andmuchmore.Theamountof
projectsthathav ebeendev elopedtoeithercomplementorreplacethese
originalelementshasmadethecurrentdefinitionofHadoopunclear.Forthis
reason,weoftenhearreferencetotheHadoopEcosysteminstead,which
encompassestheserelatedprojectsandproducts.Tofully understandHadoop,
onemustlookatboththeprojectitselfandtheecosy stemthatsurroundsit.
TheHadoopprojectitselfcurrently consistsoffourmodules[10]:
Hadoopdistributedfilesystem(HDFS)Afilesy stemdesignedto
storelargeamountsofdataacrossmultiplenodesofcommodity
hardware.HDFShasamasterslav earchitecturemadeupofdata
nodeswhicheachstoreblocksofthedata,retriev edataondemand,
andreportbacktothenamenodewithinv entory .Thenamenode
keepsrecordsofthisinv entory (referencestofilelocationsand
metadata)anddirectstraffictothedatanodesuponclientrequests.
Thissy stemhasbuiltinfaulttolerance,ty pically keepingthreeor
morecopiesofeachdatablockincaseofdiskfailure.Additionally ,
therearecontrolsincaseofnamenodefailureaswell,inwhicha
sy stemwilleitherhav easecondary namenode,orwillwrite
backupsofmetadatatomultiplefilesy stems.
MapReduceDataprocessingengine.AMapReducejobconsistsof
twoparts,amapphase,whichtakesrawdataandorganizesitinto
key /v aluepairs,andareducephasewhichprocessesdatain
parallel.Adetaileddiscussionofthisprocessingapproachcanbe
foundinthefollowingsection.
Y ARN(Y etAnotherResourceNegotiator)[11]Priortheaddition
ofY ARNtotheHadoopprojectinv ersion2.0,Hadoopand
MapReduceweretightly coupled,withMapReduceresponsiblefor
bothclusterresourcemanagementanddataprocessing.Y ARNhas
nowtakenov ertheresourcemanagementduties,allowinga
separationbetweenthatinfrastructureandtheprogrammingmodel.
WithY ARN,ifanapplicationwantstorun,itsclienthastorequest

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

6/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

WithY ARN,ifanapplicationwantstorun,itsclienthastorequest
thelaunchofanapplicationmanagerprocessfromtheresource
manager,whichthenfindsanodemanager.Thenodemanagerthen
launchesacontainerwhichexecutestheapplicationprocess.For
any readerswhoarefamiliarwithprev iousv ersionsofHadoop,the
jobtrackerresponsibilitiesfromMapReducearenowY ARN,split
betweentheresourcemanager,applicationmaster,andtimeline
serv er(whichstoresapplicationhistory ),whiletheoldtasktracker
responsibilitiesarehandledby thenodemanagers.Thischangehas
improv eduponmany ofthedeficienciespresentintheold
MapReduce.Y ARNisabletorunonlargerclusters,morethan
doublingtheamountofjobsandtasksitcanhandlebeforerunning
intobottlenecks[10].Finally ,Y ARNallowsforamoregeneralized
HadoopwhichmakesMapReducejustonety peofY ARN
application.Thismeansitcanbeleftoutaltogetherinfav orofa
differentprocessingengine.
Common[12]Asetofcommonutilitiesneededby theotherHadoop
modules.Ithasnativ esharedlibrariesthatincludeJav a
implementationsforcompressioncodecs,I/Outilities,anderror
detection.Alsoincludedareinterfacesandtoolsforconfigurationof
rackawareness,authorizationofproxy users,authentication,
serv icelev elauthorization,dataconfidentiality ,andtheHadoop
Key ManagementServ er(KMS).
TheHadoopecosy stemismadeupofav astarray ofprojectsbuiltontopofand
aroundthecoremodulesdescribedabov e.Theseprojectshav ebeendesigned
toaidresearchersandpractitionersinallaspectsofaty picaldataanaly sisor
machinelearningworkflow.Sev eralcompaniessuchasCloudera[13],
Hortonworks[14],andMapR[15]offerdistributionsofHadoopwhichbundlea
numberoftheseprojects.Freeandenterprisev ersionsofthesoftwarebundles
areav ailable.
Thegeneralstructureoftheecosy stemcanbedescribedintermsofthree
lay ers:storage,processing,andmanagement.Whiletheprimary focusofthis
paperisontoolsthatresideintheprocessinglay er,itisimportantto
understandthecontextofhowthey canbeusedinaworkflowby lookingatthe
makeupoftheecosy stemasawhole.Anexampleofhowtoolsfordifferenttasks
may fittogetheraspartofananaly ticalstackisshowninFig.2.Thespecific
projectslistedinsidethisdiagramanddiscussedinthissectionareexamplesof
commonly usedtools,butsincetheecosy stemismadeupofwellov er100
projects,itisnotmeanttobeacomprehensiv elist.Readerswhowishtolearn
moreaboutthetoolsnotdiscussedinthispaperareencouragedtorefertothe
Hadoopwebsiteor[10]formoreinformation.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

7/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Fig.2
TheHadoopecosy stem

Storagelayer
Thestoragelay erresidesatthelowestlev elofthisstack,andby defaultit
includestheHDFSdescribedprev iously .Therearealsoav ariety ofother
optionsfordistributeddatastoragewhicheitherrunontopoftheHDFSor
workasstandalonesy stems.TheHDFSisnotadatabase,butafilestorage
sy stemdesignedforaspecificpurpose,anditdoesntincludeallofthe
functionality thatsomeoftheotherdatastoragesolutionshav e.HDFSisknown
foritsscalability andfaulttolerance,andisagoodoptionforhistoricaldata
thatdoesnotneedtobeeditedoraccessedfrequently ,buttherearesev eral
limitationsthatmay impactHadoopusers,inparticularoneforwhomfast
randomreadsorwritesareapriority .ToolsdoexisttosupportSQLqueries,
andthey willbediscussedinthenextsubsection.HDFSoperatesonawrite
once,readmany paradigm,soifchangesareneededonev enasingledata
point,theentirefilemustberewritten.Forthesereasons,many choosetoadd
oneormorestoragesolutionstotheirarchitecture.
Nonrelationaldatabases,collectiv ely referredtoasNoSQL(Notonly SQL),can
besuitableformachinelearningtasks,becausethey supportnested,semi
structured,andunstructureddata.Databasesinthiscategory ty pically useone
offourbasicty pesofdatamodelsandthechoiceofdatabasewillultimately
dependonthedatabeingstoredaswellasthedemandsoftheprojectforwhich
itisbeingused.Thefourty pesofdatabasesinclude:

1.
1. KeyvaluestoresThisisthesimplestofthefourmodels,
implementedaswhatisessentially alargehashtable.Eachdataitem
hasauniquekey pointingtoit.They arefastandhighly scalable.
SomeexamplesofdatabasesbuiltonthismodelincludeVoldemort
[16]orRedis[17 ].
2.
2. DocumentstoresThesecanbethoughtofasnestedkey v alue

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

8/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

2. DocumentstoresThesecanbethoughtofasnestedkey v alue
stores,whereakey pointstoacollectionofkey v aluestoresrather
thansimply av alue.ExamplesincludeCouchDB[18]andMongoDB
[19].
3.
3. ColumnorientedDataisstoredincolumnsratherthanthe
ty picalrow/columnstructure.Columnsaregroupedintocolumn
families.HBase[20]andCassandra[21]arebothexamplesof
columnorienteddatastores.
4.
4. GraphbasedmodelsDesignedfordatathatcanberepresented
asagraphandcanbeusedfortaskssuchasnetworkanaly sis.They
aremoreflexiblethantheothermodels,withnotablesorrows.
ExamplesincludeTitan[22],Neo4J[23],andOrientDB[24].

Processinglayer
Theprocessinglay eriswheretheactualanaly sistakesplace.Thefoundationof
thislay erisY ARN,whichallowsoneormoreprocessingenginestorunona
Hadoopcluster.Processingengineswillbediscussedindetailinthenext
section.Inadditiontotheprocessingengines,thislay erincludesanumberof
differenttoolsthatcanbeusedformachinelearninganddataanaly sis.ML
librariesandframeworkswillbediscussedintheMachinelearningtoolkits
andEvaluationofmachinelearningtoolssections.
Inadditiontoprocessingframeworksandlibraries,thislay erincludestoolsfor
datamov ementandinteraction.Examplesofthisaredataintegrationtools
suchasFlume[25],Kafka[26],andSqoop[27 ].Flumehandlescollection,
aggregation,andmov ementoflogdataintoHDFS.Kafkaisadistributed
publishsubscribemessagingsy stemontopofHDFS,andSqooptransfersbulk
databetweentheHDFSandrelationaldatabases.Ontheinteractionside,we
findquery enginessuchasHiv e[28]andDrill[29].Hiv equeriesdatastoredin
theHDFSandNoSQLdatabasesusingHiv eQL,anextensionofANSISQLwhich
issimilartoMy SQL.MetadatafortablesandpartitionsiskeptintheHiv e
Metastore.DrillperformsqueriesusingANSISQLandsupportsselfdescribing
data,inwhichschemaisdiscov ereddy namically onread,eliminatingtheneed
fordatatransformationwhichisatimeconsumingprocess.Italsooffers
pluginsforconfigurationwithHiv eallowingforuseofthemetastoreandany
userdefinedfunctionsthatwereprev iously built.
OneofthemaindrawbackstousingMapReduceisthatmany algorithmsdonot
translateeasily intothispattern[30].Cascading[31]andPig[32]aimto
addressthisby offeringhighlev elabstractionswhichhidesomeofthe
complexity inherenttoMapReducejobs,thereby simplify ingtheprogramming
process.PigoffersanexecutionframeworkanddataflowlanguagecalledPig
http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

9/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

process.PigoffersanexecutionframeworkanddataflowlanguagecalledPig
Latin,ascriptinglanguage.Itsupportsuserdefinedfunctionswrittenin
Py thon,Jav a,Jav aScript,andRuby whicharethentranslatedtoMapReduce
jobs.Inadditiontonotrequiringtheusertothinkintermsofmapandreduce,
itoffersamultiquery executiontobatchstatements,significantly cuttingdown
ontheamountofcodetheprogrammerhastowrite.Itcanbeusedformachine
learningtasks,mostnotably usedby Twitter,whoseengineersnotethatit
allowslearningtaskstobeexecutedandtunedusingonly afewlinesofcode
[33].PigrunsonMapReduceandTez[34],whichisanabstractionof
MapReducethatrepresentsdataflowintheformofadirectedacy clicgraph.
Asimilarproject,Cascading,offersanApplicationProgrammingInterface
(API)thatabstractsthetraditionalkey sandv aluesintotupleswithfieldnames
andoffersanumberofoperationsonthetuplesthathelpdev elopersbuild
complexapplicationsmoreeasily andinlesstime.They hav ealsoannounced
thatupcomingreleaseswilloffersupportforSpark,Storm,andTez.Cascading
primarily supportsprogramminginJav a,butalsooffersAPIsforANSISQL,
Predictiv eModelMarkupLanguage(PMML),Scala,Clojure,JRuby ,andPy thon.
Italsosupportseasy integrationofalargenumberofdifferentdatasources.

Managementlayer
Themanagementlay erincludestoolsforuserinteractionandhighlev el
organization.Theseincludescheduling,monitoring,coordination,anduser
interface.Oozie[35],aworkflowscheduler,managesjobsformany ofthetools
intheprocessinglay er,includingprocessingengines,Pig,Sqoop,andHiv e,
amongothers.Forcomplexworkflowswhichrequiremultiplejobsandtools,it
specifiesasequenceofactionsandcoordinatesbetweenthemtocompletethe
tasks.Italsofacilitatesschedulingofjobswhichneedtorunonregular
interv als.
Zookeeper[36]isaserv iceforcoordinationandsy nchronizationofdistributed
sy stems.Itprov idestoolstohandlecoordinationofdataandprotocolsandis
abletohandlepartialnetworkfailures,whicharecommonplaceindistributed
sy stems.ItincludesAPIsforJav aandC,andalsohasbindingsforPerl,Py thon,
andRESTclients.
Hue[37 ],awebinterfaceforHadoopprojects,supportsmany ofthemore
widely usedcomponentsoftheHadoopecosy stem.Itfeaturesfilebrowsersfor
HDFSandHBaseandajobbrowserforMapReduce/Y ARN.Itcanbeusedto
manageinteractionswithHiv e,Pig,Sqoop,Zookeeper,andOozie,andin
additionalsoofferstoolsfordatav isualization.Itiscompatiblewithany
v ersionofHadoopandisav ailableinallofthemajorHadoopdistributions.

MachinelearningwithoutHadoop
WhileHadoopisubiquitousasabigdataframework,thereareanumberof

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

10/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

WhileHadoopisubiquitousasabigdataframework,thereareanumberof
otheropensourceoptionsformachinelearningthatdonotuseitatall.MOA
(Massiv eOnlineAnaly sis)[38]isaprojectrelatedtoWeka,whichoffersonline
streamanaly sisonanumberofWekaalgorithmsandwiththesameuser
interface. 1MADlibisacollectionofSQLbasedalgorithmsdesignedtorunat
scalewithinthedatabaseratherthanportingdatabetweenmultipleruntime
env ironments.Itincludesclustering,classification,regression,andtopic
modelsaswellastoolsforv alidation[39].Dato,formerly GraphLab,isa
standaloneproductthatcanbeconnectedwithHadoopforgraphanaly sisand
MLtasks.Itwasfully opensource,butinlate2014,they transitionedintoa
commercialproduct.TheirC++processingengineDatoCore[40]hasbeen
releasedtothecommunity onGithubalongwiththeirinterprocess
communicationlibrary (fortranslatingbetweenC++andPy thon)andgraph
analy ticsimplementations.Theirmachinelearninglibrariesareunav ailable
outsideoftheirenterprisepackages.DistributedprocessingonHadoopenables
largescalelearning,andthegoalofthispaperistoprofiletoolsthatcando
exactly that.Nondistributedtoolsformachinelearningarewidely av ailable,
andarethusmorematureforuseinprojectsthatdonothandleBigData.Using
Hadoopforsmallerscaleworkloadswouldnotbeadv ised,asthereisov erhead
todistributedprocessing,andtherearefeweralgorithmandimplementation
choices.Thispaperaimstoprofiletoolsthatcaneffectiv ely handleBigData,
thereforeprojectsthatdonotrunonHadoopareoutsidethescopeofthis
paperandwillnotbediscussedinfurtherdetail.

Dataprocessingengines
WhenMapReducewasintroducedin2004by Googleengineers[41],ithad
someearly critics[42],butwasconsideredby many toberev olutionary .
Regardlessofthedifferingopinionsonthev alueofthisidea,itpav edtheroad
forHadoop,whichhasplay edasignificantroleinusheringinthebigdataera.
Inmorerecenty ears,MapReducehasbeguntofalloutoffav or,particularly in
themachinelearningcommunity ,duetoitshighov erheadcosts,lackofspeed,
andthefactthatmany machinelearningtasksdonoteasily fitintothe
MapReduceparadigm.In2014,Googleannouncedthatitwasbeingphasedout
infav orofotherprojects[43].SinceMapReducehasbeendecoupledfrom
HadoopthroughY ARN,itisnowaloteasiertoworkwithanewengineonan
existingcluster,andov erthecourseofthepastfewy ears,anumberofprojects
hav ebeenintroducedthatattempttosolv etheissuesinherentinMapReduce.
Theprocessingmodelsusedformany ofthesemay becategorizedaseither
batchorstreaming.Athirdmodel,knownasbulksy nchronousparallel(BSP),
isusedforiterativ egraphingtasks,butwillnotbediscussedindetailinthis
paper.WhilegraphalgorithmsarerelatedtoML,they areusedmorefor
traditionalanaly ticsandthefocusofthispaperisonotherty pesoflearning

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

11/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

traditionalanaly ticsandthefocusofthispaperisonotherty pesoflearning


tasks.Examplesoftoolswhichemploy theBSPmodelareApacheGiraph[44]
andApacheHama[45].ItshouldbenotedthoughthatGiraphhasnothada
commitsincemid2013.Bothprojectsareopensourceimplementationsof
GooglesPregel[46].
Theremainderofthissectionwilldiscusssomeofthemorewidely used
projectswhichlev eragethebatchandstreamingparadigms.Ahighlev el
ov erv iewoftheseprojectsisinTable1.Inadditiontotheunderly ing
processingapproachused,herearesev eralimportantconsiderationsfor
ev aluationofthesetools:

Table1
DataprocessingenginesforHadoop

Current
stable
release
(asof
June1,
2015)

Execution
model

Supported
languages

Associated
MLtools

In
memory
processing

MapReduce

2 .7 .0

Batch

Jav a

Mahout

Spark

1 .3 .1

Batch,
stream ing

Jav a,
Py thon,R,
Scala

MLlib,
Mahout,H2 O

Flink

0.8.1

Batch,
stream ing

Jav a,Scala

FlinkML,
SAMOA

Storm

0.9 .4

Stream ing

Any

SAMOA

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

12/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Storm

0.9 .4

Stream ing

Any

SAMOA

H2 O

3 .0.0.1 2

Batch

Jav a,
Py thon,R,
Scala

H2 O,
Mahout,
MLlib

1.
1. LatencyThisreferstotheamountoftimebetweenstartingajob
andgettinginitialresults.Speedmay notbeimportantforev ery
project.Ifaprojectisnottimesensitiv e,abatchsy stemmay be
preferredforitssimplicity ,butforprojectsthatarerequirerealtime
ornearrealtimeresults,astreamingplatformwouldbeadv ised.
2.
2. ThroughputThroughputmeasurestheamountofworkdone
ov eragiv entimeperiod.Thiscanbethoughtofasameasureof
efficiency .
3.
3. FaulttoleranceAlloftheplatformsdiscussedinthispaperare
faulttolerantbutthemethodswhichthey usetoachiev ethatmay
v ary .Welookatthemechanismsthatareinplacetodetectfailures,
aswellashowtheplatformisabletorecov eraftersuchafailure
occurs.
4.4. UsabilityDespitetheinterfacesandabstractionsdiscussedinthe
prev ioussectionandlibrariesformachinelearningthatwillbe
discussedlaterinthispaper,thety picaluserwillspendagooddealof
timeinteractingwiththeengineitself.Withthisinmindweask,how
difficultisittoinstallandconfigure?Whatinterfacelanguage(s)does
ituse?Howdifficultisittoprogramfor?
5.
5. ResourceexpenseInthispaper,weconsiderexpensemostly in
termsoftimeinv olv edfromsettingupaclustertodeploy ingthe
modelandmaintainingitafterthefact.Mostofthisiscov eredin
usability .Whilewedontexaminefinancialcostsinthispaper,they
areimportanttoconsideraswell.Thiswilldependonwhetherthe
userhasaccesstoahighperformancecomputingcluster.Purchasing
thenecessary equipmentisnottriv ial,andtheresourcesneededby
processorscanv ary ,soonedecisionmay affecttheother.
Alternativ ely ,clusterscanbesetuponcloudserv icessuchas
AmazonEC2[47 ]orMicrosoftAzure[48],whichchargeondemand
pricesbasedoncomputetimeandstoragespaceused.
6.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

13/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

6.
6. ScalabilityAlloftheprocessingenginesdiscussedinthispaper
weredesignedtobescalable,butthedifferentmethodsemploy ed
hav ev ary ingdegreesofsuccess.Forthisreason,itisimportantto
examinewhethertherearebottleneckswhendatainputorcluster
sizesgrow.Additionally ,theseenginesweredesignedforv ery large
data,butmany realworldusecasesinv olv eatleastsomeprocessing
ofsmallerdatasets.Welookathowthesearehandledaswell.

Theapproachestoprocessingdifferintermsofthroughputandresource
expense,andthereareadditionalplatformdependentfeaturesthatshouldalso
beusedforev aluationoftheseprojects.Toprov ideacomprehensiv e
comparison,faulttolerancemethods,scalability ,efficiency ,interfacelanguage,
andusability arecov eredbelow.

MapReduce
TheMapReduceapproachtomachinelearningperformsbatchlearning,in
whichthetrainingdatasetisreadinitsentirety tobuildalearningmodel.The
biggestdrawbacktothisbatchmodelisalackofefficiency intermsofspeedand
computationalresources.Inaty picalbatchorientedworkflow,thesetof
trainingdataisreadfromtheHDFStothemapperasasetofkey v aluepairs.
Theoutput,alistofkey sandtheirassociatedv alues,iswrittentodisk.Ina
classificationtask,forexample,theinitialkey v aluepairmightbeafilename
andalistofinstances,andtheintermediateoutputfromthemapperwouldbea
listofeachinstancewithitsassociatedclass.Thisintermediatedataisthenread
intooneormorereducerstotrainamodelbasedonthislist.Thefinalmodelis
thenonceagainwrittentodisk.ThisprocessisillustratedinFig.3a.
ThesefrequentI/Ooperationscanbecomev ery expensiv eintermsoftime,
computationalresources,andnetworkbandwidth.Any modelparametersthat
needtobetunedaftertheinitialev aluationstagefurtheraddtothecosts.These
issuesbecomemoreapparentincaseswhereitisnecessary toupdatemodels
withchangingdata,whichisoftenthecaseinrealworldMLproduction
env ironments.Whilethisapproachmay besuitableforcertainprojectssuchas
analy zingpastev ents,itbecomesproblematicwhendataev olv es,asthefull
processmustberepeatedeachtimeamodelrequiresupdating.Datamustbein
itsfinalformbeforebeginningaMapReducejob,asthemechanismdoesnot
hav etheability towaitfornewdatatobegenerated.MapReduceiscompatible
withtheMahoutlibrary forML,andtheprogramminginterfacesdiscussedin
theprev ioussectioncanbeusedaswell.WementionedTwitter,whobuilttheir
analy ticsstackaroundPigandperformsMLtasksusingPigsuserdefined
functions[33].ThereisalsoaframeworkcalledConjecture[49],whichwas
dev elopedby Etsy .comengineersforparallelizedonlinelearninginScalding,a
ScalawrapperforCascading.
Thefaulttolerancemechanismemploy edby MapReduceisachiev edthrough

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

14/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Thefaulttolerancemechanismemploy edby MapReduceisachiev edthrough


datareplication,whichcanaffectscalability by increasingthesizeofdataev en
further.Theneedfordatareplicationhasbeenfoundtoberesponsiblefor90%
oftherunningtimeofmachinelearningtasksinMapReduce[50]andisperhaps
thebiggestimpedimenttofastdataprocessing.Anotherdeficiency of
MapReduceisthatitdoesnoteasily allowforiterativ eprocessing,makingit
unsuitableformany machinelearningprojects.Whileitispossibletoachiev e
iterativ ecomputationinMapReduce,thismustbeprogrammedmanually
throughmultipleMapReducejobsrequiringcarefulorchestrationofexecution
[51].Thisprocessiscomplicatedandunabletoaddressany oftheprev iously
discussedissuesrelatingtocomputationalresources.
HaLoop[51],dev elopedattheUniv ersity ofWashington,wasanearly project
aimedataddressingtheseconcernsv iaaprogramminginterfacewhichhandles
loopcontrolandtaskscheduling.Howev er,itlacksongoingdev elopment,and
isonly compatiblewitholderv ersionsofHadoop[52].

Spark
Spark[53],whichwasinitially dev elopedattheUniv ersity ofCalifornia,
Berkeley [54]andisnowanApachetoplev elproject,isbasedonMapReduce
butaddressesanumberofthedeficienciesdescribedabov e.LikeHaLoop,it
supportsiterativ ecomputationanditimprov esonspeedandresourceissues
by utilizinginmemory computation.Sparksapproachtoprocessinghasseen
widespreadadoptioninbothresearchandindustry .Themainabstractionsused
inthisprojectarecalledResilientDistributedDatasets(RDD),whichstoredata
inmemory andprov idefaulttolerancewithoutreplication[50].RDDscanbe
understoodasreadonly distributedsharedmemory [55].Thismodel,
illustratedinFig.3b,streamlinesthelearningprocessthroughinmemory
cachingofintermediateresults,significantly cuttingdownonthenumberof
readandwriteoperationsnecessary .

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

15/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

16/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Fig.3
Comparisonofprocessingmodelsforv ariousprocessingengines:a
MapReduce,bSpark&Flink,cStorm,dH2 O,andeH2 OwithSparklingWater
TheRDDAPIwasextendedin2015toincludeDataFrames,whichallowusersto
groupadistributedcollectionofdataby column,similartoatableina
relationaldatabase.They canbethoughtofasRDDswithSchema[56].For
example,anRDDofkey v aluepairscanbeconv ertedintoaDataFramewhichis
representedasatablewithonecolumneachforkey andv alue.Forusers
familiarwithRorPy thon,theimplementationsaresimilar.DataFramescanbe
createdfromanexistingRDD,Hiv etable,HDFSoranumberofotherdata
sources.
SparksspeedwasdemonstratedinOctober2014,whenitwontheDay tona
Gray SortBenchmarkContest[57 ].Theprev iousrecordwasheldby
Hadoop/MapReduce,forsorting102.5TBon2100nodesin7 2min.Spark
sorted100TBon206nodesinonly 23min,threetimesfasterwithonetenth
thenumberofmachines.Itwasthenusedtosortapetaby tein234minon190
nodes(thoughthiswasntanofficialpartofthecontestandwasnotpostedwith
thewinners)[58].Additionally ,ithasbeennotedthatSparkiseasierto
program[50,59]andpartofthatreasonisduetothefactthatitcanbecoded
inJav a,R,Py thon,orScala.Formachinelearningtasks,Sparkshipswiththe
MLlib[60]andGraphX[61]librariesandthelatestv ersionoftheMahout[62]
library offersanumberofSparkimplementationsaswell.
In[59],Sparksperformancewastestedagainstthreeothermachinelearning
platforms,SimSQL,GraphLab,andGiraph.They wererunthroughanextensiv e
setoftestsinwhichthey eachtrainedfiv ecomplexmodelsonclustersof
increasingsize.Thestudy comparedrunningtimesoneachplatformforeach

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

17/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

increasingsize.Thestudy comparedrunningtimesoneachplatformforeach
clustersize,aswellashowmuchcodewasnecessary foreachimplementation.
Theresultsoftheexperimentsv aried,butgenerally showedSparktobeslower
thanthegraphingimplementationsbutfasterthanSimSQL.Though
implementationwasslower,itrequiredfarlesscodethanthegraphing
platformsonallexperiments.Additionally ,forSpark,they examinedthe
runningtimeinbothJav aandPy thonforcomparison.Whilethey foundthe
Py thonimplementationtobeeasy touseandthecodetobecleanandsuccinct,
they notedthatitwassignificantly slowerthanJav aonmosttests.The
exceptiontothiswasoneproblemwith100dimensionallinearalgebra,in
whichJav awaseighttimesslowerthanPy thon,whichwaspresumably anissue
withJav a,ratherthanwithSparksruntime.TheauthorsnotedthatSpark
requiredagooddealoftuningandexperimentationtogetlargeorcomplicated
problemsworking,andthey wereunabletofigurethisoutforallproblems,
causingfailuresonsev eraltests.They couldnotagreeonthereasonforthese
problems,butonetheory putforthattributedthesefailurestoheav y reliance
ontechniqueslikelazy ev aluationforspeedandjobscheduling.Itshouldbe
noted,howev er,thatSparkv ersion0.7 .3wasusedintheseexperimentsand
many improv ementshav ebeenmadetotheplatform(whichisnowinv ersion
1.3.1)sincethen.Many ofSparksissuescanbereasonably attributedtothefact
thatitisstilly oung.Thereisalargeteamofcontributorsworkingonitallthe
time,soissuesareoftenresolv edev enbeforestudiesarepublished.
OtherconcernsaboutSparksapproachdealwiththedistributionofdataacross
nodes.Datatransferstakeplacethroughoutthenetwork,andbecauseofthejob
isolationmechanismpresent,only onedriv ercanserv erequeststoallofits
RDDs,potentially leadingtoabottleneckwithinthenetworkwhenthereare
multiplerequeststomultiplenodes[52,63,64].Howev er,a2015study by
Ousterhoutetal.[65]usedblocktimeanaly sistoidentify performance
bottlenecksinSparkandthey foundthatimprov ingnetworkperformanceonly
hadaminimaleffectonjobcompletiontimewhiletherealbottleneckswere
actually occurringontheCPUratherthanI/Oasprev iously thought.
Whiletheiterativ ebatchapproachtodataprocessingimprov esonmany ofthe
deficienciesoftheMapReduceparadigm,itstilldoesnotoffertheability to
processdatainrealtime.Onlinedataprocessingmay beusefulforprojects
suchasclickstreamanaly sisorev entdetection.SparkoffersSparkStreaming,
whichusesmicrobatching,atechniquethatmay bethoughtofasasimulation
ofrealtimeprocessing.Inthisapproach,anincomingstreamispackagedinto
sequencesofsmallchunksofdata,whichcanthenbeprocessedby abatch
sy stem[66].Whilethismay beadequateformany projects,itisnotatruereal
timesy stem.Itisnotedin[67 ]thatthisapproachmakesloadbalancingeasier
andismorerobusttonodefailures.Additionally ,theauthorsmentionthat
whilethismodelisslowerthantruestreaming,thelatency canbeminimized
enoughformostrealworldprojects.Sparkalsooffersintegrationofits
streamingandbatchoptionsformorepowerfulinteractiv eapplications.

Storm
http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

18/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Storm
Storm[68]isusedforprocessingdatainrealtimeandwasinitially conceiv ed
toov ercomedeficienciesofotherprocessorsincollectingandanaly zingsocial
mediastreams[69].Dev elopmentonStormbeganatBackTy pe,asocialmedia
analy ticscompany andcontinuedatTwitteraftera2011acquisition.The
projectwasopensourcedandbecameanApachetoplev elprojectin
September2014[7 0].Themachinelearningcommunity hasbeenplacing
growingimportanceonrealtimeprocessing[7 1],andasaresult,Stormis
seeingincreasedadoptionbothinproductionandinresearchenv ironments.
TheStormarchitectureconsistsofspoutsandbolts.Aspoutistheinputstream
(e.g.TwitterstreamingAPI),whileboltscontainmostofthecomputationlogic,
processingdataintheformoftuplesfromeitherthespoutorotherbolts.
Networksofspoutsandbolts,whicharerepresentedasdirectedgraphs,are
knownastopologies.AnexampleofthisisinFig.3c.Theprojectisprimarily
implementedinClojure,butinitially usedJav aforallAPIstoencouragemore
widespreadadoption.ItnowincludesThrift[7 2],aframeworkforcross
languagedev elopment,whichallowstopologiestobedefinedandsubmitted
usingany programminglanguage[7 3].Stormusesrealtimestreaming,butalso
offersmicrobatchv iaitsTridentAPI.
Faulttoleranceisachiev edby way ofthetopology :Spoutswillkeepmessagesin
theiroutputqueuesuntiltheboltsacknowledgethem.Messageswillcontinue
tobesentoutuntilthey areacknowledged,atwhichtimethey willbedropped
outofthequeue.Amasternode,knownasNimbusbecauseitrunstheNimbus
daemon,trackstheheartbeatsofworkernodes.Ifaworkernodedies,then
Nimbuswillreassigntheworkerstoanothernode.Nimbusalsohandlesthe
responsibility ofassigningtaskstoworkers,similartojobtrackerin
MapReduce.Thebiggestdifferenceisifthejobtrackerdies,allrunningjobsare
lost,butifNimbusdies,itisautomatically restarted[7 4].
Stormwasbuiltasastandalonesy stemindependentfromHadoop,butsince
Hadoopmov edtoY ARN,workhasbeendonetointegratethetwoprojects.
HortonworksaddedStormtotheirHadoopdistributionbeginninginv ersion
2.1andY ahoo!isworkingonanintegrationaswell[7 5].Theprincipal
dev eloperofStorm,NathanMarz,coinedthetermLambdaArchitecturein
[7 6],describingageneralizedapproachtocombinemultipleparadigmsinto
onesy stemby breakingdownprocessingintothreelay ers:batch,serv ing,and
speed.Thebatchlay erstoresthemasterdatasetandcomputesv iewswhichare
senttotheserv inglay erforindexingandkeepingtrackofthemostcurrent
results.Thespeedlay erlooksatnewdataonly ,asitarriv es,andmakesupdates
inrealtime.Newdataissenttoboththebatchlay erandthespeedlay erfor
computationandresultsfromeacharemergedwhenthesy stemisqueried.In
termsoftheprocessingengineswehav ediscussedsofar,thelambda
architecturecanbeseenasaway toquickly runjobsonMapReduceandStorm
simultaneously andcombinetheresults.Thisunifiestheprocessingofboth

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

19/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

realtimeandhistoricaldata.
Stormdoesnotshipwithamachinelearninglibrary ,butSAMOA,aplatformfor
miningbigdatastreams,currently hasimplementationsforclassificationand
clusteringalgorithmsrunningonStorm.H2 O[7 7 ]hasalsoofferedaway tolink
thetwoprojects[7 8].Othershav ecreatedtheirownimplementationsof
v ariouslearningalgorithms.Forexample,WassonandSales[7 9]describethe
implementationofaparticlelearningalgorithmbuiltonStorm.TridentML
[80]offersalibrary oflearningalgorithmsbuiltonStorm,buthasnotbeen
updatedsinceearly 2014.

Flink
Flink[81]wasdev elopedattheTechnicalUniv ersity ofBerlinunderthename
Stratosphere[82].ItgraduatedtheApacheincubationstageinJanuary 2015
andisnowatoplev elproject.Itofferscapability forbothbatchandstream
processing,thusallowingfortheimplementationofaLambdaArchitectureas
describedabov e.Itisascalable,inmemory optionthathasAPIsforbothJav a
andScala.Ithasitsownruntime,ratherthanbeingbuiltontopofMapReduce.
Assuch,itcanbeintegratedwithHDFSandY ARN,orruncompletely
independentfromtheHadoopecosy stem.Flinksprocessingmodelapplies
transformationstoparalleldatacollections[83,84].Suchtransformations
generalizemapandreducefunctions,aswellasfunctionssuchasjoin,group,
anditerate.Alsoincludedisacostbasedoptimizerwhichautomatically selects
thebestexecutionstrategy foreachjob.Flinkisalsofully compatiblewith
MapReduce,meaningitcanrunlegacy codewithnomodifications[81].
LikeSpark,Flinkalsooffersiterativ ebatchaswellasstreamingoptions,though
theirstreamingAPIisbasedonindiv idualev ents,ratherthanthemicrobatch
approachthatSparkuses.ThisisthesamemodelthatStormusesfortruereal
timeprocessing.Connectorsareofferedwhichallowforprocessingdata
streamsfromKafka,RabbitMQ(aplatformindependentmessagingsy stem),
Flume,Twitter,anduserdefineddatasources.
Theprojectisstillinitsinfancy butmachinelearningtoolsareindev elopment.
FlinkML[85],amachinelearninglibrary ,wasintroducedinApril2015.
Additionally ,anadapterisav ailablefortheSAMOAlibrary ,whichoffers
learningalgorithmsforstreamprocessing.Ni[55]performedacomprehensiv e
comparisonoftheFlinkandSparkplatformsandexamineddifferencesfroma
theoreticalperspectiv easwellasapracticalone.Ingeneral,Sparkwasfoundto
besuperiorintheareasoffaulttoleranceandhandlingofiterativ ealgorithms,
whileFlinksadv antageswerethepresenceofoptimizationmechanismsand
betterintegrationwithotherprojects.Intermsofpracticalexecution,Flink
usedmoreresourcesbutwasabletofinishjobsinlesstime.Flinkhas
undergonemajorchangessincethisstudy waspublishedandmoreupdated
comparisonsareneeded.TheFlinkteampublishedbenchmarkresultsusing
GrepandPageRank,andFlinksexecutionwassignificantly fasterthanthatof

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

20/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

GrepandPageRank,andFlinksexecutionwassignificantly fasterthanthatof
Spark[86],butindependenttestsareneededtov erify theseclaims.

H2O
H2 Oisanopensourceframeworkthatprov idesaparallelprocessingengine,
analy tics,math,andmachinelearninglibraries,alongwithdatapreprocessing
andev aluationtools.Additionally ,itoffersawebbaseduserinterface,making
learningtasksmoreaccessibletoanaly stsandstatisticianswhomay nothav e
strongprogrammingbackgrounds.Forthosewhowishtotweakthe
implementations,itofferssupportforJav a,R,Py thon,andScala.Inadditionto
itsnativ eprocessingengine,whichisillustratedinFig.3d,they hav ealso
releasedaprojectcalledSparklingWater,showninFig.3e,whichintegrates
SparkandSparkStreamingintotheirplatform.Thisisonly supportedinv ersion
3.0.Additionaleffortshav ebeenmadetowardsintegrationwithStormforreal
timestreaming.H2 Osengineprocessesdatacompletely inmemory using
multipleexecutionmethods,dependingonwhatisbestforthealgorithmused.
ThegeneralapproachusedisDistributedFork/Join,adiv ideandconquer
technique,whichisreliableandsuitableformassiv ely paralleltasks.Thisisa
methodwhichbreaksupajobintosmallerjobswhichruninparallel,resulting
indy namicfinegrainloadbalancingforMapReducejobsaswellasgraphsand
streams.They claimtobethefastestexecutionengine,butasofthetimeofthis
writing,noacademicstudieshav ebeenpublishedwhichv erify orrefutethese
claimsandfurtherresearchisneededinthisarea.

Machinelearningtoolkits
ToperformmachinelearningtasksinHadoop,onedoesnotneedaspecial
platformorlibrary .Apersonwithprogrammingskillsmay interactdirectly
withany oftheabov eplatformstorolltheirowncodeandmany choosetogo
thisroute.Av ariety ofmachinelearningtoolkitshav ebeencreatedtofacilitate
thelearningprocessbutmany researchersandpractitionersrejectthemfor
v ariousreasons,mostoftenbecausethey lackneededfeaturesoraredifficultto
integrateintoanexistingenv ironment.Oneissueisthatmachinelearningisa
broadfieldofstudy andmany oftheav ailabletoolkitslackimportant
functionality .Anotherproblemisthatwithouttrueexpertiseintheareasof
programmingandsy stemarchitecture,many peoplelackafullunderstanding
ofwhatthev ariousplatformsarecapableof.Thisisexacerbatedby thefactthat
therehasbeenlittlecomprehensiv eresearchintomany popularframeworks.
Researchmov esmuchslowerthandev elopment,sooftenby thetime
informationbecomesabundant,thecommunity hasalready mov edonto
differenttools.Howev er,distributedlearningalgorithmsarenottriv ialto
implement,andthosewhodonotwishtoreinv entthewheelmay findthatthey

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

21/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

implement,andthosewhodonotwishtoreinv entthewheelmay findthatthey


cansav ethemselv essignificanteffortby usingorextendingexisting
implementations.Table2prov idesanov erv iewoffourofthemore
comprehensiv emachinelearningpackagesthatrunonHadoop.Only
distributedalgorithmsarelistedinthistable.Mahoutincludessev eral
implementationsthatarenotdistributedandthereforenotincluded,butthey
arediscussedintheEvaluationofmachinelearningtoolssection.
Table2
Ov erv iewofmachinelearningtoolkits

Mahout

MLlib

H2 O

SAMOA

Interface
language

Jav a

Jav a,
Py thon,
Scala

Jav a,
Py thon,R,
Scala

Jav a

Associated
platform

MapReduce,spark
(H2 Oandflinkin
progress)

Spark,
H2 O

H2 O,
Spark,
MapReduce

Storm ,
S4 ,
Sam za

Current
v ersion(asof
June1 ,2 01 5)

0.1 0.1

1 .3 .1

3 .0.0.1 2

0.2 .0

Graphicaluser
interface

Classificationandregressionalgorithm s
Decisiontree

Logistic
regression

Nav eBay es

Supportv ector
m achine

Gradient
boostedtrees

Random forest

Adaptiv e
m odelrules

Generalized

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

22/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Generalized
linearm odel

Linear
regression

Clusteringalgorithm s
kMeans

Fuzzy km eans

Stream ingk
m eans

Power
iteration

Spectral
clustering

CluStream

Collaborativ efiltering(cf)algorithm s
UserbasedCF

Item basedCF

Alternating
leastsquares

Dim ensionality reductionandfeatureselectiontools


Principal
com ponent
analy sis

QR
decom position

Singularv alue
decom position

Chisquared

Additionalalgorithm s

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

23/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Association
rulelearning

Deeplearning

Topicm odeling

a Realtimestreamingimplementation
b Singlemachine,trainedusingStochasticgradientdescent

Selectioncriteria
Inaresearchorproductionenv ironment,thechoiceofmachinelearning
packagesorspecificalgorithmswillcomedowntoav ariety ofdifferentfactors,
mostly dependentontheneedsofthespecificgrouporproject.Anumberof
authorshav etackledthissubject,including[87 89].Basedinpartonthese
studies,weofferalistofimportantconsiderationsforev aluationofmachine
learningtools.Thesearepresentedinnoparticularorder,sincethe
prioritizationofthesefactorswillbedependentonparticularusecases.
ScalabilityThisshouldbeconsideredwithregardstoboththesize
andcomplexity ofthedata.Oneshouldconsiderwhattheirdata
lookslikenow,aswellaswhatdatathey mightbeworkingwithin
thefuture,inordertodetermineifaparticulartoolkitwillbe
appropriate.Scalability shouldbelookedatinbothdirections,as
someofthebesttoolsforbigdataperformpoorly onsmalldata,and
v icev ersa.Thisisalsotrueforotherdatacharacteristics,suchas
dimensionality .
SpeedThebiggestfactoraffectingspeediswhichprocessing
platformthelibrary oralgorithmisrunningonratherthanthe
library oralgorithmitself.Howev er,somelibrariesaretiedto
specificplatforms,sothisisstillanimportantconsiderationwhen
selectingMLtools.Asnotedintheprev ioussection,speedmay not
beimportantforev ery project.Ifmodelsdonotrequirefrequent
updating,abatchsy stemmay bepreferredforitssimplicity ,butfor
modelsthatareupdatedoften,thismay beacrucialconcern.
CoverageThisreferstotherangeofoptionscontainedinthetoolkit
intermsofdifferentclassesofmachinelearningaswellasv ariety of
implementationsineachclass.Noneoftheav ailabletoolsforbig
dataprov ideaselectionascomprehensiv eassomenondistributed
frameworkssuchasWeka,buttheirscopemay rangefromonly a
fewalgorithmstoaroundtwodozen.Asmany ofthetoolsare

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

24/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

difficulttosetupandlearn,itisimportanttoconsiderfutureneeds
aswellascurrent.
UsabilityInthiscase,onemustweightheirprojectgoalsagainstthe
skillsandexpertiseoftheirgroup.Usability may beconsideredin
termsofinitialsetup,ongoingmaintenance,programming
languagesav ailable,userinterfaceav ailable,amountof
documentation,orav ailability ofaknowledgeableusercommunity .
Ifthereisanexistinganaly ticsworkflow,oneshouldconsiderhow
wellthetoolcanbeintegratedintothis.
ExtensibilityMachinelearningtasksarerarely onesizefitsall.
Whetheritssomethingassimpleassettingthev alueofkinak
meansclusteringtask[90],orbuildinganensembleoflearners[91],
mostjobswillrequiresomeamountofparametertuningbeforea
modelisdeploy ed.Theimplementationsincludedinthev arious
toolsareoftenusedasbuildingblockstowardsnewplatformsor
sy stems,soitisimportanttoev aluatethemintermsofhowwell
they areabletofulfillthisrole.

Evaluationofmachinelearningtools
Thissectionprov idesanindepthlookatthestrengthsandweaknessesofthe
v ariousmachinelearningtoolsforHadoop.Publishedliteratureusingrelated
toolsisrev iewedhereifitisav ailable.Foracompletelookathowthetoolsand
enginesfittogether,Fig.4illustratestherelationshipsbetweenprocessing
engines,machinelearningframeworks,andthealgorithmsthey implement.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

25/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Fig.4
Machinglearningframeworksandtheirassociatedprocessingenginesand
learningalgorithms

Mahout
MahoutisoneofthemorewellknowntoolsforML.Itisknownforhav inga
wideselectionofrobustalgorithms,butwithinefficientruntimesduetothe
slowMapReduceengine.InApril2015,Mahout0.9wasupdatedto0.10.0,
markingsomethingofashiftintheprojectsgoals[92].Withthisrelease,the
focusisnowonamathenv ironmentcalledSamsara,whichincludeslinear
algebra,statisticaloperations,anddatastructures.ThegoaloftheMahout
Samsaraprojectistohelpusersbuildtheirowndistributedalgorithms,rather
thansimply alibrary ofalready writtenimplementations.They stilloffera
comprehensiv esuiteofalgorithmsforMapReduceandmany hav ebeen
optimizedforSparkaswell.IntegrationswithH2 OandFlinkarecurrently in
dev elopment.Thisv ersionisv ery new,sothereisnopublishedliteratureonit
atthetimeofthiswritingotherthantheinitialannouncementby thedev eloper
teamintroducingthenewfeatures.Becausemostoftheoldalgorithm
implementationsarestillincluded,therestofthissectionfocusesonv ersions
0.9andearlier.
Amongthemorecommonly citedcomplaintsaboutMahoutisthatitisdifficult
tosetuponanexistingHadoopcluster[9395].Additionally ,whilealotof
documentationexistsforMahout,muchofitisoutdatedandirrelev antto
peopleusingthecurrentv ersion.Thelackofdocumentation,aproblem
commontomany machinelearningtools,ispartially allev iatedby anactiv e
usercommunity willingandabletohelpwithmany issues[96,97 ].One
problemwithusingMahoutinproductionisthatdev elopmenthasmov edv ery
slowly v ersion0.10.0wasreleasednearly sev enandahalfy earsafterthe
projectwasinitially introduced.Thenumberofactiv ecommittersisv ery low,
withonly ahandfulofdev elopersmakingregularcommits.
ThealgorithmsincludedinMahoutfocusprimarily onclassification,clustering
andcollaborativ efiltering,andhav ebeenshowntoscalewellasthesizeofthe
dataincreases[98].Additionaltoolsincludetopicmodeling,dimensionality
reduction,textv ectorization,similarity measures,amathlibrary ,andmore.
OneofMahoutsmostcommonly citedassetsisitsextensibility andmany hav e
achiev edgoodresultsby buildingoffofthebaselinealgorithms[87 ,99,100].
Howev er,inordertotakeadv antageofthisflexibility ,strongproficiency in
Jav aprogrammingisrequired[87 ,95,101].CommitterTedDunningnotedIts
notaproduct.Itsnotapackage.Itsnotaserv ice.Batteriesarenotincluded
[102].Someresearchershav eciteddifficulty withconfigurationorwith

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

26/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

integratingitintoanexistingenv ironment[33,9395].Ontheotherhand,a
numberofcompanieshav ereportedsuccessusingMahoutinproduction.
NotableexamplesincludeMendeley [103],LinkedIn[104],andOv erstock.com
[102],whoalluseitsrecommendationtoolsaspartoftheirbigdata
ecosy stems.Ov erstockev enreplacedacommercialsy stemwithit,sav inga
significantamountofmoney intheprocess.

Classification
Classificationalgorithmscurrently offeredinMahoutareLogisticRegression,
Nav eBay es,RandomForest,HiddenMarkov Models,andMultilay er
Perceptron.Therehasbeensomeadditionalworktowardsimplementing
supportv ectormachines,butthisapproachisnotcurrently av ailable.Ofthese
algorithms,only Nav eBay esandRandomForestareparallelized.Thereareno
studiesthatweareawareofthathav eutilizedMahoutsimplementationsof
HiddenMarkov ModelsorMultilay erPerceptron.Thisismostlikely duetothe
factthatthey arenotparallelized.
LogisticRegressionistrainedv iaStochasticGradientDescent(SGD),negating
theneedforparallelization[105].Thisimplementationisfav orabledueto
speedandrobustnessinthefaceofnewdata,butthelackofparallelizationmay
bethereasonitsuseseemstobesporadic.Inbenchmarktestsamongdifferent
tools[106],MahoutsimplementationofLogisticRegressionwithSGDwasleft
outofthattestentirely .Theauthorsstatedthatitsimplementationistoo
communicationintensiv eandtoincludeitinthecomparisonwouldnotbefair
toMahout.Adifferentstudy [107 ]foundMahoutsimplementationtobe
particularly slowinprocessingsparsedata,anditdidnotfaremuchbetteron
densedata.Pengetal.[108]compareddifferentapproachestoLogistic
RegressionandnotedthatMahoutshadpoorprecision,particularly on
imbalanceddata,butthatithasgoodscalability andisanexampleofa
sequentialalgorithmthatcantrainmassiv edatainacceptabletime.They
recommendMahoutforsituationswhenonly asinglemachinewithlimited
memory isav ailable.
MahoutsimplementationofNav eBay esisbasedon[109].Thoughthis
algorithms(ofteninaccurate)assumptionsofindependencemay seem
counterintuitiv e,itgenerally performsquitewellinrealworldsituations,but
theperformancebeginstodeclinewhenworkingwithdatathatishighly
imbalancedordependent[110].Forthisreason,Mahoutalsoincludesan
implementationforComplementary Nav eBay es,whichbasesthepredictions
forclassConthesamplesbelongingtoC(allclassesotherthantheoneweare
predicting).Bothv ersionshav esomeov erheadfortraining,duetotheir
parallelexecution,sothey aremosteffectiv ewhenusingv ery largedataand
notrecommendedforsmallerdatasets[111].Duetoitssimplicity ,reliability ,
andeaseofuse,MahoutsimplementationofNav eBay esisoftenagoto
learnerforthoselookingtodemonstrateortestotherpartsoftheirsy stem.For

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

27/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

example,itwasusedby [112114]totestanddemonstratethecapability ofnew


dataprocessingengines.
RandomForestisalsopopularduetoitshighaccuracy andtimeefficiency
[115].Intermsofscalability ,Racetteetal.[116]compareditwithRtopredict
globalconflicts.Rcrashedwhenattemptingtobuildmorethan250treesbut
they wereabletouseMahouttobuild10,000onasinglenodewithout
problems.ItshouldbenotedthatthisisanunfaircomparisonsinceRisnot
builtforcomputationofbigdata.MoreresearchisneededtocompareMahout
withsimilartoolstoitself,suchastheonesdiscussedinthispaper.Mahouts
RandomForestimplementationhasbeenappliedacrossmany different
applicationdomains,buthasbeenusedparticularly ofteninhealthcarerelated
studies.Ithasbeenusedtopredictthesev erity ofmotorneurondisease[117 ],
identify highriskpatients[118],andpredicttheriskofreadmissionfor
congestiv eheartfailurepatients[119].Ithasalsobeenusedaspartofageneral
predictiv ehealthcareanaly ticsframework.
Many studieshav edescribedbuildingframeworksforpracticalmachine
learningusingMahout.OneexampleisatHoney well[120],wherethey builta
cloudcomputingplatformcombiningHBase,Mahout,otheranaly ticstools,and
awebinterface.Init,they utilizedMahoutsRandomForestandNav eBay es
algorithmstopredictev entsandfailuresinauxiliary powerunitsbeforethey
causeoperationalinterrupts.Thissy stemwasabletoincreasepredictiv e
accuracy forautoshutdownev entsby afactorofmorethanthree.

Clustering
Mahoutslibrary usessev eralv ariationsofthepopularkMeansClustering
algorithm,includingthetraditionalkMeans,Fuzzy kMeans,andStreamingk
Means.SpectralClusteringisalsosupported.Estev esetal.[101]lookedatthe
performanceofMahoutsimplementationofkMeansonv arioussizesofinput
filesandconfirmedthatMahoutscalesv ery welltolargerdatasets,but
probably wouldnotbeagoodchoiceforsmallerones.Estev esandRong[121]
comparedthespeedandquality ofthetraditionalandfuzzy implementationsof
thisalgorithmby clusteringWikipediaarticles.Mostexperimentswere
performed10timeseachandthey noticedlargev ariationsinexecutiontime
forthedifferentruns.Onav erage,fuzzy cmeansconv ergedfasterwithless
iterations,butkmeanswasfasteronsomeruns,showinghowdependentthese
measuresareontheinitialseedingofthecentroids.Thisissueofunpredictable
resultsduetotheinitialcentroidplacementhasalsobeennotedby [122].While
they initially expectedbetterresultsfromfuzzy kmeansduetotheinclusionof
ov erlaps,theirobserv edresultsshowedtheoppositecentroidsweretooclose
together,sothemajority ofthesampleshadmembershipinallclusters.This
wasattributedtothehighdimensionality ofthefeaturev ectors,becausewhen
they normalizedthefeaturesduringpreprocessingitfailedtoshowmany of
thecharacteristicsthatcouldhav ebetterdiv idedthedata.Ingeneral,kmeans
producedmuchmoremeaningfulresultsthanfuzzy kmeans,leadingtothe

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

28/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

producedmuchmoremeaningfulresultsthanfuzzy kmeans,leadingtothe
conclusionthatfuzzy kmeansisnotadv isedfordatasetswithhighlev elsof
noise.
Tothebestofourknowledge,noresearchonMahoutsstreamingkmeans
algorithmhasbeenpublished.Howev er,DanFilimon,thedesignerofMahouts
implementation,presenteditatBerlinBuzzwords2013[122],andreportedhis
ownresultsfromcomparisonsofkmeansandstreamingkmeans.Hefoundthe
quality oftheclustersformedby eachtobecomparable,butthestreaming
implementationwassignificantly faster.Increating20clusters,hefoundthe
streamingimplementationtobe2.4timesfaster,andaftergoingupto1000
clusters,itfinished8timesfaster.Tothebestofourknowledgethishasnoty et
beenindependently v erified.
Inacomparisonstudy ofdifferentSpectralClusteringalgorithms,Gaoetal.
[123]notedthatMahoutsstandardimplementationdidnotscaletodatasets
largerthan2 1 5 instances.Howev er,aftersomemodificationsandtheaddition
oflocalsensitiv ity hashingasapreprocessingstep,they wereabletoachiev e
processingtimesmorethananorderofmagnitudefasterthanany other
implementation.

Collaborativefiltering
Mahoutisprobably thebestknownframeworkforcollaborativ efilteringtools,
alsoknownasrecommendationengines.Theselectionoftoolsthey offerinthis
areaisfarmorerobustthanwhatisofferedinany oftheothertoolkits.
Currently ,Mahoutoffersimplementationsforuserbasedrecommendations,
itembasedbasedrecommendations,andsev eralv ariationsofmatrix
factorization.Mahoutalsooffersanumberoftoolstocomputethesimilarity
measuresofany oftheabov erecommenders.TheseincludePearson
correlation,Euclideandistance,Cosinesimilarity ,Tanimotocoefficient,log
likelihood,andothers.Ev aluationofrecommendersistricky ,duetothenature
ofthetask,soiftestingasy stemusingfeedbackfromactualusersisnot
possible,Mahoutprov idesaholdouttest,inwhichaportionofthetraining
dataissetasidetousefortesting[111].
InacomparisonbetweenMLI(anAPIfordistributedmachinelearningbuilton
Spark),GraphLab,Mahout,andMATLABofcollaborativ efilteringwith
alternatingleastsquares[106],itwasobserv edthatMahoutsrecommenders
arefairly easy tosetuponanexistingclusterbutsignificanttuningisrequired
torunthemeffectiv ely onlargedatasetswhileensuringgoodperformance.
Mahoutsimplementationhadsignificantly morelinesofcode,wassloweronall
datasets,anddidntscaleaswellasany oftheothers.Thereissome
disagreementonscalability though,asprov enscalability iscitedby [124]as
oneofMahoutsstrengthsinthisarea.Howev er,thesamestudy criticizedits
lackofbuiltinhy bridrecommenders,supportforgroups,orcontext
awareness.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

29/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Someoftheseissuesarelikely indicativ eofaproblemwithMapReducerather


thantheMahoutcode.Iterativ ealgorithmsareknownnottorunwellon
Hadoop/MapReducebecauseoftheneedtoaccessdataandv ariablesfrom
disks[125].Howev er,tothebestofourknowledge,Mahoutsimplementation
hasnotbeenformally benchmarkedonamemory basedplatformsuchasSpark
orH2 O,anduntilthishappens,itisdifficulttoproperly judgethescalability of
theactualcode.AlsonotexploredinthesestudiesisthefactthatMahouts
recommendersaredesignedtobehighly extensible,sogoodorbad
performanceonabaselineimplementationmay notbeindicativ eofwhatthe
toolsarecapableofwhentunedproperly foraspecificapplicationordataset.A
study designedtocomparethebaselinealgorithmsofferedoutofboxto
v ariouscustomizedv ersionsfoundsignificantimprov ementinperformance
whenemploy ingnewtechniquessuchassignificanceweightingandmean
centeredprediction[98].Mendeley ,anapplicationusedby studentsand
researcherstoorganizeandsharereferencematerial,lev eragesMahouts
collaborativ efilteringalgorithmstohelpusersdiscov ernewarticlesandpapers
[103].Whenev aluatingtheitembasedrecommender,they determineditwas
efficientbutcouldbeimprov edupon.By allocatingadditionalmappersand
reducersandusinganappropriatepartitioner,they wereabletoreduce
processingtimeby 63%.
SaidandBellogn[126]stresstheimportanceofhav ingclearguidelinesin
comparisonofrecommendersy stemstoallowforreproducibility and
comparisonofresults.They notethatatruecomparisonofresultsfrom
differentrecommendersy stemsisdifficultduetothemany differentdesigns
av ailable,aswellasmy riaddifferencesinimplementationstrategiesfrom
modificationandtuning.Withthatinmind,they lookedatthreepopular
frameworkswhichincludeMahout,LensKit,andMy MediaLite,andcompared
v ariousresearchpapersbasedonthereproducibility oftheresults.Foreach
framework,acontrolledev aluationwasperformedaswellasoneusingthe
frameworksinternalev aluationmethods.Theresultsofthecontrolled
ev aluationshowedMahoutsperformancetobeconsistentandfastforall
algorithms,butitsperformancewasstillslightly slowerthanLenskit.The
authorsnotethattheseresultsarespecifictothedatasetsusedandmay differ
whenrunningondatasetswithdifferentcharacteristics.Whenconsideringroot
meansquarederror,Mahoutwastheworstofthethree.Thisstudy only used
nondistributedimplementationsfromMahoutforfaircomparisons,sotheir
observ ationsshouldnotbeextrapolatedtothedistributedimplementations
offered.

MLlib
MLlibcov ersthesamerangeoflearningcategoriesasMahout,andalsoadds
regressionmodels,whichMahoutlacks.They alsohav ealgorithmsfortopic
modelingandfrequentpatternmining.Additionaltoolsincludedimensionality
reduction,featureextractionandtransformation,optimization,andbasic

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

30/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

reduction,featureextractionandtransformation,optimization,andbasic
statistics.Ingeneral,MLlibsrelianceonSparksiterativ ebatchandstreaming
approaches,aswellasitsuseofinmemory computation,enablejobstorun
significantly fasterthanthoseusingMahout[88,127 ].Howev er,thefactthatit
istiedtoSparkmay presentaproblemforthosewhoperformmachinelearning
onmultipleplatforms[128].
MLlibisstillrelativ ely y oungcomparedtoMahout.Assuch,thereisnot
currently anabundanceofpublishedcasestudiesthathav eusedthislibrary ,
andthereisv ery littleresearchprov idingmeaningfulev aluation.Theresearch
thathasbeenpublishedindicatesitisconsideredtobearelativ ely easy library
tosetupandrun[129],helpedinlargepartby thefactthatitshipsaspartofits
processingengine,thusav oidingsomeoftheconfigurationissuespeoplehav e
reportedwithMahout.ZhengandDagnino[127 ]notethattheunderly ing
optimizationprimitiv esmakeiteasy toextendexistingalgorithmsorwritenew
parallelimplementations.Therehav ebeenquestionsraisedaboutperformance
andreliability oftheiralgorithmsandmoreresearchneedstobedoneinthis
area[127 ].Efficiency issueshav ealsobeennotedduetoslowconv ergence
requiringalargenumberofiterationsaswellashighcommunicationcosts
[130,131].Otherstudies[129,131,132]hav ediscussedproblemsinhowthe
implementedtoolshandlelessthanidealdatasuchasv ery loworv ery high
dimensionalv ectors.
Thedocumentationisthorough,buttheusercommunity isnotnearly asactiv e
asthecommunity dev elopingforit.Thisissueisexpectedtoimprov easmore
peoplearemigratingfromMapReducetoSpark.Thelargeandactiv egroupof
dev elopersmeansthatmany complaintsarefixedbeforethey areev en
published.NotableexamplesofcompaniesthatuseMLlibinproductionare
OpenTableandSpotify [133],bothfortheirrecommendationengines.

Algorithms
Formostofitslifetime,muchoftheeffortthathasgoneintothisprojecthas
focusedondev elopingtheSparkplatformratherthanexpandinglibraries.The
focushasbeenondev elopingafewwidely understoodalgorithmswellrather
thanthegrabbagapproachtakenby Mahout.Recently ,howev er,they hav e
uppedtheireffortsinthisareaandexpandedtheirofferings.Onewouldexpect
themtostay onthistrackgiv entheincreasingly widespreadadoptionofthe
Sparkplatform.Forclassification,they hav eSupportVectorMachines,Logistic
Regression,Nav eBay es,DecisionTrees,RandomForest,andGradientBoosted
Trees.ClusteringalgorithmsincludekMeans,GaussianMixture,andPower
IterationClustering.They offerimplementationsforLinearRegressionand
IsotonicRegression,andonecollaborativ efilteringalgorithmusingAlternating
LeastSquares.
Foronlinelearning,streamingv ersionsofLogisticRegression,Linear
Regression,andkMeansClusteringareincluded.Forallotheralgorithms,
modelscanbelearnedofflineusinghistoricdataandappliedonlinetonew

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

31/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

streamingdata.MLlibincludesAPIsfordev elopmentinScala,Jav aand


Py thon,butnotev ery toolisav ailableinalllanguages.

MLpipelines
Aswehav ediscussedthroughoutthispaper,buildingmachinelearning
pipelinescanbeadifficulttask,particularly whenworkingwithacombination
ofdisparatetools.SparkML,asetofuniformAPIsforcreationandtuningof
pipelineswasintroducedinv ersion1.2toaddresstheseissues,makingiteasier
tocombinemultiplealgorithmsintooneworkflow.Thispackageincludestools
fordatasettransformationsandcombiningalgorithms.Itworksby
representingapipelineasasequenceofdatasettransformations.Aneasy
exampleofthisistothinkofalearnerwhichtransformsaDataFramewith
featuresintoonewithpredictions.Thispackageisdesignedtohandleallsteps
ofthelearningprocess,startingwithimportingdatafromasource,to
extractingfeatures,andtrainingandev aluatingmodels.

MLbase
Thoughitisnotcurrently av ailable,therehasbeenongoingresearchand
dev elopmentatBerkeley sAMPlabonaplatformcalledMLbase,whichwraps
MLlib,Spark,andotherprojectstomakemachinelearningondatasetsofall
sizesaccessibletoabroaderrangeofusers[134136].InadditiontoMLliband
Spark,theothercorecomponentsareMLI,anAPIforfeatureextractionand
algorithmdev elopment,andMLOptimizer,whichautomatesthetuningof
hy perparameters.
AnewcomponentcalledTuPAQ(TrainingsupportedPredictiv eAnaly ticQuery
Planner)[137 ]wasrecently introduced,whichbuildsontheinitialideaofML
Optimizer.TuPAQserv esasaquery interfacethatallowsausertoinputhigh
lev elqueriesinadeclarativ elanguageandthenselectsforthemthebestmodel
andparameters.Oneofthegoalsinthedev elopmentofMLbasewastomake
machinelearningaccessibleforthenonexpert.TuPAQisanimportantstep
towardthisgoalasitpusheshy perparametertuning,featureselectionand
algorithmselectiondownintothesy stem.Anotherstatedgoalwastoalsomake
thissy stemv aluablefortheexperts.Inthisarchitecture,they areabletowork
directly withMLIandMLlibtodev eloptheirownimplementationsortune
existingones,effectiv ely skippingthePAQstep.They hav ealsorecently
introducedanewtool,GHOSTFACE,whichaimstoautomatically perform
modelselectionfortheuser[138].

H2O
Outofallofthetoolsdiscussedinthispaper,H2 Oistheonly onethatcanbe
consideredaproduct,ratherthanaproject.Whilethey offeranenterprise

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

32/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

consideredaproduct,ratherthanaproject.Whilethey offeranenterprise
editionwithtwotiersofsupport,nearly alloftheirofferingsareav ailableopen
sourceaswellandcanbeusedwithoutthepurchaseofalicense.Themost
notablefeaturesofthisproductarethatitprov idesagraphicaluserinterface
(GUI),andnumeroustoolsfordeepneuralnetworks.Deeplearninghasshown
enormouspromiseformany areasofmachinelearning,makingitanimportant
featureofH2 O.[139].Thereisanothercompany offeringopensource
implementationsfordeeplearning,Deeplearning4j[140],butitistargeted
towardsbusinessinsteadofresearch,whereasH2 Otargetsboth.Additionally ,
Deeplearning4jssingularfocusisondeeplearning,soitdoesntofferany ofthe
otherty pesofMLtoolsthatareinH2 Oslibrary .Therearealsootheroptions
fortoolswithaGUI,suchasWeka,KNIME[141],orRapidMiner[142],butnone
ofthemofferacomprehensiv eopensourcemachinelearningtoolkitthatis
suitableforbigdata.
ProgramminginH2 OispossiblewithJav a,Py thon,RandScala.Userswithout
programmingexpertisecanstillutilizethistoolv iathewebbasedUI.Because
H2 Ocomesasapackagewithmany oftheconfigurationsalready tuned,setup
iseasy ,requiringlessofalearningcurv ethanmostotherfreeoptions.While
H2 Omaintainstheirownprocessingengine,they alsoofferintegrationsthat
allowuseoftheirmodelsonSparkandStorm.
AsofMay 2015,themachinelearningtoolsofferedcov erarangeoftasks,
includingclassification,clustering,generalizedlinearmodels,statistical
analy sis,ensembles,optimizationtools,datapreprocessingoptionsanddeep
neuralnetworks[143].Ontheirroadmapforfutureimplementationare
additionalalgorithmsandtoolsfromthesecategoriesaswellas
recommendationandtimeseries.Additionally ,they offerseamlessintegration
withRandRStudio,aswellasSparklingWaterforintegrationwithSparkand
MLlib.AnintegrationwithMahoutiscurrently intheworksaswell.They offer
thoroughdocumentationandtheirstaffisv ery communicativ e,quickly
answeringquestionsintheirusergroupandaroundtheweb.
Tothebestofourknowledge,only asinglepaperhasbeenpublishedsofarthat
usedthisproductinastudy [144].TheirworkfoundH2 Otobeaneffectiv eand
reliabletoolforanaly zingsensordataandpredictingmissingmeasurements.
Thisstudy lookedatGradientBoostedModel(GBM)andGeneralizedLinear
Model(GLM)andfoundthemtobecomparablewhenusedforclassification,
thoughGBMproducedbetteraccuracy whenusedforregression.More
independentresearchisneededtoproperly ev aluatethespeed,performance,
andreliability ofthistool.Onenotableexampleofacompany usingH2 Oin
productionisShareThis,whousespredictiv emodelingtomaximizeadv ertising
ROI(returnoninv estment).

SAMOA
SAMOA,aplatformformachinelearningfromstreamingdata,wasoriginally

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

33/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

SAMOA,aplatformformachinelearningfromstreamingdata,wasoriginally
dev elopedatY ahoo!LabsinBarcelonain2013andhasbeenpartoftheApache
incubatorsincelate2014.ItsnamestandsforScalableAdv ancedMassiv e
OnlineAnaly sis.Itisaflexibleframeworkthatcanberunlocally orononeofa
fewstreamprocessingengines,includingStorm,S4,andSamza.Thisisdone
throughaminimalAPIdesignedforageneraldistributedstreamprocessing
enginewhichallowsuserstoeasily writebindingstoportSAMOAtonewstream
processors[145].
Thoughthey currently offerfarfeweralgorithms,they liketocallthemselv es
Mahoutforstreaming.SAMOAsalgorithmsarerepresentedasdirected
graphs,referredtoastopologies(borrowingterminology fromStorm).The
algorithmsimplementedsofarcanbeusedforclassification,clustering,
regression,andfrequentpatternmining,alongwithboosting,andbaggingfor
ensemblecreation.Additionally ,thereisacommonplatformprov idedfortheir
implementations,aswellasaframeworkfortheusertowritetheirown
distributedstreamingalgorithms.Itdoesnoty ethav eanactiv ecommunity ,
butitoffersthoroughdocumentation.
Thisplatformismeantforuserswithv ery bigdatathatisconstantly being
updated.Streamingmodelsareforprojectsaimedatfindingoutwhatis
happeningrightnow,andfeedbackoccursinrealtime.SAMOAwasdesigned
withthegoalsofflexibility inupdatingthelibrary (dev elopingnew
implementationsaswellasreusingexistingonesfromotherframeworks),
scalability initshandlingofanincreasingamountofdata,andextensibility in
termsoftheAPIsdescribedabov e[146].Internaltestshav eresultedinhigh
speedandaccuracy .Sev erien[147 ]ev aluatedtheflexibility oftheplatformby
implementingCluStream,astreamclusteringalgorithm,usingSAMOAsAPIfor
creatingdataflowtopologies.OneofthebenefitstothisAPIisthatitallowsthe
usertoimplementnewalgorithmsthatwillautomatically beabletorunonany
processingenginethatisabletoplugintoSAMOA.Whilemoreresearchis
neededtofully ev aluatethesetools,resultsfromthisstudy indicatedthatitisa
flexibleplatformwiththeability toscaleuptolargerworkloadsandshowsgood
potentialforuseinalargescaledistributedenv ironment.Thisimplementation
isnowpartofSAMOAslibrary .Romsaiy ud[148]implementedatopic
modelingalgorithmandcomparedexperimentalresultsusingbothSAMOAand
MOA.Resultsindicatedsignificantly higherthroughputonSAMOAandshowed
theframeworktoberobustandstable.
Sofar,thereareonly afewlearningtoolsimplementedinSAMOA,butthey
cov erthemany ofthecommonMLtasks.Forclustering,they nowoffer
CluStream,andforclassificationthereistheVerticalHoeffdingTree,which
utilizesv erticalparallelismontopoftheVery FastDecisionTree,orHoeffding
Tree.Thisisthestandarddecisiontreealgorithmforstreamingclassification
tasks.RegressioncanbeaccomplishedthroughtheAdaptiv eModelRules
Regressor,whichincludesimplementationsforbothv erticalandhorizontal
parallelism.Thelibrary alsoincludesDistributedStreamFrequentItemset
Mining,whichisbasedonPARMA[149].PrequentialEv aluationisav ailableas

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

34/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Mining,whichisbasedonPARMA[149].PrequentialEv aluationisav ailableas


well,whichenablesmeasurementofmodelaccuracy ,eitherfromthestartor
basedonaslidingwindowofrecentinstances.Bagging,Adaptiv eBagging,and
Boostingcanbeusedtocreateensemblesofclassifiers.Foradditionallearning
algorithms,thereisapluginav ailablecalledSAMOAMOA[150]whichallows
theuseofMOAclassifiersandclusteringalgorithmsinsidetheSAMOA
platform.Howev er,itisimportanttokeepinmindthatthisdoesnotchangethe
underly ingimplementationsofMOAsalgorithms,whicharenotdistributed.
SAMOAisav ery y oungprojectandnewtoolsarecontinually beingdev eloped
toexpandthelibrary [151].
Thereisnotagreatdealofindependentresearchonthisplatform,thoughthat
islikely tochangeasSAMOAbecomesmorewellknownandonlinelearning
becomesmorewidely used.Rahnama[152]usedSAMOAandStormtoperform
sentimentanaly sisonrealtimeTwitterstreams.Toroundouttheproject,he
dev elopedanopensourceJav alibrary calledSentinel[153]whichincludes
toolsforstreamreading,datapreprocessing,query response,featureselection
andfrequentitemsetminingtailoredtosentimentanaly sistasksonTwitter
data.Analy siswascarriedoutthroughanensembleofVerticalHoeffdingTrees
usingAdaptiv eBagging.Otherprojectshav elev eragedSAMOAaspartof
frameworksforefficiently findingtopkitems[154]andforrecognitionof
internettrafficpatterns[155].

Comparisonofmajormachinelearningframeworks
Usingtheev aluationstandardsthatwerediscussedintheprev ioussection,we
hav eassignedaratingtoeachofthefourmajorframeworksbasedonthe
av ailableliterature. 2AcomparisonoftheframeworksisshowninFig.5.The
figureismeanttobev iewedasacomparativ erankingofthetoolsdisplay edin
thefigure,witheachtoolrankedaccordingtotheselectioncriteriathathav e
beenoutlinedinthispaper.Thesearerelativ eratingsbasedoncomprehensiv e
literatureandonlinedocumentationrev iew,notourownexperimentalresults.
Futureworkwillincludequantitativ ecomparisonsofthesetoolsbasedon
formally definedcriteria,butforthissurv ey ,weprov idequalitativ erankings
basedonourexposuretoeachtoolandrelatedworks.Whileanumberofother
toolsareav ailable,someofwhichwillbediscussedbelow,thereisnoty et
enoughliteratureinordertoproperly ev aluatethem.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

35/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Fig.5
Comparisonofmachinelearningframeworks
Asstatedprev iously ,thechoiceofMLframeworkwillbelargely application
anduserspecific.MLlibandH2 Oarev ery goodoptionsforgeneralneedseach
isfast,scalabletodifferentdatasetsizes,andhasafairly div erseselectionof
algorithms.MLliboffersabetterselectionformostareasofmachinelearning,
butH2 Oistheonly toolthathassolutionsfordeeplearning.Intermsof
usability ,bothhav eAPIsforprogramminginmultiplelanguages,andH2 Oalso
offersaGUI,makingiteasiertouseforthosewithoutahighlev elof
programmingexpertise.
SAMOAandMahout(throughthenewMahoutSamsara)bothfocusonoffering
platformsthroughwhichtheusercancreatehisorherownimplementationsof
learningalgorithms,soifnoneofthelibrariescontainneededalgorithmsorif
futureextensibility isanimportantfactor,oneofthetwowouldbeadv isable.
SAMOAallowstheusertocreateimplementationsofstreamingalgorithms,
whileMahoutSamsaracanbeusedforbatchimplementations.SAMOAisthe
only onewhichisdesignedfortruerealtimestreaming,makingitthefastest
andmostscalableoption.

Additionaltools
Thelearningframeworksdiscussedinthissectionhav ebeenchosendueto
theirwidespreaduseorv ersatility withrespecttoimplementedtoolsona
rangeofapplications.Thisisby nomeansacomprehensiv elistofallopen
sourcelearningtoolsforHadoopandthereareadditionalframeworkswhich
showpromiseforlargescalemachinelearningtasks.Thissectionprov idesa
briefov erv iewofsev eralsuchframeworksthatwereleftoutofthemain
discussioneitherbecausethey arentv ery v ersatileorthereisalackof
publishedresearch.
FlinkMLisamachinelearninglibrary currently indev elopmentfortheFlink
platform.ItsupportsimplementationsofLogisticRegression,kMeans
Clustering,andAlternatingLeastSquares(ALS)forrecommendation.Italso
supportsMahoutsDSL(DomainSpecificLanguage)forlinearalgebrawhichcan
beusedforoptimizationoflearningalgorithms,andplansareunderway to
implementpreandpostprocessingtools.Forgraphprocessing,they hav ethe
Gelly APIwhichprov idesmethodsforv ertexcentriciterations.
WekabeganincludingwrappersfordistributedprocessingonHadoopin
v ersion3.7 [156].distributedWekaBaseisapackagethatprov idesmapand
reducetasksthatarenottiedtoaspecificplatform,andcouldpotentially be

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

36/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

usedtoaddnewplatformsinthefuture.distributedWekaHadoopprov ides
utilitiesspecifictoHadoop,includingloaderandsav ersforHDFSand
configurationforHadooptasks.Thisisonly av ailableforHadoopv ersion1.1.2,
whichwaspreY ARN,meaningitwillonly runonMapReduce.Onereasonfor
Wekasappealisthev astamountoftoolsintheirlibrary .Whiletheirclassifiers
andregressorsareabletobeusedonHadoop,mostofthemarenotabletobe
parallelized.Thesealgorithmsareessentially trainedasensembles,inwhich
smallerdatasubsetsaretrainedindiv idually butinsteadofmergingthemintoa
finalmodelduringthereducephase,they arecombinedusingv oting
techniques.Thisintegrationwasintroducedin2013,buthasnotseen
widespreadadoption.Thismay beduetothelackofparallelalgorithms.A
similarpackageforSpark,distributedWekaSparkwasintroducedinMarch
2015.Currently itcanonly acceptinputfrom.csv files.
Ory x(formerly My rrix)[157 ]doesnotofferabroadselectionofalgorithms,but
stillcov ersthemajorareasofclassification,clustering,andcollaborativ e
filteringforrealtimelargescaleML.Itsarchitectureconsistsofacomputation
lay erwhichbuildsmodelsandaserv inglay erwhichexposesaRESTAPIthat
canbeaccessedfromabrowserorany toolthatisabletomakeHTTPrequests.
ItoffersimplementationsforkMeans++,RandomForests,andMatrix
Factorizationbasedonav ariantofAlternatingLeastSquaresforCollaborativ e
Filtering.Ory ximplementsalambdaarchitecturesomodelscanbeupdated
andqueriedinrealtime.Whiledocumentationisv ery limited,itissaidtobe
easy toconfigureandgetrunning[93].Itcurrently runsonMapReduce,but
v ersion2[158],whichisindev elopment,isbuiltonSparkandKafka.
VowpalWabbit[159]isafastonlinelearneroriginally dev elopedatY ahoo!
ResearchLabsandcurrently sponsoredby MicrosoftResearch.Itisdesigned
forterafeaturedatasetsandofferssupportforclassification,matrix
factorization,topicmodeling,andoptimization.Itisextremely fastand
efficientduetoitsuseoffeaturehashing.Howev er,documentationislacking,
andwhileitispopularamongresearchers,ithasbeensaidtobedifficultto
integrateintoaproductionenv ironment[111].

Suggestionsforfuturework
Thusfar,mostoftheresearchintheareaofmachinelearningforbigdatahas
focusedonprocessingparadigmsandalgorithmimplementationand
optimization.Largely ignoredintheresearchisthedev elopmentoftoolsfor
thedataitself,specifically forpreprocessingtechniques.Wearguethatwhile
eachoftheabov etoolshastheiradv antagesanddrawbacks,allofthemcould
beimprov edwitheasiertouse,andmoreefficienttoolsfordealingwith
problemsinherenttobigdata.Someoftheseissuesinclude:
MislabeleddataAsdatagrows,thelikelihoodofhav ingmislabeled
instancesgrowsaswell.Whendealingwithmillionsofinstances,itis

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

37/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

notpossibletoefficiently checkwhetherallofthetrainingdatais
properly labeled,andtrainingmodelsonincorrectdatawillleadto
lessaccuracy [160].Whilesomebigdataalgorithmsinclude
mechanismstohandlethisproblem,itmay behelpfultohav etools
thatoffersolutionsbeforeclassificationbegins.
MissingvaluesSimilartomislabeleddata,missingv aluesalsolead
tolessrobust,inaccuratemodels,particularly withclusteringand
collaborativ efilteringalgorithmswhichdependonsimilarity
computations.Thisissueisgenerally solv edeitherthrough
imputationtechniquesorby remov ingtheexamplecompletely
[161].
NoiseNoisy datareferstodatathatisirrelev antormeaningless.
Thiscanleadtomodelsthatsufferfromov erfitting.Clusteringor
similarity measurescanhelpidentify noisy datapoints,buttoolkits
lackalgorithmswhicharespecifically optimizedforthistask[162].
HighdimensionalityThisoccurswhenthefeaturetoinstanceratio
isv ery largeandisacommoncharacteristicofbigdata.Algorithms
fordimensionality reduction,mostcommonly PrincipalComponent
Analy sis(PCA),areincludedinmosttoolkits.Featureselectionisa
wellknownmethodtohandlehighdimensionality [163],andPCAis
justoneoptionofmany thatcouldbeincluded.
ImbalanceInclassificationproblems,imbalancedtrainingdata
(whentherearemany moreinstancesinoneormoreclassesthanin
others)canleadtoweaklearners.Thisisty pically allev iatedby
usingdatasamplingtechniques[164].
Anumberofstudieshav esuggestedthatmany ofthelearningalgorithms
implementedinthesetoolsdonotstandupwelltothesekindsofproblems
[121,165166].Therearewellusedandoftensimpletechniquestocombat
theseissues.Someofthemareimplementedinv ariousmachinelearning
packages,butmany arenot.Andwhenthey areincluded,they canbedifficult
tofindoruse.Mostlibrarieshav etoolsaddressingdimensionality reduction,
butsolutionstotheotherproblemslistedhav enotbeenwidely implemented.
Thereisahugeneedforeffectiv etools,particularly inproduction
env ironments[167 ],makingthisav aluableproblemtoaddress.Therearea
numberofstudieswhichexaminesolutionstotheseissuesfortheplatforms
discussedinthispaper[99,168],butnonewithinthecontextofamachine
learningtoolkit.Any oftheplatformsdiscussedinthispapercouldbecome
muchmorerobustwiththeadditionofsomeofthesetools.

Conclusion
http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

38/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Currenttrendsintechnology ,suchasincreasedadoptionofwearable
computersandotherInternetofThings(IoT)dev ices,areallowingfor
unprecedentedaccesstomassiv eamountsofheterogeneousdata.Efficient
learningfromthisdataoftenrequirescomplexarchitecturesthatutilizea
combinationoftoolsandtechniquesforcollection,storage,processing,and
analy sis[169].Puttingtogethersuchanarchitecturewouldbeextremely
difficultev eniftherewerelimitedoptionsfromwhichtochoose.Theopen
sourcedatasciencecommunity isprolific,resultinginmany optionsandmany
moredecisionstobemade.
Asmachinelearningconceptsarebeingincreasingly adoptedinresearchand
productionsettings,theneedfortoolstofacilitatelearningtasksisbecoming
moreimportant.Many ofthesetoolsarev ery y oung,andmoreresearchis
neededtoproperly benchmarkandev aluateallofthedifferentoptions.Areas
wherethisresearchisinsufficienthav ebeennotedthroughoutthispaper.We
discussedtheHadoopecosy stemandanumberoftoolsthatareapartofitin
ordertoprov idecontexttohowmachinelearningfitsintoananaly tics
env ironment.Threemajorapproachestoprocessing(batch,iterativ ebatch,
andrealtimestreaming)weredescribedandprojectsusingeachofthemwere
presentedandcompared.Additionally ,alistofcriteriaforev aluationand
selectionofmachinelearningframeworkswaspresentedalongwithanindepth
lookatbothwidely usedandupandcomingprojects,withadiscussionoftheir
adv antagesanddrawbacks.
Wechosetofocusthebulkofourresearchonprocessingenginesandmachine
learningframeworksbecausethosearethetwomostimportantty pesoftoolsin
anMLpipeline.WechoseprojectsintheHadoopecosy stemforanumberof
reasons.First,they areamongthemostinnov ativ ewehav eseen.Additionally ,
therearefewendtoendserv icesoutthereformachinelearningandHadoop
projectstendtobedesignedwiththeintentionofconnectingwithonesthat
already existinthisgroup.Finally ,Apacheprojectstendtodrawlargenumbers
ofactiv euserswhoarehelpfulwhenproblemsarise.
Thechoiceoftoolswilllargely dependontheapplicationsthey arebeingused
foraswellasuserpreferences.Forexample,Mahout,MLlib,FlinkML,and
Ory xincludeoptionsforrecommendations,soiftheintendedapplicationisan
ecommercesiteorsocialnetwork,onemay wishtochoosefromthemfor
featuressuchasitemorusersuggestions.SocialmediaorIoTdatamay require
realtimeresults,necessitatingtheuseofStormorFlinkalongwiththeir
associatedMLlibraries.Otherdomainssuchashealthcareoftenproduce
disparatedatasetsthatmay requireamixofbatchandstreamingprocessing,in
whichcaseFlink,Ory x,orSparkwouldbethebestchoice.
Inthispaper,weexaminedfiv eprocessingplatforms.MapReduce,oncethede
factostandardforbigdataprojects,isbecomingoutmodedinthemachine
learningcommunity andisnotrecommendedforthemajority ofapplications

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

39/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

duetoitsslownessandlackofsupportforiterativ ealgorithms.Sparkisseenby
many asanaturalsuccessor.ItisbasedinMapReducesothetransitionisnot
difficult,butitofferssupportforiterativ etasksandisabletosupportallofthe
machinelearninglibrariesthatMapReducedoesplusothers.Howev er,ifreal
timesolutionsareofimportance,onemay wishtoconsiderStormorFlink
instead,sincethey offertruestreamprocessingwhileSparksuseofmicrobatch
streamingmay hav easmalllagassociatedwithitinreceiv ingresults.Flink
offersthebestofbothworldsinthisregard,withacombinationofbatchand
truestreamprocessing,butitisav ery y oungprojectandneedsmoreresearch
intoitsv iability .Additionally ,itdoesnotcurrently supportnearly asmany
machinelearningsolutionsastheotherplatforms.H2 Oistheonly endtoend
sy stemdiscussedinthispaperandofferstwofeaturesnotpresentinother
sy stems,whichareagraphicaluserinterface,andsupportfordeeplearning.
Additionally ,itsupportsasmany ormoremachinelearningtoolsthanany of
theotherengineswestudied.LikeFlink,thereisv ery littleresearchonH2 O,so
moreisneededforaproperev aluation.
NodistributedMLlibrarieshav ethesameamountofoptionsassomeofthe
nondistributedtoolssuchasWekabecausenotev ery algorithmlendsitself
welltoparallelization.MahoutandMLlibarethemostwellroundedbigdata
librariesintermsofalgorithmcov erageandbothworkwithSparkandH2 O.
MLlibhasawiderov erallselectionofalgorithmsandalargerandmore
dedicatedteamworkingonit,butisy oungandlargely unprov en.Mahout
includesthemostoptionsforrecommendationandhasmorematurity thanthe
others.WhileMahoutwasfallingoutoffav orduetoitsrelianceofMapReduce
butthismay changeduetomodificationsmadeinthenewestv ersion.Nowthat
itisfocusedmoreonthemathneededforuserstocreatetheirownalgorithm
implementations,wecanconceiv eofsituationsinwhichausermay wishtobe
familiarwithandutilizebothMahoutandMLlib.SAMOA,likethenewv ersion
ofMahout,hasafocusongiv ingusersthenecessary toolstocreatetheirown
implementations,thesmallamountofresearchinthisareasuggestsittobea
v iableoptionforstreamprocessing.Realtimelearningisincreasingin
popularity andweexpecttheamountofoptionsav ailableforittoincreaseas
well.

Footnotes
1 Arelatedproject,SAMOA,doesuseHadoopandisdiscussedinthispaper.
2 Thereisnoty etany literatureav ailableonthenewestv ersionofMahout,so
theseratingsreflecttheresultsofstudiesusingv ersions0.9andearlier.

Authorscontributions
http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

40/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Authorscontributions
SLperformedtheprimary literaturerev iewandanaly sisforthiswork,andalso
draftedthemanuscript.ANRandTHworkedwithSLtoprov ideadditional
analy sisanddev elopthepapersframeworkandfocus.TMKintroducedthis
topictoSL,ANR,andTH,andcoordinatedtheotherauthorstocompleteand
finalizethiswork.Allauthorsreadandapprov edthefinalmanuscript.

Acknowledgements
Theauthorsgratefully acknowledgepartialsupportby theNationalScience
Foundation,undergrantnumberCNS1427 536.Any opinions,findings,and
conclusionsorrecommendationsexpressedinthismaterialarethoseofthe
author(s)anddonotnecessarily reflectthev iewsoftheNationalScience
Foundation.

Competinginterests
Theauthorsdeclarethatthey hav enocompetinginterests.

References
1.

In ter n a tion a lDa ta Cor por a tion .Dig ita lUn iv er seStu dy .2 01 4 .http://
w w w .
e mc.
c om/

leadership/
d igitaluniverse/
index.
h tm (http://w w w .emc.com/leadership/digital
universe/index.htm) .A ccessed1 Ju n 2 01 5 .
2.

A n cestr y .com Fa ctSh eet.http://


c orporate.
a ncestry.
c om/
p ress/
c ompanyfacts/

(http://corporate.ancestry.com/press/companyfacts/) .A ccessed1 Ju n 2 01 5 .
3.

Th eRPr ojectfor Sta tistica lCom pu tin g .http://


w w w .
rproject.
o rg/
(http://w w w .r

project.org/) .
4.

Weka .http://
w w w .
c s.
w aikato.
a c.
nz/
m l/
w eka/

(http://w w w .cs.w aikato.ac.nz/ml/w eka/) .


5.

A pa ch eHa doop.https://
h adoop.
a pache.
o rg/
(https://hadoop.apache.org/) .

6.

La n ey D.3 Dda ta m a n a g em en t:con tr ollin g da ta v olu m e,v elocity a n dv a r iety .META

Gr ou p2 001 .
7.

Dem ch en koY,Gr ossoP,deLa a tC,Mem br ey P.A ddr essin g big da ta issu esin scien tific

da ta in fr a str u ctu r e.In :2 01 3 In ter n a tion a lCon fer en ceon Colla bor a tion Tech n olog iesa n d
Sy stem s(CTS),Sa n Dieg o,2 01 3 .IEEE,pp4 8 5 5 .
8.

Cox M,Ellsw or th D.Ma n a g in g big da ta for scien tificv isu a liza tion .In :A CMSig g r a ph

'9 7 cou r se#4 ex plor in g g ig a by teda ta setsin r ea ltim e:a lg or ith m s,da ta m a n a g em en t,
a n dtim ecr itica ldesig n ,A u g u st,1 9 9 7 .
9.

Bekker m a n R,Bilen koM,La n g for dJ.Sca lin g u pm a ch in elea r n in g :pa r a llela n d

distr ibu teda ppr oa ch es.Ca m br idg e:Ca m br idg eUn iv er sity Pr ess2 01 1 .
CrossRef
1 0.

(http://dx.doi.org/10.1017/CBO9781139042918)

Wh iteT.Ha doop:Th eDefin itiv eGu ide,3 r dedn .Seba stopol,CA :OReilly Media ,In c.

2 01 2 .

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

41/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

2 01 2 .
11.

V a v ila pa lliV K,Mu r th y A C,Dou g la sC,A g a r w a lS,Kon a r M,Ev a n sR,Gr a v esT,Low e

J,Sh a h H,Seth S,Sa h a B,Cu r in oC,OMa lley O,Ra dia S,ReedB,Ba ldesch w ieler E.A pa ch e
Ha doopYA RN:YetA n oth er Resou r ceNeg otia tor .In :Pr oceedin g softh e4 th a n n u a l
Sy m posiu m on Clou dCom pu tin g 2 01 3 .
12.

A pa ch eHa doop2 .7 .0Docu m en ta tion .http://


h adoop.
a pache.
o rg/
d ocs/
c urrent/

(http://hadoop.apache.org/docs/current/) .A ccessed5 Ja n 2 01 5 .
13.

Clou der a .http://


w w w .
c loudera.
c om/
(http://w w w .cloudera.com/) .

1 4.

Hor ton w or ks.http://


h ortonw orks.
c om/
(http://hortonw orks.com/) .

15.

Ma pR.https://
w w w .
m apr.
c om (https://w w w .mapr.com) .

1 6.

Pr ojectV oldem or t.http://


w w w .
p rojectvoldemort.
c om/
v oldemort/

(http://w w w .projectvoldemort.com/voldemort/) .
17.

Redis.http://
redis.
io/
(http://redis.io/) .

1 8.

A pa ch eCou ch DB.http://
c ouchdb.
a pache.
o rg/
(http://couchdb.apache.org/) .

1 9.

Mon g oDB.https://
w w w .
m ongodb.
o rg/
(https://w w w .mongodb.org/) .

2 0.

A pa ch eHBa se.http://
h base.
a pache.
o rg/
(http://hbase.apache.org/) .

21.

A pa ch eCa ssa n dr a .http://


c assandra.
a pache.
o rg/
(http://cassandra.apache.org/) .

22.

Tita n Distr ibu tedGr a ph Da ta ba se.http://


thinkaurelius.
g ithub.
io/
titan/

(http://thinkaurelius.github.io/titan/) .
23.

Neo4 j.http://
neo4j.
c om/
(http://neo4j.com/) .

24.

Or ien tDB.http://
o rientdb.
c om/
o rientdb/
(http://orientdb.com/orientdb/) .

25.

A pa ch eFlu m e.https://
flume.
a pache.
o rg/
(https://flume.apache.org/) .

26.

A pa ch eKa fka .http://


k afka.
a pache.
o rg/
(http://kafka.apache.org/) .

27 .

A pa ch eSqoop.http://
s qoop.
a pache.
o rg/
(http://sqoop.apache.org/) .

2 8.

A pa ch eHiv e.http://
h ive.
a pache.
o rg/
(http://hive.apache.org/) .

29.

A pa ch eDr ill.http://
d rill.
a pache.
o rg/
(http://drill.apache.org/) .

3 0.

Fer n n dezA ,delRoS,LpezV ,Ba w a kidA ,delJesu sMJ,Ben tezJM,Her r er a F.Big

Da ta w ith Clou dCom pu tin g :a n in sig h ton th ecom pu tin g en v ir on m en t,Ma pRedu ce,a n d
pr og r a m m in g fr a m ew or ks.Wiley In ter discipRev Da ta Min Kn ow lDiscov .
2 01 4 4 (5 ):3 8 04 09 .
CrossRef

(http://dx.doi.org/10.1002/w idm.1134)

31.

Ca sca din g .http://


w w w .
c ascading.
o rg/
(http://w w w .cascading.org/) .

32.

A pa ch ePig .http://
p ig.
a pache.
o rg/
(http://pig.apache.org/) .

33.

Lin J,KolczA .La r g esca lem a ch in elea r n in g a ttw itter .In :Pr oceedin g softh e2 01 2

A CMSIGMODIn ter n a tion a lCon fer en ceon Ma n a g em en tofDa ta 2 01 2 .pp.7 9 3 8 04 .


34.

A pa ch eTez.http://
tez.
a pache.
o rg/
(http://tez.apache.org/) .

35.

A pa ch eOozieWor kflow Sch edu ler for Ha doop.http://


o ozie.
a pache.
o rg/

(http://oozie.apache.org/) .
36.

A pa ch eZookeeper .https://
z ookeeper.
a pache.
o rg/
(https://zookeeper.apache.org/) .

37 .

Hu e.http://
g ethue.
c om/
(http://gethue.com/) .

3 8.

MOA (Ma ssiv eOn lin eA n a ly sis).http://


m oa.
c s.
w aikato.
a c.
nz/

(http://moa.cs.w aikato.ac.nz/) .
39.

Heller stein JM,Sch oppm a n n F,Wa n g DZ,Fr a tkin E,Welton C.Th eMA DlibA n a ly tics

Libr a r y or MA DSkills,th eSQL.In :V LDBEn dow m en t2 01 2 .pp.1 7 001 1 .


4 0.

Da toCor e.https://
g ithub.
c om/
d atocode/
DatoCore (https://github.com/dato

code/DatoCore) .

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

42/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

41 .

Dea n J,Gh em a w a tS.Ma pRedu ce:Sim plifiedDa ta Pr ocessin g on La r g eClu ster s.In :

Pr oceedin g softh e6 th Sy m posiu m on Oper a tin g Sy stem sDesig n a n dIm plem en ta tion
2 004 .
42.

Dew ittD,Ston ebr a ker M(2 008 )Ma pRedu ce:a m a jor stepba ckw a r ds.Da ta ba se

Colu m n .
43.

DeMich illieG.Reim a g in in g dev eloper pr odu ctiv ity a n dda ta a n a ly ticsin th eclou d

n ew sfr om Goog leIO.2 01 4 .http://


g ooglecloudplatf
o rm.
b logspot.
c om/
2014/
06/
reimagining
developerproductivityanddataanalyticsinthecloudnew sfromgoogleio.
h tml
(http://googlecloudplatform.blogspot.com/2014/06/reimaginingdeveloperproductivityand
dataanalyticsinthecloudnew sfromgoogleio.html) .A ccessed5 Ja n 2 01 5 .
44.

A pa ch eGir a ph .http://
g iraph.
a pache.
o rg/
(http://giraph.apache.org/) .

45.

A pa ch eHa m a .https://
h ama.
a pache.
o rg/
(https://hama.apache.org/) .

46.

Ma lew iczG,A u ster n MH,BikA JC,Deh n er tJC,Hor n I,Leiser N,a n dCza jkow skiG.

Pr eg el:A Sy stem for La r g eSca leGr a ph Pr ocessin g .In :Pr oceedin g softh e2 01 0A CM
SIGMODIn ter n a tion a lCon fer en ceon Ma n a g em en tofda ta 2 01 0.pp.1 3 5 4 5 .
47 .

A m a zon EC2 .http://


a w s.
a mazon.
c om/
e c2/
(http://aw s.amazon.com/ec2/) .

48.

Micr osoftA zu r e.http://


a zure.
m icrosoft.
c om/
(http://azure.microsoft.com/) .

49.

A tten ber g J.Con jectu r e:Sca la bleMa ch in eLea r n in g in Ha doopw ith Sca ldin g .2 01 4 .

https://
c odeascraft.
c om/
2014/
06/
18/
c onjecturescalablemachinelearninginhadoopw ith
scalding/
(https://codeascraft.com/2014/06/18/conjecturescalablemachinelearningin
hadoopw ithscalding/) .A ccessed1 Ju n 2 01 5 .
5 0.

Za h a r ia M,Ch ow dh u r y M,Da sT,Da v eA .Fa sta n din ter a ctiv ea n a ly ticsov er

Ha doopda ta w ith Spa r k.USENIXLog in .2 01 2 3 7 (4 ):4 5 5 1 .


51.

Bu Y,How eB,Ba la zin ska M,Er n stMD.Ha Loop:efficien tIter a tiv eDa ta Pr ocessin g on

La r g eClu ster s.Pr oceedin g sV LDBEn dow m en t.2 01 03 (1 ):2 8 5 9 6 .


CrossRef
52.

(http://dx.doi.org/10.14778/1920841.1920881)

Ja kov itsP,Sr ir a m a SN.Ev a lu a tin g Ma pRedu cefr a m ew or ksfor iter a tiv eScien tific

Com pu tin g a pplica tion s.In :2 01 4 In ter n a tion a lCon fer en ceon Hig h Per for m a n ce
Com pu tin g &Sim u la tion 2 01 4 .pp.2 2 6 3 3 .
53.

Spa r k.https://
s park.
a pache.
o rg/
(https://spark.apache.org/) .

54.

Za h a r ia M,Ch ow dh u r y M,Fr a n klin MJ,Sh en ker S,Stoica I.Spa r k:Clu ster

Com pu tin g w ith Wor kin g Sets.In :Pr oceedin g softh e2 n dUSENIXcon fer en ceon h ottopics
in clou dcom pu tin g 2 01 0.
55.

NiZ.Com pa r a tiv eEv a lu a tion ofSpa r ka n dStr a tosph er e.Th esis,KTHRoy a l

In stitu teofTech n olog y 2 01 3 .


56.

Xin R.Da ta Fr a m esfor La r g eSca leDa ta Scien ce.Da ta br icksTech Ta lk.https://


w w w .

youtube.
c om/
w atch?
v =
Hvke1f10dL0 (https://w w w .youtube.com/w atch?v=Hvke1f10dL0)
(2 01 5 ).
57.

Sor tBen ch m a r kHom ePa g e.http://


s ortbenchmark.
o rg/
(http://sortbenchmark.org/)

.A ccessed1 Ju n 2 01 5 .
5 8.

Xin R.Spa r kofficia lly setsa n ew r ecor din la r g esca lesor tin g .2 01 4 .http://

databricks.
c om/
b log/
2014/
11/
05/
s parkofficiallysetsanew recordinlargescalesorting.
html (http://databricks.com/blog/2014/11/05/sparkofficiallysetsanew recordinlarge
scalesorting.html) .A ccessed01 Ju n 2 01 5 .
59.

Ca iZ,Ga oJ,Lu oS,Per ezLL,V a g en a Z,Jer m a in eC.A com pa r ison ofpla tfor m sfor

im plem en tin g a n dr u n n in g v er y la r g esca lem a ch in elea r n in g a lg or ith m s.In :


Pr oceedin g softh e2 01 4 A CMSIGMODin ter n a tion a lcon fer en ceon Ma n a g em en tofda ta
(SIGMOD1 4 )2 01 4 .pp.1 3 7 1 8 2 .

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

43/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

(SIGMOD1 4 )2 01 4 .pp.1 3 7 1 8 2 .
6 0.

MLlib.https://
s park.
a pache.
o rg/
m llib/
(https://spark.apache.org/mllib/) .

61 .

Gr a ph X.https://
s park.
a pache.
o rg/
g raphx/
(https://spark.apache.org/graphx/) .

62.

Ma h ou t.http://
m ahout.
a pache.
o rg/
(http://mahout.apache.org/) .

63.

Zh a n g H,Tu dor BM,Ch en G,OoiBC.Efficien tin m em or y da ta m a n a g em en t:a n

A n a ly sis.Pr oceedin g sV LDBEn dow m en t.2 01 4 7 (1 0):6 9 .


64.

Sin g h J.Big Da ta A n a ly tica n dMin in g w ith Ma ch in eLea r n in g A lg or ith m .In tJ

In for m Com pu tTech n ol.2 01 4 4 (1 ):3 3 4 0.


65.

Ou ster h ou tK,Ra stiR,Ra tn a sa m y S,Sh en ker S,Ch u n B.Ma kin g Sen seof

Per for m a n cein Da ta A n a ly ticsFr a m ew or ks.In :Pr oceedin g softh e1 2 th USENIX


Sy m posiu m .On Netw or kedSy stem sDesig n a n dIm plem en ta tion (NSDI1 5 )2 01 5 .
66.

Sh a h r iv a r iS,Ja liliS.Bey on dba tch pr ocessin g :tow a r dsr ea ltim ea n dstr ea m in g

big da ta .Com pu ter s.2 01 4 3 (4 ):1 1 7 2 9 .


CrossRef
67 .

(http://dx.doi.org/10.3390/computers3040117)

Za h a r ia M,Da sT,LiH,Hu n ter T,Sh en ker S,Stoica I.Discr etizedStr ea m s:A Fa u lt

Toler a n tModelfor Sca la bleStr ea m Pr ocessin g .Un iv er sity ofCa lifor n ia a tBer keley
Tech n ica lRepor tNo.UCB/EECS2 01 2 2 5 9 2 01 2 .
68.

A pa ch eStor m .https://
s torm.
a pache.
o rg/
(https://storm.apache.org/) .

69.

Ma r zN.Histor y ofA pa ch eStor m a n dlesson slea r n ed.2 01 4 .http://


nathanmarz.
c om/

blog/
h istoryofapachestormandlessonslearned.
h tml
(http://nathanmarz.com/blog/historyofapachestormandlessonslearned.html) .A ccessed
1 2 A pr 2 01 5 .
7 0.

Kh u da ir iS.Th eA pa ch eSoftw a r eFou n da tion A n n ou n cesA pa ch eStor m a sa Top

Lev elPr oject.2 01 4 .https://


b logs.
a pache.
o rg/
foundation/
e ntry/
the_
a pache_
s oftw are_
foundation_
a nnounces64
(https://blogs.apache.org/foundation/entry/the_apache_softw are_foundation_announces64)
.
71.

Lor ica B.A r ea ltim epr ocessin g r ev iv a l.Ra da r .2 01 5 .http://


radar.
o reilly.
c om/
2015/

04/
a realtimeprocessingrevival.
h tml (http://radar.oreilly.com/2015/04/arealtime
processingrevival.html) .
7 2.

A pa ch eTh r ift.http://
thrift.
a pache.
o rg/
(http://thrift.apache.org/) .

7 3.

Tosh n iw a lA ,Don h a m J,Bh a g a tN,Mitta lS,Ry a boy D,Ta n eja S,Sh u kla A ,

Ra m a sa m y K,Pa telJM,Ku lka r n iS,Ja ckson J,Ga deK,Fu M.Stor m @Tw itter .In :
Pr oceedin g softh e2 01 4 A CMSIGMODin ter n a tion a lcon fer en ceon Ma n a g em en tofda ta
(SIGMOD1 4 )2 01 4 .pp.1 4 7 5 6 .
7 4.

Gr a dv oh lA LS,Sen g er H,A r a n tesL,Sen sP.Com pa r in g Distr ibu tedOn lin eStr ea m

Pr ocessin g Sy stem sCon sider in g Fa u ltToler a n ceIssu es.JEm er g Tech n olWebIn tell.
2 01 4 6 (2 ):1 7 4 9 .
75.

Fen g A ,Ev a n sR,Da g itD,Rober tsN.Stor m y a r n .https://


g ithub.
c om/
y ahoo/
s torm

yarn (https://github.com/yahoo/stormyarn) .
7 6.

Ma r zN,Wa r r en J.Big da ta :pr in ciplesa n dbestpr a cticesofsca la bler ea ltim eda ta

sy stem s.Ma n n in g Pu blica tion s2 01 5 .


77.

H 2O.http://
h 2o.
a i/
(http://h2o.ai/) .

7 8.

Rea ltim ePr ediction sw ith H2 Oon Stor m .https://


g ithub.
c om/
h 2oai/
h 2otraining/

blob/
m aster/
tutorials/
s treaming/
s torm/
README.
m d#realtimepredictionsw ithh2oon
storm (https://github.com/h2oai/h2o
training/blob/master/tutorials/streaming/storm/README.md%23realtimepredictionsw ith
h2oonstorm) .

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

44/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

h2oonstorm) .
7 9.

Wa sson T,Sa lesA P.A pplica tion A g n osticStr ea m in g Ba y esia n In fer en cev ia A pa ch e

Stor m .In :Th e2 01 4 In ter n a tion a lCon fer en ceon Big Da ta A n a ly tics2 01 4 .
8 0.

Mer ien n eP,Tr iden tm l.http://


g ithub.
c om/
p merienne/
tridentml

(http://github.com/pmerienne/tridentml) .
81 .

A pa ch eFlin k.https://
flink.
a pache.
o rg/
(https://flink.apache.org/) .

82 .

A lex a n dr ov A ,Ber g m a n n R,Ew en S,Fr ey ta g JC,Hu eskeF,HeiseA ,Ka oO,Leich M,

Leser U,Ma r klV ,Na u m a n n F,Peter sM,Rh ein l n der A ,Sa x MJ,Sch elter S,Hg er M,
Tzou m a sK,Wa r n ekeD.Th eStr a tosph er epla tfor m for big da ta a n a ly tics.V LDBJIn tJ
V er y La r g eDa ta Ba ses.2 01 4 2 3 (6 ):9 3 9 6 4 .
CrossRef
83 .

(http://dx.doi.org/10.1007/s007780140357y)

Ew en S,Sch elter S,Tzou m a sK,Wa r n ekeD,Ma r klV .Iter a tiv ePa r a llelDa ta

Pr ocessin g w ith Str a tosph er e:A n In sideLook.In :Pr oceedin g softh e2 01 3 In ter n a tion a l
Con fer en ceon Ma n a g em en tofDa ta (SIGMOD1 3 )2 01 3 .pp.1 05 3 6 .
84.

Leich M,A da m ekJ,Sch u botzM,HeiseA ,Rh ein l n der A ,Ma r klV .A pply in g

Str a tosph er efor Big Da ta A n a ly tics.In :1 5 th Con fer en ceon Da ta ba seSy stem sfor Bu sin ess,
Tech n olog y a n dWeb(BTW2 01 3 )2 01 3 .pp.5 07 1 0.
85 .

Flin kML.https://
g ithub.
c om/
a pache/
flink/
tree/
m aster/
flinkstaging/
flinkml

(https://github.com/apache/flink/tree/master/flinkstaging/flinkml) .
86.

Metzg er R,CelebiU.In tr odu cin g A pa ch eFlin kA n ew a ppr oa ch todistr ibu tedda ta

pr ocessin g .In :Silicon V a lley Ha n dsOn Pr og r a m m in g Ev en ts2 01 4 .


87 .

Ch a lm er sS,Both or elC,PicotClem en teR.Big Da ta Sta teofth eA r t.Tech n ica l

Repor t,Telecom Br eta g n e,Tech n ica lRepor t2 01 3 .


88.

Sin g h D,Reddy CK.A su r v ey on pla tfor m sfor big da ta a n a ly tics.JBig Da ta .

2 01 4 1 :8 .
89.

Collier K,Ca r ey B,Sa u tter D,Ma r ja n iem iC.A m eth odolog y for ev a lu a tin g a n d

selectin g da ta m in in g softw a r e.In :Pr oceedin g softh e3 2 n dA n n u a lHa w a iiIn ter n a tion a l
Con fer en ceon Sy stem ssScien ces,Ma u i,HI1 9 9 9 .IEEE,pp.1 1 .
9 0.

Zh on g S,Kh osh g ofta a r TM,Seliy a N.Clu ster in g ba sedn etw or kin tr u sion detection .

In tJRelia bQu a lSa fEn g .2 007 1 4 (02 ):1 6 9 8 7 .


CrossRef
91 .

(http://dx.doi.org/10.1142/S0218539307002568)

Kh osh g ofta a r TM,Dittm a n DJ,Wa ldR,A w a da W. A r ev iew ofen sem ble

cla ssifica tion for dn a m icr oa r r a y sda ta , in Toolsw ith A r tificia lIn tellig en ce(ICTA I),
2 01 3 IEEE2 5 th In ter n a tion a lCon fer en ceon .IEEE,2 01 3 ,pp.3 8 1 9 .
92.

A pr 2 01 5 A pa ch eMa h ou tsn ex tg en er a tion v er sion 0.1 0.0r elea sed.http://


m ahout.

apache.
o rg/
(http://mahout.apache.org/) .A ccessed1 6 A pr 2 01 5 .
93.

Miller J.Recom m en der Sy stem for A n im a tedV ideo.Issu esIn for m Sy st.

2 01 4 1 5 (2 ):3 2 1 7 .
94.

Weg en er D,MockM,A dr a n a leD,Wr obelS.ToolkitBa sedHig h Per for m a n ceDa ta

Min in g ofLa r g eDa ta on Ma pRedu ceClu ster s.In :2 009 IEEEIn ter n a tion a lCon fer en ceon
Da ta Min in g Wor ksh ops2 009 .pp.2 9 6 3 01 .
95.

Zen g C,Jia n g Y,Zh en g L,LiJ,LiL,LiH,Sh en C,Zh ou W,LiT,Du a n B,LeiM,a n d

Wa n g P.FIUMin er :A Fa st,In teg r a ted,a n dUser Fr ien dly Sy stem for Da ta Min in g in
Distr ibu tedEn v ir on m en t.In :Pr oceedin g softh e1 9 th A CMSIGKDDin ter n a tion a l
con fer en ceon Kn ow ledg ediscov er y a n dda ta m in in g 2 01 3 .pp.1 5 06 9 .
96.

Gen g X,Ya n g Z.Da ta Min in g in Clou dCom pu tin g .In :Pr oceedin g softh e2 01 3

In ter n a tion a lCon fer en ceon In for m a tion Scien cea n dCom pu ter A pplica tion s(ISCA 2 01 3 )
2 01 3 .
97 .

DeSou za RG,Ch iky R,A ou lZK.Open Sou r ceRecom m en da tion Sy stem sfor Mobile

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

45/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

97 .

DeSou za RG,Ch iky R,A ou lZK.Open Sou r ceRecom m en da tion Sy stem sfor Mobile

A pplica tion .In :Wor ksh opon th ePr a ctica lUseofRecom m en der Sy stem s,A lg or ith m sa n d
Tech n olog ies(PRSA T2 01 0)2 01 0.pp.5 5 8 .
98.

Sem in a r ioCE,Wilson DC.Ca seStu dy Ev a lu a tion ofMa h ou ta sa Recom m en der

Pla tfor m .In :6 th A CMcon fer en ceon r ecom m en der en g in es(RecSy s2 01 2 )2 01 2 .pp.4 5
5 0.
99.

Lem n a r u C,Cu ibu sM,Bon a A ,A licA ,Potolea R.A Distr ibu tedMeth odolog y for

Im ba la n cedCla ssifica tion Pr oblem s.In :2 01 2 1 1 th In ter n a tion a lSy m posiu m on Pa r a llel
a n dDistr ibu tedCom pu tin g (ISPDC)2 01 2 .pp.1 6 4 7 1 .
1 00.

Ha m m on dK,V a r deA S.Clou dba sedpr edictiv ea n a ly tics:tex tcla ssifica tion ,

r ecom m en der sy stem sa n ddecision su ppor t.In :2 01 3 IEEE1 3 th In ter n a tion a lCon fer en ce
on Da ta Min in g Wor ksh opsDa lla s,TX,2 01 3 ,pp.6 07 1 2 .
1 01 .

Estev esRM,Pa isR,Ron g C.Km ea n sClu ster in g in th eClou dA Ma h ou tTest.In :

2 01 1 IEEEWor ksh opsofIn ter n a tion a lCon fer en ceon A dv a n cedIn for m a tion Netw or kin g
a n dA pplica tion s2 01 1 .pp.5 1 4 9 .
1 02 .

MetzC.Ma h ou t,Th er eItIs!Open Sou r ceA lg or ith m sRem a keOv er stock.com .Wir ed

Ma g a zin e.2 01 2 .http://


w w w .
w ired.
c om/
2012/
12/
m ahout/
(http://w w w .w ired.com/2012/12/mahout/) .A ccessed1 8 Dec2 01 4 .
1 03 .

Ja ckK.Ma h ou tbecom esa r esea r ch er :La r g eSca leRecom m en da tion sa tMen deley .

In :Big Da ta Week,Ha doopUser Gr ou pUK2 01 2 .


1 04 .

Su m ba ly R,Kr epsJ,Sh a h S.Th ebig da ta ecosy stem a tLin kedIn .In :Pr oceedin g sof

th e2 01 3 A CMSIGMODIn ter n a tion a lCon fer en ceon Ma n a g em en tofDa ta (SIGMOD1 3 )


2 01 3 .pp.1 1 2 5 3 4 .
1 05 .

In g er sollG.A pa ch eMa h ou t:Sca la blem a ch in elea r n in g for ev er y on e.IBM

Cor por a tion 2 01 1 .


1 06 .

Spa r ksER,Ta lw a lka r A ,Sm ith V ,Kotta la m J,Pa n X,Gon za lezJ,Fr a n klin MJ,

Jor da n MI,Kr a ska T.MLI:A n A PIfor Distr ibu tedMa ch in eLea r n in g .In :2 01 3 IEEE1 3 th
In ter n a tion a lCon fer en ceon Da ta Min in g 2 01 3 .pp.1 1 8 7 9 2 .
1 07 .

Zh a oH.Hig h Per for m a n ceMa ch in eLea r n in g th r ou g h Codesig n a n dRooflin in g .

Disser ta tion ,Un iv er sity ofCa lifor n ia a tBer keley 2 01 4 .


1 08 .

Pen g H,Lia n g D,Ch oiC.Ev a lu a tin g Pa r a llelLog isticReg r ession Models.In :2 01 3

IEEEIn ter n a tion a lCon fer en ceon Big Da ta 2 01 3 .pp.1 1 9 2 6 .


1 09 .

Ren n ieJDM,Sh ih L,Teev a n J,Ka r g er DR.Ta cklin g th ePoor A ssu m ption sofNa iv e

Ba y esTex tCla ssifier s.In :Pr oceedin g softh eTw en tieth In ter n a tion a lCon fer en ceon
Ma ch in eLea r n in g (ICML2 003 )2 003 .
1 1 0.

In g er sollG.In tr odu cin g A pa ch eMa h ou t:Sca la ble,com m er ica lfr ien dly m a ch in e

lea r n in g for bu ildin g in tellig en ta pplica tion s,IBMCor por a tion 2 009 .
111.

Ow en S,A n ilR,Du n n in g T,Fr iedm a n E.Ma h ou tin A ction .Sh elter Isla n d,

NY2 01 1 .
112.

Wa n g Y,WeiJ,Sr iv a tsa M,Du a n Y,Du W.In teg r ity MR:In teg r ity a ssu r a n ce

fr a m ew or kfor big da ta a n a ly ticsa n dm a n a g em en ta pplica tion s.In :2 01 3 IEEE


In ter n a tion a lCon fer en ceon Big Da ta 2 01 3 .pp.3 3 4 0.
113.

V er m a A ,Ch er ka sov a L,Ca m pbellRH.Pla y ItA g a in ,Sim MR!In :2 01 1 IEEE

In ter n a tion a lCon fer en ceon Clu ster Com pu tin g 2 01 1 .pp.2 5 3 6 1 .
1 1 4.

Ja n eja V P,A za r iA ,Na m a y a n ja JM,Heilig B.BdIDS:Min in g A n om a liesin a Big

distr ibu tedIn tr u sion Detection Sy stem .In :2 01 4 IEEEIn ter n a tion a lCon fer en ceon Big
Da ta 2 01 4 .pp3 2 4 .
115.

Sin g h K,Gu n tu ku SC,Th a ku r A ,Hota C.Big Da ta A n a ly ticsfr a m ew or kfor Peer to

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

46/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

115.

Sin g h K,Gu n tu ku SC,Th a ku r A ,Hota C.Big Da ta A n a ly ticsfr a m ew or kfor Peer to

Peer Botn etdetection u sin g Ra n dom For ests.In fSci.2 01 4 2 7 8 :4 8 8 9 7 .


CrossRef
1 1 6.

(http://dx.doi.org/10.1016/j.ins.2014.03.066)

Ra cetteMP,Sm ith CT,Cu n n in g h a m MP,Heekin TA ,Lem ley JP,Ma th ieu RS.

Im pr ov in g situ a tion a la w a r en essfor h u m a n ita r ia n log isticsth r ou g h pr edictiv em odelin g .


In :Sy stem sa n dIn for m a tion En g in eer in g Desig n Sy m posiu m (SIEDS)2 01 4 .pp.3 3 4 9 .
117.

KoKD,ElGh a za w iT,Kim D,Mor izon oH.Pr edictin g th esev er ity ofm otor n eu r on

disea sepr og r ession u sin g electr on ich ea lth r ecor dda ta w ith a clou dcom pu tin g Big Da ta
a ppr oa ch .In :2 01 4 IEEECon fer en ceon Com pu ta tion a lIn tellig en cein Bioin for m a ticsa n d
Com pu ta tion a lBiolog y 2 01 4 .
1 1 8.

LiL,Ba g h er iS,GooteH,Ha sa n A ,Ha za r dG.RiskA dju stm en tofPa tien t

Ex pen ditu r es:A Big Da ta A n a ly ticsA ppr oa ch .In :2 01 3 IEEEIn ter n a tion a lCon fer en ceon
Big Da ta 2 01 3 .pp.1 2 4 .
1 1 9.

Zolfa g h a r K,Mea dem N,Ter edesa iA ,Roy SB,Ch in S,Mu ckia n B.Big da ta solu tion s

for pr edictin g r iskofr ea dm ission for con g estiv eh ea r tfa ilu r epa tien ts.In :2 01 3 IEEE
In ter n a tion a lCon fer en ceon Big Da ta 2 01 3 .pp.6 4 7 1 .
1 2 0.

My la r a sw a m y D,Xu B,Dietr ich P,Mu r u g a n A .Ca seStu dies:Big Da ta A n a ly ticsfor

Sy stem Hea lth Mon itor in g .In :2 01 4 In ter n a tion a lCon fer en ceon A r tificia lIn tellig en ce
(ICA I1 4 )2 01 4 .
121.

Estev esRMRon g C.Usin g Ma h ou tfor Clu ster in g Wikipedia sLa testA r ticles:A

Com pa r ison betw een Km ea n sa n dFu zzy Cm ea n sin th eClou d.In :2 01 1 IEEETh ir d
In ter n a tion a lCon fer en ceon Clou dCom pu tin g Tech n olog y a n dScien ce2 01 1 .pp.5 6 5 9 .
122.

Filim on D.Clu ster in g ofRea ltim eDa ta a tSca le.In :Ber lin Bu zzw or ds2 01 3 .

123.

Ga oF,A bdA lm a g eedW,Hefeeda M.Distr ibu teda ppr ox im a tespectr a lclu ster in g

for la r g esca leda ta sets.In :Pr oceedin g softh e2 1 stin ter n a tion a lsy m posiu m on Hig h
Per for m a n cePa r a llela n dDistr ibu tedCom pu tin g (HPDC1 2 )2 01 2 .pp.2 2 3 3 4 .
1 24.

Hu ssein T,Lin der T,Ga u lkeW,Zieg ler J.Hy br eed:a softw a r efr a m ew or kfor

dev elopin g con tex ta w a r eh y br idr ecom m en der sy stem s.User ModelUser A da pIn ter .
2 01 4 2 4 (1 2 ):1 2 1 7 4 .
CrossRef
125.

(http://dx.doi.org/10.1007/s112570129134z)

Yu H,Hsieh C,SiS,Dh illon IS.Pa r a llelm a tr ix fa ctor iza tion for r ecom m en der

sy stem s.Kn ow lIn fSy st.2 01 3 4 1 (3 ):7 9 3 8 1 9 .


CrossRef
1 26.

(http://dx.doi.org/10.1007/s1011501306822)

Sa idA ,Bellog n A .Com pa r a tiv eRecom m en der Sy stem Ev a lu a tion :Ben ch m a r kin g

Recom m en da tion Fr a m ew or ks.In :Pr oceedin g softh e8 th A CMCon fer en ceon


Recom m en der sy stem s(RecSy s1 4 )2 01 4 .pp.1 2 9 3 6 .
127 .

Zh en g J,Da g n in oA .A n in itia lstu dy ofpr edictiv em a ch in elea r n in g a n a ly ticson

la r g ev olu m esofh istor ica lda ta for pow er sy stem a pplica tion s.In :2 01 4 IEEE
In ter n a tion a lCon fer en ceon Big Da ta 2 01 4 .pp.9 5 2 5 9 .
1 2 8.

Ka tsipou la kisNR,Tia n Y,Rein w a ldB,Pir a h esh H.A Gen er icSolu tion toIn teg r a te

SQLa n dA n a ly ticsfor Big Da ta .In :1 8 th In ter n a tion a lCon fer en ceon Ex ten din g Da ta ba se
Tech n olog y (EDBT)2 01 5 .pp.6 7 1 6 .
1 29.

A lber M.Big Da ta a n dMa ch in eLea r n in g :A Ca seStu dy w ith Bu m pBoost.Th esis,

Fr eeUn iv er sity ofBer lin 2 01 4 .


1 3 0.

Lin CY,Tsa iCH,LeeCP,Lin CJ.La r g esca lelog isticr eg r ession a n dlin ea r su ppor t

v ector m a ch in esu sin g spa r k.In :2 01 4 IEEEIn ter n a tion a lCon fer en ceon Big Da ta 2 01 4 .
pp.5 1 9 2 8 .
131.

Zh a n g C.Dim m Witted:A Stu dy ofMa in Mem or y Sta tistica lA n a ly tics.2 01 4 .a r Xiv

Pr epr in t,.a r Xiv :1 4 03 .7 5 5 0.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

47/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

Pr epr in t,.a r Xiv :1 4 03 .7 5 5 0.


132.

Kou tsou m pa kisG.Spa r kba sedA pplica tion for A bn or m a lLog Detection .Th esis,

Uppsa la Un iv er sity 2 01 4 .
133.

Pow er edBy Spa r k.https://


c w iki.
a pache.
o rg/
c onfluence/
d isplay/
S PARK/

Pow ered+By+Spark
(https://cw iki.apache.org/confluence/display/SPARK/Pow ered%2bBy%2bSpark) .A ccessed
1 5 Dec2 01 4 .
1 34.

Ta lw a lka r A ,Kr a ska T,Gr iffith R,Du ch iJ,Gon za lezJ,Br itzD,Pa n X,Sm ith V ,

Spa r ksE,Wibison oA ,Fr a n klin MJ,Jor da n MI.MLba se:A Distr ibu tedMa ch in eLea r n in g
Wr a pper .In :NIPSBig Lea r n in g Wor ksh op2 01 2 .
135.

Kr a ska T,Ta lw a lka r A ,Du ch iJ,Gr iffith R,Fr a n klin MJ,Jor da n M.MLba se:A

Distr ibu tedMa ch in elea r n in g Sy stem .In :6 th Bien n ia lCon fer en ceon In n ov a tiv eDa ta
Sy stem sResea r ch 2 01 3 .
1 36.

Pa n X,Spa r ksER,Wibison oA .MLba se:Distr ibu tedMa ch in eLea r n in g Ma deEa sy .

Un iv er sity ofCa lifor n ia Ber keley Tech n ica lRepor t2 01 3 .


137 .

Spa r ksER,Ta lw a lka r A ,Fr a n klin MJ,Jor da n MI,Kr a ska T.Tu PA Q:a n efficien t

pla n n er for la r g esca lepr edictiv ea n a ly ticqu er ies.2 01 5 .(a rXiv Preprin t
a rXiv :1502.00068).
1 3 8.

Spa r ksE.Sca la bleA u tom a tedModelSea r ch .Un iv er sity ofCa lifor n ia a tBer keley

Tech n ica lRepor tUCB/EECS2 01 4 1 2 2 2 01 4 .


1 39.

Na ja fa ba diMM,V illa n u str eF,Kh osh g ofta a r TM,Seliy a N,Wa ldR,Mu h a r em a g icE.

Deeplea r n in g a pplica tion sa n dch a llen g esin big da ta a n a ly tics.JBig Da ta .2 01 5 2 (1 ):1


21.
CrossRef
1 4 0.

(http://dx.doi.org/10.1186/s4053701400077)

Deeplea r n in g 4 j.http://
w w w .
s kymind.
io/
d eeplearning4j/

(http://w w w .skymind.io/deeplearning4j/) .
1 41 .

KNIME.http://
w w w .
k nime.
o rg/
(http://w w w .knime.org/) .

1 42.

Ra pidMin er .https://
rapidminer.
c om/
(https://rapidminer.com/) .

1 43.

H2 O(2 01 5 )A lg or ith m sRoa dm a p.

1 44.

Kejela G,Estev esRM,Ron g C.Pr edictiv eA n a ly ticsofSen sor Da ta Usin g Distr ibu ted

Ma ch in eLea r n in g Tech n iqu es.In :2 01 4 IEEE6 th In ter n a tion a lCon fer en ceon Clou d
Com pu tin g Tech n olog y a n dScien ce2 01 4 .pp.6 2 6 3 1 .
1 45.

Mor a lesGDF,BifetA .SA MOA :Sca la bleA dv a n cedMa ssiv eOn lin eA n a ly sis.JMa ch

Lea r n Res.2 01 5 1 6 :1 4 9 5 3 .
1 46.

BifetA ,Mor a lesGDF.Big Da ta Str ea m Lea r n in g w ith SA MOA .In :2 01 4 IEEE

In ter n a tion a lCon fer en ceon Da ta Min in g Wor ksh op(ICDMW)2 01 4 .pp.1 1 9 9 2 02 .
1 47 .

Sev er ien A L.Sca la bleDistr ibu tedRea lTim eClu ster in g for Big Da ta Str ea m s.

Th esis,Poly tech n icUn iv er sity ofCa ta lon ia 2 01 3 .


1 48.

Rom sa iy u dW.A u tom a ticEx tr a ction ofTopicson Big Da ta Str ea m sth r ou g h

Sca la bleA dv a n cedA n a ly sis.In :2 01 4 In ter n a tion a lCom pu ter Scien cea n dEn g in eer in g
Con fer en ce(ICSEC)2 01 4 .pp.2 5 5 6 0.
1 49.

Rion da toM,DeBr a ba n tJA ,Fon seca R,Upfa lE.PA RMA :a pa r a llelr a n dom ized

a lg or ith m for a ppr ox im a tea ssocia tion r u lesm in in g in Ma pRedu ce.In :Pr oceedin g softh e
2 1 stA CMIn ter n a tion a lCon fer en ceon In for m a tion a n dKn ow ledg eMa n a g em en t
(CIKM1 2 )2 01 2 .pp.8 5 9 4 .
1 5 0.

SA MOA MOA .https://


g ithub.
c om/
s amoamoa/
s amoamoa

(https://github.com/samoamoa/samoamoa) .
151.

Kou r tellisN,Mor a lesGDF,Bon ch iF.Sca la bleOn lin eBetw een n essCen tr a lity in

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

48/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

151.

Kou r tellisN,Mor a lesGDF,Bon ch iF.Sca la bleOn lin eBetw een n essCen tr a lity in

Ev olv in g Gr a ph s2 01 5 .a r Xiv Pr epr in t:1 4 01 .6 9 8 1 .


152.

Ra h n a m a A HA .Rea ltim eSen tim en tA n a ly sisofTw itter Pu blicStr ea m .Th esis,

Un iv er sity ofJy v sky l 2 01 5 .


153.

Ra h n a m a A HA ,Sen tin el.http://


a mbodi.
g ithub.
io/
s entinel/

(http://ambodi.github.io/sentinel/) .
1 54.

QiB,Ma G,Sh iZ,Wa n g W.Efficien tly Fin din g TopKItem sfr om Ev olv in g

Distr ibu tedDa ta Str ea m s.In :2 01 4 1 0th In ter n a tion a lCon fer en ceon Sem a n tics,
Kn ow ledg ea n dGr ids(SKG)2 01 4 .
155.

DiMa u r oM,DiSa r n oC.A fr a m ew or kfor In ter n etda ta r ea ltim epr ocessin g :A

m a ch in elea r n in g a ppr oa ch .In :2 01 4 In ter n a tion a lCa r n a h a n Con fer en ceon Secu r ity
Tech n olog y (ICCST)2 01 4 .
1 56.

Distr ibu tedWeka .http://


w w w .
c s.
w aikato.
a c.
nz/
m l/
w eka/
b igdata.
h tml

(http://w w w .cs.w aikato.ac.nz/ml/w eka/bigdata.html) .


157.

Or y x .https://
g ithub.
c om/
c loudera/
o ryx (https://github.com/cloudera/oryx) .

1 5 8.

Or y x 2 .https://
g ithub.
c om/
O ryxProject/
o ryx (https://github.com/OryxProject/oryx) .

1 59.

V ow pa lWa bbit.https://
g ithub.
c om/
JohnLangford/
v ow pal_
w abbit

(https://github.com/JohnLangford/vow pal_w abbit) .


1 6 0.

V a n Hu lseJ,Kh osh g ofta a r T.Kn ow ledg ediscov er y fr om im ba la n ceda n dn oisy

da ta .Da ta Kn ow lEn g .2 009 6 8 (1 2 ):1 5 1 3 4 2 .


CrossRef
1 61 .

(http://dx.doi.org/10.1016/j.datak.2009.08.005)

Kh osh g ofta a r TM,Hu lseJV .Im pu ta tion tech n iqu esfor m u ltiv a r ia tem issin g n essin

softw a r em ea su r em en tda ta .Softw a r eQu a lity J.1 6 (4 ):5 6 3 6 002 008 .[On lin e].http://
dx.
d oi.
o rg/
10.
1007/
s 1121900890547 (http://dx.doi.org/10.1007/s1121900890547) .
1 62.

Kh osh g ofta a r TM,V a n Hu lseJ,Na polita n oA .Com pa r in g boostin g a n dba g g in g

tech n iqu esw ith n oisy a n dim ba la n cedda ta .Sy stMa n Cy ber n Pa r tA Sy stHu m IEEE
Tr a n s.2 01 1 4 1 (3 ):5 5 2 6 8 .
CrossRef
1 63.

(http://dx.doi.org/10.1109/TSMCA.2010.2084081)

V a n Hu lseJ,Kh osh g ofta a r TM,Na polita n oA ,Wa ldR.Fea tu r eselection w ith h ig h

dim en sion a lim ba la n cedda ta .In :IEEEIn ter n a tion a lCon fer en ceon Da ta Min in g
Wor ksh ops(ICDMW09 )2 009 .pp.5 07 1 4 .
1 64.

V a n Hu lseJ,Kh osh g ofta a r TM,Na polita n oA .Ex per im en ta lper spectiv eson

lea r n in g fr om im ba la n cedda ta .In :Pr oceedin g softh e2 4 th In ter n a tion a lCon fer en ceon
Ma ch in eLea r n in g 2 007 .pp.9 3 5 4 2 .
1 65.

Hog a n JM,Peu tT.La r g eSca leRea dCla ssifica tion for Nex tGen er a tion Sequ en cin g .

Pr ocedia Com pu tSci.2 01 4 2 9 :2 003 1 2 .


CrossRef
1 66.

(http://dx.doi.org/10.1016/j.procs.2014.05.184)

Su n K,Mia oW,Zh a n g X,Ra oR.A n Im pr ov em en ttoFea tu r eSelection ofRa n dom

For estson Spa r k.In :2 01 4 IEEE1 7 th In ter n a tion a lCon fer en ceon Com pu ta tion a lScien ce
a n dEn g in eer in g (CSE)2 01 4 .pp.7 7 4 9 .
1 67 .

Ka n delS,Pa epckeA ,Heller stein JM,Heer J.En ter pr iseda ta a n a ly sisa n d

v isu a liza tion :a n in ter v iew stu dy .IEEETr a n sV isu a lCom pu tGr a ph ics.
2 01 2 1 8 (1 2 ):2 9 1 7 2 6 .
CrossRef
1 68.

(http://dx.doi.org/10.1109/TVCG.2012.219)

Kelley I,Blu m en stockJ.Com pu ta tion a lCh a llen g esin th eA n a ly sisofLa r g e,Spa r se,

Spa tiotem por a lDa ta .In :Pr oceedin g softh esix th in ter n a tion a lw or ksh opon Da ta
in ten siv edistr ibu tedcom pu tin g 2 01 4 .pp.4 1 5 .
1 69.

Ga n zF,Pu sch m a n n D,Ba r n a g h iP,Ca r r ezF.A pr a ctica lev a lu a tion ofin for m a tion

pr ocessin g a n da bstr a ction tech n iqu esfor th ein ter n etofth in g s.IEEEIn ter n etTh in g sJ

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

49/50

12/1/2015

AsurveyofopensourcetoolsformachinelearningwithbigdataintheHadoopecosystemSpringer

pr ocessin g a n da bstr a ction tech n iqu esfor th ein ter n etofth in g s.IEEEIn ter n etTh in g sJ
2 01 5 (preprin t ).

Copyrightinformation
Landsetetal.2015
OpenAccess
ThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution
4.0InternationalLicense(http://
c reativecommons.
o rg/
licenses/
b y/
4 .
0 /
),
whichpermitsunrestricteduse,distribution,andreproductioninanymedium,
providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,
providealinktotheCreativeCommonslicense,andindicateifchangeswere
made.

http://link.springer.com/article/10.1186/s4053701500321/fulltext.html

50/50