Professional Documents
Culture Documents
KDnuggets
DataMining,Analytics,BigData,andDataScience
SubscribetoKDnuggetsNews|Follow |Contact
searchKDnuggets Search
SOFTWARE
NEWS
Topstories
Opinions
Tutorials
JOBS
Academic
Companies
Courses
Datasets
EDUCATION
Certificates
Meetings
Webinars
TheEvolutionofClassification,Webinarpart1,Oct19
KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)
LatestNews,Stories
Top10KDnuggetsBlogPosts,
lookingbackayear
Webinar:PredictiveAnalytics:Failure
toLaunch[Oct13]
Humans&MachinesEthics
Framework:AssessingMac...
PredictiveAnalyticsWorld:Hot
TopicsinAnalyticsfo...
KDnuggetsTopBloggersin
SeptemberGoldandS...
MoreNews&Stories|TopStories
http://www.kdnuggets.com/2015/10/datasciencemachine.html 1/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
Datascope:DataScienceConsulting
TDWIAustin,Dec49,DriveBusiness
InsightwithDataRegisterNow
TopStories
LastWeek
MostPopular
1. The10AlgorithmsMachine
LearningEngineersNeedtoKnow
2. TopAlgorithmsandMethods
UsedbyDataScientists
3. TopDataScientistClaudia
PerlichonBiggestIssuesinData
Science
4. Data
Science
Basics:Data
Miningvs.
Statistics
5. 21Must
KnowData
Science
Interview
QuestionsandAnswers
6. DataScienceforInternetof
Things(IoT):TenDifferencesFrom
TraditionalDataScience
7. 7StepstoMasteringMachine
LearningWithPython
LastWeekMostShared
1.DataScienceforInternetofThings
(IoT):TenDifferencesFrom
http://www.kdnuggets.com/2015/10/datasciencemachine.html 2/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
TraditionalDataScience
2.TopDataScientistClaudiaPerlich
onBiggestIssuesinDataScience
3.DataScienceBasics:DataMining
vs.Statistics
4.EmbeddedAnalytics:TheFutureof
BusinessIntelligence
5.PredictingFutureHumanBehavior
withDeepLearning
6.WhyNotSoHadoop?
7.TopDataScientistClaudiaPerlichs
FavoriteMachineLearning
Algorithm
TheDataScienceMachine,orHowToEngineerFeatureEngineering
Previouspost
Nextpost
11 Share 30
Tweet
Tags:Automated,DataScience,FeatureEngineering,FeatureExtraction,MIT
MITresearchershavedevelopedwhattheyrefertoastheDataScienceMachine,whichcombinesfeatureengineeringandanendtoenddatascience
pipelineintoasystemthatbeatsnearly70%ofhumansincompetitions.Isthisgamechanging?
ByMatthewMayo,KDnuggets.
RecentresearchbyMITMaster'sstudentMaxKanterhasledtotheimplementationofwhathereferstoasthe'DataScienceMachine.'Apaperonthe
DataScienceMachine(DSM)anditsunderlyinginnovation,theDeepFeatureSynthesisalgorithm,byKanterandKalyanVeeramachaneni,histhesis
supervisoratCSAIL,issettobepresentedattheIEEEInternationalConferenceonDataScienceandAdvancedAnalyticsnextweek.Theirpaper'Deep
FeatureSynthesis:TowardsAutomatingDataScienceEndeavors'isavailableonlinenow.
http://www.kdnuggets.com/2015/10/datasciencemachine.html 3/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
TheDSMisconciselydescribedbyKanter&Veeramachanenias"anautomatedsystemforgeneratingpredictivemodelsfromrawdata,"which
combinestheauthors'innovativefeatureengineeringapproachwithanendtoenddatasciencepipeline.TheDSMhas,thusfar,managedtobeat68.9%
ofteamsindatasciencecompetitionsthatithasbeenenteredinto.Perhapsmostnoteworthy,submissionswhichattainthissuccessratearegenerally
completedinunder12hours,asopposedtothemonthswhichteamsofhumanscanlaborfor.
TheDSMispremisedontheobservationsthatdatasciencecompetitionproblemsgenerallyhavethefollowingpropertiesincommon:theyare
structuredandrelational,theymodelhumaninteractionwithacomplexsystem,andthereisanattemptmadetopredictsomeaspectofhumanity.
DeepFeatureSynthesis
Aswithanydatascienceproblem,featuresmustfirstbeidentifiedfromexistingvariables,orbecreatedfromleveragingexistingvariables.While
concedingthatfeatureengineeringhasmadesignificantrecentadvancementsintheareasofnonrelationaldatasuchastextandimages,Kanter&
Veeramachaneninotethatitisstillthistaskthatmostreliesonhumaninterventioninthedatasciencepipeline,andcanbedifficultandtimeconsuming
evenforseasoneddatascientists.Itisalsothistaskthatmustmostcloselyreplicatetheefficiencyofahumanbeingifitistobetrulyautomated.
DeepFeatureSelection(DFS),theDSM'sfeatureengineeringalgorithm,isstrictlyforrelationaldatasets,andisusedtoautomatetheidentificationand
generationofinsightelicitingfeatures.DFStakesrelationaltablesasinput,andisabletoprocessthevarioustypesofdataheldwithinsuchadata
structure.Tobesuccessful,DFSaimstothinklikeadatascientist,lookingtoturninsightfulquestionsintoinputfeatures.
TheDFSalgorithmwalkstherelationshipsandappliesfeatureselectionfunctionsasitdoesso,creatingafinalfeaturestepbystep.Asitperformsthis
walk,DFSstacksthecalculationsofthemathematicalfunctionstoaparticulardepth,andthisiswherethenameDFSisderived.
Dependingontheinputdatatypes,anumberofmathematicalfunctionsareappliedat2distinctlevelsintheDSM:entityandrelational.Entitylevel
featuresfocusonconversionandtranslationfunctions,suchaschangingdatarepresentations,roundingnumbers,andextractingexistinggeneralized
attributesintomorenumerousandconciseattributes.Relationallevelfeaturesareconcernedwiththerelationshipsbetweenentitiesintables(thinkabout
yourprimaryandforeignkeys).Thesefeaturefunctionsarethenabletoextractrelateddatafromothertablestoassociatewithagivenfeature(for
example,findingthemaxitempriceoritemcountassociatedwithanorder),datawhichcouldpotentiallybeexploitedasausefulfeaturetofeedintoa
model.
MachineLearningPathway
TostartofftheDSM'smachinelearningpathway,oneoftheinputfeaturesischosentomodel,whichisreferredtoasthetargetvalue,andwhichisused
toformthepredictionproblem.Appropriatefeatures,knownaspredictors,areselectedviametadatatohelpinthepredictionprocess.TheDSMthen
createsapathwayfordatapreprocessing,featureselection,dimensionalityreduction,modeling,andevaluation,allofwhichisparameterizedand
availableforreuseifnecessary.
ParameteroptimizationisaccomplishedusingaCopulaProcess,andanattemptismadetoreducethenumberoffeaturesbyobservingcorrelation.The
reducedsetoffeaturesisthentestedonsampledata,recombiningthemindifferentwaystooptimizetheaccuracyofthepredictionstheyyield.Byits
useofautotuning,whichtheauthorsargueisasolutelycriticaltoitsperformance,theDSMwasabletoincreaseitsscoreatallthreeofitscompetitions.
Discussion
Whatthisallseemstosuggest,essentially,isthis:TheDSMusesintelligentrelationaldatabaserelationwalkingtohelpbuildandestablishcandidate
features,narrowsthisfeaturesetdownbylookingforcorrelatedvalues,andusescombinatoricsinwhatamountstobruteforcefeatureengineering,to
applyiterativefeaturesubsetstosampledatawhilerecombiningthemforoptimizationuntilthebestpossiblesolutionisfound.
TomeasuretheDSM'sperformance,itwasenteredincompetitionsatKDDCup2014,IJCAI,andKDDCup2015,where,asmentioned,it
outperformedmorethan2/3ofthehumancompetitors.Kanter&Veeramachaneniclaimthatevenduringitsworstperformance(IJCAI),theDSMstill
managedtoframethepredictionprobleminsimilartermstohumancompetitors,evidencedbythefactthatitproceededinthetaskbypursuingsimilar
avenuesofdatamodeling.Inthissamecompetition,itfinishedwithanAUCdifferenceofapproximately0.04ofthecontestwinner,suggestingthatthe
DSMcapturedwhatcouldbeconsideredthemajoraspectsofthecompetitiondataset.
Kanter&Veeramachaneniarguethat,whileitcannotcurrentlycompetewiththehighestperforminghumanscientists,theDSMneverthelesshasarole
alongsidethem.EventhoughanumberofhumansbeatouttheDSMineachofitscompetitions,itwasabletooutperformthemajorityofthemwith
considerablylesseffort(lessthan12hoursversusmonths,insomecases).Theysuggestthat,inlightofthis,itcanbeusedforsettingbenchmarksas
wellasforfosteringcreativity.Frontloadingfeatureengineeringandgeneratingsetsofpotentialtopperformingsetscouldallowhumanstomoveonto
rethinkingtheproblemwithinhours,effectivelystartingwiththeDSMsolutionandmovingforwardfromthatpoint.
Itshouldbenotedthat,whiletheDSMisimpressive,it'shardlythefirstsystemaimingtoautomatemachinelearning.Otherexamplesincludemany
systemsthatautomaticallybuildmodelstobidonadvertising,orKXENModelFactory(nowpartofSAP),whichofferedAutomatedModelBuilding
alreadyin2010.Also,itisclearthattheDSMisnotusefulforalltypesofdata,andisasystemimplementedsolelyfocusingontheexploitionof
relationaldatasets.Itisalsoyettobeshownthatitcanbeeffectiveinrelationaldatasetsthatdonotconformtothepreviouslyidentifieddatascience
competitionproblempattern.
TheDSMhasalreadybeenspunoffintoastartupcalledFeatureLab,toutedas"InsightswithanInterface,"withKanterasitsCEO.Thewebsitestates
"Domorewithyourdata,withoutmoredatascientists,"andclaimsthatitisthe"bestsolutionforcompanieslookingtoincreasetheirdatascience
resources."Thesearebothboldclaims,especiallyinlightofthefactthatnoneoftheindividualpiecesofDSMcanreallybeconsideredbreakthroughs.
ItisentirelypossiblethatFeatureLabgetslostinacloudof"businessintelligence"serviceplatforms.ButBigDataisgoingnowhere,andfeature
engineeringhasbeenoneofthehottesttopicsinmachinelearningoverthepast12months.ItjustmaybethattheDSM'sparticularcombinationof
technologiesatwhatmayendupbeingtherighttimeleadstoanewwayofthinkingaboutdatascience.
MargoSeltzer,aHarvardcomputerscienceprofessor,hasstatedinreferencetotheDSM,"Ithinkwhatthey'vedoneisgoingtobecomethestandard
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.
http://www.kdnuggets.com/2015/10/datasciencemachine.html 4/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.
YoucanreadmoreaboutKanter&Veeramachaneni'sDataScienceMachinehere.
Bio:MatthewMayoisacomputersciencegraduatestudentcurrentlyworkingonhisthesisparallelizingmachinelearningalgorithms.Heisalsoa
studentofdatamining,adataenthusiast,andanaspiringmachinelearningscientist.
Related:
3ThingsAboutDataScienceYouWon'tFindInBooks
SevenTechniquesforDataDimensionalityReduction
Aug2015Analytics,BigData,DataMining,DataScienceAcquisitions,StartupRoundup
Previouspost
Nextpost
TopStoriesPast30Days
MostPopular MostShared
1.The10AlgorithmsMachineLearningEngineersNeedto 1.TopAlgorithmsandMethodsUsedbyDataScientists
Know 2.DataScienceforInternetofThings(IoT):TenDifferences
2.21MustKnowDataScienceInterviewQuestionsand FromTraditionalDataScience
Answers 3.7StepstoMasteringApacheSpark2.0
3.HowtoBecomeaDataScientistPart1 4.BattleoftheDataScienceVennDiagrams
4.7StepstoMasteringMachineLearningWithPython 5.TopDataScientistClaudiaPerlichonBiggestIssuesinData
5.TopAlgorithmsandMethodsUsedbyDataScientists Science
6.9KeyDeepLearningPapers,Explained 6.DataScienceBasics:DataMiningvs.Statistics
7.7StepstoMasteringApacheSpark2.0 7.AutomatedDataScience&MachineLearning:AnInterview
withtheAutosklearnTeam
TDWIAustin,Dec49,AnalyzeandDiscoverRegisterNow
http://www.kdnuggets.com/2015/10/datasciencemachine.html 5/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
NYUMSinBusinessAnalytics
forProfessionalsLearnMore
MoreRecentStories
HeresHowITDepartmentsareUsingBigData
TopStories,Oct39:BattleoftheDataScienceVennDiagrams...
DoMultipliersTrumpBigDataAnalytics?
DataNatives,EuropeDataScienceconference,Oct2628,Berli...
TopSeptemberStories:TopAlgorithmsandMethodsUsedbyData...
TheEvolutionofClassification,Oct19,Oct26Webinars
PredictiveAnalytics.
MaxResults.MinTime.
AdversarialValidation,Explained
Microsoft:PrincipalDataScientist
TempleUniversity:DataScienceFacultyPositions
StillSearchingforROIinBigDataAnalytics?YoureNotAl...
Top/r/MachineLearningPosts,September:OpenImagesDataset...
TheCoronationofPredictiveAnalytics:AFourYearRetrospective
UMBC:DataScience/BigDataFacultyPositions
ACIWorldwide:DataScientist
BattleoftheDataScienceVennDiagrams
Toptweets,Sep28Oct4:7StepstoMasteringSQLfor#Dat...
UniversityofNotreDame:DataScienceConsultant
EmoryUniversity:LecturerinComputerScience
9BizarreandSurprisingInsightsfromDataScience
KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)
2016KDnuggets.AboutKDnuggets
SubscribetoKDnuggetsNews
|Follow @kdnuggets| |
X
http://www.kdnuggets.com/2015/10/datasciencemachine.html 6/6