Professional Documents
Culture Documents
PracticalMachineLearningToolsandTechniques
SlidesforChapter7ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Datatransformations
Attributeselection
Attributediscretization
Principalcomponentanalysis,randomprojections,partialleastsquares,text,
timeseries
Sampling
Reservoirsampling
Dirtydata
Unsupervised,supervised,errorvsentropybased,converseofdiscretization
Projections
Schemeindependent,schemespecific
Datacleansing,robustregression,anomalydetection
Transformingmultipleclassestobinaryones
Simpleapproaches,errorcorrectingcodes,ensemblesof
nesteddichotomies
Calibratingclassprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Justapplyalearner?NO!
Scheme/parameterselection
treatselectionprocessaspartofthelearning
process
Modifyingtheinput:
Dataengineeringtomakelearningpossibleor
easier
Modifyingtheoutput
Recalibratingprobabilityestimates
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Attributeselection
Addingarandom(i.e.irrelevant)attributecan
significantlydegradeC4.5sperformance
IBLverysusceptibletoirrelevantattributes
Problem:attributeselectionbasedonsmallerand
smalleramountsofdata
Numberoftraininginstancesrequiredincreases
exponentiallywithnumberofirrelevantattributes
NaveBayesdoesnthavethisproblem
Relevantattributescanalsobeharmful
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Schemeindependentattributeselection
Filterapproach:assessbasedongeneralcharacteristicsofthedata
Onemethod:findsmallestsubsetofattributesthatseparatesdata
Anothermethod:usedifferentlearningscheme
IBLbasedattributeweightingtechniques:
e.g.useattributesselectedbyC4.5and1R,orcoefficientsoflinear
model,possiblyappliedrecursively(recursivefeatureelimination)
cantfindredundantattributes(butfixhasbeensuggested)
CorrelationbasedFeatureSelection(CFS):
correlationbetweenattributesmeasuredbysymmetricuncertainty:
A , B
UA ,B=2 H AHBH
[0,1]
H AHB
goodnessofsubsetofattributesmeasuredby(breakingtiesinfavorof
smallersubsets):
j U A j , C/ i j UA i , A j
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Attributesubsetsforweatherdata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Searchingattributespace
Numberofattributesubsetsis
exponentialinnumberofattributes
Commongreedyapproaches:
forwardselection
backwardelimination
Moresophisticatedstrategies:
Bidirectionalsearch
Bestfirstsearch:canfindoptimumsolution
Beamsearch:approximationtobestfirstsearch
Geneticalgorithms
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Schemespecificselection
Wrapperapproachtoattributeselection
Implementwrapperaroundlearningscheme
Timeconsuming
greedyapproach,kattributesk2time
priorrankingofattributeslinearink
Canusesignificancetesttostopcrossvalidationfor
subsetearlyifitisunlikelytowin(racesearch)
Evaluationcriterion:crossvalidationperformance
canbeusedwithforward,backwardselection,priorranking,orspecial
purposeschematasearch
Learningdecisiontables:schemespecificattribute
selectionessential
EfficientfordecisiontablesandNaveBayes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Attributediscretization
AvoidsnormalityassumptioninNaveBayesand
clustering
1R:usessimplediscretizationscheme
C4.5performslocaldiscretization
Globaldiscretizationcanbeadvantageousbecause
itsbasedonmoredata
Applylearnerto
kvalueddiscretizedattributeorto
k1binaryattributesthatcodethecutpoints
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
Discretization:unsupervised
Determineintervalswithoutknowingclasslabels
Twostrategies:
Whenclustering,theonlypossibleway!
Equalintervalbinning
Equalfrequencybinning
(alsocalledhistogramequalization)
Normallyinferiortosupervisedschemesin
classificationtasks
ButequalfrequencybinningworkswellwithnaveBayesif
numberofintervalsissettosquarerootofsizeofdataset
(proportionalkintervaldiscretization)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
10
Discretization:supervised
Entropybasedmethod
Buildadecisiontreewithprepruningonthe
attributebeingdiscretized
Useentropyassplittingcriterion
Useminimumdescriptionlengthprincipleasstopping
criterion
Workswell:thestateoftheart
Toapplymindescriptionlengthprinciple:
Thetheoryis
thesplittingpoint(log2[N1]bits)
plusclassdistributionineachsubset
Comparedescriptionlengthsbefore/afteraddingsplit
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
11
Example:temperatureattribute
Temperature
64
65
68
69
70
71
Play
72
72
75
75
80
81
83
85
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
12
FormulaforMDLP
Ninstances
Originalset:
Firstsubset:
Secondsubset:
gain
kclasses,entropyE
k1classes,entropyE1
log2 N1
N
k2classes,entropyE2
log2 3k 2kEk 1 E1 k 2 E2
N
Resultsinnodiscretizationintervalsfor
temperatureattribute
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
13
Superviseddiscretization:othermethods
Canreplacetopdownprocedurebybottomup
method
CanreplaceMDLPbychisquaredtest
Canusedynamicprogrammingtofindoptimum
kwaysplitforgivenadditivecriterion
Requirestimequadraticinthenumberofinstances
Butcanbedoneinlineartimeiferrorrateisused
insteadofentropy
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
14
Errorbasedvs.entropybased
Question:
couldthebestdiscretizationeverhavetwo
adjacentintervalswiththesameclass?
Wronganswer:No.Forifso,
Collapsethetwo
Freeupaninterval
Useitsomewhereelse
(Thisiswhaterrorbaseddiscretizationwilldo)
Rightanswer:Surprisingly,yes.
(andentropybaseddiscretizationcandoit)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
15
Errorbasedvs.entropybased
A2class,
2attribute
problem
Entropybaseddiscretizationcandetectchangeofclassdistribution
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
16
Theconverseofdiscretization
Makenominalvaluesintonumericones
1. Indicatorattributes(usedbyIB1)
Makesnouseofpotentialorderinginformation
2. Codeanorderednominalattributeintobinary
ones(usedbyM5)
Canbeusedforanyorderedattribute
Betterthancodingorderingintoaninteger(which
impliesametric)
Ingeneral:codesubsetofattributevaluesas
binary
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
17
Projections
Simpletransformationscanoftenmakealargedifference
inperformance
Exampletransformations(notnecessarilyfor
performanceimprovement):
Differenceoftwodateattributes
Ratiooftwonumeric(ratioscale)attributes
Concatenatingthevaluesofnominalattributes
Encodingclustermembership
Addingnoisetodata
Removingdatarandomlyorselectively
Obfuscatingthedata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
18
Principalcomponentanalysis
Methodforidentifyingtheimportantdirections
inthedata
Canrotatedatainto(reduced)coordinatesystem
thatisgivenbythosedirections
Algorithm:
1. Finddirection(axis)ofgreatestvariance
2. Finddirectionofgreatestvariancethatisperpendicular
topreviousdirectionandrepeat
Implementation:findeigenvectorsofcovariance
matrixbydiagonalization
Eigenvectors(sortedbyeigenvalues)arethedirections
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
19
Example:10dimensionaldata
Cantransformdataintospacegivenbycomponents
DataisnormallystandardizedforPCA
Couldalsoapplythisrecursivelyintreelearner
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
20
Randomprojections
PCAisnicebutexpensive:cubicinnumberof
attributes
Alternative:userandomdirections
(projections)insteadofprinciplecomponents
Surprising:randomprojectionspreserve
distancerelationshipsquitewell(onaverage)
CanusethemtoapplykDtreestohigh
dimensionaldata
Canimprovestabilitybyusingensembleof
modelsbasedondifferentprojections
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
21
Partialleastsquaresregression
PCAisoftenapreprocessingstepbefore
applyingalearningalgorithm
Whenlinearregressionisappliedtheresulting
modelisknownasprincipalcomponents
regression
Outputcanbereexpressedintermsofthe
originalattribues
PartialleastsquaresdiffersfromPCAinthatit
takestheclassattributeintoaccount
Findsdirectionsthathavehighvarianceandare
stronglycorrelatedwiththeclass
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
22
Algorithm
1.Startwithstandardizedinputattributes
2.AttributecoefficientsofthefirstPLSdirection:
Computethedotproductbetweeneachattributevector
andtheclassvectorinturn
3.CoefficientsfornextPLSdirection:
Originalattributevaluesarefirstreplacedbydifference
(residual)betweentheattribute'svalueandtheprediction
fromasimpleunivariateregressionthatusestheprevious
PLSdirectionasapredictorofthatattribute
Computethedotproductbetweeneachattribute's
residualvectorandtheclassvectorinturn
4.Repeatfrom3
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
23
Texttoattributevectors
Manydataminingapplicationsinvolvetextualdata(eg.string
attributesinARFF)
Standardtransformation:convertstringintobagofwordsby
tokenization
Attributevaluesarebinary,wordfrequencies(f ),
ij
log(1+fij),orTFIDF:
f ij log #documents
#documentsthat includeword i
Onlyretainalphabeticsequences?
Whatshouldbeusedasdelimiters?
Shouldwordsbeconvertedtolowercase?
Shouldstopwordsbeignored?
Shouldhapaxlegomenabeincluded?Orevenjustthekmost
frequentwords?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
24
Timeseries
Intimeseriesdata,eachinstancerepresentsadifferent
timestep
Somesimpletransformations:
Shiftvaluesfromthepast/future
Computedifference(delta)betweeninstances(ie.
derivative)
Insomedatasets,samplesarenotregularbuttimeis
givenbytimestampattribute
Needtonormalizebystepsizewhentransforming
Transformationsneedtobeadaptedifattributes
representdifferenttimesteps
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
25
Sampling
Samplingistypicallyasimpleprocedure
Whatiftraininginstancesarriveonebyonebutwe
don'tknowthetotalnumberinadvance?
Orperhapstherearesomanythatitisimpracticalto
storethemallbeforesampling?
Isitpossibletoproduceauniformlyrandomsampleof
afixedsize?Yes.
Reservoirsampling
Fillthereservoir,ofsizer,withthefirstrinstances
toarrive
Subsequentinstancesreplacearandomlyselected
reservoirelementwithprobabilityr/i,whereiis
thenumberofinstancesseensofar
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
26
Automaticdatacleansing
Toimproveadecisiontree:
Better(ofcourse!):
Removemisclassifiedinstances,thenrelearn!
Humanexpertchecksmisclassifiedinstances
Attributenoisevsclassnoise
Attributenoiseshouldbeleftintrainingset
(donttrainoncleansetandtestondirtyone)
Systematicclassnoise(e.g.oneclasssubstitutedfor
another):leaveintrainingset
Unsystematicclassnoise:eliminatefromtraining
set,ifpossible
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
27
Robustregression
Robuststatisticalmethod onethat
addressesproblemofoutliers
Tomakeregressionmorerobust:
Minimizeabsoluteerror,notsquarederror
Removeoutliers(e.g.10%ofpointsfarthestfrom
theregressionplane)
Minimizemedianinsteadofmeanofsquares
(copeswithoutliersinxandydirection)
Findsnarroweststripcoveringhalftheobservations
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
28
Example:leastmedianofsquares
Numberofinternationalphonecallsfrom
Belgium,19501973
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
29
Detectinganomalies
Visualizationcanhelptodetectanomalies
Automaticapproach:
committeeofdifferentlearningschemes
E.g.
decisiontree
nearestneighborlearner
lineardiscriminantfunction
Conservativeapproach:deleteinstances
incorrectlyclassifiedbyallofthem
Problem:mightsacrificeinstancesofsmall
classes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
30
OneClassLearning
Usuallytrainingdataisavailableforallclasses
Someproblemsexhibitonlyasingleclassat
trainingtime
Oneclassclassification
Testinstancesmaybelongtothisclassoranew
classnotpresentattrainingtime
Predicteithertargetorunknown
Someproblemscanbereformulatedintotwo
classones
Otherapplicationstrulydon'thavenegativedata
Egpasswordhardening
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
31
Outlierdetection
Oneclassclassificationisoftencalled
outlier/noveltydetection
Genericapproach:identifyoutliersasinstances
thatliebeyonddistancedfrompercentagepof
thetrainingdata
Alternatively,estimatedensityofthetargetclass
andmarklowprobabilitytestinstancesas
outliers
Thresholdcanbeadjustedtoobtaina
suitablerateofoutliers
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
32
Generatingartificialdata
Anotherpossibilityistogenerateartificialdatafor
theoutlierclass
Canthenapplyanyofftheshelfclassifier
Cantunerejectionratethresholdifclassifier
producesprobabilityestimates
Generateuniformlyrandomdata
Toomuchwilloverwhelmthetargetclass!
Canbeavoidediflearningaccurateprobabilitiesrather
thanminimizingclassificationerror
Curseofdimensionalityas#attributesincreaseit
becomesinfeasibletogenerateenoughdatatoget
goodcoverageofthespace
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
33
Generatingartificialdata
Generatedatathatisclosetothetargetclass
Nolongeruniformlydistributedandmusttakethis
distributionintoaccountwhencomputingmembership
scoresfortheoneclassmodel
Ttargetclass,Aartificialclass.WantPr[X|T],forany
instanceX;weknowPr[X|A]
CombinesomeamountofAwithinstancesofTandusea
classprobabilityestimatortoestimatePr[T|X];thenbyBayes'
rule:
Pr [T ]) Pr [T | X ]
Pr [ X | T ]= (1
Pr [ X | A]
Pr [T ](1Pr [T | X ])
Forclassification,chooseathresholdtotunerejectionrate
HowtochoosePr[X|A]?Applyadensityestimatortothe
targetclassanduseresultingfunctiontomodeltheartificial
class
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
34
Transformingmultipleclassestobinary
ones
Somelearningalgorithmsonlyworkwithtwo
classproblems
Sophisticatedmulticlassvariantsexistinmany
casesbutcanbeveryslowordifficultto
implement
Acommonalternativeistotransformmulticlass
problemsintomultipletwoclassones
Simplemethods
Discriminateeachclassagainstheunionof
theothersonevs.rest
Buildaclassifierforeverypairofclasses
pairwiseclassification
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
35
Errorcorrectingoutputcodes
Multiclassproblem binaryproblems
Idea:useerrorcorrecting
codesinstead
Simpleonevs.restscheme:
Oneperclasscoding
baseclassifierspredict
1011111,trueclass=??
Usecodewordsthathave
largeHammingdistance
betweenanypair
class
a
class
vector
1000
0100
0010
0001
class
a
class
vector
1111111
0000111
0011001
0101010
Cancorrectupto(d1)/2singlebiterrors
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
36
MoreonECOCs
Twocriteria:
Rowseparation:
minimumdistancebetweenrows
Columnseparation:
minimumdistancebetweencolumns
3classes only23possiblecolumns
(andcolumnscomplements)
Why?Becauseifcolumnsareidentical,baseclassifierswilllikely
makethesameerrors
Errorcorrectionisweakenediferrorsarecorrelated
(and4outofthe8arecomplements)
Cannotachieverowandcolumnseparation
Onlyworksforproblemswith>3classes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
37
ExhaustiveECOCs
Exhaustivecodeforkclasses:
Columnscompriseevery
possiblekstring
exceptforcomplements
andallzero/onestrings
Eachcodewordcontains
2k11bits
Exhaustivecode,k=4
class
class vector
1111111
0000111
0011001
0101010
Class1:codewordisallones
Class2:2k2zeroesfollowedby2k21ones
Classi:alternatingrunsof2ki0sand1s
lastrunisoneshort
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
38
MoreonECOCs
Moreclasses exhaustivecodesinfeasible
Numberofcolumnsincreasesexponentially
Randomcodewordshavegooderrorcorrecting
propertiesonaverage!
Therearesophisticatedmethodsforgenerating
ECOCswithjustafewcolumns
ECOCsdontworkwithNNclassifier
But:worksifdifferentattributesubsetsareusedtopredict
eachoutputbit
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
39
Ensemblesofnesteddichotomies
ECOCsproduceclassifications,butwhatifwewantclass
probabilityestimatesaswell?
e.g.forcostsensitiveclassificationviaminimum
expectedcost
Nesteddichotomies
Decomposesmulticlasstobinary
Workswithtwoclassclassifiersthatcanproduceclass
probabilityestimates
Recursivelysplitthefullsetofclassesintosmallerand
smallersubsets,whilesplittingthefulldatasetof
instancesintosubsetscorrespondingtothesesubsetsof
classes
Yieldsabinarytreeofclassescalledanested
dichotomy
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
40
Example
Fullsetofclasses:
[a,b,c,d]
Twodisjointsubsets: [a,b][c,d]
[a][b][c][d]
Nesteddichotomyasacodematrix:
Class
Class vector
00X
1X0
01X
1X1
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
41
Probabilityestimation
SupposewewanttocomputePr[a|x]?
Learntwoclassmodelsforeachofthethreeinternal
nodes
Fromthetwoclassmodelattheroot:
Pr[{a,b}|x]
Fromthelefthandchildoftheroot:
Pr[{a}|x,{a|b}]
Usingthechainrule:
Pr[{a}|x]=Pr[{a}|{a,b},x]Pr[{a,b}|x]
Issues
Estimationerrorsfordeephierarchies
Howtodecideonhierarchicaldecompositionofclasses?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
42
Ensemblesofnesteddichotomies
Ifthereisnoreasonaprioritopreferanyparticular
decompositionthenusethemall
Impracticalforanynontrivialnumberofclasses
Considerasubsetbytakingarandomsampleof
possibletreestructures
Cachingofmodels(sinceagiventwoclassproblem
mayoccurinmultipletrees)
Averageprobabilityestimatesoverthetrees
Experimentsshowthatthisapproachyieldsaccurate
multiclassclassifiers
Canevenimprovetheperformanceofmethodsthat
canalreadyhandlemulticlassproblems!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
43
Calibratingclassprobabilities
Classprobabilityestimationisharderthan
classification
Classificationerrorisminimizedaslongasthe
correctclassispredictedwithmaxprobability
Estimatesthatyieldcorrectclassificationmaybe
quitepoorwithrespecttoquadraticor
informationalloss
Oftenimportanttohaveaccurateclass
probabilities
e.g.costsensitivepredictionusingtheminimum
expectedcostmethod
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
44
Calibratingclassprobabilities
Consideratwoclassproblem.Probabilitiesthatare
correctforclassificationmaybe:
Toooptimistictooclosetoeither0or1
Toopessimisticnotcloseenoughto0or1
Reliabilitydiagram
showingoveroptimistic
probabilityestimation
foratwoclassproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
45
Calibratingclassprobabilities
Reliabilitydiagramgeneratedbycollecting
predictedprobabilitiesandrelativefrequencies
froma10foldcrossvalidation
Predictedprobabilitiesdiscretizedinto20ranges
viaequalfrequencydiscretization
Correctbiasbyusingposthoccalibrationto
mapobservedcurvetothediagonal
Aroughapproachcanusethedatafromthe
reliabilitydiagramdirectly
Discretizationbasedcalibrationisfast...
Butdeterminingtheappropriatenumberof
discretizationintervalsisnoteasy
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
46
Calibratingclassprobabilities
Viewasafunctionestimationproblem
Oneinputestimatedclassprobabilityandone
outputthecalibratedprobability
Assumingthefunctionispiecewiseconstantand
monotonicallyincreasing
Isotonicregressionminimizesthesquarederror
betweenobservedclassprobabilities(0/1)and
resultingcalibratedclassprobabilities
Alternatively,uselogisticregressiontoestimatethe
calibrationfunction
Mustusethelogoddsoftheestimatedclass
probabilitiesasinput
Multiclasslogisticregressioncanbeusedfor
calibrationinthemulticlasscase
DataMining:PracticalMachineLearningToolsandTechniques(Chapter7)
47