Professional Documents
Culture Documents
1)
DataPreprocessinginWEKA
Thisexerciseillustratessomeofthebasicdatapreprocessingoperationsthatcanbe performedusingWEKA.Thesampledatasetusedforthisexampleisthe"bankdata" availableincommaseparatedformat(bankdata.csv). Thedatacontainsthefollowingfields id age sex region income married children car save_acct auniqueidentificationnumber ageofcustomerinyears(numeric) MALE/FEMALE inner_city/rural/suburban/town incomeofcustomer(numeric) isthecustomermarried(YES/NO) numberofchildren(numeric) doesthecustomerownacar(YES/NO) doesthecustomerhaveasavingaccount(YES/NO)
LoadingtheData
InadditiontothenativeARFFdatafileformat,WEKAhasthecapabilitytoreadin".csv" formatfiles.Thisisfortunatesincemanydatabasesorspreadsheetapplicationscansaveor exportdataintoflatfilesinthisformat.Ascanbeseeninthesampledatafile,thefirstrow containstheattributenames(separatedbycommas)followedbyeachdatarowwith attributevalueslistedinthesameorder(alsoseparatedbycommas).Infact,onceloaded intoWEKA,thedatasetcanbesavedintoARFFformat. Inthisexample,weloadthedatasetintoWEKA,performaseriesofoperationsusing WEKA'spreprocessingfilters.Whilealloftheseoperationscanbeperformedfromthe commandline,weusetheGUIinterfaceforWEKAExplorer. Initially(inthePreprocesstab)click"open"andnavigatetothedirectorycontainingthe datafile(.csvor.arff).Inthiscasewewillopentheabovedatafile.ThisisshowninFigure p1.
Figurep1
Figurep2
Clickingonanyattributeintheleftpanelwillshowthebasicstatisticsonthatattribute.For categoricalattributes,thefrequencyforeachattributevalueisshown,whileforcontinuous
Figurep3
Figurep4
Notethatthevisualizationintherightbottompanelisaformofcrosstabulationacrosstwo attributes.Forexample,inFigurep4above,thedefaultvisualizationpanelcrosstabulates
"married"withthe"pep"attribute(bydefaultthesecondattributeisthelastcolumnofthe datafile).Youcanselectanotherattributeusingthedropdownlist.
SelectingorFilteringAttributes
Inoursampledatafile,eachrecordisuniquelyidentifiedbyacustomerid(the"id" attribute).Weneedtoremovethisattributebeforethedataminingstep.Wecandothisby (1)simplyselecttheattributeandclickonRemovebuttonasshowninFigurep5(WEKA 3.6.2)or
Figurep5
Figurep6
Figurep7
Figurep8
Figurep9
buttoninthetoppanel.Here,asshowninthe"save"dialogbox(seeFigurep10),wewill savethenewrelationinthefile"bankdataR1.arff".
Figurep10 Figurep11showsthetopportionofthenewgeneratedARFFfile(intexteditor).
Figurep11
Discretization
Sometechniques,suchasassociationrulemining,canonlybeperformedoncategorical data.Thisrequiresperformingdiscretizationonnumericorcontinuousattributes.Thereare 3suchattributesinthisdataset:"age","income",and"children".Inthecaseofthe "children"attributetherangeofpossiblevaluesareonly0,1,2,and3.Inthiscase,wehave optedforkeepingallofthesevaluesinthedata.Thismeanswecansimplydiscretizeby removingthekeyword"numeric"asthetypeforthe"children"attributeintheARFFfile, andreplacingitwiththesetofdiscretevalues.Wedothisdirectlyinourtexteditorasseen inFigurep12.Inthiscase,wehavesavedtheresultingrelationinaseparatefile"bank data2.arff".
Figurep12
Figurep13
Ifweselectthe"children"attributeinthisnewdataset,weseethatitisnowacategorical attributewithfourpossiblediscretevalues.ThisisdepictedinFigurep14.
Figurep15
Figurep16
Figurep17
Figurep18
Letusnowexaminethenewdatasetusingourtexteditor.Thetopportionofthedatais showninFigurep18.YoucanobservethatWEKAhasassigneditsownlabelstoeachofthe valuerangesforthediscretizedattribute.Forexample,thelowerrangeinthe"age" attributeislabeled"(inf34.333333]"(enclosedinsinglequotesandescapecharacters), whilethemiddlerangeislabeled"(34.33333350.666667]",andsoon.Theselabelsnow alsoappearinthedatarecordswheretheoriginalagevaluewasinthecorresponding range. Next,weapplythesameprocesstodiscretizethe"income"attributeinto3bins.Again, Wekaautomaticallyperformsthebinningandreplacesthevaluesinthe"income"column withtheappropriateautomaticallygeneratedlabels.Wesavethenewfileinto"bank data3.arff",replacingtheolderversion. Clearly,theWEKAlabels,whilereadable,leavemuchtobedesiredasfarasnaming conventionsgo.Wewillthususetheglobalsearch/replacefunctionsintexteditorto replacetheselabelswithmoresuccinctandreadableones. ReplacealloftheWEKAassignedlabelsofageandincomeattributes.Notethatthe attributesection(thetoppart)ofthearfffilemustbeadjustedaccordingly. Figurep19showsthefinalresultofthetransformationandthenewlyassignedlabelsfor theseattributevalues.
Figurep19
WenowalsochangetherelationnameintheARFFfileto"bankdatafinal"andsavethefile
MissingValues
1. Openfilebankdata.arff 2. Checkifthereisanymissingvaluesinanyattribute.
4 4
5. MakenoteofLabelthathasMaxCountinregionandMeanofchildrenattributes.
6. ChooseReplaceMissingValuesfilter (weka.filters.unsupervised.attribute.ReplaceMissingValues).Then,clickonApplybutton.
7. Lookintothedata.Howdidthosemissingvaluesgetreplaced?
7 7