You are on page 1of 15

(Ref:http://maya.cs.depaul.edu/classes/ect584/WEKA/preprocess.htmlWEKA3.4.

1)

DataPreprocessinginWEKA
Thisexerciseillustratessomeofthebasicdatapreprocessingoperationsthatcanbe performedusingWEKA.Thesampledatasetusedforthisexampleisthe"bankdata" availableincommaseparatedformat(bankdata.csv). Thedatacontainsthefollowingfields id age sex region income married children car save_acct auniqueidentificationnumber ageofcustomerinyears(numeric) MALE/FEMALE inner_city/rural/suburban/town incomeofcustomer(numeric) isthecustomermarried(YES/NO) numberofchildren(numeric) doesthecustomerownacar(YES/NO) doesthecustomerhaveasavingaccount(YES/NO)

current_acct doesthecustomerhaveacurrentaccount(YES/NO) mortgage pep doesthecustomerhaveamortgage(YES/NO) didthecustomerbuyaPEP(PersonalEquityPlan)afterthelastmailing (YES/NO)

LoadingtheData
InadditiontothenativeARFFdatafileformat,WEKAhasthecapabilitytoreadin".csv" formatfiles.Thisisfortunatesincemanydatabasesorspreadsheetapplicationscansaveor exportdataintoflatfilesinthisformat.Ascanbeseeninthesampledatafile,thefirstrow containstheattributenames(separatedbycommas)followedbyeachdatarowwith attributevalueslistedinthesameorder(alsoseparatedbycommas).Infact,onceloaded intoWEKA,thedatasetcanbesavedintoARFFformat. Inthisexample,weloadthedatasetintoWEKA,performaseriesofoperationsusing WEKA'spreprocessingfilters.Whilealloftheseoperationscanbeperformedfromthe commandline,weusetheGUIinterfaceforWEKAExplorer. Initially(inthePreprocesstab)click"open"andnavigatetothedirectorycontainingthe datafile(.csvor.arff).Inthiscasewewillopentheabovedatafile.ThisisshowninFigure p1.

Figurep1

Oncethedataisloaded,WEKAwillrecognizetheattributesandduringthescanofthedata willcomputesomebasicstatisticsoneachattribute.TheleftpanelinFigurep2showsthe listofrecognizedattributes,whilethetoppanelsindicatethenamesofthebaserelation(or table)andthecurrentworkingrelation(whicharethesameinitially).

Figurep2

Clickingonanyattributeintheleftpanelwillshowthebasicstatisticsonthatattribute.For categoricalattributes,thefrequencyforeachattributevalueisshown,whileforcontinuous

attributeswecanobtainmin,max,mean,standarddeviation,etc.Asanexample,see Figuresp3andp4belowwhichshowtheresultsofselectingthe"age"and"married" attributes,respectively.

Figurep3

Figurep4

Notethatthevisualizationintherightbottompanelisaformofcrosstabulationacrosstwo attributes.Forexample,inFigurep4above,thedefaultvisualizationpanelcrosstabulates

"married"withthe"pep"attribute(bydefaultthesecondattributeisthelastcolumnofthe datafile).Youcanselectanotherattributeusingthedropdownlist.

SelectingorFilteringAttributes
Inoursampledatafile,eachrecordisuniquelyidentifiedbyacustomerid(the"id" attribute).Weneedtoremovethisattributebeforethedataminingstep.Wecandothisby (1)simplyselecttheattributeandclickonRemovebuttonasshowninFigurep5(WEKA 3.6.2)or

Figurep5

(2)usingtheAttributefiltersinWEKA.Inthe"Filter"panel,clickonthe"Choose"button. Thiswillshowapopupwindowwithalistavailablefilters.Scrolldownthelistandselectthe "weka.filters.unsupervised.attribute.Remove"filterasshowninFigurep6.

Figurep6

Next,clickontextboximmediatelytotherightofthe"Choose"button.Intheresulting dialogboxentertheindexoftheattributetobefilteredout(thiscanbearangeoralist separatedbycommas).Inthiscase,weenter1whichistheindexofthe"id"attribute(see theleftpanel).Makesurethatthe"invertSelection"optionissettofalse(otherwise everythingexceptattribute1willbefiltered).Thenclick"OK"(SeeFigurep7).Now,inthe filterboxyouwillsee"RemoveR1"(seeFigurep8).

Figurep7

Figurep8

Clickthe"Apply"buttontoapplythisfiltertothedata.Thiswillremovethe"id"attribute andcreateanewworkingrelation(whosenamenowincludesthedetailsofthefilterthat wasapplied).TheresultisdepictedinFigurep9:

Figurep9

Itispossiblenowtoapplyadditionalfilterstothenewworkingrelation.Inthisexample, however,wewillsaveourintermediateresultsasseparatedatafilesandtreateachstepas aseparateWEKAsession.TosavethenewworkingrelationasanARFFfile,clickonsave

buttoninthetoppanel.Here,asshowninthe"save"dialogbox(seeFigurep10),wewill savethenewrelationinthefile"bankdataR1.arff".

Figurep10 Figurep11showsthetopportionofthenewgeneratedARFFfile(intexteditor).

Figurep11

Notethatinthenewdataset,the"id"attributeandallthecorrespondingvaluesinthe recordshavebeenremoved.Also,notethatWEKAhasautomaticallydeterminedthe correcttypesandvaluesassociatedwiththeattributes,aslistedintheAttributessectionof theARFFfile.

Discretization
Sometechniques,suchasassociationrulemining,canonlybeperformedoncategorical data.Thisrequiresperformingdiscretizationonnumericorcontinuousattributes.Thereare 3suchattributesinthisdataset:"age","income",and"children".Inthecaseofthe "children"attributetherangeofpossiblevaluesareonly0,1,2,and3.Inthiscase,wehave optedforkeepingallofthesevaluesinthedata.Thismeanswecansimplydiscretizeby removingthekeyword"numeric"asthetypeforthe"children"attributeintheARFFfile, andreplacingitwiththesetofdiscretevalues.Wedothisdirectlyinourtexteditorasseen inFigurep12.Inthiscase,wehavesavedtheresultingrelationinaseparatefile"bank data2.arff".

Figurep12

WewillrelyonWEKAtoperformdiscretizationonthe"age"and"income"attributes.Inthis example,wedivideeachoftheseinto3bins(intervals).TheWEKAdiscretizationfilter,can dividetherangesblindly,orusedvariousstatisticaltechniquestoautomaticallydetermine thebestwayofpartitioningthedata.Inthiscase,wewillperformsimplebinning. FirstwewillloadourfiltereddatasetintoWEKAbyopeningthefile"bankdata2.arff".The "open"dialogboxindepictedinFigurep13.

Figurep13

Ifweselectthe"children"attributeinthisnewdataset,weseethatitisnowacategorical attributewithfourpossiblediscretevalues.ThisisdepictedinFigurep14.

Figurep14 Now,onceagainweactivatetheFilterdialogbox,butthistime,wewillselect "weka.filters.unsupervised.attribute.Discretize"fromthelist(seeFigurep15).

Figurep15

Next,tochangethedefaultsforthisfilter,clickontheboximmediatelytotherightofthe "Choose"button.ThiswillopentheDiscretizeFilterdialogbox.Weentertheindexforthe attributestobediscretized.Inthiscaseweenter1correspondingtoattribute"age".We alsoenter3asthenumberofbins(notethatitispossibletodiscretizemorethanone attributeatthesametime(byusingalistofattributeindices).Sincewearedoingsimple binning,alloftheotheravailableoptionsaresetto"false".Thedialogboxisdepictedin Figurep16.ClickingonMorewillgiveyoudetailofeachparameter.

Figurep16

Click"Apply"intheFilterpanel.Thiswillresultinanewworkingrelationwiththeselected attributepartitionedinto3bins(seeFigurep17).Toexaminetheresults,wesavethenew workingrelationinthefile"bankdata3.arff"asdepictedinFigurep18.

Figurep17

Figurep18

Letusnowexaminethenewdatasetusingourtexteditor.Thetopportionofthedatais showninFigurep18.YoucanobservethatWEKAhasassigneditsownlabelstoeachofthe valuerangesforthediscretizedattribute.Forexample,thelowerrangeinthe"age" attributeislabeled"(inf34.333333]"(enclosedinsinglequotesandescapecharacters), whilethemiddlerangeislabeled"(34.33333350.666667]",andsoon.Theselabelsnow alsoappearinthedatarecordswheretheoriginalagevaluewasinthecorresponding range. Next,weapplythesameprocesstodiscretizethe"income"attributeinto3bins.Again, Wekaautomaticallyperformsthebinningandreplacesthevaluesinthe"income"column withtheappropriateautomaticallygeneratedlabels.Wesavethenewfileinto"bank data3.arff",replacingtheolderversion. Clearly,theWEKAlabels,whilereadable,leavemuchtobedesiredasfarasnaming conventionsgo.Wewillthususetheglobalsearch/replacefunctionsintexteditorto replacetheselabelswithmoresuccinctandreadableones. ReplacealloftheWEKAassignedlabelsofageandincomeattributes.Notethatthe attributesection(thetoppart)ofthearfffilemustbeadjustedaccordingly. Figurep19showsthefinalresultofthetransformationandthenewlyassignedlabelsfor theseattributevalues.

Figurep19

WenowalsochangetherelationnameintheARFFfileto"bankdatafinal"andsavethefile

as"bankdatafinal.arff". Youmaytrywithdifferentnumberofbins.Thereisalsoaparameterforequal frequencybinning.Checkitout.

MissingValues
1. Openfilebankdata.arff 2. Checkifthereisanymissingvaluesinanyattribute.

3. Editdatatomakesomemissingvalues. 4. Deletesomedatainregion(Nominal)andchildren(Numeric)attributes.ClickonOK buttonwhenfinish.

4 4

5. MakenoteofLabelthathasMaxCountinregionandMeanofchildrenattributes.

6. ChooseReplaceMissingValuesfilter (weka.filters.unsupervised.attribute.ReplaceMissingValues).Then,clickonApplybutton.

7. Lookintothedata.Howdidthosemissingvaluesgetreplaced?

7 7

8. Editbankdata.arffwithtexteditor.Makesomedatamissingbyreplacingthemwith?. (Trywithnominalandnumericattributes).Savetobankdatamissing.arff. 9. Loadbankdatamissing.arffintoWEKA,observethedataandattributeinformation. 10. Replacemissingvaluesbythesameprocedureyouhaddonebefore.

You might also like