You are on page 1of 19

This example illustrates the use oI C4.5 (J48) classiIier in WEKA.

The sample data set used Ior


this example, unless otherwise indicated, is the bank data available in comma-separated Iormat
(bank-data.csv). This document assumes that appropriate data preprocessing has been perIromed.
In this case ID Iield has been removed. Since C4.5 algorithm can handle numeric attributes, there
is no need to discretize any oI the attributes. For the purposes oI this example, however, the
"Children" attribute has been converted into a categorical attribute with values "YES" or "NO".
WEKA has implementations oI numerous classiIication and prediction algorithms. The basic
ideas behind using all oI these are similar. In this example we will use the modified version oI
the bank data to classiIy new instances using the C4.5 algorithm (note that the C4.5 is
implemented in WEKA by the classiIier class: weka.classifiers.trees.J48). The modiIied
(and smaller) version oI the bank data can be Iound in the Iile "bank.arII" and the new
unclassiIied instances are in the Iile "bank-new.arII".
As usual, we begin by loading the data into WEKA



nexL we selecL Lhe Classlfy Lab and cllck Lhe Choose buLLon Lo selecL Lhe !48 classlfler as deplcLed ln
llgures 21a and 21b noLe LhaL !48 (lmplemenLaLlon of C43 algorlLhm) does noL requlre dlscreLlzaLlon
of numerlc aLLrlbuLes ln conLrasL Lo Lhe lu3 algorlLhm from whlch C43 has evolved


now we can speclfy Lhe varlous parameLers 1hese can be speclfled by cllcklng ln Lhe LexL box Lo Lhe
rlghL of Lhe Choose buLLon as deplcLed ln llgure 22 ln Lhls example we accepL Lhe defaulL values 1he
defaulL verslon does perform some prunlng (uslng Lhe subLree ralslng approach) buL does noL perform
error prunlng 1he selecLed parameLers are deplcLed








Dnder Lhe 1esL opLlons ln Lhe maln panel we selecL 10fold crossvalldaLlon as our evaluaLlon
approach Slnce we do noL have separaLe evaluaLlon daLa seL Lhls ls necessary Lo geL a reasonable ldea
of accuracy of Lhe generaLed model We now cllck SLarL Lo generaLe Lhe model 1he ASCll verslon of
Lhe Lree as well as evaluaLlon sLaLlsLlcs wlll appear ln Lhe elghL panel when Lhe model consLrucLlon ls
compleLed (see llgure 23)


We can vlew Lhls lnformaLlon ln a separaLe wlndow by rlghL cllcklng Lhe lasL resulL seL (lnslde Lhe 8esulL
llsL panel on Lhe lefL) and selecLlng vlew ln separaLe wlndow from Lhe popup menu 1hese sLeps and
Lhe resulLlng wlndow conLalnlng Lhe classlflcaLlon resulLs are deplcLed ln llgures 24a and 24b













Note that the classiIication accuracy oI our model is only about 69. This may indicate that we
may need to do more work (either in preprocessing or in selecting the correct parameters Ior
classiIication), beIore building another model. In this example, however, we will continue with
this model despite its inaccuracy.
WEKA also let's us view a graphical rendition oI the classiIication tree. This can be done by
right clicking the last result set (as beIore) and selecting "Visualize tree" Irom the pop-up menu.
The tree Ior this example is depicted in Figure 25. Note that by resizing the window and
selecting various menu items Irom inside the tree view (using the right mouse button), we can
adjust the tree view to make it more readable.


We wlll now use our model Lo classlfy Lhe new lnsLances A porLlon of Lhe new lnsLances A8ll flle ls
deplcLed ln llgure 26 noLe LhaL Lhe aLLrlbuLe secLlon ls ldenLlcal Lo Lhe Lralnlng daLa (bank daLa we used
for bulldlng our model) Powever ln Lhe daLa secLlon Lhe value of Lhe pep aLLrlbuLe ls ? (or
unknown)





n Lhe maln panel under 1esL opLlons cllck Lhe Supplled LesL seL radlo buLLon and Lhen cllck Lhe
SeL buLLon 1hls wlll pop up a wlndow whlch allows you Lo open Lhe flle conLalnlng LesL lnsLances as
ln l



ln Lhls case we open Lhe flle banknewarff and upon reLurnlng Lo Lhe maln wlndow we cllck Lhe
sLarL buLLon 1hls once agaln generaLes Lhe models from our Lralnlng daLa buL Lhls Llme lL applles Lhe
model Lo Lhe new unclasslfled lnsLances ln Lhe banknewarff flle ln order Lo predlcL Lhe value of pep
aLLrlbuLe 1he resulL ls deplcLed ln llgure 28 noLe LhaL Lhe summary of Lhe resulLs ln Lhe rlghL panel
does noL show any sLaLlsLlcs 1hls ls because ln our LesL lnsLances Lhe value of Lhe class aLLrlbuLe (pep)
was lefL as ? Lhus WLkA has no acLual values Lo whlch lL can compare Lhe predlcLed values of new
lnsLances

OI course, in this example we are interested in knowing how our model managed to classiIy the
new instances. To do so we need to create a Iile containing all the new instances along with their
predicted class value resulting Irom the application oI the model. Doing this is much simpler
using the command line version oI WEKA classiIier application. However, it is possible to do so
in the GUI version using an "indirect" approach, as Iollows.
First, right-click the most recent result set in the leIt "Result list" panel. In the resulting pop-up
window select the menu item "Visualize classiIier errors". This brings up a separate window
containing a two-dimensional graph. These steps and the resulting window are shown in Figures
28 and 29.


lor now we are noL lnLeresLed ln whaL Lhls graph represenLs 8aLher we would llke Lo save Lhe
classlflcaLlon resulLs from whlch Lhe graph ls generaLed ln Lhe new wlndow we cllck on Lhe Save
buLLon and save Lhe resulL as Lhe flle bankpredlcLedarff as shown ln llgure 30



1hls flle conLalns a copy of Lhe new lnsLances along wlLh an addlLlonal column for Lhe predlcLed value of
pep 1he Lop porLlon of Lhe flle can be seen ln llgure 31

noLe LhaL Lwo aLLrlbuLes have been added Lo Lhe orlglnal new lnsLances daLa lnsLance_number and
predlcLedpep 1hese correspond Lo new columns ln Lhe daLa porLlon 1he predlcLedpep value for
each new lnsLance ls Lhe lasL value before ? whlch Lhe acLual pep class value lor example Lhe
predlcLed value of Lhe pep aLLrlbuLe for lnsLance 0 ls ?LS accordlng Lo our model whlle Lhe predlcLed
class value for lnsLance 4 ls nC

&8ing the Command Line (Recommended)
While the GUI version oI WEKA is nice Ior visualizing the results and setting the parameters
using Iorms, when it comes to building a classiIication (or predictions) model and then applying
it to new instances, the most direct and Ilexible approach is to use the command line. In Iact, you
can use the GUI to create the list oI parameters (Ior example in case oI the J48 class) and then
use those parameters in the command line.
In the main WEKA interIace, click "Simple CLI" button to start the command line interIace. The
main command Ior generating the classiIication model as we did above is:
java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path\bank.arff -d
directory-path \bank.model
The options -C 0.25 and -M 2 in the above command are the same options that we selected Ior
J48 classiIier in the previous GUI example (see Figure 22). The -t option in the command
speciIies that the next string is the Iull directory path to the training Iile (in this case "bank.arII").
In the above command directory-path should be replaced with the Iull directory path where the
training Iile resides. Finally, the -d option speciIies the name (and location) where the model will
be stored. AIter executing this command inside the "Simple CLI" interIace, you should see the
tree and stats about the model in the top window (See Figure 32).


ased on the above command, our classiIication model has been stored in the Iile "bank.model"
and placed in the directory we speciIied. We can now apply this model to the new instances. The
advantage oI building a model and storing it is that it can be applied at any time to diIIerent sets
oI unclassiIied instances. The command Ior doing so is:
java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -T
directory-path \bank-new.arff
In the above command, the option -p 9 indicates that we want to predict a value Ior attribute
number 9 (which is "pep"). The -l options speciIies the directory path and name oI the model Iile
(this is what was created in the previous step). Finally, the -T option speciIies the name (and
path) oI the test data. In our example, the test data is our new instances Iile "bank-new.arII").
This command results in a 4-column output similar to the Iollowing:
0 YES 0.75 .
1 NJ 0.7272727272727273 .
2 YES 0.95 .
3 YES 0.8813559322033898 .
4 NJ 0.8421052631578947 .
The Iirst column is the instance number assigned to the new instances in "bank-new.arII" by
WEKA. The 2nd column is the predicted value oI the "pep" attribute Ior the new instance. The
3rd column is the conIidence (prediction accuracy) Ior that instance. Finally, the 4th column in
the actual "pep" value in the test data (in this case, we did not have a value Ior "pep" in "bank-
new.arII", thus this value is "?"). For example, in the above output, the predicted value oI "pep"
in instance 2 is "YES" with a conIidence oI 95. Portion oI the Iinal result are depicted in
Figure 33.


1he above ouLpuL ls preferable over Lhe ouLpuL derlved from Lhe CDl verslon on WLkA llrsL Lhls ls a
more dlrecL approach whlch allows us Lo save Lhe classlflcaLlon model 1hls model can be applled Lo new
lnsLance laLer wlLhouL havlng Lo regeneraLe Lhe model Secondly (and more lmporLanLly) ln conLrasL Lo
Lhe flnal ouLpuL of Lhe CDl verslon ln Lhls case we have lndependenL confldence (accuracy) values for
each of Lhe new lnsLances 1hls means LhaL we can focus only on Lhose predlcLlon wlLh whlch we are
more confldenL lor example ln Lhe above ouLpuL we could fllLer ouL any lnsLance whose predlcLed
value has an accuracy of less Lhan 83

You might also like