You are on page 1of 8

FOR

DATA MINING
BUSINESS ANALYTICS

Section 99: Spring 2015


Due: Saturday, April 18, 2015,
9am
Homework #4

_______________
(put your name above)

Total grade: _______ out of ______


points

This assignment will give you hands-on experience in building text classification models, using the
application of email spam filtering. You will use Weka to convert the textual data (emails) into
feature vectors and build text mining models for automatic spam filtering. The email messages you
will use were delivered to a particular server between 8 Apr 2007 and 6 Jul 2007. The target
variable represents whether an email is either spam or ham (non-spam). Follow the directions and
answer any questions. Your report should not be verbose, but should present your results clearly
and professionally.
[Caveat: Using large sets of words, building the models and evaluating them with cross-validation
requires a lot of memory. The assignment asks you to save your data files before each Weka run.
In the in-class lab at the beginning of the semester, you should have increased your heapsize. If for
some reason you get memory/heap errors anyway, try restarting and avoiding having any other
applications open, except where indicated. If you have problems for more than 5 or 10 minutes,
contact the TA or me--dont spend time being frustrated with that.]
1. On the blackboard, you will find a file called spam_data_Text.arf, which contains all the emails in
our dataset and is in a format ready-to-input by Weka. If you were to open this file in WordPad or
some other capable text editor, you would see that each instance is just a string of the email text
(between single quotes), with the target variable at the end. Thus the .arff file has two variables for
each instance: the text string, and the target variable (called @class@), which takes on two values,
spam and ham.
To be able to build spam filtering predictive models, we need to convert the email texts into feature
vectors and engineer the features. You can do this by following the steps below:
1)
2)

Load spam_data_Text.arf into Weka


Convert the email text to word features. (In Filter: filters-->unsupervised-->attribute->StringToWordVector). Change only two parameter settings as follows:
a. In the tokenizer parameter box, replace the default delimiters with the set I am
providing below. You need to copy the whole thing below, including the (). You can cut
and paste the string of delimiters from this document (on blackboard).
(.,:;'"?!@#$%^&*

{}|[]\<>/`~1234567890-=_+)

b. Set "useStoplist" to true.


Click OK and then dont forget to click Apply back on the main Preprocess page!
This step splits the string into words/terms, by using the delimiters to delimit where words
start and end. After clicking apply, it should run for a little while and then youll see a large
set of words in the Attributes list.
3)

Remove non-word (noise) attributes. If you got many weird words or mixture of words and
symbols, you must have wrong delimiters. Correctly applying the delimiters provided should
give you fewer than 20 noisy features at the bottom of the list. Manually remove those
features [check the boxes in Attributes and click Remove].

4)

Binarize the features (meaning, change them from numeric features to 1 for the word is
present in this email or 0 for the word is absent). (In Filter: filters-->unsupervised->attribute-->NumericToBinary) [click Apply!] [If the last attribute, zip, remains unbinarized, just remove
it.]

5)

Randomize the instances. (In Filter: filters-->unsupervised-->instance-->Randomize)

[click

Apply!]
6)

Save the data you just engineered as spam_data_ Occurrence.arf by clicking the save
button at the upper right.

Now you are just about ready to build text classification models. Go to the Classify tab. First
select @class@ as the target variable from the pull-down list in the rectangle under the More
options button. Then choose as your classifier Nave Bayes (classifiersbayes NaiveBayes ). If
NaveBayes is greyed out, make sure you selected the right variable as the target variable (in the
box below More options).
Task. Report the evaluation results of your model using 10-fold cross-validation. Consider it both
as a classification model and as a model that ranks cases by the likelihood of class membership.
Report your results below on this page. Feel free to add more pages if one page is not
enough for your results.

2. The spam filtering model you just built is based on word occurrence (presence or absence) in a
document. Now lets use the frequency of each word in the document instead. Repeat the above
step 1 from the original data file. In step 2, in addition to changing the two parameters, set
outputWordCounts to true. Then apply step 3 and 5, skipping steps 4 and 6. Save your data as
spam_data_ Count.arf.
Now you can build a classifier using NaveBayesMultinomial, a version of NaiveBayes taking into
account multiple occurrences of a word (classifiersbayes NaiveBayesMultinomial) to build your
prediction model. Report the 10-fold cross-validation results and compare with the occurrencebased results in the previous question.
Report your results below on this page. Feel free to add more pages if one page is not
enough for your results.

3. As you know, error costs for spam filtering are asymmetric. Mistakenly classifying a good email
as a spam (false positive) is a lot more costly than mistakenly classifying a spam as a good email
(false negative). Lets assume each false positive costs $5 and each false negative costs a nickel.
Calculate the total cost and expected cost (per email) based on the confusion matrix you obtained
in question 1. Copy the confusion matrix and present the formulas you used to get the results. [If
this seems foreign to you, go back and reread Chapter 6.] [Be careful with the dimensions of the
confusion matrix: which are the actuals and which are the predictions?]
Report your results below on this page. Feel free to add more pages if one page is not
enough for your results.

4. The models you generated so far use the full set of features/words. But not all of them are
necessarily good features. Lets investigate what features are the most discriminativerecall the
discussion on selecting the most informative attributes from early in the semester (and Chapter 3).
After loading the file spam_data_ Occurrence.arf, click the Select Attributes tab and in the
Attribute Evaluator section. Select @class@ as the target variable in the usual pull-down box.
Choose InfoGainAttributeEval (this selects the attributes with the greatest Information Gain; see
Ch. 3). A small window will be popped up to ask you to use the Ranker search method. Click Yes.
And click the Ranker box; set numToSelect to 50. In the Attribute Selection Mode section,
select use full training set. Start the attribute selection program and you will get a list of
features. You can look through this list and see whether they seem to make sense as words that
might separate spam from non-spam emails. Here in the Select Attributes tab, you can change the
number of attributes and rerun without changing the underlying data.
Now, lets combine feature selection with modeling. Move back to the Preprocess tab. This time we
will change the data to only include these best 50 attributes. Choose Filter->Supervised>attribute->AttributeSelection. Click on the parameter box showing AttributeSelection E
Choose InfoGainAttributeEval as the evaluator. Choose Ranker as the search. Click on the
'Ranker...' parameter box. Set numToSelect to 50. [Click OkOk.] Make sure that @class@ is the
target variable (in the box next to Visualize All). Then click apply. The result should be that now
only those top-50 words are left as Attributes, along with the target variable. Save this as
spam_data_ Occurrence50.arf.
Build a new Naive bayes model only using these 50 attributes. [Dont forget to pick the right target
variable.] Compare the results of evaluating this new model with the one you generated in question
1 using the full set of features. Analyze the results and report your findings.
Report your results below on this page. Feel free to add more pages if one page is not
enough for your results.

ss

You might also like