You are on page 1of 8

Spam

Email Classification 3----Austin Kinion


#I certify that I have acknowledged any code that I used from any other
person
#in the class, from Piazza or any Web site or book or other source.
#Any other work is my own.
print(load(url("http://eeyore.ucdavis.edu/stat141/Data/trainVariables.r
da")))
Data=trainVariables

Pulling out some of the nonsense variables


Data$numLinesInBody= NULL
Data$bodyCharacterCount=NULL
Data$subjectQuestCount=NULL
Data$percentSubjectBlanks=NULL
Data$messageIdHasNoHostname=NULL
Data$percentForwards=NULL
Data$isDear=NULL
Data$averageWordLength=NULL
Data$percentHTMLTags=NULL
Data$numAttachments=NULL

Name Data2 as new data with nonsense variables and isSpam out, then scale.
Data2= as.matrix(Data[,1:19])
Dat_scaled= scale(Data2)
#Compute different type of distance matrices from discussion section
man_dist= as.matrix(dist(Dat_scaled, method="manhattan"))
euc_dist= as.matrix(dist(Dat_scaled))
mink_dist= as.matrix(dist(Dat_scaled, method="minkowski"))

Voting method: Most of this function was obtained from Nick Ulle's OH's
vote= function(email, dist.matrix= euc_dist, Train= Dat_scaled, k){

neighbors = names(sort(dist.matrix[email,])[1:k])

prediction = Train[rownames(Train) %in% neighbors, 'isSpam']

pred_mean = mean(prediction)

if (pred_mean >= .5){
return(TRUE)
}
else{
return(FALSE)
}

}
vote_result = vote(email, dist.matrix= euc_dist, Train= Dat_scaled, k)

How accurate voting is: Help from Charles was given for this:
acc= function(email, pred_mean= vote_result, Train= Dat_scaled){

accur=mean(Train[email, 'isSpam']== pred.mean,)



return(cbind(accur, k))
}

knn and CV for getting best k:


First shuffle:
set.seed(5779616) rand.rowx= sample(1: nrow(Dat_scaled))
Then split into 5 groups for 5fold:
n=5
group = function(rand.rowx,n) split(rand.rowx,
factor(sort(rank(rand.rowx)%%n))) group(rand.rowx,n)
These next two lines are from Nick Ulle:
store= matrix(NA, ncol= 2, nrow= 20) result= matrix(NA, ncol=5, nrow= 20)

Write loop for each group 1:5 is used as test: Help from Charles Arnold
was
given to write this function

for (i in 1:5){
test_dat= group(rand.rowx,n)[i] train_dat= euc_dist[, unlist(group(rand.rowx,n)[i])]
#20 as number of k to plot was suggested by Nick. #Make loop for up to 20 k to see
which is best from function vote and vote_result for(k in 1:20){
x= sapply(test_dat, vote, euc_dist= train_dat, Train= Dat_scaled, k)

res= vote_result(test_dat, pred_mean= x, Train= Dat_scaled, k)

#Use store and result matrices above to fill in:

store[k,]= 1- res

} colnames(store)= c('k', 'Accuracy')


result[,i]= store[,2] } }

To find average error rate of k, used stackoverflow function and modified it:
err_rate=c() for(i in 1:20){ err_rate=mean(result[i,]) } #Help from a girl in Charles
OH's was given for this next part: #Show result and make data.frame show_result=
cbind(result, err_rate) colnames(show_result)= c("1", "2", "3", "4", "5", "Error Rate")
res_dat.fram= data.frame(cbind(1:20, err_rate))

As learned in Nick'sSection, order the dataframe to show best to worst:

order_df= res_dat.fram[order(res_dat.fram$error.rate)]
Plot the k's vs. error rate:

plot(1:20, result[,6], sub="k vs. Error Rate", ylab = "Error Rate", xlab = "k", type="o")

Finally, create the confusion matrix.


Used Lecture notes and OH notes and modified them for confusion matrix:

prediction = sapply(rand.rowx, dist.matrix = euc_dist, vote, Train= Dat_scaled, k =


best.k)

#find the truth


t = function(email, Train= Dat_scaled, k) { true = Train[email, 'isSpam'] return(true)
}
trues = sapply(rand.rowx, vote, k = best.k, Train= Dat_scaled)
table(true, prediction)
prediction

true FALSE TRUE


FALSE 4697 173
TRUE 212 1459
So the type 1 error rate for the rpart is: 173/(173+4697)= 3.5%
ANd the type 2 error rate from the rpart is: 212/(212+1459)= 12.6%
And the total misclassification rate is (173+212)/6541= 5.8%


I used several methods to explore the misclassified observations, but f
ound the most interesting to be SubjectSpamWords, which I will talk abo
ut in this report. It is no surprise to me that this is a good classifi
er, as I can usually tell the difference between Spam and non-spam just
by reading the subject.
library(lattice)
densityplot(~ subjectSpamWords, Data2, groups= isSpam, col= c("green",
"blue"), main = "Ham(green) and Spam(Blue) for subjectSpamWords")

On the plot below, we can see that there is a significant difference in the densities
for Ham and Spam emails for the variable subjectSpamWords. Ham clearly has a
higher density when there are no spam words in the subject, and it is not spam, and
Spam clearly hasa higher density when there are spam words in the subject, and the
email was Spam.


For classification tree, used code from class and modified it:

library(rpart)
ct= rpart(factor(isSpam)~ ., Dat_scaled )
#Makes a much nicer tree, found on r-project.org
library(rpart.plot)
prp(ct)


To get the Confusion matrix from rpart:

prediction = predict(ct, Dat_scaled, type = "class")


idx = 1:nrow(Dat_scaled) true = sapply(idx, t, k = best.k, TrainV = Dat_scaled)
table(true, prediction)
prediction

true FALSE TRUE


FALSE 4628 235
TRUE 361 1321
So the type 1 error rate for the rpart is: 235/(235+4628)= 4.8%
ANd the type 2 error rate from the rpart is: 361/(361+1321)= 21.4%
And the total misclassification rate is (235+361)/6541= 9.1%
So it is obvious that the knn and cross validation method for predicting Spam is
much better than the rpart() method, with the knn misclassification rate of 5.8%
and the rpart() misclassification rate of 9.1%.

I used several methods to explore the misclassified observations, but f
ound the most interesting to be isWrote, which I will talk about in thi
s report.
library(lattice)

densityplot(~ isWrote, Data2, groups = isSpam, col = c("green", "blue"),


main = "Ham(green) and Spam(Blue) for isWrote")

As we can see here, there is a significant difference for emails both


with and without isWrote. We can see that the densities when isWrote
is True is much greater for Spam than Ham, and the opposite is true
for when isWrote is False.

Compare the test and training data sets to see if they have similar
characteristics.
Going to use the two models I fit with the training data to predict the
values for the test set.
Examine the confusion matrix:
#First, combine training and test data with rbind:
#Using original trainvariables instead of Data2 so that the two match:
emails= rbind(testVariables, trainVariables)

#Get distance matrix and scale it with isSpam removed again:
euc_dist = as.matrix(dist(scale(emails[, 1:29])))

#Get the prediction
idx = 1:nrow(testVariables)

#Get train matrix
train = euc_dist[6542:8541, 1:6541]

prediction = sapply(idx, vote, dist.matrix = trainings, TrainV
= trainVariables, k = best.k)

#Get the confusion matrix
true = sapply(idx, t, TrainV = testVariables, k = best.k,)
table(true, prediction)

prediction
true FALSE TRUE
FALSE 1447 67
TRUE 87 398

#For rpart on test data:

library(rpart)
library(rpart.plot)
ct = rpart(factor(isSpam) ~ ., testVariables)
prp(ct)

#Get the confusion matrix from rpart:


prediction = predict(ct, testVariables, type = "class")
idx = 1:nrow(testVariables)
true = sapply(idx, t, k = best.k, TrainV = testVariables)
table(true, prediction)
prediction
true FALSE TRUE
FALSE 1420 90
TRUE 155 335

You might also like