Professional Documents
Culture Documents
Name
Data2
as
new
data
with
nonsense
variables
and
isSpam
out,
then
scale.
Data2=
as.matrix(Data[,1:19])
Dat_scaled=
scale(Data2)
#Compute
different
type
of
distance
matrices
from
discussion
section
man_dist=
as.matrix(dist(Dat_scaled,
method="manhattan"))
euc_dist=
as.matrix(dist(Dat_scaled))
mink_dist=
as.matrix(dist(Dat_scaled,
method="minkowski"))
Voting
method:
Most
of
this
function
was
obtained
from
Nick
Ulle's
OH's
vote=
function(email,
dist.matrix=
euc_dist,
Train=
Dat_scaled,
k){
neighbors
=
names(sort(dist.matrix[email,])[1:k])
prediction
=
Train[rownames(Train)
%in%
neighbors,
'isSpam']
pred_mean
=
mean(prediction)
if
(pred_mean
>=
.5){
return(TRUE)
}
else{
return(FALSE)
}
}
vote_result
=
vote(email,
dist.matrix=
euc_dist,
Train=
Dat_scaled,
k)
How
accurate
voting
is:
Help
from
Charles
was
given
for
this:
acc=
function(email,
pred_mean=
vote_result,
Train=
Dat_scaled){
for
(i
in
1:5){
test_dat=
group(rand.rowx,n)[i]
train_dat=
euc_dist[,
unlist(group(rand.rowx,n)[i])]
#20
as
number
of
k
to
plot
was
suggested
by
Nick.
#Make
loop
for
up
to
20
k
to
see
which
is
best
from
function
vote
and
vote_result
for(k
in
1:20){
x=
sapply(test_dat,
vote,
euc_dist=
train_dat,
Train=
Dat_scaled,
k)
res=
vote_result(test_dat,
pred_mean=
x,
Train=
Dat_scaled,
k)
#Use
store
and
result
matrices
above
to
fill
in:
store[k,]=
1-
res
order_df=
res_dat.fram[order(res_dat.fram$error.rate)]
Plot
the
k's
vs.
error
rate:
plot(1:20, result[,6], sub="k vs. Error Rate", ylab = "Error Rate", xlab = "k", type="o")
I
used
several
methods
to
explore
the
misclassified
observations,
but
f
ound
the
most
interesting
to
be
SubjectSpamWords,
which
I
will
talk
abo
ut
in
this
report.
It
is
no
surprise
to
me
that
this
is
a
good
classifi
er,
as
I
can
usually
tell
the
difference
between
Spam
and
non-spam
just
by
reading
the
subject.
library(lattice)
densityplot(~
subjectSpamWords,
Data2,
groups=
isSpam,
col=
c("green",
"blue"),
main
=
"Ham(green)
and
Spam(Blue)
for
subjectSpamWords")
On
the
plot
below,
we
can
see
that
there
is
a
significant
difference
in
the
densities
for
Ham
and
Spam
emails
for
the
variable
subjectSpamWords.
Ham
clearly
has
a
higher
density
when
there
are
no
spam
words
in
the
subject,
and
it
is
not
spam,
and
Spam
clearly
hasa
higher
density
when
there
are
spam
words
in
the
subject,
and
the
email
was
Spam.
For
classification
tree,
used
code
from
class
and
modified
it:
library(rpart)
ct=
rpart(factor(isSpam)~
.,
Dat_scaled
)
#Makes
a
much
nicer
tree,
found
on
r-project.org
library(rpart.plot)
prp(ct)
To
get
the
Confusion
matrix
from
rpart:
Compare the test and training data sets to see if they have similar
characteristics.
Going to use the two models I fit with the training data to predict the
values for the test set.
Examine the confusion matrix:
#First,
combine
training
and
test
data
with
rbind:
#Using
original
trainvariables
instead
of
Data2
so
that
the
two
match:
emails=
rbind(testVariables,
trainVariables)
#Get
distance
matrix
and
scale
it
with
isSpam
removed
again:
euc_dist
=
as.matrix(dist(scale(emails[,
1:29])))
#Get
the
prediction
idx
=
1:nrow(testVariables)
#Get
train
matrix
train
=
euc_dist[6542:8541,
1:6541]
prediction
=
sapply(idx,
vote,
dist.matrix
=
trainings,
TrainV
=
trainVariables,
k
=
best.k)
#Get
the
confusion
matrix
true
=
sapply(idx,
t,
TrainV
=
testVariables,
k
=
best.k,)
table(true,
prediction)
prediction
true
FALSE
TRUE
FALSE
1447
67
TRUE
87
398
#For
rpart
on
test
data:
library(rpart)
library(rpart.plot)
ct = rpart(factor(isSpam) ~ ., testVariables)
prp(ct)