You are on page 1of 2

Project 1

MCIS6173

Spam and Ham

As you know, a big problem in todays exchanging messages among people is the spam email. It is a
major problem and it holds a security threat to a lot of users. In this project you will be working to try to
filter email messages into spam or ham using simple metrics. Usually this type of work will involve
machine learning techniques but we are not asking you to go that route.

Objectives:

1- Understanding the danger that these spam email poses.


2- Having a glance of what the spam messages could look like.
3- Identifying what constitutes a spam email.
4- Applying some techniques that are used to separate spam from ham.
5- Extracting information that supports a decision of considering an email as spam or ham.
6- Working with real dataset that is used in industry and research in this topic.

Guidelines:

You will be given 51 email messages. Some of these are spam and some are ham. The labels for these
messages are given in the labels.txt file, 0 is spam and 1 is ham.

Your code should perform two tasks:

a. An email message should be given to your code. You code has to report if the message is
ham or spam. In addition, the code should report the evidence for the decision of that
categorization. The evidence is in the form of words that are found in the message and
made you decide that it is ham or spam. The spam words are given in the file
spamwords.txt.
The file has spam categories, so you need to report the category of the spam email as
well based on the ones you see in the file. The categories appear under the ----- line.
b. Your code should also report the number of spam and the number ham messages in the
dataset based on analyzing each email message and not based on the labels.
Notes:

1- The project is to be done in groups of 3 or less. Groups have to be from the same section in case
we have several sections.
a. Forming groups, if you want to have a group, is the responsibility of students.
i. Therefore, not finding a group is not an excuse not to do the project; you still
can do it on your own.
2- It is preferred that the project be developed under the Linux OS without the need to install any
special packages or libraries except the default compilers and libraries.
3- Languages to be used are only python or java.
4- Name the solution file like: spam.py, spam.java
5- Only one code file should be submitted per group.
a. Do not submit other files like snapshots of the program
6- Your code should start with a block of comment.
a. This comment block has:
i. Students names and ids
7- You have to make sure that your code runs error-free, especially compilation errors.
a. We will not debug or fix any errors. Very low score is expected in this case.
8- Be careful about the Path names/information.
a. Always assume current folder/directory.
9- The command to run your code would be similar to:
python2.6 spam.py dataset/ abc.eml
java spam dataset/ abc.eml
10- An example of the output should be:
abc.eml is commerce spam
evidence: buying , order, shipped
number of spam messages is: 30 and ham is: 21
11- Copying and cheating will have serious consequences. So, avoid that.

Good Luck!

You might also like