You are on page 1of 3

Program-5: Write a program to implement the naïve Bayesian classifier for a

sample training data set stored as a .CSV file. Compute the accuracy of the
classifier, considering few test data sets.

Bayesian Theorem:

Naive Bayes: For the Bayesian Rule above, we have to extend it so that we have

Bayes' rule:

Since Naive Bayes assumes that the conditional probabilities of the independent
variables are statistically independent we can decompose the likelihood to a
product of terms:

and rewrite the posterior as:


Using Bayes' rule above, we label a new case X with a class level Cj that achieves
the highest posterior probability.

Program-6: Assuming a set of documents that need to be classified, use the naïve
Bayesian Classifier model to perform this task. Built-in Java classes/API can be
used to write the program. Calculate the accuracy, precision, and recall for your
data set.

Loading the 20 newsgroups dataset : The dataset is called “Twenty Newsgroups”.


Here is the official description, quoted from the website:
http://qwone.com/~jason/20Newsgroups/

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup


documents, partitioned (nearly) evenly across 20 different newsgroups. To the best
of our knowledge, it was originally collected by Ken Lang, probably for his paper
“Newsweeder: Learning to filter netnews,” though he does not explicitly mention
this collection. The 20 newsgroups collection has become a popular data set for
experiments in text applications of machine learning techniques, such as text
classification and text clustering. [Don’t Write this, just an information about the
csv file]

Algorithm:

Learning To Classify Text: Preliminaries

Learning to Classify Text: Algorithm

S1: LEARN_NAIVE_BAYES_TEXT (Examples, V)


S2: CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• Examples is a set of text documents along with their target values. V is the set of
all possible target values. This function learns the probability terms P(wk Iv,),
describing the probability that a randomly drawn word from a document in class vj
will be the English word wk. It also learns the class prior probabilities P(vj).

S1: LEARN_NAIVE_BAYES_TEXT (Examples, V)

S2: CLASSIFY_NAIVE_BAYES_TEXT (Doc)

positions ← all word positions in Doc that contain tokens found in Vocabulary

• Return vNB where

Twenty News Groups


Given 1000 training documents from each group Learn to classify new documents according to
which newsgroup it came from

comp.graphics misc.forsale alt.atheism sci.space


comp.os.ms- rec.autos soc.religion.christia sci.crypt
windows.misc rec.motorcycles n sci.electronics
comp.sys.ibm.pc.har rec.sport.baseball talk.religion.misc sci.med
dware rec.sport.hockey talk.politics.mideast
comp.sys.mac.hard talk.politics.misc
ware talk.politics.guns
comp.windows.x

• Naive Bayes: 89% classification accuracy

You might also like