Data Science A

Machine Learning
k-Nearest Neighbor Classifiers

1-Nearest Neighbor Classifier
Test Examples
(What class to assign this?)
1-Nearest Neighbor
2-Nearest Neighbor
3-Nearest Neighbor
8-Nearest Neighbor
Controlling COMPLEXITY in k-NN
Ingredient Sweetness Crunchiness Food type
apple 10 9 fruit
Bacon 1 4 protein
banana 10 1 fruit
carrot 7 10 vegetable
celery 3 10 vegetable
cheese 1 1 protein
Measuring similarity with distance
Locating the tomato's nearest neighbors requires a distance function, or a
formula that measures the similarity between the two instances.
There are many different ways to calculate distance. Traditionally, the k-NN
algorithm uses Euclidean distance, which is the distance one would measure
if it were possible to use a ruler to connect two points, illustrated in the previous
figure by the dotted lines connecting the tomato to its neighbors.
Euclidean distance
Euclidean distance is specified by the following formula, where p and q are the
examples to be compared, each having n features. The term p1 refers to the value
of the first feature of example p, while q1 refers to the value of the first feature of
example q:
Application of KNN
Which Class Tomoto belongs to given the feature values:
Tomato (sweetness = 6, crunchiness = 4),

K = 3, 5, 7, 9
K = 11,13,15,17
Bayesian Classifiers
Understanding probability
The probability of an event is estimated from the observed data by
dividing the number of trials in which the event occurred by the
total number of trials
For instance, if it rained 3 out of 10 days with similar conditions as

today, the probability of rain today can be estimated as 3 / 10 =
0.30 or 30 percent.
Similarly, if 10 out of 50 prior email messages were spam, then the

probability of any incoming message being spam can be
estimated as 10 / 50 = 0.20 or 20 percent.
For example, given the value P(spam) = 0.20, we can calculate

P(ham) = 1 – 0.20 = 0.80
Note: The probability of all the possible outcomes of a trial must always sum to 1
Understanding probability cont..
For example, given the value P(spam) = 0.20, we can calculate
P(ham) = 1 – 0.20 = 0.80
Because an event cannot simultaneously happen and not happen, an

event is always mutually exclusive and exhaustive with its complement
The complement of event A is typically denoted Ac or A'.
Additionally, the shorthand notation P(¬A) can used to denote the

probability of event A not occurring, as in P(¬spam) = 0.80. This notation
is equivalent to P(Ac).
Understanding joint probability
Often, we are interested in monitoring several nonmutually exclusive

events for the same trial
All
emails
Lotter
y 5%
Spam Ham
20% 80%
Understanding joint probability
Lottery appearing in
Spam
Lottery
appearin
g in Ham
Lottery
without
appearin
g in Spam
Estimate the probability that both P(spam) and P(Spam) occur, which can be written as P(spam ∩
Lottery). the notation A ∩ B refers to the event in which both A and B occur.
Calculating P(spam ∩ Lottery) depends on the joint probability of the two
events or how the probability of one event is related to the probability of the
other.
If the two events are totally unrelated, they are called independent events
If P(spam) and P(Lottery) were independent, we could easily

calculate P(spam ∩ Lottery), the probability of both events happening
at the same time.
Because 20 percent of all the messages are spam, and 5 percent of all
the e-mails contain the word Lottery, we could assume that 1 percent
of all messages are spam with the term Lottery.
More generally, for independent events A and B, the probability of

both happening can be expressed as P(A ∩ B) = P(A) * P(B).
0.05 * 0.20 = 0.01

Bayes Rule
 Bayes Rule: The most important Equation in ML!!

Class Prior Data Likelihood given Class
P ( Class ) P ( Data Class )

P ( Class Data ) =
P ( Data )
Data Prior (Marginal)

Posterior Probability
(Probability of class AFTER seeing the data)
Naïve Bayes Classifier
Conditional Independence
P ( Fever, BodyAche Viral ) = P ( Fever Viral ) P ( BodyAche Viral )
P ( Fever, BodyAche) ¹ P ( Fever ) P ( BodyAche)

 Simple Independence between two variables:
P ( X1, X2 ) = P ( X1 ) P ( X2 )
 Class Conditional Independence assumption:
P ( X1, X2 ) ¹ P ( X1 ) P ( X2 )
P ( X1, X2 C ) = P ( X1 C ) P ( X2 C )
Naïve Bayes Classifier
Conditional Independence among variables given Classes!
D
P ( C ) Õ P ( Xd C )
P ( C ) P ( X1, X2 ,..., X D C )
P ( C X1 , X2 ,..., X D ) = = d=1
å P ( X , X ,..., X
1 2 D C ') D
å Õ P( X d C¢)
C¢
C ¢ d=1
 Simplifying assumption
 Baseline model especially when large number of features
 Taking log and ignoring denominator:
( )
D
log P ( C X1, X2 ,..., X D ) µ log ( P ( C )) + å log P ( Xd C )
d=1
( )
Naïve Bayes Classifier for
Categorical Valued Variables
Let’s Naïve Bayes!
( )
D
(
log P ( C X1, X2 ,..., X D ) µ log ( P ( C )) + å log P ( Xd C )
d=1
)
Class Prior Parameters:
#EXMPLS COLOR SHAPE LIKE
P ( Like = Y) = ???
20 Red Square Y
P ( Like = N ) = ???
10 Red Circle Y
10 Red Triangle N
Class Conditional Likelihoods 10 Green Square N
P ( Color = Red Like = Y) = ???? 5 Green Circle Y
5 Green Triangle N
P ( Color = Red Like = N ) = ????
10 Blue Square N
...
10 Blue Circle N
P ( Shape = Triangle Like = N ) = ???? 20 Blue Triangle Y
Naïve Bayes Classifier for
Text Classifier
Text Classification Example
 Doc1 = {buy two shirts get one shirt half off}
 Doc2 = {get a free watch. send your contact details now}
 Doc3 = {your flight to chennai is delayed by two hours}
 Doc4 = {you have three tweets from @sachin}
Four Class Problem:

P ( promo doc1) = 0.84
 Spam,
 Promotions,
P ( spam doc2 ) = 0.94
 Social, P ( main doc3) = 0.75
 Main P ( social doc4 ) = 0.91

Bag-of-Words Representation
 Structured (e.g. Multivariate) data – fixed number of features
 Unstructured (e.g. Text) data

 arbitrary length documents,
 high dimensional feature space (many words in vocabulary),
 Sparse (small fraction of vocabulary words present in a doc.)
 Bag-of-Words Representation:
 Ignore Sequential order of words
 Represent as a Weighted-Set – Term Frequency of each term
Naïve Bayes Classifier with BoW
 Make an “independence assumption” about words | class
P ( doc1 promo)
= P ( buy :1, two :1, shirt : 2, get :1, one :1, half :1, off :1 promo)
= P ( buy :1 promo) P ( two :1 promo) P ( shirt : 2 promo)

P ( get :1 promo) P ( one :1 promo) P ( free :1 promo)
= P ( buy promo) P ( two promo ) P ( shirt promo)

1 1 2
P ( get promo) P ( one promo) P ( free promo)

1 1 1
Naïve Bayes Text Classifiers
 Log Likelihood of document given class.
doc = {tf ( wm )} m=1

M
tf ( wm ) = Number of times word wm occurs in doc

tf ( w1 ) tf ( w2 ) tf ( wM )
P ( doc class ) = P ( w1 class ) P ( w2 class ) ...P ( wM class )
 Parameters in Naïve Bayes Text classifiers:
P ( wm c ) = Probability that word wm occurs in documents of class c

P ( shirt promo ) ,P ( free spam ) ,P ( buy spam ) ,P ( buy promo) ,...
Number of parameters = ??
Naïve Bayes Parameters
 Likelihood of a word given class. For each word, each class.
P ( wm c ) = Probability that word wm occurs in documents of class c

 Estimating these parameters from data:
N ( wm ,c ) = Number of times word wm occurs in documents of class c

doc1
N ( free, spam ) = å tf ( free doc )

doc2
...
docÎspam docm
doc1
doc2
N ( free, promo) = å tf ( free doc ) doc 3
docÎpromo ...
...
docn
Bayesian Classifier
Multi-variate real-valued data
Bayes Rule
Class Prior Data Likelihood given Class
P ( Class ) P ( Data Class )

P ( Class Data ) =
P ( Data )
Data Prior (Marginal)

Posterior Probability
(Probability of class AFTER seeing the data)
P (c) P ( x c)
P ( c x) = x ÎR D
P ( x)
Simple Bayesian Classifier
P (c) P ( x c)
P ( c x) =
P ( x)
æ 1 ö
P ( x c ) = N ( x mc , S c ) = exp ç - ( x - mc ) S -1
c ( c )÷
1 T
x - m
D 1
è 2 ø
( 2p ) 2 Sc 2
Sum: ò P ( x c ) dx = 1
xÎR D
Mean: ò xP ( x c ) dx = mc
xÎR D
Co-Variance: ò ( x - mc ) ( x - mc ) P ( x c ) dx = S c
T
xÎR D
Controlling COMPLEXITY

Data Science A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science A

Uploaded by

Copyright:

Available Formats

Machine Learning

k-Nearest Neighbor Classifiers

Which Class Tomoto belongs to given the feature values:

Tomato (sweetness = 6, crunchiness = 4),

For instance, if it rained 3 out of 10 days with similar conditions as

Similarly, if 10 out of 50 prior email messages were spam, then the

For example, given the value P(spam) = 0.20, we can calculate

Because an event cannot simultaneously happen and not happen, an

The complement of event A is typically denoted Ac or A'.

Additionally, the shorthand notation P(¬A) can used to denote the

Often, we are interested in monitoring several nonmutually exclusive

If P(spam) and P(Lottery) were independent, we could easily

More generally, for independent events A and B, the probability of

0.05 * 0.20 = 0.01

 Bayes Rule: The most important Equation in ML!!

P ( Class ) P ( Data Class )

Data Prior (Marginal)

P ( Fever, BodyAche) ¹ P ( Fever ) P ( BodyAche)

 Baseline model especially when large number of features

 Taking log and ignoring denominator:

 Doc1 = {buy two shirts get one shirt half off}

 Doc2 = {get a free watch. send your contact details now}

 Doc3 = {your flight to chennai is delayed by two hours}

 Doc4 = {you have three tweets from @sachin}

Four Class Problem:

 Social, P ( main doc3) = 0.75

 Main P ( social doc4 ) = 0.91

 Structured (e.g. Multivariate) data – fixed number of features

 Unstructured (e.g. Text) data

 Make an “independence assumption” about words | class

= P ( buy :1 promo) P ( two :1 promo) P ( shirt : 2 promo)

= P ( buy promo) P ( two promo ) P ( shirt promo)

P ( get promo) P ( one promo) P ( free promo)

 Log Likelihood of document given class.

doc = {tf ( wm )} m=1

tf ( wm ) = Number of times word wm occurs in doc

 Parameters in Naïve Bayes Text classifiers:

P ( wm c ) = Probability that word wm occurs in documents of class c

 Likelihood of a word given class. For each word, each class.

P ( wm c ) = Probability that word wm occurs in documents of class c

N ( wm ,c ) = Number of times word wm occurs in documents of class c

N ( free, spam ) = å tf ( free doc )

Class Prior Data Likelihood given Class

P ( Class ) P ( Data Class )

Data Prior (Marginal)

You might also like