You are on page 1of 35

K-nearest neighbor methods

William Cohen
10-601 April 2008

1
But first….

50

45

40

1
35
yˆ  x  26
30
7
Age in Years
25

20

15

10

0
0 20 40 60 80 100 120 140 160
2 Number of Publications
Onward: multivariate linear regression
Univariate col is feature Multivariate

 x11 ,...., x1k   y1 


x  x1 ,...., xn
  y   ... 
X   ...   
y  y1 ,...., yn  yn 
 x1n ,...., xnk   
1  
w  x y(x x)
ˆ T T

yˆ  wˆ 1 x1  ...  wˆ k x k
row is example
1
w  X y( X X )
ˆ T T
w  arg min  i
[ˆ ( w )] 2

ˆi (w )  yi  w T x i
3
X Y

4
5
6
ACM Computing Surveys 2002

7
8
Review of K-NN methods (so far)

9
Kernel regression
• aka locally weighted regression, locally linear
regression, LOESS, …
What does making the kernel wider
do to bias and variance?

10
BellCore’s MovieRecommender
• Participants sent email to videos@bellcore.com
• System replied with a list of 500 movies to rate on a
1-10 scale (250 random, 250 popular)
– Only subset need to be rated
• New participant P sends in rated movies via email
• System compares ratings for P to ratings of (a
random sample of) previous users
• Most similar users are used to predict scores for
unrated movies (more later)
• System returns recommendations in an email
message.

11
Suggested Videos for: John A. Jamus.
Your must-see list with predicted ratings:
•7.0 "Alien (1979)"
•6.5 "Blade Runner"
•6.2 "Close Encounters Of The Third Kind (1977)"
Your video categories with average ratings:
•6.7 "Action/Adventure"
•6.5 "Science Fiction/Fantasy"
•6.3 "Children/Family"
•6.0 "Mystery/Suspense"
•5.9 "Comedy"
•5.8 "Drama"

12
The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar.
Correlation with target viewer:
•0.59 viewer-130 (unlisted@merl.com)
•0.55 bullert,jane r (bullert@cc.bellcore.com)
•0.51 jan_arst (jan_arst@khdld.decnet.philips.nl) Mystery/Suspense:
•"Silence Of The Lambs, The" 9.3, 3
•0.46 Ken Cross (moose@denali.EE.CORNELL.EDU) viewers
•0.42 rskt (rskt@cc.bellcore.com) Comedy:
•0.41 kkgg (kkgg@Athena.MIT.EDU) •"National Lampoon's Animal House" 7.5,
4 viewers
•0.41 bnn (bnn@cc.bellcore.com)
•"Driving Miss Daisy" 7.5, 4 viewers
By category, their joint ratings recommend: •"Hannah and Her Sisters" 8.0, 3 viewers
•Action/Adventure: Drama:
•"It's A Wonderful Life" 8.0, 5 viewers
•"Excalibur" 8.0, 4 viewers
•"Dead Poets Society" 7.0, 5 viewers
•"Apocalypse Now" 7.2, 4 viewers •"Rain Man" 7.5, 4 viewers
•"Platoon" 8.3, 3 viewers
Correlation of predicted ratings with your actual
•Science Fiction/Fantasy:
ratings is: 0.64 This number measures ability to
•"Total Recall" 7.2, 5 viewers evaluate movies accurately for you. 0.15 means
•Children/Family: low ability. 0.85 means very good ability. 0.50
•"Wizard Of Oz, The" 8.5, 4 viewers means fair ability.
•"Mary Poppins" 7.7, 3 viewers
13
Algorithms for Collaborative Filtering 1:
Memory-Based Algorithms (Breese et al, UAI98)
• vi,j= vote of user i on item j
• Ii = items for which user i has voted
• Mean vote for i is

• Predicted vote for “active user” a is weighted sum

normalizer weights of n similar users

14
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.

• Example:
http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/

15
What is the decision boundary?
Voronoi diagram

16
Convergence of 1-NN
P(Y|x’’) x2
P (knnError) P(Y|x)
 1  Pr( y  y1 ) x y2
 1   Pr(Y  y ' | x) 2 neighbor
y'
y
 1  Pr( y* | x) 2  
y ' y*
Pr(Y  y ' | x ) 2
x1
... P(Y|x1)
 2(1  Pr( y* | x)) y1
 2(Bayes optimal error rate)
assume equal

let y*=argmax Pr(y|x)


17
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.

• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set
quickly

18
K-NN and irrelevant features

+ + + oo o oo?o ++o+ o oooo+oooooo +

19
K-NN and irrelevant features
+
o
+ ? o oo
+
o o
o o oo
+ o
o + +
+ + o o
o
o
o
o

20
K-NN and irrelevant features

+
+ ? oo o o
oo + o
+ o oo
+ o oo + o + oo +
o o

21
Ways of rescaling for KNN
Normalized L1 distance:

Scale by IG:

Modified value
distance metric:

22
Ways of rescaling for KNN
Dot product:

Cosine distance:

TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :

#docs in
#occur. of
corpus
term i in
doc j
#docs in
corpus that
contain term i
23
Combining distances to neighbors
Standard KNN: yˆ  arg max y C ( y, Neighbors ( x))
C ( y, D' ) | {( x' , y ' )  D': y '  y} |
Distance-weighted KNN:
yˆ  arg max y C ( y, Neighbors ( x))
C ( y, D' )   (SIM ( x, x' ))
{( x ', y ')D ': y ' y}

C ( y, D' )  1   (1  SIM ( x, x' ))


{( x ', y ')D ': y ' y}

SIM ( x, x' )  1  ( x, x' )


24
25
26
William W. Cohen & Haym Hirsh (1998): Joins that
Generalize: Text Classification Using WHIRL in KDD
1998: 169-173.

27
28
29
M1

M2

Vitor Carvalho and William W. Cohen (2008): Ranking Users


for Intelligent Message Addressing in ECIR-2008, and current
work with Vitor, me, and Ramnath Balasubramanyan
30
Computing KNN: pros and cons
• Storage: all training examples are saved in memory
– A decision tree or linear classifier is much smaller
• Time: to classify x, you need to loop over all training
examples (x’,y’) to compute distance between x and
x’.
– However, you get predictions for every class y
• KNN is nice when there are many many classes
– Actually, there are some tricks to speed this up…especially
when data is sparse (e.g., text)

31
Efficiently implementing KNN (for text)

IDF is nice
computationally

32
Tricks with fast KNN
K-means using r-NN
1. Pick k points c1=x1,….,ck=xk as centers
2. For each xi, find Di=Neighborhood(xi)
3. For each xi, let ci=mean(Di)
4. Go to step 2….

33
Efficiently implementing KNN

dj3

Selective classification: given a


training set and test set, find the N
test cases that you can most
dj2
confidently classify

dj4

34
Train once and select 100 test cases to classify

35

You might also like