K-Nearest Neighbor Methods: William Cohen 10-601 April 2008

K-nearest neighbor methods
William Cohen
10-601 April 2008
1
But first….
50
45
40
1
35
yˆ  x  26
30
7
Age in Years
25
20
15
10
0
0 20 40 60 80 100 120 140 160
2 Number of Publications
Onward: multivariate linear regression
Univariate col is feature Multivariate
 x11 ,...., x1k   y1 

x  x1 ,...., xn
  y   ... 
X   ...   
y  y1 ,...., yn  yn 
 x1n ,...., xnk   
1  
w  x y(x x)
ˆ T T
yˆ  wˆ 1 x1  ...  wˆ k x k
row is example
1
w  X y( X X )
ˆ T T
w  arg min  i
[ˆ ( w )] 2
ˆi (w )  yi  w T x i
3
X Y
4
5
6
ACM Computing Surveys 2002
7
8
Review of K-NN methods (so far)
9
Kernel regression
• aka locally weighted regression, locally linear
regression, LOESS, …
What does making the kernel wider
do to bias and variance?
10
BellCore’s MovieRecommender
• Participants sent email to videos@bellcore.com
• System replied with a list of 500 movies to rate on a
1-10 scale (250 random, 250 popular)
– Only subset need to be rated
• New participant P sends in rated movies via email
• System compares ratings for P to ratings of (a
random sample of) previous users
• Most similar users are used to predict scores for
unrated movies (more later)
• System returns recommendations in an email
message.
11
Suggested Videos for: John A. Jamus.
Your must-see list with predicted ratings:
•7.0 "Alien (1979)"
•6.5 "Blade Runner"
•6.2 "Close Encounters Of The Third Kind (1977)"
Your video categories with average ratings:
•6.7 "Action/Adventure"
•6.5 "Science Fiction/Fantasy"
•6.3 "Children/Family"
•6.0 "Mystery/Suspense"
•5.9 "Comedy"
•5.8 "Drama"
12
The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar.
Correlation with target viewer:
•0.59 viewer-130 (unlisted@merl.com)
•0.55 bullert,jane r (bullert@cc.bellcore.com)
•0.51 jan_arst (jan_arst@khdld.decnet.philips.nl) Mystery/Suspense:
•"Silence Of The Lambs, The" 9.3, 3
•0.46 Ken Cross (moose@denali.EE.CORNELL.EDU) viewers
•0.42 rskt (rskt@cc.bellcore.com) Comedy:
•0.41 kkgg (kkgg@Athena.MIT.EDU) •"National Lampoon's Animal House" 7.5,
4 viewers
•0.41 bnn (bnn@cc.bellcore.com)
•"Driving Miss Daisy" 7.5, 4 viewers
By category, their joint ratings recommend: •"Hannah and Her Sisters" 8.0, 3 viewers
•Action/Adventure: Drama:
•"It's A Wonderful Life" 8.0, 5 viewers
•"Excalibur" 8.0, 4 viewers
•"Dead Poets Society" 7.0, 5 viewers
•"Apocalypse Now" 7.2, 4 viewers •"Rain Man" 7.5, 4 viewers
•"Platoon" 8.3, 3 viewers
Correlation of predicted ratings with your actual
•Science Fiction/Fantasy:
ratings is: 0.64 This number measures ability to
•"Total Recall" 7.2, 5 viewers evaluate movies accurately for you. 0.15 means
•Children/Family: low ability. 0.85 means very good ability. 0.50
•"Wizard Of Oz, The" 8.5, 4 viewers means fair ability.
•"Mary Poppins" 7.7, 3 viewers
13
Algorithms for Collaborative Filtering 1:
Memory-Based Algorithms (Breese et al, UAI98)
• vi,j= vote of user i on item j
• Ii = items for which user i has voted
• Mean vote for i is
• Predicted vote for “active user” a is weighted sum
normalizer weights of n similar users
14
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.
• Example:
http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/
15
What is the decision boundary?
Voronoi diagram
16
Convergence of 1-NN
P(Y|x’’) x2
P (knnError) P(Y|x)
 1  Pr( y  y1 ) x y2
 1   Pr(Y  y ' | x) 2 neighbor
y'
y
 1  Pr( y* | x) 2  
y ' y*
Pr(Y  y ' | x ) 2
x1
... P(Y|x1)
 2(1  Pr( y* | x)) y1
 2(Bayes optimal error rate)
assume equal
let y*=argmax Pr(y|x)

17
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.
• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set
quickly
18
K-NN and irrelevant features
+ + + oo o oo?o ++o+ o oooo+oooooo +
19
+
o
+ ? o oo
+
o o
o o oo
+ o
o + +
+ + o o
o
o
o
o
20
+
+ ? oo o o
oo + o
+ o oo
+ o oo + o + oo +
o o
21
Ways of rescaling for KNN
Normalized L1 distance:
Scale by IG:
Modified value
distance metric:
22
Ways of rescaling for KNN
Dot product:
Cosine distance:
TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :
#docs in
#occur. of
corpus
term i in
doc j
#docs in
corpus that
contain term i
23
Combining distances to neighbors
Standard KNN: yˆ  arg max y C ( y, Neighbors ( x))
C ( y, D' ) | {( x' , y ' )  D': y '  y} |
Distance-weighted KNN:
yˆ  arg max y C ( y, Neighbors ( x))
C ( y, D' )   (SIM ( x, x' ))
{( x ', y ')D ': y ' y}
C ( y, D' )  1   (1  SIM ( x, x' ))

{( x ', y ')D ': y ' y}
SIM ( x, x' )  1  ( x, x' )

24
25
26
William W. Cohen & Haym Hirsh (1998): Joins that
Generalize: Text Classification Using WHIRL in KDD
1998: 169-173.
27
28
29
M1
M2
Vitor Carvalho and William W. Cohen (2008): Ranking Users

for Intelligent Message Addressing in ECIR-2008, and current
work with Vitor, me, and Ramnath Balasubramanyan
30
Computing KNN: pros and cons
• Storage: all training examples are saved in memory
– A decision tree or linear classifier is much smaller
• Time: to classify x, you need to loop over all training
examples (x’,y’) to compute distance between x and
x’.
– However, you get predictions for every class y
• KNN is nice when there are many many classes
– Actually, there are some tricks to speed this up…especially
when data is sparse (e.g., text)
31
Efficiently implementing KNN (for text)
IDF is nice
computationally
32
Tricks with fast KNN
K-means using r-NN
1. Pick k points c1=x1,….,ck=xk as centers
2. For each xi, find Di=Neighborhood(xi)
3. For each xi, let ci=mean(Di)
4. Go to step 2….
33
Efficiently implementing KNN
dj3
Selective classification: given a

training set and test set, find the N
test cases that you can most
dj2
confidently classify
dj4
34
Train once and select 100 test cases to classify
35

K-Nearest Neighbor Methods: William Cohen 10-601 April 2008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K-Nearest Neighbor Methods: William Cohen 10-601 April 2008

Uploaded by

Copyright:

Available Formats

K-nearest neighbor methods

 x11 ,...., x1k   y1 

• Predicted vote for “active user” a is weighted sum

normalizer weights of n similar users

let y*=argmax Pr(y|x)

+ + + oo o oo?o ++o+ o oooo+oooooo +

TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :

C ( y, D' )  1   (1  SIM ( x, x' ))

SIM ( x, x' )  1  ( x, x' )

Vitor Carvalho and William W. Cohen (2008): Ranking Users

Selective classification: given a

You might also like