Professional Documents
Culture Documents
Main Reference
University of Washington
CSS 581 - Introduction to Machine Learning
Instructor: J Jeffry Howbert
Lecture 2: Math Essentials
Probability
Linear algebra
Linear algebra applications
Linear Regression
Regression Trees
cov[ x, y ]
w1 w0 y w1 x x ,y means of training x, y
var[ x]
y t w0 w1 xt for test sample xt
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data
Multiple dimensions
To simplify notation and derivation, add a new feature
x0 = 1 to feature vector x:
d
y w0 1 wi xi w x
i 1
PROCESS:
1. users provide ratings on items they have experienced
2. Take all < user, item, rating > data and build a predictive
model
3. For a user who hasnt experienced a particular item, use
model to predict how well they will like it (i.e. predict
rating)
Roles of Recommender Systems
Help users deal with paradox of choice
content-based approach
explicit characteristics of users and items
PARTICIPATION:
51051 contestants on 41305 teams from 186 different
countries
44014 valid submissions from 5169 different teams
The Netflix Prize Data
Netflix released three datasets
480,189 users (anonymous)
17,770 movies
ratings on integer scale 1 to 5
Training set: 99,072,112 < user, movie > pairs with ratings
Probe set: 1,408,395 < user, movie > pairs with ratings
Qualifying set of 2,817,131 < user, movie > pairs with no
ratings
Model Building and Submission Process
training set probe set
ratings
99,072,112 1,408,395 known
tuning
MODEL validate
make predictions
RMSE on RMSE kept
public 1,408,342 1,408,789 secret for
leaderboard quiz set test set final scoring
qualifying set
(ratings unknown)
Why the Netflix Prize Was Hard
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
Massive dataset
Very sparse matrix
user 1 1 2 3
only 1.2% occupied user 2 2 3 3 4
Extreme variation in user 3 5 3 4
number of ratings user 4 2 3 2 2
user 5 4 5 3 4
per user user 6 2
Statistical properties user 7 2 4 2 3
user 8 3 4 4
of qualifying and user 9 3
probe sets different user 10 1 2 2
from training set
user 480189 4 3 3
Dealing with Size of the Data
MEMORY:
2 GB bare minimum for common algorithms
4+ GB required for some algorithms
need 64-bit machine with 4+ GB RAM if serious
SPEED:
Program in languages that compile to fast machine code
64-bit processor
Exploit low-level parallelism in code (SIMD on Intel x86/x64)
Common Types of Algorithms
Global effects
Nearest neighbors
Matrix factorization
Restricted Boltzmann machine
Clustering
Etc.
Nearest Neighbors in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
user 1 1 2 3
user 2 2 3 3 4 ? Identical preferences
user 3 5 3 strong weight
user 4 2 3 2 2
user 5 2 3 5 4 2 4
Similar preferences
user 6 2
moderate weight
user 7 2 4 2
user 8 3 1 3 4 5 4
user 9 3
user 10 1 2 2
user 480189 4 3 3
Matrix Factorization in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >
user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 reduced-rank
factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 singular
user 7 2 4 2 3
value
user 8 3 4 4
user 9 3 decomposition user 1
user 10 1 2 2 (sort of) user 2
user 3
< a bunch of
user 480189 4 3 3 user 4
numbers >
user 5
user 6
user 7
user 8
user 9
user 10
user 480189
Matrix Factorization in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
factor 1
movie 10
factor 2
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 3
factor 4
factor 5 user 1 1 2 3
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5
user 6
user 7
user 8
user 9
user 10
user 480189
Netflix Prize Progress: Major Milestones
1.05
RMS Error on Quiz Set
1.00
me, starting
June, 2008
0.95
0.90
8.43%
9.44%
10.09%
10.00%
0.85
trivial algorithm Cinematch 2007 Progress 2008 Progress Grand Prize
Prize Prize
Money and fame make people crazy, in both good ways and bad
Stochastic (definition):
1. involving a random variable
2. involving chance or probability; probabilistic
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >
user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 factorization
factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 (training
user 7 2 4 2 3
process)
user 8 3 4 4
user 9 3 user 1
user 10 1 2 2 user 2
user 3
< a bunch of
user 480189 4 3 3 user 4
numbers >
user 5
training user 6
data user 7
user 8
user 9
user 10
user 480189
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
factor 1
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3
factor 4 user 1 1 2 3
factor 5
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5
user 6
user 7
user 8
user 9
user 10
user 480189
Notation
Number of users = I
Number of items = J
Number of factors per user / item = F
User of interest = i
Item of interest = j
Factor index = f
for f = 1 to F
L(rij , rij ) e 2
2eV jf
U if U if
L(rij , rij ) e 2
2eU if
V jf V jf
for f = 1 to F
for f = 1 to F
f 1 f 1 f 1
U if U if 2 (eV jf U if )
V jf V jf 2 (eU if V jf )
for f = 1 to F
Dimensionality Reduction
Some slides thanks to Xiaoli Fern (CS534, Oregon State Univ., 2011).
Some figures taken from "An Introduction to Statistical Learning, with applications in R" (Springer,
2013) with permission of the authors, G. James, D. Witten, T. Hastie and R. Tibshirani.
Curse of dimensionality
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Approaches to dimensionality reduction
Feature selection
Select subset of existing features (without
modification)
Lecture 5 and Project 1
Model regularization
L2 reduces effective dimensionality
L1 reduces actual dimensionality
Combine (map) existing features into smaller
number of new features
Linear combination (projection)
Nonlinear combination
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Linear dimensionality reduction