Linear Algebra For Machine Learning

Linear Algebra for Machine Learning
Main Reference
University of Washington
CSS 581 - Introduction to Machine Learning
Instructor: J Jeffry Howbert
Lecture 2: Math Essentials
Probability
Linear algebra
Linear algebra applications
1) Operations on or between vectors and matrices

2) Coordinate transformations
3) Dimensionality reduction
4) Linear regression
5) Solution of linear systems of equations
6) Many others
Applications 1) 4) are directly relevant to this

course. Today well start with 1).
Jeff Howbert Introduction to Machine Learning Winter 2014 #

Why vectors and matrices?
Most common form of data vector

organization for machine Refund Marital Taxable
learning is a 2D array, where Status Income Cheat
Yes Single 125K No
rows represent samples No

No
Married
Single
100K
70K
No
No
(records, items, datapoints) Yes Married 120K No
No Divorced 95K Yes
columns represent attributes No Married 60K No

Yes Divorced 220K No
(features, variables) No Single 85K Yes
No Married 75K No
Natural to think of each sample No Single 90K Yes
10
as a vector of attributes, and

whole array as a matrix matrix

Lecture 8: Regression
Linear Regression
Regression Trees

slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)



Loss function
Suppose target labels come from set Y

Binary classification: Y = { 0, 1 }
Regression: Y= (real numbers)
A loss function maps decisions to costs:
L( y, y ) defines the penalty for predicting y when the
true value is y .
Standard choice for classification:
0/1 loss (same as 0 if y y
L0 /1 ( y, y )
misclassification error) 1 otherwise
Standard choice for regression:

squared loss
L( y, y ) ( y y) 2
Least squares linear fit to data
Most popular estimation method is least squares:

Determine linear coefficients w that minimize sum of
squared loss (SSL).
Use standard (multivariate) differential calculus:
differentiate SSL with respect to w
find zeros of each partial differential equation
solve for each wi
In one dimension:
N
SSL ( y j ( w0 w1 x j )) 2 N number of samples
j 1
cov[ x, y ]
w1 w0 y w1 x x ,y means of training x, y
var[ x]
y t w0 w1 xt for test sample xt
Multiple dimensions
To simplify notation and derivation, add a new feature
x0 = 1 to feature vector x:
d
y w0 1 wi xi w x
i 1
Calculate SSL and determine w:

N d
SSL ( y j wi xi ) 2 (y Xw ) T (y Xw )
j 1 i 0
y vector of all training responses y j

X matrix of all training samplesx j
w ( X T X) 1 X T y
y t w x t for test samplex t

Lecture 9: Recommendation Systems

Netflix Viewing Recommendations
Recommender Systems
DOMAIN: some field of activity where users buy, view,
consume, or otherwise experience items
PROCESS:
1. users provide ratings on items they have experienced
2. Take all < user, item, rating > data and build a predictive
model
3. For a user who hasnt experienced a particular item, use
model to predict how well they will like it (i.e. predict
rating)
Roles of Recommender Systems
Help users deal with paradox of choice
Allow online sites to:

Increase likelihood of sales
Retain customers by providing positive search experience
Considered essential in operation of:

Online retailing, e.g. Amazon, Netflix, etc.
Social networking sites
Amazon.com Product Recommendations
Social Network Recommendations
Recommendations on essentially every category of
interest known to mankind
Friends
Groups
Activities
Media (TV shows, movies, music, books)
News stories
Ad placements
All based on connections in underlying social network
graph, and the expressed likes and dislikes of yourself
and your connections
Types of Recommender Systems
Base predictions on either:
content-based approach
explicit characteristics of users and items
collaborative filtering approach

implicit characteristics based on similarity of users
preferences to those of other users
The Netflix Prize Contest
GOAL: use training data to build a recommender system,
which, when applied to qualifying data, improves error rate by
10% relative to Netflixs existing system
PRIZE: first team to 10% wins $1,000,000

Annual Progress Prizes of $50,000 also possible
The Netflix Prize Contest
PARTICIPATION:
51051 contestants on 41305 teams from 186 different
countries
44014 valid submissions from 5169 different teams
The Netflix Prize Data
Netflix released three datasets
480,189 users (anonymous)
17,770 movies
ratings on integer scale 1 to 5
Training set: 99,072,112 < user, movie > pairs with ratings
Probe set: 1,408,395 < user, movie > pairs with ratings
Qualifying set of 2,817,131 < user, movie > pairs with no
ratings
Model Building and Submission Process
training set probe set
ratings
99,072,112 1,408,395 known
tuning
MODEL validate
make predictions
RMSE on RMSE kept
public 1,408,342 1,408,789 secret for
leaderboard quiz set test set final scoring
qualifying set
(ratings unknown)
Why the Netflix Prize Was Hard
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
Massive dataset
Very sparse matrix

user 1 1 2 3
only 1.2% occupied user 2 2 3 3 4
Extreme variation in user 3 5 3 4
number of ratings user 4 2 3 2 2
user 5 4 5 3 4
per user user 6 2
Statistical properties user 7 2 4 2 3
user 8 3 4 4
of qualifying and user 9 3
probe sets different user 10 1 2 2
from training set
user 480189 4 3 3
Dealing with Size of the Data
MEMORY:
2 GB bare minimum for common algorithms
4+ GB required for some algorithms
need 64-bit machine with 4+ GB RAM if serious
SPEED:
Program in languages that compile to fast machine code
64-bit processor
Exploit low-level parallelism in code (SIMD on Intel x86/x64)
Common Types of Algorithms
Global effects
Nearest neighbors
Matrix factorization
Restricted Boltzmann machine
Clustering
Etc.
Nearest Neighbors in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

user 1 1 2 3
user 2 2 3 3 4 ? Identical preferences
user 3 5 3 strong weight
user 4 2 3 2 2
user 5 2 3 5 4 2 4
Similar preferences
user 6 2
moderate weight
user 7 2 4 2
user 8 3 1 3 4 5 4
user 9 3
user 10 1 2 2

user 480189 4 3 3
Matrix Factorization in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770

movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >

user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 reduced-rank
factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 singular
user 7 2 4 2 3
value
user 8 3 4 4
user 9 3 decomposition user 1
user 10 1 2 2 (sort of) user 2
user 3
< a bunch of
user 480189 4 3 3 user 4
numbers >
user 5
user 6
user 7
user 8
user 9
user 10

user 480189
Matrix Factorization in Action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
factor 1
movie 10
factor 2
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 3

factor 4
factor 5 user 1 1 2 3
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5
multiply and add user 6 2

factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5
prediction user 480189 4 3 3
user 6
user 7
user 8
user 9
user 10

user 480189
Netflix Prize Progress: Major Milestones
1.05
RMS Error on Quiz Set
1.00
me, starting
June, 2008
0.95
0.90
8.43%
9.44%
10.09%
10.00%
0.85
trivial algorithm Cinematch 2007 Progress 2008 Progress Grand Prize
Prize Prize
DATE: Oct. 2007 Oct. 2008 July 2009

BellKor in
WINNER: BellKor ???
BigChaos
July 26, 18:43 GMT Contest Over!
Final Test Scores
Netflix Prize: What Did I Learn?
Several new machine learning algorithms
A lot about optimizing predictive models
Stochastic gradient descent
Regularization
A lot about optimizing code for speed and memory usage
Some linear algebra
Enough to come up with one original approach that actually
worked
Money and fame make people crazy, in both good ways and bad
COST: about 1000 hours of my free time over 13 months

Lecture 10: Collaborative Filtering
Matrix Factorization Approach

Stochastic (definition):
1. involving a random variable
2. involving chance or probability; probabilistic

Application to training a machine learning model:

1. Choose one sample from training set
2. Calculate loss function for that single sample
3. Calculate gradient from loss function
4. Update model parameters a single step based on
gradient and learning rate
5. Repeat from 1) until stopping criterion is satisfied
Typically entire training set is processed multiple
times before stopping.
Order in which samples are processed can be
fixed or random.
Matrix factorization in action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770

movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >

user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 factorization
factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 (training
user 7 2 4 2 3
process)
user 8 3 4 4
user 9 3 user 1
user 10 1 2 2 user 2
user 3
< a bunch of
user 480189 4 3 3 user 4
numbers >
user 5
training user 6
data user 7
user 8
user 9
user 10

user 480189

Matrix factorization in action
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770

factor 1
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3

factor 4 user 1 1 2 3
factor 5
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5
multiply and add user 6 2

factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5 prediction user 480189 4 3 3
user 6
user 7
user 8
user 9
user 10

user 480189

Notation
Number of users = I
Number of items = J
Number of factors per user / item = F
User of interest = i
Item of interest = j
Factor index = f
User matrix U dimensions = I x F

Item matrix V dimensions = J x F

Prediction rij for user, item pair i, j :

F
rij U if V jf
f 1
Loss for prediction where true rating is rij :

F
L(rij , rij ) (rij rij ) 2 (rij U if V jf ) 2
f 1
Using squared loss; other loss functions possible

Loss function contains F model variables from U
and F model variables from V
Gradient of loss function for sample i, j :

F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )V jf
f 1

U if U if f 1
F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )U if
f 1

V jf V jf f 1
for f = 1 to F

Lets simplify the notation:

F
let e rij U if V jf (the prediction error)
f 1
L(rij , rij ) e 2
2eV jf
U if U if
L(rij , rij ) e 2
2eU if
V jf V jf
for f = 1 to F

Set learning rate =

Then the factor matrix updates for sample i, j
are:
U if U if 2eV jf
V jf V jf 2eU if
for f = 1 to F

SGD for training a matrix factorization:
1. Decide on F = dimension of factors

2. Initialize factor matrices with small random values
3. Choose one sample from training set
4. Calculate loss function for that single sample
5. Calculate gradient from loss function
6. Update 2 F model parameters a single step using
gradient and learning rate
7. Repeat from 3) until stopping criterion is satisfied

Must use some form of regularization (usually L2):

F F F
L(rij , rij ) (rij U if V jf ) U if V jf
2 2 2
f 1 f 1 f 1
Update rules become:
U if U if 2 (eV jf U if )
V jf V jf 2 (eU if V jf )
for f = 1 to F

Lecture 17:
Dimensionality Reduction
Some slides thanks to Xiaoli Fern (CS534, Oregon State Univ., 2011).
Some figures taken from "An Introduction to Statistical Learning, with applications in R" (Springer,
2013) with permission of the authors, G. James, D. Witten, T. Hastie and R. Tibshirani.

Dimensionality reduction
Many modern data domains involve huge

numbers of features / dimensions
Documents: thousands of words, millions of

bigrams
Images: thousands to millions of pixels
Genomics: thousands of genes, millions of

DNA polymorphisms
Why reduce dimensions?
High dimensionality has many costs
Redundant and irrelevant features degrade

performance of some ML algorithms
Difficulty in interpretation and visualization
Computation may become infeasible

what if your algorithm scales as O( n3 )?
Curse of dimensionality
Approaches to dimensionality reduction
Feature selection
Select subset of existing features (without
modification)
Lecture 5 and Project 1
Model regularization
L2 reduces effective dimensionality
L1 reduces actual dimensionality
Combine (map) existing features into smaller
number of new features
Linear combination (projection)
Nonlinear combination
Linear dimensionality reduction
Linearly project n-dimensional data onto a k-

dimensional space
k < n, often k << n
Example: project space of 104 words into 3
dimensions
There are infinitely many k-dimensional

subspaces we can project the data onto.
Which one should we choose?

LDA for two classes

Unsupervised dimensionality reduction
Consider data without class labels

Try to find a more compact representation of the
data

Principal component analysis (PCA)
Widely used method for unsupervised, linear

dimensionality reduction
GOAL: account for variance of data in as few

dimensions as possible (using linear projection)

PCA: conceptual algorithm
Find a line, such that when the data is projected

onto that line, it has the maximum variance.

Find a second line, orthogonal to the first, that

has maximum projected variance.

Repeat until have k orthogonal lines

The projected position of a point on these lines
gives the coordinates in the k-dimensional
reduced space.

Steps in principal component analysis
Mean center the data
Compute covariance matrix
Calculate eigenvalues and eigenvectors of

Eigenvector with largest eigenvalue 1 is 1st
principal component (PC)
Eigenvector with kth largest eigenvalue k is kth
PC
k / i i = proportion of variance captured by
k th PC
PCA: choosing the dimension k

PCA example: face recognition
A typical image of size 256 x 128 pixels is

described by 256 x 128 = 32768 dimensions.
Each face image lies somewhere in this high-
dimensional space.
Images of faces are generally similar in overall
configuration, thus
They cannot be randomly distributed in this
space.
We should be able to describe them in a much
lower-dimensional space.
PCA for face images: eigenfaces

Face recognition in eigenface space
(Turk and Pentland 1991)

Face image retrieval

Linear Algebra For Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Algebra For Machine Learning

Uploaded by

Copyright:

Available Formats

Linear Algebra for Machine Learning

1) Operations on or between vectors and matrices

Applications 1) 4) are directly relevant to this

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Most common form of data vector

learning is a 2D array, where Status Income Cheat

Yes Single 125K No

rows represent samples No

columns represent attributes No Married 60K No

as a vector of attributes, and

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Suppose target labels come from set Y

Standard choice for regression:

Most popular estimation method is least squares:

Calculate SSL and determine w:

y vector of all training responses y j

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Allow online sites to:

Considered essential in operation of:

collaborative filtering approach

PRIZE: first team to 10% wins $1,000,000

multiply and add user 6 2

DATE: Oct. 2007 Oct. 2008 July 2009

COST: about 1000 hours of my free time over 13 months

Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Application to training a machine learning model:

Jeff Howbert Introduction to Machine Learning Winter 2014 #

multiply and add user 6 2

Jeff Howbert Introduction to Machine Learning Winter 2014 #

User matrix U dimensions = I x F

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Prediction rij for user, item pair i, j :

Loss for prediction where true rating is rij :

Using squared loss; other loss functions possible

Gradient of loss function for sample i, j :

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Lets simplify the notation:

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Set learning rate =

Jeff Howbert Introduction to Machine Learning Winter 2014 #

SGD for training a matrix factorization:

1. Decide on F = dimension of factors

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Must use some form of regularization (usually L2):

Update rules become:

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Jeff Howbert Introduction to Machine Learning Winter 2014 #

Many modern data domains involve huge

Documents: thousands of words, millions of

Images: thousands to millions of pixels

Genomics: thousands of genes, millions of

High dimensionality has many costs

Redundant and irrelevant features degrade

Difficulty in interpretation and visualization

Computation may become infeasible

Linearly project n-dimensional data onto a k-