You are on page 1of 62

Lecture No:

Recommender Systems (overview),


Information Extraction (IR)

Presented by: Dr Mustansar Ali Ghazanfar


Mustansar.ali@uettaxila.edu.pk
UET Taxila, Pakistan

08/11/13 1
Agenda

Recommender Systems (RS) Overview


Collaborative Filtering
Content-Based Filtering
Information Retrieval (IR)
Information Extraction Steps
Vector space model
Conclusion
08/11/13 2
Recommender Systems
Systems for recommending items (e.g. books,
movies, CDs, web pages, newsgroup messages) to
users based on examples of their preferences.
Many on-line stores provide recommendations (e.g.
Amazon, CDNow).
Recommenders have been shown to substantially
increase sales at on-line stores.
Two basic approaches to recommending:
Collaborative Filtering (a.k.a. social filtering)
Content-based
08/11/13 3
Book Recommender
Red
Mars

Found
ation

Juras-
sic
Park
Machine User
Lost
Learning Profile
World

2001 Neuro- 2010


mancer

Differ-
ence
Engine
08/11/13 4
Personalization
Recommenders are instances of personalization
software.
Personalization concerns adapting to the individual
needs, interests, and preferences of each user.

08/11/13 5
Machine Learning and
Personalization
Machine Learning can allow learning a user
model or profile of a particular user based on:
Sample interaction
Rated examples
This model or profile can then be used to:
Recommend items
Filter information
Predict behavior

08/11/13 6
Collaborative Filtering
Maintain a database of many users ratings of a
variety of items.
For a given user, find other similar users whose
ratings strongly correlate with the current user.
Recommend items rated highly by these similar users,
but not rated by the current user.
Almost all existing commercial recommenders use
this approach (e.g. Amazon).

08/11/13 7
Collaborative Filtering

A 9 A A 5 A A 6 A 10
User B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
Database : : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1

A 9 A 10
B 3 B 4
Correlation C C 8
Match : : . .
Z 5 Z 1

A 9
Active B 3 Extract C
C Recommendations
User . .
08/11/13 8
Z 5
Collaborative Filtering Method
Weight all users with respect to similarity with
the active user.
Select a subset of the users (neighbors) to use
as predictors.
Normalize ratings and compute a prediction
from a weighted combination of the selected
neighbors ratings.
Present items with highest predicted ratings as
recommendations.
08/11/13 9
Types of Collaborative Filtering (CF)
Memory based approaches (covered)
User-based CF
Item-based CF
Model-based approaches (will cover later)
Clustering

08/11/13 10
Advantages with Collaborative Filtering
Simple algorithm
Use wisdom of crowd
Can recommend out of the box
High quality recommendations (high accuracy)

08/11/13 11
Problems with Collaborative Filtering
Cold Start: There needs to be enough other users already in
the system to find a match.
Sparsity: If there are many items to be recommended, even if
there are many users, the user/ratings matrix is sparse, and it is
hard to find users that have rated the same items.
First Rater: Cannot recommend an item that has not been
previously rated.
New items
Popularity Bias: Cannot recommend items to someone with
unique tastes.
Gray-Sheep Users.

08/11/13 12
Content-Based Recommending
(CBR)
Recommendations are based on information on the
content of items rather than on other users opinions.
Uses a machine learning algorithm to induce a profile
of the users preferences from examples based on a
featural description of content.
Some previous applications:
Newsweeder (Lang, 1995)

08/11/13 13
Advantages of CBR
No cold-start or sparsity problems.
Able to recommend to users with unique tastes.
Able to recommend new and unpopular items
No first-rater problem.
Can provide explanations of recommended items by
listing content-features that caused an item to be
recommended.

08/11/13 14
Disadvantages of CBR
Requires content that can be encoded as meaningful
features (problem for multimedia data).
Users tastes must be represented as a learnable
function of these content features.
Low quality recommendation (low accuracy)
Unable to exploit quality judgments of other users.
Can not recommend out of the box

08/11/13 15
Content-Based Recommending
Vs collaborating Filtering
Content Based Filtering Collaborating Filtering

Info about content of items Info about the similar users


(or items)
Features & keywords about Ratings & Past history of
item user

Train Machine Learning Use (social) collective


classifiers and predict Intelligence to predict

08/11/13 16
Presentations
Will start from Nov
18, 2013
Keep an eye on
email
10% grades
Who is crying?

08/11/13 *Not part of course (only for presentations)


17
LIBRA System [1]
Amazon Pages
LIBRA
Information Database
Extraction

Rated
Examples Machine Learning

Learner

Recommendations
1.~~~~~~ User Profile
2.~~~~~~~
3.~~~~~
:
:
: Predictor
*Not part of course (only for presentations)
08/11/13 18
Combining Content and
Collaboration [1]
Content-based and collaborative methods have
complementary strengths and weaknesses.
Combine methods to obtain the best of both.
Various hybrid approaches:
Apply both methods and combine recommendations.
Use collaborative data as content.
Use content-based predictor as another collaborator.
Use content-based predictor to complete collaborative
data.

08/11/13 *Not part of course (only for presentations)


19
Content-Boosted Collaborative
Filtering [1]
EachMovie Web Crawler IMDb

Movie
Content
Database

User Ratings Full User


Matrix (Sparse) Ratings Matrix
Content-based
Predictor

Active Collaborative
User Ratings Filtering

Recommendations

08/11/13 *Not part of course (only for presentations)


20
Content-Boosted CF - I
User-ratings Vector

Training Examples

Content-Based
Predictor

Pseudo User-ratings Vector


User-rated
Items
Unrated Items
Items with Predicted Ratings

08/11/13 *Not part of course (only for presentations)


21
Content-Boosted CF - II

User Ratings Content-Based Pseudo User


Matrix Predictor Ratings Matrix

Compute pseudo user ratings matrix


Full matrix approximates actual full user ratings matrix
Perform CF
Using Pearson corr. between pseudo user-rating vectors

08/11/13 *Not part of course (only for presentations)


22
Experimental Method
Used subset of EachMovie (7,893 users; 299,997
ratings)
Test set: 10% of the users selected at random.
Test users that rated at least 40 movies.
Train on the remainder sets.
Hold-out set: 25% items for each test user.
Predict rating of each item in the hold-out set.
Compared CBCF to other prediction approaches:
Pure CF
Pure Content-based
Nave hybrid (averages CF and content-based
08/11/13
predictions) *Not part of course (only for presentations)
23
Metrics
Mean Absolute Error (MAE)
Compares numerical predictions with user ratings

ROC sensitivity
How well predictions help users select high-quality items
Ratings 4 considered good; < 4 considered bad

08/11/13 *Not part of course (only for presentations)


24
Results - I
MAE

1.06
1.04
1.02
1 CF
MAE

0.98 Content
0.96 Nave
0.94 CBCF
0.92
0.9
Algorithm

CBCF is significantly better (4% over CF)

08/11/13 *Not part of course (only for presentations)


25
Results - II
ROC Sensitivity

0.68

0.66

0.64 CF
ROC-4

Content
0.62 Nave
CBCF
0.6

0.58
Algorithm

CBCF outperforms rest (5% improvement over CF)

08/11/13 *Not part of course (only for presentations)


26
Information Retrieval (IR)

08/11/13 27
Information Retrieval (IR)
The indexing and retrieval of textual documents.
killer app.
Concerned firstly with retrieving relevant documents to
a query.
Concerned secondly with retrieving from large sets of
documents efficiently.

08/11/13 28
Typical IR Task
Given:
A corpus of textual natural-language documents.
A user query in the form of a textual string.
Find:
A ranked set of documents that are relevant to
the query.

08/11/13 29
IR System

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .

08/11/13 30
Relevance
Relevance is a subjective judgment and may
include:
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).

08/11/13 31
Keyword Search
Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document,
in any order (bag of words).

08/11/13 32
Problems with Keywords
May not retrieve relevant documents that
include synonymous terms.
restaurant vs. caf
PRC vs. China
May retrieve irrelevant documents that
include ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)

08/11/13 33
Beyond Keywords
We will cover the basics of keyword-based
IR, but
We will focus on basic capabilities of IR

08/11/13 34
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect
feedback.
Taking into account the authority of the source
(Loophole, the Google cracked in!).

08/11/13 35
Web Search
Application of IR to HTML documents on the
World Wide Web.
Differences:
Must assemble document corpus by spidering
the web.
Documents change uncontrollably.
Can exploit the link structure of the web.

08/11/13 36
Web Search System
Web Spider Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3
Ranked
. Documents
.

08/11/13 37
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
08/11/13 38
08/11/13 39
Example
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$mrs@bilbo.reference.com>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software


for PC-Based Voice Mail systems. Experienced in C Programming. Must be
familiar with communicating with and controlling voice cards; preferable Dialogic,
however, experience with others such as Rhetorix and Natural Microsystems is
okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years.
Need to find a Senior level person who can come on board and pick up code
with very little training. Present Operating System is DOS. May go to OS-2 or
UNIX in future [1].

08/11/13 40
Steps involved [1]

Tokenization
Stop word removal
Stemming
Indexing
Weighting schemes
Similarity Measure (For Content-Based Filtering)
Dot Product
Vector Space Model

08/11/13 41
Tokenization
Sequence of discrete tokens (words).
Punctuation (e-mail), numbers (2009), and case
(MobileVCE vs. mobileVCE) can be a meaningful
part of a token.
Keywords
Simplest approach
Ignore all numbers
Ignore all punctuations
Ignore all cases

08/11/13 42
Stop Words
It is typical to exclude high-frequency words (e.g.
function words: a, the, in, to; pronouns: I, he,
she, it).
Language dependent.
Google stop words (
www.ranks.nl/resources/stopwords.html)
Customizable domain dependent lists

08/11/13 43
Stemming
Reduce tokens to root form of words to recognize
morphological variation.
computer, computational, computation all reduced to same
token compute
Correct morphological analysis is language specific
and can be complex.
Stemming strips off known affixes (prefixes and
suffixes) in an iterative fashion.
Example: Porter stemmer algorithm

08/11/13 44
Indexing

Dj, tfj
Index terms df

computer 3 D7, 4
database 2 D1, 3

science 4 D2, 4
system 1 D5, 2

Index file Representation

08/11/13 45
Term Weights: Term Frequency

More frequent terms in a document are more


important, i.e. more indicative of the topic.
fij = frequency of term i in document j

Normalize term frequency (tf) by dividing by the


frequency of the most common term in the
document:
tfij = fij / maxi{fij}

08/11/13 46
Term Weights:
Inverse Document Frequency
Terms that appear in many different documents are less
indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a terms discrimination power.
Log used to dampen the effect relative to tf.

08/11/13 47
TF-IDF Weighting
A typical combined term importance indicator is tf-idf
weighting:
wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but


rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have
been proposed.
Experimentally, tf-idf has been found to work well.

08/11/13 48
Similarity Measure
A similarity measure is a function that computes the
degree of similarity between two vectors.

Similarity measure between the query and each


document can help:
to rank the retrieved documents in the order of
presumed relevance.

08/11/13 49
The Vector-Space Model
Assume t distinct terms remain after preprocessing; call them
index terms or the vocabulary.
These orthogonal terms form a vector space.
Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a real-valued
weight, wij.
Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, , wtj)

08/11/13 50
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
5
Q = 0T1 + 0T2 + 2T3

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


2 3
T1
D2 = 3T1 + 7T2 + T3
Is D1 or D2 more similar to Q?
How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
08/11/13 51
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the weight of a
term in the document.

T1 T2 . Tt
D1 w11 w21 wt1
D2 w12 w22 wt2
: : : :
: : : :
Dn w1n w2n wtn

08/11/13 52
TF-IDF Weighting
A typical combined term importance
indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)

08/11/13 53
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

*Not part of the course


08/11/13 54
Query Vector
Query vector is typically treated as a document and
also tf-idf weighted.
Alternative is for the user to supply weights for the
given query terms.

08/11/13 55
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):
t

sim(dj,q) = djq = wij wiq


i =1
where wij is the weight of term i in document j and wiq is the weight of
term i in the query

08/11/13 56
Inner Product -- Examples
ure e nt ion
al ase tect uter m at
v e
ie tab chi p xt nag form
Binary: tr m
re da ar co te ma in
D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
Q = 1, 0 , 1, 0, 0, 1, 1 0 means corresponding term not
found in document or query
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10


sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

08/11/13 57
Cosine Similarity Measure
Cosine similarity measures the cosine of t3
the angle between two vectors.
Inner product normalized by the vector 1
lengths. t D1
dj q = ( wij wiq )

Q
CosSim(dj, q) = i =1
t t
2 t1
dj q wij wiq
2 2

i =1 i =1

t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

08/11/13 58
Nave Implementation (Pseudo code)
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
1. Sort documents by decreasing score.
2. Present top ranked documents to the user.

Time complexity: O(|V||D|) Bad for large V & D !


|V| = 10,000; |D| = 100,000; |V||D| = 1,000,000,000

08/11/13 59
Comments on Vector Space Models
Simple, mathematically based approach.
Considers both local (tf) and global (idf) word occurrence
frequencies.
Provides partial matching and ranked results.
Tends to work quite well in practice despite obvious
weaknesses.

08/11/13 60
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word
order, proximity information).
Assumption of term independence (e.g. ignores synonomy).
Given a two-term query A B, may prefer a document
containing A frequently but not B, over a document that
contains both A and B, but both less frequently.

08/11/13 61
References
1. https://www.cs.utexas.edu/~ml/publications/area/119/learning_fo
r_recommender_systems/abstracts/
2. http://www.cs.utexas.edu/~mooney/ir-course/

08/11/13 62

You might also like