You are on page 1of 62

Lecture No:

Recommender Systems (overview),

Information Extraction (IR)

Presented by: Dr Mustansar Ali Ghazanfar
UET Taxila, Pakistan

08/11/13 1

Recommender Systems (RS) Overview

Collaborative Filtering
Content-Based Filtering
Information Retrieval (IR)
Information Extraction Steps
Vector space model
08/11/13 2
Recommender Systems
Systems for recommending items (e.g. books,
movies, CDs, web pages, newsgroup messages) to
users based on examples of their preferences.
Many on-line stores provide recommendations (e.g.
Amazon, CDNow).
Recommenders have been shown to substantially
increase sales at on-line stores.
Two basic approaches to recommending:
Collaborative Filtering (a.k.a. social filtering)
08/11/13 3
Book Recommender


Machine User
Learning Profile

2001 Neuro- 2010


08/11/13 4
Recommenders are instances of personalization
Personalization concerns adapting to the individual
needs, interests, and preferences of each user.

08/11/13 5
Machine Learning and
Machine Learning can allow learning a user
model or profile of a particular user based on:
Sample interaction
Rated examples
This model or profile can then be used to:
Recommend items
Filter information
Predict behavior

08/11/13 6
Collaborative Filtering
Maintain a database of many users ratings of a
variety of items.
For a given user, find other similar users whose
ratings strongly correlate with the current user.
Recommend items rated highly by these similar users,
but not rated by the current user.
Almost all existing commercial recommenders use
this approach (e.g. Amazon).

08/11/13 7
Collaborative Filtering

A 9 A A 5 A A 6 A 10
User B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
Database : : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1

A 9 A 10
B 3 B 4
Correlation C C 8
Match : : . .
Z 5 Z 1

A 9
Active B 3 Extract C
C Recommendations
User . .
08/11/13 8
Z 5
Collaborative Filtering Method
Weight all users with respect to similarity with
the active user.
Select a subset of the users (neighbors) to use
as predictors.
Normalize ratings and compute a prediction
from a weighted combination of the selected
neighbors ratings.
Present items with highest predicted ratings as
08/11/13 9
Types of Collaborative Filtering (CF)
Memory based approaches (covered)
User-based CF
Item-based CF
Model-based approaches (will cover later)

08/11/13 10
Advantages with Collaborative Filtering
Simple algorithm
Use wisdom of crowd
Can recommend out of the box
High quality recommendations (high accuracy)

08/11/13 11
Problems with Collaborative Filtering
Cold Start: There needs to be enough other users already in
the system to find a match.
Sparsity: If there are many items to be recommended, even if
there are many users, the user/ratings matrix is sparse, and it is
hard to find users that have rated the same items.
First Rater: Cannot recommend an item that has not been
previously rated.
New items
Popularity Bias: Cannot recommend items to someone with
unique tastes.
Gray-Sheep Users.

08/11/13 12
Content-Based Recommending
Recommendations are based on information on the
content of items rather than on other users opinions.
Uses a machine learning algorithm to induce a profile
of the users preferences from examples based on a
featural description of content.
Some previous applications:
Newsweeder (Lang, 1995)

08/11/13 13
Advantages of CBR
No cold-start or sparsity problems.
Able to recommend to users with unique tastes.
Able to recommend new and unpopular items
No first-rater problem.
Can provide explanations of recommended items by
listing content-features that caused an item to be

08/11/13 14
Disadvantages of CBR
Requires content that can be encoded as meaningful
features (problem for multimedia data).
Users tastes must be represented as a learnable
function of these content features.
Low quality recommendation (low accuracy)
Unable to exploit quality judgments of other users.
Can not recommend out of the box

08/11/13 15
Content-Based Recommending
Vs collaborating Filtering
Content Based Filtering Collaborating Filtering

Info about content of items Info about the similar users

(or items)
Features & keywords about Ratings & Past history of
item user

Train Machine Learning Use (social) collective

classifiers and predict Intelligence to predict

08/11/13 16
Will start from Nov
18, 2013
Keep an eye on
10% grades
Who is crying?

08/11/13 *Not part of course (only for presentations)

LIBRA System [1]
Amazon Pages
Information Database

Examples Machine Learning


1.~~~~~~ User Profile
: Predictor
*Not part of course (only for presentations)
08/11/13 18
Combining Content and
Collaboration [1]
Content-based and collaborative methods have
complementary strengths and weaknesses.
Combine methods to obtain the best of both.
Various hybrid approaches:
Apply both methods and combine recommendations.
Use collaborative data as content.
Use content-based predictor as another collaborator.
Use content-based predictor to complete collaborative

08/11/13 *Not part of course (only for presentations)

Content-Boosted Collaborative
Filtering [1]
EachMovie Web Crawler IMDb


User Ratings Full User

Matrix (Sparse) Ratings Matrix

Active Collaborative
User Ratings Filtering


08/11/13 *Not part of course (only for presentations)

Content-Boosted CF - I
User-ratings Vector

Training Examples


Pseudo User-ratings Vector

Unrated Items
Items with Predicted Ratings

08/11/13 *Not part of course (only for presentations)

Content-Boosted CF - II

User Ratings Content-Based Pseudo User

Matrix Predictor Ratings Matrix

Compute pseudo user ratings matrix

Full matrix approximates actual full user ratings matrix
Perform CF
Using Pearson corr. between pseudo user-rating vectors

08/11/13 *Not part of course (only for presentations)

Experimental Method
Used subset of EachMovie (7,893 users; 299,997
Test set: 10% of the users selected at random.
Test users that rated at least 40 movies.
Train on the remainder sets.
Hold-out set: 25% items for each test user.
Predict rating of each item in the hold-out set.
Compared CBCF to other prediction approaches:
Pure CF
Pure Content-based
Nave hybrid (averages CF and content-based
predictions) *Not part of course (only for presentations)
Mean Absolute Error (MAE)
Compares numerical predictions with user ratings

ROC sensitivity
How well predictions help users select high-quality items
Ratings 4 considered good; < 4 considered bad

08/11/13 *Not part of course (only for presentations)

Results - I

1 CF

0.98 Content
0.96 Nave
0.94 CBCF

CBCF is significantly better (4% over CF)

08/11/13 *Not part of course (only for presentations)

Results - II
ROC Sensitivity



0.64 CF

0.62 Nave


CBCF outperforms rest (5% improvement over CF)

08/11/13 *Not part of course (only for presentations)

Information Retrieval (IR)

08/11/13 27
Information Retrieval (IR)
The indexing and retrieval of textual documents.
killer app.
Concerned firstly with retrieving relevant documents to
a query.
Concerned secondly with retrieving from large sets of
documents efficiently.

08/11/13 28
Typical IR Task
A corpus of textual natural-language documents.
A user query in the form of a textual string.
A ranked set of documents that are relevant to
the query.

08/11/13 29
IR System


Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .

08/11/13 30
Relevance is a subjective judgment and may
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information

08/11/13 31
Keyword Search
Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document,
in any order (bag of words).

08/11/13 32
Problems with Keywords
May not retrieve relevant documents that
include synonymous terms.
restaurant vs. caf
PRC vs. China
May retrieve irrelevant documents that
include ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)

08/11/13 33
Beyond Keywords
We will cover the basics of keyword-based
IR, but
We will focus on basic capabilities of IR

08/11/13 34
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect
Taking into account the authority of the source
(Loophole, the Google cracked in!).

08/11/13 35
Web Search
Application of IR to HTML documents on the
World Wide Web.
Must assemble document corpus by spidering
the web.
Documents change uncontrollably.
Can exploit the link structure of the web.

08/11/13 36
Web Search System
Web Spider Document

Query IR
String System

1. Page1
2. Page2
3. Page3
. Documents

08/11/13 37
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
08/11/13 38
08/11/13 39
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$>


Position available for Software Programmer experienced in generating software

for PC-Based Voice Mail systems. Experienced in C Programming. Must be
familiar with communicating with and controlling voice cards; preferable Dialogic,
however, experience with others such as Rhetorix and Natural Microsystems is
okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years.
Need to find a Senior level person who can come on board and pick up code
with very little training. Present Operating System is DOS. May go to OS-2 or
UNIX in future [1].

08/11/13 40
Steps involved [1]

Stop word removal
Weighting schemes
Similarity Measure (For Content-Based Filtering)
Dot Product
Vector Space Model

08/11/13 41
Sequence of discrete tokens (words).
Punctuation (e-mail), numbers (2009), and case
(MobileVCE vs. mobileVCE) can be a meaningful
part of a token.
Simplest approach
Ignore all numbers
Ignore all punctuations
Ignore all cases

08/11/13 42
Stop Words
It is typical to exclude high-frequency words (e.g.
function words: a, the, in, to; pronouns: I, he,
she, it).
Language dependent.
Google stop words (
Customizable domain dependent lists

08/11/13 43
Reduce tokens to root form of words to recognize
morphological variation.
computer, computational, computation all reduced to same
token compute
Correct morphological analysis is language specific
and can be complex.
Stemming strips off known affixes (prefixes and
suffixes) in an iterative fashion.
Example: Porter stemmer algorithm

08/11/13 44

Dj, tfj
Index terms df

computer 3 D7, 4
database 2 D1, 3

science 4 D2, 4
system 1 D5, 2

Index file Representation

08/11/13 45
Term Weights: Term Frequency

More frequent terms in a document are more

important, i.e. more indicative of the topic.
fij = frequency of term i in document j

Normalize term frequency (tf) by dividing by the

frequency of the most common term in the
tfij = fij / maxi{fij}

08/11/13 46
Term Weights:
Inverse Document Frequency
Terms that appear in many different documents are less
indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a terms discrimination power.
Log used to dampen the effect relative to tf.

08/11/13 47
TF-IDF Weighting
A typical combined term importance indicator is tf-idf
wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but

rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have
been proposed.
Experimentally, tf-idf has been found to work well.

08/11/13 48
Similarity Measure
A similarity measure is a function that computes the
degree of similarity between two vectors.

Similarity measure between the query and each

document can help:
to rank the retrieved documents in the order of
presumed relevance.

08/11/13 49
The Vector-Space Model
Assume t distinct terms remain after preprocessing; call them
index terms or the vocabulary.
These orthogonal terms form a vector space.
Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a real-valued
weight, wij.
Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, , wtj)

08/11/13 50
Graphic Representation
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

2 3
D2 = 3T1 + 7T2 + T3
Is D1 or D2 more similar to Q?
How to measure the degree of
T2 similarity? Distance? Angle?
08/11/13 51
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the weight of a
term in the document.

T1 T2 . Tt
D1 w11 w21 wt1
D2 w12 w22 wt2
: : : :
: : : :
Dn w1n w2n wtn

08/11/13 52
TF-IDF Weighting
A typical combined term importance
indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)

08/11/13 53
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

*Not part of the course

08/11/13 54
Query Vector
Query vector is typically treated as a document and
also tf-idf weighted.
Alternative is for the user to supply weights for the
given query terms.

08/11/13 55
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):

sim(dj,q) = djq = wij wiq

i =1
where wij is the weight of term i in document j and wiq is the weight of
term i in the query

08/11/13 56
Inner Product -- Examples
ure e nt ion
al ase tect uter m at
v e
ie tab chi p xt nag form
Binary: tr m
re da ar co te ma in
D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
Q = 1, 0 , 1, 0, 0, 1, 1 0 means corresponding term not
found in document or query
sim(D, Q) = 3
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

08/11/13 57
Cosine Similarity Measure
Cosine similarity measures the cosine of t3
the angle between two vectors.
Inner product normalized by the vector 1
lengths. t D1
dj q = ( wij wiq )

CosSim(dj, q) = i =1
t t
2 t1
dj q wij wiq
2 2

i =1 i =1

t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

08/11/13 58
Nave Implementation (Pseudo code)
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
1. Sort documents by decreasing score.
2. Present top ranked documents to the user.

Time complexity: O(|V||D|) Bad for large V & D !

|V| = 10,000; |D| = 100,000; |V||D| = 1,000,000,000

08/11/13 59
Comments on Vector Space Models
Simple, mathematically based approach.
Considers both local (tf) and global (idf) word occurrence
Provides partial matching and ranked results.
Tends to work quite well in practice despite obvious

08/11/13 60
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word
order, proximity information).
Assumption of term independence (e.g. ignores synonomy).
Given a two-term query A B, may prefer a document
containing A frequently but not B, over a document that
contains both A and B, but both less frequently.

08/11/13 61

08/11/13 62

You might also like