AI Lect 08 - Information Retrieval - Week 09

Lecture No:
Recommender Systems (overview),

Information Extraction (IR)
Presented by: Dr Mustansar Ali Ghazanfar

Mustansar.ali@uettaxila.edu.pk
UET Taxila, Pakistan
08/11/13 1
Agenda
Recommender Systems (RS) Overview

Collaborative Filtering
Content-Based Filtering
Information Retrieval (IR)
Information Extraction Steps
Vector space model
Conclusion
08/11/13 2
Recommender Systems
Systems for recommending items (e.g. books,
movies, CDs, web pages, newsgroup messages) to
users based on examples of their preferences.
Many on-line stores provide recommendations (e.g.
Amazon, CDNow).
Recommenders have been shown to substantially
increase sales at on-line stores.
Two basic approaches to recommending:
Collaborative Filtering (a.k.a. social filtering)
Content-based
08/11/13 3
Book Recommender
Red
Mars
Found
ation
Juras-
sic
Park
Machine User
Lost
Learning Profile
World
2001 Neuro- 2010

mancer
Differ-
ence
Engine
08/11/13 4
Personalization
Recommenders are instances of personalization
software.
Personalization concerns adapting to the individual
needs, interests, and preferences of each user.
08/11/13 5
Machine Learning and
Personalization
Machine Learning can allow learning a user
model or profile of a particular user based on:
Sample interaction
Rated examples
This model or profile can then be used to:
Recommend items
Filter information
Predict behavior
08/11/13 6
Maintain a database of many users ratings of a
variety of items.
For a given user, find other similar users whose
ratings strongly correlate with the current user.
Recommend items rated highly by these similar users,
but not rated by the current user.
Almost all existing commercial recommenders use
this approach (e.g. Amazon).
08/11/13 7
A 9 A A 5 A A 6 A 10
User B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
Database : : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1
A 9 A 10
B 3 B 4
Correlation C C 8
Match : : . .
Z 5 Z 1
A 9
Active B 3 Extract C
C Recommendations
User . .
08/11/13 8
Z 5
Collaborative Filtering Method
Weight all users with respect to similarity with
the active user.
Select a subset of the users (neighbors) to use
as predictors.
Normalize ratings and compute a prediction
from a weighted combination of the selected
neighbors ratings.
Present items with highest predicted ratings as
recommendations.
08/11/13 9
Types of Collaborative Filtering (CF)
Memory based approaches (covered)
User-based CF
Item-based CF
Model-based approaches (will cover later)
Clustering
08/11/13 10
Advantages with Collaborative Filtering
Simple algorithm
Use wisdom of crowd
Can recommend out of the box
High quality recommendations (high accuracy)
08/11/13 11
Problems with Collaborative Filtering
Cold Start: There needs to be enough other users already in
the system to find a match.
Sparsity: If there are many items to be recommended, even if
there are many users, the user/ratings matrix is sparse, and it is
hard to find users that have rated the same items.
First Rater: Cannot recommend an item that has not been
previously rated.
New items
Popularity Bias: Cannot recommend items to someone with
unique tastes.
Gray-Sheep Users.
08/11/13 12
Content-Based Recommending
(CBR)
Recommendations are based on information on the
content of items rather than on other users opinions.
Uses a machine learning algorithm to induce a profile
of the users preferences from examples based on a
featural description of content.
Some previous applications:
Newsweeder (Lang, 1995)
08/11/13 13
Advantages of CBR
No cold-start or sparsity problems.
Able to recommend to users with unique tastes.
Able to recommend new and unpopular items
No first-rater problem.
Can provide explanations of recommended items by
listing content-features that caused an item to be
recommended.
08/11/13 14
Disadvantages of CBR
Requires content that can be encoded as meaningful
features (problem for multimedia data).
Users tastes must be represented as a learnable
function of these content features.
Low quality recommendation (low accuracy)
Unable to exploit quality judgments of other users.
Can not recommend out of the box
08/11/13 15
Content-Based Recommending
Vs collaborating Filtering
Content Based Filtering Collaborating Filtering
Info about content of items Info about the similar users

(or items)
Features & keywords about Ratings & Past history of
item user
Train Machine Learning Use (social) collective

classifiers and predict Intelligence to predict
08/11/13 16
Presentations
Will start from Nov
18, 2013
Keep an eye on
email
10% grades
Who is crying?
08/11/13 *Not part of course (only for presentations)

17
LIBRA System [1]
Amazon Pages
LIBRA
Information Database
Extraction
Rated
Examples Machine Learning
Learner
Recommendations
1.~~~~~~ User Profile
2.~~~~~~~
3.~~~~~
:
:
: Predictor
*Not part of course (only for presentations)
08/11/13 18
Combining Content and
Collaboration [1]
Content-based and collaborative methods have
complementary strengths and weaknesses.
Combine methods to obtain the best of both.
Various hybrid approaches:
Apply both methods and combine recommendations.
Use collaborative data as content.
Use content-based predictor as another collaborator.
Use content-based predictor to complete collaborative
data.

19
Content-Boosted Collaborative
Filtering [1]
EachMovie Web Crawler IMDb
Movie
Content
Database
User Ratings Full User

Matrix (Sparse) Ratings Matrix
Content-based
Predictor
Active Collaborative
User Ratings Filtering
Recommendations

20
Content-Boosted CF - I
User-ratings Vector
Training Examples
Content-Based
Predictor
Pseudo User-ratings Vector

User-rated
Items
Unrated Items
Items with Predicted Ratings

21
Content-Boosted CF - II
User Ratings Content-Based Pseudo User

Matrix Predictor Ratings Matrix
Compute pseudo user ratings matrix

Full matrix approximates actual full user ratings matrix
Perform CF
Using Pearson corr. between pseudo user-rating vectors

22
Experimental Method
Used subset of EachMovie (7,893 users; 299,997
ratings)
Test set: 10% of the users selected at random.
Test users that rated at least 40 movies.
Train on the remainder sets.
Hold-out set: 25% items for each test user.
Predict rating of each item in the hold-out set.
Compared CBCF to other prediction approaches:
Pure CF
Pure Content-based
Nave hybrid (averages CF and content-based
08/11/13
predictions) *Not part of course (only for presentations)
23
Metrics
Mean Absolute Error (MAE)
Compares numerical predictions with user ratings
ROC sensitivity
How well predictions help users select high-quality items
Ratings 4 considered good; < 4 considered bad

24
Results - I
MAE
1.06
1.04
1.02
1 CF
MAE
0.98 Content
0.96 Nave
0.94 CBCF
0.92
0.9
Algorithm
CBCF is significantly better (4% over CF)

25
Results - II
ROC Sensitivity
0.68
0.66
0.64 CF
ROC-4
Content
0.62 Nave
CBCF
0.6
0.58
Algorithm
CBCF outperforms rest (5% improvement over CF)

26
08/11/13 27
The indexing and retrieval of textual documents.
killer app.
Concerned firstly with retrieving relevant documents to
a query.
Concerned secondly with retrieving from large sets of
documents efficiently.
08/11/13 28
Typical IR Task
Given:
A corpus of textual natural-language documents.
A user query in the form of a textual string.
Find:
A ranked set of documents that are relevant to
the query.
08/11/13 29
IR System
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
08/11/13 30
Relevance
Relevance is a subjective judgment and may
include:
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).
08/11/13 31
Keyword Search
Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document,
in any order (bag of words).
08/11/13 32
Problems with Keywords
May not retrieve relevant documents that
include synonymous terms.
restaurant vs. caf
PRC vs. China
May retrieve irrelevant documents that
include ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)
08/11/13 33
Beyond Keywords
We will cover the basics of keyword-based
IR, but
We will focus on basic capabilities of IR
08/11/13 34
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect
feedback.
Taking into account the authority of the source
(Loophole, the Google cracked in!).
08/11/13 35
Web Search
Application of IR to HTML documents on the
World Wide Web.
Differences:
Must assemble document corpus by spidering
the web.
Documents change uncontrollably.
Can exploit the link structure of the web.
08/11/13 36
Web Search System
Web Spider Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
08/11/13 37
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
08/11/13 38
08/11/13 39
Example
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$mrs@bilbo.reference.com>
SOFTWARE PROGRAMMER
Position available for Software Programmer experienced in generating software

for PC-Based Voice Mail systems. Experienced in C Programming. Must be
familiar with communicating with and controlling voice cards; preferable Dialogic,
however, experience with others such as Rhetorix and Natural Microsystems is
okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years.
Need to find a Senior level person who can come on board and pick up code
with very little training. Present Operating System is DOS. May go to OS-2 or
UNIX in future [1].
08/11/13 40
Steps involved [1]
Tokenization
Stop word removal
Stemming
Indexing
Weighting schemes
Similarity Measure (For Content-Based Filtering)
Dot Product
Vector Space Model
08/11/13 41
Tokenization
Sequence of discrete tokens (words).
Punctuation (e-mail), numbers (2009), and case
(MobileVCE vs. mobileVCE) can be a meaningful
part of a token.
Keywords
Simplest approach
Ignore all numbers
Ignore all punctuations
Ignore all cases
08/11/13 42
Stop Words
It is typical to exclude high-frequency words (e.g.
function words: a, the, in, to; pronouns: I, he,
she, it).
Language dependent.
Google stop words (
www.ranks.nl/resources/stopwords.html)
Customizable domain dependent lists
08/11/13 43
Stemming
Reduce tokens to root form of words to recognize
morphological variation.
computer, computational, computation all reduced to same
token compute
Correct morphological analysis is language specific
and can be complex.
Stemming strips off known affixes (prefixes and
suffixes) in an iterative fashion.
Example: Porter stemmer algorithm
08/11/13 44
Indexing
Dj, tfj
Index terms df
computer 3 D7, 4
database 2 D1, 3
science 4 D2, 4
system 1 D5, 2
Index file Representation
08/11/13 45
Term Weights: Term Frequency
More frequent terms in a document are more

important, i.e. more indicative of the topic.
fij = frequency of term i in document j
Normalize term frequency (tf) by dividing by the

frequency of the most common term in the
document:
tfij = fij / maxi{fij}
08/11/13 46
Term Weights:
Inverse Document Frequency
Terms that appear in many different documents are less
indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a terms discrimination power.
Log used to dampen the effect relative to tf.
08/11/13 47
TF-IDF Weighting
A typical combined term importance indicator is tf-idf
weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
A term occurring frequently in the document but

rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have
been proposed.
Experimentally, tf-idf has been found to work well.
08/11/13 48
Similarity Measure
A similarity measure is a function that computes the
degree of similarity between two vectors.
Similarity measure between the query and each

document can help:
to rank the retrieved documents in the order of
presumed relevance.
08/11/13 49
The Vector-Space Model
Assume t distinct terms remain after preprocessing; call them
index terms or the vocabulary.
These orthogonal terms form a vector space.
Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a real-valued
weight, wij.
Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, , wtj)
08/11/13 50
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3

2 3
T1
D2 = 3T1 + 7T2 + T3
Is D1 or D2 more similar to Q?
How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
08/11/13 51
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the weight of a
term in the document.
T1 T2 . Tt
D1 w11 w21 wt1
D2 w12 w22 wt2
: : : :
: : : :
Dn w1n w2n wtn
08/11/13 52
TF-IDF Weighting
A typical combined term importance
indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
08/11/13 53
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
*Not part of the course

08/11/13 54
Query Vector
Query vector is typically treated as a document and
also tf-idf weighted.
Alternative is for the user to supply weights for the
given query terms.
08/11/13 55
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):
t
sim(dj,q) = djq = wij wiq

i =1
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
08/11/13 56
Inner Product -- Examples
ure e nt ion
al ase tect uter m at
v e
ie tab chi p xt nag form
Binary: tr m
re da ar co te ma in
D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
Q = 1, 0 , 1, 0, 0, 1, 1 0 means corresponding term not
found in document or query
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
08/11/13 57
Cosine Similarity Measure
Cosine similarity measures the cosine of t3
the angle between two vectors.
Inner product normalized by the vector 1
lengths. t D1
dj q = ( wij wiq )

Q
CosSim(dj, q) = i =1
t t
2 t1
dj q wij wiq
2 2
i =1 i =1
t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
08/11/13 58
Nave Implementation (Pseudo code)
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
1. Sort documents by decreasing score.
2. Present top ranked documents to the user.
Time complexity: O(|V||D|) Bad for large V & D !

|V| = 10,000; |D| = 100,000; |V||D| = 1,000,000,000
08/11/13 59
Comments on Vector Space Models
Simple, mathematically based approach.
Considers both local (tf) and global (idf) word occurrence
frequencies.
Provides partial matching and ranked results.
Tends to work quite well in practice despite obvious
weaknesses.
08/11/13 60
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word
order, proximity information).
Assumption of term independence (e.g. ignores synonomy).
Given a two-term query A B, may prefer a document
containing A frequently but not B, over a document that
contains both A and B, but both less frequently.
08/11/13 61
References
1. https://www.cs.utexas.edu/~ml/publications/area/119/learning_fo
r_recommender_systems/abstracts/
2. http://www.cs.utexas.edu/~mooney/ir-course/
08/11/13 62

AI Lect 08 - Information Retrieval - Week 09

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Lect 08 - Information Retrieval - Week 09

Uploaded by

Copyright:

Available Formats

Lecture No:

Recommender Systems (overview),

Presented by: Dr Mustansar Ali Ghazanfar

Recommender Systems (RS) Overview

2001 Neuro- 2010

Info about content of items Info about the similar users

Train Machine Learning Use (social) collective

08/11/13 *Not part of course (only for presentations)

08/11/13 *Not part of course (only for presentations)

User Ratings Full User

08/11/13 *Not part of course (only for presentations)

Pseudo User-ratings Vector

08/11/13 *Not part of course (only for presentations)

User Ratings Content-Based Pseudo User

Compute pseudo user ratings matrix

08/11/13 *Not part of course (only for presentations)

08/11/13 *Not part of course (only for presentations)

CBCF is significantly better (4% over CF)

08/11/13 *Not part of course (only for presentations)

CBCF outperforms rest (5% improvement over CF)

08/11/13 *Not part of course (only for presentations)

Position available for Software Programmer experienced in generating software

Index file Representation

More frequent terms in a document are more

Normalize term frequency (tf) by dividing by the

A term occurring frequently in the document but

Similarity measure between the query and each

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

*Not part of the course

sim(dj,q) = djq = wij wiq

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

Time complexity: O(|V||D|) Bad for large V & D !

You might also like

sim(D1 , Q) = 20 + 30 + 5*2 = 10