You are on page 1of 58

IR Models

J. H. Wang
Mar. 11, 2008
The Retrieval Process
Text
User
Interface
user need 4, 10 Text

Text Operations
6, 7
logical view logical view
Query DB Manager
Indexing
user feedback Operations Module
5 8
quer inverted file
y
Searching Index
8
retrieved docs
Text
Ranking Database
ranked docs 2
Introduction
• Traditional information retrieval systems
usually adopt index terms to index and retrieve
documents
– An index term is a keyword (or group of related
words) which has some meaning of its own (usually a
noun)
• Advantages
– Simple
– The semantic of the documents and of the user
information need can be naturally expressed through
sets of index terms
Docs Index Terms

doc

match
Ranking
Information Need

query
IR Models

Ranking algorithms are at the core of


information retrieval systems
(predicting which documents are
relevant and which are not).
A Taxonomy of Information Retrieval
Models
Classic Models Set Theoretic
Boolean Fuzzy
Vector Extended Boolean
U
S Probabilistic
Retrieval:
E
R
Ad hoc Algebraic
Filtering Structured Models Generalized Vector
T
A Lat. Semantic Index
Non-overlapping lists
S Neural Networks
Browsing Proximal Nodes
K

Browsing Probabilistic
Flat Inference Network
Structure Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure

Retrieval Classic Classic Structured


Set Theoretic Set Theoretic
Algebraic Algebraic
Probabilistic Probabilistic

Browsing Flat Flat Structure Guided


Hypertext Hypertext

Figure 2.2 Retrieval models most frequently associated with distinct


combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering

• Ad hoc (Search): The documents in the


collection remain relatively static while new
queries are submitted to the system

• Routing (Filtering): The queries remain


relatively static while new documents come
into the system
Retrieval: Ad Hoc x Filtering
• Ad hoc retrieval:
Q1

Q2

Collection
Q3
“Fixed Size”

Q4
Q5
Retrieval: Ad Hoc x Filtering
• Filtering:
User 2 Docs Filtered
Profile for User 2

User 1 Docs for


Profile User 1

Documents Stream
A Formal Characterization of IR
Models
• D : A set composed of logical views (or
representation) for the documents in the
collection
• Q : A set composed of logical views (or
representation) for the user information needs
(queries)
• F : A framework for modeling document
representations, queries, and their
relationships
• R(qi, dj) : A ranking function which defines an
ordering among the documents with regard to
the query
Definition
• ki : A generic index term
• K : The set of all index terms {k1,…,kt}
• wi,j : A weight associated with index term
ki of a document dj

• gi: A function returns the weight associated


with ki in any t-dimensional vector ( gi(dj)=wi,j )
Classic IR Model

• Basic concepts: Each document is


described by a set of representative
keywords called index terms
• Assign a numerical weights to distinct
relevance between index terms
• Three classic models: Boolean, vector,
probabilistic
Boolean Model
• Binary decision criterion
– Either relevant or nonrelevant (no partial match)
• Data retrieval model
• Advantage
– Clean formalism, simplicity
• Disadvantage
– It is not simple to translate an information need into a
Boolean expression
– Exact matching may lead to retrieval of too few or too
many documents
Ka Kb

Example (1,0,0)
(1,1,0)
(1,1,1)

• Can be represented as a disjunction of


Kc
conjunctive vectors (in DNF)
– Q= qa∧(qb∨¬qc)=(1,1,1) ∨ (1,1,0) ∨ (1,0,0)
• Formal definition
– For the Boolean model, the index term weight
are all binary, i.e. wij ∈{0,1}
– A query is a conventional Boolean expression,
which can be transformed to a disjunctive
normal form qdnf (qcc: conjunctive component)
1 qdnf
sim( d j , q) =  if (∃qcc∈ )∧(∀ki, wi,j=gi(qcc))
0
Vector Model [Salton, 1968]

• Assign non-binary weights to index


terms in queries and in documents
=> TFxIDF

• Compute the similarity between


documents and query => Sim(Dj, Q)

• More precise than Boolean model


The IR Problem ≡ A Clustering
Problem
• We think of the documents as a collection C of
objects and think of the user query as a
specification of a set A of objects

• Intra-cluster similarity
– What are the features which better describe the
objects in the set A?
• Inter-cluster similarity
– What are the features which better distinguish the
objects in the set A?
Idea for TFxIDF

• TF: intra-clustering similarity is quantified by


measuring the raw frequency of a term ki
inside a document dj
– term frequency (the tf factor) provides one
measure of how well that term describes the
document contents
• IDF: inter-clustering similarity is quantified
by measuring the inverse of the frequency of
a term ki among the documents in the
collection
– inverse document frequency (the idf factor)
Vector Model (1/4)
• Index terms are assigned positive and non-
binary weights
• The index terms in the query are also weighted

d j = ( w1, j , w2, j , , wt , j )

q = ( w1,q , w2, q ,  , wt ,q )

• Term weights are used to compute the degree of


similarity between documents and the user
query
• Then, retrieved documents are sorted in
decreasing order
Vector Model (2/4)
• Degree of similarity
  dj
d j ⋅q
sim(d j , q ) =  
| d j |×| q |
θ
∑ w ×w
t
i, j i ,q q
= i =1

∑ w × ∑
t 2 t
i =1 i, j i =1
wi2,q
Figure 2.4 The cosine of θis adopted
as sim(dj,q)
Vector Model (3/4)
• Definition
freqi , j
– normalized frequency fi, j =
max freql , j
l

N
– inverse document frequency idf i = log
ni

– term-weighting schemes wi , j = freqi , j × idf i

– query-term weights wi ,q = (0.5 +


0.5 freqi ,q
) × log
N
max freql ,q ni
l
Vector Model (4/4)
• Advantages
– Its term-weighting scheme improves retrieval
performance
– Its partial matching strategy allows retrieval of
documents that approximate the query conditions
– Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
• Disadvantage
– The assumption of mutual independence between
index terms
The Vector Model:
Example I k1
k2

d7
d2 d6

d4 d5
d3
d1

k3

k1 k2 k3 q• dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
The Vector Model:
Example II k1
k2

d7
d2 d6

d4 d5
d3
d1

k3

k1 k2 k3 q• dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2

q 1 2 3
The Vector Model:
Example III k1
k2

d7
d2 d6

d4 d5
d3
d1

k3

k1 k2 k3 q• dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10

q 1 2 3
Probabilistic Model (1/6)
• Introduced by Roberston and Sparck Jones, 1976
– Binary independence retrieval (BIR) model
• Idea: Given a user query q, and the ideal answer set R
of the relevant documents, the problem is to specify the
properties for this set
– Assumption (probabilistic principle): the probability of relevance
depends on the query and document representations only; ideal answer
set R should maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the
user will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/6)
• Definition
– All index term weights are all binary i.e., wi,j ∈
{0,1}
– Let R be the set of documents known to be
relevant to query q
– Let R be the complement of R
– Let P( R | d j ) be the probability that the
document dj is relevant to the query q
– Let P( R | d j ) be the probability that the
document dj is nonelevant to query q
Probabilistic Model (3/6)
• The similarity sim(dj,q) of the document dj to the query q is
defined as the ratio 
 Pr( R | d j )
sim(d j , q ) = 
Pr( R | d j )

• Using Bayes’ rule,



 Pr(d j | R ) × Pr( R )
sim(d j , q ) = 
Pr(d j | R ) × Pr( R )

– P(R) stands for the probability that a document randomly


selected from the entire collection is relevant
– P ( d j | R ) stands for the probability of randomly selecting the
document dj from the set R of relevant documents
Probabilistic Model (4/6)
Pr( d j | R ) Pr( R )
sim( d j , q ) ≈log +log
Pr( d j | R ) Pr( R )

• Assuming independence of index terms


and given q=(d1, d2, …, dt),
t
Pr( d j | R ) = ∏Pr( ki = d i | R )
i =1
t
Pr( d j | R ) = ∏Pr( ki = d i | R )
i =1
t

∏ Pr( k i = di | R)
sim( d j , q) ≈ log i =1
t

∏ Pr( k
i =1
i = di | R)
Probabilistic Model (5/6)

– Pr(ki |R) stands for the probability that the


index term ki is present in a document
randomly selected from the set R

– Pr(ki | R) stands for the probability that the


index term ki is not present in a document
randomly selected from the set R
Probabilistic Model (6/6)

∏ gi ( d j ) =1
Pr( ki | R )∏g ( d ) =0
Pr( ki | R )
sim( d , q) ≈ i j

∏ Pr( ki | R )∏g ( d
j
gi ( d j ) =1 ) =0
Pr( ki | R )
i j

Pr( ki | R) + Pr( ki | R) = 1
t
 P ( ki | R ) 1 − P ( ki | R ) 
sim( d j , q) ≈ ∑  log + log 
i =1  1 − P ( ki | R ) P ( ki | R )  

t
 P ( ki | R ) 1 − P ( ki | R ) 
sim( d j , q ) ≈ ∑wi ,q × wi , j × 
 log 1 − P ( k | R ) + log P ( k | R )  
i =1  i i 
Estimation of Term Relevance
In the very beginning: Pr( ki | R ) = 0.5
df i
Pr( ki | R ) =
N
Next, the ranking can be improved as follows: N
Let V be a subset Vi V
Pr( ki | R ) = dfi
of the documents V
initially retrieved df − Vi
Pr( ki | R ) = i
N −V
For small values of V and Vi
Vi + 0.5 Vi + VVi
Pr( ki | R ) = Pr( ki | R ) =
V +1 V +1
df − Vi + 0.5 df i − Vi + VVi
Pr( ki | R ) = i Pr( ki | R ) =
N −V +1 N −V +1
• Advantage
– Documents are ranked in decreasing order of
their probability of being relevant
• Disadvantage
– The need to guess the initial relevant and
nonrelevant sets
– Term frequency is not considered
– Independence assumption for index terms
Brief Comparison of Classic Models
• Boolean model is the weakest
– Not able to recognize partial matches
• Controversy between probabilistic and
vector models
– The vector model is expected to outperform
the probabilistic model with general
collections
Alternative Set Theoretic
Models
• Fuzzy Set Model
• Extended Boolean Model
Fuzzy Theory
• A fuzzy subset A of a universe U is
characterized by a membership function
uA: U→{0,1} which associates with each
element u∈U a number uA

• Let A
µ and
A
= 1 − µ B be two fuzzy subsets of U,
A

µ A∪B = max( µ A , µ B )
µ A∩B = min( µ A , µ B )
Fuzzy Information Retrieval
• Using a term-term correlation matrix
df u ,v
cu ,v =
df u + df v − df u ,v
• Define a fuzzy set associated to each index term
ki
µi ( d j ) = 1 − ∏ (1 − ci ,l )
ki ∈d j

– If a term kl is strongly related to ki, that is ci,l ~1,


then ui(dj)~1
– If a term kl is loosely related to ki, that is ci,l ~0,
then ui(dj)~0
Ka Kb

 Example cc3
cc2
cc1
qdnf = ka ∧ ( kb ∨ ¬kc )
• Disjunctive Normal Form
Kc

qdnf = ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc )
3
µq (d j ) = µcc1 ∨cc1 ∨cc1 = 1 − ∏ (1 −µcci )
i =1

= 1 − (1 − µa ,b,c ( d j )) × (1 − µa ,b,c ( d j )) × (1 − µa ,b ,c ( d j ))

ua ,b,c ( d j ) = ua ( d j )ub ( d j )uc ( d j ) = ua , j ub, j uc , j


ua ,b,c ( d j ) = ua , j ub, j (1 − uc , j )
ua ,b ,c ( d j ) = ua , j (1 − ub, j )(1 − uc , j )
Algebraic Sum and Product
• The degree of membership in a disjunctive
fuzzy set is computed using an algebraic
sum, instead of max function
• The degree of membership in a
conjunctive fuzzy set is computed using
an algebraic product, instead of min
function
• More smooth than max and min functions
Alternative Algebraic Models

• Generalized Vector Space Model


• Latent Semantic Model
•Neural Network Model
Sparse Matrix Problem
• Considering a term-doc matrix of
dimensions 1M*1M
– Most of the entries will be 0  sparse matrix
– A waste of storage and computation
– How to reduce the dimensions?
Latent Semantic Indexing (1/5)
• Let M=(Mij) be a term-document association
matrix with t rows and N columns
• Latent semantic indexing decomposes M using
Singular Value Decompositions
M = KSD t
– K is the matrix of eigenvectors derived from the
term-to-term correlation matrix (MMt)
– Dt is the matrix of eigenvectors derived from the
transpose of the document-to-document matrix (MtM)
– S is an r×r diagonal matrix of singular values, where
r=min(t,N) is the rank of M
Latent Semantic Indexing (2/5)
• Consider now only the s largest singular values
of S, and their corresponding columns in K and
Dt
– (The remaining singular values of S are deleted)
t
M s = K s S s Ds
• The resultant matrix Ms (rank s) is closest to the
original matrix M in the least square sense
• s<r is the dimensionality of a reduced concept
space
Latent Semantic Indexing (3/5)
• The selection of s attempts to balance two
opposing effects
– s should be large enough to allow fitting all
the structure in the real data
– s should be small enough to allow filtering
out all the non-relevant representational
details
Latent Semantic Indexing (4/5)
• Consider the relationship between any
two documents
t t
M st M s = ( K s S s Ds )t K s S s Ds
t
= Ds S s K st K s S s Ds
t
= Ds S s S s Ds
= ( Ds S s )( Ds S s )t
Latent Semantic Indexing (5/5)
• To rank documents with regard to a given
user query, we model the query as a
pseudo-document in the original matrix M
– Assume the query is modeled as the
document with number k
– Then the kth row in the matrix M t M provides
s s

the ranks of all documents with respect to this


query
Computing an Example
• Let (Mij) be given by the matrix
k1 k2 k3 q• dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10

q 1 2 3

– Compute the matrices (K), (S), and (D)t


• Latent Semantic Indexing transforms the
occurrence matrix into a relation between
the terms and concepts, and a relation
between the concepts and the documents
– Indirect relation between terms and
documents through some hidden (or latent)
concepts
Taipei ?
Taiwan
doc

Taipei
Taiwan (Latent)
Concepts doc

Alternative Probabilistic
Model
• Bayesian Networks
• Inference Network Model
• Belief Network Model
Bayesian Network
• Let xi be a node in a Bayesian network G and Γxi
be the set of parent nodes of xi
• The influence of Γxi on xi can be specified by any
set of functions that satisfy:
∑ F (x ,Γ
∀xi
i i xi ) =1

0 ≤ Fi ( xi , Γxi ) ≤ 1

• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
Belief Network Model (1/6)
• The probability space
The set K={k1, k2, …, kt} of all index terms is
the universe. To each subset u is 
associated a vector k such that gi(k )=1 ⇔ ki
∈u

• Random variables
– To each index term ki is associated a binary
random variable
Belief Network Model (2/6)
• Concept space
– A document dj is represented as a concept
composed of the terms used to index dj
– A user query q is also represented as a concept
composed of the terms used to index q
– Both user query and document are modeled
as subsets of index terms
• Probability distribution P over K
P ( c ) = ∑ P ( c | u ) × P (u ) Degree of coverage of K by c
u

1
P(u ) = ( )t
2
Belief Network Model (3/6)
• A query q is modeled as a network node
– This random variable is set to 1 whenever q
completely covers the concept space K
– P(q) computes the degree of coverage of the space K
by q
• A document dj is modeled as a network node
– This random variable is 1 to indicate that dj
completely covers the concept space K
– P(dj) computes the degree of coverage of the space K
by dj
Belief Network Model (4/6)
Belief Network Model (5/6)
• Assumption
– P(dj |q) is adopted as the rank of the
document dj with respect to the query q
P( d j | q) = P( d j ∧ q) / P( q)
≈ P( d j ∧ q) ≈ ∑ P (d j ∧ q | u ) × P(u )
∀u

≈ ∑ P ( d j | u ) × P ( q | u ) × P (u )
∀u
  
≈ ∑ P( d j | k ) × P ( q | k ) × P( k )
∀k
Belief Network Model (6/6)
• Specify the conditional probabilities as follows
 
  wi , j
if k = ki ∧ gi ( d j ) = 1
P(d j | k ) =  ∑it =1 wi , j
2

 0 otherwise
 
  wi , q
if k = ki ∧ gi ( q) = 1
P( q | k ) =  2
∑it =1 wi , q
 0 otherwise

• Thus, the belief network model can be tuned to


subsume the vector model
Comparison
• Belief network model
– Is based on set-theoretic view
– It provides a separation between the
document and the query
– It is able to reproduce any ranking strategy
generated by the inference network model
• Inference network model
– Takes a purely epistemological view which is
more difficult to grasp

You might also like