You are on page 1of 58

IR Models

J. H. Wang
Mar. 11, 2008
The Retrieval Process
user need 4, 10 Text

Text Operations
6, 7
logical view logical view
Query DB Manager
user feedback Operations Module
5 8
quer inverted file
Searching Index
retrieved docs
Ranking Database
ranked docs 2
• Traditional information retrieval systems
usually adopt index terms to index and retrieve
– An index term is a keyword (or group of related
words) which has some meaning of its own (usually a
• Advantages
– Simple
– The semantic of the documents and of the user
information need can be naturally expressed through
sets of index terms
Docs Index Terms


Information Need

IR Models

Ranking algorithms are at the core of

information retrieval systems
(predicting which documents are
relevant and which are not).
A Taxonomy of Information Retrieval
Classic Models Set Theoretic
Boolean Fuzzy
Vector Extended Boolean
S Probabilistic
Ad hoc Algebraic
Filtering Structured Models Generalized Vector
A Lat. Semantic Index
Non-overlapping lists
S Neural Networks
Browsing Proximal Nodes

Browsing Probabilistic
Flat Inference Network
Structure Guided Belief Network
Index Terms Full Text Full Text+

Retrieval Classic Classic Structured

Set Theoretic Set Theoretic
Algebraic Algebraic
Probabilistic Probabilistic

Browsing Flat Flat Structure Guided

Hypertext Hypertext

Figure 2.2 Retrieval models most frequently associated with distinct

combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering

• Ad hoc (Search): The documents in the

collection remain relatively static while new
queries are submitted to the system

• Routing (Filtering): The queries remain

relatively static while new documents come
into the system
Retrieval: Ad Hoc x Filtering
• Ad hoc retrieval:


“Fixed Size”

Retrieval: Ad Hoc x Filtering
• Filtering:
User 2 Docs Filtered
Profile for User 2

User 1 Docs for

Profile User 1

Documents Stream
A Formal Characterization of IR
• D : A set composed of logical views (or
representation) for the documents in the
• Q : A set composed of logical views (or
representation) for the user information needs
• F : A framework for modeling document
representations, queries, and their
• R(qi, dj) : A ranking function which defines an
ordering among the documents with regard to
the query
• ki : A generic index term
• K : The set of all index terms {k1,…,kt}
• wi,j : A weight associated with index term
ki of a document dj

• gi: A function returns the weight associated

with ki in any t-dimensional vector ( gi(dj)=wi,j )
Classic IR Model

• Basic concepts: Each document is

described by a set of representative
keywords called index terms
• Assign a numerical weights to distinct
relevance between index terms
• Three classic models: Boolean, vector,
Boolean Model
• Binary decision criterion
– Either relevant or nonrelevant (no partial match)
• Data retrieval model
• Advantage
– Clean formalism, simplicity
• Disadvantage
– It is not simple to translate an information need into a
Boolean expression
– Exact matching may lead to retrieval of too few or too
many documents
Ka Kb

Example (1,0,0)

• Can be represented as a disjunction of

conjunctive vectors (in DNF)
– Q= qa∧(qb∨¬qc)=(1,1,1) ∨ (1,1,0) ∨ (1,0,0)
• Formal definition
– For the Boolean model, the index term weight
are all binary, i.e. wij ∈{0,1}
– A query is a conventional Boolean expression,
which can be transformed to a disjunctive
normal form qdnf (qcc: conjunctive component)
1 qdnf
sim( d j , q) =  if (∃qcc∈ )∧(∀ki, wi,j=gi(qcc))
Vector Model [Salton, 1968]

• Assign non-binary weights to index

terms in queries and in documents

• Compute the similarity between

documents and query => Sim(Dj, Q)

• More precise than Boolean model

The IR Problem ≡ A Clustering
• We think of the documents as a collection C of
objects and think of the user query as a
specification of a set A of objects

• Intra-cluster similarity
– What are the features which better describe the
objects in the set A?
• Inter-cluster similarity
– What are the features which better distinguish the
objects in the set A?
Idea for TFxIDF

• TF: intra-clustering similarity is quantified by

measuring the raw frequency of a term ki
inside a document dj
– term frequency (the tf factor) provides one
measure of how well that term describes the
document contents
• IDF: inter-clustering similarity is quantified
by measuring the inverse of the frequency of
a term ki among the documents in the
– inverse document frequency (the idf factor)
Vector Model (1/4)
• Index terms are assigned positive and non-
binary weights
• The index terms in the query are also weighted

d j = ( w1, j , w2, j , , wt , j )

q = ( w1,q , w2, q ,  , wt ,q )

• Term weights are used to compute the degree of

similarity between documents and the user
• Then, retrieved documents are sorted in
decreasing order
Vector Model (2/4)
• Degree of similarity
  dj
d j ⋅q
sim(d j , q ) =  
| d j |×| q |
∑ w ×w
i, j i ,q q
= i =1

∑ w × ∑
t 2 t
i =1 i, j i =1
Figure 2.4 The cosine of θis adopted
as sim(dj,q)
Vector Model (3/4)
• Definition
freqi , j
– normalized frequency fi, j =
max freql , j

– inverse document frequency idf i = log

– term-weighting schemes wi , j = freqi , j × idf i

– query-term weights wi ,q = (0.5 +

0.5 freqi ,q
) × log
max freql ,q ni
Vector Model (4/4)
• Advantages
– Its term-weighting scheme improves retrieval
– Its partial matching strategy allows retrieval of
documents that approximate the query conditions
– Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
• Disadvantage
– The assumption of mutual independence between
index terms
The Vector Model:
Example I k1

d2 d6

d4 d5


k1 k2 k3 q• dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
The Vector Model:
Example II k1

d2 d6

d4 d5


k1 k2 k3 q• dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2

q 1 2 3
The Vector Model:
Example III k1

d2 d6

d4 d5


k1 k2 k3 q• dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10

q 1 2 3
Probabilistic Model (1/6)
• Introduced by Roberston and Sparck Jones, 1976
– Binary independence retrieval (BIR) model
• Idea: Given a user query q, and the ideal answer set R
of the relevant documents, the problem is to specify the
properties for this set
– Assumption (probabilistic principle): the probability of relevance
depends on the query and document representations only; ideal answer
set R should maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the
user will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/6)
• Definition
– All index term weights are all binary i.e., wi,j ∈
– Let R be the set of documents known to be
relevant to query q
– Let R be the complement of R
– Let P( R | d j ) be the probability that the
document dj is relevant to the query q
– Let P( R | d j ) be the probability that the
document dj is nonelevant to query q
Probabilistic Model (3/6)
• The similarity sim(dj,q) of the document dj to the query q is
defined as the ratio 
 Pr( R | d j )
sim(d j , q ) = 
Pr( R | d j )

• Using Bayes’ rule,

 Pr(d j | R ) × Pr( R )
sim(d j , q ) = 
Pr(d j | R ) × Pr( R )

– P(R) stands for the probability that a document randomly

selected from the entire collection is relevant
– P ( d j | R ) stands for the probability of randomly selecting the
document dj from the set R of relevant documents
Probabilistic Model (4/6)
Pr( d j | R ) Pr( R )
sim( d j , q ) ≈log +log
Pr( d j | R ) Pr( R )

• Assuming independence of index terms

and given q=(d1, d2, …, dt),
Pr( d j | R ) = ∏Pr( ki = d i | R )
i =1
Pr( d j | R ) = ∏Pr( ki = d i | R )
i =1

∏ Pr( k i = di | R)
sim( d j , q) ≈ log i =1

∏ Pr( k
i =1
i = di | R)
Probabilistic Model (5/6)

– Pr(ki |R) stands for the probability that the

index term ki is present in a document
randomly selected from the set R

– Pr(ki | R) stands for the probability that the

index term ki is not present in a document
randomly selected from the set R
Probabilistic Model (6/6)

∏ gi ( d j ) =1
Pr( ki | R )∏g ( d ) =0
Pr( ki | R )
sim( d , q) ≈ i j

∏ Pr( ki | R )∏g ( d
gi ( d j ) =1 ) =0
Pr( ki | R )
i j

Pr( ki | R) + Pr( ki | R) = 1
 P ( ki | R ) 1 − P ( ki | R ) 
sim( d j , q) ≈ ∑  log + log 
i =1  1 − P ( ki | R ) P ( ki | R )  

 P ( ki | R ) 1 − P ( ki | R ) 
sim( d j , q ) ≈ ∑wi ,q × wi , j × 
 log 1 − P ( k | R ) + log P ( k | R )  
i =1  i i 
Estimation of Term Relevance
In the very beginning: Pr( ki | R ) = 0.5
df i
Pr( ki | R ) =
Next, the ranking can be improved as follows: N
Let V be a subset Vi V
Pr( ki | R ) = dfi
of the documents V
initially retrieved df − Vi
Pr( ki | R ) = i
N −V
For small values of V and Vi
Vi + 0.5 Vi + VVi
Pr( ki | R ) = Pr( ki | R ) =
V +1 V +1
df − Vi + 0.5 df i − Vi + VVi
Pr( ki | R ) = i Pr( ki | R ) =
N −V +1 N −V +1
• Advantage
– Documents are ranked in decreasing order of
their probability of being relevant
• Disadvantage
– The need to guess the initial relevant and
nonrelevant sets
– Term frequency is not considered
– Independence assumption for index terms
Brief Comparison of Classic Models
• Boolean model is the weakest
– Not able to recognize partial matches
• Controversy between probabilistic and
vector models
– The vector model is expected to outperform
the probabilistic model with general
Alternative Set Theoretic
• Fuzzy Set Model
• Extended Boolean Model
Fuzzy Theory
• A fuzzy subset A of a universe U is
characterized by a membership function
uA: U→{0,1} which associates with each
element u∈U a number uA

• Let A
µ and
= 1 − µ B be two fuzzy subsets of U,

µ A∪B = max( µ A , µ B )
µ A∩B = min( µ A , µ B )
Fuzzy Information Retrieval
• Using a term-term correlation matrix
df u ,v
cu ,v =
df u + df v − df u ,v
• Define a fuzzy set associated to each index term
µi ( d j ) = 1 − ∏ (1 − ci ,l )
ki ∈d j

– If a term kl is strongly related to ki, that is ci,l ~1,

then ui(dj)~1
– If a term kl is loosely related to ki, that is ci,l ~0,
then ui(dj)~0
Ka Kb

 Example cc3
qdnf = ka ∧ ( kb ∨ ¬kc )
• Disjunctive Normal Form

qdnf = ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc )
µq (d j ) = µcc1 ∨cc1 ∨cc1 = 1 − ∏ (1 −µcci )
i =1

= 1 − (1 − µa ,b,c ( d j )) × (1 − µa ,b,c ( d j )) × (1 − µa ,b ,c ( d j ))

ua ,b,c ( d j ) = ua ( d j )ub ( d j )uc ( d j ) = ua , j ub, j uc , j

ua ,b,c ( d j ) = ua , j ub, j (1 − uc , j )
ua ,b ,c ( d j ) = ua , j (1 − ub, j )(1 − uc , j )
Algebraic Sum and Product
• The degree of membership in a disjunctive
fuzzy set is computed using an algebraic
sum, instead of max function
• The degree of membership in a
conjunctive fuzzy set is computed using
an algebraic product, instead of min
• More smooth than max and min functions
Alternative Algebraic Models

• Generalized Vector Space Model

• Latent Semantic Model
•Neural Network Model
Sparse Matrix Problem
• Considering a term-doc matrix of
dimensions 1M*1M
– Most of the entries will be 0  sparse matrix
– A waste of storage and computation
– How to reduce the dimensions?
Latent Semantic Indexing (1/5)
• Let M=(Mij) be a term-document association
matrix with t rows and N columns
• Latent semantic indexing decomposes M using
Singular Value Decompositions
M = KSD t
– K is the matrix of eigenvectors derived from the
term-to-term correlation matrix (MMt)
– Dt is the matrix of eigenvectors derived from the
transpose of the document-to-document matrix (MtM)
– S is an r×r diagonal matrix of singular values, where
r=min(t,N) is the rank of M
Latent Semantic Indexing (2/5)
• Consider now only the s largest singular values
of S, and their corresponding columns in K and
– (The remaining singular values of S are deleted)
M s = K s S s Ds
• The resultant matrix Ms (rank s) is closest to the
original matrix M in the least square sense
• s<r is the dimensionality of a reduced concept
Latent Semantic Indexing (3/5)
• The selection of s attempts to balance two
opposing effects
– s should be large enough to allow fitting all
the structure in the real data
– s should be small enough to allow filtering
out all the non-relevant representational
Latent Semantic Indexing (4/5)
• Consider the relationship between any
two documents
t t
M st M s = ( K s S s Ds )t K s S s Ds
= Ds S s K st K s S s Ds
= Ds S s S s Ds
= ( Ds S s )( Ds S s )t
Latent Semantic Indexing (5/5)
• To rank documents with regard to a given
user query, we model the query as a
pseudo-document in the original matrix M
– Assume the query is modeled as the
document with number k
– Then the kth row in the matrix M t M provides
s s

the ranks of all documents with respect to this

Computing an Example
• Let (Mij) be given by the matrix
k1 k2 k3 q• dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10

q 1 2 3

– Compute the matrices (K), (S), and (D)t

• Latent Semantic Indexing transforms the
occurrence matrix into a relation between
the terms and concepts, and a relation
between the concepts and the documents
– Indirect relation between terms and
documents through some hidden (or latent)
Taipei ?

Taiwan (Latent)
Concepts doc

Alternative Probabilistic
• Bayesian Networks
• Inference Network Model
• Belief Network Model
Bayesian Network
• Let xi be a node in a Bayesian network G and Γxi
be the set of parent nodes of xi
• The influence of Γxi on xi can be specified by any
set of functions that satisfy:
∑ F (x ,Γ
i i xi ) =1

0 ≤ Fi ( xi , Γxi ) ≤ 1

• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
Belief Network Model (1/6)
• The probability space
The set K={k1, k2, …, kt} of all index terms is
the universe. To each subset u is 
associated a vector k such that gi(k )=1 ⇔ ki

• Random variables
– To each index term ki is associated a binary
random variable
Belief Network Model (2/6)
• Concept space
– A document dj is represented as a concept
composed of the terms used to index dj
– A user query q is also represented as a concept
composed of the terms used to index q
– Both user query and document are modeled
as subsets of index terms
• Probability distribution P over K
P ( c ) = ∑ P ( c | u ) × P (u ) Degree of coverage of K by c

P(u ) = ( )t
Belief Network Model (3/6)
• A query q is modeled as a network node
– This random variable is set to 1 whenever q
completely covers the concept space K
– P(q) computes the degree of coverage of the space K
by q
• A document dj is modeled as a network node
– This random variable is 1 to indicate that dj
completely covers the concept space K
– P(dj) computes the degree of coverage of the space K
by dj
Belief Network Model (4/6)
Belief Network Model (5/6)
• Assumption
– P(dj |q) is adopted as the rank of the
document dj with respect to the query q
P( d j | q) = P( d j ∧ q) / P( q)
≈ P( d j ∧ q) ≈ ∑ P (d j ∧ q | u ) × P(u )

≈ ∑ P ( d j | u ) × P ( q | u ) × P (u )
  
≈ ∑ P( d j | k ) × P ( q | k ) × P( k )
Belief Network Model (6/6)
• Specify the conditional probabilities as follows
 
  wi , j
if k = ki ∧ gi ( d j ) = 1
P(d j | k ) =  ∑it =1 wi , j

 0 otherwise
 
  wi , q
if k = ki ∧ gi ( q) = 1
P( q | k ) =  2
∑it =1 wi , q
 0 otherwise

• Thus, the belief network model can be tuned to

subsume the vector model
• Belief network model
– Is based on set-theoretic view
– It provides a separation between the
document and the query
– It is able to reproduce any ranking strategy
generated by the inference network model
• Inference network model
– Takes a purely epistemological view which is
more difficult to grasp

You might also like