Professional Documents
Culture Documents
J. H. Wang
Mar. 11, 2008
The Retrieval Process
Text
User
Interface
user need 4, 10 Text
Text Operations
6, 7
logical view logical view
Query DB Manager
Indexing
user feedback Operations Module
5 8
quer inverted file
y
Searching Index
8
retrieved docs
Text
Ranking Database
ranked docs 2
Introduction
• Traditional information retrieval systems
usually adopt index terms to index and retrieve
documents
– An index term is a keyword (or group of related
words) which has some meaning of its own (usually a
noun)
• Advantages
– Simple
– The semantic of the documents and of the user
information need can be naturally expressed through
sets of index terms
Docs Index Terms
doc
match
Ranking
Information Need
query
IR Models
Browsing Probabilistic
Flat Inference Network
Structure Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure
Q2
Collection
Q3
“Fixed Size”
Q4
Q5
Retrieval: Ad Hoc x Filtering
• Filtering:
User 2 Docs Filtered
Profile for User 2
Documents Stream
A Formal Characterization of IR
Models
• D : A set composed of logical views (or
representation) for the documents in the
collection
• Q : A set composed of logical views (or
representation) for the user information needs
(queries)
• F : A framework for modeling document
representations, queries, and their
relationships
• R(qi, dj) : A ranking function which defines an
ordering among the documents with regard to
the query
Definition
• ki : A generic index term
• K : The set of all index terms {k1,…,kt}
• wi,j : A weight associated with index term
ki of a document dj
Example (1,0,0)
(1,1,0)
(1,1,1)
• Intra-cluster similarity
– What are the features which better describe the
objects in the set A?
• Inter-cluster similarity
– What are the features which better distinguish the
objects in the set A?
Idea for TFxIDF
∑ w × ∑
t 2 t
i =1 i, j i =1
wi2,q
Figure 2.4 The cosine of θis adopted
as sim(dj,q)
Vector Model (3/4)
• Definition
freqi , j
– normalized frequency fi, j =
max freql , j
l
N
– inverse document frequency idf i = log
ni
d7
d2 d6
d4 d5
d3
d1
k3
k1 k2 k3 q• dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
The Vector Model:
Example II k1
k2
d7
d2 d6
d4 d5
d3
d1
k3
k1 k2 k3 q• dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2
q 1 2 3
The Vector Model:
Example III k1
k2
d7
d2 d6
d4 d5
d3
d1
k3
k1 k2 k3 q• dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10
q 1 2 3
Probabilistic Model (1/6)
• Introduced by Roberston and Sparck Jones, 1976
– Binary independence retrieval (BIR) model
• Idea: Given a user query q, and the ideal answer set R
of the relevant documents, the problem is to specify the
properties for this set
– Assumption (probabilistic principle): the probability of relevance
depends on the query and document representations only; ideal answer
set R should maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the
user will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/6)
• Definition
– All index term weights are all binary i.e., wi,j ∈
{0,1}
– Let R be the set of documents known to be
relevant to query q
– Let R be the complement of R
– Let P( R | d j ) be the probability that the
document dj is relevant to the query q
– Let P( R | d j ) be the probability that the
document dj is nonelevant to query q
Probabilistic Model (3/6)
• The similarity sim(dj,q) of the document dj to the query q is
defined as the ratio
Pr( R | d j )
sim(d j , q ) =
Pr( R | d j )
∏ Pr( k i = di | R)
sim( d j , q) ≈ log i =1
t
∏ Pr( k
i =1
i = di | R)
Probabilistic Model (5/6)
∏ gi ( d j ) =1
Pr( ki | R )∏g ( d ) =0
Pr( ki | R )
sim( d , q) ≈ i j
∏ Pr( ki | R )∏g ( d
j
gi ( d j ) =1 ) =0
Pr( ki | R )
i j
Pr( ki | R) + Pr( ki | R) = 1
t
P ( ki | R ) 1 − P ( ki | R )
sim( d j , q) ≈ ∑ log + log
i =1 1 − P ( ki | R ) P ( ki | R )
t
P ( ki | R ) 1 − P ( ki | R )
sim( d j , q ) ≈ ∑wi ,q × wi , j ×
log 1 − P ( k | R ) + log P ( k | R )
i =1 i i
Estimation of Term Relevance
In the very beginning: Pr( ki | R ) = 0.5
df i
Pr( ki | R ) =
N
Next, the ranking can be improved as follows: N
Let V be a subset Vi V
Pr( ki | R ) = dfi
of the documents V
initially retrieved df − Vi
Pr( ki | R ) = i
N −V
For small values of V and Vi
Vi + 0.5 Vi + VVi
Pr( ki | R ) = Pr( ki | R ) =
V +1 V +1
df − Vi + 0.5 df i − Vi + VVi
Pr( ki | R ) = i Pr( ki | R ) =
N −V +1 N −V +1
• Advantage
– Documents are ranked in decreasing order of
their probability of being relevant
• Disadvantage
– The need to guess the initial relevant and
nonrelevant sets
– Term frequency is not considered
– Independence assumption for index terms
Brief Comparison of Classic Models
• Boolean model is the weakest
– Not able to recognize partial matches
• Controversy between probabilistic and
vector models
– The vector model is expected to outperform
the probabilistic model with general
collections
Alternative Set Theoretic
Models
• Fuzzy Set Model
• Extended Boolean Model
Fuzzy Theory
• A fuzzy subset A of a universe U is
characterized by a membership function
uA: U→{0,1} which associates with each
element u∈U a number uA
• Let A
µ and
A
= 1 − µ B be two fuzzy subsets of U,
A
µ A∪B = max( µ A , µ B )
µ A∩B = min( µ A , µ B )
Fuzzy Information Retrieval
• Using a term-term correlation matrix
df u ,v
cu ,v =
df u + df v − df u ,v
• Define a fuzzy set associated to each index term
ki
µi ( d j ) = 1 − ∏ (1 − ci ,l )
ki ∈d j
Example cc3
cc2
cc1
qdnf = ka ∧ ( kb ∨ ¬kc )
• Disjunctive Normal Form
Kc
qdnf = ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc ) ∨ ( ka ∧ kb ∧ kc )
3
µq (d j ) = µcc1 ∨cc1 ∨cc1 = 1 − ∏ (1 −µcci )
i =1
= 1 − (1 − µa ,b,c ( d j )) × (1 − µa ,b,c ( d j )) × (1 − µa ,b ,c ( d j ))
q 1 2 3
0 ≤ Fi ( xi , Γxi ) ≤ 1
• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
Belief Network Model (1/6)
• The probability space
The set K={k1, k2, …, kt} of all index terms is
the universe. To each subset u is
associated a vector k such that gi(k )=1 ⇔ ki
∈u
• Random variables
– To each index term ki is associated a binary
random variable
Belief Network Model (2/6)
• Concept space
– A document dj is represented as a concept
composed of the terms used to index dj
– A user query q is also represented as a concept
composed of the terms used to index q
– Both user query and document are modeled
as subsets of index terms
• Probability distribution P over K
P ( c ) = ∑ P ( c | u ) × P (u ) Degree of coverage of K by c
u
1
P(u ) = ( )t
2
Belief Network Model (3/6)
• A query q is modeled as a network node
– This random variable is set to 1 whenever q
completely covers the concept space K
– P(q) computes the degree of coverage of the space K
by q
• A document dj is modeled as a network node
– This random variable is 1 to indicate that dj
completely covers the concept space K
– P(dj) computes the degree of coverage of the space K
by dj
Belief Network Model (4/6)
Belief Network Model (5/6)
• Assumption
– P(dj |q) is adopted as the rank of the
document dj with respect to the query q
P( d j | q) = P( d j ∧ q) / P( q)
≈ P( d j ∧ q) ≈ ∑ P (d j ∧ q | u ) × P(u )
∀u
≈ ∑ P ( d j | u ) × P ( q | u ) × P (u )
∀u
≈ ∑ P( d j | k ) × P ( q | k ) × P( k )
∀k
Belief Network Model (6/6)
• Specify the conditional probabilities as follows
wi , j
if k = ki ∧ gi ( d j ) = 1
P(d j | k ) = ∑it =1 wi , j
2
0 otherwise
wi , q
if k = ki ∧ gi ( q) = 1
P( q | k ) = 2
∑it =1 wi , q
0 otherwise