Professional Documents
Culture Documents
Apr 2011
Original Paper
Diversifying Search Results
Rakesh Agrawal Sreenivas Gollapudi
Search Labs Search Labs
Microsoft Research Microsoft Research
Ideally the result set should properly
account for the interests of the overall user
population.
Introduction: Basic
Assumption #1
A taxonomy of information exists at the
topical level.
A document can belong to one or more
categories, and so can a query.
Introduction: Basic
Assumption #2
Usage statistics are available for user
intents.
Example: When searching for 'FLASH',
65% users intended to find 'Flash Player',
15% were looking for 'Recent flash floods'
and 5% were looking for 'Flash Gordon'.
Introduction: Defining the
objective
To maximize the relevance of a result
document set based on individual
relevance of the members and their
diversity.
Formalization of Notation
The set of categories to which a document
d belongs is denoted by C(d).
The set of categories to which a query
belongs is denoted by C(q).
Formalization of Notation
Example:
− C(q='FLASH') = { 'flash floods','Flash
player','Flash Gordon') }
− C(d='FLASH') = { 'flash floods','Flash
player', 'Flash Gordon', 'Flash village' }
Formalization of Notation
C(d) ∩C(q) may be empty.
Formalization of Notation
The probability of a given query q
belonging to a category c is denoted by
P(c|q).
It is called user intent for query q and
category c.
Formalization of Notation
Assumption: Our knowledge is complete.
cƐC(q), Ʃ P(c|q) = 1
Informally this means that given a query
q, we have an exhaustive list of all the
categories to which the query could
belong.
Formalization of Notation
V(d|q,c) is defined as the relevance value
of document d for query q, when the
intended category of q is c.
Formalization of Notation
V(d|q,c) is defined as the relevance value
of document d for query q, when the
intended category is c.
If we constrain V to [0,1], it represents the
probability of document d satisfying user
query q that has intended category c.
Formalization of Notation
V can be obtained by multiplying query-
document similarity by the probability that
the document d belongs to category c.
− V(d|q,c) = Similarity(d,q) * P(c|d)
− Where P(c|d) can be computed by the
classifier algorithm . E.g. a confidence
value.
Formalization of Notation
Assumption: Given a query q and a
category of intent c, the relevance of two
documents is independent
− V(d1|q,c) , V(d2|q,c) are independent.
Formalizing the objective fn
Suppose users only consider the top k
results of a search engine.
– We can rephrase the objective :
– As:
“To maximize the relevance of a result document
set based on individual relevance of the
members and their diversity”.
The objective is to maximize the
probability that an average user
finds at-least one relevant result
among the top k.
Formalizing the objective fn
Formally, given
− A query q,
− A set of documents D
− A distribution of category of intent P(c|q),
− The relevance of each document d ε D
V(d|q,c),
− We want to find the set S of top k results
(|S| = k), S⊆ D ,such that
Formalizing the objective fn
P(S|q)
= ∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
is maximized
Origin of the objective function
d∈S V(d|q,c) is the probability that a
document d from our result subset S
satisfies a user query q having intended
category c.
Origin of the objective function
d∈S 1-V(d|q,c) is the probability that a
document d, from our result subset S,
does not satisfy a user query q, having
intended category c.
Origin of the objective function
πd∈S (1-V(d|q,c) ) is the probability that no
document d, from our result subset S,
satisfies user query q, having intended
category c.
Origin of the objective function
1- πd∈S (1-V(d|q,c) ) is the probability
that at least one document d, from our
result subset S, satisfies user query q,
having intended category c.
Origin of the objective function
Therefore,
P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
gives the probability that query q has
intended category c, and it is satisfied by
at least one document from our result
subset S.
Origin of the objective function
If we sum up
P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
for different categories {c1,c2,...,cr},
we find the probability that a user query
belonging to any of these categories is
satisfied by at least one document from
our result subset S.
Origin of the objective function
Therefore by defining
P(S|q) = ∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) ),
FOR d ∈ D do