You are on page 1of 70

Term Paper

Web Search Result


Diversification
Farhan Ahmad

Apr 2011
Original Paper
Diversifying Search Results
Rakesh Agrawal Sreenivas Gollapudi
Search Labs Search Labs
Microsoft Research Microsoft Research

Alan Halverson Samuel Ieong


Search Labs Search Labs
Microsoft Research Microsoft Research

Second ACM International Conference on Web Search and Data Mining


WSDM 2009
Barcelona, Spain - February 9-12, 2009
Abstract

Query terms given by users are often
ambiguous.
Abstract

Query terms given by users are often
ambiguous.

Search engines should diversify the
search results to minimize the risk of
dissatisfaction of average users.
Abstract

Query terms given by users are often
ambiguous.

Search engines should diversify the
search results to minimize the risk of
dissatisfaction of average users.

The authors have presented a systematic
approach for measuring diversity of search
results.
Abstract

Query terms given by users are often
ambiguous.

Search engines should diversify the
search results to minimize the risk of
dissatisfaction of average users.

The authors have presented a systematic
approach for measuring diversity of search
results.

They have presented an algorithm to
maximize the diversity of a subset of the
search results.
Introduction: Ambiguous
queries

Consider the search term 'FLASH'
Introduction: Ambiguous
queries

Consider the search term 'FLASH'

It can have several interpretations-
− Flash player
− Flash floods
− Flash Gordon (an adventure hero)
Introduction: Ambiguous
queries

Suppose Flash player is searched most
often.

Most of the top results returned for the
query 'FLASH' will belong to this category.

This is because search engines rank on
the basis of similarity, and make no explicit
attempt to diversify the documents.
Introduction:Relevance of
search results

The basic premise is “ The relevance of a
set of documents depends not only on the
individual relevance of its members, but
also on how they relate to each other.”
Jaime G. Carbonell and Jade Goldstein. The use of MMR, diversity-based re-ranking for
reordering documents and producing summaries.


Ideally the result set should properly
account for the interests of the overall user
population.
Introduction: Basic
Assumption #1

A taxonomy of information exists at the
topical level.

A document can belong to one or more
categories, and so can a query.
Introduction: Basic
Assumption #2

Usage statistics are available for user
intents.

Example: When searching for 'FLASH',
65% users intended to find 'Flash Player',
15% were looking for 'Recent flash floods'
and 5% were looking for 'Flash Gordon'.
Introduction: Defining the
objective

To maximize the relevance of a result
document set based on individual
relevance of the members and their
diversity.
Formalization of Notation


The set of categories to which a document
d belongs is denoted by C(d).

The set of categories to which a query
belongs is denoted by C(q).
Formalization of Notation


Example:
− C(q='FLASH') = { 'flash floods','Flash
player','Flash Gordon') }
− C(d='FLASH') = { 'flash floods','Flash
player', 'Flash Gordon', 'Flash village' }
Formalization of Notation


C(d) ∩C(q) may be empty.
Formalization of Notation


The probability of a given query q
belonging to a category c is denoted by
P(c|q).

It is called user intent for query q and
category c.
Formalization of Notation


Assumption: Our knowledge is complete.
cƐC(q), Ʃ P(c|q) = 1

Informally this means that given a query
q, we have an exhaustive list of all the
categories to which the query could
belong.
Formalization of Notation

V(d|q,c) is defined as the relevance value
of document d for query q, when the
intended category of q is c.
Formalization of Notation

V(d|q,c) is defined as the relevance value
of document d for query q, when the
intended category is c.

If we constrain V to [0,1], it represents the
probability of document d satisfying user
query q that has intended category c.
Formalization of Notation

V can be obtained by multiplying query-
document similarity by the probability that
the document d belongs to category c.
− V(d|q,c) = Similarity(d,q) * P(c|d)
− Where P(c|d) can be computed by the
classifier algorithm . E.g. a confidence
value.
Formalization of Notation

Assumption: Given a query q and a
category of intent c, the relevance of two
documents is independent
− V(d1|q,c) , V(d2|q,c) are independent.
Formalizing the objective fn

Suppose users only consider the top k
results of a search engine.
– We can rephrase the objective :
– As:
“To maximize the relevance of a result document
set based on individual relevance of the
members and their diversity”.
The objective is to maximize the
probability that an average user
finds at-least one relevant result
among the top k.
Formalizing the objective fn

Formally, given
− A query q,
− A set of documents D
− A distribution of category of intent P(c|q),
− The relevance of each document d ε D
V(d|q,c),
− We want to find the set S of top k results
(|S| = k), S⊆ D ,such that
Formalizing the objective fn
P(S|q)
= ∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
is maximized
Origin of the objective function

d∈S V(d|q,c) is the probability that a
document d from our result subset S
satisfies a user query q having intended
category c.
Origin of the objective function

d∈S 1-V(d|q,c) is the probability that a
document d, from our result subset S,
does not satisfy a user query q, having
intended category c.
Origin of the objective function
 πd∈S (1-V(d|q,c) ) is the probability that no
document d, from our result subset S,
satisfies user query q, having intended
category c.
Origin of the objective function
 1- πd∈S (1-V(d|q,c) ) is the probability
that at least one document d, from our
result subset S, satisfies user query q,
having intended category c.
Origin of the objective function

Therefore,
P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
gives the probability that query q has
intended category c, and it is satisfied by
at least one document from our result
subset S.
Origin of the objective function

If we sum up
P(c|q) . (1- πd∈S (1-V(d|q,c) ) )
for different categories {c1,c2,...,cr},
we find the probability that a user query
belonging to any of these categories is
satisfied by at least one document from
our result subset S.
Origin of the objective function

Therefore by defining
P(S|q) = ∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) ),

and trying to maximize P(S|q), we are trying


to maximize the chances that an average
user is satisfied.

That is P(S|q) is measuring the diversity of
result subset S for a query q.
Formalizing the problem

Given a result set D, find a set S ⊆ D, |S|
=k, whose diversity

P(S|q) = ∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) )

is maximum of all such possible S



Formally written as Diversify(k)
Caveat

Diversify(k) does not try to cover all the
categories
− While trying to maximize
∑c P(c|q) . (1- πd∈S (1-V(d|q,c) ) ),
And maintain |S| = k

We might need to exclude all documents


from some category cr.
Caveats

It might be that by taking all k documents
from only a single category c3, we are able
to maximize P(S|q) !!

All other categories are left out in such a
case.
Problems with Diversify(k)

Diversify(k) does not say anything about
the ordering of the result subset S.

Diversify(k) is NP hard (it reduces to the
problem of finding the max coverage of
result set D).
A greedy algorithm for
Diversify(k)

The authors have proposed a greedy
algorithm for Diversify(k) that uses the
concept of marginal utility to diversify as
well as re-rank the search results.

This algorithm maximizes P(S|q) when
every document can belong to just one
category, and otherwise it optimizes P(S|q)
with bounded error.
Notation

U(c|q,S) is the probability that a query q
belongs to the category c, given that all
documents in the set S fail to satisfy the
user.
− Initially S= ∅
− And we define U(c|q,∅) = P(c|q)
Notation

We define the marginal utility of a
document d as the product of its relevance
value V with the conditional distribution of
categories U
g(d|q,c,S) = ∑c∈C(d)U(c|q,S) . V(d|q,c)
− It is the probability that document d
satisfies the user when all documents
that come before it fail to do so.
Greedy algo IA-Select

Inputs: k,q,C(q),D,C(d),P(c|q),V(d|c,q)

Output S ⊆ D, |S|=k
1: S=∅
2:∀c, U(c|q,S) = P(c|q)
3: WHILE |S| < k
4: FOR d ∈ D do
5: g(d|q,c,S) = ∑c∈C(d) U(c|q,S) . V(d|q,c)
6: ENDFOR
7: d* = argmax( g(d|q,c,S) )
8: ∀c∈C(d*), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)
9: S = S ∪ d*
Greedy algo IA-Select
10: D = D – {d*}
11: ENDWHILE
12:RETURN S
Proofs
• Earlier we claimed that
– IA-select maximizes P(S|q), the diversity if
every document belongs to exactly one
category.
– IA-select optimizes P(S|q), with bounded
error, otherwise.
Basis for proofs
• To prove our claims, we need to
understand the concept of submodularity.
• We first define submodularity.
• Then prove that P(S|q) is submodular
• Then prove our claims.
Submodularity
• It is known as the principle of diminishing
marginal utilities in economics.
• In our context, the marginal benefit of
adding a document to a larger
collection is less than that of adding the
same document to a smaller collection.
• Formal definition follows.
Submodularity
• If N is a set and f is a set function
f: 2N ==> R
then f is submodular if and only if
for S ⊆ T ⊆ N
and d ∊ N and d ∉ S , d ∉ T
f(S U {d}) – f(S) ≧ f(T U {d}) - f(T)
Submodularity
• We have chosen two subsets of the
domain N, S and T. S is smaller than T
which in turn is smaller than N.
• Then for a new element d in N
which has not yet been added to either S or
T.
Submodularity
We evaluate the change in values of f due to
addition of d, for S and T both.
• f(S U {d}) – f(S) is the marginal utility
gained from adding d to the smaller set
S.
• f(T U {d}) – f(T) is the marginal utility
gained from adding d to the larger set T.
Submodularity
• If the inequality
f(S U {d}) – f(S) ≧ f(T U {d}) – f(T) holds,
f is said to be submodular.
P(S|q) is submodular
• Let S,T
S⊆T⊆D
be two sets of documents.
• And e ∊ D be a document such that
e∉T
• Let S' = S ⋃ {e} and T' = T ⋃ {e}
P(S|q) is submodular
• P(S'|q) – P(S|q)
= P(c|q) . [ (1- πd∈S' (1-V(d|q,c) ) ) - (1-
πd∈S (1-V(d|q,c) ) ) ]
= P(c|q) . [ πd∈S (1-V(d|q,c) - πd∈S' (1-V(d|
q,c) )]
P(S|q) is submodular
= P(c|q) . [ πd∈S (1-V(d|q,c) ) - πd∈S (1-
V(d|q,c) ) . (1-V(e|q,c)) ]

=P(c|q) . (πd∈S (1-V(d|q,c) ) ) . V(e|q,c)


P(S|q) is submodular
• Similarly
• P(T'|q) – P(T|q)
P(c|q) . (πd∈T (1-V(d|q,c) ) ) . V(e|q,c)
P(S|q) is submodular
• Because |T| ≧ |S|
(πd∈S (1-V(d|q,c) ) ) ≧ (πd∈T (1-V(d|q,c) ) )
since both the sides are products of
fractions.
And RH product contains all the fractions in
LH product in addition to its own factors.
P(S|q) is submodular
• Therefore
P(S'|q) – P(S|q) ≧ P(T'|q) – P(T|q)
• Hence P(S|q) is submodular.
Proof #1
• Formally, we have to prove that
- IA-select maximizes P(S|q) if
∀d∈D, |C(d)|=1
Proof #1
• In this case step 5 in our algo becomes
– g(d|q,c,S) = U(c|q,S) . V(d|q,c)
instead of

g(d|q,c,S) = ∑c∈C(d) U(c|q,S) . V(d|q,c)


Proof #1
• Step 8 in our algo becomes
– c=C(d), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)
instead of
∀c∈C(d), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)
Proof #1
• Because U(c|q,S) = P(c|q) at the outset,
as documents { d1, d2,..} get added to S,
in step 8 U is updated as
– c=C(d1),
U(c|q,S) = (1 – V(d1|q,c) ) . U(c|q,∅)
= (1 – V(d1|q,c) ) . P(c|q)
– c=C(d2),
U(c|q,S) = (1 – V(d2|q,c) ) . (1 – V(d1|q,c) ) . U(c|q,∅)
= (1 – V(d2|q,c) ) .(1 – V(d1|q,c) ) . P(c|q)
Proof #1
• Therefore in the kth iteration of IA-select, in
steps 4 and 5, g is updated as

FOR d ∈ D do

g(d|q,c,S) = (πd'∊S (1 – V(d'|q,c) ) . P(c|q)) . V(d|q,c)


= P(c|q) . (πd'∊S (1 – V(d'|q,c) ) . V(d|q,c)
Proof #1

• From the result for submodularity of P(S|


q) we know that
for every document d
∑c g(d|q,c,S) = P( S∪{d} | q) – P(S|q)
– Because g(d|q,c,S) is non-zero for
exactly one category c, the sigma
can be removed
g(d|q,c,S) = P( S∪{d} | q) – P(S|q)
Proof #1
• At the start S = ∅. When we have added
required k documents to S
g(d1|q,c,∅) = P( {d1} | q) – P(∅|q)
g(d2|q,c,{d1}) = P( {d2,d1} | q ) - P({d1} | q)
. . .
. . .
. . .
g(dk|q,c,{d1,d2...dk-1} ) = P(S | q) – P({d1,d2...dk-1} | q)
Proof #1
• The sum is
g(d1|q,c,∅) + g(d2|q,c,{d1}) + ......... + g(dk|q,c,{d1,d2...dk-1} )
= P(S|q) – P(∅|q)
=P(S|q)
Proof #1
• Because {d1,d2...dk-1,dk} = S are documents
selected from D having the k highest
values of g(d|q,c,S)
the sum on the LHS is larger than that for
any other S selected from D.

• THIS PROVES THAT OUR SELECTION OF


DOCUMENTS BASED ON THEIR MARGINAL
UTILITY INDEED MAXIMIZES THE DIVERSITY OF
OUR RESULT SUBSET P(S|q)
Proof #2
• Formally we have to prove that if
documents can belong to many
categories, their selection based on g(d|
q,c,S) still optimizes P(S|q), with an error
that is bounded.
Proof #2
• Theorem: For a submodular set function
f,
– let S* be the set of k elements that
maximizes f and
– S' be the k-element set that is constructed
by greedily selecting the elements, one by
one, so as to give the largest marginal
increase to f then

f(S') ≧ (1-1/e) . f(S*)


Proof #2
• Since our objective P(S|q) is submodular,
and our algorithm IA-Select() is greedy in
the sense mentioned before,
– IA-Select is a (1-1/e) approximation to
Diversify(k).
– That is, IA-select optimizes P(S|q) and the
optimum value is not less than (1-1/e) the
maximum value obtainable from
Diversify(K).
Evaluating IA-Select
• To measure the result set diversity of
search engines and compare it with the
diversity of results obtained using IA-
Select, the authors first defined intent-
aware counterparts of traditional IR
metrics.
Evaluating IA select
• One such metric is reciprocal rank(RR) :
It is the inverse of the first position at
which a relevant document is found in a
list.
• If there is a rank-threshold T, RR is zero
if no relevant document is found among
the first T documents.
• The mean reciprocal rank (MRR) of a
query set is the average RR of the
queries in the set.
Evaluating IA-Select
• Because IA-Select diversifies the top k
results, the authors defined an IA MRR

• For a result set D, rank threshold k


and query q
– MRR-IA(D,k) = ∑c P(c|q) . MRR(D,k|c)
MRR(D,k|c) gives the average RR for a
query set belonging to category c.
Evaluating IA-Select
• Shown below are the results obtained
using three commercials search engine
and IA-Select.

You might also like