Operational Research PHD Thesis

Plachouras, Vasileios (2006) Selective web information retrieval. PhD thesis.
http://theses.gla.ac.uk/1945/
Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given
Glasgow Theses Service http://theses.gla.ac.uk/ theses@gla.ac.uk
Selective
Web Information Retrieval
L3
C R
/{
uii.
\\
UNIVERSITY
of GLASGOW
Vasileios Plachouras
Department of Computing Science Faculty of Information and Mathematical Sciences
University of Glasgow
\ thesis submitted
for
t he
degree
of
Doctor of Philosophy
\-a-ileics Placho iiras. 2006
Abstract
One of the main challenges in Web information
retrieval is the number of
different retrieval approaches that can be used for ranking Web documents. In addition to the textual content of Web documents, evidence from the structure of Web documents, or the analysis of the hyperlink structure of the Web, can be used to enhance the retrieval effectiveness. However, not all the queries benefit equally from applying the same retrieval approach. An additional challenge is posed by the fact that the Web enables users to seek information by searching and browsing. Therefore, users do not only perform typical informational search tasks, but also navigational search tasks, where the aim is to locate a particular Web document, which has been before, already or which is expected to exist. visited In order to alleviate these challenges, this thesis proposes selective Web information framework formulated in deciterms of statistical retrieval, a sion theory, with the aim to apply an appropriate retrieval approach on a framework decision is basis. The the a mechmain component of per-query basis. that approach on a per-query appropriate retrieval selects an anism The selection of a particular retrieval approach is based on the outcome of an experiment, final before is the performed ranking of the rewhich trieved documents. The experiment is a process that extracts features from documents. This investigates thesis three the of set retrieved a sample of broad types of experiments. The first one counts the occurrences of query terms in the retrieved documents, indicating the extent to which the query topic is covered in the document collection. The second type of experiinformation ments considers from the distribution of retrieved documents in larger aggregates of related Web documents, such as whole Web sites, or directories within Web sites. The third type of experiments estimates the
usefulness of the hyperlink structure among a sample of the set of retrieved Web documents. The proposed experiments are evaluated in the context of both informational and navigational search tasks with an optimal Bayesian exists. decision mechanism, where it is assumed that relevance information This thesis further investigates the implications information
Web selective of applying
retrieval in an operational setting, where the tuning of a deinformation existing relevance and input is retrieval system's a stream of queries related to and navigational First, tasks. the experiments search and testing query sets, as well as a
cision mechanism is based on limited the information informational mixed
are evaluated using different training in order to approximate
mixture of different types of queries. Second, query sampling is introduced, the queries that a retrieval system receives, and to tune an ad-hoc decision mechanism with a broad set of automatically sampled queries. The main contributions
Web information periments. of retrieval automatic mechanism. In addition, approaches sampling
introduction the this thesis are of

framework definition the and
of the selective
of a range of ex-
retrieval
this thesis presents a thorough for Web information retrieval,
evaluation
of a set the
investigates and
of queries in order to perform
the training
of a decision
Overall, selective Web information
is a promising approach, which retrieval
The in the lead improvements to of evaluation effectiveness. retrieval can decision mechanism and the experiments shows that it can be successfully for a particular employed types of queries. type of queries, as well as a mixture of different
Acknowledgements
I would like to thank the following people:

My supervisors, Iadh Ounis and Keith your support van Rijsbergen. and feedback Keith, Iadh, thank you for through the journey
being a great supervisor;
of my Ph. D have been unsurpassed.
I am grateful
for your advice
discussions had; have the and we you always pointed directions of thinking and research.
interesting to me very
Gianni Amati for many helpful discussions and ideas related to research. Mark Baillie and Tassos Tombros for reading parts of this thesis, and giving discussing feedback, earlier papers related to this as as well me very useful thesis. I would also like to thank Craig Macdonald, Ben He, and Christina Lioma for reading parts of this thesis, and for their support in the last stages of writing up. All the members of the Information it to place work. a great making All those with whom I had a wonderful time in the past and at present. Those include Areti, Anna, Allan, Harold, Fr. Eirinaios and Alexandra. My brother, Diamantis, for encouraging me to do a Ph. D. Finally, I would like to thank my parents, Pavlos and Niki, for their unconditional support. Their care and love made it possible for me to complete this work. Retrieval Group, past and present, for
PAGE NUMBERING AS ORIGINAL
Contents
1 Introduction 1.1 Introduction ................... ............... 1 1
1.2 1.3 1.4 2 Basic 2.1 2.2 2.3
................... Thesis statement ................ Thesis outline .................. Concepts Introduction Indexing Matching 2.3.1 2.3.2 2.3.3 of Information ............................ Retrieval
Motivation
............... ............... ...............
2 3 4 6 ...... ...... ...... ...... ...... 6 7 8 10 12 13 19 21 22 ... 22 22 23 24 26 27 28 i '??
..............................
............................. Best Match weighting models .............. Language modelling ................... Divergence From Randomness framework .......
...... ......
2.4 2.5 3
Evaluation
............................. About Web information retrieval Retrieval
................
......
Web Information 3.1 3.2 Introduction
............................... Differences between classical and Web information 3.2.1 Hypertext document model ..................
retrieval
....
... ... ... ... ... ... ...
3.2.2 3.2.3 3.2.4 3.3
Structure of the Web ...................... Quality of information on the Web .............. Background of Web users ...................
Web-specific sources of evidence .................... 3.3.1 Document and Web site structure ..............
CONTENTS
3.3.2 3.3.3 3.4
Hyperlink
structure analysis ..................... User interaction evidence ...................... for Web information of evidence
hyperlink
29 ) :3. .......... 35 36
37
Combination 3.4.1
3.4.2
retrieval
Extending Hyperlink
Implicit
analysis algorithms ..............

............
analysis with anchor text
3.4.3 3.4.4 3.5
Network-based models ........................ Combination of different retrieval techniques and representations ...................................
evaluation in Text REtrieval Conference ...... Experimental
38 39 43
13 .
Evaluation
3.5.1
3.5.2 3.6
Search engine evaluation ....................... Query classification and performance prediction 3.6.1
3.6.2
.15 ..............
combination of evidence
46 16 .
48 50
Identifying
Predicting
intentions user goals and

query performance
................
and dynamic
3.7
Summary
...................................
Retrieval
4.1 4.2 4.3
Approaches
for Selective
Web Information
Retrieval
51
51 52 .
Introduction Experimental Document
.................................. ............................. for Web information representations setting
retrieval
.........
5-1
4.3.1 4.3.2 4.3.3 4.3.4
Representing Web documents Parameter setting Evaluation results
....................
55 .56 61
..........................
.......................... Impact of query terms with high frequency on the Poisson-based models ................................. Discussion and Conclusions
64 ..................... 66 67 G
........ I)
4.3.5 4.4
Combining document fields 4.4.1

4.4.2
......................... Weighting models for field retrieval .................

Parameter setting for field-based weighting models
4.4.3 4.4.4 4.5
Evaluation of field-based weighting models ............. Discussion and conclusions ..................... evidence ......................... URLs of Web documents ...................... Hyperlink structure analysis .....................
72 73 74 ( 77
Query-independent 4.5.1 4.5.2
ii
CONTENTS
4.5.3 4.5.4 4.6
Evaluation
field of retrieval with query-independent
evidence
Summary and conclusions ...................... Obtaining a realistic parameter setting ................... Using mixed tasks Conclusions 4.6.2 4.6.3 .......................... Using mixed tasks and restricted optimisation
92 92 ........... retrieval 95 9 .... 97 99
4.6.1
4.7 4.8 5A
.............................. Potential improvements from selective Web information Summary ................................... for Selective Web Information Retrieval
framework 5.1 5.2 Introduction
103 ... .... .... .... .... .... .... .... ....... .... .... .... ...... ..... .... .... .... .... 103 104 108 109 110 110 112 115 116 117 122 122 125 126 128 130 ...... ...... ...... ...... ...... ...... ...... 130 131 131 132 135 136 136
............................... Selective retrieval as a statistical decision problem
5.3
........ 5.2.1 Selective Web information retrieval and related work ... 5.2.2 Decision mechanism with known states of nature ..... Retrieval score-independent experiments .............. 5.3.1 Document-level experiments ................. 5.3.2 Aggregate-level experiments ................. Retrieval score-dependent experiments ............... 5.4.1 Divergence between probability distributions 5.4.2 Usefulness of hyperlink structure ..............
5.4
5.5
Bayesian decision mechanism 5.5.1 5.5.2 5.5.3 Definition Application
.................... decision Bayesian the mechanism of decision Bayesian the mechanism of ......................
Density estimation
5.6 6
Summary
............................... Web Information Retrieval
Evaluation 6.1 6.2
of Selective
Introduction
............................ Evaluation methodology ..................... 6.2.1 Effectiveness of experiments E............. .................... 6.2.3 Presentation and analysis of results ........... Evaluation of score-independent experiments ......... 6.3.1 Document-level experiments ............... 6.2.2 Evaluation setting
6.3
iii
CONTENTS
6.4
Aggregate-level experiments ..................... 6.3.3 Conclusions .............................. Evaluation of score-dependent experiments ................ 6.3.2 6.4.1 6.4.2 Setting the score distribution Evaluation .................. based results of experiments on the usefulness of hySn
141 151 151 132
6.4.3
(S,, ) L U, perlink structure ..................... Evaluation results of experiments based on the usefulness of hyperlink structure L(S. , Ute) .....................
154
157
6.4.4 6.4.5 6.4.6 6.5
Example of the usefulness of hyperlink structure experiments Discussion Conclusions ............................... ..............................
..
159 161 162 163 16.1 . 165
Document sampling 6.5.1 6.5.2 6.5.3
............................. Revisiting the definition of experiments E............. Document sampling for score-independent document-level experiments .................................
Description of experimental setting and presentation of results
166
6.5.4
Document sampling for score-independent aggregate-level exper................................. Document sampling for score-dependent experiments ....... Discussion ............................... Conclusions .............................. based the on same weighting model ..... approaches retrieval iments 167 173 175 178 179 181 185 188 Infor190 .................................. information relevance ........................ .............. setting for limited relevance information 190 191 1!)1 ...... ..... 193 195
6.5.5 6.5.6 6.5.7 6.6 6.7 6.8 6.9 7 Using
Decision mechanism with more than two retrieval approaches ...... Discussion ................................... Summary ................................... Web Information Retrieval with Limited Relevance
Selective mation 7.1 7.2
Introduction Limited 7.2.1 7.2.2
Modelling limited relevance information Experimental
7.3
Evaluation
of experiments F_with limited relevance information
IV
CONTENTS
7.3.1 7.3.2 7.3.3
Score-independent experiments with limited relevance information Score-dependent experiments with limited relevance information Discussion and conclusions ............... ......
B. -) 196 198
7.4
Ad-hoc decision mechanism and query sampling ........

7.4.1 7.4.2 7.4.3 7.4.4 7.4.5 Ad-hoc decision mechanism ............... Query sampling ...................... Evaluation of query sampling .............. Evaluation of ad-hoc decision mechanism Conclusions ........................
......
...... ...... ......
199
200 201 204 210 214 215 217 217
217 218 222
.......
...... ...... ......
7.5 8
Summary
............................. and Future Work
Conclusions 8.1
Contributions
8.1.1 8.1.2
and conclusions ........................

............................. ..............................
Contributions Conclusions work
8.2
Future
..................................
A Parameter B Evaluation
settings
and evaluation E
of retrieval
approaches
225 236 280
of experiments
References
Al'
List of Figures
2.1 3.1 4.1 The architecture of a basic information Hubs and authorities bipartite as a retrieval system. ......... i 32
graph ..................
The obtained mean average precision (MAP) for different c values tested during the two-step optimisation of full text retrieval with PL2 for the topic sets tr2001, td2004, hp2004 and np2004 ................ The obtained mean average precision (MAP) for different c values tested
during the two-step optimisation of anchor text retrieval with PL2 for
58
4.2
the topic sets tr2001, td2004, hp2004 and np2004. 4.3 The monotonically decreasing transformation
............ for the URL path length,
60
4.4 4.5 4.6
for k,,, = 1,10 and 100 ............................. The Markov Chain representing the Web graph ............... The extended Markov Chain including the clone states. ......... The monotonically increasing transformation of the hyperlink structure analysis scores, for kL = 1,10 and 100....................
76 80 8-1
88
5.1
Selective application
for three states of nature of retrieval approaches
different loss The three s1, s2, s3 and retrieval approaches al, a2, a3. associated with applying retrieval approach ai when the true state of nature is sj is denoted by l (ai, sj) .......................... The hyperlink graphs of the ranked documents, corresponding to the first three cases described in the Example 6 ................. Example of a Bayesian decision mechanism with 3 available actions, the likelihoods loss the corresponding posterior and ..............
106
5.2
121
5.3
12.1
vi
LIST OF FIGURES
5.4
Box-and-whisker
document-level the plots of score-independent experi127,
for the task td2003 ment outcome values .................. 6.1 Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval %' from `+/of Table 6.2 ................. approach column Posterior likelihoods of the experiments E3(b) and 4(b) for the topic set hp2004 6.3 ..................................... Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual
approach 6.4 Histogram decision from column summarising mechanism `+/-
139
6.2
140
retrieval
1-14 of the
%' of Table 6.3 ................. the relative difference between the MAP of the most effective individual
and that
retrieval
6.5
%' from `+/Table 6.4 of column approach ................. Posterior likelihoods of the score-independent aggregate-level experiments
hp2004, for E3b, the -3b, the topic one of where set and avg(dir)i avg(dom) for is DLHFA PB2FU each selectively applied or retrieval approaches query. The posterior likelihoods for the domain directory the and based
147
6.6
diagram, 149 bottom the top, respectively. and aggregates are presented on . Density estimates of the usefulness of the hyperlink structure experidefault the to parameter setor whether an optimised ments, according ting is used ................................... Histogram summarising the relative differences between the MAP of the decision mechanism and that of the most effective individual retrieval %' 6.5 Table `+/from of column approach ................. Histogram summarising the relative differences between the MAP of the decision mechanism and that of the most effective individual retrieval %' 6.6 Table from +/of column approach ................. Posterior likelihoods of the score-dependent experiments for the topic DLHFP I(ne)C2FU the or retrieval approaches set td2003, where one of is selectively applied on a per-query basis. ............ .... 153
6.7
157
6.8
159
6.9
160
vii
LIST
OF FIGURES
6.10 Histogram summarising the relative differences between the MAP of the decision mechanism and that of the most effective individual retrieval .......................... 6.11 Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual approach from Table 6.8. retrieval 171 from Table 6.7. approach 168
.......................... 6.12 Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual approach from Table 6.9. retrieval
.......................... 6.13 Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual
approach 6.14 Histogram decision approach 6.15 Histogram from Table 6.10 summarising mechanism ........................... the relative difference
173
retrieval
175
between the MAP individual
of the
and that
of the most effective
retrieval 177
from Table 6.11 summarising
........................... the relative difference
between the MAP
of the
decision mechanism and that of the most effective individual %' Table 6.13 from `+/of column approach ................
retrieval 185
viii
List
Tables of
The formulae of the weighting models PL2, PB2, I(ne)C2, DLH, and BM25, respectively ............................... The search tasks and the corresponding topic sets from the TREC Web ...................................... The average length of documents for the different document representations in WT10g and GOV test collections. The document length corresponds to the number of indexed tokens for each document, after tracks 53
2.1
19
4.1
4.2
4.3
removing stop words .............................. The average length of relevant documents for the different topic sets, and for the different document representations. The document length corresponds to the number of indexed tokens for each document, after removing stop words ..............................
)9
61
with the weighting mod-
4.4
Evaluation
of different
document
representations
4.5
els PL2, PB2, I(ne)C2, DLH and BM25 ................... Mean Average Precision (MAP) for full text retrieval with the weighting models PL2 and PB2, when query terms with A>1 for are employed assigning weights to documents, or they are treated as stop words.
63
4.6
Evaluation
... best Web the to the tracks of official submitted runs results
66
from TREC-9 to TREC 2004 4.7 Evaluation BM25F 4.8
......................... I(ne)C2F, DLHF PL2F, PB2F, the and of weighting models field-based of retrieval with the
67
..................................... Evaluation results of the combinations
72
query-independent
length, from PageRank. URL the and evidence path ............................ :10
the Absorbing Model.
ix
LIST
OF TABLES
4.9
The evaluation of the field retrieval weighting models and their combination with the query-independent sets, and for the query-type for the mixed-type evidence query specific topic subsets. The task mg2003'
first 50 the topics to consists of a subset of mq2003, corresponds which for each type of task .............................. 4.10 The evaluation of the field retrieval weighting models and their combination with the query-independent for the mixed-type evidence query 94
sets and for the query-type specific topic subsets, with restricted optimisation.
4.11 Potential application trieval
The task mq2003' corresponds to a subset of mq2003, which 96

for improvements of two retrieval in retrieval approaches from the selective basis. The re-
for first 50 topics the each type of task ............. consists of

effectiveness on a per-query optimisation,
approaches
based on a restricted are
as reported that .. 100
in Table 4.10. The table displays improvements highest in the result
the pairs of retrieval in MAP
approaches
for the tested topic sets.
5.1 6.1
Notation
for the aggregate-level experiments.. examples
.........
114
The pairs of retrieval
decision Bayesian by the approaches employed E. The in the the columns experiments proposed evaluation of mechanism `First approach' and `Second approach' show the employed retrieval ap-
The brackets. for MAP task the their within corresponding proaches and by be MAP that `MAX' the seobtained can maximum shows column lectively basis 6.2 applying one of the two retrieval approaches on a per-query 134
6.3
....................................... Evaluation of score-independent document-level experiments F, 3(f) and Ey(f) for combination of fields f, which result in at least one decision boundary for each tested topic set ...................... Evaluation of score-independent aggregate-level experiments with dofor boundary decision tested least in each one at mains, which result
topic set. ...................................
138
143
6.4
Evaluation
tories,
direcof score-independent aggregate-level experiments with

result in at least one decision boundary for each tested 146
which
topic set.
...................................
LIST
OF TABLES
6.5
Evaluation
of score-dependent experiments based on estimating the use-
fulness of the hyperlink structure L(S,, Un), which result in at least one 6.6 decision boundary for each tested topic set ................. Evaluation of score-dependent experiments based on estimating the usefulness of the hyperlink structure L(S,, Uh), which result in at least one decision boundary for each tested topic set ................. The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corThe decision mechanism decision boundaries. of responding number 1T9 1-06
6.7
document level document experiments with employs score-independent (p15000 PL2F documents 5000 500 top and with ranked and sampling of I(ne)C2F pl500), and setting. 6.8 (in5000 and in500), using the default parameter 167
.................................... The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the correThe decision mechanism emboundaries. decision sponding number of
documents 500 5000 document top with ranked and sampling of ploys PL2F (p15000 and pl500), and I(ne)C2F (iri5000 and in500), using the default parameter setting. The experiments compute the average domain directory aggregate sizes.......................... or The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corredecision The boundaries. decision mechanism emsponding number of documents 500 5000 top document with ranked and sampling of ploys PL2F (p15000 and p1500), and I(ne)C2F (in5000 and in500), using the devithe The default parameter setting. standard experiments compute directory domain the aggregate sizes. ............. or ation of 1i'? 170
6.9
xi
LIST
OF TABLES
6.10 The relative difference between the MAP of a decision mechanism and
that of the most effective individual retrieval approach, and the correThe decision mechanism emsponding number of decision boundaries.
documents document 5000 500 top with and ranked sampling of ploys PL2F (p15000 and p1500), and I(ne)C2F (in5000 and in500), using the default parameter setting. The experiments compute the number of large domain or directory aggregates ........................ 6.11 The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corThe decision mechanism decision boundaries. responding number of 11-4
document the sampling of score-dependent experiments and employs 5000 and 500 top ranked documents with PL2F (p15000 and p1500), 176 default (in5000 in500), the I(ne)C2F setting. parameter and using and . 6.12 The number of times for which there is at least one decision bound('B>0'), ary ('+'), in improvements retrieval effectiveness or when the 180 Bayesian decision mechanism selectively applies retrieval
approaches,
field-based the weighting model. ............. same which use 6.13 Evaluation of the decision mechanism, which employs the retrieval apfor that BM25FP, the I(ne)C2FU, PL2FA, experiments and proaches identify at least one decision boundary for all the tested tasks, and refor least tested three in improvements in at retrieval effectiveness sult tasks.
7.1
.....................................
decision of a mechanism with known states of nature for
184
Evaluation mixed-type
queries ...............................
191
7.2
Evaluation
7.3
document-level the and aggregate-level score-independent of information limited relevance experiments with ............... inforlimited Evaluation of score-dependent experiments with relevance mation. .................................... Average and standard deviation for the length of the TREC 2003 and 2004 Web track queries ............................
197
198
7.4
206
xii
LIST
OF TABLES
7.5
Symmetric experiment
Jensen-Shannon outcome
(J-S) divergence between the distribution queries with
of
values for the generated
STS, MTS.
ATS and and the TREC mq2004).
2003 and 2004 Web track queries (mq2003 and EV(at), EV(at), F-V(b), are and ) avg(dom)" std(dom. deviation of the query length distribution in 207
The experiments
The mean and standard
MTS are denoted by p and a.........................
7.6
Evaluation of the ad-hoc decision mechanism with the experiments 'Ey(at)1

*Ed(at), avg(dom) ,
and y(b), std(dom)
........................
212
A. 1 Parameter values for retrieval from the full text, title,
headings, and
anchor text of documents, with the DFR weighting models PL2, PB2 and I(ne)C2, and the weighting model BM25 ................ A. 2 The values of the c parameters and the weights of the fields for the weighting models PL2F, PB2F and I(ne)C2 ................. A. 3 The weights of the anchor text and title fields for the weighting model DLHF ...................................... A. 4 The values of the parameters for the weighting model BM25F. ..... A. 5 Precision at 10 retrieved documents (P10) for field retrieval and combi................. A. 6 Mean reciprocal rank of the first retrieved relevant document (MRR1)
for field retrieval and combination with query-independent evidence. ..
226
227
228 228
nation with query-independent
evidence.
230
231
A. 7 Number of retrieved relevant documents for field retrieval and combination with query-independent evidence .................... 232 A. 8 The parameter values for the combination of the weighting models with ...................... A. 9 The values of the parameters and the weights of the fields for the weighting models PL2F, PB2F, I(ne)C2, DLHF and BM25F for training and different The for tasks. the evaluating with mixed parameter values used for 234 their corresponding subsets of tasks. tasks the mixed are ones used . the query-independent evidence. 233
XIII
LIST
OF TABLES
A. 10 The values of the parameters for the combination of each field retrieval weighting model and the query-independent for training evidence and for different The the tasks. evaluating with parameter values used mixed for their corresponding subsets of tasks. tasks the are ones used mixed The task mq2003' corresponds to a subset of mq2003, which consists of .................. A. 11 The values of the parameters and the weights of the fields for the weighting models PL2F, PB2F, I(ne)C2, DLHF and BM25F for training and The tasks, parameter and restricted optimisation. evaluating with mixed for for their corresponding the the tasks ones used are mixed values used to The task tasks. a subset of mq2003, corresponds mq2003' subsets of for first 50 task.. type topics the of each which consists of
A. 12 The values of the parameters weighting evaluating model for the combination and the query-independent
the first 50 topics for each type of task.
234
........
and
235
field each retrieval of
for training evidence optimisation.
with mixed tasks, and restricted
The parameter
for their for the tasks the corresponding ones used are mixed values used subsets of tasks. The task mq2003' corresponds to a subset of mq2003, ........ 235 239 240 242
for first task.. type 50 topics the of each which consists of B. 1 Evaluation B. 2 B. 3 B. 4 B. 5 B. 6
P7 ..
of experiments of experiments of experiments of experiments of experiments of experiments

of Pxnerimpnts -, _ -r---::
Evaluation Evaluation Evaluation Evaluation Evaluation

Pir 1natinn .. ....,,,..,..
&y(f) 1-3(f) and .................. EV(f), F-3(f), and avg(dom).. ....... avg(dom) EV(f), E3(f) and std(dom). ........ std(dom) E3(f), 1rg(do, ) and EV(f), lrg(darn).
........
24-1
245 2-17
249
E3(f),
F-V(f), and avg(dir). ......... avg(dir) EV(f), E3(f), and std(dir) .......... std(dir)
F7,: ENio> and \ ,_, Ilj 1,uTylu&i_') --- VU 1,., yk.,
..........
B. 8 Evaluation of experiments E3(f), L(su)pi and Ev(f), L(su)pj .......... B. 9 Evaluation of experiments F-3(f),L(sU);,, and F-y( f), L(su);,,.. ........ Ey(f), Evaluation B. 10 L(sUI)pj ......... of experiments'3(f), L(sU')p, and Eye F-3(f), Evaluation B. 11 f), L(su'); n. ........ of experiments L(sv');,, and
250 252 25`l 255
xiv
LIST
OF TABLES
B. 12 Evaluation
document-level the of score-independent and aggregate-level The table displays the
limited information. experiments with relevance
decision is trained and evaluevaluation results of a mechanism, which different ated with mixed tasks ........................ B. 13 Evaluation of the score-dependent experiments with limited relevance information. The table displays the evaluation results of a decision mech258 different is trained mixed tasks. .... and evaluated with anism, which 257
xv
Chapter
Introduction
1.1 Introduction
This thesis investigates the selective application of different approaches for information (IR) documents from (Web). World Wide Web The main argument of the retrieval with the thesis is that selective Web IR, a technique by means of which appropriate retrieval approaches are applied on a per-query basis, can lead to improvements in retrieval effectiveness. Two main issues are addressed. First, a range of retrieval approaches is evaluated for different test collections and search tasks, in order to establish the potential for improvements from selective Web IR. Second, a decision theoretical framework for selective Web IR is introduced and evaluated in both an optimal and a realist is limited setting, with relevance information. The advent of the Web and the resulting wide use of Web search engines has redevelopments in to combine and enhance classical IR techniques wit h suited a range of Web-specific evidence. Most of the proposed approaches in the literature uniform combination investigate it of evidence, which is applied for all queries. Recent works have
focused also on predicting the query difficulty, and proposing measures, which correlatee statistical features of the retrieved documents for a particular query with the perforThis thesis is focused on selectively applying the most effective
mance of a system.
retrieval approach on a per-query basis, in order to improve the retrieval effect ivel The evaluation of selective Web IR is performed with different search tasks, as defined in the TREC 2003 and 2004 Web tracks (Craswell & Hawking, 2004; Craswell et al., 2003).
1.2 Motivation
The remainder of the introduction
describes the motivation
for the work in this
thesis, presents the statement of its aims and contributions, of the structure for the remainder of the thesis.
and closes with an overview
1.2
Motivation
IR has been an active field of research for more than 30 years, starting as a need to search and locate information in the ever-growing body of scientific literature. While IR systems have always been useful in libraries, the advent of the Web made IR systems for tool an essential a wide range of people. Indeed, the Web was conceived as a virtiial information space, which would facilitate sharing of information At among scientists. the beginning, finding information on the Web was a matter of keeping a set of pointers
to interesting Web documents. However, as the number of Web documents grew rapidlv, this became impractical. The first IR systems for searching the Web, also known as (McBryan, 1994 search engines, appeared as early as Classical IR systems have been primarily information ments are rarely updated, ment, where new information become or unavailable. intentionally misleading. 1994). Today, there are several l. large general purpose search engines as well as a number of specialised search enginc. docuin used controlled settings where is considered to be reliable, and users are experts may be modified
in the field of search. In contrast, the Web is a highly diverse and dynamic environis published and existing information the available information In addition, may be erroneous or
The users that access the Web have a wide range of back-
have it impossible interests, that they to experience on assume making grounds and the topic they search for, or on how to use a search engine effectively. They tend to for(Jansen & Pooch, 2001; top the ranked results rnulate short queries and examine only Silverstein et al., 1999). Furthermore, the queries are not always about finding out in(2002) identified Broder topic. to a taxonomy of three main types a related finding information First, informational tasks Web tasks. are about search search of locating Second, tasks topic. are about search navigational and useful pointers about a formation a particular Web document, that a user has visited in the past. Third, buying products. resources, or
(visited on 17th Oc-
transactional
search tasks are about accessing particular
'A extensive list can be found at http: //www. searchenginewatch. com/links/ tober, 2005).
1.3 Thesis
statement
In addition, effectiveness documents'
the Web offers a range of evidence retrieval techniques, which
that
be can used to enhance the of the
of classical textual
are based on the analysis
content.
A key element to directly
of the Web is the hypertext
document
model, which enables documents These hyperlinks pointers analysis to other
documents reference other aids within Similarly
hyperlinks. with or as
can serve as navigational related Web documents. (Garfield,
a set of Web documents,
to work in the field of citation pointed to by many such
for scientific
journals
1972), a Web document as popular,
Web documents other as the popularity
be may considered
or authoritative.
Evidence
or authority by first
documents of retrieving
can be used to improve documents
the effectiveness Thus,
of a Web IR system,
relevant
higher of quality.
relevance is not replaced,
but only complemented.
Generally, the non-textual

are applied relevance them of documents (Croft,
evidence have been used in a static manner, where they

However, their weaker nature alternative in indicating t lie 2000) suggests that to the context incorporate to ways lead to can
for each query uniformly.
dynamically,
according
of documents
and queries,
improvements for estimating an IR system. properties
in retrieval the query Estimators
effectiveness. difficulty,
This thesis is also related to recent techniques predicting the performance of
and consequently
of the query difficulty (He & Ounis, (Yom-Tov
have been based on the statistical of query
of the query terms
2004), or on the co-occurrence et al., 2005).
terms in the retrieved
documents
1.3
Thesis statement
is this thesis that the retrieval of an appropriate retrieval effectiveness approach of an IR system can be basis. This
The statement enhanced
by applying
on a per-query
is investigated mechanism decision
in the context
framework of a retrieval
for selective Web IR, where a decision to apply on a per-query basis. The
selects appropriate performs
approaches
mechanism
an experiment
E, which extracts
features from a sample of E, it applies an in identifying is expected application the to of a
of the set of retrieved appropriate most result retrieval
documents, approach.
and according If the experiment then
to the outcome E is successful
appropriate in improved
retrieval retrieval
approaches, effectiveness,
the decision
mechanism
compared
to the uniform
single retrieval
approach.
1.4 Thesis
outline
The main contributions
following. this thesis the of are
A decision theoretical
framework for selective Web IR is introduced. information where relevance
The framework is evaluated in a setting,
is assumed to exist, and it is shown that it is possible to
from improvements in the selective application of different obtain retrieval effectiveness framework is The in the also evaluation of proposed performed retrieval approaches. limited information a setting where relevance techniques are introduced In this context, query sampling exists. in their to effectiveness setting and evaluated with respect
decision Moreover, a thorough evaluation of several retrieval mechanism. up an ad-hoc for Web IR different is test collections and search tasks. performed on approaches
1.4
Thesis
outline
The remainder of the thesis is organised in the following way. describes IR. It brief Chapter 2 the a main concepts of overview of provides a " in the including IR thesis, this those as experimental as well used models, series of evaluation of IR systems. IR. It Web detail in Chapter to 3 provides an overview of work related presents " hyperlink documents, for Web the document hypertext the and of model used be Web features how discusses the It Web. can used of particular structure of the issues It Web IR also reviews to enhance the retrieval performance of systems. identification IR Web the to the as systems, as well evaluation of related the prediction of query performance. and user goals, in for improvements Chapter 4 investigates the effectiveness retrieval potential " from selective Web IR. First, it examines the effectiveness of performing retrieval docuWeb different from of representations models of weighting with a range incoming headings, text the body, the title, the of anchor and ments, such as the fields, different discusses it Next, the hyperlinks. which correspond combination of (HT. \11. ). Language Markup HyperText the to the text within particular tags of The proposed approaches consider both the length normalisation and the weight further fields is documents fields. The enhanced with ing of the with retrieval of is The studapproach retrieval of each effectiveness evidence. query-independent for types its of search to ied with respect several optimal retrieval effectiveness of the
1.4 Thesis
outline
tasks. This chapter
also considers a realistic
setting,
where a restricted approach.
optimisaFinally, the from
tion for mixed search tasks is performed chapter applying establishes the potential
for each retrieval
for improvements
in retrieval
effectiveness
Web IR. selective
Web IR. First, it for framework Chapter introduces 5 the proselective applying " decision in description terms the statistical of selection mechanism of vides a theory. Then, the chapter defines a range of experiments, which aid the decision Finally, basis. to to apply on a per-query select a retrieval approach mechanism the chapter closes with the definition of an optimal Bayesian decision mechanism for the evaluation of the proposed experiments.
Chapter " IR. First, 6 presents the evaluation it employs the retrieval experiments of the proposed framework for selective Web 4, and evalu-
described approaches
in Chapter
ates the proposed described as
in the context
decision Bayesian mechanisin, of a Second, the charthe
in Chapter
5, with several types of search tasks.
ter investigates experiments.
the use of small samples of documents
in order to compute
be Web IR how 7 Chapter system a retrieval when applied can selective explores " has only limited relevance information available. This corresponds to training and The different tasks. decision automatic sets of mixed testing a mechanism with facilitate training in the to investigated, is order generation of query samples also decision mechanism. of an ad-hoc the Chapter 8 thesis this contributions with closes " lie for future t directions extending from this work, as well as possible work of IR. Web for framework selective proposed drawn the conclusions and
Chapter
Basic Concepts Retrieval

2.1 Introduction
Retrieval
Information of
Information
(IR) deals with the efficient storage and access of information items (Baeza-Yates & Ribeiro-Neto, 1999). The information items can be text doc-
uments, images, video, etc. A common scenario of the use of an IR system is the following: while performing a task, a user needs to locate information in a repository documents. of form in the need of a query, which bag to usually corresponds a of keywords. The user is only interested in the documents need. The ideal goal of an IR system is to return The user expresses an information
that are relevant to his information
all the relevant documents, while not retrieving any non-relevant ones. Furthermore, the retrieved documents should be ranked from the most relevant to the least relevant. The above process is iterative in the sense that a user can refine the initial query, or feedback to the system, which leads to the retrieval process being performed provide focused from This is thesis text and Web documents. again. on retrieval Automatically deciding whether a document is relevant to the information need, as well as the ambiguity of information need of a user is not a straight-forward a query for an information task, because of the inherent ambiguity in formulating in documents. where
This is a main difference between Information (Van Rijsbergen, verified
Retrieval and Data Retrieval,
the items to be retrieved must clearly satisfy a set of conditions, which can be easily 1979). The current chapter provides an overview of basic (Section in IR documents indexing 2.2), the matching of the regarding concepts of
2.2 Indexing
documents and queries (Section 2.3), and the evaluation IR (Section 2.4). of systems
2.2
Indexing
In order for an IR system to process queries from users, it is required to extract and store in an efficient way a representative for the documents to be searched. Creating the document representatives, or the document index, takes place in the indexing IR component of an system, as shown in Figure 2.1.
IR system
Documents
Indexing
L
Query
Data structures
Matching
Retrieved Documents
Figure 2.1: The architecture of a basic information
retrieval system.
The simplest approach is to represent a document by its composing terms. However,

document in the terms all a not document. the of carry the same amount information of about the topic a document Luhn (1958) proposed that the frequency of a term within its significance frequently very in the document. In addition, without
be indicate to can used of terms that particular appear
there is a number being related t o, a
in many documents,
topic.
Such terms are called stop words and they can be discarded A benefit document from ignoring during stop words indexing
during t lie
the indexing
process.
is that
size of the generated
index is reduced.
Another common type of lexical processing of terms during indexing is stemming. The purpose of stemming is to replace a term by its stem, so that different grammatical forms of words are represented in the same way. For example, if the terms -retriever'. `retrieval', and `retrieving' appear in the text, they can be represented by the common stem `retriev'. However, once these terms have been stemmed, any difference in their
2.3 Matching
lost. is meaning
A widely used stemming algorithm (1980). by Porter proposed indexing the which
for the English language was
Instead of indexing single terms, more complicated strategies can be adopted, in units are combinations of consecutive terms. For example, an information. IR system can index pairs of consecutive words, also known as bigrams (Manning & Schutze, 1999, ch. 6). The document index may also contain additional in in document, the terms terms the appear particular or a whether positions of such as fields of documents. For the purpose of this thesis, the documents are indexed using field information. frequencies their terms, and single The output of the indexing process is a set of data structures that enables the effidata The document the structure used most commonly representatives. cient access of is the inverted file (Frakes & Baeza-Yates, 1992), which stores the document identifiers Gendocuments. indexed from the the term that contain a particular vocabulary of document file is the inverted to that the collection. of comparable erally, the size of However, it can be reduced by using appropriate compression techniques, based on frequenidentifiers document term the integers the that and represent encoding the for Elias the bits. fewer The gamma encoding commonly used encodings are cies with identifiers, document the between differences unal-y and the a sequence of compressing These (Witten 1994). frequencies for term encodings al., et compressing encoding bit level, but and require many operations operate on a achieve very good compression, bytes in Other techniques decompressing. on operate for compressing and compression & (Williams Zobel, bytes handle hardware to order to exploit the optimised capacity of 1999).
2.3
Matching
is 2.1, Figure the in IR matching The second main component of an system, as shown A for documents query a submits user query. given that a of retrieves a set component Several the documents to models query. relevant to an IR system, which aims to retrieve (Below. Boolean The model have been developed for matching documents to queries. An Boolean exexpression. 2000). which is the oldest IR model, treats the query as a The (NOT AND AND is storage). Boolean information search query ample of such a documents that the contains for this all Boolean model retrieve query would particular
2.3 Matching
but do search, not contain the term storage. The documents are presented to the user as a set, without any particular ranking. This lack of ranking of the results has been one of the main points of criticism for the Boolean (Salton model et al., 1983). A different class of models is based on computing the similarity between the quer`documents. One such model is the vector space model (Salton & McGill, the and 1986), where both the queries and the documents are represented as in the saine vectors space. The number of the dimensions of the vector space corresponds to the size of the document index, or in other words, the number of distinct terms in the vocabulary of the documents. The retrieved documents are ranked according to their similarity to the query, which corresponds to the distance between points in the vector space. Several distance functions can be defined and used to measure the similarity (Van Rijsbergen, 1979). Another classical retrieval model is the probabilistic model (Robertson & Sparck Jones, 1976). This model is based on estimating the probability of relevance for a document, It given a query. assumes that there is some knowledge of the distribution of terms in the relevant documents and this distribution is refined through the iterative interaction (1979) Van Rijsbergen the with user. presents a decision theoretic interpretation probabilistic document is retrieved if the probability retrieval models, where a relevant to a given query is greater than the probability of Hie of being
the terms information,
of the document being non-
relevant. Through the definition of a loss function for the possible actions of retrieving or not retrieving an appropriate (Harter, model documents. a document, the number of retrieved documents can be adjusted in way.
A series of simple and effective IR models have been based on the 2-Poisson indexing 1975), which aims to assign a set of specialty or useful index terms to specialty The set of elite documents, which are indexed with a particular
term, would be the answer to a query consisting of that specialty term. The specialty terms are identified by means of their different distributions in the documents that do not have the eliteness property. different Poisson distributions. two as modelled 2-Poisson model with the probabilistic in the elite documents, and The two distributions are
Robertson et al. (1981) combined the
Section for described in 2.3.1. model retrieval, as families of IR models.
The remainder of the current section describes particular
Section 2.3.1 describes the family of Best Match (BM) models, which combines the
2.3 Matching
model with the 2-Poisson model. A different family of IR models. based on language modelling, is briefly discussed in Section 2.3.2. Section 2.3.3 presents the Divergence From Randomness (DFR) framework of IR models, which is based on a probabilistic generalisation of the 2-Poisson indexing model.
2.3.1
Best Match
weighting
models
(Robertson & Sparck Jones, 1976), the weight model of a term t in a document, assuming that the terms appear in documents independent Iv from each other, is computed as follows: (r + 0.5)/(R -r+0.5) log w(1) = (n-r+0.5)/(N-n-R+r+0.5) (2.1)
Starting from a basic probabilistic
R is the number of relevant documents, r is the number of relevant documents where that contain the query term t, N is the number of documents in the collection and n is the document frequency of the term t, or in other words the number of documents that contain the term t. When there is no relevance information, becomes (Croft & Harper, 1988): log N-n+0.5 n+0.5 which is similar to the inverse document frequency (idf): log the above weight w(l)
(2.2)
The above equations do not incorporate the within-document frequency of teriii . Robertson et al. (1981) modelled the within-document term frequencies with two Poisson distributions: one distribution for modelling the occurrences of the term t in the for modelling the occurrences of the term of a substanrelevant documents, and another distribution
t in the non-relevant documents. This approach leads to the introduction tial number of parameters that cannot be set in a straight-forward formula, has that a similar shape and properties. with a simple
For t his manner. They identified folir
(1994) & Robertson frequencies Walker the term approximated above model of reason, (b) (a) frequency be is the term the the zero, properties: weight should zero when (c) frequency increases, increase the term the weight weight should monotonically as should increase to an asymptotic sponds to the weight w(l). (d) this asymptotic maximum, and maximum correA formula that satisfies these properties is the following;: tf w(l) ki+ tf (2.3)
10
2.3 Matching
kl is where a parameter that controls the saturation of the term frequency tf. By incorporating the frequency qtf of a term t in the query, and a correction for t he length 1 of a document, Robertson and Walker derived the formula BM15 for computing the weight Wd, of a document for query: q wd,q Ew
tEq
_ -
q(k
tEq
tf
+tf i
qtf
k3+gtf"log
N-n+0.5
n-{-0.5
l-l )+k2"nq"l+l
(2.4)
where k3 controls the saturation of the term frequency in the query, l is the average document length in the collection, k2 is the weight of the document length correction, and nq is the number of terms in the query. In addition, they introduced BM11, a different version of the formula that normalises the term frequency with respect to the document length: wd,q
Further
. tEq
tf
l/l + tf
qtf
k3 + qtf
log
N-n+0.5
n+0.5
)+ k2 nq. " 1+1

formula, which
l-l
(2.5)
research
led to the introduction
of the BM25
is a corn-
bination
BM11 of and BM15, with the addition (k3 + 1) (Robertson et al., 1994): (kl+1)tf Wd, q -E(. (k3+1)gtf log
(k1 factors the + 1) and of scaling
N-r+0.5
(kl(1 -b)+bI)+tf tEq

then the formula
k3 + qtf
BM15
n+0.5
)+ k2 nq , ' 1+l
l-l
(2.6)
Indeed, BM11
if b=0, is obtained.
is obtained,
while if b=1,
then the formula length adjust-
In most of the reported
experiments,
the document
ment k2 " nq " + has been ignored by setting k2 = 0. In addition, when k3 is very large, k3+1)gtf then the component qtf. k3+qtf In the Formulae (2.4), (2.5), and (2.6), when the document frequency n> N/2, the resulting introduced log weight of a particular a modified so that document in term a query is negative. Fang et al. (2001) Nn+ 5.5 is log where replaced with In the remainder of this
formula, BM25 the version of weights
the computed is employed
are always positive. documents
thesis, when BM25 document frequency
for ranking
and a term with negative weight
a very high
in appears a query, any resulting to the weight of the document
is ignored aiici
it does not contribute
for the query.
11
2.3 Matching
2.3.2
Language modelling
based indexing 2-Poisson the on model make either explicit or models documents. However, Ponte in terms of This view led to the
The retrieval implicit
distribution the assumptions about
Croft (1998) suggested that it is preferable to use the available data, instead of making any parametric application distribution the assumptions about of terms. is data for IR. In language this generated model approach, a modelling of For a given query, the documents are ranked according to the that the corresponding document model generates the query. Ponte S;
for each document.
probability Croft (1998) treated the queries as a set of words with binary weights. The probability language document to the from the product corresponds model a query of generating of the probabilities probability
the the times terms the of product query each of of generating in do the that the terms query. appear not of not generating (1998) modelled the queries as a sequence of terms and computed the
Hiemstra
the to the of the of probabilities product according query generating of probability it is In language this document from the terms approach, model. generating the query in the do the that terms the and resulting to query, appear not consider not necessary (1998). & Croft Ponte is that than of simpler model In all language modelling approaches, there is the issue of assigning probabilities (1998) & Croft that. Ponte document. in do suggested to the terms that a not appear document. in does that term to a not appear it is harsh to assign a zero probability a language distribution the for the of For this reason, smoothing techniques probability (1998) the to Croft & Ponte probability been use have proposed employed. models document. in does it document a not appear collection, when that a term occurs in the interpolation linear based the (1998) on Hiemstra employed a smoothing approach A the document of from study model. the collection and model the probabilities of IR in for language 'as modelling the effectiveness of different smoothing techniques (2001). & Lafferty Zhai by conducted The ranking of documents according to the probability ideal is implies it there (2002). that because one only been criticised by Robertson has the query of generating
be to it relevance model is the used to not could document that and query, relevant introduction led has to the language of feedback. Further work with the modelling Lavrenko the to models. retrieval that probabilistic similar more are approaches
12
2.3 Matching
Croft
(2001) introduced
a language
modelling
approach,
where relevance is explicitly need of the user is
modelled.
The basic underlying
assumption
is that the information
described by a relevance language model. the probability
Then, the documents
are ranked according to In addition, Lafferty
that they generate the relevance language model.
& Zhai (2003) argued that the classical probabilistic equivalent from a probabilistic
model and the language models are estimation:
differ but in terms of statistical point of view, for a model relevant a model for relevant documents, queries,
the probabilistic while language
model estimates models estimate
based on a query,
based on a document. and introduced theoretic
Moreover, a language divergence
Lafferty
& Zhai (2001) employed framework, in which
Bayesian decision theory they estimated
modelling
the information
between the document
language models and the query language models.
2.3.3
Amati
Divergence
From
Randomness
framework
(2003) introduced the Divergence From
& Van Rijsbergen (DFR) models.
(2002) and Amati
Randomness IR weighting informative
framework A central
as a generalisation
for ing 2-Poisson the general model of is that a term is more model that predicts a
framework DFR the concept of does not fit the probabilistic
distribution its when
function is document in t The term of term. a the a a of weight of occurrence random two probabilities: w= (1 (t Et)) "(Probe loge Probl (t Collection)) (2.7)
defined is documents, for Et as the which In the above equation elite set of stands is tf the t term the documents observed within-document that and the set of contain frequency of t. 2.3.3.1 Randomness models in Equation (2.7) corresponds to the infor-
The component (- loge Probl (tfjCollection))
document in frequency tf a term that the with appears a probability of content mative terni If that the a probability by chance, according to a given model of randomness. high, is (t Collection)) the Probt loge low, and is document then in tf times a occurs that There can be informative. models randomness to is several are term considered be used to compute the probability Probl.
13
2.3 Matching
according to a binomial model, then the probability of observing tf occurrences of a term in a document is given by the probability of tf successes in a sequence of F Bernoulli trials with N possible outcomes:
Probet ( Collection .) (FF = tf tf F-tf P4' (2. )
If the occurrences of a term are distributed
F is frequency the where of a term in a collection of N documents, p=1 N and q= 1- p. The informative content of this probability corresponds to (tu loge Probl Collection). If the maximum likelihood A=N of the frequency of a term in t he can be used In this case, the informative
estimator
collection is low, or in other words FN, to approximate the binomial content of Probl is given as follows: (t A (, lo Prob, g2 )= tf " 10 92 A+ g2
then the Poisson distribution
described model above.
(A
-0"
1092e+0.5
(27r fl 1092 t" " "
(2.9)
The Poisson model is denoted by P. Another approximation binomial the of model is obtained by using the information factorials. In this case, theoretic divergence D and Stirling's formula of approximating the informative follows: (Probe (t, Collection) = F. -1og2 (D(cb, (27r. (1 0))) 0. +0.51092 p) (2.10)
having tf occurrences of a term in a document is given as content of
(O, 0=F, D Kullback-Leibler is the p) where p=N, and loge p. This model is denoted by D. Starting informative from the geometric distribution, content of the probability a tf-idf
divergence of 0 from p:
is model generated, where the
that there are tf occurrences of a term in it
document is given by: (tfjCollection) Probl -1092 loge tf = " N+1 n+0.5 (2.11)
document This frequency in document is term the the the collection. model of where n is denoted by I(n). Alternatively, the document frequency n can be replaced with the follows: binomial law, by frequency is document the as given rye, which expected ne = N1(F)pOqF_O) N_ =N1- 1F (2.12)
14
2.3 Matching
In this case, the informative is given by:
content of having tf occurrences of a term in a document N +01 loge tf = " 5 n, (2.13)
(tACollection) loge Probi This model is denoted by I(ne). 2.3.3.2 Aftereffect of sampling
In the basic Equation
(2.7) of the DFR framework, the component 1-
corresponds to the information
Prob2(t l ,) gain obtained by considering a term to be informative for this document and the probability that t his
for a document. If a term appears with a high frequency in a document, then it is almost is informative this that term certain term occurs more times in the same document is high. At the same time, when a term document, information frequently in the associated a appears the component document. 1Prob2(tiEt) lower. is Therefore, gain adjusts the importance of a term with respect to a One model for computing Prob2 is the Laplace model (denoted by L), probability having one more occurrence of a of
which corresponds to the conditional
term in a document, where the term appears tf times already: 1- Prob2(tlEt)

Another for computing model
tf =1-1+ tf
1 1+ tf
model (denoted
(2.14)
by B), which is
Prob2 is the Bernoulli
defined as the ratio of two binomial distributions: 1- Prob2(tfl Et) =F (tf + 1) n, (2.15)
2.3.3.3
Document
length
normalisation
(2.7), Equation document the term in term the weight of a with a frequency tf can be normalised with respect to the length of the document. The lengt h (2003) Amati indexed tokens. the document to number of simply corresponds of the frequency to function term density decreasing the respect with of normalised assumed a Before computing is formula, following derived length document the called normalisation which the and
:
On = tf " log2(l +c.
/l)
(2.16)
15
2.3 Matching
is frequency. 1 document length, is l is the average tfn the term the where normalised document length in the document collection and c is a hyper-parameter. Equation (2.16) becomes: tfra = tf " 1og2(1 + l/l) is it called normalisation and 1. c has an impact on the retrieval effectiveness of 2, and it is collection-dependent. In (2.17) If c=1, then
The setting of the hyper-parameter
the DFR weighting models that use normalisation
(2003) & Ounis defined He dependency, the to tackle the of collection problem order frequency hyper-parameters function to the term the related of normalisation effect as a normalisation. The normalisation for a particular effect corresponding to the optimal setting of the depends task only on the type of the task and search
hyper-parameters
the type of the queries (i. e. short or long queries). Then, for a similar type of search task and for a similar type of queries, the hyper-parameters are set so that they restalt in the same normalisation
presented in (He & Ounis, the hyper-parameters
effect.
2005b).
A refinement of the normalisation

In addition, by measuring
been has effect

to so of
He & Ounis (2005a) proposed
frequency term of
normalisation
the correlation
frequencies. lengths term document the the and normalised In the remainder of this thesis, the hyper-parameter of normalisation 2, as well as directly is the that optimised. effectiveness retrieval so set are any other parameters, 56. Section 4.3.2, in discussed page The details of this optimisation process are 2.3.3.4
A DFR
Divergence
document
From
Randomness
weighting
models
of a randomness
weighting
model is generated
from a combination
(2.7), Equation in (t Collection) Probi loge model aftereffect an for computing model For frequency exammodel. for computing 1- Prob2(tMEt), normalisation and a term is (P), information distribution the Poisson is gain the if the model randomness ple, frequencies (L), term the according Laplace adjusted are the and model computed with PL2. The is DFR of wd, weight 2, called the then model to normalisation resulting y the the quel v the of to of each for weights d of document sum corresponds query q terms. The formula by is PL2 combining given of Equations (2.7), (2.9). (2.14). and
(2.16):
Wd,q 1:
tEq
qtfn -1 tfn +1
(t fn
loge "
tfn
+ ( - tfn) " 1092 e+0.5
log2(27r tfn)) "
(2.1
16
2.3 Matching
is the frequency of the term t in the query, where gtfr = qtf is he and t qtf,,,, , 9ttmax maximum frequency of any term in the query. If the Poisson randomness model (P) for computing Probl(tACollection) is combined with the Bernoulli the following: Wd, q= 1:
tEq
(B) for computing model
Prob2(tflEt)
and normalisation
for term frequency normalisation,
then the resulting model is PB2 and its formula is
gtfn .F+1 (tfn n-
1)
(tfn
loge "
tfn
+ (A - tfn) " loge e+0.5
(27r 1092 tfn)) " . (2.19)
models can be generated from different combinations of basic models. The DFR model I(ne)B2 is generated from the inverse expected document frequency I(ne) for (tCollection), Prob, model computing Prob2(tflEt), and normalisation wd,q = A modification 1: tEq the Bernoulli model (B) for computing 2. The formula of the model I(ne)B2 is the following: F+ gtfn " 1 (tfn loge " N+1) rye+0.5 (2.20)
Additional
n-(tfn+1)
I(ne)B2 is of generated if natural logarithms are used instead of logarithms base 2 in Equation (2.20). The resulting model is denoted by I(ne)C2 and its formula is the following: Wd, q -E gtff "1 n/ (tfne
itfrye+
+051 -Inn
e
tEq
1)
(2.21)
(l/l)). ln(1 tfrce tf +cC. where = " The four DFR models that are shown above, PL2, PB2, I(ne)B2 and I(ne)C2, employ normalisation 2, which introduces the only hyper-parameter This hyper-parameter by using relevance information. be to required set can be set either by measuring by directly effect, or op-
a collection independent quantity, timising hyper-geometric the generated with document length normalisation
such as the normalisation
the retrieval effectiveness. Another interesting DFR weighting model can be randomness model, which naturally incorporates a 2 is not needed, component. In this case, normalisation
and all the variables of the weighting model are computed from the collection statistics.
17
2.3 Matching
This model is denoted by DLH and its formula is the following:
Wd, q-
1 gtf n-tf+0.5
tEq
(log2(tf.
) + F,
+(l-tf)log2(1-
l)+
)) (2.22)
+0.51092 (27rtf(1 -l
Overall, the DFR framework provides an elegant and general way to generate IR from basic probabilistic models. Similarly to the generation models of the retrieval models, the DFR framework can be used to introduce weighting models for performing automatic query expansion, as discussed in Section 7.4.2.2, page 202. Amati & Van Rijsbergen (2002) described a theoretically motivated derivation of BM25 within the DFR framework, where the resulting formula has an additional cornponent compared to the original one. Regarding the relationship between the DFR framework (2006) argued that the DFR weighting model DLH and language modelling are generated from the same probability space, but Amati represent a frequentist and a Bayesian approach to the IR inference problem, respectively. Moreover, recent large-scale evaluations of several DFR weighting models and language modelling have shown that they result in similar retrieval effectiveness (Clarke et al., 2004). For these reasons, the employed weighting models in the remainder of the thesis are mostly based on the DFR framework, and not on language modelling. The evaluation of selective Web IR in the subsequent chapters, can be performed with any retrieval model. For the purpose of this thesis, five weighting models are used. More specifically, the employed models are the DFR weighting models PL2, I(ne)C2, PB2 and DLH, as well as the classical BM25. For ease of reference, the corresponding formulae are given in Table 2.1. These weighting models have been selected for several reasons. The weighting models PL2 and I(ne)C2 are robust and perform well across a (Plachouras & Ounis, 2004; Plachouras, He & Ounis. 2004). The tasks range of search PB2 is in weighting model selected order to test the combination Poisson the of ranand language modelling,
domness model with the Bernoulli model for the after-effect. The weighting model DLH is particularly The weighting literature. interesting, because it does not have any associated hyper-parameter. model BM25 is employed, because it has been frequently used in the independent, as it will be
The employed weighting models are statistically
by Chapter in 4. the evaluation confirmed results
18
2.4 Evaluation
PL2 PB2
1(ne)Ci2
t7l (tfn >tE (A loge loge 1092(2-7r tfn) + tfn)) Wd, gtfn e+0.5 = "t " " q f" 1 a n+i (tfn EtEq (A 10 1092 log2(27r tfn) tfn)) Wd,q = qty 92+ e+0.5 " " nt In tfne
wd, q = tEq gtfn "ntn i-1 "
DLH
BM25
wd ,q=
Wd, q =
tEq >tEq
gtfn
(1og2(. L-l
tf+o. i+l ((1-b)+b 5 tk +tf
n e+0.5
F) + (1 t 1og2(1 -)+0.51og2
log i-9 ft Nn+0.5 5) + k2 " nq " 1-1
(27rtf(1
1))
(kl
Table 2.1: The formulae of the weighting models PL2, PB2, I(ne)C2, DLH, and BM25, respectively.
2.4
Evaluation
There are several IR models, based on different assumptions, or on combinations of theory and experimental data, as discussed in the previous section. A natural question that arises is how to evaluate and compare the different IR models. This has been an important issue that has attracted the interest of researchers from the early stages of
IR.
The evaluation of the retrieval effectiveness of IR models has been based on nwadocuments, test a collections, which consist of a set of suring precision and recall on is defined Precision the topics as number of and a set of relevance assessments. set of for documents documents total the a parnumber of retrieved over retrieved relevant documents the defined is Recall the over titular topic. relevant number of retrieved as total number of relevant documents for a particular topic. The relevance assessments documents are relevant to a particular specify which topic. This approach was introduced in the Cranfield experiments, where the size of the test collections allowed the (Cleverdon, However, 1997). for document topics the all complete assessment of each increased, handle the is to IR documents that expected system an as the number of for 1, impractical became t documents e, and other approaches complete assessment of all generation of relevance assessments were needed. (TREC), Conference the relevance assessments REtrieval Text In the context of the Sparkfrom idea developed (Harman, 1993), technique of an based a on pooling are Jones & Van Rijsbergen (1976). The output of a set of IR systems is used to generate from documents top by for taking topic, documents ranked of a number each a pool of it is IR the In the systems, of to effectiveness retrieval compare order each system. the instead in documents all the of assessing the generated pool, sufficient to assess t(,,, t the (2001) has the that Blair However, of size for as documents pointed relevance.
19
2.4 Evaluation
collections increases, the computed recall with pooling does not correspond to the real one, because only a small fraction of the documents are examined for relevance. In such a case, the recall maybe artificially
ing (Harman, against 1993).
boosted. TRECs were ad-hoc search and routmatching an unknown set of topics a known various task involves matching TRECs introduced
The evaluated search tasks in the initial

a known documents, set of a stream
The ad-hoc search involves
while the routing
set of topics
against
of documents.
Subsequent
tracks in order to evaluate lection (VLC)
different
search tasks. from ran
For example,
the Very Large Colto the
Web tracks, and
which
1997 to 2004, were dedicated
evaluation Hawking TREC TREC
of IR systems with & Craswell
Web test collections a comprehensive
for ad-hoc and Web search tasks'. presentation both tracks until of of the Web track in
(2005) provide
2003, and Craswell
& Hawking
(2004) give an overview
2004. More details about the tracks are provided
in Section 3.5.1, page 43.
There are several different measures that can be used to evaluate IR models with Precision to respect precision and recall. and recall are complementary fixed is if the reported at recall points. reported, or precision Average precision corresponds to the average of the precision after each relevant document is retrieved. For example, if an IR system retrieves three relevant documents for a topic, at ranks 2,4 and 10, the average precision of the system for this particular -) (2 defined is R-Precision 0.4333. the precision topic is computed as +2+l as = 3 R have been R documents corresponds to the number of relevant retrieved, where after documents for a particular query. For high precision search tasks, where there are few relevant documents, and it is to retrieve a relevant document at the top ranks, an evaluation measure that is commonly used is the reciprocal rank of the first retrieved relevant document. important If there is only one relevant document for a query, then this measure is equivalent to (Pn), documents Another is where precision at n retrieved measure average precision. fixed is a number. n This measure depends only on the number of relevant retrieved documents among the top n retrieved documents, and not on their ranking. The concepts; (lie different IR be if both made comparison of models can only precision and recall are
by is the the topics of mean employing performed of over a set of systems comparison (MAP), leading described to mean mean average precision evaluation measures, above (MRR1), first document the and mean precision at retrieved relevant reciprocal rank of
20
2.5 About
Web information
retrieval
(Sn) documents documents. In correspond addition, success at n retrieved n retrieved to the percentage of topics, for which a system retrieves at least one relevant document among the top n ranked ones. Van Rijsbergen (1979) provided a comprehensive discussion on the evaluation of IR be More details that the will evaluation measures about systems and related measures. Section 61. be in 4.3.3, in thesis this the page given will subsequent chapters of used
2.5
About
Web information
retrieval
While classical information
dealt have test sized colreasonably with systems retrieval lections, and a variety of search tasks, involving ad-hoc and routing (Harman, 1993), as information, Web (Lewis, 1996), the filtering the of repository as a vast advent of well as larger is Web, is the One the has posed new challenges. which size of such challenge MoreIR in been has that document test collection experiments. used than any ever hypertext the over, of evidence that can document model used for Web documents offers several sources be exploited to enhance the retrieval effectiveness of IR systeiiis.
in detail in discussed the issues, next These challenges, as well as other related are chapter.
21
Chapter
Web Information
3.1 Introduction
Retrieval
The Web can be considered as a large-scale document collection, for which classical text retrieval techniques can be applied. However, its unique features and structure offer new sources of evidence that can be used to enhance the effectiveness of IR systems. Generally, Web IR examines the combination of evidence from both the textual content documents of and the structure of the Web, as well as the search behaviour of users, and issues related to the evaluation of retrieval effectiveness. This chapter presents an overview of Web IR. It discusses the differences between (Section IR Web IR 3.2), a range of Web specific sources of evidence classical and (Section 3.3), and the combination of evidence in the context of Web IR (Section 3A). This chapter also provides a brief overview of work on the evaluation of Web IR systems (Section 3.5), as well as on query classification and performance prediction (Section 3.6).
3.2
Differences trieval
between
information Web classical and
re-
Classical IR systems have been often developed and used in the context of a controlled environment, library, document a specific group of and such as with a users a collection different environment for
However, Web the of moderate size. represents a substantially
IR systems. These differences are discussed with respect to three aspects: the hypertext document model (Section 3.2.1), the size and structure of the Web (Section 3.2.2), the information of quality (Section 3.2.4). on the Web (Section 3.2.3). and the background of Web users
22
3.2 Differences
between
classical
and Web information
retrieval
3.2.1
Hypertext
document
model
The Web is based on a hypertext document model, where the documents are connected directed hyperlinks. This results in a virtual network of documents. Hypertext with (1945) by Bush was envisioned as a more natural way to organise, store and search for information, similar to the associative way in which the human mind works. A reader approaches a text by reading and understanding small sections of it, while discovering the connections between the exposed concepts in the text. The hypertext aids this by process making the connections between parts of the text explicit (Levy, 1995). In facilitates it the reading of texts in non-linear ways, similarly addition, (Belew, books indices in 2000). the table the such as of contents, or
The hyperlinks noted the importance in hypertext of making systems can have explicit the type of links explicit. types.
to structures,
(1983) first hypertext
Trigg
In his proposed
model, the links are divided commentary extensive of links, links.
in two broad classes: internal divided
substance links, and external in subclasses, leading to an two main types type of while the
These two classes are further of link types. Similarly,
taxonomy namely
Baron
(1996) identified links.
the organisational
and the content-based
The former documents,
links was used to organise and help navigation latter type was used for pointing in scientific references a set of typed links. to documents publications, The HyperText
hypertext among on similar topics.
However, as with bibsystems do not usually (HTML) (Raggett for
liographic come with
some hypertext Markup
Language
1999), et al., which defining typed links,
is used to write
Web documents, is optional
provides
some functionality
but this mechanism
it is and not used consistently. task, because it requires underdocuments. Differently from
The automatic standing identifying
inference
difficult is link type the a of
the context
destination both the source and of Allan linking
the type of hyperlinks, documents. After
(1996) investigated
the automatic the similarity
typed linkof which links. A
ing of related
documents, all pairs of
exceeds a threshold,
the resulting
is simplified graph links, according
by iteratively
merging
type is assigned to the resulting
to a predefined
taxonomy.
by to navigate a user allowing search process hyperlinks Navigating following hyperlinks. through by document through the space Hypertext alters the information However, documents. hypertext the for be number as of small collections sufficient may heterogeneous is increases, documents sets of allowed across or when navigation of
23
3.2 Differences
between
classical
and Web information
retrieval
hypertext
documents, users may not be able to locate information by following merely links, but instead, they may find themselves lost in hyperspace (Bruza, 1990; Guinan One aspect of this problem can be addressed by applying IR techniques to search for information, or locate starting points for browsing in hypertext documents. & Smeaton, 1992; Halasz, 1987).
3.2.2
Structure
of the Web
of information, the size of which is increasing continuously. 1997 to be that the
The Web is a vast repository Bharat & Broder
(1998) estimated
the size of the static Web in November Lawrence 800 million that & Giles documents (1999) reported in February
approximately indexable recently,
200 million
documents.
part of the Web was about Gulli & Signorini
1999. More
(2005) reported
the indexable
Web has more than available part of the (2001) Web
11.5 billion
documents.
All these estimates
refer to the publicly
Web, which is indexed by search engines. However, Raghavan estimated that even more information the hidden
& Garcia-Molina
is stored in databases or access-restricted
sites, composing
Web, which cannot be easily indexed by search engines.
In order to study its topology, the Web can be seen as a directed graph 9(V, E), documents, E Web V the the the and set of edges represents set of vertices where links documents. between Web The hypertext links' that the of number represents links from d, d document is iridegree the that the to start number of of while a point document d is the outdegree of d. The sum of the iradegree and the outdegree of a document d is called the degree of d. Generally, complex interconnected systems can he Initially, the research area of random graphs was explored graphs. as random modelled by Erds & Renyi (1959), who proposed the random graph model Sm,t. This model describes graphs with m nodes, where a link exists between two randomly selected nodes between Ili distance have Random t. vertices. a short average graphs with probability follow Poisson indegrees the a vertices of random graphs and outdegrees of addition, the distribution. the Web seems to be chaotic and to lack structure, because there is is its that information, to topology the to of similar available organise no single entity Recent display in self-organising principles. nature, which many other complex systems Although does interconnected has the topology that systems item of many complex shown research
'Hereafter, the terms hypertext link, hyperlink, and link will be used interchangeably.
24
3.2 Differences
between
classical
and Web information
retrieval
fit the one predicted by the random graph model 9,,,,, Therefore, new graph models t. have been proposed to study these networks. First, Watts & Strogatz (1998) proposed free a model with one parameter. The Watts-Strogatz from an ordered finite-dimensional (WS) model is based on starting one lattice and changing with a given probability
of the vertices connected by each edge. The WS model produces a range of graphs, between the extremes of an ordered finite-dimensional lattice and a random graph. It likely the to of are captures properties small world social networks, where people more know their neighbours than a random person that lives far away. The graphs general vd by the WS model have a short average path length between vertices, similarly to random graphs. In addition, they have a high clustering triplets coefficient, which corresponds to
the fraction
of transitive
of vertices, compared to random graphs generated
between (1999) length Albert by Sm,, that the any average path estimated et al. t. two documents on the Web is 19 links (the estimated size of which was 800 million documents at the time). Adamic (2001) reported that the clustering coefficient of the higher has Web different hyperlinks by a significantly sites across using graph generated (0.081 0.00105). than that vs. of graphs random clustering coefficient
The degree distribution the random graphs of graphs generated additional by the WS model is similar evidence obtained to that of However, 9,,,,,, t. from the analysis of & Albert, hyperlinks that
the Web showed that the degree distribution 1999). In other words, the probability is proportional the distribution exponents
follows a power law (Barabsi having k incoming
of a Web document
to k-7, where -y is a positive constant.
Broder et al. (2000) reported
documents Web indegrees and outdegrees of of In addition,
follow power laws with the number of pages
2.72, 2.10 respectively. and -you, 'yi,,, = t = of visitors
in a site, as well as the number 2001). routers Faloutsos follow ) (n. d. et al.
to a site follow similar that
(Adarnic, laws power between within Internet
identified also
the connections
law. a power documents of
Pennock
(2002) that observed al. et
some online
communities deviate may
distribution Web, the the on and roughly
of indegrees and outdegrees distribution.
from a power-law
follow a log-normal
have been observed in many highly complex networks aris(1999) Albert the & Barabsi origins human attributed ing in nature and communities. Power law distributions attachto two in laws and preferential growth mechanisms: networks complex of power introduction the of First, with grow continuously networks complex most real ment.
25
3.2 Differences
between
classical
and Web information
retrieval
Second, in most real networks, the likelihood of connecting to a node denew nodes. pends on the degree of the node. The nodes, which are linked to by many other nodes. are more likely to get a higher number of new links. Albert & Barabsi (2002) provided an extensive survey on complex networks, and Barabsi (2002) presented the historical background of studying complex networks, as well as related applications. Broder et al. (2000) have studied the distribution of connected components of the Web, identifying that the Web graph consists mainly of four parts. The first is a large strongly connected component of Web sites, where it is possible to navigate between any two Web sites. The second part consists of documents that point to the lai g connected component, while the third part consists of documents that are pointed by documents in the connected component. The last part consists of the rest of the other documents on the Web. Overall, the estimation of the size of the Web and the analysis of its structure, are very interesting issues for two main reasons. Search engines have to collect, or crawl the documents from the Web by following hyperlinks (Brin & Page, 1998; Heydon & Najork, 1999), differently from classical IR systems, where the documents are often readily Both the size and the structure of the Web may also be
Web Therefore, the the can enhance the effectiveness properties of provided. studying of crawling Web documents. Section be described in 3.3. it to enhance retrieval effectiveness, as will used
3.2.3
Quality
information of
Web the on
Classical IR systems have been often used in controlled environments, where documents different Web However, is the information that a quite rarely changes. contain reliable documents. Web be the quality of made about environment, where no assumption can The information available on the Web is very different from the information contained in either libraries or classical IR collections. A large amount of information on (Bharat different is duplicated, Web is the sites often mirrored across many and content & Broder, 1999; Shivakumar & Garcia-Molina, the information 1998). This redundancy ensures that
is always available, even when some of the mirrors are out of service. However, search engines and IR systems need to take into account the duplication of Web documents, in order to reduce the required resources for crawling Web documents Mirin documents to duplicate Web the users. results presented and to avoid returning documents. duplicates the in does documents Web of as exact not always result roring
26
3.2 Differences
between
classical
and Web information
retrieval
formatting
may change, or dates on the pages may be updated.
In such cases, it i5
(Bernstein Scalable detect duplicate duplicate to techniques or near pages. necessary & Zobel, 2004; Bharat & Broder, 1999) to do this are based on fingerprinting documents, parts of small fingerprint between the the generated overlap and comparing s. fingerprii1ts the the generated selection of and
The differences between the proposed techniques are mainly due to the selection of the documents fingerprint, to the parts of to compare between documents. the contents of Web pages are not guaranteed to be due information, false inaccurate to Web Indeed, unor contain pages may accurate. intentional errors by their authors, or due to intentional efforts to mislead users in information duplication issues Both the of and quality of visiting a particular website. In addition to duplication, hypertext Web in in the than the the syscase of classical case of are more significant tems (Spertus, 1997), or other types of document corpora, such as newswire articles and scientific publications.
3.2.4
Background
Web users of
be Therefore, to can The Web is an open system accessible no assumption anyone. (1993) Hsieh-Yee literacy. made about the users' expertise, experience or computer in behaviour a in searchers differences the experienced and of novice search reported Studies of query logs from Web search engines showed that the documents browse top the and ranked only the queries, short users provide majority of (2001) & Pooch Jansen (Silverstein 1999). do not reformulate the original query et al., behaviour studies. search of user review a comprehensive provided (2002) has identified taxonBroder types. a Users perform search tasks of varying IR setting. classical Web: informational, and the navigational, tasks types three on of search main omy of for information looking partica about informational tasks, In are users transactional. document Web they in interested a tasks, In viewing are topic. users navigational ular do to location, they its navigate do want not but or have seen before, not remember ion in transact interested making a back to that page. In transactional tasks, users are (2004) extended Broder's taxonoiiiy & Levinson Rose or obtaining an online resource. tasks'. transactional, search informational or resource, by providing sub-types of and
27
3.3 Web-specific
sources of evidence
3.3
Web-specific
sources of evidence
Web IR can exploit a range of sources of evidence, in addition to the textual content documents. For example, evidence from the document structure, or the structure of of the hyperlinks among documents, can be used to enhance retrieval effectiveness. This section presents an overview of the different sources of evidence that can be used for Web IR.
3.3.1
Document
and Web site structure

in the sense that HTML offers basic structuring Web In addition, bold
Web documents are semi-structured capabilities
to the authors, even though the use of such capabilities is optional.
documents may have titles and headings for improved readability. Evidence about the formatting
be document. italic in typefaces to the or can used order emphasise specific parts of has document been by the the used and structure of
indication importance the of the text that appears of commercial search engines as an (1998) described & Brin Page For that changes example, with additional visual cues. in the relative size, or the colour of the text, were stored in the index of an early version Google the search engine. of The hypertext document model and the Web encourage authors to organise doc iWeb in Web different Documents in the sites, where are grouped on ways. several ments Within topics. documents topic, the or a series of related cover either a specific most of Web sites, documents are usually organised in a hierarchical directory structure. There have been several efforts towards the automatic identification of aggregates of hypertext, or Web documents. Botafogo & Shneiderman (1991) employed a graph theoretical & McCurley Eiron documents. hypertext in identify in aggregates order to approach (2003 a), and Li et al. (2000), defined heuristics based on observations of the structure been has domain documents their Grouping to Web employed also according sites. of (Kvwok from documents limit in order to the redundancy of retrieving many a given site IR the In 2002). corresponds commonly most unit retrieval systems, classical et al., However, item. this document, or a news to a whole such as a scientific publication is not necessarily the case for hypertext, where a document may correspond to several hypertext (1998) For Tajima statically et al. example, nodes. identified retrieval units that the those to all contain subgraphs retrieve as connected subgraphs. and proposed
28
3.3 Web-specific
sources of evidence
query terms.
(1999) et al. also identified the retrieval units from the set of retrieved documents for a query. Tajima The URL of Web documents can be effectively used to detect documents that are likely to be home pages of Web sites. Westerveld et al. (2001) and Kraaij (2002) et al. identified four types of URLs for Web pages: domain " root: a name. For example, such a URL would be http: //ir. dcs.gla. ac.uk/. domain " subroot: a name followed by a single directory. For example, such a UR.L //ir. be http: dcs. would gla. ac.uk/terrier/. " path: a domain name followed by a directory path of arbitrary //ir. URL be http: dcs. such a ple, would gla. ac. uk/terrier/doc/. depth. For exaiii-
file: domain followed file by default file into than the " a name a path a other dex. html. Such a URL would be http: //ir. dcs. gla. ac. uk/terrier/people. html.
Westerveld et al. (2001) found that the Web documents with root and subroot URLs Web likely home to to sites. correspond pages of are more of whether a particular Web document is a home page of a Web site is given by the length of its URL (Savoy & In addition to the type of URLs, another indication Rasolofo, 2001). Because the URL of Web documents is likely to reflect the hierarchical directory structure of Web sites, documents that are higher in the hierarchy have shorter URLs. Savoy & Rasolofo (2001) defined the length as the number of "/" in the URL. Kamps et al. (2004a) considered the number of characters in a URL, and they also "/" in domain ". " in the the the the of path name, and number number of counted (2004) length & Ounis (2003) Plachouras the Plachouras URL. used and et al. of a in characters of the URL path. Using evidence from the URLs of Web documents is further discussed in Section 4.5.1.
3.3.2
Hyperlink
structure
analysis
The analysis of the hyperlink structure of the Web has been based on citation analysis. For example, the impact factor of a journal, can be estimated by counting the number (Garfield, Instead just 1972). journals in it is counting of times other papers or cited of depend journal (1976) influence & Narin that the Pinski should of a suggested citations, it. journals influence that the the cite of on Therefore, the influence of a journal is
29
3.3 Web-specific
sources of evidence
defined in a recursive manner. Geller (1977) provided additional the computation Markov influence the chains. of with
insight by modelling
Similarly to citations, hyperlinks between Web documents can be exploited in order to estimate the importance of Web pages. From the perspective of a Web search engine, for each query there may be far more relevant documents than a user is willing to browse. Bar-Yossef et al. (2004) also suggested that the quality and the freshness of documents vary significantly. For this reason, when the number of retrieved documents documents that for a query is large, a search engine should try to detect important from the more authoritative, originate or trusted sources.
There have been various proposed ways to find the important documents within hypertext or Web documents. In an early study, Botafogo et al. (1992) investigatccf hypertext in documents importance the to of estimate various structure-based measures hypertext, in the introduced They to a centrality of nodes quantify measures systems. linear the to the ordering of a set of compactness and as well as measures related hypertext nodes. Moreover, Pirolli et al. (1996) analysed both the content and the link structure of Web documents within a single site in order to detect the most useful documents. (1997) were among the first to use evidence from the link in They documents. Web in Web, a method to proposed the rank order of structure h is the is wit extended set to result obtained engine, a search sent which after a query Carriere & Kazman Then, initial in documents the to set. result documents that are pointed to, or point hyperlinks incoming the the of is to of number the extended result set sorted according for hyperlink in the area of structure analysis documents. Two of the seminal works PageRank documents the Web algorithm are ranking following in the sections. are presented 3.3.2.1 PageRank HITS the algorithm, which and
for authority PageRank, a global (1998) computing Brin & Page algorithm an proposed is to links perform the While expected document. of for number counting each score PageRank wary 2000), (Amento sophisticated a more provides in et al., some cases well (1976) & Narin Pinski by to rank Web documents, similar to the approaches proposed depends document Web PageRank The a (1977) of in score Geller analysis. citation and high Documents it. to a documents with the PageRank pointing scores of all on the
30
3.3 Web-specific
sources of evidence
PageRank are either pointed by many documents, or they are pointed by important documents. A simplified version of PageRank is defined as follows:
PR(i)
dj --
PR(j)
outdegreej dti
(3.1)
PR is Nx1 the where outdegreej
vector, which contains the PageRank values for each document,
is the outdegree of a document dj, and N is the number of documents in The above equation can be expressed in terms of matrices. Let .ato
the collection.
be aNxN matrix with the rows and columns corresponding to Web documents and [aj, i] = 1/outdegreej if dj -+ di, otherwise [aj, (3.1) Then, Equation 0. be can i] = PR PR, AT PR is that written as cAT so an eigenvector of = with eigenvalue c. " The simplified definition of PageRank in Equation (3.1) overestimates the PageRank
for documents values without any outgoing hyperlinks, documents for or sets of that
only link to each other. These problems are eliminated with the introduction E called rank source. Thus, Equation (3.1) can be expressed as follows:
PR(i) (1 = - prdf) E(j) " PR(j) + prdf "E outdegreej dj -+d;
of a vector
(3.2)
[0,1] E(j) f damping is E a constant called actor, and where prdf
is the score assigned
to document j by the rank source E. The vector PR is the principal eigenvector of the
matrix:
A=(1-prdf)"E+prdf
M is the matrix with elements: where
mij outdegree;
"MT
(3.3)
ifd2i-- d7 j
otherwise of visiting a particular
(3,4)
PageRank scores correspond to the probability
node in a
Markov chain for the whole Web graph, where the states represent Web documents, PageRank Alternatively, hyperlinks. between the transitions can states represent and be seen as a model of a random surfer, who browses Web documents and navigates by following hyperlinks. The random surfer chooses to browse a random Web document with some probability 1- prdf, instead of following a hyperlink. The introduction of this jump to a random Web document makes PageRank stable to small perturbations of (2002) PageRank Diligenti Moreover, 2001). (Ng Weh the modified et al. et al., graph
31
3.3 Web-specific
sources of evidence
in order to refine the random surfer model. They included more specific user act ion,,.
following forwards, backwards, link a or jumping to a random document. such as

The PageRank algorithm does not depend on the content of documents, nor on the documents based hyperfor Instead, it the score only on computes a queries of users. links. The computation PageRank indexing time ensures that the computational at of Section 3.4.1 discusses the extensions
it is in time applying at query minimal. overhead
for hyperlink Section 4.5.2 introduces bias. PageRank topical a novel model with of hyperlink Absorbing Model, the the and combination of content and evaluates analysis, analysis. 3.3.2.2 Hubs Authorities and
Kleinberg (1998) proposed a more sophisticated algorithm for finding authoritative doc(HITS), Search Topic Hyperlink-Induced The time. algorithm, called at query uments is based on the spectral analysis of the adjacency matrix of the documents returned for a query. Documents on the Web may be authorities, or hubs. Authorities contain information topic. hubs to topic, on a specific authorities point and about a specific is is expressed Between hubs and authorities, there a mutual reinforcing relation, which hubs hubs; to by linked point follows: good and many good good authorities are as bito A this a corresponds structure of representation graph many good authorities. hand left the hubs Figure 3.1, side, in on are presented where partite graph, as shown and authorities hand the side. right are shown on
o
Hubs Authorities
Figure 3.1: Hubs and authorities
bipartite a as
graph.
the is to follows: and engine a search sent a query The HITS algorithm works as is This documents. initial set form root documents of set root 200 an top retrieved by to the that to, documents pointed that are base or point to with set a expanded documents. already retrieved One imposed restriction is that a document in the root
32
3.3 Web-specific
sources of evidence
bring set can only at most 50 documents in the base set. The generation of the base set documents is performed in a similar way to the methodology proposed by Carriere k of Kazman (1997). Each document d in the base set is associated with a hub value h (d) . and an authority value a(d) . If there is a hyperlink from document di to document dj. then this is represented by di -+ dj. The values h(di) and a(di) for document di are iteratively updated, using the following formulae:
h(dz) =E
di -+ dj
a(ds)
(3.5)
a(di) =
Kleinberg
1:
dj -)di
h(dj)
(3.6)
values a(d) the
(1998) showed that the hub values h(d) and the authority
converge to the values h*(d) and a*(d), respectively, after iteratively above computations.
performing
If A is the adjacency matrix for the documents in the expanded set, then the Equations (3.5) and (3.6) can be written as a- AT h and h F- Aa, where h, a are vectors hub the of and authority of the matrix A. AT matrix Bharat & Henzinger (1998) extended the original HITS algorithm in the following between hubs In diminish the to the reinforcement relation effect of mutual order way. link is links from to there each given site another site, a are many and authorities, when links between inversely the two sites. the to of number proportional a weight Lempel & Moran (2000) introduced the Stochastic Approach for Link-Structure Analysis (SALSA), an algorithm that computes hub and authority for Web scores AAT documents. h* is the principal eigenvector The scores of vector
the vector a* corresponds to the principal eigenvector of the and ,
documents, differently
from the HITS algorithm.
In SALSA, the scores are computed (a) a randomly selected incoming (b) a randomly selected outgoing
from a two-step random walk, where alternately: hyperlink document of d is traversed backwards;
hyperlink of document d is traversed forwards. The authority scores correspond to the (a). first from Markov distribution the step and performing chain resulting of stationary
then step (b). The hub scores correspond chain resulting from performing to the stationary distribution Markov the of (2001) et al. first step (b), and then step (a). Borodin
hority HITS. is the SALSA truncated that of alit where a one-step version of suggested
33
3.3 Web-specific
sources of evidence
a document depends only on its popularity authority
in its immediate neighbourhood,
while the
of a document in HITS depends on the global link structure. Cohn & Chang (2000) introduced Probabilistic HITS (PHITS) for calculating authority and hub scores for Web documents. PHITS is equivalent to the Probabilistic
Latent Semantic Indexing (PLSI) proposed by Hofmann (1999) and it can be described follows: the probability of the generation of a document d is P(d). The probability of as factor, a or a topic z to be associated with document d is P(zld). Given the associated topic z, the probability that there is a link to a document d is P(dlz). The advantage of this model is that, apart from a measure of authority which is represented by the P(djz), probability other interesting measures can be extracted, such as the probability P(zld) that a document d is about topic z. Borodin et al. (2001) provided heuristical refinements to the HITS and the SALSA
First, introduced they algorithms. a hub-averaging version of HITS, where the authority documents HITS, in in but the hub scores the scores of are computed same way as documents of correspond to the average of the authority by the hub. pointed In a second modification scores of all the documents the authority Third, HITS the algorithm, of
higher hub depend hubs the than the average score, and scores only on with a score the hub scores are computed only from the scores of the top ranked authorities. they extended the SALSA algorithm Bayesian algorithm, authority by allowing the authority document. of each documents in a broader neighbourhood scores to depend on a
They also introduced
from hyperlink hub having to an the of a a prior probability where is defined with respect to: (a) a parameter that represents the tendency of (b) a parameter that represents the level of authority. There, from hub hyperlink having to an authority a of a is conditioned
hubs to link to authorities; the prior probability
data. The the proposed algorithms are refinements of existing algorithiris, on observed but they have not been evaluated in a large-scale experiment. docits HITS to of expanded set retrieved an and extensions of by is implicitly documents for the that considered content of a query, means uments The application the algorithm. However, the associated computational cost of the algorithm at query in an operational setting rather difficult. Another problem
time makes its application
drift, is the HITS to topic most prominent group of occurs when which related with documents in the result set is not about the query topic, but dominates the results bedescribes Section 3.4.1 (Bharat & Henzinger, 1998). densely is it connected more cause
34
3.4 Combination
of evidence
for Web information
retrieval
the extensions of the HITS algorithm, explicitly.
which employ the content of documents more
3.3.3
User interaction
evidence
from the Web documents, the visiting patterns another source of users in Web In
In addition of evidence
to evidence that can be obtained is the information obtained
from
sites, or the click-through a study that
data obtained
from the result
pages of search engines.
of the usage patterns
observed on Web sites, Huberman
(1998) found et al. according to
the number
links followed of distribution,
by a user in a Web site is distributed
Gaussian inverse an
which means that most of the users follow few links, Pirolli &
while there is a small number of users that will follow a high number of links. Pitkow (1999) suggested that characterising structure hyperlinks. analysis the visiting patterns
of users can be used
to enhance hyperlink process of traversing
algorithms,
which
based are on a stochastic
data from the logs of a metasearch engine, in order to adapt its retrieval function with a Support Vector Machine (SVM) classifier. The results from a controlled experiment with users suggested that the users viewed function documents the retrieval with respect to the after adapting more retrieved (2005) Jiang data the et al. combined clickthrough of metasearch engine. clickthrough data with associations between documents in order to alleviate the problem that a large amount of clickthrough data is required to improve the retrieval effectiveness. Joachims data can be (2005) that the clickthrough study and suggested a user et al. performed documents. for indication the retrieved of relevance used as a relative
Joachims (2002) employed clickthrough
3.4
Combination
information Web for of evidence
retrieval
The combination
improves the different retrieval ef 6cgenerally sources of evidence of tiveness (Croft, 2000). The sources of evidence can be either different query represen-
Hyperlink techniques. different document tations, representations, or various retrieval documents. the complementof quality, or usefulness analysis provides an estimation of ing the concept of relevance. The combination for for IR, and specifically of evidence Web IR, has been extensively investigated, with many different approaches proposed.
35
3.4 Combination
of evidence
for Web information
retrieval
3.4.1
Extending
hyperlink
analysis
algorithms
One approach to the combination of evidence from the content and hyperlink structure analysis is to refine already proposed hyperlink analysis algorithms with evidence from the content analysis, or the users' queries.
Extensions cific topic. PageRank of focus more on biasing the PageRank (2002) proposed by an intelligent a modified surfer, scores towards a spePageRank algorithm, Richardson & Domingos
where the random jumps to documents
surfer is replaced according
who traverses links and to the query. A similar
to the similarity
latter the of
extension
to PageRank
has been proposed by Haveliwala specific topics is computed.
biased towards ank scores profile determines
(2002), where a set of PageR. Then at query time, the user of the differ-
the weight of each individual
topic in the combination
ent PageRank PageRank query time.
A drawback scores.
of these approaches
is that a range of precomputed combined at
scores for various
topics is required
in order to be efficiently
There have been several proposed extensions to the HITS algorithm,

incorporate Chakrabarti similarity more evidence from the textual content documents of (1998) et al. weighted the hyperlinks between documents
aiming to
in the algorithm. according to the Bharat
between the query and a window (1998) extended from the expanded the original
of text surrounding HITS algorithm
the hyperlink.
& Henzinger documents documents
by eliminating
non-relevant of
documents, set of
by and regulating
the influence
according
to their relevance scores. The relevance scores correspond
to the
cosine similarity
between the documents
in the expanded set and a broad query, resultin the expanded improvements in dif-
ing from the concatenation set. They performed
from document first 1000 the each of words and reported algorithm. considerable
a user experiment, HITS
precision
with respect to the original measures. Chakrabarti microhubs,
Li et al. (2002) investigated Object
ferent similarity (DOM)
(2001) Document the et al. used which correspond
Model
in order to detect document. a HITS
to focused hubs on a specific the topic
topic within drift
The proposed approach algorithm, by identifying
was successful at reducing more relevant
of the original
hubs for a query.
Achlioptas et al. (2001) introduced a model for searching, where both the generation distribution between links the and pages of They terms assumed a are considered. of of which results to every possible latent basic concepts, the combination number of
36
3.4 Combination
of evidence
for Web information
retrieval
topic.
For each document,
there is a vector for its authority
on each topic and another of the hub vector of one
for its vector quality document
as a hub on each topic.
The inner product document
with the authority
vector of another document.
gives the expected number of of
links from the former
to the latter
There are two associated distributions the distribution
terms with each document. for the authoritative frequencies use of latent where both PHITS factors,
The first one determines
of term frequencies of term The
terms, while the second one determines e.g. the anchor text
the distribution with
for the hub terms, topics content
associated
hyperlinks. (Section
is more evident and link analysis
in the continuation are integrated
PHITS of
3.3.2.2), PLSI and
by linearly
combining
(Cohn & Hofmann, resulting
2001). Both algorithms integration
share the same space of latent topic link and analysis.
in a principled
of content
3.4.2
Implicit
hyperlink
analysis
with
anchor
text
employ
The algorithms
HITS and PageRank, along with their extensions, explicitly
the hyperlinks between Web documents, in order to find high quality, or authoritative Web documents. A form of implicit use of the hyperlinks in combination with content analysis is to use the anchor text associated with the incoming hyperlinks of document 5. Web documents can be represented by an anchor text surrogate, which is formed from document'. hyperlinks to the the the text pointing collecting anchor associated with The anchor text of the incoming hyperlinks provides a concise description for a Web document. The used terms in the anchor text may be different from the ones that occur in the document itself, because the author of the anchor text is not necessarily the author of the document. Eiron & McCurley (2003b) found similarities in the distribution documents Web to between the text the terms queries submitted an and of anchor of intranet search engine by users. For these reasons, Web documents can be indexed with the anchor text of their incoming hyperlinks, in addition to their textual content. This (Brin McBryan, & Page, 1994). 1998: Web has been in search engines used approach Craswell et al. (2001) showed that anchor text is very effective for navigational Upstill Web home for finding tasks of sites. ct pages and more specifically search from document hyperlinks incoming (2003) the text that the of anchor suggested al. finding for home the page retrieval effectiveness outside the collection can enhance
'In the remainder of the thesis, the anchor text surrogate of a document twill be referred to as anchor text, unless otherwise stated.
37
3.4 Combination
of evidence
for Web information
retrieval
in general Web collections. In the context of enterprise search. Hawking, Craswell, Crimmins & Upstill (2004) indicated that external link and anchor text evidence are less effective. of terms in the anchor text has different characteristics from the distribution of terms in the body of Web documents. For example, the home page of a Web site may have several thousands of incoming hyperlinks with the same anchor text. As a consequence the terms of the anchor text would have a high term frequency, very that should not be penalised by the term frequency normalisation component of the used document weighting model (Hawking, Upstill & Craswell, 2004). Instead, the anchor text should be normalised differently This approach is further from the text in the body of documents. discussed in Section 3.4.4.2, and in Section 4.4, where the and The distribution
DFR framework is extended in order to allow the term frequency normalisation different document fields. weighting of
3.4.3
Network-based
models
Frei & Stieger (1995) used activation spreading of the retrieval scores along the semantic hyperlinks in a hypertext. They defined the semantic hyperlinks as hyperlinks that point to documents with similar, more detailed, or additional information. In the context of hypertext documents, Savoy (1996) suggested that constraints, such as avoiding to activate a document for which the number of links exceeds a given threshold, can be also used. Savoy & Picard (2001) employed a spreading activation
assumption about status that hypertext After links between documents documents a set of document
based mechanism, on the

some information
may contain
relevance.
retrieving
for a given query, the retrieval following in the as shown
RSVZ of a retrieved value
di is updated
equation:
RSV := RSV + ,c*1:

di -+dj
RSVP
(3.7)
from
is a weighting where r, document
parameter, dj.
denotes hyperlink di dj is that there -+ a and experiments, they considered
di to document
In their
links from or to are
the top 50 ranked documents relevant.
based on the assumption only,
that these documents
38
3.4 Combination
of evidence
for Web information
retrieval
Jin & Dumais (2001) employed a method similar to spreading activation. The combined score for document di depends on the score of document di and on a score based on the link structure. The latter score is computed by considering all document s {dk} that point to, or that are pointed by document di. For each such document dk, its contribution to the combined score of document di depends on dk's authority score, its similarity to the query, and also its similarity to document di. Ribeiro-Neto & Muntz (1996) proposed a belief network model,
which was extended
to consider hyperlinks between Web documents (Silva et al., 2000). In the belief network model, the queries, the documents and the terms are treated as nodes in a network. For each document, the evidence associated with a document being either a hub, or an are represented by two additional nodes in the network. From a theoretical point of view, the belief network model is more general than the Bayesian inference network proposed by Turtle & Croft (1991). However, both models are very similar for authority, practical purposes. 3.4.4 Combination tions This section describes the combination perspectives: the combination of evidence for Web IR from three different of retrieval systems in metasearchiiig; different of retrieval techniques and representa-
of the output
the combination
different document of representations; and the combination of query-
dependent and query- independent evidence. 3.4.4.1 Metasearching
Metasearching refers to the combination of the output of several IR systems. Saracevic & Kantor (1988) noted that the odds of a document being judged relevant increase document in the the of systems number retrieval which appears too monotonically with be relevant. Lee (1997) indicated that different systems may retrieve similar sets of documents. documents but different relevant sets of non-relevant the improvements As a consequence, in effectiveness may result from the detection of the non-relevant
documents from the different systems. Metasearching is performed by combining either the ranks or the scores of the be harming documents documents. The used without retrieval scores of may retrieved the retrieval effectiveness, when they are distributed similarly (Lee, 1997). When this
39
3.4 Combination
of evidence
for Web information
retrieval
does hold, the ranks of documents should be preferred, in order to remove condition not the bias introduced by the different score distributions. Aslam & Montague (2001) proposed a method for fusing ranked lists of documents from obtained search engines, by looking at the problem of fusion as a voting problem. They used Borda Count, where each voter ranks a fixed set of c candidates in order of preference and has at his disposal Ei-1 i votes. The top ranked candidate is given If the voter does not votes, etc. >? then i votes are divided between the the rank some candidates, remaining of _1 the unranked candidates. Then, the candidates are ranked in order of total votes. Lebanon & Lafferty (2002) proposed a model for obtaining a probability distribution (2003) documents. Moreover, Fagin the over rankings of et al. performed a combination features of several using the ranks of documents. They employed various features, including the content of documents, the anchor text of incoming hyperlinks, PageRank, the length and depth of URLs, as well as the occurrence of query terms in the URL of Web documents. The combination of ranked lists is particularly useful when combining the output c votes, the second ranked candidate is given c-1
do Web not usually return the retrieval scores search engines, of commercial which & (Meng Craswell, Robertson, Zaragoza Taylor However, documents 2002). et al., of (2005) suggested that using the scores is potentially contain more information because the scores effective, more than the ranks. Indeed, the ranks can be obtained from the
from but is it the ranks. the to original scores obtain scores, not possible Bartell (1994) investigated the automatic et al. combination of multiple retrieval techniques. They modelled the combination of evidence from different retrieval systems for For the linear the the example, system. of each scores of retrieval combination as for RSVe, d, document the of systems m a combination overall score query q and q(d) is given by Ei"_' Ei (q, d), where Ei (q, d) is the score assigned by the i-th system t 1 A drawback i-th Oi is for the d document the the of system. weight of query q, and )j. this approach is the need to calculate the m parameters E Shaw & Fox (1994) reported that the combination of scores performed better when the combined systems were related to different retrieval paradigms. They also suggested that the linear combination of scores was more effective than selecting one retrieval score from the available ones for each document.
40
3.4 Combination
of evidence
for Web information
retrieval
Manmatha
(2001) introduced a methodology, which is based on the observaet al.

of relevant documents documents fits a Gaussian distribution, distribution. while If the intersect fits an exponential
tion that the score distribution the score distribution
of non-relevant
Gaussian distribution, the mean of from far each other, are in separating different the relevant
and the point where the two distributions system is expected the non-relevant
then the retrieval documents from
to be more successful ones. When combining to
retrieval
techniques,
the above described of each search engine. of the parameters expensive.
methodology However,
could be applied a disadvantage
automatically approach distributions results,
set the weights the estimation
of the
is that
for the Gaussian
and exponential in t he
is computationally
In addition, approach
there is some variability employed
due to the Expectation-Maximisation
for the paramet er
estimation. 3.4.4.2 Combination of representations
Westerveld et al. (2001) and Kraaij et al. (2002) employed a mixture of language models incoming the text documents Web the for the content of of corresponding anchor and hyperlinks. Similarly, Ogilvie & Callan (2003) investigated the use of a mixture of one low found They documents. that for combining language model each representation of They improve does suggested also performance. always not performing representations low is language performing representations are that the mixture of robust when models incorporated better performing ones. among (Turtle & inference Bayesian (2004) network model Tsikrika & Lalmas employed a formal in documents Web a Croft, 1991) in order to combine several representations of in improvements early precision. framework, and the obtained results showed difthe linear (2004) of scores combination of Robertson et al. suggested that the linear because the be combination of ferent retrieval techniques may not appropriate, by introduced models, weighting various the does non-linearities not consider scores difficult it is and to interpret the resulting scores. BM25 the model (2004) weighting of Zaragoza et al. version a modified proposed text to the fields, different within from handle correspond terms which in order to terms to BM25 using The weights assigns HTML tags. of extended version specific For field for separately. each term frequency normalisation parameters and weights independently. frequencies field, weighted. term and the different normalised are each
41
3.4 Combination
of evidence
for Web information
retrieval
The term frequency that is used in the BM25 formula is the sum of the normalised and weighted frequencies. Section 4.4 provides more details about this extension of BM25, and introduces an extension to the DFR framework for performing per-field normalisation 3.4.4.3 and weighting. of query-dependent and independent evidence
Combination
As indicated above, the combination of query-dependent evidence from the textual content of documents, and query-independent evidence, such as the hyperlink structure of Web documents, can be performed by aggregating ranked lists of documents. Alternatively, the combination by first retrieving of query-dependent and independent evidence can be achieved documents using query-dependent evidence, and then reranking the
retrieved documents according to the query-independent evidence. Upstill et al. (2003) employed this approach to obtain an ideal evaluation of various query-independent sources of evidence for home page finding search topics. This was performed as follows. First retrieval from the content or the anchor text of documents Then the rank k of the correct answer for a topic was located. Finally, the top k ranked documents were reordered according to a query-independent source PageRank, of evidence, such as or the type of the URL. In a more realistic case, the was performed. documents top to reorder is specified as a percentage of the total number of ranked for documents In number of retrieved a query. addition, they suggested reordering the documents with a higher score than a percentage of the highest score assigned to a document for a query. The results indicated that the latter approach is more effective than the former one. Westerveld et al. (2001) and Kraaij et al. (2002) employed the indegree of docuURLs, define in the prior probability type to the order of ments, as well as being a home page, in the context of language modelling. defined the document priors by considering both the age and popularity documelit of a Hauff & Azzopardi (2005) docuWeb of
for by estimating the number of a preferential attachment model ments, as estimated links to a Web document. Amati et al. (2003) proposed the Dynamic Absorbing retrieval score of a particular Model, which employs the of starting a randoiii motivated
document as the prior probability
Absorbing Model. This in in theoretically the and a results principled walk combination hyperlink and analysis. of content
42
3.5 Evaluation
Craswell, Robertson, Zaragoza & Taylor (2005) introduced a methodology, inspired by the work of Singhal et al. (1996) on pivoted document length normalisation. for finding appropriate tent retrieval. functional forms to combine query-independent evidence with conThen, they transformed the independent evidence into scores, which can be linearly combined with the analysis of query-dependent evidence. The combination of query-dependent and query-independent sources of evidence is further discussed in Section 4.5 of the next chapter.
3.5
Evaluation
performed in the context of the
The evaluation of Web IR systems has been primarily
TREC Very Large Collection and the Web tracks, which ran for eight consecutive years. Several studies have also been conducted in order to estimate the retrieval effectiveness of commercial search engines.
3.5.1
Experimental
evaluation
(VLC)
in Text REtrieval
Conference
The Very Large Collection
track, followed by the Web track, have been ded-
icated to the evaluation of IR systems with Web test collections from TREC-6 until TREC 2004 (Craswell & Hawking, 2004; Hawking & Craswell, 2005). The definitions from have tasks the standard ad-hoc search tasks with evolved of evaluated search Web documents to Web-specific informational and navigational search tasks, similar to the search tasks specified by Broder (2002). The evaluation measures primarily used first the the retrieved relevant mean reciprocal rank of were mean average precision, document, precision at n retrieved documents, and success at n retrieved documents. In the VLC track of TREC-6 (Hawking & Thistlewaite, 1997) the used test collection documents. In Web 7.5 the VLC the and non-Web million collection, a set of was VLC track of TREC-7 (Hawking et al., 1998b), the used test collection was the VLC2 Archives. from Internet The documents Web the 18.5 crawled million collection, a set of focused in tasks, the tasks and mainly on spirit of ad-hoc search were evaluated search the scalability of the existing prototype IR systems.
In the Web tracks of TREC-8 (Hawking et al., 1999) and TREC-9 (Hawking, 2000),
More it. VLC2 two the specifithe used test collections were subsets of collection and
'http: //www. archive. org
43
3.5 Evaluation
in the Large Web task of TREC-8 the VLC2 collection was used to test whether tally, the existing prototype IR systems would scale up to that amount of data. In addition, in the Small Web task of TREC-8, the WT2g collection, which corresponds to a subset of 250,000 Web documents and 2GB of data from the VLC2 collection, was used to perform ad-hoc retrieval. The results showed that hyperlink analysis-based approaches were not as effective as standard IR techniques for an ad-hoc search task (Hawking et In Web 1999). the track of TREC-9, a subset of 1.69 million Web documents and al., 10GB of data from the VLC2 collection, the WT10g collection, was used to perform ad-hoc retrieval. The results showed that standard IR techniques performed well for the ad-hoc search tasks (Hawking, 2000). In the Web track of TREC 2001 (Hawking & Craswell, 2001), a homepage findhig task was introduced, in addition to the ad-hoc search task with the WT1Og collectioii. In this navigational task, the topics are about finding the homepage of a Web site, the name of which corresponds to the query. The results from this search task showed that both anchor text and the type of URL of Web documents improved the retrieval effectiveness. For the Web tracks of TREC 2002 (Craswell & Hawking, 2002), TREC 2003 (Craswell (Craswell lie 2003), TREC 2004 test 2003), the t collection al., used was et and et al., GOV collection, a partial crawl of the gov domain from 2002. This collection consists . Web TREC 2002, lie data. In 18GB Web documents the track t 1.24 of and of of million topic distillation task was introduced, where relevant documents are supposed to be definition due However, the to topic. the of what constiquery useful resources about tuted a relevant document, the results from the evaluation of the task were similar to an distillation task, there In to topic the task. was a named page addition ad-hoc retrieval finding task, where the query topics were about finding a particular Web documeia, For the text, homepage. task, this is and anchor navigational not necessarily a which (Craswell & Hawking, 2002). document the structure were effective sources of evidence In TREC 2003, the definition distillation the topic of task was refined to specify The Web be homepages documents navigational sites. that relevant of relevant can only finding Web 2003 TREC track and task of the consisted of a mixture of named page homepage finding topics. In both evaluated search tasks, the document structure and in lie improvements in important hyperlinks incoming t the anchor text of resulted (Craswell 2003). et al., retrieval effectiveness
44
3.5 Evaluation
The Web track in TREC 2004 consisted of a mixed query task, where topic distillation, named page finding, and homepage finding queries, are mixed in a single stream of queries. The IR systems are not aware of the query type during retrieval. This task is closer to the operational setting of a search engine, where users submit queries, without giving explicit evidence of the query type. The results showed that effective retrieval could be performed without classifying the mixed queries into the corresponding query types (Craswell & Hawking, 2004). Summarising the findings from the VLC and the Web tracks, Hawking & Craswell (2005) suggested that the nature of the search task is very important in determining what sources of evidence will result in effective retrieval. Indeed, Web-specific evidence improved the retrieval effectiveness for informational not for typical ad-hoc search tasks. and navigational search tasks, but
The last Web track was run in TREC 2004. The evaluation of IR systems with Web data has also been performed in the Terabyte track of the TREC 2004 (Clarke et (Clarke 2004) TREC 2005 al., and et al., 2005), which employed the GOV2 collection, a crawl of 25 million Web documents and 426GB of data from the gov domain. The Enterprise Track in TREC 2005 (Craswell, de Vries & Soboroff, 2005) focused on email and expert search tasks. In the remainder of this thesis, the experimental setting is based on the Main Web task of TREC-9 and the Web tracks from TREC 2001 to TREC 2004. More details are presented in Section 4.2
3.5.2
Search engine
evaluation
Gordon & Pathak (1999) proposed a list of features that should be considered in comparative evaluation studies of commercial search engines. They performed a comparison of seven commercial search engines and one subject directory, using genuine information needs and expert searchers. The relevance assessments were performed by tine formulated users who the information needs. The evaluation was based on precision that documents that had been retrieved by orte and recall, as well as the likelihood
found by The that, overall, the absolute engine, were retrieved others, as well. authors low. is They also noted that different engines retrieve precision of search engines quite different relevant documents, suggesting that metasearching could potentially the retrieval effectiveness. improve
45
3.6 Query
classification
and performance
prediction
In another study, Hawking et al. (2001) extended the list of features for comparative 20 of commercial search engines, of search and an evaluation studies engines, performed including directories. metasearch engines and subject They used genuine queries obtained from search engine query logs. Therefore, the relevance assessments were not information by the the original users with performed The evaluation measures need. first the the retrieved relevant mean reciprocal rank of were mean average precision, document, precision at n retrieved documents, as well as the average precision from beinter-correlations The 5. 1 there that to are significant results showed rank rank tween the different evaluation measures, but that there are no statistically significant difference among the top performing (2001) Hawking also suget al. search engines. IR lower than that the of search engines was gested that the retrieval effectiveness of Web TREC-8. Large in task the of systems evaluated Some of the proposed methodologies for the evaluation of IR systems, and Web Chowdinvestigated have in the automatic evaluation of systems. general, search engines hury & Soboroff (2002) compared IR systems by pooling randomly selected retrieved documents. Can et al. (2004) computed the similarity between an information need and documents The from documents most similar the top ranked a set of search engines. For information be to the engine, precisearch to each need. relevant were considered IR Even the though systems, of evaluation automatic sion and recall were calculated. it is to doubtful is it interesting is topic, achieve possible whether an or search engines, human based the to assessors. on evaluation results similar
3.6
Query
classification
and performance
prediction
tasks different types performed there search of As discussed in the previous sections, are be that different techniques can retrieval by users of Web search engines, as well as identification focused has the For on Web IR. for these research current reasons, applied for IR the systems. prediction the performance and users' goals of 3.6.1 Identifying user goals and intentions
by log, the from matching (2003) identified a query Beitzel et al. navigational queries (2004) & Levinson Rose in taxonomies. retitles the edited to of categories queries Broder by Web fined the taxonomy of user goals, or search tasks, originally proposed
46
3.6 Query
classification
and performance
prediction
(2002).
They added sub-types
of user goals for the informational identified
search tasks and the
transactional
search tasks. They manually the submitted
the user goals from search engine documents: the doc-
logs, by using: query
queries; the sets of retrieved
uments that users clicked on; and other subsequent they suggested that the successful identification different Bomhoff relevance ranking algorithms
actions of the users. Additionally,
of the user goal may result in applying depending queries, on the user goal. the
for different
(2005) also proposed to identify et al. navigational, including: appear
the intentions informational,
of users by examining and transactional
In logs. identify to order query they looked features at several
queries, the
the terms in the URLs
and the length of documents
of queries;
fraction
of the query terms that
the users clicked the about
on; information
browser Web the of users; part-of-speech about for the query.
information
query terms; and a timestamp
In the context of TREC-style type classification method. a different combination
(2003) & Kim Kang proposed a query experiments,
The query types correspond to different search tasks, and
for They is type. considered each query applied of evidence finding homepage the task the search two search tasks, namely and ad-hoc search (Hawking & Craswell, The 2001). two tasks TREC 2001 Web from track the task of differ considerably, since the first one is an ad-hoc search task, while the other is a that terms they type, For identifying the task. are employed query navigational search terns, information the in homepages, likely query about to part-of-speech appear more information Web hyperlinks incoming pages, and co-occurrence of anchor text of the for the query terms. Web in introduced the task Section 3.5.1, the mixed query As described in was IR in the 2004 to systems when a stream track of TREC performance of evaluate order knowing the type is each of different explicitly types without available, of queries of better in type significantly that The resulted classification query showed results query. & (Craswell Hawking, help did retrieval effectiveness than random accuracy, but it not between type differences discusses classification and Section 5.2.1 the query 2004). in thesis. is this IR, Web proposed which selective
47
3.6 Query
classification
and performance
prediction
3.6.2
Predicting idence
query
performance
dynamic and
combination
of ev-
In addition
to identifying
the user goals, some methodologies have been proposed to (Voorhees, Robust TREC 2003. the track of
dynamic This has the the of performance queries, and combination of evidence. predict been partly motivated by the introduction 2004), where IR systems are required to provide a measure of confidence in the quality for basis. thus their the each query, performance on a per-query of results predicting Cronen-Townsend et al. (2002) introduced the clarity score and measured the ambidocuments, divergence language top the the the of model of ranked guity of a query as from the language model of the whole document collection. The clarity score was shown to be correlated with the query's average precision. Amati et al. (2004) introduced query difficulty predictors, based on measuring the
divergence between the query terms' distribution When the two distributions in the top ranked documents, have a high divergence, perform correlate well. with Their and the then it is more experimental
whole collection. likely results precision that
the query
is easy and the system will the query difficulty predictors
suggest that
the mean average the effectiveness
first-pass the of applying
retrieval,
but they cannot be used to predict
of automatically
query expansion.
He & Ounis (2004) defined and evaluated five pre-retrieval query performance predictors. Unlike the above two approaches, which depend on the assigned scores of the depend the documents, collection statisthe on only pre-retrieval predictors retrieved The before be performing retrieval. tics of the query terms, and they can computed he deviation t length the include: the of the standard of query; proposed predictors idf idf the the values idf the minimum over terms' of maximum ratio values; query documents; the to and a for the query terms; a predictor related number of retrieved likelihood based is that estithe on maximum score clarity query simplified version of found It the that score clarity instead query simplified was scores. of retrieval mates for the stailthe while is more effective at predicting short queries, query performance dard deviation of the query terms' idf values correlates well with the performance of longer queries. Plachouras, He & Ounis (2004) introduced an additional pre-retrieval inverse term to the collection average query performance predictor, which corresponds frequency.
48
3.6 Query
classification
and performance
prediction
Instead of modelling the query ambiguity, types of topics.
Evans et al. (2002) distinguished three topics, where the retrieved doctopics, that may contain
The first type refers to monolithic
uments are similar. several relatively
The second type refers to structured subtopics.
monolithic
The third type refers to diffuse topics, which
highly dissimilar documents. The topic structure can be quantified by eimay retrieve ther considering the stability and the number of generated groups of documents using between samples of retrieved documents and Similarly, Yom-Tov et al. (2005) estimated the because some of their asin this approach, as by measuring the similarity clustering, or the rest of the retrieved documents. difficulty query
based on the number of documents that contain subsets of the query
terms. In this way, they identified queries that are difficult,
difficulty dominate The the aim of query predicting results. pects
is disable described in the to the automatic query expansion ones, previously as well for difficult queries, which usually leads to a degradation of precision. In addition to performance prediction for retrieval, there has been work on predicting the quality of evidence from the hyperlink structure of the Web. In the context of Web document classification, Fisher & Everson (2003) focused on the relation between the effectiveness of hyperlink analysis and the density of hyperlinks in a test collectiou. Gurrin & Smeaton (2003) studied the effectiveness of hyperlink the size of a document collection increases. Other proposed approaches include dynamically analysis algorithms basis. on a per-query Amitay (2003) set the weight of additional structure analysis as
hyperlink the of weight setting Amitay (2002) et al. and et al. analysis algorithms, such
hyperlink from evidence
the text the numSALSA, and anchor HITS as such evidence, of sources or other or as documents. features the to hyperlinks, set of retrieved ber of incoming of according Plachouras & Ounis (2005) employed Dempster-Shafer theory to combine content and The basis, to the query. hyperlink analysis on a per-query specificity of each according SecModel. Absorbing PageRank the hyperlink and were algorithms analysis employed different the of weights setting in is latter The IR. the Web approach proposed selective and combinations of evidence dynamically between differences discusses 5.2.1 the tion this thesis.
49
3.7 Summary
3.7
Summary
This chapter has presented an overview of Web IR. It discussed the differences between IR classical and Web IR, with respect to the hypertext document model, the structure of the Web, the quality of information on the Web, and the Web users' background. Next, a range of Web-specific sources of evidence, as well as different methodologies to combine them for effective retrieval, were presented. The chapter continued wit h covering the evaluation of Web IR systems in an experimental setting, as well as the evaluation of Web search engines, and closed with a review of user goals prediction Overall, most of the discussed techniques either and query performance prediction. retrieval approach for all queries, or they estimate the difficulty identify they the query expansion or not, or of a query, in order to apply automatic
apply a particular
goal of the user. There has not been any extensive investigation and evaluation of the selective application of different retrieval techniques on a per-query basis, according to
the appropriateness of each retrieval approach. The remainder of this thesis presents selective Web IR, a novel framework, which
aims to apply appropriate gates the potential decision theoretical a framework retrieval approaches on a per-query obtained basis. Chapter 4 investi5 presents improvements framework from selective Web IR. Chapter
for selective Web IR. The evaluation 6. Chapter 7 investigates
of the proposed of selective
is presented in Chapter where limited
the application
Web IR in a setting
information relevance
is available.
50
Chapter
Retrieval Selective Retrieval

4.1 Introduction
Approaches for Web Information
The aim of this thesis is to investigate the effectiveness of selective Web IR. However, before introducing framework for Web IR, it is necessary to examine the a selective in improving the retrieval effectiveness. This chapter aims potential of such an approach to establish this potential, by examining and evaluating a range of retrieval approaches. This chapter starts with describing the experimental setting and the used search tasks from various TREC Web tracks in Section 4.2. Section 4.3 evaluates the effectivefull documents, document from text the other representations. of and ness of retrieval These document representations correspond to the text of the title, and the heading HTML tags, as well as to the anchor text of the incoming hyperlinks. a range of statistically For each docuindependent weighting models is evaluated.
ment representation,
These models include the DFR weighting models PL2, PB2, I(ne)C2, and DLH, as wtiv11 (Table BM25 19). 2.1 on page as Next, Section 4.4 presents a new extension of the DFR framework, in order to allow the combination fields, document different and to perform per-field normalisation of of for evidence
the term frequencies. Section 4.5 investigates the use of query-independent
Web IR, including the URLs of Web documents, PageRank, and the Absorbing N1o(1(I for hyperlink structure analysis. a novel model
51
4.2 Experimental
setting
The introduced retrieval approaches are separately evaluated for the different of ad-hoc, topic distillation, home page finding, and named page finding search of the TREC Web tracks. The associated hyper-parameters are set in order to
types t asks opt i-
mise precision for each evaluated task, and each weighting model. This allows for the comparison of the retrieval approaches on the basis of their optimal performance. Section 4.6 evaluates the proposed retrieval approaches in a different setting, in order to reduce any overfitting effects from the optimisation process. First, the hyperparameters of the retrieval approaches are set in order to optimise precision for different sets of mixed tasks. Second, the optimisation process is stopped early, before converging to the optimal setting of the hyper-parameters.
This chapter closes with establishing the potential for improvements in retrieval effectiveness from selective Web IR in Section 4.7. The results from this chapter provide further a motivation for the introduction of the decision theoretical framework for
Web IR in the next chapter. selective
4.2
Experimental
setting
The retrieval TREC tracks.
approaches presented in this chapter are evaluated using the standard and the associated search tasks from the TREC Web
Web test collections,
Table 4.1 presents an overview of the tasks used in the TREC Web tracks
from GOV WT10g The Very Large the the test tasks the and collections. earlier with tracks are not used, because their aim was primarily to test whether IR (Hawking & Craswell, data large 2005). to systems would scale amount of process a Moreover, the WT2g test collection is not employed, because it corresponds to a small WT10g the collection, subset of The WTlOg and it has been used only once in TREC-8 for arg (Hawking task et al., 1999). ad-hoc search test collection documents, Web 10GB 1,692,096 and of consists of data (Bailey et al., 2003). The topic relevance tasks tr2000 and tr2001, used in TREC9 (Hawking, 2000) and TREC 2001 (Hawking & Craswell, 2001). respectively, coneWT Og is l hp2001, the The task tasks. to associated with which spond ad-hoc search (Hawking finding TREC k 2001 home task to the test collection, corresponds of page Craswell, 2001). where the topics are about finding the home page of a Web site, the to the query. corresponds name of which Collection
52
4.2 Experimental
setting
Name tr2000 tr2001 hp2001 td2002 np2002 td2003 ki2003 mq2004 hp2003 np2003 mq2003 td2004 hp2004 np2004
Used in TREC-9 TREC 2001 TREC 2001 TREC 2002 TREC 2002 TREC 2003 TREC 2003 TREC 2004 TREC TREC TREC TREC TREC TREC 2003 2003 2003 2004 2004 2004
Task Topic relevance Topic relevance Home page finding Topic distillation Named page finding Topic distillation Known-item finding Mixed query Home page finding Named page finding Mixed query Topic distillation Home page finding Named page finding
Collection WT10g WT1Og WT10g GOV GOV GOV GOV GOV GOV GOV . GOV GOV GOV GOV
Topics 451-500 (50 topics) 501-550 (50 topics) EP1-EP145 (145 topics) 551-600 (50 topics) NP1-NP150 (150 topics) TD1-TD50 (50 topics) NP151-NP450 (300 topics) WT04-1-WT04-225 (225 topics) (150 ki2003 topics) subset of (150 ki2003 topics) of subset td2003 and ki2003 (350 topics) (75 topics) subset of mq2004 (75 topics) mq2004 subset of (75 topics) subset of mq2004
Web from TREC the topic 4.1: The the Table sets corresponding search tasks and tracks. data. 18GB documents, Web 1,247,753 GOV of and The test collection consists of (td2002) distillation for been GOV have topic the and The associated topics with used 2002), (Craswell & Hawking, the 2002 TREC (np2002) finding tasks of named page (Cras%ti'ell 2003 (ki2003) TREC finding tasks of topic distillation (td2003) and known-item (Craswell & 2004 (mq2004) TREC & Hawking, 2002), and the mixed query task of home to the hp2003 page and Hawking, 2002). The tasks and np2003 correspond ki2003, finding item task known 2003 TREC respecfinding the topics of named page ki2003. The from td2003 topics and tively. The task mq2003 corresponds to the set of findhome distillation, the topic to page hp2004, correspond tasks td2004, np2004 and Web 2004 TREC task mq2004 of ing and named page finding topics of the mixed query track, respectively. be 4.5 4.3,4.4, Sections in evaluated will and The proposed retrieval approaches hp2001, td2002, tr2001, np2002, tr2000, tasks: different types for the of separately 4.7 4.6 Sections also will hp2004, and td2004, hp2003, np2004. and td2003, np2003, Web-specific focusing search on tasks mq2004, mq2003 and consider the mixed search
tasks. Savoy & Picard (2001) highlighted that removing stop words and applying stemming
ad-hoc the has a positive effect on precision of a retrieval (199"Sa) Hawking hand, On al. et the WT2g other the collection. retrieval task with during be retrieval, be indexed applied can stemming and that can words stop suggested
TREC-7 for the system
53
4.3 Document
representations
for Web information
retrieval
if necessary. This is more similar to the indexing approach taken by commercial Web search engines, where stop words are usually indexed, and weak stemming may be applied. For Web specific tasks, such as topic distillation, named page and home page finding tasks, there has not been any clear indication that removing stop words and applying stemming harm the retrieval effectiveness (Craswell & Hawking, 2004).
The WT10g is visible algorithm are applied document and GOV test collections application. are indexed by processing Stop words are removed, indexing. In addition, the text, which on a Web browser of Porter
and the stemming certain restrictions
(1980) is applied
during
in order to reduce the number index. First,
of non-informative
terms in the generated Next,
tokens which are longer than 20 characters are discarded. characters,
tokens that numerical text
contain
more than three same consecutive This restriction documents. of other
or more than four the anchor
digits, are discarded. hyperlinks describe
is not applied when indexing Indeed, the anchor Therefore, text
incoming of
is intentionally to be
used to concisely informative. more informative generated
Web documents. restrictions
it is considered
The applied
indexing
aim to reduce the number of nonin reducing the size of the of the inverted
terms in the document data structures, similarly
index,
and also result
to the approach
of static pruning and retrieval
index (Carmel with the Terrier
et al., 2001). In this thesis, indexing IR platform (Ounis et al., 2005).
have been performed
4.3
Document trieval
representations
for
Web
information
re-
This section investigates the retrieval effectiveness of different document representations for Web documents. The employed document representations include the full text of Web documents, the title, the headings, and the anchor text of incoming hyperlinks. For each document representation, five a range of weighting models is tested. The first four models are derived from the Divergence From Randomness (DFR) framework (Section 2.3.3, page 13): PL2, PB2, I(ne)C2, and DLH. The fifth weighting model is BM25 (Section 2.3.1, page 10). The formulae of the five weighting models are given in Table 2.1 on page 19. As discussed in Chapter 2, these weighting model', have beeri PL2 I(ne)C2 for The and are robust and several reasons. weighting models selected (Plachouras & Ounis, 2004: Plachouras. He tasks perform well across a range of search
54
4.3 Document
representations
for Web information
retrieval
& Ounis, 2004). The weighting model PB2 is selected in order to test the combination of the Poisson randomness model with the Bernoulli model for the after-effect. The weighting model DLH is particularly interesting, because it does not have any associhyper-parameter. ated The weighting model BM25 is employed, because it has been of TREC. frequently used by many participants
process is employed BM25. and the precision
In order to compare the effectiveness of each weighting model, a two-step optirnisation to set the hyper-parameters of the weighting PL2, models are set in ormodel is PB2, I(ne)C2, der to optimise DLH For each tested task, the hyper-parameters of the weighting models.
Note that the weighting and therefore,
does not have any associated
hyper-parameter,
no optimisation
required.
4.3.1
Representing
Web documents
The analysis of the textual content of documents is necessary for matching documents to the users' queries. There are several different representations for Web documei1 s. The first representation corresponds to the full text of Web documents. In addition, particular features of HTML can be employed to define other document representations. HTML is a markup language that is used for authoring Web documents (Raggett et documents, for Web It 1999). tags the structure of as well specifying al., provides a set of by Web browser be the they application. a should rendered as way vey information The HTML tags conbe documents, improve to textual the can of used content which about the retrieval effectiveness in navigational, and informational search tasks (Craswell & Hawking, 2002,2004; Craswell et al., 2003; Hawking & Craswell, 2001). For example, the text within document. the tags (TITLE) (/TITLE) and Web the title the to of corresponds Jin et al. (2002) observed that the user queries are more similar to the titles both documents, the that documents they to the than queries suggested and actual of different The descriptions information. the text titles the within of provide concise and heading tags (for example (H1) and (/H1)) document's sections. The anchor text, which appears within the tags (A) and (/A) in the source docdocument. Eiron description brief functions hyperlinks, incoming of a as a uments of McCurley fir , (2003b) suggested that the anchor text exhibits similarities and grammatical to the user level. In order to provide a concise description Web titles to the of a usually corresponds
queries on a statistical
55
4.3 Document
representations
for Web information
retrieval
Web documents, of anchor text tends to be short and may contain abbreviated terms and acronyms. Compared to the titles of Web documents, Eiron & McCurley pointed that there are as many anchor texts as the incoming hyperlinks of a document, while there can be only one title for a document. The anchor text has been shown to be effective for navigational tasks, such as named page finding (Craswell & Hawking, 2002. 2004; Craswell et al., 2003) and home page finding (Craswell et al., 2001). as well as for informational tasks, such as topic distillation, when there is a bias towards the home (Craswell Web & Hawking, 2004; Craswell et al., 2003). pages of sites
In order documents to establish are represented the retrieval effectiveness of the different the corresponding HTML tags, t he only by the text within the other tags. In addition correspond to
to the full text
representation,
three document
representations
to: the text within (H6) and (/HG));
the title tags; the text within
the heading tags ((H1) and (/H1) hyperlinks.
and the anchor text of the incoming
4.3.2
Parameter
setting
The evaluation of the different document representations is performed with a range of However, the retrieval effectiveness of the weighting models depends weighting models. on the setting of any associated hyper-parameters. ment representations, the hyper-parameters In order to compare the docliare set in order to optimise the retrieval
effectiveness of the weighting models. This allows for the comparison of the weighting basis the models on of their optimal performance. The employed weighting models include four DFR weighting models: PL2, PB2,
I(ne)C2, and DLH. The weighting model BM25 is used as well. Their formulae are of All in Table 2.1, 19. the employed given page DLH, have associated hyper-parameters I(ne)C2 PL2, PB2, models and to the normalisation 2 from
weighting
models, with the exception
that need to be estimated.
The DFR weighting c, which is related takes real
have one associated hyper-parameter, Equation
(2.16) on page 15. This parameter parameters for the weighting
The than considered greater zero. values b, which is related factor for the term to the term frequency frequency.
model BM25 are
normalisation,
k1, is and a saturation which is related to a correction is related
The parameters lengths
k2, which
of the weights
due to the different of the term
of documents, in a query,
and k3. which
to the importance respectively
frequency
0 1000, to and set equal are
(Robertson
et al., 1994).
56
4.3 Document
representations
for Web information
retrieval
The values of the parameter c for the models PL2, PB2, and I(ne)C2, and the parameters b and ki for the model BM25, are independently set for every tes1; ed task, after performing a one-dimensional optimisation for the DFR models, and a twodimensional optimisation (MAP). precision The direct optimisation for BM25. Each optimisation maximises the mean average
MAP is preferred over more classical optimisation techof niques, such as maximum likelihood estimation, for two main reasons suggested 1)yMetzler & Croft (2005). First, the training data, which corresponds to the available is a very small sample of the event space of documents and Therefore, the maximum likelihood estimation is less likely to result in a good queries. estimate of the parameters. Second, the maximisation likelihood for the of generating the training data does not necessarily mean that a metric, such as MAP, is optimiticd. Therefore, it may be more useful to optimise a particular retrieval effectiveness metric, such as MAP. The direct optimisation
effectiveness. each task. training stopped
relevance information,
MAP for of each tested task results in optimal retrieval

problem is the overfitting of the weighting models to in Section 4.6 is performed types. The optimisation different with process is also
However, a potential
For this reason, the optimisation tasks of mixed query
and testing
after a given number
of iterations.
The optimisation gorithm
involves two steps. In the first step, a simulated annealing alIts output is used as a starting point for the
(Press et al., 1992) is applied.
based is the on a combination of second step, where applied optimisation algorithm heuristics to avoid local maxima (Yuret, 1994). The optimisation is performed at least twice for each of the tested topic sets, in order to increase the chances of finding a global The for MAP, the optimal and most effective parameter values are selected. maximum b I(ne)C2F, PL2F, PB2F for DFR the the as well as and parameters models c values Appendix A. A. The DLH 1 for in Table k BM25, of weighting model are shown and does not have any associated hyper-parameter, tion naturally Section 2.3.3. Figure 4.1 shows the tested values of c during the optimisation of full text retrieval because the hypergeometric distribuincorporates term frequency normalisation in the model, as discussed in
is The hp2004, for to PL2 td2004, tr2001, tasks the set parameter and np2004. c with higher values for the topic-relevance topics, than for the topic distillation, or any of i lie
57
4.3 Document
representations
for Web information
retrieval
optimization for full text retrieval with PL2 and tr2001

U."l"i 0.21
V. I
optimiwtioo for fu0 text retrieval with PL2 and td2004
0.09
0.2
0.09
0.19 0.18 0.17 0.16 0.15 0.14 0.13

0.03 0.05 0.07
0.06
0.04
0.12 nii 0.1 1

nm
10 cc
100
1000
0.01
0.1
10
100
IM)
optiminatioo for full text retrieval with PL2 and hp2004 0.24 0.22 0.45
U. J
optimisation for full text retrieval with PL2 and np2004
0.2 0.18 0.16 0.14 0.12

0.1
0.4
0.35
0.3
0.25
0.08
n7
nnc 0.1 I 10 CC 100
1000
0.1
10
Rw
uw
different (MAP) for tested c values Figure 4.1: The obtained mean average precision for tr2001, PL2 topic the full sets text during the two-step optimisation of retrieval with td2004, hp2004 and np2004.
58
4.3 Document
representations
for Web information
retrieval
navigational
search topics. This is related to both the average length of documents in
the collection (Table 4.2), and the average length of the relevant documents (Table 4.3). The document length corresponds to the number of indexed tokens for a particular document representation. Regarding the WT10g collection and the ad-hoc task tr2001. the average length of documents (394.87 from Table 4.2) is lower than the average length is (1689.98 from Table 4.3), documents the relatively optimal c value and of relevant high (c = 12.3985 from the top left diagram in Figure 4.1). For the GOV collection distillation the topic and task td2004, where there is a bias towards the home pages of Web sites, the average length of documents (726.71 from Table 4.2) is higher than the (494.28 In from 4.3). Table documents the this length optimal case, of relevant average Figure 4.1). in (c diagram from low 0.1536 is top the right relatively = value c The above dependence between the length of the relevant documents, the avenig e document length, and the parameter c is explained with respect to the formula of normalisation 2 from Equation (2.16): (1 loge tf +c" tfn = " (l/l))
hyper-parameter, is length, document l is length, document the 1 is a c the average where low When frequency. term is c value the normalised a tf is the term frequency, and tfn (1 loge f= +c" is used and l/1 > 1, then tf n/t ([/')) (1 1/1 f= loge +c" 1, tf then < n/t and (171)) - 1. When a low c value is used favour low Thus, 1. short < c values
1/1 for 1, is high < When either used, documents, and penalise longer ones. c value a high to (1/0) Therefore, 1. (1 correspond 1092 > c values +c" 1/1 1, tf= tf > nl or f>1 the because tf frequencies, of regardless term n/t the of normalisation a weaker l/l. ratio
Average document length WT10g Document representation 394.87 Full-text 4.42 Title 25.41 Heading 13.50 Anchor text GOV . 726.71 4.12 10.14 21.41
document different for documents the representations length Table 4.2: The average of to the length document The number. corresponds GOV test collections. in WT10g and . document, for words. stop indexed removing tokens after each of
59
4.3 Document
representations
for Web information
retrieval
optimisation 0.042
for anchor text retrieval with PL2 and tr2001 ++

U. I Y
optimisation
for anchor text retrieval with PL2 and d2004
0.04 ++ +
Ftep 1+ tep 2x
0.12
0.038
++ +
0.036 a 0.034
0.1 ++ +
0.08 a
0.032
-01
0.06
0.03
0 04 0,02
c=1.030
0.1 1 cc optimisation for anchor text retrieval with PL2 and hp2004
V. JO
0.028
(1
001
10
100
n 0.001
0.01
0.1
10
100
1004)
optimisation
for anchor text rctncval
with PL2 and np2004 step 1+ step 2
0.65 0.6 ++ step I+ step 2x
0.55 054 0.53 0.52
0.55 0.5 0.45

04
0.51
0.35 03 0.25 02 nl(
0.5 0.49 0.48 =81.1350 =10.7286 . .11
0.1
CC
00
0.1
Figure 4.2: The obtained mean average precision (MAP) for different c values tested during the two-step optimisation of anchor text retrieval with PL2 for the topic sets tr2001, td2004, hp2004 and np2004.
60
4.3 Document
representations
for Web information
retrieval
Topic relevance Avg. Length Task tr2000 tr2001 2016.76 1689.98
tr2000 tr2001
5.72 4.18
tr2000 tr2001
78.78 56.25
tr2000 tr2001
14.47 11.42
Topic distillation Home page finding Avg. Length Task Avg. Length Task Full-text 1315.33 hp2001 204.70 td2002 539.66 hp2003 266.12 td2003 hp2004 494.28 357.10 td2004 Title 4.88 hp2001 4.05 td2002 4.48 hp2003 4.88 td2003 4.73 hp2004 4.69 td2004 Headings 14.69 hp2001 13.63 td2002 8.99 hp2003 3.31 td2003 2.82 18.42 hp2004 td2004 Anchor text 84.82 hp2001 1264.65 td2002 300.62 hp2003 3258.32 td2003 902.20 346.49 hp2004 td2004
Named page finding Avg. Length Task np2002 np2003 np2004 np2002 np2003 np2004 np2002 np2003 np2004 np2002 np2003 np2004 782.37 834.71 923.39 5.49 4.60 4.79 40.12 11.67 14.71 63.41 79.62 218.30
Table 4.3: The average length of relevant documents for the different topic sets, ai d for the different document representations. The document length corresponds to Hie for document, indexed tokens after removing stop words. each number of Figure 4.2 displays the range of tested c values for the optimisation document text representation, anchor In particular np2004. of PL2 with the
for the topic sets tr2001, td2004, hp2004, and
for the task td2004, where there is a bias towards home pages.
the optimal c value is 526.7705 (top right diagram in Figure 4.2). For the tasks td2004, (Table 4.3), is 346.49 documents length the text while the average anchor relevant of (Table 4.2). GOV is 21.41 in length text the average anchor The optimal c values obtained for the full text and the anchor text representations documents, of regarding the task td2004, indicate that different representations of documents require different normalisation settings. This is in agreement with Hawking, Upstill & Craswell (2004), who suggested applying different length normalisation for The documents. Web text the full representation of text representation and the anchor hyperlinks, high documents is the from that of benefit number with a such an approach document by text, are not penalised and consequently, a significant amount of anchor length normalisation.
4.3.3
Evaluation
results
different the the representations This section presents evaluation results obtained with documents the various used weighting models. and of
61
4.3 Document
representations
for Web information
retrieval
Several measures have been employed for the evaluation of IR systems in the TREC
Web tracks. Mean average precision has been used to evaluate ad-hoc tasks. The first the mean reciprocal rank of retrieved relevant document (MRR1) has been used to
evaluate navigational tasks. For the evaluation of topic distillation tasks, precision at 10 (P10), mean average precision (MAP) and R-precision (R-Prec) have been employed. In have to order a consistent setting, the evaluation in this thesis is performed using mean average precision. The mean reciprocal rank of the first retrieved relevant document is equivalent to mean average precision when there is only one relevant document for In each query. addition, precision at 10 is expected to correlate with average precision. for the though even optimal setting average precision does not necessarily correspond to the optimal setting for precision at 10. Table 4.4 shows the mean average precision (MAP) for all the tested topics and Each column the different representations of documents. Each row shows the achieved MAP by the five tested weighting models for a task and a document representation. MAP by the a particular achieved shows document representations. for all the tested tasks and weighting model
The entries in bold show the weighting model that results
in the highest MAP for each task and representation of documents.

Row Task PL2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 0.2038 0.2107 0.1997 0.1245 0.0901 0.3355 0.2528 0.2300 0.5651 0.5185 0.4744 0.0281 0.0214 0.0512 0.0759 0.0641 Mean Average Precision (MAP) DLH I(ne)C2 PB2 Full text 0.1606 0.2073 0.1923 0.1746 0.2132 0.2032 0.1738 0.1983 0.1909 0.1091 0.1167 0.1108 0.0856 0.0927 0.0844 0.3331 0.3524 0.3280 0.2608 0.2624 0.2190 0.1956 0.2074 0.2335 0.5034 0.5785 0.5432 0.5095 0.5237 0.4850 0.4029 0.4614 0.4508
Title
BM25 0.2102 0.2132 0.1989 0.1234 0.0956 0.3552 0.2893 0.2276 0.5771 0.5309 0.4853 0.0297 0.0224 0.0514 0.0789 0.0650
0.0264 0.0208 0.0537 0.0759 0.0640
0.0282 0.0208 0.0528 0.0758 0.0640
0.0284 0.0175 0.0501 0.0661 0.0571
continued on next page
62
4.3 Document
representations
for Web information
retrieval
R, ow
Task
continued from previous page Mean Average Precision (MAP) PL2 PB2 I(ne)C2 DLH BM25 Title 0.3288 0.2796 0.3026 0.4014 0.4147 0.4282 0.0501 0.0527 0.0422 0.0684 0.0397 0.1555 0.1174 0.1027 0.1928 0.2432 0.3419 0.0328 0.0417 0.0663 0.1433 0.1271 0.5219 0.6675 0.6025 0.4476 0.4939 0.5498 0.3194 0.2765 0.3095 0.4000 0.4136 0.4287 0.0463 0.0554 0.0420 0.0680 0.0383 0.1506 0.1116 0.0994 0.1855 0.2330 0.3194 0.0222 0.0352 0.0563 0.1216 0.1149 0.4828 0.6317 0.5159 0.3297 0.4187 0.4298 0.3230 0.2860 0.3020 0.3958 0.4148 0.4288 Headings 0.0480 0.0578 0.0432 0.0682 0.0393 0.1607 0.1113 0.0995 0.1946 0.2341 0.3209 Anchor text 0.0244 0.0378 0.0652 0.1239 0.1126 0.5265 0.6423 0.5711 0.4008 0.4797 0.4544 0.3066 0.2726 0.3009 0.3974 0.3975 0.4267 0.0474 0.0527 0.0401 0.0415 0.0336 0.1549 0.1084 0.1037 0.1882 0.2362 0.3204 0.0274 0.0267 0.0581 0.1216 0.1013 0.4337 0.4365 0.4329 0.4287 0.4885 0.5176 0.3287 0.2974 0.3130 0.3996 0.4115 0.4276 0.0511 0.0578 0.0425 0.0676 0.0379 0.1633 0.1173 0.1060 0.1952 0.2510 0.3389 0.0402 0.0436 0.0669 0.1437 0.1261 0.5383 0.6655 0.6043 0.4630 0.5060 0.5225
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
Table 4.4: Evaluation of different document representations with the DFR models PL2, PB2, I(ne)C2, DLH and the weighting model BM25. The bold entries correspond to the MAP. The highest the in that the of models values parameter results weighting model A. 226. 1 in Table on page are given For the topic relevance tasks, tr2000 and tr2001, it can be seen that the full t xt headings, text the title, is than anchor or repof any more appropriate representation (rows 12-13,23-24, documents 1-2 vs. rows resentations of In particular, in Table 4.4). 34-35 and the achieved MAP with the anchor text and title representations is less than 0.0500 (rows 12-13, and 34-35 in Table 4.4). The topic distillation task td2002 is tasks tr2000 the topic task, to and relevance as such retrieval an ad-hoc more similar is documents full the Therefore, text the most effective oiw tr2001. of representation
63
4.3 Document
representations
for Web information
retrieval
(row 3 vs. rows 14,25, and 36 in Table 4.4). For the topic distillation tasks td2003 and td2004, the most effective document representation is the anchor text (rows 37-38 vs. rows 4-5,15-16, and 26-27 in Table 4.4), due to the fact that for those particular tasks, the relevant documents are restricted to be the home pages of Web sites about the query topic. For the same reason, the anchor of documents is the most effective for the home page finding topic (rows 39-41, vs. rows 6-8,17-19, and 28-30 in Table 4.4). For the home sets finding page tasks, the title representation results in similar levels of MAP as the retrieval from the text representation full text representation (rows 17-19 vs. rows 6-8 in Table 4.4). This indicates that t lie title is an adequate description for the name of a Web site's home page, even though its size is limited, and the frequencies of its terms are distributed almost uniformly.
For the named page finding representations documents of tasks np2002, np2003 and np2004, the most effective are the full text and the anchor However, text representations
(rows 9-11, and 42-44 from Table 4.4, respectively). representation that outperforms
there is no document
the other ones consistently.
With respect to the different weighting models that are evaluated, Table 4.4 shows
that in most of the cases the weighting weighting distribution different PB2 DLH. models and of term frequencies models PL2, I(ne)C2 BM25 and outperform the For the title and headings representations, are more uniform he t where the
than the content representation,
weighting
have models very similar
performance.
It should be noted that t he to the parameter estimation
small differences in retrieval process.
be effectiveness may attributed
Before closing with a discussion and some conclusions from the evaluation of the different document representations, the next section investigates an implication of using the Poisson randomness model in the DFR weighting models PL2 and PB2, when the high frequency in have terms the test collection. query extremely
4.3.4
Impact models
of query terms with
high frequency
on the Poisson-based
Considering the weighting models PL2 and PB2 (Table 2.1 on page 19), the Poisson distribution is an approximation estimator distribution Bernoulli the of A=N and is the maxof the distribution's mean and variance, where F is the
imum likelihood
frequency of a term in the document collection, and N is the number of documents in
64
4.3 Document
representations
for Web information
retrieval
the collection.
When A1,
or equivalently
FN,
then the Poisson distribution This is the case for terms
provides a good approximation
Bernoulli distribution. the of
frequency low in large document a collection. with a When the term frequency F is comparable to the size N of the document collection, or equivalently approximation A Poisson does is 1, to then the close not provide a good when This situation is more likely to appear in
distribution. Bernoulli the of
the context of the GOV collection, which is a domain specific collection of documents from governmental organisations. Therefore, terms such as national or federal are distribution because likely times, their to occur many very reflects the topics of the and feder,
documents. The application of stemming transforms these terms to nation frequency. further in increase their of a respectively, and results Example
1 The query NP167 from the known item finding task ki2003 of the TREC 2003 Web track is: Federal Deposit Insurance Corporation, and it corresponds
to a home page finding stemmed to feder, documents of in query for the GOV test collection. The term Federal is GOV in 1,465,491 the times collection. which appears GOV is 1,247,753. Therefore, =F=1,465,491 N-1,247,753 ) 1. The number El
Thereduring be retrieval. considered as stop words can fore, these terms can be ignored when assigning weights to documents. Table -1.5 full for PB2 PL2 text the and weighting models shows the retrieval effectiveness of The terms for which A>1 irrespectively the for terms, the of value query all assigned are scores when retrieval, In A>1. in for terms the each \, result which assigned are not and when scores of in described MAP, to been have as respect with the optimised models weighting case, Section 4.3.2. The c values shown in Table 4.5 correspond to cases where scores are A>I. for terms the with not assigned from MAP in differences resulting For all the tested topic sets, there are only small This ignoring terms. these for terms the or with query .\>1, either assigning weights do have A>1 for t documents term not to when a the that assigned weights suggests lie t thesis, this For MAP. the when of the important remainder resulting on effect an be to terms the PB2 assign used PL2 will query all employed, are or weighting models A the irrespectively documents, value. associated of to weights
65
4.3 Document
representations
for Web information
retrieval
Task tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
Assign scores for .1>1 MAP PL2 PB2 0.2038 0.1923 0.2107 0.2032 0.1997 0.1909 0.1245 0.1108 0.0901 0.0844 0.3355 0.3280 0.2190 0.2528 0.2300 0.2074 0.5432 0.5651 0.4850 0.5185 0.4508 0.4744
Do not assign scores for \>1 MAP c PL2 PB2 PL2 PB2 0.2029 0.1950 12.0603 53.8243 0.2103 0.2054 11.9829 10.7955 1.0132 0.2030 0.1938 1.2796 0.4133 0.2614 0.1245 0.1108 0.1424 0.2086 0.0909 0.0868 0.3400 0.3663 0.3328 0.3278 0.2200 0.3128 0.2446 0.2045 0.5988 0.7904 0.2295 0.2136 2.0209 1.4632 0.5636 0.5403 1.1433 1.4065 0.5193 0.4800 1.9617 2.8387 0.4644 0.4342
Table 4.5: Mean Average Precision (MAP) for full text retrieval with the weighting for A>1 PB2, PL2 terms assigning weight s are employed with and when query models to documents, or they are treated as stop words.
4.3.5
Discussion
Conclusions and
Web TREC tracks, in the the In order to put the obtained results various context of Table 4.6 presents the official measure of evaluation of the best official submitted runs Wherever the for topic tested Web the TREC track sets. of each to the corresponding in (MAP), if it is is available and precision average mean not measure official evaluation in to the it is then TREC official evaluation measure. addition the reported proceedings, full have text the that td2002 shown The evaluation results for tr2000, tr2001, and full For text for tasks. is documents example, ad-hoc very effective representation of best the PL2 submitted run performing the outperforms weighting model retrieval with (0.1997 from 3 Web 2002 track TREC in row the to the topic distillation task td2002 distillation task td2004, For topic 4.6). the Table in 3 from 0.1571 in Table 4.4 vs. row MAP lower in than that PL2 documents results with the anchor text representation of from in 5 0.1791 4.4 in Table 38 (0.1271 from row vs. best row the runs performing of document different four the finding task np2004, all 'T'able 4.6). For the named page (rows 11, TREC in best MAP lower the in than runs performing representations result 4.6). from Table 11 4.4 from Table 44 22,33, and vs. row introduction the first towards of effechas the Overall, this section step presented document four has It IR. for Web representations examined tive retrieval approaches documews. Web text headings: the of anchor and including: the full text: the title; the The hyper-parameters have BM23 I(ne)C2, PB2. PL2, and of the weighting models
66
4.4 Combining
document
fields
Row 1 2 3 4 5 6 7 8 9 10 11 12
Tasks tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 mq2004
Run name j2cbt9wcsl fub0lbe2 thutd5 csiro03tdO3 uogWebCAU150 tnoutlOepCAU csiro03ki01 MSRC04C12 thunp3 LmrEq MSRC04B2S MSRC04B2S
Official Evaluation
Measure
N1AP (if available and not official evaluation measure) -j %, IAP=0.1571 MAP=0.1543 MAP=0.7351 MAP=0.7232 MAP=0.5389
MAP=0.2011 MAP=0.2226 P10=0.2510 R-Prec=0.1636 MAP=0.1791 MRR1=0.774 MRR1=0.815 MRR1=0.749 MRR1=0.719 MRR1=0.688 MRR1=0.731 Avg=0.546
Table 4.6: Evaluation results of the best official submitted runs to the Web tracks from TREC-9 to TREC 2004. For the mixed query task mq2004, Avg stands for the average home for MRR1 the distillation for MAP the the topic named page and page and of finding tasks. been set in order to directly optimise mean average precision. Note that the weighting does it Section 2.3.3.4, discussed in is DLH not require and as parameter-free, model is full have text that The very retrieval shown results evaluation any optimisation. in for improvements is However, there for retrieval tasks. room search ad-hoc effective has the that In this for Web weighttasks. shown section addition, specific effectiveness for terms PB2 with very PL2 query ing models weights assigning when are robust and high collection frequencies. In order to improve the effectiveness of the employed retrieval approaches, the in fields, introduced the is section. next document or representations, combination of
4.4
Combining
document
fields
been has document evaluated separately so The effectiveness of each representation from the in improvements coiiminvestigates the effectiveness far. This section retrieval documents. for better in representation bination of different fields, order to obtain a in by the models evaluated weighting The combination of fields is achieved extending documents; text Web the body fields anchor the of Section 4.3. The employed are: defined field is body the The documents. Web as links; title the incoming of and of docfull text Compared to the text between the HTML tags <BODY>and </BODY>. ument representation. Web but title the headings. of not the body field includes the
67
4.4 Combining
document
fields
documents.
4.4.1
Weighting
models
for field retrieval
This section extends the DFR framework with a new normalisation method, which takes into account the fields of Web documents, that is the terms that appear within HTML tags. This new normalisation method applies term frequency norparticular malisation and weighting for a number of different fields. The per-field normalisation has been similarly applied in (Zaragoza et al., 2004) using the BM25 formula. In this thesis, a different document length normalisation Per-field normalisation
be to need combined. ferent fields of documents. There
formula is used.
the information retrieval from dif-
is useful in a Web context, where different document fields

are several ways to combine One approach involves performing independently
from each field and then, merging combination different the of
the ranked lists of results
(Fagin et al., 2003). The of rele-
fields can be achieved as the linear combination representations
for document the each of vance scores 2003; Savoy et al., 2003; Tomiyama the combination linear combination fields, of
(Gao et al., 2001; Kamps et al., language of modelling, is achieved with a
et al., 2003). In the context document
different or
representations,
of language
models computed 2003).
for each of the fields or document
representations
(Ogilvie
& Callan,
Plachouras et al. (2003,2002)
documents extended with the anchor text of their In addition,
incoming hyperlinks and treated the anchor text as a field of the document, effectively body from frequencies the terms the and the anchor text. of adding Plachouras, He & Ounis (2004) re-weighted the documents according to the importance by documents' terln fields increased the query when a percentage a certain scores of and field. in a particular appeared Robertson et al. (2004) suggested that it is more appropriate to weight and combine the frequencies of terms from different fields in a pseudo-frequency, before applying a (2004) & Craswell Upstill Hawking, term weighting model. suggested that terms in the body and the anchor text of Web documents are distributed very differently. For document', because document, in the times term of a may occur many example, a in hand, text On times the term the of a anchor many appearing a other verbosity. document's incoming hyperlinks represents votes for this document. Thus, performing normalisation independently weighting and for the various fields allows to take into
68
4.4 Combining
document
fields
account the different characteristics
of the fields, and to achieve their most effective 2 from Equation (2.16) (Amcorresponds
combination.
The per-field normalisation 2F extends normalisation & Van Rijsbergen, 2002), so that the normalised term frequency tfn ati to the weighted sum of the normalised term frequencies tff tfn=Z f
for each used field f: (4.1)
(Wf
tff"log2(1+cf"fl
lf
(cf>0)
t -f is field f, the is the average length of field f in the collection. where wf weight of 1f is the length of field f in a particular document, and cf is a hyper-parameter for field f. Note that normalisation each After defining 2 is a special case of normalisation 2F, when the document is field, entire considered as one with weight 1. (Table PL2 2.1 on normalisation model (4.1) be PL2F by from following Equation 19) in to tfm the can extended page replacing 2F, the DFR weighting formula: wd9_E gtfn tEgtfn+1` (tfn"1092tfn-I-(A-tfn)"logee+0.5"loge(2,7r A "tfn))
for d A=N document is to the query q, score of relevance where wd,q corresponds the mean and variance of a Poisson distribution, F is the total term frequency in the fn In is documents in N is the the the addition, qt collection. number of collection, and qtf f is by fn= frequency, the term term qt query where given qt query normalised qtfx,
frequency, fma,; is the maximum and qt frequency term query among the query terni. s.
The weighting models PB2 and I(ne)C2 (Table 2.1 on page 19) are extended in a denoted (4.1). The from Equation by tfn are models extended replacing similar way
by PB2F, and I(ne)C2F.

The weighting model DLH is extended by replacing the frequency tf of a term t field f: in frequencies t tf the the each with weighted sum of f of
>ii., r" tff

f and it is denoted by DLHF.
(4.2)
Zaragoza et al. (2004) proposed BM25F, an extension of BM25 with per-field noi-malisation. The formula of BM25F is given below: N-n+0.5
tfn
k1 + tEq tfn
log
where tfn =EWfn+0.5 (1 + bf(lf/lf) j
tff
(4.3)
69
4.4 Combining
document
fields
field-dependent bf is b to the a normalisation parameter, similar of parameter where: BM25 (Equation (2.6) on page 11); kl is a parameter that controls the saturation of tfn, he kl field f BM25; 1f is length in the the to the t of of average similar parameter document. The parameter In order document collection; and lf is the length of f in a particular
field f. is the the weight of wf In Equation (4.3), the frequency of a term in the query qt f is ignored.
to make the comparison of the employed weighting models more fair, the following formula is used for the conducted evaluation of BM25F, where the original query terra frequency component from BM25 (Equation (2.6) on page 11) is added:
wd, 9 _-
1:
tEq
tfn k1 + tfn
(k3+1)gtf k3 + qtf
log
N-n+0.5 n+0.5
where tfn =
1: f
wf
tff (1 +bf(lf/lf)
(4".1)
The value of k3 is set to 1000, in the same way as described in Section 4.3. This value k f. k3 that qt essentially means of 3+q ff documents; Web body in fields the thesis: different this There are three of considered field body The documents. Web hyperlinks; the title incoming of the anchor text of and It includes the HTML </BODY>. tags <BODY> the text to the and within corresponds discuss the The documents. Web of but setting the title headings, sections next of not (Section 4.4.2), the field-based for evaluation hyper-parameters the present the models (Section 4.4.4). discussion (Section 4.4.3), conclusions and a provide and results
4.4.2
Parameter
setting
for field-based
weighting
models
Webb different for combining This section focuses on the parameter setting of the models The Section 4.4.1. in normalisation fields, per-field document presented which were introduction in fields the of arg the frequencies, the result of term weighting and of he in For t in the hyper-parameters example, models. weighting of additional number kl b if hyper-parameters, while considered, are two and BM25, only there are case of The I(ne)C2. PB2 PL2, DFR in the and models there is only one hyper-parameter body, text If the hyper-parameter. have does anchor DLH any not weighting model has BM25F parameters: seven the then fields model weighting title are considered, and for the bt; ba, bb, the of each Wt kl Wa, wb, weights and the parameter ; the parameters PB2F PL2F. The fields, and weighting models body, anchor text, and title respectively. the wb. v'u. hyper-parameters: the pct weights and have ct ca, ch. I(ne)C2F parameters six
70
4.4 Combining
document
fields
for each of the body, anchor text and title fields, respectively. The weighting model DLHF has only 3 hyper-parameters wb, wa,,wt related to the weights of the fields.
The values for these parameters (2004). al. For the case of BM25F, are set experimentally, a two-dimensional The weight as suggested by Zaragoza et for the parameters
optimisation
bf and kl is performed and the weights optimised
for each field f.
field f is set equal to 1, the of With this first step, the
of the other
fields are set equal to zero. Next, the weights
value for bf is set. optimisation
of the fields are set equal to 1. and kl is performed, using the already process involves
a one-dimensional optimised setting
for the parameter
values for bf. wf
The third
last and step of the optimisation The weight is performed of the body
the weights
of the fields. optimisation
wb is set equal to 1,
and a two-dimensional wt. During
in order to set the weights wa and
the optimisation into account
of the weights
kl is the value of adjusted wo, and wt, due to the field
by taking weights
the difference
in the average term frequencies
(Robertson
et al., 2004) :
1 -1 The hyper-parameters
frequency term weighted average frequency term unweighted average
(4.5)
PB2F I(ne)C2F PL2F, the are set and of weighting models in a similar way. First, the parameter cf is set for each field separately, by setting the weight of f equal to 1 and the weights of the other fields equal to 0. Next, the ire is is body 1, two-dimensional the optimisation performed a and wb set equal weight of fields, for the text title the to the and respectively. anchor order set weights wa, and wt Regarding the weighting model DLHF, the weight of the body wb is set equal to 1 and there is only a two-dimensional
for the parameters for the parameter
optimisation
to set wa, and wt. opand 1 one-dimensional
Overall, setting the hyper-parameters

timisations optimisation isation
for BM25F involves 4 two-dimensional

w f,
bf and the field weights
k1. In the case of the DFR models that employ norrnul3 one-dimensional optimisations optimisation for the term frefor the weights olp-
2F, it is necessary to perform parameters the weighting optimising
quency normalisation wf. Furthermore,
cf and 1 two-dimensional model DLHF the DFR requires
only one two-dimensional
timisation. demanding optimisations.
Therefore.
weighting
is less computationally models of two-dimensional provalues
than optimising
BM25F,
because of the lower number have been performed
All the optimisations in Section 4.3.2.
with the same two-step the parameter
described cess
Table A. 2 on page 227 displays
71
4.4 Combining
document
fields
for the weighting models PL2F, PB2F and I(ne)C2F. Tables A. 3 on page 228 and A. 4 on page 228 display the parameter values for the weighting models DLHF and BM25F. respectively.
4.4.3
Evaluation
field-based of
weighting
models
This section presents the evaluation results for the field-based document models. Table 4.7 presents the evaluation of the weighting models PL2F. PB2F, I(ne)C2F, DLHF BM25F, and where documents with the body, anchor text and title fields are considered. The bold entries correspond to the weighting model that achieved the highest MAP for a particular task. The Tables A. 5, A. 6, and A. 7 in the Appendix A contain the precision at 10, the mean reciprocal rank of the first retrieved relevant document, and the number of retrieved relevant documents for the evaluated retrieval approaclics.
Row 1 2 3 4 5 6 7 8 9 10 11 Task tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 PL2F 0.2047 0.2144 0.2155 0.1745 0.1483 0.6450 0.7281 0.6559 0.7174 0.7657 0.7437 Mean Average Precision PB2F I(ne)C2F DLHF 0.1927 0.2066 0.1699 0.2083 0.2199 0.1833 0.2115 0.2020 0.1764 0.1650 0.1577 0.1571 0.1316 0.1400 0.1343 0.6252 0.6787 0.5534 0.6743 0.7201 0.6244 0.5889 0.6519 0.5770 0.6888 0.7302 0.5829 0.7199 0.7068 0.5963 0.7189 0.7048 0.5354 BN125F 0.2097 0.2231 0.2133 0.1876 0.1497 0.6874 0.7446 0.6731 0.7277 0.7138 0.7163
Table 4.7: Evaluation BM25F. The weighting
PL2F, PB2F, I(ne)C2F, the of weighting models
DLHF
and
PL2F, I(ne)C2F models
for BM25F the tasks and perform well With distopic to the respect
tr2000, tr2001, and td2002 (rows 1-3 in Table 4.7). tillation
tasks td2003 and td2004, the weighting model BM25F outperforms the other
four weighting models (rows 4-5 in Table 4.7). For the same tasks, PL2F is the most For home findDFR the the page effective weighting model among weighting models. ing topic sets, the most effective weighting model is BM25F (rows 6-8 in Table 4.7). Regarding the named page finding tasks, the best performing models are I(ne)C2F f'u>r
(rows for (row 10-11 PL2F Table 4.7), in 9 tasks the and and np2003 np2004 np2002
in Table -1.7). The weighting model DLHF is outperformed by the other four weighting (rows home finding finding for both 6-11 tasks the the page and named page models
72
4.4 Combining
document
fields
from Table 4.7). Overall, the results confirm that the evaluated weighting models are statistically independent, since they are based on different probabilistic models, and they result in different performance.
4.4.4
Discussion
and conclusions
The combination of content retrieval with information from different fields of document s results in very good performance and improvements in retrieval effectiveness for Web specific search tasks, compared to the results obtained with retrieval from each document representation separately (Table 4.4). For the ad-hoc retrieval tasks, employing field-specific term frequency normalisation and weighting of the different fields result in improvements small of retrieval effectiveness. For example, full text retrieval with the PL2 for (row MAP in Table 0.2107 2 in 4.-1), the task tr2001 weighting model resulted field-based the while weighting model PL2F for the same task resulted in MAP 0.214-1. from using field-based weighting models are greater for the honte finding finding For from tasks. the anchor text and example, retrieval named page page The improvements document representation (row for PL2 in MAP 0.5498 the task np2004 with resulted 44 row in Table 4.4). The field-based weighting model PL2F for the same task np2004 (row improvement 35% Table MAP 0.7437 in 4.7), in 11 of represents an which resulted in MAP. (Craswell TREC2004 k, Web in track to the runs of performance respect Hawking, 2004), PL2F achieves higher MAP for the task np2004 (0.7437 from row 11 With in Table 4.7), than the most effective submitted run (0.7232 from row 11 in Table 4.6). However, there is still room for improvements regarding the home page finding tasks. The MAP of the field-based weighting model I(ne)C2F for the task hp2004 is 0.6519 (row 8 in Table 4.7), while the most effective submitted run in the same task of TREC 2004 achieved 0.7351 (row 8 in Table 4.6). Overall, per-field normalisation has been shown to be particularly effective. The field-based have the that weighting models most effective shown evaluation results BM25F A BM25F. the I(ne)C2F, PL2F, aiid weighting model comparison of and are the DFR weighting models, which employ normalisation 2F, shows that none of the A drawback for tasks tested the consistently. all models outperforms the other ones in hyper-parameters the is introduction the the of additional of per-field normalisation frequency for With introduced term the to the parameters respect weighting models.
73
4.5 Query-independent
evidence
normalisation
and the weighting of fields, the DFR weighting models with normalisation 2F have an advantage of fewer hyper-parameters, compared to BM25F.
4.5
Content
retrieval
with
query-independent
evidence
The previous sections have focused on employing query-dependent evidence in order to retrieve and rank documents. The assigned weight to the retrieved Web documents depends on the distribution of the query terms in the body, as well as in the title and the anchor text of incoming hyperlinks. In addition to the query-dependent evidence,
discussed Sections in 3.3 and 3.4, the ranking of Web documents can be further and as by enhanced using other query-independent sources of evidence, such as the URL of Web documents (Section 4.5.1), or the analysis of the hyperlink structure of the Web (Section 4.5.2). Section 4.5.3 presents the evaluation results from combining field-based weighting models and the employed query-independent Finally, sources of evidence. Section 4.5.4 closes with a summary and some conclusions.
4.5.1
In order
URLs
documents Web of
browse and a certain it. This Web document, it is necessary Resource
to be able to locate
to have a way to uniquely Locators (URL) (Berners-Lee
identify
is achieved with
the Uniform
URL for The 1994). an availgeneral syntax of a et al., is <scheme> : <scheme specific part>, where <scheme>
able resource on a network
specifies the scheme, or the protocol specific part> is specified which
to use for accessing the resource, and <scheme protocol. For example, the URL for a / of
by the particular
Web document,
is called news. html
directory in is it the root stored and
In //www. 80/news. html. this is dcs. host http: dc the gla. ac. uk: s. gla . ac . uk, www. URL, news html corresponds to the path of the URL, www. dc s. gl a. ac. uk corresponds . Web hosts http the that fully the and to the page, network server qualified name of corresponds to the HyperText Transfer Protocol (HTTP) that is used for requesting to the staiiin t he
and transferring dard port URL. that
Web documents. the HTTP
In addition,
the number
80 corresponds
listening is server
to and it is usually
included not
In the context of Web Information be used as a query-independent
Retrieval,
the URL of a Web document can document Web of a
indication
functionality the of
74
evidence
a group of related Web documents, which form a Web site. This is based on two observations. First, two common conventional filenames for home pages are index. html or default html. Second, due to common practice in the organisation of . Web documents in Web sites, the entry point or home page of a Web site is more likely within to be in the root directory of a Web site. Westerveld et al. (2001) and Kraaij et al. (2002) considered both observations, in order to compute the prior probability that a Web document with a certain type of URL is the home page of a Web site. They identified four types of URLs and found that the Web pages with a root URL, such as http: //domain/, are highly likely to be home pages. Then, they used these prior probabilities in a language modelling approach for the home page finding task of the TREC 2001 Web track. Tomlinson (2005) assigned a distinct term for each type of URLs. During indexing, the term that corresponded to the type of the document's URL was added to the index. Then, during retrieval, the idf of the terms, which corresponded to the types of URLs, for documents. the were used weighting The second observation has been used in order to employ evidence from the length (2001) & For Savoy Rasolofo length URL. URL, the the of parts of example, or of a (2004b) defined URL length URL. Kamps `/' in the the also et al. a counted number of in terms of the number of characters, and the number of `.' in the domain name of the URL. In this thesis, the query-independent Web documents from URL the of evidence
is based on the length of the URL path (Plachouras & Ounis, 2004; Plachouras et URL length For & Ounis, the the He 2004). Plachouras, 2003; of path example, al., http: //www. dcs. gla. ac. uk/news. html corresponds to the length in characters of the fact by justified is This that is the 9 html, employchoice characters. string news. which ing the fully qualified domain name may bias the resulting scores towards the Web sites
that have been present for a longer period of time, and had an advantage in registering does domain length However, the domain the name not provide ally of names. shorter indication home Web to the document Web the page. site corresponds of about which The combination of the URL path length with query-dependent evidence requires
the URL path length to be transformed into an appropriate score. More specifically, URL longer documents for Web be lower length the URL a path, the with score should likely L RLs. for Web documents higher be that the it are more with short should and
75
evidence
to correspond to home pages. An appropriate transformation formula (Zaragoza et al., 2004):

k
is given by the followhig
URL(d) = k,, + URLpathlen(d)

URL-related URL(d) is the where score assigned to document d, URLpathlen(d)
(4.6)
corre-
d k,,, is document in URL length the to the a parameter. and of characters of path sponds When length. URL URL(d) to the the path saturation of with respect which controls URL length k,,, the the to takes the parameter path of of small values with respect documents, the documents with short URL paths are more favoured. For the higher k, values of the effect of the URL path length is smoothed and the resulting score is less biased towards the documents with shorter URL paths. This is shown in Figure 4.3 for three different values of k,,,.
1 0.9 0.8 0.7 1/(1+URLpathlen(d)) 10/(10+URLpathlen(d)) ------100/(100+URLpathlen(d))
0.6
b
0.5 0.4 0.3 0.2 0.1 0

0 20 40 60 80 100 ----------------------------------------------
URLpathlen(d)
Figure 4.3: The monotonically ku = 1,10 and 100. The URL-related
decreasing transformation
for the URL path length, for
URL(d) score
lie linearly t d is document for the combined with
follows: corresponding content analysis score as

Wd, 9, URL = Wd, q +w
URL(d)
(4.7)
76
evidence
is the document where Wd, d, to the content analysis score assigned using any of the q retrieval approaches described in Sections 4.3 and 4.4, Wd, URL is the combined score q, for document d, and wu is the weight of the URL-related score in the linear combination. Plachouras & Ounis (2004) have also experimented with multiplying the content URL-related the follows: analysis scores Wd, with score as q 1 wd'q'URL - Wd, q ' 1og2(1 + URLpathlen(d)) (4.8)
This approach has been particularly effective (Craswell & Hawking, 2004), but it has to be applied for the top ranked documents only, because it alters the score distribution
significantly. robust, On the other hand, the linear combination the original used in Equation distribution (4.7) is more because it does not alter significantly
of scores. Hence,
it can be applied for all retrieved documents.
4.5.2
Hyperlink
structure
analysis
This section focuses on the effectiveness of combining content analysis with queryindependent evidence from the analysis of the hyperlink structure. The hyperlinks that exist between Web documents can be considered as an indication that the author Web document believes destination document Web is related to the the the of source source one, or it is worth viewing. When a particular number of incoming hyperlinks Web document has a significant document. from other Web documents, this suggests that it is for Web documents, iii a score
Web document, it is or an authoritative either a popular In order to compute a popularity, or authority
be Markov Web the chain. The graph can modelled as a query-independent way, for Markov in the popularity the chain stands probability of entering a particular state or the authority-based PageRank documents. For Web the example, scores of score of of visiting the state that represents the in order to define a with which any
Web documents correspond to the probability
Web document, in a Markov chain for the whole Web (Page et al., 1998). However, the hyperlink structure is Web the not necessarily appropriate of Markov chain. For this reason, PageRank introduces a transformation Web document, even the ones without finite probability. This transformation
be hyperlinks, incoming can visited with a any E, the to rank source which was corresponds
described in Section 3.3.2.1.
77
evidence
This section also introduces the Absorbing Model, a novel hyperlink structure analysis model, which employs a different transformation of the Web graph in order to define a Markov chain. Instead of adding a small but finite probability to the probability of visiting any state in the Markov chain, the Absorbing Model introduces the states that have a one-to-one correspondence with the states, of the original Web documents. clones, a set of virtual The remainder of the section is organised as follows. Sections 4.5.2.1 and 4.5.2.2
basic definitions for Markov chains. Section 4.5.2.3 discusses the transforthe present Web the graph that are required to define a Markov chain. The Absorbing mations of Model and its instantiation 4.5.2.5, respectively. with static priors are introduced in Sections 4.5.2.4 and The introduced notation and terminology for Markov chains are (1957). by Feller Finally, PageRank to that the used and the combination of similar Absorbing Model with the field-based weighting models is discussed in Section 4.5.2.6. 4.5.2.1 Markov chains
Each document is considered as a possible outcome of the retrieval process. Therefore, the documents are orthogonal, or alternative states dk, which have a prior probability
be to retrieved. Pk The prior probability by defined is the system. Pk Each pair of
documents (di, dj) has an associated transition
I di) of reaching p(dj probability pik = the document dj from the document di. This conditional probability p(djl di) can he as the probability dj document having the the outcome, as when of
also interpreted
document di is the evidence. Both prior and transition space, which are: EPk
k
probabilities
must satisfy the conditions of a probability (4.9)
=1
>Pij
Condition (4.10) imposes that each state d1 must have access to at least one state dj for arbitrary
for some j, where it is possible that i=J. In order to obtain a more compact representation of probabilities the it is to prior probabilities express useful sequences of states, P and as a row vector
78
evidence
the transition
probabilities
as a row-by-column
P=[Pk]
M, matrix as follows:
(4.11)
M= Then, let M'
1 Pig M of with itself n-times:
(4.12)
be the matrix product rows-into-columns

Mr''_ [P13 I
(4.13)
In a first order Markov chain, the probability of any walk from a state d1 too a state dj depends only on the probability of the last visited state. In other words, the probability of any sequence of states (dl, ... d, ) is given by the relation: ,
n-1
p(di,...,
where pi is the prior probability
dn) =Pi fl P(dz+lIdz)

i=l
(4.14
document dl. It is possible to define Markov chains of
higher of order, where the probability
depends of a walk on more of the visited states than just the last one. In this thesis, only first-order chains are considered for the In terms of matrices, the element p of the product M'
purpose of hyperlink structure analysis. corresponds to the probability p(di, dj) of reaching the state dj from di by any random walk, or sequence of ... , (di, dj) made up of exactly n states. states ... , If p>0 for some n, then the state dj is reachable from the state di. A set of {di} C C= be if inside is to can reach all and only all other states said closed any state C. inside states The states in a closed set are called persistent or recurrent states, from the state di and terminating at state dj, can be
since a random walk, starting
from definition Indeed, di the to through of the closed set, ever extended again. pass the probability pj >0 for some m. If a single state forms a closed set, then it is called
absorbing, since a random walk that reaches this state cannot visit any other states. :A
least it is is in transient and must reach at one state, which not any closed set, called
from in Thus, is the transient state there a random walk, starting state a closed set. di, that cannot be ever extended to pass through di again.
79
evidence
A useful property of Markov chains is the decomposition characterisation. It can

be shown that all Markov chains can be decomposed in a unique manner into nonCl, C2i C,,, and a set T that contains all and only all t lie overlapping closed sets ... , transient states of the Markov chain (Feller, 1957). If this decomposition results in a C, Markov is then the closed set single chain called irreducible.
Figure 4.4: The Markov Chain representing the Web graph.
Example
2 Figure 4.4 provides an illustration
in different types the a of states of

between the states in chains, states
Markov chain. The directed graph may be seen as a Markov Chain consisting of the
1,2,3,4 states the Markov and 5. The arcs represent According the possible transitions to the terminology for Markov given above
chain.
State is 2 transient form they state. a 1,3,4,5 states. are persistent a closed set and in decomposed be it is irreducible, Markov a non-empty Therefore, this can as chain from 5 If to the state arc states. set of transient states and a single set of persistent becomes 5 itself, from then 5 to by state. an absorbing is 3 state arc an replaced state
O
4.5.2.2 Classification of states (4.14), the probability (di, from initial dj the any state of reaching
According to Equation
by any random walk w= state
dj) is given below: ... ,

00
Epi Z]
iwi n=1 i
(13)
Epij
00
(4.1)
80
evidence
However, in a Markov chain, the limit

be infinite. The limit
lim
n->oo
does not always exist, or it can
is n a multiple Periodic periodic, pkj. With
does not exist when there is a state di such that p ini =0 unless fixed integer t>1. In this case, the state di is called periodic. of some if t is the largest integer which makes the state di pkt as new transition probabilities
states are easily handled: then it is sufficient the new transition
to use the probabilities probabilities,
states dj will become aperiodic. chain are aperiodic (Feller,
be 0 than the ptn will greater and periodic ii Hence, it may be assumed that all states in a Markov
1957).
Recurrent states in a finite Markov chain have the limit of p greater than 0 if the dj is from di, for reachable state while all transient states this limit is 0:
n
l im P=0
if di is transient
(4.16)
l im p>0 In an irreducible
if dj is persistent and dj is reachable from di
(4.17)
finite Markov chain, all nodes are persistent and the probability node of the graph is positive. In other words,
from them of reaching an arbitrary um pj>0 irreducible such that: uj =E

i
for k. Due i )n p= to this property, an and all uj gym Markov chain possesses an invariant distribution, that is a distribution eck p lim u3 = n-+oo (4.18) does not
p2 um and =
uiPij
and
In the case of irreducible affect the unconditional
Markov chains, the vector P of prior probabilities probability of entering an arbitrary
state, since all rows are
identical in the limit matrix of Mn. Indeed: lim

7t-*00
E
ii2
Pipi
lim = n-aoo
lim Pipkj = n-+oo Pkjn
EPi
= uj
Pi
= uj
(4.19)
Because of this property, the probability is called invariant or stationary lim71, > If the distribution
distribution
uj in a irreducible Markov chain
distribution. pip is taken to assign weights to the nodes, then it
is equivalent to the invariant distribution uj in the case of an irreducible Markov chain. More generally, if the Markov chain is not irreducible or does not possess an invariant distribution, E1 ptp can be still used to define the distribution -4x distribution depend it However, the pi. on prior will node weights. then lim, of the
81
evidence
4.5.2.3
Modelling
the hyperlinks
of the Web
Markov chains can be applied to model the hyperlinks between documents on the Web. Let R be the binary accessibility relation between the documents. More set of it is R(dd, dj) =1 if there is a hyperlink from document d, to document dj, specifically, and 0 otherwise. Let o(i) be the number of documents dj which are accessible from di: l{dj R(i, j) = 1}1 o(i) _ : This is equal to the outdegree of a Web page. The probability document di to document dj is defined as follows: R(i, 3) pik of a transition (4.20) from
(4.21)
The above definition of pik assumes that there is an equal probability to make a transition from document di to any of the documents pointed to by di, irrespectively of their
content, or the type of the hyperlink.
There are two main implications

Equation Web documents do not contain text files that
from using the transition

any hyperlinks
probabilities
chain. First,
defined in
there are
(4.21) in order to model the Web graph as a Markov that
to other documents. any HTML markup.
Such docuIn this case, in Equa-
be can plain ments the Equation tion
do not contain
(4.10) is not satisfied be used in order
and the transition to define a Markov
probabilities
defined
(4.21) cannot
from the Web graph. chain Equathe and
Even if all the Web documents tion (4.10) is satisfied,
have hyperlinks
to other Web documents Markov
all the transient
states in the resulting
have chain would
hyperlinks. from incoming Therefore 0, independently their the of number = this limit cannot be used as a score, since only persistent states would have a significant l im probability of being visited during a random walk. First, all the in a suitable way, such that There are two possible ways to overcome the above two implications. by be linked assigning a new probability states can p0
jpz -i zj I<E. In this way all states become persistent. In other words the Web graph is transformed into a single irreducible closed set, namely the set of all states. Therefore, all states receive a positive probability the Markov chain. that they will be visited in a random walk in This approach is used in PageRank, where the assumed randoiii
82
evidence
surfer may randomly jump with a finite probability to any Web document. Second. the original graph G can be extended to a new graph G*. The new states of the extended graph G* are all and only all the persistent states of the graph G*. The scores of all t he in the original graph, irrespectively states of whether they are transient or persistent, be uniquely associated to the scores of these persistent states in the new graph. will The latter is the approach that is used to define the Absorbing Model. 4.5.2.4 The Absorbing Model of the Web graph. The
The Absorbing
Model is based on a simple transformation
G is projected onto a new graph G*, the decomposition of which is made original graph {Ci, } T=G C, transient in up of a set of states and a set of absorbing states, ... , other words a set of singular closed sets. The state Ci is called the clone of state di of the original graph G. Any state in G has direct access only to its corresponding clone, but not to other clones. Since the clones are absorbing states, they do not have direct access to any state except to themselves. The Absorbing Model is formally introduced as follows: Definition 1 Let G= (V, R) be the graph consisting of the set V= {di} of N doc-
R(di, dj) binary the accessibility relation uments and =1
if there is a hyperlink frone N additional states
di to dj and 0 otherwise. The graph G is extended by introducing dN+i, i=1,...
N, called the clone nodes. These additional nodes are denoted as: , dN+i = di* and the accessibility relation R is extended in the following way:
R(d2 d) = R(d, d! ) = 0, d , R(di, di) =1 R(di, di)=1

The transition probability
d2 ,i=1,
for: except ... ,N
dj is: d2 from to state state pij
R(di, dj) Pij - Ijdj R(d,, dj)=1}I :

denominator the where di. from transitions for the the state possible number of stands Ow to the graph according of
The following example illustrates the transformation definition Absorbing Model. the of
83
evidence
Figure 4.5: The extended Markov Chain including the clone states. Example 3 Figure 4.4 shows a graph that represents a part of the Web. Figure 4.5
Absorbing Model. definition the the the transformed to shows according of same graph, In this case, the states 1 to 5 become transient and the only persistent states are the introduced newly introduced The 1* 5*. transformation to states in results removing Web from the graph, as there are no closed sets consisting original any absorbing states of any of the original states. With the introduction
11
dj, N j=1, the the states original clone nodes, all of .... become transient, while all the clone states dj*, j=1, the only persistent are ... ,N Markov in In the the chain state original probabilities of visiting a states. other words, become: (4.22)
Pjk
is: for it the clone states while
Pik -4 Ujk,
k=N+1,
... ,
2N
(4.23)
84
evidence
where Ujk stands for the probability
that a random walk starting
from state dj will
dk. The Absorbing Model score s(dk) of a state dk is given by the through pass state unconditional probability of reaching its clone state d*:
s(dk) _
k* where =k+N
Intuitively,
"absorbed" This
PjUjk*
(4.24)
k= N. 1, and ... ,
the Absorbing Model score measures the probability
he is browsing while the incoming other documents
beiiig of a user
in its vicinity. If a if hyperlinks.
by a Web document, depends
probability
on both links,
and the outgoing Model
document
has many outgoing links,
then its Absorbing that
score is low, while
it has few outgoing higher. Additionally,
it is more probable with
its Absorbing number
Model score will he links, have a have a lower as
documents
a significant
of incoming links
high Absorbing score. Therefore,
Model
score, while
documents
without
incoming
the higher values of the Absorbing for documents. Model has two main qualitative
Model score can be considered
evidence of authority The Absorbing
differences
from PageRank.
First,
links PageRank depend incoming in the the the on of scores mainly quality of while by its Absorbing Model document's is document, in the the score outgoing affected a links. The second difference is that PageRank scores correspond to the stationary distribution from Web Markov the the graph after adding chain resulting of probability Absorbing Model On hand, documents. between link the the other every pair of a does not possess a stationary distribution, Absorbing Model the therefore, scores and Depending on the way the prior For documents. the of
depend on the prior probabilities probabilities
introduced. different defined, the to model maybe extensions are
in the the the results a prior probabilities content retrieval scores as example, use of (Amati link dynamically to analysis et combine content and simple and principled way Absorbing Model. Dynamic 2003), the al., called On the other hand, if the prior probabilities are independent of the content retrieval, Abbe in it defined, be the Model the Static Absorbing the next section. seen as will can PageRank. This be the to Model case of computed offline, similarly scores can sorbing flexibility its Model Absorbing the application enables of way. in either a query-dependeiit,
or a query-independent
85
evidence
4.5.2.5
Definition
of the Static Absorbing
Model
From the possible ways to define the prior probabilities independently of the queries. such as the document's length, or its URL type, one option is to assume that they are This approach reflects the concept that all the documents have being an equal chance of retrieved, without taking into account any of their specific As characteristics. a consequence, the prior probabilities are defined as follows: uniformly 2 (Static mode priors) The prior probability trieved is uniformly distributed over all the documents: Definition Pk =Nk=1,..., N that a document dk is redistributed.
(4.25)
N the where number refers to the total number of states in the original graph, that is the total number of documents. The prior probability to zero. When the static mode priors are employed, the Absorbing Model score s (dj) of a document dj is given from Equations (4.24) and (4.25) as follows:
s(dj) = >PitLij* i _
for the clone nodes is set equal
Nuij*
Euij*
i
(4.26)
In other words, the Absorbing Model score s(dj) for a document dj is the probability from by dj* its any state, clone node performing a random walk, starting of accessing with equal probability. The interpretation description derived is in this a straight-forward score of Model in Section 4.5.2.4: a from intuitive the manner Absorbing the of
document has a high Absorbing
Model score if there are many paths leading to it.
As a result, a random user would be absorbed by the document, while browsing the documents in its vicinity. 4.5.2.6 Combination Model of field retrieval with PageRank or the Absorbing
It is necessary to combine the hyperlink documents, similarly
Web the of content analysis analysis with
Web document URLs from the the to s of case of using evidence in Section 4.5.1. In the case of combining the scores, a transformation of the hypcrhyperlink because is the content and analysis scores required, st ructlire
link structure
86
evidence
analysis scores follow different distributions. the content analysis score distribution distributions: exponential a Gaussian distribution distribution
Indeed, Manmatha et al. (2001) modelled of the retrieved documents as a mixture of two
for the scores of the relevant documents, and a for the scores of the non-relevant documents. On the other
hand, Pandurangan et al. (2002) suggested that the values of PageRank follow a power law. Therefore, there are only few Web documents with a high PageRank score, while Web documents have a low score. the most of Plachouras et al. (2005) experimented with a Cobb-Douglas utility the content and hyperlink analysis scores are multiplied:
Wd, q, L = Wd, q
function, where
LS(d) "
(4.27)
In order to address the difference in the score distributions, perlink analysis scores in the following way: log2(shi ft" LS(d)) Wd, Wd, q,L q"
they transformed the hy-
(=1.28)
is LS a parameter and where shift corresponds to the score computed by a hyperlink (PR), Static Absorbing PageRank Model the structure analysis method, such as or (SAM). The transformation in better Equato resulted retrieval effectiveness compared is this that multiplying of approach the content analtion (4.27). However, a limitation
document hyperlink the the transformed scores greatly changes analysis ysis scores and documents boosts to the top ranks of the results. ranking and non-relevant Craswell, Robertson, Zaragoza & Taylor (2005) proposed that the scores comput cal by the hyperlink structure analysis methods, are transformed with a saturating function form: following the of
L(d) =
kL is the saturating where the transformation parameter.
LS(d)
kL+
d s()
(4.29)
kL in parameter
The effect of the saturating
is shown in Figure 4.6. For the low values of kL, the score L(d) is higher LS(d). For hyperlink the inversely the to score analysis effectively proportional hyperlink LS(d) L(d) between kL, the the the score analysis and score relation values of is almost linear. Differently monotonically
to the hyperlink
from the URL-based scores, where the URL-based score is a

increasing function.
decreasing function of the URL path length, the applied transformation

structure is scores a monotonically
87
evidence
0.9 -
0.7 0.6 r. J 0.5

0.4 0.3 0.2 0.1 0 0 20 40 LS(d) 60 80 100 ; LS(d)/(1+LS(d)) LS(d)/(10+LS(d)) ------LS(d)/(100+LS(d)) --------
Figure 4.6: The monotonically increasing transformation analysis scores, for kL = 1,10 and 100.
hyperlink the of
structure
Similarly to Section 4.5.1, the hyperlink analysis score is linearly combined with the follows: content analysis score, as
Wd,
q,
L = Wd,
+ WLL(d) q
(4.30)
L. hyperlink is the the score analysis structure weight of where wL
4.5.3
Evaluation
field retrieval of
with
query-independent
evidence
different field three the The current section evaluates retrieval with combination of is length, URL first is The the which path one query-independent sources of evidence. PageRank, (4.6). is The Equation to where second one transformed to a score according (Brin The & Page, 1998). third 0.85 factor is damping source of querythe prdf = independent structure Absorbing is the evidence Model with static priors, a novel hyperlink described in Sections 4.5.2.4 and 4.5.2.5. The evaluation field-based weighting of each model with one source
analysis algorithm
is performed
for combinations
of query-independent
hyperfurther increase the in to number of order not evidence,
for each retrieval approach. parameters
88
evidence
Table A. 8 in Appendix A presents the values of the parameters wu, ku, wy,., kp,., want for the combination of the URL and ka, length, PageRank and the Absorbing path m Model with the field-weighting models PL2F, PB2F, I(ne)C2F. DLHF and BM25F, respectively. The parameter values are set in order to optimise MAP for each task. The setting of the parameters is based on two-dimensional a optimisation of the pairs of w and k for each source of query-independent evidence. The optimisation is based on the same techniques that have been used to set the parameters of the weighting models, as described in Section 4.3.2. The combination of a field-based weighting model with one of the URL path length, PageRank, or the Absorbing Model, is denoted by appending the letter U, P, A, rc spectively, to the name of the weighting model. For example BM25FU denotes the combination of the field-based weighting model BM25F with the URL path length, and PL2FA denotes the combination of the field-based weighting model PL2F with t he Absorbing Model. The field-based weighting models employ the body, anchor text, and title fields of Web documents. Table 4.8 contains the evaluation results of combining the weighting models PL2F, PB2F, I(ne)C2, DLHF and BM25F with the evidence from the URL of documents (rows 12-22), PageRank (rows 23-33), and the Absorbing Model (rows 34-44). The entries in bold show the most effective combination of a weighting model with a queryindependent source of evidence for a particular topic set. The baselines correspond to the field-based weighting models, which do not employ query-independent evidence. Their evaluation results are copied from Table 4.7 in the rows 1-11 of Table 4.8. Tables A. 5, A. 6, and A. 7 in the Appendix A contain the precision at 10, the mean reciprocal rank of the first retrieved relevant document, and the number of retrieved relevant documents, respectively, for the evaluated retrieval approaches.
Row 1 2 3 4 5 6 Task tr2000 tr2001 td2002 td2003 td2004 hp2001 PL2F 0.2047 0.2144 0.2155 0.1745 0.1483 0.6450 Mean Average Precision PB2F I(ne)C2F DLHF 0.1927 0.2066 0.1699 0.2083 0.2199 0.1833 0.1764 0.2115 0.2020 0.1650 0.1577 0.1571 0.1400 0.1343 0.1316 0.6787 0.5534 0.6252 BM25F 0.2097 0.2231 0.2133 0.1876 0.1497 0.6874
89
evidence
Row 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
Task hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
continued from previous page Mean Average Precision PL2F PB2F I(ne)C2F DLHF 0.7281 0.6743 0.7201 0.6244 0.6559 0.5889 0.6519 0.5770 0.7174 0.6888 0.7302 0.5829 0.7657 0.7199 0.7068 0.5963 0.7437 0.7189 0.7048 0.5354 PL2FU PB2FU I(ne)C2FU DLHFU 0.2047 0.1927 0.2076 0.1705 0.2144 0.2083 0.2197 0.1848 0.2157 0.2119 0.2020 0.1767 0.2174 0.2036 0.2087 0.1793 0.1869 0.1735 0.2011 0.1961 0.7946 0.7501 0.8148 0.7151 0.7803 0.7330 0.7958 0.7070 0.7032 0.6483 0.7141 0.6438 0.7174 0.6904 0.7302 0.5829 0.7657 0.7201 0.7068 0.5986 0.7458 0.7331 0.7139 0.5380 PL2FP PB2FP I(ne)C2FP DLHFP 0.2047 0.1927 0.2069 0.1704 0.2144 0.2083 0.2199 0.1868 0.2160 0.2116 0.2006 0.1780 0.1875 0.1700 0.1808 0.1642 0.1525 0.1357 0.1541 0.1594 0.6452 0.6237 0.6839 0.5626 0.7068 0.7403 0.7746 0.7141 0.6197 0.6763 0.7554 0.6245 0.7214 0.6901 0.5865 0.7400 0.7430 0.7629 0.6301 0.7976 0.7284 0.7263 0.5365 0.7552 I(ne)C2FA DLHFA PL2FA PB2FA 0.1736 0.1927 0.2066 0.1998 0.2225 0.1848 0.2115 0.2186 0.2022 0.1767 0.2116 0.2155 0.1573 0.1660 0.1668 0.1804 0.1454 0.1367 0.1326 0.1506 0.5480 0.6253 0.6827 0.6435 0.6705 0.7006 0.7461 0.7363 0.5882 0.6972 0.5965 0.6681 0.5830 0.6897 0.7348 0.7206 0.6068 0.7159 0.7273 0.7809 0.5354 0.7215 0.7310 0.7612
B.%125F 0.7446 0.6731 0.7277 0.7138 0.7163 B-N125FU 0.2122 0.2231 0.2133 0.2338 0.1981 0.8187 0.8190 0.7100 0.7279 0.7138 0.7304 BM25FP 0.2102 0.2231 0.2138 0.1966 0.1549 0.6877 0.8044 0.7461 0.7355 0.7916 0.7373 BN125FA 0.2096 0.2231 0.2137 0.1871 0.1508 0.6872 0.7791 0.7015 0.7309 0.7522 0.7496
Table 4.8: Evaluation results of the combinations of field-based retrieval with the qucer\'independent evidence from the URL path length, PageRank, and the Absorbing Model. The evaluation results of the baselines, which correspond to the field-based weighting from Table 4.7. BM25F, DLHF, I(ne)C2F, PB2F, PL2F. are copied and models Regarding the different sources of query-independent evidence, when the relevant
90
evidence
documents are restricted distillation
to be the home pages of relevant Web sites for the topic tasks td2003 and td2004, the evidence from the URL of the Web document s improvements over employing only field retrieval (rows PageRank (rows 26-27) or the Absorbing Model (rows 37with
(rows 15-16) results in important 4-5), or a combination
38). For example, the MAP achieved by BM25FU for td2003 is 0.2338 (row 15), while the MAP achieved by BM25FP for the same task is 0.1966 (row 26).
Regarding documents the home page finding tasks, using evidence from the URLs Web of (rows, of (rows 17-19) is more effective than employing with the Absorbing Model
field-based only retrieval
6-8), or its combinations the field-based weighting
(rows 39-41). The combination the URL path length
PL2F models
and PB2F with
is more models
effective than their combination I(ne)2F, DLHF, BM25F and
PageRank. with
However, when the weighting are less marked.
are employed,
the differences
For the named page finding tasks, there is no particular restriction to the type of the relevant documents. In this case, the combination of the most effective fieldbased weighting models (PL2F, I(ne)C2F, and BM25F) with PageRank (rows 31-33), Absorbing (rows 42-44), outperforms the corresponding combination the or with model (rows from URL Web documents 20-22). Both PageRank and the with evidence of model result in comparable retrieval effectiveness for the named page finding topic sets. For example, the MAP of PL2FP for the tasks np2003 and np200-1 is 0.7976, and 0.7552, respectively (rows 32-33). The MAP of PL2FA for the same tasks is 0.7809, and 0.7612, respectively (rows 43-44). The combination any important of query-independent field does retrieval evidence with not yield improvements in retrieval effectiveness for the ad-hoc search tasks. For for field-based PL2F by the task the achieved weighting model the Absorbing
example, the MAP
tr2001 is 0.2144, while the MAP of the combination of PL2F with the Absorbing Model is 0.2186 (Table 4.8). This is due to the fact that the query-independent evidence are i, ut (Craswvll documents in for identifying tasks ad-hoc search relevant necessarily useful & Hawking, 2002).
4.5.4
Summary
and conclusions
sourc:ces
The combination
field-based the weighting models with query-independent of
best better to than the the submitted official runs as well, or of evidence, performs from BM25F Web MAP for For TREC tracks. the obtained example, corresponding
91
4.6 Obtaining
a realistic
parameter
setting
the task td2003 is 0.2338 (row 15 in Table 4.8), while the MAP of the best performing (row Web TREC 2003 is in In 0.1543 4 Table 4.6). to the track addition, run submitted PL2FA achieves MAP of 0.7612 for the task np2004 (row 44 in Table 4.8), while the best performing in Table 4.6). Overall, this section has investigated the use of query-independent sources of evirun submitted to the TREC 2004 Web track achieved 0.7232 (row 11
dence for Web IR. Three sources of evidence have been employed: the URL path length; PageRank; and the Absorbing Model, a novel hyperlink structure analysis algorithm. The evaluation have results shown that the employed query-independent sources of field-based be in to retrieval. order enhance used effectively evidence can The URL path length has been shown to be particularly distillation tasks. for the topic effective For the home page finding tasks, both the URL path length, and PageRank result in considerable improvements in retrieval effectiveness. Regarding the finding tasks, the most effective query-independent named page PageRank and the Absorbing Model. sources of evidence are
The next section investigates the performance of the described retrieval approaches in a setting which aims to reduce any overfitting effect of the applied optimisation process.
4.6
Obtaining
a realistic
parameter
setting
In Sections 4.3 to 4.5, each retrieval approach has been optimised, and evaluated with for This task. from allows approach optimisation search a particular a set of queries basis their the perretrieval optimal of the on the comparison of approaches retrieval The task. in the formance. However, it may also result overfitting of a particular for the introduce is optimisation to setting realistic more a the current section aim of the involves the This and optimisation setting of the proposed retrieval approaches. tasks, different types a as well as of mixed the with retrieval approaches evaluation of is (Section terminated 4.3.2), early. which the optimisation process restriction of
4.6.1
Using
mixed
tasks
for a mixThe current section investigates the effectiveness of the retrieval approaches Two finding topics. finding home sos distillation, page and named page ture of topic
92
4.6 Obtaining
a realistic
parameter
setting
of mixed tasks are used, as described in Section 4.2. The first one, denoted by mq2003, is a set of 350 topics from the TREC 2003 Web track. The second set of topics, denoted by mq2004, corresponds to 225 topics from the TREC 2004 Web track mixed Due lack task. to query a of test collections with various Web search tasks, only one test collection is used here, namely the GOV collection. However, the tested tasks indistillation, home finding, topic volve page and named page finding tasks. These three different types of tasks are specific to Web search, which is the focus of this thesis. The mean average precision (MAP) of the employed retrieval approaches is optimised evaluate for one of the mixed the retrieval approach tasks, and the obtained with a different parameter tasks. values are used to When the mixed
set of mixed
task mq2003
is employed
as a training
first the set,
50 queries for each type of task by mq2003'. This choice is made
are used, and this smaller
set of queries is denoted towards a particular
in order not to bias the training consists of 50 topic distillation finding page queries).
type of task (note that mq2003 queries, and 150 named
queries, 150 home page finding
The employed field-based retrieval models are PL2F, PB2F, I(ne)C2F, DLHF, and BM25F (Section 4.4). The employed query-independent sources of evidence are the URL path length, PageRank, and the Absorbing Model (Section 4.5). The parameter values for the field-based retrieval models, and their combination with the queryindependent sources of evidence are shown in Tables A. 9 and A. 10 of Appendix A, respectively. The evaluation of the field-based weighting models and their combination with the Table The in 4.9. is for the shown mixed-type query sets query-independent evidence bold entries correspond to the most effective retrieval approach for each row of the table. In the column `Task (train)', the task in brackets corresponds to the training task. For for field-based the the 1 the models weighting of results evaluation example, row shows for MAP task the their task mq2004. mixed mq2003, after optimising mixed Regarding the mixed tasks mq2003 and mq2004, it is interesting to note that for field BM25F, I(ne)C2F the of retrieval with the weighting models combination and PageRank (rows 17-18 in Table 4.9) is more effective than the combination field of by be This (rows 4.9). in Table 9-10 length URL the can explained path retrieval with the fact that the combination of field retrieval with PageRank improves the retrieval length from URL The for the tasks. types three path evidence of search all effectiveness
93
4.6 Obtaining
a realistic
parameter
setting
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Task (train) mq2003 (mq2004) mq2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) np2004 (mq2003') mq2003 (mq2004) mg2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) np2004 (mq2003') (mq2004) mq2003 mq2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) (mq2003') np2004 mq2003 (mq2004) mq2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mg2003') np2003 (mq2004) (mq2003') np2004
M ean Average Precision (MAP) PL2F PB2F I(ne)C2F 0.6132 0.5695 0.6174 0.4916 0.4638 0.4667 0.1423 0.1366 0.1427 0.1454 0.1307 0.1249 0.7032 0.6395 0.6889 0.6410 0.5893 0.5810 0.6801 0.6437 0.7041 0.6885 0.6716 0.6940 PL2FU PB2FU I(ne)C2FU 0.6317 0.5916 0.6461 0.5363 0.4954 0.5082 0.1942 0.1810 0.1762 0.2015 0.1577 0.1790 0.7732 0.7287 0.7632 0.7124 0.6450 0.6616 0.6361 0.5913 0.6856 0.6950 0.6834 0.6840 PL2FP PB2FP I(ne)C2FP 0.6308 0.5887 0.0.6644 0.4809 0.4639 0.5232 0.1599 0.1558 0.1791 0.1527 0.1307 0.1436 0.7300 0.6933 0.7657 0.6367 0.5895 0.7100 0.6886 0.6285 0.7249 0.6534 0.6716 0.7160 PL2FA PB2FA I(ne)C2FA 0.6225 0.5851 0.6334 0.4876 0.5016 0.4638 0.1446 0.1330 0.1449 0.1308 0.1307 0.1448 0.7385 0.6793 0.7212 0.6115 0.5893 0.6598 0.6911 0.6417 0.6831 0.6716 0.7206 0.7002
DLHF 0.5037 0.3914 0.1357 0.1177 0.6120 0.5602 0.5182 0.4963 DLHFU 0.5302 0.4321 0.1586 0.1630 0.6880 0.6325 0.4962 0.5007 DLHFP 0.5387 0.4155 0.1524 0.1418 0.6734 0.6123 0.5330 0.4925 DLHFA 0.5116 0.3947 0.1336 0.1232 0.6300 0.5697 0.5192 0.4913
BN125F 0.6351 0.4874 0.1449 0.1401 0.7296 0.6404 0.7041 0.6817 BM25FU 0.6596 0.5302 0.1907 0.1875 0.7918 0.7172 0.6837 0.6858 BM25FP 0.6694 0.5322 0.1777 0.1656 0.7780 0.7487 0.7246 0.6821 BM25FA 0.6443 0.4979 0.1449 0.1442 0.7591 0.6675 0.6959 0.6821
Table 4.9: The evaluation of the field retrieval weighting models and their combination for for the querythe the mixed-type query sets, and query-independent evidence with type specific topic subsets. The task mq2003' corresponds to a subset of mq2003, which for first 50 task. type topics the of each consists of is mostly beneficial for the topic distillation On the other hand, the combination finding home tasks, where the page and
Web home documents sites. pages of are relevant PB2F, PL2F field-based the and models of both of which employ a Poisson randomness model, with evidence from the URL pat h length for the tasks mg2003 and mq2004 (rows 9-10) seems to be more effective thaii
94
4.6 Obtaining
a realistic
parameter
setting
their combination 25-26).
with either Pagerank (rows 17-18), or the Absorbing Model (rows
Overall, the training and evaluation of the retrieval approaches with different mixed tasks has a negative impact on MAP, compared to the results obtained from Table 4.8. This is explained in terms of the reduced effect of overfitting the data. However, sonne of the evaluated retrieval approaches still perform well compared to the best performing in the corresponding TREC Web tracks. For example, the MAP of the retrieval runs PL2FU for the task mq2004 is 0.5363 (row 10 in Table 4.9), while the highest approach MAP achieved by the submitted runs to the TREC 2004 is 0.5389 (row 12 in Table 4.6, 67). The MAP page of the same retrieval approach for the task td2004 is 0.2015 (row 12 in Table 4.9), while the highest MAP achieved for this task in TREC 2004 was 0.1791 (row 5 in Table 4.6).
4.6.2
Using
mixed
tasks and restricted
optimisation
In addition
to the evaluation of the retrieval approaches with mixed types of tasks, process is terminated after 20
this section considers a setting, where the optimisation iterations.
The parameters are set to the values that resulted in the best retrieval iterations 20 effectiveness after of the optimisation process. This setting aims to further reduce any overfitting Appendix of effect of the optimisation A. A. Tables 11 12 process. and A display the corresponding parameter values for the field-based weightwith the query-independent sources of evidence,
ing models, and their combination respectively.
Table 4.10 shows the evaluation of the weighting models and their combination with sources of evidence when a restricted optimisation bold entries correspond to the most effective retrieval approach for each tested topic (train)', brackets in `Task In the task the corresponds to the training task. set. column query-independent Compared to the retrieval effectiveness obtained from the full optimisation (Table be it 4.9), that, the types seen generally, restricted can of queries set of mixed because is lower MAP. This in the optimisation proce' expected optimisation results is performed. The
over a
is stopped early.
In particular, the restricted optimisation has a negative effect on the retrieval effectiveness of BM25F. For example, the MAP of BM25F for the mixed task mq2003 with full optimisation is 0.6351 (row 1 in Table 4.9). However, in the case of the restricted
95
4.6 Obtaining
a realistic
parameter
setting
M ean Average
Precision
(MIAP)
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Task (train) mq2003 (mq2004) mq2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') (mq2004) np2003 np2004 (mq2003') mq2003 (mq2004) mq2004 (mq2003') td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) (mq2003') np2004 mq2003 (mq2004) (mq2003') mq2004 td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) (mq2003') np2004 (mq2004) mq2003 (mq2003') mq2004 td2003 (mq2004) td2004 (mq2003') hp2003 (mq2004) hp2004 (mq2003') np2003 (mq2004) (mq2003') np2004
PL2F 0.6089 0.4444 0.1474 0.1299 0.7005 0.4893 0.6713 0.7141 PL2FU 0.6206 0.5254 0.1939 0.2092 0.7435 0.6674 0.6399 0.6997 PL2FP 0.6238 0.4853 0.1606 0.1459 0.7174 0.6192 0.6846 0.6908 PL2FA 0.6164 0.4717 0.1558 0.1201 0.7229 0.5780 0.6633 0.7169
PB2F 0.5558 0.4114 0.1401 0.1137 0.5927 0.4262 0.6575 0.6944 PB2FU 0.5809 0.4723 0.1520 0.1404 0.6480 0.5523 0.6568 0.7241 PB2FP 0.5873 0.4723 0.1445 0.1402 0.6589 0.5677 0.6634 0.7090 PB2FA 0.5772 0.4462 0.1417 0.1204 0.6248 0.4965 0.6748 0.7218
I(ne)C2F 0.6071 0.4273 0.1089 0.1150 0.6872 0.4826 0.6930 0.6843 I(ne)C2FU 0.6258 0.4946 0.1446 0.1763 0.7343 0.6370 0.6778 0.6706 I(ne)C2FP 0.6453 0.4983 0.1542 0.1307 0.7603 0.6632 0.6940 0.7010 I(ne)C2FA 0.6210 0.4509 0.1283 0.0889 0.7227 0.5815 0.6836 0.6814
DLHF 0.4890 0.3792 0.1410 0.1138 0.5814 0.5170 0.5127 0.5069 DLHFU 0.5216 0.4273 0.1600 0.1565 0.6660 0.6278 0.4978 0.4978 DLHFP 0.5319 0.4156 0.1455 0.1371 0.6710 0.6149 0.5216 0.4947 DLHFA 0.5019 0.3959 0.1284 0.1167 0.6042 0.5555 0.5241 0.5154
BM25F 0.5533 0.4327 0.1179 0.1136 0.5629 0.5138 0.6889 0.6707 BM25FU 0.6237 0.4883 0.1857 0.1851 0.7365 0.6479 0.6570 0.6319 B\125FP 0.6502 0.4955 0.1640 0.1377 0.7516 0.6469 0.7108 0.7021 BM25FA 0.5894 0.4680 0.1290 0.1169 0.6498 0.5975 0.6825 0.6896
Table 4.10: The evaluation of the field retrieval weighting models and their combination for for the the the sets and qucr-yquery mixed-type evidence with query-independent The task mq2003' corresponds type specific topic subsets, with restricted optimisation. to a subset of mq2003, which consists of the first 50 topics for each type of task.
optimisation, optimisation restriction
it drops to 0.5533 (row 1 in Table 4.10). This is explained because the higher BM25F involves number of two-dimensional optimisations, a of the
from further the optimum. in away of which results a setting
The DFR field-based weighting models are more robust. in the sense that they are less affected by the restricted optimisation. For example, the MAP of PL2F for mg2UU3
96
4.7 Potential
improvements
from
selective
Web information
retrieval
drops from 0.6132 (row 1 in Table 4.9) to 0.6089 (row 1 in Table 4.10).
It is worth noting that, the field-based weighting despite the restricted optimisation, the combination of model PL2F with the URL path length (PL2FU) achieve" comparable MAP to that of the best performing run submitted to the TREC 2004 Web track (0.5254 from Table 4.10 with respect to 0.5389 from Table 4.6). This suggests that the retrieval approach PL2FU is robust with respect to setting its hyper-parameters. 4.6.3 Conclusions
from two perspectives in process order to obtain a realistic setting for the hyper-parameters of the proposed retrieval of precision, and the evaluation of the retrieval has been performed with different sets of mixed tasks. The mixed tasks approaches include topic distillation, home page finding, and named page finding tasks. Second, approaches. the two-step optimisation process described in Section 4.3.2 has been modified in order to terminate after 20 iterations. The obtained parameter setting for the retrieval approaches does not always result in optimal retrieval performance. However, it represents a realistic setting, where the The setting, which employs the will be employed in the next section, First, the optimisation
This section has revisited the employed optimisation
most effective parameter values are approximated. mixed tasks and the restricted optimisation, in order to establish the potential effectiveness.
for Web IR improvements in retrieval of selective
4.7
improvements Potential tion retrieval
from
selective
Web informa-
The aim of this section is to investigate the potential improvements in retrieval effectiveness from selective Web IR. This investigation is performed in a setting where it is basis. is that the assumed most effective approach a2 applied on a per-query The methodology to establish the potential for improvements from selective Web IR is considered. It is assumed that A set of retrieval approaches al, a2, ... ), which can identify and apply the most effective there is a mechanism MAX(ai, a2, ... is the following. basis. The the of mechanism effectiveness retrieval a on per-query retrieval approach
97
4.7 Potential
improvements
from
selective
Web information
retrieval
MAX
corresponds to the maximum
retrieval
effectiveness that can be obtained lby-
selectively applying any of the approaches al, a2, basis. on a per-query ... The employed retrieval approaches involve the field-based weighting models, and their combinations with query-independent evidence from the URLs Web documents. of PageRank, or the Absorbing Model. The parameters of the retrieval approaches have been set after a restricted optimisation with mixed tasks, as described in Section 4.6. The evaluation of the retrieval approaches has been shown in Table 4.10. The described methodology is applied for pairs of retrieval approaches. Table 4.11 displays the pairs of retrieval approaches, for which the mechanism MAX results in the highest improvements over the most effective retrieval approach of the pair. The symbol * denotes that the difference between the MAP of the mechanism MAX and that of the most effective retrieval approach is statistically significant at p=0.05 according to Wilcoxon's signed rank test. Rows 1-6 display the potential for improvements in retrieval effectiveness from the selective application of retrieval approaches that use t he field-based weighting model PL2F. Row 1 of Table 4.11 refers to the following case. When the retrieval approach PL2F is applied for all queries of the task td2003, the MAP is 0.1474. When the retrieval approach PL2FP is applied for all queries achieved of the task td2003, the achieved MAP is 0.1606. When the mechanism MAX selects t lie most effective approach between PL2F and PL2FP for each query of the task td2003, the achieved MAP is 0.1726, which represents a relative improvement of +7.47% over the MAP of PL2FP (0.1606). According to Wilcoxon's signed rank test, the difference between the MAP of the decision mechanism MAX and that of PL2FP is statistically significant, denoted * by in the table. as For all the cases reported in Table 4.11, it can be seen that the improvements in MAP obtained by the mechanism MAX are statistically significant. When the employed pairs of retrieval approaches use the same field-based weighting (rows for highest improvements in Table in 4.11), 1-30 then the model potential retrieval (rows field-based I(ne)C2F is is 13-18 the weighting model effectiveness obtained when in Table 4.11). The lowest potential for improvements in retrieval effectiveness are for fieldthe the task the employed retrieval approaches use obtained np2003, when based weighting models PL2F or PB2F (+2.19% from rows 5 and 11 in Table 4.11). If the available retrieval approaches employ different field-based weighting models (rows 31-36 in Table 4.11), the potential for improvements in retrieval effectiveness
98
4.8 Summary
increases considerably. application Table 4.11).
For example, the maximum MAP achieved by the selective of either PB2FU, or DLHFA, for the task hp2004, is 0.7025 (row 34 in This corresponds to a relative increase of +26.46% from the MAP of
DLHFA (0.5555).
MAP achieved from the selective application of the pairs of retrieval approaches displayed in Table 4.11 is higher than the MAP of the best run to the corresponding TREC Web track. For example, when PB2F I(ne)C2FA either or are applied on a per-query basis for the task np2004, the performing submitted MAX mechanism results in higher MAP than that of the best performing run in t he (0.8019 TREC Web 2004 from row 36 in Table 4.11 vs. 0.7232 from task track same of row 11 in Table 4.6, page 67). It is worth noting that the pairs of retrieval approaches that result in the highe t. potential for improvements, as shown in Table 4.11, do not necessarily involve the For most effective retrieval approaches. example, the maximum MAP obtained by t lie selective application PL2F for However, PL2FP is 0.1726. td2003 the task the of and PL2FU, the most effective retrieval approach applying MAP obtained by uniformly In some cases, the maximum
(row field-based from 4.10). is 11 Table 0.1939 the employing weighting model, Overall, this section has shown that there is an important potential for statistically Web IR. The from in improvements selective potential effectiveness significant retrieval for improvements is higher when the applied retrieval approaches employ different fieldbased weighting models. Furthermore, there are important improvements from the selective application of retrieval approaches, even when the retrieval approaches are not the best performing ones.
4.8
Summary
This chapter has established the potential for improvements in selective Web IR, after a thorough evaluation of different retrieval approaches with a range of weighting models. The experimental setting has been described in Section 4.2. The evaluation of the retrieval approaches has been performed in three steps, where the mean average precision tested task. been to has each respect with optimised of each retrieval approach First, Section 4.3 has examined the effectiveness of full text retrieval and retrieval from particular document representations, such as the title, the headings, and the
99
4.8 Summary
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004
Mean Average Precision First approach (0.1474) PL2F (0.1299) PL2F (0.7435) PL2FU (0.6674) PL2FU (0.6713) PL2F (0.7141) PL2F PB2F PB2FU PB2FU PB2FU PB2F PB2FU I(ne)C2F I(ne)C2F I(ne)C2FU I(ne)C2FU I(ne)C2F I(ne)C2F DLHF DLHF DLHFU DLHFU DLHFP DLHFU BM25FU BM25F BM25FU BM25FU BM25F BM25F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F (0.1401) (0.1404) (0.6480) (0.5523) (0.6575) (0.7241) (0.1089) (0.1150) (0.7343) (0.6370) (0.6930) (0.6843) (0.1410) (0.1138) (0.6660) (0.6278) (0.5216) (0.4978) (0.1857) (0.1136) (0.7365) (0.6479) (0.6889) (0.6707) (0.1446) (0.1299) (0.6660) (0.5523) (0.6846) (0.6944) Second approach (0.1606) PL2FP (0.1201) PL2FA (0.7229) PL2FA (0.6192) PL2FP (0.6633) PL2FA (0.7169) PL2FA PB2FA PB2FP PB2FP PB2FP PB2FP PB2FP I(ne)C2FA I(ne)C2FP I(ne)C2FA I(ne)C2FP I(ne)C2FP I(ne)C2FA DLHFP DLHFP DLHFP DLHFP DLHFA DLHFP BM25FP BM25FA BM25FP BM25FP BM25FP BM25FU DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA (0.1417) (0.1402) (0.6589) (0.5677) (0.6634) (0.7090) (0.1283) (0.1307) (0.7227) (0.6632) (0.6940) (0.6814) (0.1455) (0.1371) (0.6710) (0.6149) (0.5241) (0.4947) (0.1640) (0.1169) (0.7516) (0.6469) (0.7108) (0.6319) (0.1455) (0.1307) (0.6498) (0.5555) (0.6836) (0.6814) MAX 0.1726 (+ 7.47%)' 0.1464 (+12.70%)' 0.7633 (+ 2.66%)' 0.7311 (+ 9.54%)' 0.6860 (+ 2.19%) 0.7797 (+ 8.76%) 0.1490 0.1614 0.6798 0.6340 0.6779 0.7821 0.1574 0.1524 0.7600 0.7385 0.7262 0.7546 0.1578 0.1492 0.7018 0.6733 0.5556 0.5427 0.2135 0.1255 0.8031 0.7062 0.7656 0.6966 0.1926 0.1615 0.7658 0.7025 0.7827 0.8019 (+ 5.15%) (+14.96%)* (+ 3.17%)' (+11.68%)* (+ 2.19%) (+ 8.01%) (+22.68%)' (+16.60%)' (+ 3.50%)' (+11.35%)' (+ 4.64%)* (+10.27%)' (+ (+ (+ (+ (+ (+ 8.45%)' 8.83%)` 4.59%)' 7.25%)' 6.01%)' 9.02%)*
(+14.97%)' (+ 7.35%) (+ 6.85%) (+ 9.00%)" (+ 7.71%) (+ 3.862) (+32.37%)" (+23.57%)' (+14.98%)" (+26.46%)* (+14.33%)' (+16.52%)'
from the in selective Potential Table 4.11: retrieval effectiveness The basis. approaches retrieval two a per-query on retrieval approaches application of lie displays The 4.10. table Table t in based on a restricted optimisation, as reported are for MAP in improvements the highest in the that result pairs of retrieval approaches MAX MAP between difference the denotes of * the that The topic tested symbol sets. is according significant, statistically the approach that retrieval most effective of and to Wilcoxon's signed rank test. for improvements been has It that documents. Web hyperlinks shown incoming the of text of anchor For 63). (Table 4.4 task field depends the page on search on the effectiveness of each
100
4.8 Summary
the ad-hoc retrieval tasks, the full text of documents is the most effective document representation. For the tasks, where there is a bias towards the home pages of Web sites, the anchor text representation is more effective than the full text of Web documents. The title representation representation of documents is less effective, but outperforms the headings for Web specific tasks. per-field normalisation, a new normalisation
Second, Section 4.4 has introduced
technique for the DFR framework, which allows the term frequency normalisation and different document fields. The employed document fields are the body, the of weighting anchor text of incoming hyperlinks, in important weighting models result to retrieval from the individual and the title of Web documents. The field-based improvements in retrieval effectiveness, compared document representations, particularly for the narr d
(Table finding 4.7 on page 72). tasks page Third,

independent
Section 4.5 has enhanced the field-based weighting

sources of evidence. are the length In particular, in characters the considered URL the of
models with queryquery-independent
sources of evidence PageRank structure (Brin analysis
Web documents, path of Model, a novel hyperlink
& Page, 1998), as well as the Absorbing algorithm. The evaluation
have results shown that the combination provides important improve-
of field-based MAP in ments
retrieval
with the URLs of Web documents for the topic distillation
field over retrieval,
finding home tasks and page the home im-
(Table 4.8 on page 90). This is due to the fact that there is a bias towards for Web sites such search tasks. pages of provements that in retrieval effectiveness When employing PageRank,
moderate
are obtained
for all the Web specific search tasks, finding and named page tasks. tasks. The
is the topic Model
distillation, is particularly
home page finding
Absorbing
finding for the named page effective
Overall, the presented retrieval approaches achieve higher, or similar performance TREC the tasks to the the of corresponding as most effective official runs submitted Web tracks. Section 4.6 revisits the parameter setting of the retrieval approaches, by performing optimisation and evaluation on different sets of mixed tasks, as well as This allows for the reduction of the process, and shows that the retrieval
by terminating overfitting
the optimisation
process early.
by the optimisation effects caused
(Table 96). 4.10 on page approaches are robust This last setting is employed in Section 4.7, in order to demonstrate the potential for improvements in retrieval effectiveness from selective Web IR. The results show that
101
4.8 Summary
statistically of applying established IR, Chapter
significant a retrieval the potential 5 will will
improvements approach
can be obtained
ivencs., the to effect with respect After having
uniformly
for all queries (Table 4.11). in retrieval
for improvements a framework both
from Web selective effectiveness Web IR. The evaluation of
introduce
for selective
the framework operational
be performed with limited
in an optimal
setting (Chapter
(Chapter 7).
6), and in an
setting
102
Chapter
A framework for Selective Information Retrieval

5.1 Introduction
Web
The previous chapter introduced
a wide range of retrieval approaches for Web IR,
and established that selective Web IR has the potential to result in improved retrieval effectiveness. This chapter proposes a novel framework for selective Web IR. A central concept in this framework is the decision mechanism, which selects an appropriate retrieval approach to apply on a per-query basis. The selection of a retrieval E, feature from by is approach aided an experiment which extracts a a sample of the set following documents. by An This is informational the of retrieved motivated example. query about a very broad topic may benefit from applying hyperlink structure analysis in order to detect the most authoritative Web sites and resources (Kleinberg, 1998). On the other hand, applying hyperlink analysis to an informational query about a topic document in in is the collection, may result a topic which not extensively represented drift (Bharat & Henzinger, 1998). In these two examples, the retrieval effectiveness of hyperlink analysis is related to the broadness of the topic. In terms of the characteristics broadness for documents the topic the of a can a query, particular of set of retrieved for being documents indexed that the are retrieved a of from URL if for On hand, the topics, the evidence navigational other particular query. document for is documents Web topics, then with a a relevant all uniformly used of be seen as the proportion relatively long URL would be penalised. Section 5.2 describes i lie The remainder of this chapter is organised as follows.
103
5.2 Selective
retrieval
as a statistical
decision
problem
framework
for selective Web IR in terms of statistical
decision theory, and discuses
the differences between the selective Web IR and related work. The next two sections introduce a range of experiments E. First, Section 5.3 defines the score-independent E, which are based on counting the occurrences of query terms in the The
experiments
documents, as well as in aggregates of related Web documents. set of retrieved Web documents that are stored in the same directory.
defined documents Web the that have the same domain name, or the as aggregates are Second, Section 5.4 introduces the score-dependent experiments E, which employ evidence from the retrieval scores of Web documents, and from the hyperlink structure of the retrieved documents, in order to estimate the usefulness of the hyperlink structure. Finally, Section 5.5 introduces a Bayesian decision mechanism for the evaluation of selective Web IR.
5.2
Selective
retrieval
as a statistical
decision
problem
Selective Web IR can be seen as a statistical decision problem with a number of available 1, function loss different for experiment an and s, a nature of states set of a a actions
E (Lindgren, correspond 1971; Wald, to the retrieval uncertainty approach 1950). In the context that of selective Web IR, the actions for a given query. a be applied can process, a retrieval Due
approaches
to the inherent which retrieval retrieval
of the retrieval
system can only guess
is most appropriate
for a given query. The knowledge of which with the states of nature: approach when the
approach
is most appropriate
is modelled
true state of nature formulation This ai. nature and actions. words,
is si, the most appropriate of the problem In this setting,
retrieval
for a given query is between states of
in a one-to-one results a decision mechanism
mapping
guesses the state of nature, retrieval approach for a
in other or query.
it aims to identify
the most appropriate
The consequences
of applying
a retrieval
approach
ai when the state of nature
is
in loss the elch l(aj, utility function of loss by expresses which sj), a sj are modelled depending different in defined be oil function loss ways. The can possible situation. in the defined be terms it effectiveof For factors can that affect utility. example, the ness of the retrieval approaches. It is also reasonable to consider the computational d setting where a retriev in operational an especially the approaches, retrieval cost of For the of cost in timely example, manner. the a queries users' process must system
104
5.2 Selective
retrieval
as a statistical
decision
problem
applying
1998) at query time is significantly higher than using PageRank (Page et al., 1998), because the hub and authority scores of the HITS algorithm need to be computed for each particular In query. contrast, PageRaiik scores are computed during indexing time. Hence, the overhead for combining them with content analysis scores is marginal. retrieval setting. This thesis is focused on a TREC-like bath It is also worth noting that the investigated retrieval approaches in Chapter 4 do not introduce any considerable overhead in the retrieval process. Thereand the loss of applying a retrieval approach a when
the HITS algorithm
(Kleinberg,
fore, it is assumed that the utility
the state of nature is s depends only on the retrieval effectiveness of a, and not on its computational cost. In the context of selective Web IR, the loss l (ai, sj) of applying a retrieval approach h ai when the true state of nature is sj can be defined with respect to a preference relationship Definition among the retrieval approaches, as follows. 3 Suppose that there is a decision problem with n retrieval approaches', The retrieval effectiveness of the retrieval approach ai for the
and n states of nature.
is denoted by state s m(ai, s). The n retrieval approaches are ranked in decreasing In this way, the rank of the most effective their order of retrieval effectiveness m(ai, s). is is 2, 1, the the second most effective retrieval approach rank of retrieval approach is least The the effective retrieval approach n. and so on. rank of If the rank of the retrieval approach az is denoted by r(a2, s), then the loss function is defined as follows:
r(ai, s) -1 l(ai, s) = n-1 (5.1)
The definition of the loss function in Equation (5.1) does not consider the magnitude but their difference in the the only approaches, retrieval among of retrieval effectiveness function loss the dividing the Moreover, values of only normalises with n-1 ranking. in the range [0,1], and it does not affect any further computations. Before continuing, illustrate in is to order given an example the formulation Q of
definition decision the IR in Web terms problem, as well as of a selective function. Example
loss the of
4 Figure 5.1 describes selective Web IR as a decision problem with 3 states
(al, When the 3 (sl, state and a3). a2, approaches and retrieval 83) and '2, of nature '
105
5.2 Selective
retrieval
as a statistical
decision
problem
is of nature sj, then the loss associated with applying retrieval approach a1 is denoted by 1(ai, sj). The loss 1(a2, sj) can be specified as follows.
States of nature
sl = retrieval approach al is appropriate
Apply retrieval approach al Apply retrieval approach a2 Apply retrieval approach a3
l(ai, si)
l(a2, l(a3, Si) S1)
82 = retrieval approach a2 is appropriate l(al, s2)

1(a2, S2) 1(a3,82)
83 = retrieval approach a3 is appropriate I(aI.. s )
1(a2,4;3)
1(a3,83)
Selective application of retrieval approaches for three states of nature 81,82,83 and three different retrieval approaches al, a2, a3. The loss associated with applying retrieval approach ai when the true state of nature is sj is denoted by l (ai,,, j).
When the state of nature is s1, suppose that m(al, si) > m(a3, Si) > m(a2, Si). In
Figure 5.1:
this case, r(al, associated

for the
si) = 1, r(a3,81)
with
(5.1). 2, 3. From Equation loss the and r(a2, sl) = = 1-1 is is l(al, the 0. The loss true state of nature al when sl 81) = 3-1 = approaches a2 and a3 is 1(a2, S l) = 3_ i=1, and 1(a3, SO = 3-1 = 0-5,
retrieval
respectively.
The loss function 1(ai, sj) can be specified in the same way for all the in the same way for any number of retrieval approaches and states
decision The of approaches and states of possible pairs retrieval ai nature sj. problem can be formulated decision In the of a problem with two retrieval approaches, the output of nature. case function i. 0. loss is binary, 1 the e. or of
11
In order to identify the true state of nature, and to decide which retrieval approach to use, an experiment F-is performed on a sample Retq of the set of retrieved documents for a query q. The sample Retq can be restricted to a number of top-ranked documents, ordered by a specific retrieval technique. This retrieval technique may correspond to Chapter The 4. in technique that the retrieval presented retrieval approaches any of document final for is the Retq the of ranking S. not necessarily used sample generates In other words, the experiment E does not depend on the retrieval approaches that For basis. that decision suppose an the example, mechanism applies on a per-query least in term documents that E one query contain at counts the number of experiment in to this A decision select on a order their title. experiment employ mechanism can Section The 4.4. described in field-based basis the weighting models one of per-query fact that the experiment employs only the documents with at least one query tcrni field does not mean that the retrieval approaches cannot employ ails' for This fields, text. document the more allows approach anchor such as available other in a particular
106
5.2 Selective
retrieval
as a statistical
decision
problem
in defining the experiment E, as well as in selecting the retrieval approachc". In the remainder of this thesis, the defined experiments will be independent of the retrieval approaches used for the final ranking of documents. The experiment F, extracts a feature related to the query. This feature can be related to the statistical characteristics of the query terms, or the characteristics of the documents that are retrieved for this particular query. For example, the query performance pre-retrieval predictors (He & Ounis, 2004) can be seen as experiments E that use evidence only from the collection statistics of the query terms. A differelit E experiment may employ evidence from the hyperlink structure among the retrieved Web documents for a query. The experiment F- returns an outcome o, which can be either a categorical, or a numerical value. In the case of categorical values, the outcome o of an experiment, which detects how difficult the queries are, could be either `Query is difficult' `Query or , is easy'. In the case of numerical values, the outcome o of an experiment, which estimates the density of hyperlinks in the set of retrieved Web documents, could be a between 0 and 1. The decision mechanism needs to map the range of the real number possible outcome values of the employed experiment to particular retrieval approaches. According to the outcome of the experiment for a query, the decision mechanism selects an appropriate retrieval approach to apply. it provides the decision mechanism with evidence to guess the Fthe of outcome of an experiment when the state distribution When the outcome of the experiment E for a query predicts the true state of nature with some probability, true state of nature, and to apply an appropriate retrieval approach for the given query. Ideally, the probability distribution from different for be is the probability a set of queries, should of nature sl E would identify
flexibility
for different In is Fthe the a set of queries. such state of nature 82 of outcome of when a case, the experiment the true state of nature, without distribution any error. Section 5.5 describes how the probability E the of outcome of is empirically
decision Bayesian in from the training of a context mechanism, a set of queries obtained from loss the to applying a retrieval approach. expected minimise which aims For the remainder of this thesis, a particular experiment is referred to as E, whei-e by is Retq feature for the that the the experiment. quantified of sample x stands the concept of an experiment E is further illustrated with an example. Next,
107
5.2 Selective
retrieval
as a statistical
decision
problem
Example
5 If the broadness of a topic is associated with the number of retrieved documents, then one experiment how that broad is, topic estimates a t road could be described as "Count the number of documents that contain least at one quei'Y term". The outcome of this experiment corresponds to the cardinality documents of retrieved containing at least one query term. of the set R( f, Q
In the context of Web retrieval, other types of experiments can exploit evidence from the hyperlink structure of the sample Retq of retrieved documents, or combinations of the hyperlink structure and retrieval from the text of documents. This thesis is focused defining on a range of experiments E, and not on the definition of the loss function. The remainder of the current section is organised as follows. Section 5.2.1 discusses the differences between selective Web IR and related work. the terminology of the proposed framework, in Section 4.7 to establish the potential Web IR. selective Section 5.2.2 illustrates by describing the setting already used
improvements in retrieval effectiveness from
5.2.1
Selective
Web information
retrieval
and related
work
Selective Web IR is a different approach to optimising from a system query type classification. retrieval
the retrieval effectiveness of & by Kang as performed
Indeed, the selective application of different
differs from query type classification, approaches
Kim (2003), where the aim has been to identify whether a query is informational or for that then to particular approach retrieval apply an appropriate navigational, and knowing does The type. not require selective application of retrieval approaches query for Instead, it to the type of a query. apply each query. selects a retrieval approach irrespectively its type. of Therefore, the decision mechanism and the experiments 6 in the case of a new type of queries.
would not necessarily require modifications There is another difference between selective Web IR and the query-biased setting of (2002) Amitay for the adjusted et al. combination of evidence. weights and parameters the contribution hyperlink the of structure basis, according per-query analysis on a & Plachouras hyperlink documents' structure. to the characteristics of the retrieved Ounis (2005) adjusted the weights of content and hyperlink structure analysis with a Oil the to query. Dempster-Shafer combination mechanism, according specificity of a the other hand, selective Web IR applies a particular from a set of retrieval approach
108
5.2 Selective
retrieval
as a statistical
decision
problem
available ones. In this context, the retrieval approach corresponds to a fixed combination of retrieval techniques. In other words, the retrieval approach corresponds to a description of all the steps followed in order to form the final ranking of retrieved docTherefore, improvements in retrieval effectiveness from selective the uments. potential Web IR come from the relative difference in retrieval effectiveness between the different from the change in the weight of each source of evidence. approaches, and not Selective Web IR is similar to query performance prediction,
tion 3.6.2, because it aims to predict approach. However, how appropriate prediction is primarily effectiveness
as discussed in Secretrieval focused on estimatiilg
it is to apply a particular
query performance with
the correlation approach. effective
of a predictor
the retrieval
of a particular aims to predict
retrieval the most
On the other retrieval approach
hand, selective
Web IR explicitly
from a set of at least two available retrieval
approaches.
5.2.2
Decision
mechanism
with
known
states of nature
This section introduces a decision mechanism and an experiment E in order to describe the setting used for establishing the potential improvements in retrieval effectiveness from selective Web IR in Section 4.7. Suppose that the decision retrieval of one apply mechanism can a approaches basis, and that the true state of nature si is known.
al, a 2, - .., an on a per-query In other words, it is assumed that the most effective retrieval approach ai among basis. This be identified corsetting on a per-query certainty with al, a2i .... an can Emax, its that design is it to so experiment an to possible where a situation responds MAX, Therefore, the is i mechanism is true the si. of nature state when outcome described in Section 4.7, corresponds to a decision mechanism that would employ the In the retrieval emax, the a case, such ai. approach retrieval and select outcome of be is that the decision can maximum the mechanism corresponding effectiveness of for each an the a2i al, by approaches retrieval of selectively applying one .... obtained
query. The remainder of the chapter introduces a set of score-independent experiment
(Section 5.3), and a set of score-dependent experiments Web for be how it selective applied can describes a Bayesian decision mechanism, and IR.
(Section 5.4).
Section 5.5
109
5.3 Retrieval
score-independent
experiments
5.3
Retrieval
score-independent
experiments
A wide range of experiments E can be defined, depending on the aim of the experiment Web IR, In the the the purpose of context of selective employed sources of evidence. and the experiment is to identify the queries for which a particular retrieval approach is more from Since different than the approaches may use evidence other approaches. effective the textual content of documents, as well as their structure and the hyperlink struct lire E. defining in Web, it is the to the experiments reasonable consider similar evidence of A first distinction E is for defining the whether scores or experiments of possibilities In documents the this scores context, or not. used are weights associated with ranking by IR documents the to the to such as models, weighting scores assigned either refer field-based weighting models described in Section 4.4. If such scores are not used, then E tertiis is define to to query whether consider an experiment a straight-forward way defining in latter investigates the documents. The in approach current section occur E, document-level Section introduces E. 5.3.1 count which experiments experiments Section 5.3.2 terms. the documents query the number of containing all or some of from information the structural presents a refined set of experiments, where additional distribution is in documents considered. aggregates of
5.3.1
Document-level
experiments
It documents. in terms based occur Document-level experiments are on whether query Therefore, in the broader collection. topics are more widely covered is assumed that the least terms. the query documents of that one contain either all, or at there will be more documents Web URL the may of For these topics, evidence from hyperlink analysis, or Web home documents, pages of relevant or be more useful in detecting high quality sites. document-level experiments For a given query, the outcome of the score-independent Several experiis related to the number of documents that satisfy a given condition. that the For condv(d) different for condition defined be example, conditions. ments can documents all contain Retq should the d in of retrieved the set document of sample a the terms of the query q is written follows: as
Vt dd(d) EqtEddE : coan,
Retq
(5.2)
110
5.3 Retrieval
score-independent
experiments
If at least one term of the query q is required to occur in the document d. then the (d) is written as follows: condition condi
cond3(d): Iteq
tEd
dERetq
(5.3)
A range of more refined conditions can be defined when the fields of documeiits
are considered. For example, term a possible condition field. is that document a should contain with fields, the least one query at above conditions the occurrences appear in its title In the case of documents are rewritten
condv(d)
and candi(d)
in order to distinguish
between
of the same term in different
fields. If f (d) denotes the terms of d that for checking whether all the query
in a particular
field f, then the condition
f, terms appear in any of the fields fi, ...,
of d is written
as follows:
fl,..., condv(d,
f,
1):
`dtEq
tE fl(d)V...
VtE fn(d)
dERetq
(5.4)
The condition for checking whether at least one query term appears in any of the fields
fl, ... , f, of d is written as follows:
(d, fl,..., cond2
fT,) : 3t EqtEf,
(d) V VtE fw (d) ...
dE Retq
(5.5)
The outcome o of the score-independent document-level experiments is computed (d) is for documents true: the a condition cored-, which as number of (d) J{d true}l o= : cord,, = 3. V for or either where x stands When documents with fields fl, is computed as follows: ) {d (d, f,,, fi, trice}1 = : cored,, o=I ... , (5.7) (5.6)
f, are considered, the output of the experiments ... ,
For the rest of this thesis, the experiments that count the number of documents field f, be in least them will in Retq, with all the query terms, or at a specific one of For example, the experiment that counts the (anchor), in text the terms documents or the anchor either query all with number of Ev(as) is EV(at). The by the denoted is (title) fields experiment outcome o of the title denoted by EV(f) or E3(f), respectively. follows: computed as o=I {d : condy(d, anchor, title) = true}I
111
5.3 Retrieval
score-independent
experiments
where condv(d, anchor, title) Vt EqtE : anchor(d) Vtc title(d) dE Retq
The experiment that counts the number of documents least with at one query term in their body (body) is denoted by e3(b), and its outcome o is computed as follows: {d (d, body) = true} o=I cond3 : where
(d, body) Vt EqtE condi :

The outcome number o of the proposed experiments
body(d)
dE Retq
ranging from 0 to t he
is an integer,
of documents
N in the collection. values with min(
Plachouras,
Ounis & Cacheda (2004) norstudy primarily for computing investigates the outcome
malised the outcome the effectiveness
N, 1). The current
different the of
fields and the conditions
of the experiments.
experiments
Therefore, the outcome of the score-independent document-level

given by Equation (5.6), without further any normalisation.
is directly
Amitay
(2003) introduced et al.
document frcthe a similar measure, expected
quency, which estimates the number of documents that contain all the query terms, by multiplying the probabilities of the query terms occurring in the collection. The underby the number of lying assumption is that the query terms are independent. In order to weaken the effect frequency document this the of assumption, expected was multiplied the query terms. The described experiments in this section compute the exact numallowing to consider or ignore t he document-level the cost of is available during retrieval.
ber of documents that satisfy a certain condition, information is low, the since required experiments
dependencies between the query terms. The computational
5.3.2
Aggregate-level
experiments
document-level be by refined coliexperiments can Web documents in of
The proposed score-independent sidering additional structural hypertext Indeed, the aggregates.
information
from the distribution
facilitate Web the and
the organisation of related
documents into aggregates. For example, in the case of the Web, the documents that belong to the same domain are likely to be about a particular topic, or a series of related lie This investigates be t Therefore, they topics. section can considered as an aggregate.
112
5.3 Retrieval
score-independent
experiments
of Web documents in aggregates to define a range of experiments E. This section introduces the aggregate-level experiments using abstract aggregates, and then it specifies how the aggregates are generated. The underlying assumption for these experiments is that the distribution of documents in aggregates shows whether there exist large aggregates containing relatced documents, or whether the documents related to the topic are dispersed in different and unrelated aggregates. For example, evidence from the URL of Web documents or the hyperlink structure analysis may enhance the retrieval effectiveness, by identifying the entry points of large aggregates of documents. The definition of the experiments E is based on the conditions introduced in S(, c"tion 5.3.1. Indeed, by modifying the condition from Equation (5.3), the condition that at least one query term is required to appear in a document d from aggregate ag can
use of information
from the distribution
be written
follows: as
(d, cond3 ag) : 3t EqdE
ag tEddE
Retq
(5.8)
The conditions (5.2), (5.3), (5.4), and (5.5) can be rewritten in the same way. The size of the aggregate ag is defined as follows:
{d 11 (d, true agI =I ag) = : cord,, (5.9)
I to where x corresponds either or V. Differently

are required aggregate
from the document-level

to generate an outcome
experiments, the aggregate-level experiments

from the characteristics characteristics of the distribution of the distribution of of
This sizes. work utilises
three different
aggregate sizes to generate the outcome of the experiments. correspond aggregate distribution with to the average aggregate distribution, size is the number
The first two characteristics deviation he t magi of
jag and the standard size The third characteristic
respectively.
of the aggregate size to the aggregates
large aggregates, of
which
corresponds
size greater
than jagI + 2UIa91
This work looks at two approaches to define aggregates. on comparing documents with the same domain name. This definition
The first one is based
the domain name of the URL of Web documents and aggregating the in results relatively broad
from documents Web be it to aggregate sites that appropriate may not aggregates, and
113
5.3 Retrieval
score-independent
experiments
Experiment E3(f), std(dorn) v(f), avg(dom) E3 i, dOT1 ,
Aggregate domain domain domain
Type
Distribution Qlagl IagI
feature 3t EqdE Vt EqdE 3t EqdE
Condition ag ag ag tEf tEf tEf (d) (d) (d) dE lip tq dER tq dE Reto
{ag : IagI > IagI + 2Q, }I aq,
Table 5.1: Notation examples for the aggregate-level experiments. diverse contain very content. One such example is http: //www. geocities. com/1,
free which provides a service for hosting Web sites. In this case, the fact that two Web documents appear in the same domain does not mean that they are about the saiiie, or even a similar topic. is name more appropriate On the other hand, aggregating Web documents by domain different domain a when name is assigned to divisions, or
departments of large organisations. The second way to aggregate documents considers the directory under which the
Web pages are stored. the URLs http: //a. In this way, two Web documents, b/d/e/y. html and http which are accessible through . html, respectively, will : //a . b/d/e/z be assigned to the same aggregate, but http: //a b/d/x. html . by Web sites such as http: the partly overcomes problem posed but it may result in large numbers of small aggregates. pendent generated Web the sites are organised. on way content by scripts
will not. This approach //www. geocities. com/,
This approach is also more de-
For example, Web sites with dynamically structure related to the
directory have a useful may not
topics covered by documents.
Even though there are many ways to define aggregates of Web documents, such definition introduced the two of aggregates. a simple approaches provide as clustering, They also have the advantage that they identify aggregates by simply matching t he by URL during URLs. This document take the a accessing querying, place can string of database for Web documents, or during indexing, by assigning an aggregate identifier to The document. associated computational each level experiments is thus very low. Table 5.1 summarises the notation that will be used for the aggregate-level experithe that For the thesis. in this average the counts experiment example, rest of ments field f is in terms the documents that domain contain all query aggregates of size of in the table. the denoted by y(f), of second row shown as ' avg(dom)
l Visited on 11th August 2005.
cost of the score-independent aggregate-
114
5.4 Retrieval
score-dependent
experiments
5.4
Retrieval
score-dependent
document-level
experiments
Both the score-independent
and aggregate-level experiments depend solely on the occurrence of query terms in documents. Therefore, they are independent
of any retrieval approach, or any score that is assigned to documents. However, not all the documents that contain a query term are relevant to a query. In addition, t he outcome of the experiments that depend only on the occurrence of the query terms niav be biased by frequent terms. The current section introduces experiments that employ the scores assigned to documents by a retrieval approach. This retrieval approach is not necessarily used for obtaining the final ranking of documents, as discussed in Section 5.2. The introduced experiments employ a score distribution transform it into a new score distribution, documents to assigned and
after a one-step propagation of the scores
through the incoming hyperlinks of the documents, in order to favour the documents that point to other highly scored documents. The main underlying assumption of the experiments is that the difference between the two tested score distributions to the usefulness of evidence from the hyperlink from hyperlink the evidence structure is related For of Web documents.
distributions', difference between is the two tested there score a great example, when employing additional structure analysis or the URL of documents may be more effective for retrieval. has been in order to predict the effectiveness used of retrieval scores (2001) Manmatha modelled the retrieval scores as a mixture et al. of a retrieval system. distribution for for documents, Gaussian distribution and an exponential relevant of a Gaussian distribution, between difference the documents. The the mean of non-relevant The distribution distributions the two the and point where intersect indicates how well a system is exCronen-Townsend documents. from distinguish the the to non-relevant relevant pected divergence information theoretic (2002) the the clarity of a query as et al. modelled between the query language model and the collection language model. When the two language models are different, then the retrieval is expected to be effective. The defined experiments in this section are related to the approach of Cronen-Townsend et al. in the sense that both the clarity of a query and the introduced score-dependent exdistribution,,. between divergence difference two the probability or periments measure Indeed, the introduced score-dependent experiments in this thesis focus on estimating
115
5.4 Retrieval
score-dependent
experiments
the difference between two score distributions
(Section 5.4.1), after performing a one-
documents document incoming hyperlinks the through the of scores step propagation of (Section 5.4.2).
5.4.1
Divergence
between probability
distributions
There are several different ways to estimate the divergence between probability distributions. A commonly used definition of information theoretic divergence between two probability I (P, Q) (Kullback, distributions 1959): I (P, Q) _A
i
P=
{pi}
Q= and
{qi} is the Kullback-Leibler P2 1082qi
divergence
(5.10)
It easy to verify from Equation the Kullback-Leibler Kullback-Leibler
(5.10) that I(P, Q)
I (Q, P), or in other words, that Following from this, the symmetric
divergence is not symmetric.
divergence J(P, Q) is defined as the sum of the divergences I(P, Q) 1959): J(PIQ) =
_
(Kullback, P) I(Q, and
I(PIQ)
(pi
+I(Q,
P)
A
loge - 4'i) "
(5.11)
qz
Q) I(P, >0 be (5.11), that it (5.10) and Equations From the above seen can and if if the Q) to Q) J(P, I(P, Both only and zero Q) 0, equal J(P, are > and respectively. for bound is there However, that no upper distributions P and Q are equivalent. note Q). J(P, Q) I(P, the values of and (Lin, 1991), divergence Jensen-Shannon by which corresponds the This is addressed from P distribution the average divergence the probability of to the Kullback-Leibler follows: Q, P distributions as and the probability of
K(P, Q)
=I
_
(P,
.P+2"
Q)
(5.12)
Pi 1o 1i b2 2. I. pi + - qi
The symmetric
follows: defined is divergence Jensen-Shannon as
L(P, Q)
K(P. Q) + K(Q, P)
1092 Pi t2' Pi 11+ 4i pi +2 qi 92 g2 10 1
(5-13)
pi +2' 9i
116
5.4 Retrieval
score-dependent
experiments
One of the properties of this measure of divergence is that there exists an upper bound for its value, L(P, Q) <2 (Lin, 1991). The symmetric Jensen-Shannon divergence is known also as total divergence from the average (Pirolli & Pitkow, 1999), and it i a special case of the weighted Information Radius (Jardine & Sibson, 1971, page 13):
i z (wp+ pz w9) loge + Wp + Wq WpPi + Wgqi WpPj (wp + w9) Wgqi qi loge " Wp + Wq wpPi + wqq1 (5.14)
In the above formula, wp and Wq are the weights of the probability distributions { pz} {qi}, and respectively. Indeed, the information radius of two probability distributions with the same weights is equal to half their symmetric Jensen-Shannon divergence.
5.4.2
Usefulness
of hyperlink
structure
The current section defines an experiment based on measuring the divergence between the score distribution of a retrieval approach, and a modified score distribution, obtained after a one-step propagation of scores through the incoming hyperlinks of Web documents. The underlying if is that there are non-random patterns of assumption diverthe then query, of document scores will be hyperlinks among the retrieved Web documents for a particular distributions between the the original and modified gence
higher. This suggests that the hyperlink structure is more useful, or, in other words, that the use of structural evidence may be more effective for retrieval. The usefulness of the hyperlink Retq of the set of retrieved structure of a sample theoretic divergence between two probability S of the scores sc2 for the documents documents is defined as the information distributions.
The first one is the distribution
di E Retq. The scores scz can be the relevance scores assigned to documents by any Chapter 4, in introduced the or a query-independent source of retrieval approaches of he In Model. thesis, Absorbing this t the PageRank the remainder of or evidence, such as by the to a particular retrieval approach. relevance scores assigned scores sci correspond is constructed so as to favour the highly scored documents t hat desired is This Retq. in the documents highly property under to a scored other point highly documeiit, browsing is for is it scored a a user, who more useful assumption that The second distribution following hyperlinks. The by documents highly be new to to scored other access able distribution U of scores u; is defined as follows. The score uz for document di E Rh'tq
117
5.4 Retrieval
score-dependent
experiments
depends on the score sci, as well as on the scores sch of all the retrieved documents dj. which are pointed to by di:
Ui = sci + 1: di -+dj sch di, dj E Retq (5.15)
di dj that hyperlink there from di to dj. where means exists a -+ The measures of divergence introduced in the previous section estimate the difference between two probability distributions. However, the score distributions S and U do not necessarily correspond to probability distributions. Indeed, the Divergence From Randomness (DFR) weighting models and their field-based extensions rank documents divergence to the according of the occurrences of a term in a document from a random distribution. The resulting scores are in the range (0, +oo). The scores assigned to documents by the weighting model BM25 and its extension BM25F also fall within the same range. Therefore, it is necessary to normalise the retrieval scores from (0, +oo) to (0,1]. Nottelmann & Fuhr (2003) compared linear and sigmoid functions to transform of relevance. For simplicity, and in order to the document scores into probabilities
reduce the number of the introduced parameters, the scores are normalised by dividing them with their sum. In this case, the normalised scores are in the range (0,1] anti their sum is equal to 1:
sCi Ui
sni=
djERetq
'SC7
cn_ l
(5.16)
dj ERetq Ui
{ui } has been defined in order to favour the highly scored doc(5.15), According Equation documents. highly to di to that point scored other uments the score ui > sci > 0. Therefore, highly scored documents, which do not point to favour documents In high have those documents, to only order score. a still any other The distribution U= that point to other highly scored documents, a new distribution follows: as u1._
d1-rd
U' = {ui}
is defined
sch
and uni =Z, Edj
ERetq
(5.17)
'. {ui} in the sense that the dependence ui > scj is removed. For the distribution {uz}. it is easy to verify that if a document di does not have outgoing links, then u= = 0. If (l, Therefore, 0<u; be low it that the < documents scj. to case scores, may with points
The normalised distribution
Jun 'j} is denoted by U;
The distribution
{uz} differs from
118
5.4 Retrieval
score-dependent
experiments
the dependence of {ui} on {sci} is stronger than the dependence of {ui} on {sc=}, and the hyperlink structure among the documents in Retq is expected to have a greater impact on the distribution {ui}. {uni}, Un and = the usefulness of the hyperlink structure is estimated as the symmetric Jensen-Shannon divergence between the normalised distributions S and Un: L (Sn Ute) _
diERetQ
Having defined the score distributions
Sn = {snZ}, Un = {uni},
uni log, u+
uni
S7L
'2+
diERetq
loge sni u2 +2
(5.1s)
or as the symmetric Sn and Un

L(Sn, Un) -E
Jensen-Shannon divergence between the normalised distributions
uni
1092 un,
2+
uni
32
E
d; ERetq
srti loge
sni
di ERetq
un,
2+2
sni
(5.1 J)
The usefulness of the hyperlink
structure
is defined using the symmetric Jensen-
Shannon divergence, instead of the symmetric Kullback-Leibler divergence, because the values of the former are in the range [0,2]. An additional reason for employing the symmetric Jensen-Shannon divergence is that the two probability This means that the Kullback-Leibler distributions do be have to mutually not Leibler divergence. for probability distributions for Kullbackit is the the case absolutely continuous, as divergence is defined only
for which sni = 0, for all i, for which uni = 0, and vice because S, Un, is distributions In this the the satisfied, condition and case of versa. the definition of the distribution U from Equation (5.15) suggests that 'i > scz > 0, Kullback-Leib ler However, for di Retq. the E >0 >0 all and sni and consequently uni divergence cannot be defined for the distributions S., and Un, because uni can be 0 even if sni > 0. Therefore, the symmetric Jensen-Shannon divergence is more appropriate to use in the context of selective Web IR. Note that, the Jensen-Shannon divergence has been used in the context of pattern recognition to measure the distance between (Wong & You, 1985). random graphs Before continuing, an example is provided in order to illustrate the introduced S, U, distributions between divergence how the the to and or show experiments, and U, depends on the scores and the hyperlink structure of the retrieved documents. 6 Suppose that for a particular query, six documents. numbered from 1 to 6, }, {0.9.0.4,0.3,0.2,0.2,0.1 {si} have been retrieved and ranked according to the scores = Example
119
5.4 Retrieval
score-dependent
experiments
i. e., the score of document 1 is 0.9, the score of document 2 is 0.4. and so on. The documents six are connected with hyperlinks, as shown in Figure 5.2. The divergences L(S, U,,,) and L(S, U7) are computed for the three graphs of hyperlinks shown in the figure, as well as for a fourth case of a complete graph, where there is a hyperlink between any ordered pair of documents. The first graph, shown in Figure 5.2(a), corresponds to a case, where there is no apparent pattern in the way the hyperlinks are distributed. After the distributions Un ) = 0.0728 and L(S, U;, ) = 0.6875. The second graph U,,, hyperlinks, in Figure 5.2(b), corresponds to a case, where the top three ranked of documents are strongly connected. In the same way as before, it is easy to compute ) = 0.1226 and L(S,, Un) = 0.4273. For the third graph, shown in that L(S,, UT, Figure 5.2(c), there is a group of documents that are strongly connected, without 1) = all of them being highly ranked. In this case, L(Sn, U, ) = 0.2167 and L(Sn, U, 0.9386. For the last case, suppose that the graph of the example is complete, in that it contains one hyperlink between each and every ordered pair of documents. In this case, L(Sn, Un) = 0.1675 and L(Sn, U1) = 0.2485. The divergence L(Sn, Un) has the lowest value when there is no apparent structure in the way hyperlinks are distributed, and increases its value when there is a connected from documents higher is if documents. The increase the the connected group group of is information In it documents. list lower in that the this the case, assumed of are ranked from the hyperlink structure is more useful. The computed values of L (S", U,") are ), because the distribution U1 is less dependent on S, higher than those of L(S,,,, Un, ti ) for U, divergence L(S, The the discussed U, distribution than the graph above. as (a) is higher than L(S,,,, Un,) for the graph (b), while the divergence L(S" U7) for the , L(S", (b). Uri) fact indicates This for Uh) that (a) L(S, lower is the than graph graph L(s, and U7) can be used to define two experiments E, which are not equivalent. Moreover, the fact that the complete graph does not result in the highest divergence Sn and are computed, L(S,
n) indicates that the usefulness of the hyperlink structure for both L(SU, U.,,) and L(Sn, U, 0 does not depend only on the number of hyperlinks. by defined has been hyperlink The usefulness of the using a one-step structure An documents. hyperlinks incoming the n-step proppof propagation of scores through distributions between dependence the in score of either a weaker agation would result
120
5.4 Retrieval
score-dependent
experiments
1 _. 63 54 , 2 a
I, `r
0
5(4 &54
Figure 5.2: The hyperlink graphs of the ranked documents, corresponding to the first three cases described in the Example 6. U,,, or Ute, and the initial computing distribution S. In addition, the computational
overhead of
U.,, and Un would increase with every step.
In the remainder of this thesis, the experiments E that employ either L(Sn, Un) or L(S,,,, U7), when considering the documents with all the query terms in a combination of Eye In fields f, are denoted by Eye this and respectively. notation. f), f), L(SU),,, L(SUI)wm, , distribution S documents. for that the to technique assigns score a scoring wm stands This scoring technique can be any of the retrieval approaches described in Chapter 4. When considering documents with at least one query term in a particular combination denoted by E3(f), E2(f), fields, the are of experiments L(SU)wmand L(SU')wm, respectiv('ly.
121
5.5 Bayesian decision mechanism
5.5
Bayesian decision mechanism
A range of experiments E. has been defined in the Sections 5.3 and 5.4, using different sources of evidence. In the context of a decision mechanism, the effectiveness of three experiments depends on how successful they are in detecting the true state of nature, hence, in identifying the most appropriate retrieval approach to use for each given and query. The current section defines a Bayesian decision mechanism (Section 5.5.1), and discusses how it can be applied for selective Web IR (Sections 5.5.2 and 5.5.3).
5.5.1
Definition
of the Bayesian
decision
mechanism
Suppose that there are k For each state of
The Bayesian decision mechanism is defined as follows.
available retrieval approaches and r states of nature, where k=r.
nature si, the retrieval approaches are ordered according to an evaluation measure 'rr, as described in Section 5.2. In this way, the most effective retrieval approach ai for the state of nature si (m(ai, si) > m(aj, si)) corresponds to a loss in utility equal to l(ai, si) = 0, while the other retrieval approaches have a higher loss of utility l(aj,. ';j) The least effective retrieval approach corresponds to a loss of utility equal to 1. Each has a prior probability state of nature si a particular query. This prior probability for which m(ai) > m(ad), i#j. P(si) P(si) for being the true state of nature of is defined as the number of queries of a stete
In other words, the prior probability
for the depends the corresponding retrieval which queries of on number of nature si decision in k the is the the used retrieval approaches most effective among approach ai mechanism. The Bayesian decision mechanism employs the experiment E in order to The conditional probability E. the the experiment outcome of o as that si is the true
informed the true state of nature. about guess make an P(olsi) denotes the probability of obtaining
is the si. state of nature when According to the Bayes decision rule, the posterior probability depends the on prior probability state of nature follows: LI, the as experiment outcome o of
P(si) of si and the evidence from the
P(Si) - P(Ol si) P(si o) = P(o) where: k
(5.20)
P(o) _
i=1
P(Si) " P(olsi)
(5.21)
122
5.5 Bayesian
decision
mechanism
Then, for each action ai, the expected loss E[l(ai)] given by':
E[l(ai)] k 1: _ j=1 l(ai, sj) " P(sjI
for all the states of nature i
o)
(5.22)
The Bayesian decision mechanism selects the retrieval approach ai with the minimum expected loss E[l(ai)]. This mechanism is optimal in the sense that it minimises
the average classification error (Duda & Hart, 1973), or, in other words, the expected loss. In the case of selective Web IR, this is desirable in order to evaluate the effectiveness of the employed an appropriate experiment approach. E in identifying the true state of nature, and applying
retrieval
to note that the denominator P(o) in Equation (5.20) is a constant, and it is used as a normalisation factor in order to obtain probabilities. In the context Web IR, the objective of the decision mechanism is to select a retrieval of selective Therefore, denominator be P(o) ignored, without affecting to the approach apply. can the selection of the retrieval approach to apply for a particular Bayesian decision mechanism is illustrated Example query. The use of the in the following example.
It is important
7 Suppose that a decision mechanism selects one retrieval approach from
three available ones: al, a2, and a3. The decision mechanism performs an experiment, E, for which the posterior likelihoods P(sj (o) are shown in the upper diagram of Figure 5.3. The loss l (ai, sj) associated with applying ai when the state of nature is sj is specified in the matrix [lid], where lid =l (ai, sj) :
[lid ]=1.0
0.0
0.5
1.0 0.5 0.0 1.0 0.5 0.0
The lower diagram in Figure 5.3 shows the expected loss from applying each of t lie (5.22). In from Equation this three retrieval approaches al, a2, and a3, as computed diagram, the intersections of the loss curves define the decision boundaries, that is the for E, the thresholds of the one selecting as serve which experiment outcome values of Fin if For the outcome o< of, results experiment example, retrieval approaches ai. loss lowest in it because the is expected results then the retrieval approach a3 applied,
'Note that l(a;, sj) denotes the loss from selecting the action a; when the true state of natura is loss from )] denotes the E[l(a; action a;. the selecting expected s;, while
123
5.5 Bayesian
decision
mechanism
0.2
0.15
b 0 0.1
0.05
0 0 10 20 30 40 50 outcome values of E 0.25
0.2
0.15 0 0.1
0.05
00
10
20 outcome
30 values of
40
50
Figure 5.3: Example of a Bayesian Decision mechanism with 3 available retrieval apdiagram densities The the three shows upper estimated proaches and states of nature. lower diagram for The likelihoods the shows the coreach state of nature. of posterior The for E[l(an)] loss approach. outcome values ol, 02, each retrieval responding curves loss e intersection the to the curves represent the points of 03, and 04 of corresponding decision boundaries of the decision mechanism.
124
5.5 Bayesian
decision
mechanism
E[l(a3)].
is lower than both E[l(a2)] and E[l(a3)]. Therefore, the retrieval approach al is applied. In a similar way, the retrieval is approach a3 applied when 02 <0< 03, the retrieval approach a2 is applied when 03 <0< 04, and the retrieval approach a3 is applied when 04 < o. In this way, the decision mechanism selects a particular of the experiment E. retrieval approach for every possible outconme
If of <o<
o2, then the expected loss E[l(al)]
A decision mechanism that selects one out of two retrieval approaches is a special case, where the above description can be simplified. Indeed, in the case of two retrieval approaches, or equivalently binary:
(az, sj) 0i=j 1i:? j
two states of nature, the output
loss function the of
is
Therefore, E[l(al) P(silo).
l(al, P(s2lo) P(s21o) E[l(a2)] l(a2, P(s1Jo) 82) and sl) " = = = = " The decision mechanism applies retrieval approach al when P(s21o) < P(siIo), of sl is greater than that of s2.
or, in other words, when the posterior likelihood k-1 of decisions between two retrieval approaches.
Generally, selecting one out of k retrieval approaches can always be mapped to a series
5.5.2
Application
of the Bayesian
decision
mechanism
This section discusses the application
in Bayesian decision the order to mechanism of
how It describes Web IR. the the to of quantities required estimate selective perform decision mechanism with respect to a set of training data. The application of the Bayesian decision mechanism requires the estimation of three
quantities:
The prior probability "
that a state of nature is the true state of nature. This corresponds to the prior probability that the retrieval approach a2 is t he decision Bayesian IR Web the When with performing selective most effective. is the P(si) to the the of numproportion set equal prior probability mechanism, P(si) is the for the most effective. a2 approach retrieval which queries,
ber of training
) the (aj, l loss the The approach aj of wheni retrieval application associated with s; " is 1(),, the When is training available. 5 queries the true state of nature a set of st. l(aj, s=) is defined as the difference between the retrieval effectiveness of ai w id
125
5.5 Bayesian
decision
mechanism
that of aj, for the subset of training retrieval approach. " The probability
queries for which ai is the most effective
P(olsi) that the outcome of an experiment e is o when the state of nature is si. This probability is computed by estimating the density of the outcome values of the experiment E, for the subset of training queries, for which the retrieval approach ai is the most effective. A more detailed discussion about the density estimation is given in Section 5.5.3
5.5.3
Density
estimation
The last point of discussion with respect to the Bayesian decision mechanism is relat ed to the density estimation of P(olsj) from the outcome o of the experiment E for a number of queries. Bishop (1995, Chapter 2) identifies three main types of density estechniques: parametric methods that assume a certain functional form for the estimated density, non-parametric methods that allow the available data to completely specify the estimated density, and semi-parametric methods such as mixture models. A disadvantage of the parametric methods is that it may be difficult to find an approfunctional priate form for the estimated density. Although non-parametric methods disadvantage, this the complexity alleviate of the estimated density depends on the timation
data do Mixture number of available points. models not assume a particular functional form for the estimated density, and result in less complex models, but they are computationally In particular, density P(olsj) In thesis, the this estimation of expensive. is performed data. because the training of relatively small amount of methods, technique, with automatic
using non-parametric
density Gaussian kernel-based estimation a (Silverman, is 1986), bandwidth the employed'. setting of
Due to the limited amount of training data, it is necessary to pay special attention to the existence of outliers in the available experiment outcome values. Figure 3.4 for (Chambers 1983) the box-and-whisker the the of outcome values et al., plots shows score-independent document-level for In the task td2003. each experiment, computed distribufirst third the box to the the of and quartiles correspond ends of each plot, tion of experiment outcome values. The bold line corresponds to the median of the distribution. The whiskers extend to the farthest points that are within 3/2 times the
'The density estimation was performed with the software package R: A language and environment for statistical computing (R Development Core Team, 2005).
126
5.5 Bayesian
decision
mechanism
E3(6) outcome values
8V(b) outcome values
Irr
Oe+00
111
III
2e+05
4e+05
6e+05
10000
30000
F-3(at) outcome values
eV(at) outcome values
Uo
0 20000 40000 60000
uI COD 0
1 0 500 1000
-71500
Figure 5.4: Box-and-whisker plots of the score-independent document-level experiment outcome values for the task td2003. first the range of and third quartiles. Any points that are farther than the be to whiskers are considered outliers, and they are denoted with a circle. The top left box-and-whisker plot shows that there is one outlier among the outcome E3(b) for the task td2003, corresponding to the query TD39: the values of experiment national public tv radio. This is due to the very high document frequency of the in the GOV test collection, resulting in 634,053 query terms national interquartile
and public
The outcome values for llie experiments EV(b), 3(at), and 3(at) are lower than those of E3(6), but there exist outliers in all cases. More specifically, the experiments 3(at) and EV(at) result in more outliers that the experiments 4(b) and E3(b), because the obtained outcome values depend on the distribution of hyperlinks with the query terms in the associated anchor text: there The density estimation is performed for range are many distinct anchor texts associated with few hyperlinks, while there are only fi,w hyperlinks. texts anchor associated with many the range of obtained outcome values that lie within 3/2 times the interquartile of the first and third quartiles.
documents with at least one query term in their body.
127
5.6 Summary
In the next chapter, the Bayesian decision mechanism will be employed to evaluate the proposed score-independent (Section 5.3), and score-dependent experimems (Section 5.4) in a setting where relevance information is assumed to exist.
5.6
Summary
a novel framework for selective Web IR. The framework is formulated in terms of statistical decision theory (Section 5.2). One of its main concepts is the decision mechanism, which selects one retrieval approach from a set of available ones on a per-query basis. The selection of the applied retrieval approach is aided by the experiment E, which extracts a feature from a sample of the set of documents. retrieved The introduced framework for selective Web IR is different from the related work in
(Section several aspects 5.2.1). First, it differs from query-type classification, because the aim is to apply an appropriate particular adjustment retrieval approach basis, instead of it
This chapter has introduced
retrieval
approach on a per-query
for each query-type.
Second, it differs from the dynamic approach
of the weights of each source of evidence, because each retrieval Third, the introduced framework
is assumed to be fixed. performance prediction,
is more general than queryof a predictor with
which primarily approach.
estimates
the correlation
the effectiveness in the retrieval
of a retrieval effectiveness
Selective Web IR aims to predict the difference approaches.
between several retrieval
Several experiments E have been defined. Section 5.3 introduces a range of experiments based on counting the occurrences of query terms in documents, or in particular fields of documents. These documents are called score-independent, because they do documents. document-level The to score-independent consider any score assigned not experiments least documents the of at one, or all query terns. count number with informat ion documents. documents in The aggregates of related aggregate of
The score-independent aggregate-level experiments consider the structural from the distribution
domain, documents belong documents the to the the that that to same or correspond directory. in the same are stored Section 5.4 has presented a range of experiments based on estimating the usefulness hyperlink the of documents. These for the experimeiits a sample of retrieved structure
128
5.6 Summary
are called score-dependent, to documents by a particular
because they compute the information
theoretical
diver-
between two score distributions. gence
The first one is the score distribution
assigned
retrieval approach, such as a field-based weighting model. The second score distribution is obtained after a one-step propagation document the of scores through their incoming hyperlinks.
The Bayesian decision mechanism defined in Section 5.5 provides a means for the
evaluation of the proposed experiments E, by applying a retrieval approach with the loss. The Bayesian decision mechanism can be used to select one minimum expected from approach retrieval any number of available ones. The estimation of the likelihoods that a particular retrieval approach is appropriate is carefully performed, by considering the fact that there may be outliers in the obtained outcomes of an experiment E. Overall, the introduced
proach to the problem
framework for selective Web IR represents a general al)appropriate retrieval approaches to apply on a. of this thesis focuses on evaluating the most appropriate 6 evaluates the proposed
of identifying
per-query
basis. The remainder experiments
the effectiveness of approaches to
the proposed apply
in identifying Chapter
retrieval
on a per-query
basis.
experiments
in a setthe
ting, where it is assumed that relevance information evaluation that limited of the proposed experiments
exists. Chapter setting,
7 investigates
in a more realistic
where it is assumed
exists.
129
Chapter
Evaluation Information
6.1 Introduction
Selective of Retrieval
Web
The potential for improvements in retrieval effectiveness from selective Web IR has been established in Chapter 4. Furthermore, Chapter 5 has proposed a new framework 1'()i selective Web IR, which employs a range of experiments &. The current chapter aims to evaluate the proposed framework, and to establish the effectiveness of the introducc(I experiments E in an setting, where relevance information is assumed to exist. This chapter starts with Section 6.2, which introduces the evaluation methodology for the experiments & Each experiment E is evaluated in the context of a Bayesian decision mechanism, which selectively applies two retrieval approaches on a per-query basis, assuming that there exists relevance information. The two retrieval approaches are chosen according to their potential for improvements from selective Web IR, and field-based described Chapter different An in 4. employ weighting models, as example of a Bayesian decision mechanism, which selectively applies three retrieval approaches on a per-query basis is also provided later in this chapter. Section 6.3 discusses the evaluation of the score-independent experiments E, which Section both document-level, include in 5.3. These the experiments were proposed as F. Section Next, directory domain, 6.1 the and aggregate-level experiments well as presents the evaluation of the score-dependent experiments, which estimate the usefulhyperlink the ness of structure.
130
6.2 Evaluation
methodology
The chapter continues with Section 6.5, where the proposed experiments are computed from small samples of documents, in order to reduce the associated computational overhead, and to assess whether highly scored documents are more useful for computing the outcome of experiments. Section 6.6 discusses the evaluation of the experiments E, decision the when mechanism selects between retrieval approaches, which employ the field-based same weighting models. Section 6.7 investigates an example of a Bayesian decision mechanism, which uses more than two retrieval approaches. The chapter closes with a discussion of the findings in Section 6.8.
6.2
Evaluation
methodology
The aim of this section is to introduce the evaluation methodology that will be used for the remainder of the chapter. First, it describes how the effectiveness of an experiment E will be evaluated. Next, it defines the experimental setting, in which a Bayesian decision mechanism, as discussed in Section 5.5, employs the proposed score-independent and score-dependent experiments to perform selective retrieval. This section closes with a brief description of the presentation of the results in the remainder of the chapter.
6.2.1
Effectiveness
of experiments
This section discusses issues related to the evaluation of the proposed experiments E. The effectiveness of an experiment E is evaluated with respect to the number of decision boundaries used in a decision mechanism, the achieved mean average precision (MAP) by the decision mechanism, and whether the correct decision is made for a statistically significant number of queries. As discussed in Section 5.5, in the context of a Bayesian decision mechanism, which 8, boundaries decision intersection to the the correspond points employs an experiment An for loss the the employed approaches. each of retrieval effective of curves of expected experiment retrieval E should result in a different distribution loss for the expected of each between he t points In such a case, the number of intersection for loss the each retrieval expected of
approach.
E does if likely be low. However, loss is the to experiment not result curves of expected in a different distribution approach over the between E, intersection is the the curves points number of range of outcome values of between intersection If be high. the curves of expected there to points are no expected
131
6.2 Evaluation
methodology
loss, because the loss of one retrieval approach is always lower than that of the other retrieval approaches, then the decision mechanism cannot selectively apply differ mlt retrieval approaches on a per-query basis. In such a case, the experiment E is considered to be less effective for selective Web IR. The same discussion applies to the case of a Bayesian decision mechanism, which employs two retrieval approaches, and selects the higher likelihood the to be the most effective retrieval approach. as one with posterior described in Section 5.5. retrieval approach by a decision mechanism on a per-query basis should have a positive impact on MAP, compared to the MAP of the individual retrieval approaches. Therefore, the effectiveness of an experibe ment should reflected on the resulting MAP of the decision mechanism. The raust effective experiments should result in improvements in MAP similar to that obtained by the hypothetical experiment which always applies the most effective retrieval (Section basis 5.2.2). The Wilcoxon's signed rank test is used approach on a per-query to indicate whether the difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach is statistically significant. The resulting MAP is not the only indication of the experiment's effectiveness. If the employed retrieval approaches have similar performance for a query, then applying the most effective retrieval approach for that particular query is not expected to have In decision to take important impact the the order mechanism. of effectiveness on an & Castellan, (Hoel, Siegel 1988) is 1984; to into issue test the this used sign account, denote whether the most appropriate retrieval approach is applied for a statistically significant number of queries. The application of the most appropriate
6.2.2
Evaluation
setting
It briefly describes the
This section provides an overview of the evaluation setting.
been introit has their corresponding notations, as employed retrieval approaches, and decision Bayesian the describes it Next, Chapter 4. the duced in mechanism, setting of E. for the is the proposed experiments of evaluation used which 6.2.2.1 Description of retrieval approaches setting approach. In this thesis, t.hc
Selective Web IR can be performed with any retrieval employed retrieval
field-based the to weighting either one of approaches correspond
132
6.2 Evaluation
methodology
(PL2F, models their combination
PB2F, I(ne)C2F,
DLHF,
with query-independent
(Section 4.4 on page 67), or sources of evidence (Section 4.5 on page 4).
and BM25F)
The employed fields are: the body; the anchor text incoming hyperlinks; and r he of title. Compared to the original weighting models, the field-based weighting models are preferred, because they provide important gains in retrieval effectiveness for Web specific search tasks, as shown in Chapter 4.
The employed query-independent tion 4.5.1 on page 74), PageRank with static priors (Section sources of evidence are the URL path length (Se(c. (Brin & Page, 1998), and the novel Absorbing Model
4.5.2.4 on page 83, and Section 4.5.2.5 on page 86, respe(-field-based a with weighting model is denoted by appending model's name. For
tively).
Their
combination
the letters example,
U, P, and A, respectively, the combination
at the end of the weighting
PL2F PageRank is denoted by PL2FP, of with and the comb 1is denoted by I(ne)C2FA. model Each field-basvdl in evidence, order approaches, as
I(ne)C2F nation of weighting
Absorbing the with with
model is combined
one source of query-independent
not to further described
increase the number of hyper-parameters
in the retrieval
in Section 4.5.3.
have been dethe set, as of employed retrieval approaches (MAP) Section 92. The in 4.6, of each retrieval scribed mean average precision page The hyper-parameters for The directly is task. optimisation a mixed optimised approach process is terminated do hyper-parameters iterations, to their 20 that the not necessarily converge opso after timal values. The obtained hyper-parameter values from the above training process are the training tasks than in the to other with approach same retrieval evaluate order used for is the task For mq2004. and mixed optimised approach ones. example, a retrieval The hp2003, for it is tasks td2003, the evaluation results of then or np2003. evaluated 4.10, 96. in Table displayed page the employed retrieval approaches are 6.2.2.2 Description of Bayesian decision mechanism setting
The Bayesian decision mechanism, which has been described in Section 5.5, is used E. The tasks the are: employed the to perform proposed experiments evaluation of td2003; td2004; hp2003; hp2004; np2003; and np2004. The training decision the of E the task. the same with the performed are experiments evaluation of mechanism, and This setting has been chosen in order to reduce any effect on the evaluation of the experiments from the differences among the employed tasks. Chapter 7 discusses Ole
133
6.2 Evaluation
methodology
evaluation of the experiments E in a setting with limited relevance information. where different mixed tasks are employed for the training of the Bayesian decision mechanism, and the evaluation of the experiments. In order to obtain a clear indication about the effectiveness of the evaluated exE, the employed retrieval approaches by the Bayesian decision mechanism periments correspond to the ones with the highest potential for improvements in retrieval ef'ecctiveness, as discussed in Section 4.7, and presented in Table 4.11. More specifically, the Bayesian decision mechanism employs pairs of retrieval approaches, which use different field-based weighting models. For ease of reference, Table 6.1 presents the evaluation of the selected pairs of retrieval approaches. This setting is chosen in order to provide a clear indication of the effectiveness of the proposed experiments E.
Mean Average Precision Row 1 2 3 4 5 6 Task td2003 td2004 hp2003 hp2004 np2003 np2004 First approach I(ne)C2FU (0.1446) (0.1299) PL2F (0.6660) DLHFU (0.5523) PB2FU (0.6846) PL2FP (0.6944) PB2F Second approach (0.1455) DLHFP I(ne)C2FP (0.1307) (0.6498) BM25FA (0.5555) DLHFA I(ne)C2FA (0.6836) I(ne)C2FA (0.6814) MAX 0.1926 (+32.37%) 0.1615 (+23.57%)' 0.7658 (+14.98%)' 0.7025 (+26.46%)' 0.7827 (+14.33%)' 0.8019 (+16.52%)'
Table 6.1: The pairs of retrieval approaches employed by the Bayesian decision mechE. The in the the columns `First approach' proposed experiments anism evaluation of MAP for their `Second the the employed retrieval approaches and approach' show and MAP `MAX' brackets. The the task shows maximum column corresponding within that can be obtained by selectively applying one of the two retrieval approaches on a from MAP in increase brackets is The basis. the the relative value within per-query difference * indicates The that the individual symbol retrieval approach. most effective in MAP between the mechanism MAX and the most effective retrieval approach is The Wilcoxon's test. to rank results are copied signed statistically significant, according from Table 4.11. The outcome of the evaluated experiments is computed from a sample Retq of the formed Retq is This Section 5.2. in discussed documents, with sample as set of retrieved documents that contain at least one query term in either their body, or their title. For in text, documents terms their Retq anchor with query contains example, the sample documents However, body, title. their in their least term or either one query and at Rety. in included the in text their sample are not anchor that only contain query terms disfields documents. different for defined E been have the as of The experiments document the three From 5.4. the 5.3 Sections in of combinations possible all and cussed
134
6.2 Evaluation
methodology
fields (body, anchor text, and title), the evaluated experiments employ either the body field (b), or a combination of the anchor text and title fields (at). The body field is because it is similar to the full text of documents, while the combination selected, of the anchor text and the title corresponds to fields that provide a concise description of the documents. experiments have shown that other combinations of the body, anchor text, and title fields perform either similarly to the body field, or similarly to the combination of the anchor text and title fields. Initial
6.2.3
Presentation
and analysis
of results
Here, a brief description
of the presentation and the analysis of the results is given,
before proceeding to the evaluation of the proposed experiments E. In the subsequent Sections 6.3 and 6.4, each row in the tables shows the following a row identifier for ease of reference ('Row'); the employed task ('Task'); the employed pair of retrieval approaches ('Retrieval approaches') and the mean av('Baseline'); by the the the erage precision of effective one experiment employed most information: decision mechanism ('MAP'); mechanism the achieved mean average precision by the Bayesian decision the relative difference between the MAP of the most effective ret, (`+/-%'); decision by MAP the trieval approach and the achieved a which mechanism decision times the the that mechanism applies the correct retrieval signifies number of *, level 0.05 test; to the is a sign which according statistically significant at approach decision MAP between that difference the the that the and of mechanism of signifies level 0.05 is the most effective retrieval approach according statistically significant at to Wilcoxon's decision boundaries in decision the the test; and number of singed rank ('Bnd'). mechanism The tables report the evaluation results for the experiments E, which identify at least focus in is This to for tasks. tested boundary the decision order choice made each of one (topic distillation, for tasks types three the E that are effective of all on the experiments home page finding, and named page finding). The comparison of the effectiveness of for the tested their to E is all performance the experiments performed with respect for that focus a range the analysis on the experiments perform well tasks, in order to impact from the discussed two The perspectives: different tasks. mainly are results of of the particular fields used to compute the experiments E. i. e., the body field, or a
135
6.3 Evaluation
of score-independent
experiments
of the anchor text and the title fields; and the particular of each experiment F-. combination
characteristics
The remainder of the chapter is organised as follows. Sections 6.3 and 6.4 present the evaluation of the score-independent and the score-dependent experiments, in the described setting. The experimental setting and the presentation of the results are rein Sections 6.5,6.6, and 6.7. More specifically, Section 6.5 introduces documcmt visited in sampling order to reduce the computational overhead of the experiments, and to assess their effectiveness when using only highly scored documents. Section 6.6 discusses the effectiveness of a decision mechanism when the retrieval approaches employ the same field-based weighting model. Section 6.7 describes the results from an example of Bayesian decision a mechanism, which employs three retrieval approaches.
6.3
Evaluation
experiments
This section evaluates the effectiveness of the score-independent document-level and aggregate-level experiments, which were introduced in Sections 5.3.1 and 5.3.2, respectively. First, the evaluation of the document-level experiments is presented in Section 6.3.1. The evaluation of the domain and directory aggregate-level experiments Sections in 6.3.2.1 and 6.3.2.2, respectively. For each type of experiment, are presented the evaluation results are followed by an illustrative Bayesian decision mechanism. for example a particular task. The examples are intended to provide insight in using the experiments in the context of the Section 6.3.3 provides some concluding remarks about
the evaluation of the score-independent experiments.
6.3.1
Document-level
experiments
The current section presents the evaluation of the document-level experiments. The based the document-level on counting number of are experiments score-independent documents, which contain query terms in particular fields. The considered documents field. The in least them, terms, the a particular one of or at query may contain all the text title. body; the fields the and anchor of and a combination are: considered The evaluation of the score-independent document-level experiments is perforiiied in the context of a Bayesian decision mechanism, which employs a particular pair
136
6.3 Evaluation
experiments
of retrieval
approaches for each of the tested tasks (Table 6.1). As described in Sec-
tion 6.2.2.2, the tested tasks are: td2003; td2004; hp2003; hp2004; np2003; and np2004. The same task is used for training the Bayesian decision mechanism, and for evaluating each experiment. This setting has been chosen in order to reduce any effect from the differences between the employed tasks on the evaluation of the experiments. 6.3.1.1 Evaluation results for document-level
document-level the of
experiments
experiments, which identify at
Table 6.2 presents the evaluation least one decision `+/boundary
for each of the tested tasks'. in a histogram in Figure
The results from column 6.1. Row 1 in the table of documents the t; Isk of
%' of Table 6.2 are presented
shows the evaluation that contain
of the experiment term
E3(b), which counts the number in their body.
at least one query
For each query from
td2003,
the Bayesian
decision mechanism I(ne)C2F model or the combination
selectively
applies either the combination
the field-based documents with
weighting
from URL length the of with evidence path of the field-based weighting model DLHF is
(I(ne)C2FU), (DLHFP). represents individual
PageRank which
The achieved MAP an improvement approach (0.1455).
of the Bayesian decision mechanism baseline the over MAP
0.1483,
of +1.92% Moreover,
of the applies
most effective
the decision mechanism significant
the most effective denoted by t.
retrieval
approach
for a statistically
number of topics, as
From Table 6.2, it can be seen that the experiments Ey(b) (rows 7-12) and (rows 13-18) result, on average, in a lower number of decision boundaries, than the experiment e-3(b) (rows 1-6). For all the tested cases, the Bayesian decision mechanism results in improved reW. `+/in differences by the indicated the column trieval effectiveness, as positive The most notable case is shown in row 4, where there is an improvement of 11.65% in MAP for the task hp2004. For the task np2004, the obtained MAP when the decision MAP lie higher is the than F-V(b) is 0.7341, t of which mechanism uses the experiment (0.7232 Web 2004 TREC track the best performing run in the corresponding task of from row 11 in Table 4.6, page 67).
'The evaluation results of the experiments, which do not identify at least one decision boundary for each of the tested tasks, are given in Table B. 1 (page 239) of Appendix B.
137
6.3 Evaluation
experiments
The sign test shows that the decision mechanism has applied the most appropriate retrieval approach for a significant number of queries in 3 cases for the experimeiit X3(b) (rows 1,3, and 4), in 1 case for the experiment Ey(b) (row 9), and in 2 cases for (rows 13,17). The Wilcoxon's the experiment EV(a, t) signed rank test shows that the decision mechanism results in statistically significant improvements in MAP compared to the most effective retrieval approach, in 1 case for the experiment EV(b) (row 9), and in 1 case for the experiment EV(a, (row 13). t)
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 Retrieval approaches I(ne)C2FU DLHFP PL2F I(ne)C2FP DLHFU BM25FA PB2FU DLHFA PL2FP I(ne)C2FA PB2F I(ne)C2FA I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 E
E3(b) E3(b) E3(b) E3(b) E3 (b)
E3(b) F-V(b) EV(b) Ey(b) F-V(b) F-V(b) Ey b Ey(Qt) EV(Qt) EV(at) EV(at) Ey(ag) Ey ate
MAP 0.1483 0.1313 0.6849 0.6202 0.7007 0.7220 0.1476 0.1402 0.6942 0.5635 0.6940 0.7341 0.1568 0.1322 0.6803 0.5871 0.7091 0.7150
+/- % Bnd 1 + 1.92 2 + 0.46 + 2.841 3 1 +11.651 1 + 2.35 2 + 3.97 + + + + + + + + + + + + 1.44 7.27 4.231' 1.44 1.37 5.72 7.77t* 1.15 2.15 5.69 3.581 2.97 2 2 1 1 1 1 1 1 1 2 1 1
Table 6.2: Evaluation of score-independent document-level experiments E3(fl and Ey(fl for combination of fields f, which result in at least one decision boundary for each t decision denotes the The the that tested topic set. applies most mechanism symbol for a statistically significant number of queries, according appropriate retrieval approach MAP between difference denotes the * the the The that of to the sign test. symbol decision mechanism and that of the most effective retrieval approach is statistically `Bnd' The the Wilcoxon's test. to column reports rank signed significant, according for boundaries decision each case. number of
6.3.1.2
Example
for document-level
experiments
f'or decision Bayesian boundaries the decision mechanism the of (row in 4 TaX3(b) hp2004, the performs very well the topic set experiment where lie t decision of ble 6.2). The mechanism selectively applies either a combination Figure 6.2 illustrates field-based weighting (PB2FU). length URL PB2F the path with model or a coin-
138
6.3 Evaluation
experiments
12 10 8 6 4 2 0 td2003 0 td2004 hp2003 hp2004 np2003 np2004
F, 3 (b)
dd(b)
EV(at)
Figure 6.1: Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach from %' `+/Table 6.2. of column bination (DLHFA). The Absorbing the model with likelihoods density the figure to in the the posterior of estimated correspond curves (top E3(b) for P(DLHFA)"P(EIDLHFA) the P(PB2FU)"P(EIPB2FU) experiment and DLHF field-based the model of
diagram), and for the experiment EV(b) (bottom diagram), respectively. field be it the Figure 6.2, that diagram in of From the top combination seen can (PB2FU) is documents URL from the when more effective of retrieval and evidence least documents E3(b), one with at which considers the outcome of the experiment When the 300551. the lower is than body, experiment of in outcome their term query DLHF field-based the weighting model F-3(b)is higher than 300551, the combination of bottom hand, On the the (DLHFA) is Model other Absorbing effective. more the with the the is DLHFA experiment of outcome diagram indicates that more effective when Ev(b) the is PB2FU of outcome d(b) is lower than 3782.337, while more effective when type the the for this of that, This example, 3782.337. particular higher is than suggests least documents one of or at all is, contain the that considered whether experiment, for the the which values important has outcome of range on terms, effect the query an is effective. a retrieval approach
139
6.3 Evaluation
exoerimPntc
1.1e-06 1.Oe-06 9.0e-07 .0e-07 7.0e-07 6.0e-07 5.0e-07 4.0e-07 3.0e-07 2.0e-07 I. Oe-07 300551 O. Oe+00 Oe+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05 .. .. . P(PB2FU). P(3(b)jPB2FU) P(DLHFA) "P(Eg(b) JDLHFA) _, S. '
outcomes of E3(b)
1.20e-04 r
/ / 7 \ \
P(PB2FU). P(Ey(b)IPB2FU) P(DLHFA). P(Ev(b)IDLHFA)

1
l t
1.00e-04
8.00e-05
6.00e-05
4.00e-05
2.00e-05
\ ` \
3782.33 n nnP, nn v VV4 Oe+00
'
1e+03
2e+03
3e+03
4e+03
5e+03
6e+03
7e+03
8e+03
9e+03
FV(b) outcomes of
Figure 6.2: Posterior likelihoods of the experiments 3(b) and -y(b) for the topic set hp2004, where one of the retrieval approaches PB2FU or DLHFA is selected to be for applied each query.
140
6.3 Evaluation
experiments
6.3.1.3
Discussion
The evaluation results for the score-independent document-level have experiments shown that the experiments 4(b) and EV(a, result, on average, in a lower number of decision t) boundaries than E3(b) (Table 6.2). For example, the experiment ev(at) in results one decision boundary for all the tested tasks, apart from hp2004, for which there are two
decision boundaries.
The experiments &(b) and 4(at) count only the documents, in which all the query terms appear in the body, or in a combination of the anchor text and title fields. This provides a strong indication that the documents are related to the query. Therefore the used evidence by the experiments provides a better indication of how broad or specific When is. documents in there the terms a query are many all query a particular with field or a combination fields, from hyperlink then the of evidence structure or the URL Web documents be documents detect documents higher to of can of used quality, or that are likely to be home pages of relevant Web sites. On the other hand, it is not documents that there are containing all the terms of a specific query. expected many For this reason, the document-level experiments, which consider documents with all the query terms, are more appropriate for selective Web IR.
6.3.2
Aggregate-level
experiments
This section discusses the evaluation of the decision mechanism that employs the scoreindependent aggregate-level experiments, described in Section 5.3.2. The aggregates are defined as the documents that belong to the same domain, or the documents that are first identify the The directory. in all aggregates the considered experiments same stored feature Next, documents. they Retq of extract a in the sample of the set of retrieved Section 5.3.2, in As discussed the distribution employed the the size aggregates. of deviation (avg); the of the features are: the average of the aggregates' size standard (lrg). The large (std); aggregate-level the aggregates of number and aggregates' size lie in least terms the t documents query or all one, at either with experiments consider documents. title the text of body, or in a combination of the anchor and domain for the describes the evaluation results <<iid Sections 6.3.2.1 and 6.3.2.2 ilbist 6.3.2.3 Section an napresents directory aggregate-level experiments, respectively. Finally, for task. a particular tive example of applying the aggregate-level experiments
141
6.3 Evaluation
experiments
Section 6.3.2.4 discusses the evaluation of the aggregate-level experiments. 6.3.2.1 Evaluation results for domain aggregate-level experiments
of the experiments E that employ domain aggi-cgates, and result in at least one decision boundary for all the tested tasks'. Figure 6.3 summarises the results from column `+/- %' of Table 6.3 in a histogram. The results indicate that only the experiments (rows '3(at), (rows 1-6), 13t (b), ) avg(do,,,,, avg(dom) (rows 19-24) and F-3(b), 18), 4(b), (rows 31-36), in improvements in result lrg(dom) std(dom) MAP for all the tested tasks. On the other hand, the experiments V(b), (rows ag(dom) 7-12), t 3(at), (rows 25-30), and EV(b), (rows 37-42), in decrease in result a lrg(dom) std(dom) MAP for some of the tested tasks. std(do1n) (rows 19-24) results in a decision mechanism with only 1 decision boundary for all identifies decision boundary the tested tasks. The experiment F, one a(at),std(dol) also for four out of the six tested tasks (rows 25-30). The rest of the experiments shown in Table 6.3 result in a variable number of decision boundaries. experiment For example, the IEV(b), decision boundaries for in least two the tested at all results avg(dom) topic sets (rows 7-12), while the experiment'3(b), lrg(dom) results in either one, two, or four decision boundaries (rows 31-36).
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 Retrieval approaches DLHFP I(ne)C2FU I(ne)C2FP PL2F BM25FA DLHFU DLHFA PB2FU I(ne)C2FA PL2FP I(ne)C2FA PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 F-3(b), avg(dom)
F-3(b), avg(dom) E3(b), avg(dom) F-3(b), avg(dom)
Table 6.3 presents the evaluation
Regarding
the number of decision boundaries, only the experiment
4(b),
F-3(b),
avg(dom) E3 b dom ,a EV(b), avg(dom) EV(b), avg(dom) y(b),avg(dom) y(b),avg(dom) EV(b), avg(dom) Ey b avg(dom)
,
MAP 0.1482 0.1347 0.6732 0.6202 0.6929 0.7187 0.1429 0.1386 0.6593 0.6054 0.7031 0.7005 0.1464 0.1316 0.6895 0.6215
+/- % + 1.86 + 3.06 + 1.08 +11.651 + 1.21 + 3.50 + + + + 1.79 6.041 1.01 8.98 2.70 0.88
Bnd 1 3 3 1 1 2 3 3 2 2 2 2 2 1 4 2
3(at),
3(at),
avg(dom)
avg(dom)
3(at), avg(dom) E3 dom avg at ,
+ 0.62 + 0.69 + 3.531 +11.881'
continuea on next
decision boundary least identify do one at not which results of the experiments, evaluation (page B. 244) (page 242), 4 B. (page 3 240), 2 Tables B. in and of for all the tested tasks, are given B. Appendix 'The
142
6.3 Evaluation
experiments
rnntinn,
crn..
r,,....
..
----
Row 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Task np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004
Retrieval approaches PL2FP I(ne)C2FA PB2F I(ne)C2FA I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA
Baseline 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
E E3(at), avg(dom) E3 at ,av dom Ey(b),std(dom) V(b), std(dom) EV(b), std(dom) Ey(b),std(dom) y(b),std(dom) y b std dom
,
MAP 0.6943 0.7298 0.1525 0.1353 0.6682 0.5622 0.7230 0.7184 0.1426 0.1347 0.6746 0.5871 0.7055 0.7088 0.1463 0.1399 0.6719 0.6064 0.6895 0.7125 0.1534 0.1378 0.6881 0.5483 0.6880 0.6959
+1- % Bnd 1 +1.42 3 +5.10 +4.81 +3.52 +0.33 +1.21 +5.61 +3.46 -2.00 +3.06 +1.29 +5.69 +3.051 +2.07 +0.55 +7.041 +0.89 +9.161 +0.72 +2.61 +5.43 +5.43 +3.32t1 -1.30 +0.50 +0.22 1 1 1 1 1 1 1 1 2 2 1 1 4 2 2 1 1 2 2 5 1 1 2 2
3(at), atd(dom)
E3(at), std(dom)
3(at), std(dom) 3(at), std(dom) E3(at), 8td(dom) 3 dom std at , E3(b),Irg(dom) 3(b),lrg(dom)
3(b), E3(b), E3(b),
Irg(dom)
lrg(dom) lrg(dom)
3 6 lr dom , Ev(b),Irg(dom) y(b),Irg(dom) y(b),lrg(dom) d(b),Irg(dom) Ey(b),Irg(dom) Ey b Ir dom ,
Table 6.3: Evaluation of score-independent aggregate-level experiments with domains, t for The boundary least decision in each tested topic set. symbol one at which result denotes that the decision mechanism applies the most appropriate retrieval approach for a statistically significant number of queries, according to the sign test. The symbol * denotes that the difference between the MAP of the decision mechanism and that of Wilcoxon's is to the most effective retrieval approach statistically significant, according signed rank test. is obtained for the task hp2004, where the '3(at), the decision Bayesian selectively and experiment mechanism employs , avg(dom) As by denoted *, (row 6.3). in Table 16 this DLHFA PB2FU, or applies either The highest improvement in MAP improvement in MAP is statistically Wilcoxon's to significant according signed rank decision that the MAP the same For the employs mechanism the task test. of np2004, in best by the higher is that than is run 0.7298, performing obtained which experiment 4.6, 67). (0.7232 Table 11 in Web 2004 track TREC of page row the same task of the
143
6.3 Evaluation
experiments
15
10
0
' V(b), avg(dom) -2(b), R(at), avg(dom) avg(dom) 4(b), std(dom) 3(at), R(b), std(dom)
lrg(dom) `d(b), lrg(dom)
Figure 6.3: Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach from %' `+/column of Table 6.3. 6.3.2.2 Evaluation for directory
results
aggregate-level
experiments
The aggregates can also be defined in terms of documents that are stored in the same directory, as described in Section 5.3.2. Table 6.4 displays the evaluation results for the directory aggregate-level experiments, which identify at least on decision boundary for each tested task'. For example, row 1 in the table gives the evaluation results obtained decision Bayesian the mechanism selectively applies either the combination of when the field-based weighting model I(ne)C2F with the URL path length (I(ne)C2FU), or the combination (DLHFP), PageRank DLHF field-based the with weighting model of for the task td2003. The resulting MAP is 0.1483, which corresponds to a relative im(0.1455). individual MAP the approach of most effective provement of +1.92% over the Furthermore, a statistically the decision mechanism applies the most effective retrieval approach for t. Figure by indicated from 6.4 td2003, as significant number of queries
%' form in Table 6.4 from `+/the the of of a column results provides an overview of histogram.
'The evaluation results of the directory aggregate-level experiments, which do not identify at least (page (page B. 247). 245), 6 B. 5 Tables in for decision boundary tasks, and tested the are given all one B. 7 (page 249) of Appendix B.
144
6.3 Evaluation
experiments
The evaluation results show that only the experiments, which compute the average size of aggregates, result in improvements for all the tested tasks (rows 1-18). The directory aggregate-level experiments, which compute either the standard deviation of the aggregates' sizes, or the number of large aggregates, do not always result in in retrieval effectiveness for all the tested tasks (rows 19-48). The column `Bnd' of Table 6.4 shows that the experiments, which estimate 1he average size of the directory aggregate size (rows 1-18), identify a variable number of decision boundaries for each task. For example, row 15 shows that the decision mechanism, which employs the experiment Ev(at), to select either DLHFU or BM25FA avg(dir) on a per-query basis, has seven decision boundaries. The experiments EV(b), (row", std(dir) 25-30), and F-3(at), (rows 31-36), result in either one or two decision boundaries in std(dir) most of the tested cases. This suggests that the standard deviation is a robust feature of the aggregate size distribution, and it is in agreement with the obtained results for the domain aggregate-level experiments, as discussed in Section 6.3.2.1. The most notable improvements consistent improvements
in MAP are shown in row 4, where the MAP obtained by the Bayesian decision mechanism for the task hp2004 represents an irriprovement of 15.93% over the MAP of the most effective retrieval approach. This improvement is statistically significant according to Wilcoxon's signed rank test, as denoted by *. In the case of the task td2003, when the experiment 4(at), is emavg(dir) ployed in order to select between I(ne)C2FU, or DLHFP (row 13), the resulting MAP is 0.1613 (+10.86% relative improvement compared to the MAP of the most effective retrieval approach). Regarding the task np2004, the MAP achieved by the decision is 0.7261 (row 30 in Table 6.4),
y(b),std(dir)i best is higher in that the than the corresponding task of performing run which slightly (0.7232 from in 4.6 11 Table Web TREC 2004 the track row on page 67). of
Row 1 2 3 4 5 6 7 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 Retrieval approaches DLHFP I(ne)C2FU I(ne)C2FP PL2F BM25FA DLHFU DLHFA PB2FU I(ne)C2FA PL2FP PB2F I(ne)C2FA I(ne)C2FU DLHFP Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 E
E3(b), avg(dir) E3(b), avg(dir) E3(b), avg(dir) E3(b), avg(dir)
mechanism, which employs the experiment
E3(b),
E3 b
avg(dir)
, at' dir
MAP 0.1483 0.1336 0.6742 0.6440 0.7045 0.6975 0.1497
+/- % + 1.92 + 2.22 + 1.23 +15.93t' + 2.91 + 0.45 + 2.89
Bnd 1 2 4 2 4 2 1
E3
at ,at dir
145
6.3 Evaluation
experiments
continued
from previous
Row 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
page
Task td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004
Retrieval approaches PL2F I(ne)C2FP DLHFU BM25FA PB2FU DLHFA PL2FP I(ne)C2FA PB2F I(ne)C2FA I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA
Baseline 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
E E3(at), avg(dir) E3(at),

avg(dir)
E3(at),avg(dir) E3(at),avg(dir) E3(at),avg dir) EV(at),avg(dir) Ey(at),avg(dir) y(at), avg(dir) EY(at),avg(dir) EV(at),avg(dir) Ey at ,av dir 3(b),
std(dir) E3(b), E3(b), E3(b),
E3(b),
MAP 0.1411 0.6855 0.5903 0.7100 0.7216 0.1613 0.1338 0.6836 0.6279 0.6861 0.7027 0.1422 0.1318 0.6699 0.6040 0.6934 0.7104 0.1565 0.1359 0.6710 0.5517 0.7070 0.7261 0.1284 0.1321 0.6849 0.5898 0.7111 0.7040 0.1547 0.1295 0.6712 0.6042 0.6872 0.7132 0.1469 0.1300 0.6768 0.5550 0.6859 0.6910
+/- % Bnd 3 + 7.96 + 2.931 4 4 + 6.26 3 + 3.711 4 + 3.92 +10.86 + 2.37 + 2.64 +13.03t' + 0.22 + 1.20 + + + + + 2.30 0.84 0.591 8.73 1.29 2.30 2 1 7 3 5 1 1 2 3 2 2 4 2 3 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 2 3 1 6 1 1 1
std(dir) std(dir) std(dir)

std(dir)
b std dir ,
EV(b), std(dir) EY(b), std(dir) EV(b), std(dir) EV(b), std(dir) V(b),std(dir)

Ey b std dir ,
+ 7.56 + 3.98 + 0.75 0.68 + 3.27 + 4.57 -12.00 + 1.07 + 2.84 + 6.17 + 3.871 + 1.38 + 6.32 0.92 + 0.78 + 8.77 + 0.38 + 2.71 + 0.96 0.54 + 1.62 0.09 + 0.19 0.49 -
E3(at), std(dir) 3(at), std(dir) E3(at),std(dir) C3(at),std(dir) E3(at), std(dir) E3 dir std at ,
E3(b), F-3(b), F-3(b), F-3(b), F-3(b), lrg(dir) Irg(dir) Irg(dir) Irg(dir) Irg(dir)
E3 b Ir dir , E3(at),irg(dir) E3(at),Irg(dir) E3(at),lrg(dir) E3(at),Irg(dir) 3(at), lrg(dir) E3 dir at ,Ir
directoEvaluation Table 6.4: of score-independent aggregate-level experiments with The for boundary topic tested decision least in set. symbol each one at ries, which result t denotes that the decision mechanism applies the most appropriate retrieval approach The test. the to symbol for a statistically significant number of queries, according sign * denotes that the difference between the MAP of the decision mechanism and that of Wilcoxmn's to is statistically significant. according the most effective retrieval approach signed rank test.
146
6.3 Evaluation
experiments
25 20 15 10 5 0 -5 -10
6B(at), -3(b), avg(dir) avg(dir) 4(at), 3(b), avg(dir) std(dir) 4(b), E3(at), std(dir) std(dir) 6B(b), 3(at), lrg(dir) lrg(dir)
td2003 L1 td2004 hp2003 hp2004 np2003 np2004 LL Q0,
Figure 6.4: Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach from %' `+/Table 6.4. column of
6.3.2.3
Example
for domain
and directory
aggregate-level
experiments
This section provides an illustrative anism with domain and directory
example of applying the Bayesian decision mechaggregate-level experiments. The example is based
have 3(b), E3(b), in the considerable resulted and which on experiments ) avg(dir), avg(da,,,, improvements in MAP for the task hp2004, with either one or two decision boundaries (see row 4 in Table 6.3, and row 4 in Table 6.4, respectively). IPB2FU) (top Figure 6.5 displays the posterior likelihoods P(PB2FU)"P(E3b, avg(dorn) jDLHFA) (bottom diagram), which have been diagram) and P(DLHFA)"P(-3b, avg(dir) for decision Bayesian the topic during the the set mechanism of application estimated The decision mechanism selectively applies either PB2FU, or DLHFA on a field-based denotes the PB2FU basis. the of weighting model combination per-query PB2F with evidence from the URL path length. DLHFA denotes the combinatioii of hp2004.
Model. Absorbing DLHF the field-based the with weighting model

The top diagram shows that the retrieval approach PB2FU is more effective for for is DLHFA E3b, lower the more effective the while experiment outcomes of avg(dmn), diagraiii boundary, decision is the 3b, There higher and the one outcomes of ). avg(do,,,
147
6.3 Evaluation
experiments
suggests that there is a clear separation between the curves corresponding to the two posterior likelihoods. The outcome values of the experiment'3b, approximately avg(dm,. L) fall between 0 and 120.
When the aggregates are based on directories, are two decision boundaries, and the separation the bottom diagram shows that there likelihoods is
between the two posterior
E3(b), However, the outcome values avg(dom). E3(b), lower than those of the experiment of the experiment are considerably avg(dir) -3b, they fall between 1.5 6. This and is approximately two orders and range avg(dom), of magnitude smaller than the range of outcome values of the experiment the estimated densities of the posterior likelihoods As a consequence, to overlap, resulting E3(b), avg(do771) ' likely are more
less clear than in the case of the experiment
in a higher number experiments.
of decision boundaries In this particular
than the one resulting example, it is preferable
from domain
aggregate-level
to employ the domain aggregates, because the corresponding experiment results in a lower number of decision boundaries. 6.3.2.4 Discussion
This section provides a discussion related to the evaluation of the domain and directory have Sections been in 6.3.2.1 and 6.3.2.2, evaluated aggregate-level experiments, which respectively. There are three main points of discussion related to: the differences between the domain and directory aggregates; the effectiveness of the three employed features of the aggregate size distribution (average size, standard deviation, and number fields large the used to compute the aggregate-level experiments. of aggregates); Domain directory aggregates The evaluation results for the domain ag-
versus
least decision identify indicate the that one at seven experiments, which out of gregates boundary for all the tested tasks, there are four experiments that result in improve(rows 1-6,13-18,19-24, for tasks the tested ments all the directory Regarding in Table 6.3). 31-36 and improvements in in that three there result experiments are aggregates, MAP for all the tested tasks (rows 1-6,7-12, and 13-18 in Table 6.4), out of the eight for This boundary decision tasks. the tested least identify that all one at experiments better indication domain of aggregates are more robust, and provide a suggests that the directory hand, On the the in documents distribution other the aggregates. of related distribution be their to smaller, and aggregates are expected depends on the particular
148
6.3 Evaluation
experiments
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
011WIIIi 0
20
40
60
80 avg(dom)
100
120 )
outcomes of'3(b),
0.3
IPB2FU) P(PB2FU)"P(E3(b), avg(dir) P(DLHFA)"P(E3(b), avy(dir)IDLHFA)
------
0.25
i'
S `
0.2
i i
0.15
i 0.1
. ,
0.05 3. 4.575 3 3.5 4 4.5 5 5.5 6
01.5
Figure
2.5
1a(b), outcomes of avg(dir)

6.5: Posterior likelihoods of the score-independent aggregate-level experiments hp2004, the for E topic the '2b, of retrieval apone where set and b,avg(dir), avg(dom) likefor The is DLHFA PB2FU posterior each query. selectively applied or proaches lihoods for the domain and the directory based aggregates are presented on top, and the bottom diagram, respectively.
149
6.3 Evaluation
of score- independent
experiments
structure
of Web sites. Therefore, the domain aggregates provide a better indicat ion than the directory aggregates that a query is broad, and that the retrieval effectiveness may be enhanced by employing evidence from the hyperlink structure of documents.
or the URLs.
Features which of the aggregate size distribution The aggregate-level size distribution experiments. result in a agthe
compute
the standard
deviation
of the aggregate
relatively gregates, experiment
low number
of decision boundaries
(rows 19-30 in Table 6.3 for domain aggregates). In particular,
and rows 25-36 in Table ''y(b),
6.4 for directory
is the only aggregate-level experiment, identifies only which std(dom) one decision boundary and results in improvements in MAP for each of the tested tasks (rows 19-24 in Table 6.3). This suggests that the standard distribution size retrieval is effective in separating deviation of the aggregate
the queries for which each of the employed that coin-
approaches
is more effective.
On the other hand, the experiments large identify to tend aggregates of for the different
pute the average size, or the number more variable
higher a and
decision boundaries number of
topic sets (see rows 7-12 in the standard devia-
Table 6.3, and rows 43-48 in Table 6.4). For this reason, estimating tion of the aggregate size distribution approach is more effective, better indication a provides
about which retrieval E
hence it is more appropriate and
for defining experiments
for selective Web IR.
Document
fields for
aggregate-level
experiments
Regarding the domain aggregate-
the evaluation results from Table 6.3 show that improvements in (rows MAP for all the tested topics are obtained with the experiments E3(b), avg(dom) (rows E3(b), (rows 3119-24), EV(b), (rows 13-18), E](at), 1-6), and lrg(dom) std(dom) avg(dom) for documents E3(at), the Among 36). which these experiments, only considers avg(dom) be field. This title the text, in can explained or the query terms appear either the anchor is likely body in term their documents because the number of containing a particular level experiments, in their term the documents that either higher be same the than contain to number of documents documents body Employing the more provides title. their of text or anchor from which to generate the domain aggregates, and, therefore, a more representative Ey(b), the experiment distribution only std(do7n) documents Considering terms. the documents with that query all contain considers Similarly, domain aggregate sizes. of
150
6.4 Evaluation
of score-dependent
experiments
all the query terms in a particular
field may result in a less representative distribution domain of aggregate sizes. The results for the directory aggregate-level experiments (Table 6.4) do not exhibit any particular trend regarding the document fields.
6.3.3
Overall,
Conclusions
this section has evaluated experiments, the proposed in the context which of a Bayesian decision mechanism The
the score-independent results suggest that
have been introduced experiments retrieval
in Section 5.3.
score-independent appropriate
E allow the decision on a per-query
mechanism basis.
to distinguish
and apply
approaches
Both the document-level level experiments,
(Section 6.3.1), as well as the aggregateexperiments (Secdeviation the the aggregate sizes which compute standard of
tions 6.3.2.1 and 6.3.2.2), result in a low number of decision boundaries. This suggests that they can capture a simple relation between the effectiveness of the different retrieval approaches. documents the terms that query all with consider experiments tend to result in a lower number of decision boundaries or thresholds, because the document in terms the provides stronger a particular part of a query occurrence of all The document-level Therefore, the document. the the the topic of experiment outcome of evidence about is computed from a more cohesive set of documents (Section 6.3.1.3) The domain aggregates are more stable than the directory aggregates, because the Web dependent is the directory of structure distribution on the more aggregates of size (Section 6.3.2.4). sites
6.4
Evaluation
of score-dependent
experiments
that the e5tiscore-dependent experiments This section focuses on the evaluation of between divergence by the hyperlink computing the structure, the of usefulness mate distribution, S,,, first The Section 5.4. in described score two score distributions, as The by documents second model. a weighting the to assigned scores of corresponds highly to documents that favour other in point formed to is distribution order score U, distribution tested: for the definitions are different Two second documents. a, scored its to document by score; original documents to added are a the pointed scores of where
151
6.4 Evaluation
of score-dependent
experiments
U7, and where the sum of the scores of documents pointed to by a document replaces its original score. The scores in both distributions are normalised between 0 and 1. The score distribution S, can be defined with respect to any of the retrieval apdescribed in Chapter 4. In the context of the evaluation of the experiments, proaches two field-based weighting models are employed, namely PL2F and I(ne)C2F. in order to test the impact of different weighting models on the effectiveness of the experiments. These two field-based models are statistically independent, as shown in Chapter 4. The weighting models PL2F and I(ne)C2F are used independently of the retrieval at>for final document ranking. In this way. the definition of the the employed proaches does depend experiments not on which retrieval approaches are considered by the (I(,cision mechanism. When the weighting PL2F is employed, then the score-dependent expui imodel Un), are denoted define the usefulness of the hyperlink structure as L(S, ments, which
F-v(f), by F-](f), depending documents least at one or and on whether with L(su)p1 L(su)pl, In field f in the same way, when terms the the are considered, respectively. all query the weighting model I(ne)C2F is used to define S, the score-dependent experiments, U; denoted by L(S.,, hyperlink define the the are structure as usefulness of which "), 3(f), L(sU%, and F-y(f),L(sU')in n After describing the setting of the distribution Sn,, this section presents the evaldiscussion for the and score-dependent experiments, and closes with a uation results some concluding remarks.
6.4.1
Setting
the score distribution
S7,,
PL2F, field-based by defined S, the is models The score distribution weighting using E frfehyper-parameters: has term the Each I(ne)C2F. the six weighting models of and body, for title text, the and anchor quency normalisation parameters cb, ca,, and ct, field three fields, respectively, and the weights wb, IQ, and wt. corresponding In order to define the score distribution S, independently of the retrieval approaches following in the hyper-parameters documents, the final set are for the of ranking used length the to normalisation the both For related parameters models, weighting way. fields body title the The 1. are and fields of weights the Ct Ca = are set cb = = of 0. field is to text the The wa 1. set equal = anchor to of weight Wt = wb = set equal both to the the text of score contribute Indeed, if wa > 0, then the anchor would
152
6.4 Evaluation
of score-dependent
experiments
source and the destination documents. Therefore, the estimated distribution U, would ' incorporate the effect of the anchor text twice. For the evaluation of the score-dependent experiments, the described setting of the parameters will be referred to as the default setting. The remainder of this section investigates the impact of the parameter setting on the distribution of the experiment outcome values.
5.56+00 5.Oeti00 3.5e+00 4.Se+00 4.0e+00 3.5e+00 3.Oe+W 2.Su00 2.0e+00 1.5e+00 I. 0e+00 5.00-01 OC+00 O. 3a0I l. O"A1 0.00+W le01 b 3.Oe+00 2.5e+00 2.Oe+00 I. 5e+00 I. Oe+00 4.oa+oo
4C-0I
4e-U 1
5"
5e. 01
6"1
66-01
7a01
76-01
or-UI
it-ui
outcomes of E3(b), L(SU)

6.Oe+00 optimised setting --^default setting -----5.Oa+00 3.0e+00 4.Oa+00 2.Se+OO 4.Oe+00 3.5eu00
outcomes of EY(b), L(SU)
3.Oe+00 I
I 9
2.Oe+00 I. 5e+00
2.Oe+00 I. Oe+00 1.00+00 S.Oe-01 Oe+oo O.
0.00+00 ,
01 5o-01 Se-02 10-01 2o-01 2c-01 2e-01 30-01 3e-01 4e-01 40-01 5e. outcomes of E3(,; ), t(su)
Oe+OO
zbol outcomes
3o-0l 4. -01 of Ev(a*),L(su)
ia01
6c.01
Figure 6.6: Density estimates of the usefulnessof the hyperlink structure experiments, is The default the shown to setting used. parameter whether an optimised or according density estimates are obtained for the topic set td2003 and the distribution S,ti is based PL2F. the on weighting model Figure 6.6 displays the density estimates for the experiment outcome values, which S,, is the distribution ), L(S,,, Ute, based the weighting generated with score when on are the default the described or of parameters, PL2F, the setting above and either model Section in 4.6.2 discussed td2003, task for the as with retrieval used setting parameter (first row of Table A. 11 on page 235). The differences in the estimated density curves suggest that the parameter setting does not have a strong impact on the obtained outcomes, when the experiment considers
153
6.4 Evaluation
of score-dependent
experiments
the documents with at least one (top left diagram), or all the query terms in their body (top right diagram). On the other hand, the difference is greater when the considered documents contain the query terms in their anchor text, or their title fields (bottom left diagrams). and right This is explained by the fact that the optimised setting weights the anchor text of documents, while the default setting uses a zero weight for the anchor
The bottom diagrams in Figure 6.6 show that the outcome values of the experiments setting of the parameters, in the
text.
E3(at), L(SU)pt and F-y(at), lower for the optimised are L(SU)PI than the ones obtained setting, S, for the default setting. This
is due to the fact that, incorporated
optimised distribution computing
the effect of the hyperlinks setting
is already
in the score
justifies it and
the weight of the anchor text equal to zero f'o structure. Similar results are obtained when t"),
the usefulness of the hyperlink structure
the usefulness of the hyperlink
is represented
divergence L(S, the with
distribution the as well as when score Therefore, the default setting
S,,, is based on the weighting
model I(ne)C2F. weighting
hyper-parameters the of
field-based the of
is models appropriate
for computing
the outcome of the score-dependent
experiments.
6.4.2
Evaluation hyperlink
results structure
of experiments L(S, U, )
based
on the
usefulness
of
This section presents the evaluation results of the experiments 3(f), L(su) and Ey(f), L(.yv) , ). distriUn, L(S, The hyperlink the the score structure usefulness of which estimate bution S,,, is formed by using either the field-based model PL2F or I(ne)C2F, with the default parameter setting described in Section 6.4.1. The score distribution U, is generdocument its by documents to to by the original score a pointed scores of ated adding (Equations (5.15) and (5.16) on page 118). As in the case of the score-independent field body for the (Section 6.3), the experiments are evaluated of either experiments (at). fields For title text (b), the example, documents and anchor or a combination of form I(ne)C2F field-based to the model weighting the experiment -3(b), employs L(su)ti in least documents terni S,,,, distribution query one with at the score and considers L(S,, Un). divergence Jensen-Shannon their body, in order to compute the symmetric define I(ne)C2F 5',; (using PL2F to All the different combinations of options or either fields (b) title text body the the 3 `d: and anchor or either using and or using either (at)) result in eight different experiments.
154
6.4 Evaluation
of score-dependent
experiments
results in Table 6.5 show that all the eight experiments, which estimate the usefulness of the hyperlink structure L(S,, Un), identify at least one deboundary for all the tested tasks. However, the cision number of identified decision boundaries varies. For example, the experiment 'y(at), identifies decision one L(SU)pj boundary for the tasks hp2004, np2003, and (rows 22-24), and at least three np2004 decision boundaries for the tasks td2003, td2004, and hp2003 rows (19-21). Regarding the obtained MAP by the decision mechanism, it can be seen that only the experiments E3(at),L(SU)p1(rows 13-18) and E3(at), (37-42) in improveresult L(SU)jn for ments all the tested tasks. Furthermore, when the decision mechanism selectivcly applies either PB2F or I(ne)C2FA for the task np2004 (row 48), the obtained MAP is 0.7468, which is higher than the MAP of the best performing run in the corresponding task of the TREC 2004 Web track (0.7232 from row 11 in Table 4.6 on page 67).
In this case, the decision for a statistically mechanism number applies the most appropriate t. denoted by of queries, as retrieval approach When the decision
The evaluation
significant
Ey(b), EV(at), for the the task employs mechanism experiments or np20011, L(SU)pl L(SU)p, for hp2003, difference between MAP task the the the or the experiment Ey(b), of L(sU)in the decision mechanism and that of the most effective baseline is statistically significant, (rows denoted * by 12,24, and 33 of Table 6.5, respectively). as Figure 6.7 provides an overview of the differences between the MAP of the decision mechanism and that %' in `+/Table 6.5. the column of approach, as reported of most effective retrieval
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 Retrieval approaches DLHFP I(ne)C2FU I(ne)C2FP PL2F BM25FA DLHFU DLHFA PB2FU I(ne)C2FA PL2FP I(ne)C2FA PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA
continuea
Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660
on next p
E3(6),L(SU)p, E3(6), c,(SU)pj 3(6), L(SU)p,

3(6), L(Su)p,
3(b),L(SU)pl E3(b), L(su) , EV(b), L(su)pj Ey(b),L(su)p, y(b), L(su)p, Ee(b),L(su)p, EV(b), L(su)p, EV(b), L(SU) , 3(at), L(SU)pl E3(at), L(su)pz 3(at), L(su) i
MAP 0.1432 0.1433 0.6670 0.5612 0.6899 0.7312 0.1607 0.1287 0.6943 0.6132 0.7213 0.7373 0.1484 0.1337 0.6796
+/- % Bnd 2 1.58 + 9.641 3 3 + 0.15 3 + 1.03 1 + 0.77 2 + 5.30 +10.45 1.50 + 4.251 +10.39 + 5.361 + 6.18' + 1.99 + 2.30 + 2.04 3 1 5 3 4 3 3 3 5
155
6.4 Evaluation
of score-dependent
experiments
continued
from nrevinns
narre
Row 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Task hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 W2004 hp2003 hp2004 np2003 np2004
Retrieval approaches PB2FU DLHFA PL2FP I(ne)C2FA PB2F I(ne)C2FA I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA
Baseline 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
vE
E3(at),L(SU) 1 p 3(at) L(SU) , , p E3(at),L(sv) , y(at) L(SV) , , p y(at) L(SU) 1 , p Ey(at) L(SU) , , p Ey(at) L(SV) i p , d(at), t(SV) p, V(at),L(SU) , 3(b),L(SU)i E3(b),L(su);,, L(SU);,, E3(b), L(SU);,, E3(b), L(SU),,, F-3(b), L(SU)in y(b),L(sv);,, y(b),L(SU);,, V(b),L(sv);,, EV(b), L(su);,, EY(b), L(SU);,, EV b su ;n ,L E3(at), L(SU);,, E3(at),L(SU);,, E3(at),L(SU),,, 3(at), L(SU);,, 3(at), L(SU);,, E3 at ,L SU ; y(at), L(SU); n EV(at),L(SU);,, V(at),L(SU);,, EV(at),L(SU);,, EV(at),L(SU);,, EV(at),L(SU)in
E3(b),
MAP 0.5877 0.6946 0.7382 0.1571 0.1322 0.6589 0.5707 0.7013 0.7460 0.1404 0.1405 0.6600 0.5628 0.6944 0.7432 0.1561 0.1289 0.7038 0.6038 0.7149 0.7368 0.1500 0.1367 0.6772 0.5933 0.7017 0.7231 0.1445 0.1350 0.6608 0.5607 0.6904 0 7468 .
+5.80 +1.46 +6.31 +7.97 +1.15 -1.10 +2.74 +2.44 +7.43t' -3.51 +7.50 -0.90 +1.31 +1.431 +7.03 +7.29 -1.40 +5.68t' +8.69 +4.431 +6.11 +3.09 +4.591 +1.68 +6.80 +2.50 +4.13 -0.69 +3.29 -0.78 +0.94 +0.85 +7.551
Bnd 1 3 2 4 5 3 1 1 1 2 2 1 3 3 4 1 1 5 4 5 3 3 2 4 2 3 3 2 3 4 1 3 1
Table 6.5: Evaluation of score-dependent experiments based on estimating the usefulboundary decision least in Un), L(S,, hyperlink one at which result structure ness of the t denotes that the decision mechanism applies for each tested topic set. The symbol the most appropriate retrieval approach for a statistically significant number of queries, MAP between difference denotes the * the that The of test. the to symbol sign according is the decision statistically that approach retrieval effective the most of and mechanism test. Wilcoxon's to rank signed significant, according
156
6.4 Evaluation
of score-dependent
experiments
20 15 10
+
0
5
F-V(b), 3(b), L(SU)pi L(SU 4(at), pi 3(at), L(SU)pl L(S21(b), p, 6d(b), L(SU)in L(SU in 3(at), L(SU)in EV(at), L(SC-)i
Figure 6.7: Histogram summarising the relative differences between the MAP of the decision mechanism and that of the most effective individual retrieval approach froin %' Table 6.5. '+/of column
6.4.3
Evaluation hyperlink
results structure
of experiments L(Sn, Un)
based
on the
usefulness
of
The current section presents the evaluation results for the score-dependent experiment s ', ). distribuU, The L(S,, hyperlink the score that compute the usefulness of structure field-based by documents the to weighting S, to the tion assigned scores corresponds ti document Un, distribution the According the to I(ne)C2F. of a PL2F score or models (5.17) (Equation it documents to the the points the to of scores of sum corresponds depend that the on the 6.6 Table experiments of 118). evaluation presents on page the divergence L(Sn, U,, )'. differences histogram the 6.8 Figure relative of presents a between the MAP of the decision mechanism and the most effective retrieval approach. %' 6.6. Table `+/in of column as reported in that is a consistently results The evaluation results show that there no experiment 6.6). in Table (column `Bnd' different tasks for the low number of decision boundaries E3(at), there the decision experiment For example, when the L(SP); n . mechanism employs
boundary decision least identify do one at 'The evaluation results of the experiments, which not B. Appendix (page 255) B. 11 (page 254), of and for all the tested tasks, are given in Tables B. 10
157
6.4 Evaluation
of score-dependent
experiments
for the named page finding tasks (rows 23-24), but there are five decision boundaries for the hp2004 task (row 22). Regarding the achieved MAP by the decision mechanism, the experiments Ev(b), L(su')p1 (rows 1-6), EV(at), (rows 7-12), Ev(b), (rows in improve13-18) and result L(SU'), j L(su')i, n for is all the tested tasks. In particular, when the experiment Ev(b), ments L(su')p1 used to selectively apply either I(ne)C2FU or DLHFP, for the td2003 task, the obtained MAP is 0.1655 (row 1). The obtained MAP corresponds to a relative increase of (0.1455). increase This approach according to Wilcoxon's by denoted *. test, signed rank as 13.75% over the MAP of the most effective individual is statistically significant
is one decision boundary
The same decision mechanism also applies the most appropriate retrieval approach for t. denoted by, a statistically significant number of queries, according to the sign test, as When the decision mechanism uses the experiment E2(at),L(SU'); the obtained M \P n, for the np2004 task is 0.7269 (row 24). This is slightly higher than the MAP of the best performing run in the corresponding task of TREC 2004 Web track (0.7232 from 4.6 67). in Table 11 on page row
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 Retrieval approaches DLHFP I(ne)C2FU I(ne)C2FP PL2F BM25FA DLHFU DLHFA PB2FU I(ne)C2FA PL2FP I(ne)C2FA PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F I(ne)C2FU PL2F DLHFU PB2FU DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 E EY(b), L(SU')p, y(b),L(SU')pi EV(b), L(SU)p, y(b),L(SU')p, EV(b), L(SU)p, y(b),L(SU'), MAP 0.1655 0.1343 0.6829 0.5939 0.6914 0.7173 0.1470 0.1355 0.6806 0.5751 0.6925 0.7155 0.1570 0.1327 0.6816 0.6001 0.7059 0.7217 0.1437 0.1357 0.6807 0.5949 +/- % Bnd +1 3-75t* 3 3 + 2.75 1 + 2.54 3 + 6.91 2 + 0.99 2 + 3.30 + + + + + + + + + + + + + + + 1.03 3.67 2.19 3.53 1.15 3.04 7.90 1.53 2.34 8.03 3.11 3.93 1.20 3.83 2.21 7.091 1 2 4 1 2 1 3 2 1 3 4 2 1 3 4 5
V(at),L(SU')p, EV(at),L(SUI)p, y(at), L(SU')p, Ey(at), r,(SU')P, y(at), L(SU')P, EV(at),L(SU') , EV(b), 0.1455 DLHFP L(SUI);,, y(b),L(SU');,, 0.1307 I(ne)C2FP Ed(b),L(SU');,, 0.6660 BM25FA EV(b), 0.5555 DLHFA L(SU');,, Ey(b),L(SU');,, 0.6846 I(ne)C2FA Ed b ,L SU' 0.6944 I(ne)C2FA E3(at),t, (SU');,, 0.1455 DLHFP E3(at), L(SU');,, 0.1307 I(ne)C2FP E3(Qt),c(SUI);,, 0.6660 BM25FA E3 0.5555 DLHFA at ,z SUS continued on next page
158
6.4 Evaluation
of score-dependent
experiments
continued
from nrevini. c naap
Row 23 24 25 26 27 28 29 30
Task np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004
Retrieval approaches PL2FP I(ne)C2FA PB2F I(ne)C2FA I(ne)C2FU PL2F DLHFU PB2FU PL2FP PB2F DLHFP I(ne)C2FP BM25FA DLHFA I(ne)C2FA I(ne)C2FA
Baseline 0.6846 0.6944 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
v E E3(at) L(SU')i,, , E3 at ,L Svc ;,, Ey(at) L(SUl);,, , y(at) L(SUl);,, , Ey(at) L(SUl)j , y(at) L(SUf)t, , V(at) L(SU1);,, , Ey(at) L(SU1);,, ,
MAP 0.7131 0.7269 0.1503 0.1376 0.6719 0.5640 0.7009 0.6914
+/- % Bnd 1 +4.16 1 +4.68 +3.30 +5.28 +0.89 +1.53 +2.38 -0.43 1 4 2 1 2 1
Table 6.6: Evaluation of score-dependent experiments based on estimating the usefiilhyperlink ), which result in at least one decision boundary the ness of structure L(S,, U,,, for each tested topic set. The symbol t denotes that the decision mechanism applies the most appropriate retrieval approach for a statistically significant number of queries, according to the sign test. The symbol * denotes that the difference between the MAP of the decision mechanism and that of the most effective retrieval approach is statistically significant, according to Wilcoxon's signed rank test.
15
10
+
td2003 0 td2004 hp2003 hp2004 np2003 np2004
0
EV(at), Eb'(b), L(SU')pi L(SU')Pi Etl(b), L(SU')in \at)L(SU)in `d(at), L(SU' )in
MAP between differences the the the of Figure 6.8: Histogram summarising relative fron individual retrieval approach decision mechanism and that of the most effective %' 6.6. Table `+/of column
6.4.4
Example
the usefulness of
hyperlink of
structure
experiments
based the the on illustrative experiments This section presents an example of using likelihoods displays 6.9 Figure the hyperlink posterior the structure. usefulness of
159
6.4 Evaluation
of score-dependent
experiments
P(I(ne)C2FU)"P(8jP(I(ne)C2FU)
experiments'y(b),
and P(DLHFP)"P(FIP(DLHFP)
Ev(b), (top right L(sv, )pi
for the score-dependent

diagram), Ev(b), L(SU)i,,
(top left diagram), L(SU)p1
(bottom left diagram), and F-v(b), (bottom diagram). The right employed task )=, L(SU, n is td2003, where all four experiments achieved an improvement of at least +7.29% over the baseline (rows 7 and 31 from Table 6.5, and rows 1 and 13 from Table 6.6).
25
P(I(w)CSPU)"P(er, u5u). lr(w)C2PU)P(DL}IFP)"P(%p, t(sv)rIDLHFP) ---2
1.6 1.4 1.2 1 0.8 0.6 0.4
P(I(s.)c2Fu)"P(E &.us(lII(n. )C2FU) P(DLHFP)-P )A( ).,IDLtFP) ------
1.5
0.5 0.2 0.371 0L 0.15 0.2

V. ci u. i V. J] V. 4 Aa V. 3 U. aj U. 0 U. 03
__-----
O.'s
1.091
1.298
n 0.4
1.4
0.5
0.6
outcomesOf Ev(b), L(su),,

2.5
0.7 0.8 0.9 1 1.1 outcomes of Ey(6),t(sv'),,
1.2
1.3
1.4
P(i(a)CZFU)'P(Ey('), I1(w)C2FU) L(SU),. P(DLHFP)'PIEvP), Ksu)ajDLHFP)
P(n. )c2FU)'P(GJ). l1(n. )caFU) Msm), IDLHPP) P(DLI P)'PlE b), +(, (p). -----1.2
1.5
0.e
0.6
iI
11 %
0.4 0.5 0.2

-------- 0.694 1.118 ,1 1.302
A X0.2
0.25
0.3
0.35 outcomes
0.4 0.45 of Ey(6), L(gp);.
0.5
0.55
0.6
0.4
0.5
0.6
0.7 0.8 0.9 1.1 1 outcomes of FV(p),L(SU%.
1.2
1.3
1.4
Figure 6.9: Posterior likelihoods of the score-dependent experiments for the topic set td2003, where one of the retrieval approaches I(ne)C2FU or DLHFP is selectively apfor likelihoods that the The basis. estimate experiments posterior plied on a per-query Un) left L(ST, Un) L(S, the hyperlink the are shown on the usefulness of and structure S,, distribution is The figure, hand hand and the right the score respectively. side of (bottom diagrams). I(ne)C2F (top diagrams), PL2F or generated with either For this particular diagrams suggests bottom diagrams between the top the and example, a comparison PL2F, I(ne)C2F, different or that using a weighting model, such as
has a small effect on the outcome values of the experiments. The shapes of the pos(left hand Ey(b), Ey(b), of side and terior likelihoods for the experiments L(su');,, L(su')p, for likelihoods the The the experiments Figure 6.9) are very similar. posterior shapes of intersection However, the Ey(b), of 8V(6), number also similar. are and L(5U);,, L(SU)pl
160
6.4 Evaluation
of score-dependent
experiments
points, or in other words, decision boundaries, is different (3 decision boundaries for the experiment 4(b), L(su)P1 vs. 1 for the experiment 6y(b), This variability L(su). n). is explained by the fact that the estimated posterior likelihoods for the lower and the higher outcome values of the experiments are relatively low and less reliable, because there are only few training queries that result in such outcome values. It is also worth noting that the divergence values obtained with L(S,,,, Ute) (right hand side of Figure 6.9) are considerably higher than those obtained with L(S, U71) (left hand side of Figure 6.9). This confirms that the distribution U; is less dependent ' S, because the original score of a document in Sn is replaced in U, by the sung on of the scores of the documents it points to, as discussed in Section 5.4.2 (page 117). On the other hand, the distribution U, is more similar to S, because the score of a document in U, depends on its original score in S.
6.4.5
Discussion
This section provides a discussion of issues related to the score-dependent experiment s, discussion focused Sections The is have been in 6.4.2 6.4.3. on t1Ie evaluated and which document S.,,, different definitions the the and effectiveness of using of perspectives of fields. in least terms the query particular one, or all with at Defining Sn with either PL2F I(ne)C2F or The score distribution Sn can be
defined in several different ways. In the context of the evaluation of the experiments, field-based defined been S, has to the distribution weighting models the with respect PL2F and I(ne)C2F. sion mechanism. symmetric The two employed weighting models affect the Bayesian deciFor example, when the experiment 4(b), L(sU)p1, which employs the
Jensen-Shannon divergence L(S,,,, Un), is used to selectively apply PB2FU
decision boundaries MAP the hp2004, for the are DLHFA of task the number and or tEy(b), When 6.5). the (row in Table 10 experiment 0.6132 and 3, respectively L(sv);,, 4 decisioii MAP is 0.6038. there are and is used in the same setting, the obtained to However, the some (row in 6.5). Table 34 are consistent boundaries obtained results (rows from Table 6.5) 13-18 For the experiments'E3(at), L(SU)p, example, only extent. for MAP in in improvements 6.5) from Table (rows all 37-42 F-a(at), result and L(SU);n hyperlink the the striic-usefulness of the tested tasks. The experiments that estimate EV(b), Lv(b), both For ) example, L(SU')%f L(su)p, and ture L(S,, U, follow similar trends.
161
6.4 Evaluation
of score-dependent
experiments
(rows 1-6 and 13-18 from Table 6.6, respectively) result in improvements in MAP for all the tested tasks. Therefore, the score-dependent experiments are robust with respect to the different weighting models that can be used to define the score distribution 5', Using documents with all, or at least one of the query terms Table 6.5
shows that all the experiments, which estimate the usefulness of the hyperlink structure L(Sn, Un), identify at least one decision boundary for all the tested tasks. This is not the case for the experiments, which estimate the usefulness of the hyperlink structure As shown in Table 6.6, out of the five different experiments that identify . least at one decision boundary for each of the tested tasks, four of them consider only the documents with all the query terms in a particular field, or a combination (rows fields 1-6,7-12,13-18, of that the score distribution and 25-30 from Table 6.6). This is due to the fact U,. Un is less dependent on Sn, than the score distribution L (Sn, U7,)
Therefore, considering only the documents with all the query terms allows to complete the divergence L(ST, Un) from a more cohesive set of documents, which are more likely to be about the topic of the query.
6.4.6
Conclusions
Overall, this section has presented the evaluation of the document score-dependent divergence hyperlink the the the structure as usefulness of experiments, which compute between two score distributions (Section 5.4 on page 115). The first score distribution S, corresponds to the scores assigned to documents by a retrieval approach. In the PL2F field-based two the the models, weighting experiments, evaluation of context of I(ne)C2F, and S,. form to are employed The second score distribution is defined in documents: highly-scored favour documents to that in two ways, other point order to its document to U, distribution the original score corresponds the score of a where Un, distribution the it the to; documents score that where the and points of scores plus it The documents that to. is the document the points scores of sum of equal to of a the to hyperlink symmetric the corresponds structure of usefulness 7. U, Sn between Ute,, S,,, between divergence and or and Jensen-Shannon
divergence the that have the The evaluation results employ experiments shown that L(Sn, Un) result in identifying at least one decision boundary for all the tested topic IR. Web The for Therefore, (Section they 6.4.2). applying selective are very robust sets
162
6.5 Document
sampling
a) are robust when the documents experiments that use the divergence L(S7z, U, with all the query terms are considered (Section 6.4.5). However, both the experiments that use either L(S,,,, Un) or L(S, hyperlink the ness of U, ) result in a variable number of decision boundaries. The outcome values of the score-dependent experiments that estimate the useful-
) or L(S, U7) depend on the definition of the structure L(SS, U,,,, S7z. However, the effectiveness of the experiments. distribution define S7. in which terms of two statistically independent field-based weighting models, namely PL2F and is consistent to an extent (Section 6.4.5).
I(ne)C2F,
6.5
Document
sampling
The evaluation methodology, which has been described in Section 6.2 (page 131), has is from that the the sample of retrieved outcome of an experiment stated computed documents Retq, which contain at least one query term in either their body, or their title fields. For example, Retq contains documents for which the query terms occur in the anchor text and the body or the title of the document, but it does not contain documents for which the query terms occur only in the anchor text. Depending on i he document frequency of the query terms, the size of Retq can be anything between few documents to a large proportion document the collection. of The aim of the current, from is their is the to computed outcome proposed experiments, when evaluate section Retq. documents TopRetq fixed C number of small samples of a The advantage of using a subset of the set of retrieved documents is mainly the E. Employing the for the small samexperiment outcome of computing reduced time documents indicate the documents fixed whether potentially can of number of a ples for the higher of experiments, outcome computing useful more are ranks retrieved at documents that is used. should there of number optimal an and whether documents by is TopRetq with respect documents The sample of ranking obtained to the score assigned by a retrieval approach. the 67), their query'(Section 4.4 with combination based weighting models or page on In be this 74), (Section 4.5 section, used. can independent sources of evidence on page independent two is statistically the evaluation of document sampling performed with different The two I(ne)C2F. use of field-based weighting models, namely PL2F and For this purpose, any of the fielcl-
163
6.5 Document
sampling
weighting models allows for evaluating the robustness of document sampling. The default setting described in Section 6.4 is used to set the associated hyper-parameters:
Cb=Ca =Ct=1, wb=Wt =1,
andwa=0.
In the context of evaluating document sampling, the effectiveness of computing t lie outcome of an experiment is tested with two sizes of samples. First. the top 5000 documents ranked are used to form a sample of moderate size. Second, the top 500 documents ranked are used to form a sample of small size. Regarding the queries, which retrieve several tens of thousands of documents, both 5000 and 500 document samples are relatively small. The remainder of this section is organised as follows. First, the definition of the E is experiments revisited in order to employ the sample of documents TopRetq. Next, description brief a is the the of given. experimental setting and presentation of results The current section continues with the evaluation of the score-independent and scoredependent experiments, and closes with a discussion, and some concluding remarks.
6.5.1
Revisiting
the definition
of experiments
This section revisits the definition
(Section 110) 5.3 the page on of score-independent
(Section docin 115), 5.4 to the order use on page and score-dependent experiments documents from top their ranked a sample of outcome ument sampling, and compute T opRetq. The definitions of the score-independent experiments are updated by replacing Rctq (5.2): (5.8). Equation For (5.2)-(5.5), Equations in TopRetq the example, and with
Vt EqtEddE condv(d) :
is rewritten follows: as Vt EqtEddE condv(d) :
Retq
TopRetq
(6.1)
hyperthe the that usefulness of For the score-dependent experiments compute (5.15) (Equation {uz} U= distributions on link structure, the definitions of the score follows: 118), (5.17) (Equation {uj} as updated U' are 118) page on and page = ui U. =
di-idj
scj +Z
dt-*dj
sch
di E TopRetq, dj E Retq
(6.2) (6.3)
sch
di E TopRetq, dj E Retq
164
6.5 Document
sampling
so that all the outgoing hyperlinks
from the documents in TopRetq to the documents
in Retq are used. In this way, the number of employed hyperlinks is greater than in the case where only the hyperlinks within the set TopRetq would be used. Therefore, more information from the hyperlink structure is employed.
6.5.2
Description
of experimental
setting
and presentation
of results
This section provides a brief description of the experimental setting and the presentation of the results for the evaluation of the experiments E with document sampling. As described in the previous sections, the Bayesian decision mechanism selectiv'el. y applies one retrieval approach from the pair of retrieval approaches that results in the highest potential for improvements in retrieval effectiveness. The evaluation is 1)(, i-formed for six different tasks: td2003; td2004; hp2003; hp2004; np2003; and np2004. The employed pairs of retrieval approaches for each task are given in Table 6.1 (page 134). Each of the tables used for the evaluation of document sampling provides the following information: precision relative obtained (MAP) difference a row identifier ('Row'); the tested task ('Task'); retrieval approach the mean average ('Baseline'); the individual
of the most effective in MAP
from the baseline, and the number without document sampling
boundaries decision of (column denoted wit h
by the decision mechanism of an experiment,
the symbol documents
i. e., F-V(b)for the experiment
that counts the number of
body); in their terms the query with all and the number sampling
the relative difference in MAP from obtained by the decision mechaìn500', for where,
the baseline, nism with example,
boundaries decision of (columns
document
`p15000', `p1500', ìn5000',
p15000 corresponds
documents by 5000 formed top the TopRetq to the sample to the sample TopRetq formed by the top 500 t denotes that the decision mechanism The symbol for a statistically approach The symbol significant number of betwei ni approach
PL2F, by ranked documents ranked
in500 corresponds and by I(ne)C2F).
applies the most appropriate queries, according the MAP
retrieval
to the sign test. mechanism according
* denotes that the difference
decision the of significant,
and that of the most effective retrieval to Wilcoxon's signed rank test.
is statistically
for the score-independent exthe The subsequent tables present evaluation results for topic the tested boundary decision so", all least identify one at periments, which for both Retq, or for and pl: p15000 i00, the and set when the outcomes are computed
165
6.5 Document
sampling
both in5000 and in500. The reported evaluation results for the score-dependent experiments correspond to the experiments, which identify at least one decision boundary for all the tested topic sets, when the outcomes are computed for the set Retq, and for all four samples p15000, p1500, in5000, and in500. This is choice made in order to focus the evaluation on the experiments that can effectively be for 'Ob used selective NN IR, and to avoid situations that result in the application of one retrieval approach f'()r The cases for which the decision mechanism does not have at least one decision boundary for a particular task are denoted by ìn the tables. -' all queries. 6.5.3 Document periments Table 6.7 shows the evaluation results of the document-level experiments ey(b) and ey(, t) for sampling with either PL2F or I(ne)C2F. The results from Table 6.7 are summarised in the form of a histogram in Figure 6.10. When the experiment EV(b) is used with a (rows documents, 5000 decision boundaries low identified is the sample of number of 1-6 in columns 'p15000' and ìn5000'). The obtained mean average precision (MAP) is higher than that of the baseline, with the exception of sampling 5000 documents with PL2F for the task hp2004 (row 4). Compared to the MAP obtained from the decision mechanism without document sampling (column ÈV(b)'), there are some fluctuations from document the resulting Ev(b) by MAP the For the experiment without using achieved example, sampling. document sampling is +7.27% above the baseline for the task td2004 (row 2 of colunin However, when sampling the top 5000 ranked documents with PL2F, the drops decision baseline between to MAP the difference in the mechanism and relative +2.52% (row 2 and column 'pl5000'). Èy(b)'). When sampling the top 5000 ranked documents with either PL2F or I(ne)C2F, the MAP decision in improvements the in the of retaining well performs experiment V(at) for boundary decision 1 identifying each of only and sampling, mechanism without decision the the For mechanism without the tested tasks. performance of example, (row the for td2003 task documents the 5000 same remains sampling sampling, or with 7 and columns ÈV(at)', `p15000', and 'in5000'). inig the documents, the then 500 is of weight to choice When the sample reduced the the has of performance for on effect to considerable a more sampling use model sampling for score-independent document-level ex-
166
6.5 Document
sampling
decision mechanism.
For example, sampling the top 500 documents with PL2F. and using the experiment FV(at) for the task np2003, does not result in identifying ans' decision boundary (row 11 of column `p1500' in Table 6.7). However, sampling with results in identifying one decision boundary (row 11 and column 'M500' ill This is related to the fact that for small sample sizes, the top ranked documents are more likely to depend on the employed weighting model. From the obtained results, it can be seen that the experiment EV(a, in t) results improvements over the baseline for all tasks when used with a sample of the top 5000 documents, ranked with either PL2F or I(ne)C2F (rows 7-12 and columns p15000 and Table 6-7).
I(ne)C2F
in5000).
Row 1 2 3 4 5 6 Row 7 8 9 10 11 12 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Ed b +1.44 +7.27 +4.23t" +1.44 +1.37 +5.72 Ey at +7.77 " 2 2 1 1 1 1 p15000 +6.67 1 +2.52 1 +2.69 1 1 -0.63 +2.40 1 +1.14 1 p15000 +7.77 ' 1 +1.22 1 +1.34 1 +4.79 1 +2.40 1 +0.75 1 p1500 +2.68 +0.15 +0.47 +0.34 +2.82 +3.11 p1500 +6.74 +5.43t' +1.50 +3.91 +1.21 in5000 +7.29 +1.22 +2.70 +1.78 +1.34 +1.93 in5000 +7-77* +1.76 +1.38 +1.57 +2.40 +0.75 in500 +10.10 3 2 -1.80 +2.82 3 +6.41 2 in500 +7.49t* +4.28 +1.26 +5.11 +2.381 +3.05 1 1 1 2 1 1
1 2 1 1 1 1 1 2 1 2 1
1 1 1 2 1 1 1 1 1 1 1 1
1 +1.15 1 +2.15 1 +5.69 2 +3.581 1 +2.97 1
Table 6.7: The relative difference between the MAP of a decision mechanism and t1l;It individual the the and corresponding number of approach, retrieval of most effective decision boundaries. The decision mechanism employs score-independent documelit level experiments with document sampling of 5000 and 500 top ranked documents default (in5000 in500), the I(ne)C2F (p15000 PL2F using and and p1500), and with t decision the denotes the that The applies nwst mechanism symbol parameter setting. for a statistically significant number of queries, according appropriate retrieval approach MAP between difference the the denotes * the that The of test. to the sign symbol is the decision mechanism and that of statistically most effective retrieval approach test. Wilcoxon's to rank signed significant, according
6.5.4
Document periments
sampling
for score-independent
aggregate-level
ex-
This section is focused on the evaluation of the score-independent aggregate-level exthe Each the application in evaluates It is parts three of parts. organised periments.
167
6.5 Document
sampling
12
8 4
td2003 0 td2004M
hp2003 hp2004
np2003 np2004
0
k-I A- ----I----- -yr Wv oivvvv
12 8
\ +4
0
N,
L-1
L-1
L-IL
nMnnn
1)1sOO
4 Fnnn
insnn
Figure 6.10: Histogram summarising the relative differences between the MAP of the decision mechanism and that of the most effective individual retrieval approach from Table 6.7. di-the the that the aggregate size average of compute experiments of sampling with tribution, its standard deviation, and the number of large aggregates, respectively, as described in Section 5.3.2 (page 112).
6.5.4.1
Average of the aggregate size distribution
Table 6.8 displays the results from document sampling with the experiments that coiiithe 6.11 Figure of results overview an presents the aggregates. of size average pute from Table 6.8 in the form of a histogram. documents, then is E all of When the experiment sampling with used (b), avg(dom) (rows boundaries 1decision high in of number a relatively result methods sampling 4 I(ne)C2F. there documents 5000 are with 6). For example, when sampling the top hp2004, hp2003. td2004, for tasks the and np2003 identified decision boundaries each of (rows 2-5).
168
6.5 Document
sampling
In the case of the experiment Ev(b), documents 500 I(ne)C2F sampling ), with avg(dorn, results in improvements in retrieval effectiveness and two decision boundaries for each of the tested tasks (rows 7-12 and column 'in500'). based aggregates, the experiment E3(at), performs avg(dir) well when used with document sampling, and it results in a relatively low number (rows thresholds 19-24 in Table 6.8). In particular, sampling 500 documents with of I(ne)C2F results in one decision boundary and improvements in retrieval effectiveness for all tested topic sets (rows 19-24 and column 'in500'). When the decision mechanism employs the experiment EV(at), sampling With avg(dir), either the weighting models PL2F or I(ne)C2F produces a variable number of deciboundaries sion and has a mixed effect in the retrieval effectiveness of the decision (rows 25-30). mechanism 6.5.4.2 Standard deviation of the aggregate size distribution In the case of directory
This section discusses the effect of document sampling on the performance of the aggregate-level experiments that compute the standard deviation of the aggregate size distribution. Table 6.9 displays the obtained results, and Figure 6.12 provides an form histogram. in the the of a overview of results EV(b), document sampling results in a variable with std(dom) (rows In last 4 in Table 6.9). in boundaries 1-6 decision the columns particnumber of for boundary decision detect does the task td2004 decision the any not mechanism ular, Using the experiment (row 2 and column `p15000'), and the named page finding tasks (rows 5-6 in column `pl500'). EV(b), E3(b), the and experiments aggregates, std(dir)" std(dir), (rows 7-12 in5000', baseline in improvements the E3(at), and column over result std(dir) from `p15000', 19-24 `p1500', 13-18 respectively. and column rows and column rows field-based depend the However, Table 6.9). the results weighting model employed on In the case of directory improvement in EV(b), of for sampling. For example, the experiment an results std(dir) PL2F. The documents 5000 top the same with +1.86% for td2003, when sampling improvement in an experiments results the top 5000 documents with I(ne)C2F. for task, the +9.14% when sampling same of
169
6.5 Document
samDline
Row 1 2 3 4 5 6 Row 7 8 9 10 11 12 Row 13 14 15 16 17 18 Row 19 20 21 22 23 24 Row 25 26 27 28 29 30
Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004
Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
E3 b ,a +1.86 +3.08 +1.08 +11.651 +1.21 +3.50 Ed b ,a -1.79 +6.041
dom
1 3 3 1 1 2
dom
p15000 +0.82 1 +10.181 4 +3.36 2 +6.23 2 +0.22 3 +1.41 2
p1500 +3.44 3 +0.31 3 +3.17 3 +5.76 3 +1.31 3 3 -1.40
in5000 -2.80 +7.57 +1.55 +4.46 +0.67 +3.31
in500 --6-3f--2-1 T-;,; 4 +1.91 2 4 +2.06 1 4 +5.72 3 4 -0.28 1 3 +0.95 1 in500 +3.37 2 +2.14 2 3.181 2 +9.85 2 +2.22 2 +2.64 2 in500 +3.16 1 5.74 5 +2.091 2 6.16t 3 +2.42 2 +0.53 3
-1.01 +8.98 +2.70 +0.88 Baseline 3 6 in5000 p15000 p1500 dir a , - 1 0.1455 +1.92t +13.06 '2 +5.70 2 +1.58 1 0.1307 2 +2.22 +7.501 4 +0.77 2 +7.88 4 0.6660 4 +1.23 +2.39 4 +0.41 2 -1.131 3 0.5555 +15.931' 2 +6.34 3 +10.621 2 +3.89 2 0.6846 4 +2.91 +3.801 5 +3.171 4 +5.571' 3 0.6944 2 +0.45 +1.56 4 +0.85 2 +5.721' 2 E3 (at), Baseline in5000 in500 p15000 p1500 avg(dir) 0.1455 1 +2.89 +2.13 1 +5.29 2 +1.51 3 +3.99 1 0.1307 3 +7.961 +3.06 2 -1.80 3 +1.91 2 +1.07 1 0.6660 4 +2.931 +1.34 2 -1.80 2 3 1.981 1 -0.27 0.5555 4 +6.26 +4.37 2 +3.20 2 +5.99 1 +1.76 1 0.6846 3 +3.711 +0.66 1 +1.50 2 +1.31 2 +0.53 1 0.6944 4 +3.92 +4.65 2 +4.94 2 2 1 +1.21 -1.20 Baseline Ey in5000 in500 p15000 p1500 dir at ,av 0.1455 2 +10.86 +2.47 2 +6.80 2 +4.26 2 +5.50 2 0.1307 1 0.00 2 +2.37 +0.46 3 0.00 2 0.00 2 0.6660 7 4 3 +2.64 3 4 -1.70 -0.03 -0.29 -0.92 0.5555 +13.031' 3 +10.661' 1 +9.451 3 +10.661' 1 +9.451 3 0.6846 5 +0.22 +1.99 1 +1.74 1 +1.49 1 +0.74 1 1 0.6944 +0.49 1 +1.97 3 +0.49 1 +2.85 1 +1.20
3 3 2 2 2 2
p15000 p1500 +2.34 3 +15.53 3 +4.82 1 -2.40 2 +2.211 5 +4.831 2 +6.08 2 +9.38 4 +3.72 1 +2.06 2 4 2 +3.96 -1.60
in5000 +1.92 2 +5.74 1 +0.80 5 +7.18 4 +5.02 1 2 -0.58
Table 6.8: The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corresponding number of decision boundaries. The decision mechanism employs document sampling of 5000 (p15000 (in5000 I(ne)C2F PL2F documents 500 top and p1500), and and with ranked The default the in500), the compute average experiments and setting. parameter using domain or directory aggregate sizes. The symbol t denotes that the decision mechanism for the a statistically significant number of applies most appropriate retrieval approach between difference * denotes that the The test. to the symbol queries, according sign the MAP of the decision mechanism and that of the most effective retrieval approach is statistically significant, according to Wilcoxon's signed rank test.
170
6.5 Document
sampling
td2003 o
td2004
15 10 5 0
hp2003
hp2004
np2003
np2004
&M-.
Marti, .. .,...
-AJZI
...... ra .. y...
El EEM,
,
a-_a----.
ID
LMA
7)i5nnn
rims)
Fnnn
_., ... ..,
inSnn
15 -1110 5 +0
DI 5000
15 -R 1o
+0 5
1)1500
5nnn i., n.
in500
lj
E (b), avg(dir)
IL
P15000
&p1500
in5000
in500
15 10 S + 0
rl
L_... _.
C=EM
Ear
15
\1o
/-
v15000
P1500
in5000
in500
+0
5 -5
F-V(at), P15000 avg(dir)
ii,
ii
pI5UU
zn5000
mbUU
MAP the between the difference of Figure 6.11: Histogram summarising the relative from individual approach retrieval the that effective decision mechanism and most of Table 6.8. 6.5.4.3 Number large aggregates of
the decision by employs which mechanism, a Table 6.10 displays the results obtained Figdocument sampling. large with aggregates the of that number count experiments histograill. form in 6.10 the Table from a of the results of 6.13 overview an provides ure least identifies E3(b), at the 1rg(do experiment n) When domain aggregates are used, 5000 both and of sizes sample topic and for all the tested sets, boundary decision one However, 6.10). Table in 4 last (rows 1-6 columns and 500 documents, respectively hand, On the when task. for other each boundaries decision varies the number of documents 5000 either with then EV(b), sampling is the employed experiment lrg(dom)-, decision less of in number variable I(ne)C2F a PL2F results or the weighting models
171
6.5 Document
sampling
Row 1 2 3 4 5 6 Row 7 8 9 10 11 12 Row 13 14 15 16 17 18 Row 19 20 21 22 23 24
Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004
Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
Ey
+4.81 +3.52 +0.33 +1.21 +5.61 +3.46

E3(6),
b std dom ,
p15000 -5.09 +1.98 +11.951 +4.831 +1.31 p15000 2 -2.27 +8.721 4 +0.831 3 +6.97 3 +2.16 2 +1.87 4 pl5000 +1.86 +3.60 0.00 +11.38 +3.16 +0.23 2 1 1 3 1 1 1 3 4 1 1
1 1 1 1 1 1
p1500 +10.38 2 +2.68 1 +2.91 1 +5.11 2 -
in5000 -5.09 -1.84 +3.59 +7.81 +5.841 +1.11 in5000 +12.44 3 +7.881 2 +0.21 2 +9.04 1 +2.25 1 +0.84 3 in5000 +9.14 2 +3.52 2 2 -1.90 +2.21 1 2 -0.83 +2.07 1 in5000 +4.54 2 +2.07 2 1 -0.20 +4.07 2 +1.55 1 +6.06 3 1 1 3 4 1 1
in500 +1.79 2 +0.84 1 1 -1.23 +5.58 2 +2.86 1 +2.68 1 1 2 1 2 1 -0.13 +6.741' 3 in500 +4.12 +2.30 +3.211 +6.21 1 3 1 2 in500 +2.68 +0.23 +2.241 +4.77
std(dir)
-2.30 +0.84 +0.591 +8.73 +1.29 +2.30

Ey
1 2 3 2 2 4 2 3 1 1 1 1 1 1 2 2 2 1
p1500 +6.12 1 2 -0.61 +0.62 1 +8.01 2 +4.311 2 4 -1.90 p1500 +7.77 1 +1.61 1 +2.03 1 +5.51 2 +1.80 1 +2.89 2 p1500 +4.67 2 1 -1.30 1 -0.83 +5.831 4 +1.87 2 +4.62 1
+7.56 +3.98 +0.75
b std dir ,
-0.68 +3.27 +4.57 E3(at), a td -12.00 +1.07 +2.84 +6.17 +3.871 +1.38
dir)
p15000 +2.68 3 +7.19 2 +1.05 1 +4.55 2 +3.07 2 +6.01 1
+4.33 1 in500 +5.91 2 -0.20 +4.46 +1.68 +3.60 2 1 1 2
Table 6.9: The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corresponding number of decision boundaries. The decision mechanism employs document sampling of 5000 and (in5000 I(ne)C2F (p15000 PL2F documents 500 top ranked and and p1500), and with in500), using the default parameter setting. The experiments compute the standard t denotes that the deviation of the domain or directory aggregate sizes. The symbol decision mechanism applies the most appropriate retrieval approach for a statistically * denotes The that the test. to the symbol sign significant number of queries, according difference between the MAP of the decision mechanism and that of the most effective Wilcoxon's test to is signed rank according statistically significant, retrieval approach . boundaries respectively). E3(at), the the experiment Regarding the directory aggregates, crg(dir) combination of in low PL2F to documents a 5000 results according top the ranked sampling of ìn5000'. `p15000', (rows 7-12 tasks for each of the tested and columns
and
Howin Table 6.10). 'p15000' (rows 19-24 boundaries decision column and of number (row baseline to the for td2003, task the in compared it precision reduced results ever,
172
6.5 Document
sampling
td2003 = W2004=
15 10 5 0 + -5 -10
hp2003 hp2004
np2003 np2004
F-17i
OAL
6V(b), std(dom) p15000
==. "rjlL= - L-rall

p1500 in5000 in500
15 10 5 0 + -5 -10
113
JL El
p15000 p1500
F
in5000
M m ,.
in500
F-3(b), std(dir)
15 10 5 0 + -5 -10
IIE,..,
E-V(b), std(dir)
=EALp15000 p1500
I,
M=
irt500
in5000
15 10 5 +0 -5 -10
iI1
F-3(at), std(dir)
p15000
p1500
irt5000
irt500
MAP between difference the the the Histogram of Figure 6.12: relative summarising from individual decision mechanism and that of the most effective retrieval approach Table 6.9. 19 and column 'p15000').
6.5.5
Document
sampling
for score-dependent
experiments
the decision exthe employs which mechanism, Table 6.11 presents the evaluation of ) (rows Ute, L(S, 1-12) hyperlink the structure the of that usefulness estimate periments form in is the the An also provided (rows 13-24). results Uh) of L(S, overview and be EXat), E3(at), can The and 6.14. Figure in L(SU); histogram experiments L(SU)p, of a n hp2003 task the the and of document exception with sampling, with used effectively In from Table 6.11). (row 3 'pl500' PL2F column documents and 500 with sampling boundaries. decision in than one decision more results the mechanism most of the cases,
173
6.5 Document
sampling
R ow 1 2 3 4 5 6 R ow 7 8 9 10 11 12 Row 13 14 15 16 17 18 Row 19 20 21 22 23 24
T ask td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004
B aseli ne 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
83
+0.55 4 +7.041 2 +0.89 2 +9.161 1 +0.72 1 +2.61 2 Ey b 1, do,,, , 2 +5.43 +5.43 5 1 +3.321` 1 -1.30 2 +0.50 2 +0.22 Ir g (dir) 2 +6.32 2 -0.92 2 +0.78 1 +8.77 1 +0.38 2 +2.71
E3 E3(6),
b lrg dom ,
p15000 +9.28 2 3 -1.10 0.00 2 +7.27 2 +2.21 2 3 -0.99 p15000 +3.23 1 +5.051 1 +2.03 1 +4.81 3 1 -0.20 1 -0.69 p15000 +6.12 2 +10.021 3 +0.36 2 +6.26 4 2 -0.16 +4.59 3 p15000 -5.70 +0.31 +1.10 +4.34 +2.95 +1.57 1 1 1 2 1 1
pl500 +9.62 2 +2.60 2 2 -1.40 +5.29 1 +1.911 2 +0.91 2 p1500 +7.01 +2.07 +0.74 +4.57 +0.79 1 1 5 4 3 1
in5000 +9.90 3 +0.38 2 3 -0.59 +2.07 2 +0.72 4 +1.07 2 in500 +6.12 +5.66 +2.631 +3.01 +1.93 +0.86 in5000 +16-36" +6.89 +0.57 +12.101' +1.88 +10.211' in5000 -0.62 +0.69 +3.27 +0.94 +0.76 +1.92 1 1 4 2 3 2 3 3 2 3 2 2 1 1 2 2 2 1
in500 +2.27 +1.61 +2.15 +4-77 +1.33 +0.12 in500 +0.55 +1.61 +2.87 +5.71 +4.46 +4.95
3 1 2 1 2 2 1 2 1 5 1 2
-2.40 p1500 +7.29 2 +3.14 1 1 -0.41 +4.19 2 +0.34 1 +0.26 2 p1500 +14.50 3 +1.37 2 +2.821 2 2 -0.99 +5.991' 2
in500 +5.57 2 1 -0.23 +2.13 1 +2.90 2 1 -0.66 +6.05' 2 in500 1 -2.50 +1.30 2 +4.191' 1 +3.92 2 +5.62 2 +3.07 2
tr diT , 3 +0.96 1 -0.54 6 +1.62 1 -0.09 1 +6.67 1 +4.97

at
The relative difference between the MAP of a decision mechanism and that of the most effective individual retrieval approach, and the corresponding number The decision mechanism employs document sampling of 5000 of decision boundaries. (p15000 (in5000 I(ne)C2F documents PL2F 500 top and pl500), and with and ranked The default in500), the experiments compute the nurnand parameter setting. using t denotes The decision directory ber of large domain or that the symbol aggregates. for the a statistically significant mechanism applies most appropriate retrieval approach denotes * difference The that the to the test. symbol sign number of queries, according between the MAP of the decision mechanism and that of the most effective retriev, 11 Wilcoxon's test.. to is signed rank according significant, approach statistically
Table
6.10:
(Section 6.4.2 154) is to the on page used similarly case where no sampling in improvements cEy(b), The experiments F-y(b), also result over and L(su');,, L(sv'),, In 6.11, in Table (rows 19-24 baseline 13-18 particular, the the respectively). and I(ne)C2F, the PL2F, to documents 500 compute outcoiiie or with either sampling of for boundary decision in 4(b), the F-y(6), one results of L(su')tin, respectively, L(sv, )p, and (rows in 13-16 finding home tasks distillation the column 'pl500', and topic page and
174
6.5 Document
sampling
td2003 =h td2004=
15 02 10 5 0 -5
nIL171 E_IL
p2003 =n hp2004
p
p2003 np2004
63(b), 1rg(dom)
p15000
p1500
in5000
in500
+0
15105-5
EH>
\'_/""
L----I--
nMnnn
inn ni
nnn
inn 7
15 10
\5 +
0 5
EL. It=
-k -I), I ZI k-I
n15000
v1500
i.n,Moo
in50()
15 10 +5 0 5
-3(at), 1rg(dir)
p15000
p1500
irt5000
in500
MAP between difference the the the of Histogram Figure 6.13: relative summarising from individual the decision mechanism and that of retrieval approach most effective Table 6.10. ìn500', in 19-22 respectively). column rows
6.5.6
Discussion
from the document perspectives to discussion sampling This section presents a related documelit the document E of size sampling; with of. the effectiveness of experiments different approaches. retrieval document with samples samples; and generating Effective experiments document with sampling The score-independent docuiiientdocument sampling, effective with
4(b) and EV(at) are particularly The 6.7. Table experiin score-dependent 6.5.3 Section in shown discussed and as level experiments
175
6.5 Document
sampling
Row 1 2 3 4 5 6 Row 7 8 9 10 11 12 Row 13 14 15 16 17 18

Row
Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004 Task td2003 td2004 hp2003 hp2004 np2003 np2004
Task
Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944 Baseline 0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
Baseline
E3(at),
L(SU) , +1.99 3 3 +2.30 5 +2.04 +5.80 1 3 +1.46 2 +6.31
p15000 +7.90 "2 +1.15 2 +0.24 2 +2.72 3 +2.06 3 +6.98t' 5 in5000 +3.99 +2.91 +1.971 +6.91 +4.731 +4.57 2 2 2 2 3 2
p1500 2 +2.75 +5.891 2 3 -0.301 +5.18 3 +1.17 2 +2.36 2 in500 +5.70t* +10.33t' +2.571 +4.18 +3.18 +0.75 p1500 +7.22 +1.071 +3.321 +3.01 +2.41 +4.49
in500
E3
at ,z, su 3 +3.09 2 +4.591 4 +1.68 2 +6.80 3 +2.50 3 +4.13 Ey(b),L(sU')
2 3 1 2 3 2 1 1 1 1 2 2 1 1 1 1 3 2
+13.75 '3 3 +2.75 1 +2.54 3 +6.91* 2 +0.99 2 +3.30 +7.90 +1.53 +2.34 +8.03 +3.11 +3.93 3 2 1 3 4 2
p15000 +10.45t* 3 +1.61 2 +1.44 1 +5.33 2 +4.461 3 +1.47 2

in5000
19 20 21 22 23 24
td2003 td2004 hp2003 hp2004 np2003 np2004
0.1455 0.1307 0.6660 0.5555 0.6846 0.6944
Ey b sv' in ,L
+11-681* +2.07 +3.021 +4.82 +2.07 +4.75
3 1 1 2 3 2
+4.26 +2.071 +1.77 +2.29 +1.80 +1.71
Table 6.11: The relative difference between the MAP of a decision mechanism and that the individual the corresponding number of and approach, retrieval of most effective decision boundaries. The decision mechanism employs the score-dependent experiments (p15000 PL2F documents 500 top 5000 document with and ranked and and sampling of default The (in5000 the in500), I(ne)C2F parameter setting. using and p1500), and t the decision denotes most appropriate retrieval that the applies mechanism symbol to the test. sign for according of queries, approach a statistically significant number decision MAP between the the difference mechanism of The symbol * denotes that the is significant, according statistically the approach that retrieval most effective of and to Wilcoxon's signed rank test. U, L(S7z, L(S, Uh) hyperlink or the structure the of that usefulness ') estimate ments Table 6.11). (Section A 6.5.5 document and sampling with applied are also effectively lw following: the is i facts of the out either cmiie when these that can explain reason i5 the experiments score-dependent or document-level experiments, score-independent information then documents, of amount similar a from the of same number computed between the outcome values of the comparison is considered for each query, making
176
6.5 Document
sampling
td2003 = td2004 =
15 *10 +5 0
hp2003 ho2004
np2003 nn7nn4 v
3(at), L(SU)l
15 10 +5 0
Tf LTTN Rfl
p15000
v1500
oivvvvv
r, tin to IOtUVV
15 1o +5
0 V(b), L(SU')p1
15 910 +5 0
\ T (QTTI ) H(, " '-'"-
p15000
p1500
dKdMK-7KA
t
ix x
Incnnn
""'-, '-,
mrInn
Figure 6.14: Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach from Table 6.11. independent have The scoreaggregate-level experiments easier. not been shown to be particularly effective with document sampling (Section 6.5.4 and Tables 6.8,6.9, and 6.10). This may be explained by the fact that since the outcome of the experiments the aggregate-level experiments is based on the distribution of aggregates, and not on the distribution of documents, then more documents are required in order to obtaiii a representative Size distribution of aggregates. In the evaluation of the experiments E with docu-
document of
samples
documents. The 5000 500 document the or scoresamples consisted of ment sampling, independent document-level experiments, which count the number of documents in
177
6.5 Document
sampling
which the query terms occur, resulted in improvements in retrieval effectiveness, and low a number of thresholds for the samples of 5000 documents. However, their performance was harmed for the samples of 500 documents (Section 6.5.3). This is because smaller document samples reduce the available information to compute the outcome of the experiments. On the other hand, the score-dependent experiments have been be to shown robust for samples of 500 documents (Section 6.5.5). This is because the score-dependent experiments, which estimate the usefulness of the hyperlink structure, employ all the outgoing links from the sample of documents to the whole set of retrieved documents, as described in Section 6.5.1. Therefore, even for small samples of
documents, the experiments consider more information from the hyperlink structiin,
Generating
document
samples
with
different
retrieval
approaches
The doe. -
ument samples have been generated with two different field-based weighting models, PL2F namely and I(ne)C2F. The experiments that exhibited a weak dependence on the particular weighting model used for sampling were the score-independent document(Section 6.5.3). This is explained, because these experiments level F-y(6) and EV(a, t) On documents in the terms the other occur. query which number of simply count hand, the score-dependent experiments that compute the usefulness of the hyperlink (Section Consequently, 6.5.5). documents their the score of structure explicitly employ for dependent is the small employed weighting model, especially on more performance document samples.
6.5.7
Conclusions
8 their are cornhas the outcomes when This section experiments proposed evaluated document have that The documents. shown from results evaluation of sets small puted the the to experiments, of cost be computational reduce effectively used sampling can in improvements the effectiveness. retrieval retaining while still doctinentthe score-independent Document sampling is used more effectively with either that the compute the (Section 6.5.3), experiments score-dependent level experiments or The (Section 6.5.5). aggregatescore-independent hyperlink the structure usefulness of (Section 6.5M). is used level experiments do not perform as well as when no sampling document-level then the experis document reduced, When the considerably sample but exhibit also well, perform The less experiments iment are score-dependent effective.
178
6.6 Using
retrieval
approaches
based on the same weighting
model
a stronger dependence on the particular documents (Section 6.5.6).
method used for performing the sampling of
6.6
Using retrieval ing model
approaches
based on the same weight-
The evaluation
results presented in Sections 6.3 and 6.4 refer to a Bayesian decision
mechanism that selectively applies retrieval approaches, which use different field-based weighting models. This section discusses the evaluation of the experiments E, when the decision mechanism employs retrieval approaches that use the same field-based weighting models. For example, the Bayesian decision mechanism can selectively apply the field-based weighting model PL2F, or its combination with PageRank (PL2FP) Hence, the employed experiment F_is required to identify the most effective retrieval approach, based on differences due to the used sources of queryindependent evidence. The remainder of this section aims to identify which of the field-based weighting models can be used more effectively in the context of a Bayesia1n decision mechanism, which employs the proposed experiments F_to selectively apply field-based combinations of a particular weighting model and query-independent sources basis. of evidence on a per-query The experimental A decision following. Bayesian is the mechanism employs setting field-based to the same use weightpairs of retrieval approaches, which are restricted ing model. For each of the tested tasks (td2003, td2004, hp2003, hp2004, np2003, (PL2F, PB2F, I(ne)C2F, field-based the weighting models and np2004), and each of DLHF, and BM25F), the decision mechanism employs the pairs of retrieval approaches (rows in for improvements highest in the retrieval effectiveness that result potential 1-30 in Table 4.11, page 100). For example, in the case of the task td2003 and the decision Bayesian BM25F, the applies either selectively mechanism weighting model (BM25FU), length URL from the or path the combination of BM25F with evidence Overall (row in Table 25 4.11). (BM25FP) PageRank the combination of BM25F with aggregate-level experiments, which compute 4 large deviation, the and scoreaggregates; of number the and the average, standard ) U;, L(S,, by U, L(S, divergences and dependent experiments which compute the ti, ti) there are 11 different experiments: directory domain independent and score1 score-independent document-level experiment: 6 for the task td2003.
179
6.6 Using retrieval
approaches based on the same weighting
model
setting the distribution
Sn with either PL2F or I(ne)C2F. By considering the 6 tested tasks, the body (b) and the combination of the anchor text fields (at). title and and the conditions 3 and `d, each type of experiment has 24 different configurations. The total number of configurations for the 11 different experiments is 24.11 = 264. Table 6.12 provides an overview of the evaluation results, with respect to: the number of times a particular type of experiments identifies least at one decision boundary (column 'B>O'); and the number of times a particular type of experiments results in improvements in mean average precision, compared to the most effective individual retrieval approach (column `+'). These numbers are given for each of the five field-based weighting models. For each type of experiments, the column `Table' indicates the table in Appendix B, which contains the evaluation results for the corresponding experiment E. The row `Total' of the table displays the sum of the corresponding columns for all the experiments E. The row `Ratio +/ B>O' corresponds to the ratio of the number of
times when there is an improvement in MAP from selective Web IR, over the number of times when there is at least one decision boundary.
Experiment
3(f), y(f)
t/(f), avg(dom) std(dom), , V(f), std(dom) avg(dom)
I(ne)C2F PL2F PB2F B>0 + Table B>0 + B>0 + Score-independen t experiments

B. 1 18 11 20 13 12 7
DLHF B>0 +
14 6
BM25F B>0 +
20 10
3(f),
3(f),
B. 2 B. 3
B. 4 B. 5 B"6
18 17
16 17 18
10 8
11 13 12
19 17
17 20 19
15 11
8 11 14
16 15
12 16 16
12 8
10 9 11
16 18
13 22 20
11 13
8 15 14
24 23
19 24 23
14 13
11 11 12
3(f), lrg(dom), V(f), lrg(dom) V(f), 3(f), avg(dir) avg(dir), V(f), 3(f), std(dir) std(dir),
3 f
lrg
dir
b f Ir dir) , ,
B. 7
15
17
11 21 20 15 15
7 19 18 14 10
15 21 17 17 19
7 15 11 10 13
21 24 24 23 23
12 20 18 19 19
e(f), L(SU)P, 3(f), L(su) p,, V(f), 3(f), L(SU); L(SU)1 n, v(f), 3(f), L(SU') L(SU')P, p,, y 3 ),L SU' in SU, , in ,L Total Ratio + /B>0
Score-dependent experiments 22 16 19 14 B. 8 22 15 18 13 B. 9 22 18 18 14 B. 10 23 18 15 10 B. 11 189 124 0.66 218 148 0.68
169 125 0.74
192 123 0.64
248 159 0.64
decision boundary least is for there one at Table 6.12: The number of times which Bayesian decision ('+'), the when ('B>O'), or improvements in retrieval effectiveness field-based 1 the same use which approaches, retrieval applies mechanism selectively weighting model. decision Bayesian the mechanism The results from Table 6.12 indicate that when lotst is BM25F. there at the model weighting use which employs retrieval approaches,
180
6.7 Decision
mechanism
with
more than
two retrieval
approaches
decision boundary one improvements
identified for 248 out of the 264 experiment configurations, and in retrieval effectiveness for 159 experiment configurations (row `Total').
On the other hand, when the decision mechanism employs retrieval approaches, which use the weighting model I(ne)C2F, there are only 169 out of 264 configurations of the experiments, which result in at least one decision boundary (row `Total'). The 125 out of these 169 configurations (0.74%) result in improvements in retrieval effectiveness (rows `Total' and `+ / B>0'). Therefore, the field-based weighting model BM25F is to be used in selective Web IR than I(ne)C2F. This can be explained by the fact that the restricted optimisation, which has been described in Section 4.6.2, more appropriate harmed the retrieval effectiveness of BM25F more than that of the Divergence From Randomness (DFR) field-based weighting models. Therefore, the benefit from selective Web IR is greater for the less robust field-based weighting model BM25F. Table 6.12 also suggests that the score-dependent experiments are particularly robust when they are used to selectively apply retrieval approaches based on the
(rows PB2F BM25F È3(f), d(f), è3(f), to weighting models and L(sU)Pl, L(sU)pl' L(sU);,,,
4(f)>L(SUI )in')
Overall, experiments,
this section has provided an overview of the evaluation of the proposed Bayesian decision the mechanism selectively applies retrieval ajpwhen in retrieval effectiveness in most of the cases. When both the applied
The the same weighting model. results suggest that there are proaches, which employ improvements BM25F, is least field-based there the at model one weighting retrieval approaches use identified decision boundary for most of the tested cases (row `Total' in Table 6.12). The score-dependent experiments are also robust, and they result in improvements in for tested the cases. of most effectiveness retrieval
6.7
Decision proaches
mechanism
with
two than retrieval more
ap-
far, been I has, in this perfornw, so chapter The evaluation of the proposed experiments However, two decision approaches. retrieval Bayesian uses which mechanism, with a from any number of retrieval the Bayesian decision mechanism can selectively apply (page In 122). Section 5.5 7 Example in described such been has it of approaches, as
181
6.7 Decision
mechanism
with
more than
two retrieval
approaches
a case, the decisions depend on the expected loss of each retrieval approach, instead of only the posterior likelihood that a given retrieval approach is the most effective. The current section presents an illustrative example of a Bayesian decision mechanism, which employs 3 retrieval approaches. Chapter 4 has described and evaluated 20 different retrieval approaches (5 field-based weighting models, and their combinations different 3 with sources of query-independent evidence), there are 20 19 18 = 68-10 " ways to select a set of three distinct retrieval approaches. This section presents the evaluation of a particular set of retrieval approaches, which have been selected for being diverse, and using all three different sources of query-independent evidence. The selected approaches are: the combination of the field-based weighting model PL2F with the Absorbing Model (PL2FA); the combination of I(ne)C2F with evidence from the URL path length (I(ne)C2FU); and the combination of BM25F with PageRank (BM25FP). The evaluation is performed for each of the tasks: td2003; td2004; hp2003;
decision the of mechanism that employs the above is
hp2004; np2003; and np2004.

Table 6.13 displays the evaluation mentioned identified retrieval approaches, for the cases when at least one decision boundary in retrieval
for all tasks, and there are improvements
effectiveness for at least
three of the tested tasks. This choice is made in order to focus the analysis on the most effective `+/experiments. Figure 6.15 provides an overview of the results from column
%' of Table 6.13 in the form of a histogram. lead improvements to small can the MAP
The results suggest that the decision effectiveness over the baseline in he ( with
mechanism
in retrieval
some of the cases. For example, experiment
achieved by the decision mechanism
Ev(b), L(su)p1 is 0.1726 (row 19 in Table 6.13). This represents an improveindividual MAP the the of most effective over When the decision mechanism significant retrieval approach for the E3(b), std(dir) in MAP, and (lie number of
ment of +5.24% task td2003
(0.1640).
uses the experiment improvement
for the task hp2004, most appropriate
there is a statistically approach
retrieval
is applied
for a statistically
significant
in improveis that However, there 6.13). (row Table in 16 results experiment no queries For Table 6.13). (column in `+/-%' for tasks example, MAP tested the in all ments The for td2004. the task improvements in number of the results experiments none of (column tasks for the tested experiments and decision boundaries also varies each of `Bnd' in Table 6.13). It is worth noting that 6 out of the 9 experiments that identify at for at least t hree
least one decision
boundary
for all tasks, and result in improvements
182
6.7 Decision
mechanism
with
more than
two retrieval
approaches
of the tested tasks, are score-dependent experiments, which estimate the usefulness of the hyperlink structure (rows 19-54), while there are only 3 score-independent director Y (rows 1-18). The results indicate that the score-dependent aggregate-level experiments experiments are more robust than the score-independent experiments in the described setting. The unstable performance setting can be attributed of the Bayesian decision mechanism in the employed to the fact that the higher number of retrieval approaches that a particular retrieval approach is effective, of the likelihoods, of
require more queries for training the decision mechanism. As described in Section 5.5.2, the estimation the estimation obtaining of the prior probability loss function, the of
density the and estimation
a particular
experiment outcome, are performed from subsets of the traiilretrieval
ing queries.
These subsets correspond to the queries for which a particular
approach is the most effective one. Therefore, as the number of retrieval approaches increases, the size of the training decreases, less evidence subsets of queries providing for setting the Bayesian decision mechanism.
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Task td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 Retrieval Approaches PL2FA I(ne)C2FU BM25FP PL2FA I(ne)C2FU BM25FP PL2FA I(ne)C2FU BM25FP PL2FA I(ne)C2FU BM25FP PL2FA I(ne)C2FU BM25FP PL2FA I(ne)C2FU BM25FP PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA PL2FA I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU I(ne)C2FU BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP BM25FP Baseline 0.1640 0.1763 0.7516 0.6469 0.7108 0.7169 0.1640 0.1763 0.7516 0.6469 0.7108 0.7169 0.1640 0.1763 0.7516 0.6469 0.7108 0.7169 0.1640 0.1763 0.7516 0.6469 E 3(at), avg(dir) E3(at),avg(dir) E3(at),avg(dir) 3(at), avg(dir) E3(at),avg(dir)
3 at , avg dir
MAP 0.1623 0.1704 0.7574 0.6500 0.7180 0.7024 0.1558 0.1714 0.7662 0.6539 0.7079 0.7393 0.1558 0.1711 0.7517 0.6713 0.7063 0.7316 0.1726 0.1676 0.7597 0.6261
+/- % Bud 3 -1.04 4 -3.35 1 +0.77 1 +0.48 1 +1.01 2 -2.02 -5.00 -2.78 +1.94 +1.08 -0.41 +3.12 -5.00 -2.94 +0.01 +3.77t' -0.63 +2.05 +5.24 -4.93 +1.08 -3.22 2 1 3 7 3 1 3 1 1 1 1 4 3 4 2 2
v(at), avg(dir) y(at), avg(dir) EV(at),avg(dir) V(at),avg(dir) y(at), avg(dir) Ed at ,av dir E3(b),
std(dir) E3(b),
E3(b), std(dir)
std(dir)
E3(b), E3(b),
std(dir) std(dir)
E3
a ,std dir
EV(b), BM25FP L(SU)p, EV(b), BM25FP L(SU)p1 Ev(b),L(su)P, BM25FP Ev(b),L(SU) BM25FP , continued on next page
183
6.7 Decision
mechanism
with
more than
two retrieval
approaches
Row 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Task np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004 td2003 td2004 hp2003 hp2004 np2003 np2004
continued from previous page Retrieval Approaches Baseline PL2FA I(ne)C2FU BM25FP 0.7108 Ey(b) L(su) , , p PL2FA I(ne)C2FU BM25FP 0.7169 Ey(b),L(SU) t PL2FA I(ne)C2FU BM25FP 0.1640 3(at) L(SU) , , p PL2FA I(ne)C2FU BM25FP 0.1763 E3(at) L(SU) , , p PL2FA I(ne)C2FU BM25FP 0.7516 E3(at) L(SU) , , p PL2FA I(ne)C2FU BM25FP 0.6469 E3(at) L(SU) , , P PL2FA I(ne)C2FU BM25FP 0.7108 E3(at),L(SU) , P PL2FA I(ne)C2FU BM25FP 0.7169 E3(at),L(SU) , PL2FA I(ne)C2FU BM25FP 0.1640 Ey(b),L(SU);, , PL2FA I(ne)C2FU BM25FP 0.1763 Ey(b),L(SU)in PL2FA I(ne)C2FU BM25FP EV(b), 0.7516 L(su)1n PL2FA I(ne)C2FU BM25FP Ev(b),L(su);,, 0.6469 PL2FA I(ne)C2FU BM25FP Ey(b),L(su)in 0.7108 PL2FA I(ne)C2FU BM25FP Ey b L su 0.7169 . , PL2FA I(ne)C2FU BM25FP Ev(b) L(su') 0.1640 , , p PL2FA I(ne)C2FU BM25FP Ey(b),L(SU')P, 0.1763 PL2FA I(ne)C2FU BM25FP EV(b), 0.7516 L(SU1)p, PL2FA I(ne)C2FU BM25FP Ey(b),L(su')p, 0.6469 PL2FA I(ne)C2FU BM25FP Ev(b),L(SU')P, 0.7108 PL2FA I(ne)C2FU BM25FP EV(b), 0.7169 L(su; ) , EV(b), PL2FA I(ne)C2FU BM25FP 0.1640 L(SU');, EV(b), PL2FA I(ne)C2FU BM25FP 0.1763 L(SUl); n Ey(b),L(su')1n PL2FA I(ne)C2FU BM25FP 0.7516 EV(b), 0.6469 PL2FA I(ne)C2FU BM25FP L(su');,, EV(b), PL2FA I(ne)C2FU BM25FP 0.7108 L(SU');,, Ey b su' 0.7169 PL2FA I(ne)C2FU BM25FP ;,, ,L E3(at),L(SU');,, 0.1640 PL2FA I(ne)C2FU BM25FP E3(at),L(SU'); 0.1763 PL2FA I(ne)C2FU BM25FP n E3(at),L(SU');,, 0.7516 PL2FA I(ne)C2FU BM25FP E3(at),L(su');,, 0.6469 PL2FA I(ne)C2FU BM25FP E3(at),L(SUl)j 0.7108 PL2FA I(ne)C2FU BM25FP E3(at),L(SU'); 0.7169 PL2FA I(ne)C2FU BM25FP n
MAP 0.7131 0.7082 0.1614 0.1699 0.7558 0.6347 0.7140 0.7314 0.1674 0.1693 0.7563 0.6338 0.7183 0.7148 0.1558 0.1645 0.7637 0.6508 0.7125 0.7319 0.1570 0.1629 0.7639 0.6295 0.7125 0.7253 0.1578 0.1630 0.7524 0.6487 0.7022 0.7346
+/-V(, +0.32 -1.21 -1.59 -3.63 +0.56 -1.89 +0.45 +2.02 +2.07 -3.97 +0.63 -2.03 +1.06 -0.29 -5.00 -6.69 +1.61 +0.60 +0.24 +2.09 -4.27 -7.60 +1.64 -2.69 +0.24 +1.17 -3.78 -7.54 +0.11 +0.28 -1.21 +2.47
Bnd 2 3 3 1 1 4 1 2 1 3 2 2 3 4 1 3 1 1 1 1 2 3 1 2 1 1 1 4 1 4 2 1
Table 6.13: Evaluation of the decision mechanism, which employs the retrieval apleast identify for that BM25FP, the I(ne)C2FU, at PL2FA, experiments and proaches in in improvements for tasks, tested boundary retrieval the decision and result all one t decision denotes that the The least tasks. for tested three symbol at effectiveness for significant statistically a approach the retrieval appropriate most mechanism applies difference denotes * the that The test. the to symbol sign according number of queries, the that decision effective retrieval most of MAP and the between the mechanism of Wilcoxon's test. to rank signed is statistically significant, according approach decision Bayesian mechaiiisin, Overall, this section has presented an example of a deciBayesian In the this example, different approaches. three retrieval which employs However, lie t in improvements effectiveness. lead retrieval to small sion mechanism can higher training queries of number a requires increased number of retrieval approaches
184
6.8 Discussion
td2003 0
6 4 2 +-2
td2nn4
hn-nnd
hp2003
-)nnA Tr, n, .. r
np2003
1771
L-F-8 (at), avg(dir) 6 4 2 00 + -2 -4 -6 -8 4(at), avg(dir)
EF-0
F-3(b), std(dir)
xxq 21
EV(b), L(SU)pi 3(at), L(SU)pi 4(b), L(SU)in
6 4 2
Uo -f7-71 L 00
00
+-2 -4 -6 -8
-V(b),
L(SUI)pj
' V(b), L(SU')tn
R(at),
L(SU')in,
Figure 6.15: Histogram summarising the relative difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach from %' `+/column of Table 6.13. in order to reliably set the Bayesian decision mechanism. In order to alleviate the need for a higher number of training queries, a different approach can be taken. For example, the problem of selecting one retrieval approach among k available ones can always be transformed to a series of k-1 selections of one among two retrieval approaches.
6.8
Discussion
Bayesian in the the Overall, the evaluation of the proposed experiments context of Web IR for is framework introduced has that the selective decision mechanism shown This in lead improvements to effectiveness. retrieval can which approach, a promising from discusses further the perspectives. additional of a range obtained results section
185
6.8 Discussion
Range
of
experiment
outcome
values
The outcome of the proposed experi-
fall ments within
different ranges. For example, the outcome of the score-independei it document-level experiments can be any number within the range [0, N], where N is the number of documents in the employed collection. On the hand, the outcome other that compute the usefulness of the hyperlink structure fall within the range of the symmetric Jensen-Shannon divergence values [0,21 (Section 5.4.1). In addition, the average size of domain aggregates is expected to be higher than the average size of the directory aggregates, as discussed in Section 6.3.2.3 (page 147). The illustrative examples in Sections 6.3.1.2 (page 138) for the document-level experi(page 6.3.2.3 147) for the aggregate-level experiments, and 6.4.4 (page 159) for ments, of the experiments
the score-dependent experiments, suggest that the estimated posterior likelihoods are higher when the outcome of an experiment falls within a smaller range of values. The higher posterior likelihoods correspond to stronger evidence for the appropriateness of a particular retrieval approach. However, the smaller range of outcome values of an experiment E is likely to result to overlapping densities for the posterior likelihoods, and hence, a higher number of decision boundaries. Therefore, there is a tradeoff between the range of the outcome values of an experiment and the expected number of decision This tradeoff explains the fact that all the score-dependent experiments, ) U, least divergence L(5,,, identify Jensen-Shannon the at symmetric which compute (Table 156). 6.5 for boundary decision tasks tested on page all one boundaries. Applying retrieval approaches and improvements in retrieval
appropriate
E by is the the The shown numexperiments proposed effectiveness of effectiveness ber of decision boundaries, the improvements in retrieval effectiveness, and the nurnber of topics for which the Bayesian decision mechanism applies the most appropriate Bayesian deFor the 131). (Section 6.2.1 when example, page on approach retrieval for DLHFA PB2FU IEB(b), to or either apply selectively uses cision mechanism avg(dir) is boundaries, there decision a statistically significant the task hp2004, there are two is the approach retrieval apmost appropriate improvement in MAP of +13.03%, and (row Table 6.4, 1-16). in 16 page for of queries number significant a statistically plied the decision most apprcothe applies mechanism However, in some of the tested settings, but there are for of queries, number significant a statistically priate retrieval approach in Table 21 6.4 For in retrieval effectiveness. row example, improvements important no
186
6.8 Discussion
shows that the decision mechanism results in a relative improvement of only 0.59`/C in MAP for the task hp2003, even though the most appropriate retrieval approach is applied for a statistically significant number of queries. This fact can be explained in the following way. The decision mechanism may apply the most appropriate retrieval approach for a statistically significant number of queries, but a small number of wrong decisions may cancel the positive effect in the overall retrieval effectiveness. Indeed, when the decision mechanism is trained, it does not consider the magnitude of the differin ence retrieval effectiveness between the most effective and the less effective retrieval approaches. Future work can address this issue by investigating different definitions for the loss function described in Section 5.2, which would consider the magnitude of
differences in retrieval effectiveness. Potential IR for improvements improvements from Web
and obtained
selective
The evaluation of the proposed experiments in this chapter has been performed These retrieval
in the context of a Bayesian decision mechanism, which selectively applies a pair of have been basis the approaches selected on of retrieval approaches. their potential for improvements in retrieval effectiveness. Table 6.1 (page 134) shows the potential for improvements in retrieval effectiveness for each pair of retrieval apfor the task td2003, the maximum mean average This corresponds to a relaFor the in tested tasks. the applying retrieval selectively example, when proaches used approaches I(ne)C2FU precision (MAP) DLHFP and (row 6.1). Table 1 in be 0.1926 can
tive improvement (0.1455).
MAP from the 32.37% the most effective retrieval approach of of When employing the Bayesian decision mechanism, the highest obtained
(row from Table 6.6 1 is 4(b), MAP is 0.1655, when the experiment L(sU')p1 employed from 0.1455, 13.75% improvement which to This 159). of a relative corresponds on page is statistically significant. in two ways. for improvements and the obtained improveFirst, this difference is due to the fact that decision mechaa setting, where Second, the difference between t lie The difference between the potential be interpreted ments can the maximum MAP
is obtained in a hypothetical
for the decisions queries. all perfect makes nism
for is there that room more suggests the effectiveness obtained retrieval maximum and Web IR. for E selective improvements by introducing more effective experiments
187
6.9 Summary
Generalising
findings
from
selective
Web IR
The evaluation of the proposed
E has been performed in a particular experiments
setting, with a broad set of retrieval approaches and a range of different tasks, as described in Section 6.2. The obtained depend factors. One such factor is the particular field-based results on several weighting models and their combination with the query-independent evidence. The improvements in retrieval effectiveness from selective Web IR can be higher if less robust retrieval approaches are employed. For example, the field-based weighting model BM25F has been shown to be more appropriate for selective Web IR, because it is less robust than the DFR field-based weighting models with respect to the setting of the hyperdiscussed Section in 6.6. Another factor that affects the results is parameters, as whether the employed tasks are good representatives of particular types of tasks. If they are good representatives of a type of tasks, then the Bayesian decision mechanism for have the tasks of the the and a similar performance proposed experiments should same type. In order to alleviate the effect on the obtained results from the above mentionc(l factors, the evaluation different evaluation has focused on the experiments E, which are effective across Therefore, it is expected that the main findings from the settings. FurE would hold for different experimental
types of tasks.
of the experiments Chapter
E different in the 7 a thermore, experiments will perform an evaluation of if information, limited there any. relevance exists setting, where
6.9
Summary
Web IR, for framework the which selective This chapter has presented the evaluation of in been has the The Chapter 5. of context in performed been has evaluation proposed to apply approaches retrieval decision of pairs Bayesian employs mechanism, which a on a per-query basis. These pairs of retrieval been have selected with approaches they in employ and for improvements effectiveness, retrieval respect to their potential focus the In to (Section 6.2). evaluation order different field-based weighting models the testing E, training the of and the experiments the employed of effectiveness on that there task, the assuming same decision mechanism have been performed with information. exists relevance
188
6.9 Summary
experiments perform well when their outcome is computed from the set of documents that contain all the query terms in their anchor text. This is because these documents are more likely to be about the query topic, and therefore, the resulting set of documents is more cohesive (Section 6.3.1). aggregate-level experiments perform well when the considdocuments ered contain at least one or all the query terms in their body, because a larger set of documents is required in order to obtain a representative distribution aggregate sizes (Section 6.3.2.4). of The score-independent
The score-independent
document-level
The score-dependent experiments are robust, and they result in improvements for most of the tested settings, when the usefulness of the hyperlink structure is estimated by the symmetric Jensen-Shannon divergence L(S, U,,, ) (Section 6.4.2). The current chapter has also investigated document sampling, in order to reduce the computational cost of the introduced experiments E, as well as to test whether the experiments by a weighting F, are more effective with documents, which have been highly scored model. The results show that document sampling can be effectively document-level experiment EV(at) (Section 6.5.3),
for the score-independent employed
EV(at), for E3(at), E(at), the as well as score-dependent experiments L(SU)pji L(SU);,,, L(SU')p, (Section 6.5.5). and'y(at), L(SU")i, When the Bayesian decision mechanism selectively applies retrieval approaches, be introduced the the can also experiments used same weighting model, which employ (Section 6.6). individual the improve to the retrieval effectiveness of approaches retrieval An example of a Bayesian decision mechanism, which selectively applies three retrieval illustrated lie In Section 6.7. in the t been has example, presented also approaches, higher because is the decision Bayesian number the unstable, mechanism of performance decision the to training set appropriately queries more requires approaches of retrieval mechanism. lead improvements IR Web to has that can Overall, the evaluation selective shown in the tested cases, of some significant in retrieval effectiveness, which are statistically decision to Bayesian the apply mechanism introduced the allow that experiments and an appropriate for a statistically retrieval approach Tlie of queries. significant number the all well across focused the perform which has experiments, on primarily evaluation IR in Web a setting, where following The selective evaluate will tasks. chapter tested limited only information relevance exists.
189
Chapter
Web IR with information relevance

7.1 Introduction
Selective
limited
The framework for selective Web IR and the proposed experiments F- have been so far Bayesian decision evaluated with a mechanism, which was trained and tested with the focused Chapter has been In in 6. this the task, way, evaluation as presented same E identify to the the appropriate retrieval proposed experiments on effectiveness of information that relevance approaches, assuming does exist. The objective of this decision the is investigate the the to and experiments mechanism of effectiveness chapter E in a more realistic and operational setting, where a decision mechanism is trained This setting is represented by training the decision different the known evaluation with a set of queries, and performing mechanism with a different Moreover, search of set a mixed represents of queries set each of queries. set information. limited relevance with home page finding, and named page finding. be decision that used when can This chapter also proposes an ad-hoc mechanism decision This approxiinformation mechanism limited ad-hoc exists. relevance only tasks, such as topic distillation, by boundaries decision the automatically mates The automatically queries. generating samples of representative be single-term queries, or more realgenerated queries can by either applying istic multiple term queries. The multiple term queries are generated from the text term, anchor sampling to or seed a random expansion automatic query document collection. limited how defines 7.2 Section follows. The remainder of this chapter is organised as
190
7.2 Limited
relevance
information
is modelled, and describes the experimental setting that is used in the remainder of this chapter. Next, Section 7.3 evaluates the proposed experiments E in the experimental setting of this chapter. It evaluates both the score-independent document and aggregate-level experiments, as well as the score-dependent experiment;. have been introduced in Chapter 5. Section 7.4 proposes an ad-hoc decision which mechanism, which can be applied, when the available relevance information is limited. The ad-hoc decision mechanism sets its decision boundaries by using novel techniques to automatically generate samples of queries. These samples of queries correspond to single term queries, or queries with multiple terms, which are generated by applying automatic query expansion, or by sampling the anchor text of documents.
7.2
Limited
relevance
information
The proposed retrieval
approaches in Chapter 4 have been optimised and evaluated
with different sets of mixed tasks, in order to obtain a realistic setting of the hyperparameters. However, the evaluation of selective Web IR and the proposed experiments does exist. This choice E in Chapter 6 has been performed by training and testing a Bayesian decision mechainformation that the task, relevance assuming same nism with focus the in the the to of experiments, and effectiveness on evaluation order was made to reduce any effect from using different training and testing tasks. The current chapter has decision IR in Web only mechanism a setting, where a aims to evaluate selective This section explains the concept of limited relevance limited relevance information. information in the context of selective Web IR, and it describes the experimental setting E in this the for chapter. the of remainder experiments proposed evaluating used
7.2.1
Modelling
limited
relevance
information
for information the have A retrieval system will almost certainly not complete relevance be information limited However, may it relevance some that processes. search requests In the been have of context selective for that processed. already the queries available (a) defined the is to: information respect limited with Web IR, the concept of relevance (b) the decision training by the mechanism; type of the queries, which are processed different of queries. decision sets with the mechanism and evaluation of
191
7.2 Limited
relevance
information
Type of queries
In an operational
queries, which are submitted
setting, a retrieval system processes a stream of by users. The queries are not associated with explicit
evidence about the aim of the user. For example, the retrieval system is not aware whether a particular query is an informational, or a navigational query, unless further analysis of the queries is performed (Beitzel et al., 2004; Bomhoff et al., 2005; Rose & Levinson, 2004). This means that the type of the relevant documents, or the number of relevant documents is unknown. For example, if a system is not aware whether a query is related to a navigational or an informational task, it does not know whether there is one or few relevant documents. Similarly, a system does not know that the relevant document for a home page finding task is indeed a home page of a Web site. Craswell & Hawking (2004) suggested that effective retrieval can be performed without knowing the type of the queries. However, in the context of selective Web IR, queries from mixed tasks may have an impact on the training of the decision mechanism. This is because different types of queries are likely to result in different distributions for an experiment. to retrieve more documents from a particular of outcome valia's For example, a query related to a home page finding task is likely Web site, resulting in a small number of
large domain aggregates. On the other hand, a query related to a topic distillation task is likely to retrieve many documents from several Web sites, resulting in a high number finding A task to domain large page a named related specific query very aggregates. of is likely to retrieve few documents with all the query terms. Therefore, using mixed from is decision the the affected mechanism tasks intends to test whether setting of different types of queries. processing from In to processing queries training Using different addition and testing different is trained of sets with evaluated and tasks, usually system a retrieval mixed to is a set of respect with optimised The system a of retrieval effectiveness queries. both is to lirlpreviously process Then, the required system training queries. retrieval tasks If for the been training. have set of used which seen queries, and possibly queries, the then task, type perforsearch of is particular training queries representative of a during training. be that to likely is to obtained close the system retrieval mance of for training different evaluand IR, Web sets query using In the context of selective decision the apply can mechanism test to decision whether aims the mechanism ating for queries. unseen previously approaches appropriate retrieval
192
7.2 Limited
relevance
information
7.2.2
Experimental
setting
for limited
relevance information
This section describes the experimental setting, which will be employed in the remainder of this chapter to evaluate selective Web IR when limited relevance information is available. The experimental setting is defined as follows.
1. As described in Section 4.6, two mixed tasks are selected to be used for training and testing the decision mechanism, respectively. Both mixed tasks correspond to a mix of queries from three different tasks: topic distillation; home page finding; and named page finding. The first mixed task is denoted by mq2003, and corresponds to the queries from the tasks td2003, hp2003, and np2003. When
task mq2003 is employed as a training set, the first 50 topics for each The
the mixed
type of task are used, and this smaller set of queries is denoted by mq2003'. mixed task mq2003' is used for training
in order not to bias the results towards to the queries used & Hawking, 2004).
a particular in the mixed When
type of task. The second mixed task corresponds query task of TREC task mq2003' 2004 Web track (Craswell
the mixed
is used for training,
the mixed
task mq2004 is for
employed training,
for the evaluation, the mixed
is task the employed and when mixed mq2004 for the evaluation. Details
task mq2003 is employed
about
the employed
(page Section been in 4.2 have 52). tasks given also mixed
2. As described in Section 4.6.2, the hyper-parameters
of the employed retrieval
for the training in to average precision optimise mean order set approaches are In order not to overfit the training mixed task, the optimisation (see The Section 4.6.2 95). iterations 20 is on page terminated after process field-based the to weighting models employed retrieval approaches correspond mixed task. PL2F, PB2F, I(ne)C2F, to their combinations DLHF, and BM25F (Section 4.4 on page 67), as well as PageRank, length, URL from the and path with evidence
fields document The 74)1. the (Section 4.5 Model are Absorbing on page the documents. title hyperlinks, the incoming of and body, the anchor text of E the and proposed experiments 3. The Bayesian decision mechanism employs one of i. tasks, for training or the mg2003' e., boundaries mixed decision of the one sets
displayed field-based are 'The values of the hyper-parameters associated with the weighting models the hyper-parameters quei-ywith A. The associated Appendix the (page 235) of A. values 11 in Table of A. Appendix (page 235) A. 12 Table displayed in of independent sources of evidence are
193
7.2 Limited
relevance
information
mq2004. Then, the Bayesian decision mechanism is tested with the corresponding evaluation mixed task. For example, if it has been trained with mg2003', then the evaluation employs the mixed task mq2004. In each case, the Bayesian decision mechanism selectively applies the two retrieval approaches, which have the highest potential for improvements from selective Web IR. For each of the evaluation mixed tasks, the potential for improvements in retrieval effectiveness is shown in Table 7.1. It is computed by assuming that a decision mechanism MAX employs two retrieval approaches and selectively applies the most appropriate one on a per-query basis, as described in Sections 4.7 and 5.2.2. The resulting retrieval effectiveness is the maximum that can be obtained from selectively applying the two retrieval approaches, and it is statistically significantly higher than that of the most effective individual retrieval approach, as denoted by *. The Bayesian decision mechanism employs the pairs of retrieval approaches that result in the highest potential for improvements in retrieval effectiveness. The employed pairs of retrieval approaches correspond to: DLHFP and BM25F for the evaluation task mq2003 (row 11); DLHFP and PB2F for t lie evaluation
Row 1 2 3 4 5 6 7 8 9 10 11 12
task mq2004 (row 12).

Task mq2003 mg2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 Mean Average Precision First approach (0.6206) PL2FU (0.4444) PL2F (0.5809) PB2FU (0.4723) PB2FU I(ne)C2FU (0.6258) I(ne)C2FU (0.4946) (0.5216) DLHFU (0.4273) DLHFU (0.6237) BM25FU (0.4883) BM25FU (0.5319) DLHFP (0.4156) DLHFP Second approach (0.6238) PL2FP (0.4717) PL2FA (0.5873) PB2FP (0.4723) PB2FP I(ne)C2FA (0.6210) I(ne)C2FP (0.4983) (0.5319) DLHFP (0.4156) DLHFP (0.6502) BM25FP (0.4680) BM25FA (0.5533) BM25F (0.4114) PB2F MAX 0.6529 (+ 4.66%)* 0.5094 (+ 7.99%)* 0.6029 (+ 2.66%)* 0.5258 (+11.33%)' 0.6511 (+ 4.04%)' 0.5561 (+11.60%)' 0.5577 (+ 4.85%)' 0.4618 (+ 8.07%)* 0.6921 (+ 6.44%)' 0.5284 (+ 8.21%)' 0.6582 (+18.96%)' 0.5304 (+27.62%)'
MAX, the decision applies selectively most which mechanism of a based The basis. approaches are retrieval effective retrieval approach on a per-query (page displays The 96). 4.10 table in Table on a restricted optimisation, as reported MAP in for improvements highest in the the pairs of retrieval approaches that result difference denotes * the that The symbol the tested mixed tasks mq2003 and mq2004. MAX the that decision most effective of MAP the and between the mechanism of Wilcoxon's test. to is signed rank according significant, statistically approach retrieval Table 7.1: Evaluation
194
7.3 Evaluation
of experiments
e with
limited
relevance
information
Overall, the described experimental setting allows to investigate the effect ivene ' of the proposed framework for Web IR in a setting where the decision mechanism is trained and evaluated with different sets of mixed tasks. The next section evaluates the proposed experiments e in the described experimental setting.
7.3
Evaluation information
of experiments
E with
limited
relevance
This section presents the evaluation of the proposed experiments E with limited releinformation, vance as described in Section 7.2. The evaluated experiments are the scoreindependent document-level and aggregate-level experiments (Section 5.3 on page 110), as well as the score-dependent experiments that estimate the usefulness of the hyperlink (Section 5.4 on page 115). This section closes with a discussion structure and conclusions from the evaluation of the Bayesian decision mechanism and the experiments with limited relevance information (Section 7.3.3).
7.3.1
Score-independent tion
experiments
with
limited
relevance
informa-
Table 7.2 displays the evaluation results for those score-independent experiments, which result in improvements in MAP, compared to the most effective retrieval approach, for both tasks mq2003 and mg20041. This choice is made in order to focus the analysis on the experiments that allow the decision mechanism to obtain improved retrieval effectiveness. For example, row 1 in Table 7.2 corresponds to a decision mechanism, which has been trained for the mixed task mq2004 and it is evaluated for the task mq2003. This decision mechanism, selectively applies either the combination of the field-based (DLHFP), PageRank DLHF with weighting model BM25F, on a per-query field-based the or weighting model basis. The employed experiment is &y(at), which counts the
MAP in The text. the documents terms the the anchor of all query with of number decision mechanism is 0.5775, which represents a relative improvement of +4.37% over the MAP individual the most effective of significant (0.5533). approach This improvement in Wilcoxon's to according denoted test, as signed rank
MAP is statistically
by *, and the corresponding

'The
decision mechanism applies the most appropriate retrieval

experiments in the same setting appear in
for all the score-independent results evaluation Table B. 12 (page 257) of Appendix B.
195
7.3 Evaluation
of experiments
8 with
limited
relevance
information
for approach a statistically denoted by t.
significant number of queries according to the sign test. as
From the results, it can be seen that the document-level experiment EV(at) results in improved MAP over the baseline, and it identifies decision boundary for bot li 1 only tasks mq2003 and mq2004 (rows 1-2). Moreover, when the decision mechanism uses Ey(at) it applies the most appropriate retrieval for approach a significant number of , queries, and there is only one decision boundary.
Regarding the domain E3(at), the aggregate-level aggregates experiments, there are four experiments namely that employ 4(b), std(do7TL) (rows 3-10).
Three of these experiments,
F-V(at), deviation of the domain aggrethe and compute standard std(dorn), std(dom) distribution. The experiment EV(b), in highest improvement the gate size results std(don) over the most effective retrieval approach for the task mq2003 (+4.41% from row 5 of Table 7.2). task mq2004 The experiment (+7.12% EV(at), in the highest improvement results avg(dom) from row 4 in Table 7.2). The directory aggregate-level in retrieval for t lie experi-
ments achieve lower improvements mechanism, significant which number applies
effectiveness, but they result in a decision retrieval approach for a statistically
(rows from Table 7.2). 11-13 15-16 and of queries
It is worth noting from Table 7.2 that six out of the eight experiments computc their outcome from the documents that contain all the query terms in their anchor text (rows 1-4 and 9-16). In the context of processing queries from mixed tasks for selective Web IR, this can be explained by the fact that the terms of either broad queries or likely in Web to home the appear site, are more page of a particular queries about the anchor text of documents. Therefore, the experiments, which count the number of decision the in text, their terms mechanism documents with all the query aid anchor be home likely documents to for the pages, are to identify the queries relevant which hyperlink from the structure. to therefore, evidence more apply and
7.3.2
Score-dependent tion
experiments
with
limited
relevance
informa-
for those which experiments, score-dependent 7.3 the Table evaluation results presents individual to the effective most improvements, compared in retrieval effectiveness result
196
7.3 Evaluation
of experiments
E with
limited
relevance
information
Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Task mq2003 mg2004 mq2003 mg2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004
Retrieval approaches DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F
Baseline 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156
y(at) y(at) y(at), avg(dom) y dom at ,a V(b), std(dom)

y
E3 at , std
6 std dom E3(at), ,

std(dom) dom
Ey(at), std(dom) Ey dom std at , V(at),avg(dir) y at ,a dir Ey(at), std(dir) y dir std at , y(at) lrg(dir) , y at ,Ir dir
MAP 0.5775 0.4381 0.5626 0.4452 0.5777 0.4212 0.5554 0.4265 0.5622 0.4233 0.5626 0.4395 0.5648 0.4374 0.5613 0.4421
+/- % Bnd 1 +4.37t* 1 +5.411 2 +1.68 2 +7.12t 1 +4.41 2 +1.34 3 +0.38 1 +2.62 2 +1.61 2 +1.85 1 +1.68 2 +5.751 2 +2.08t* 1 +5.25 3 +1.45 1 +6.371
Table 7.2: Evaluation of the score-independent document-level and aggregate-level exdisplays limited information. The the evaluation results table relevance periments with different The is decision tasks. trained mixed and evaluated with of a mechanism, which t decision denotes that the mechanism applies the most appropriate retrieval symbol for to the test. sign according of queries, a statistically significant number approach The symbol * denotes that the difference between the MAP of the decision mechanism is the that statistically significant, according of most effective retrieval approach and to Wilcoxon's signed rank test.
Overall, there for both tasks seven tested are and mg20041. mq2003 retrieval approach, All for both tasks. tested improved in mixed performance result experiments, which documents the from that the all contain their outcome the seven experiments compute in their 9-12), the (rows anchor 1-2,5-6, of body combination in or their and terms query documents indicates This that 13-14). all with (rows 3-4,7-8, fields and title text and E. in to experiment an compute order evidence terms useful the query provide more I(nr)C2F field-based the model weighting employs The experiment 1Ey(at), which L(su)Z,,, in improvements highest in efretrieval the S, distribution results to assign the score from 7 (+3.80% +2.60% and rows and fectiveness for both tasks mq2003 and mq2004 field-based the weighting EV(at), employs which 8, respectively). The experiment L(su)pj' improvements lower (rows 3-4), S,, distribution achieves the PL2F to score assign model F-V(at), than the experiment L(SU)1 both for tasks tested boundary decision in one results The experiment F-V(at), L(SU')t,ti
experiments for the score-dependent all evaluation results B. Appendix (page 258) of Table B. 13 'The in the same setting
appear in
197
7.3 Evaluation
of experiments
E with
limited
relevance
information
when the Bayesian decision mechanism employs the experi. the ment Ey(b), most appropriate retrieval approach is applied for a statistically L(suF)pl, significant number of queries for both tasks mq2003 and mq2004 (rows 9-10). When the Bayesian decision mechanism employs the experiments 4(at), L(sv); or EV(at), L(SU');,, n for the task mq2003, there is a statistically significant improvement in MAP, and the retrieval approach is applied for a statistically significant number of queries. However, there is no particular experiment that results in both statistically significant improvements in MAP over the baseline, as well as in applying the most most appropriate appropriate retrieval approach for a statistically significant number of queries, for both
(rows 13-14). Furthermore,
mq2003 and mq2004.

Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Task mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mg2004 Retrieval approaches DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F BM25F DLHFP DLHFP PB2F DLHFP BM25F PB2F DLHFP BM25F DLHFP PB2F DLHFP Baseline 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 E Ey(b),L(sv) pj Ey(b),L(sv) l y(at) L(SU) , p, EV(at),L(su) , y(b),L(SU)in y b sv ; ,L Ev(at), L(SU);,, EV(at su ;,, ,t y(b),L(SUI)p, EV(b), L(su') j y(b),L(su'); n eV(b),L(SUI)in y(at), L(sv');,, y at ,L sui ;,, MAP 0.5539 0.4215 0.5685 0.4215 0.5702 0.4207 0.5743 0.4264 0.5698 0.4201 0.5742 0.4194 0.5733 0.4157 +/- % Bnd 4 +0.11 2 +1.421 2 +2.75 1 +1.42 4 +3.05 2 +1.231 2 +3.80t* 2 +2.60 3 +2.98 1 +1.081 3 +3.78 1 +0.91 +3.611* 1 1 +0.02
Table 7.3: Evaluation of score-dependent experiments with limited relevance infordecision displays The the table mechanism. which of a evaluation results mation. t denotes The that the different tasks. is trained and evaluated with symbol mixed for decision mechanism applies the most appropriate retrieval approach a statistically * denotes The that the test. the to symbol sign significant number of queries, according difference between the MAP of the decision mechanism and that of the most effective Wilcoxon's test. to signed is rank statistically significant, according retrieval approach
7.3.3
Discussion
and conclusions
liinin decision Bayesian setting a with mechanism This section has evaluated the lie training to I and evaluating This setting corresponds ited relevance information. The tasks. different evaluation res-tilts of mixed sets decision Bayesian mechanism with for both the score-independent (Section 7.3.1) and the score-dependent experiments
198
7.4 Ad-hoc
decision mechanism
and query sampling
(Section 7.3.2) show that the proposed decision mechanism and the experiments E rein improved sult retrieval effectiveness, even when limited relevance information exist s. This suggests that selective Web IR can be effectively applied in a realistic setting. document-level experiment EV(a, performs for both he t well t) tested mixed tasks (rows 1-2 from Table 7.2). The domain aggregate-level experiments 4(at), EV(b), for and the tasks also perform well mq2004 and mq2003, avg(dom) std(dom) respectively (rows 4 and 5 from Table 7.2, respectively). Moreover, four out of the seven aggregate-level experiments, which are shown in Table 7.2, estimate the standard deviation of the aggregates' size. This indicates that estimating the standard deviation of the aggregates' size results in robust experiments, and it is in agreement with i he from Sections 6.3.2.1 and 6.3.2.2. results The score-dependent experiments, which estimate the usefulness of the hyperlink also result in improvements in retrieval effectiveness. However, no particular trend has been observed from the results in Table 7.3. Moreover, the score-independent structure experiments outperform the score-dependent ones with respect to the obtained improvedecision in MAP by the mechanism. ments An observation related to both the score-independent (Section 7.3.1) and the scoredependent experiments (Section 7.3.2) is that the documents that contain all the query terms in the anchor text or the title fields provide more robust evidence to compute the outcome of the experiments. This can be explained because a high number of documents with the all the query terms in the anchor text or the title fields indicates hyperlink is Hence, Web the to analysis applying that there are query. sites related Web home those detect the in to sites. likely to be effective pages of order The score-independent
7.4
Ad-hoc
decision mechanism
and query sampling
(page Section 5.5.2 125), in described decision When setting the Bayesian mechanism, as in employed are assessments the relevance corresponding training and set of queries a that the a particular to probability prior First, they set following used are the way. loss the to Second, associated they estimate used is are effective. retrieval approach to Third, they estimate employed are approach. with applying a particular retrieval is the most effective approach that retrieval likelihood a particular the density of the is drawback this that A process of values. the outcome for experiment a range of one
199
7.4 Ad-hoc
decision mechanism
and query sampling
the setting of the decision mechanism primarily depends on the availability of training
queries. The current section aims to reduce the dependence on the training queries by introducing a simple ad-hoc decision mechanism, which employs the distribution of the outcome values of an experiment E to set its decision boundary. The distribution of the outcome values of E is obtained from a sample of queries. which is automatically generated. generation of query samples is performed with three different techfirst The niques. one involves the random sampling of single terms from the vocabulary of the collection. The second technique applies automatic query expansion to randomly from terms the vocabulary and generates queries with more than one terms. selected The third one is a novel technique, which samples the anchor text of documents and generates queries with more than one terms. The query sampling is evaluated with respect to the similarity of the outcome values of an experiment F_for the sampled queries, and the outcome values of the same experiment 8 for the TREC Web track queries. The ad-hoc decision mechanism is evaluated from Web in improvements to the selective retrieval effectiveness obtained with respect IR. For both the evaluation of query sampling and the ad-hoc decision mechanism, have been the employed experiments are EV(at), 'd(at), avg(dom),and Ey(b), which std(dom), (rows is 1-2,4, information limited be arid available relevance effective when shown to 5 in Table 7.2). The remainder of this section is organised as follows. The ad-hoc decision mechaintroduces the three query sampling 7.4.2 Section 7.4.1. Section in is introduced nism to the the the 7.4.3 queries Section queries sampled of similarity techniques. evaluates Craswell & Hawking, 2004; (Craswell Web 2004 tracks et 2003 TREC in the and used decision the mechanism. the ad-hoc 7.4.4 Next, of 2003). evaluation presents al., The automatic
7.4.1
Ad-hoc
decision
mechanism
the training decision Bayesian on mechanism the dependence of In order to alleviate the decision mechanism ad-hoc new a assessments. the relevance corresponding queries and for is applied set and how this mechanism describes ad-hoc This is proposed. section IR. Web selective
200
7.4 Ad-hoc
decision mechanism
and query sampling
The ad-hoc decision mechanism is used to select one out of two retrieval approaches al and a2. It is set in two steps. First, it estimates the distribution of the outcome values of an experiment E. Second, it sets a decision boundary Bnd, so that the outcome o of the experiment F_is lower than Bnd with a given probability P(o < Bnd). During retrieval, if the outcome of the experiment E is o< Bnd, then the decision mechanism applies the retrieval approach al. Otherwise, it applies the retrieval approach a2. The probability retrieval P (o < Bnd) can be set equal to the prior probability that the
approach al is more effective than the retrieval approach a2. In this case, the prior probability that the retrieval approach a2 is more effective than the retrieval approach al would be 1- P(o < Bnd). If the prior probabilities cannot be determined, then a uniform prior probability P(o < Bnd) can be assigned to the retrieval approaches by settiiig P(o < Bnd) can be set equal to any The Generally, 0.5. the probability = the range [0,1].
value within
The defined ad-hoc decision mechanism employs two retrieval approaches. k be by k-1 selection of one out of retrieval approaches can always modelled
ad-hoc
decision mechanisms, which select one out of two retrieval approaches. The remainder of this chapter will illustrate the ad-hoc decision mechanism with two retrieval approaches. The next section focuses on automatically estimating the distribution of outcome values of the experiment E, by generating samples of queries.
7.4.2
Query
sampling
In this thesis, query sampling can be seen as the automatic generation of a sample of It to document from queries and provide real approximate aims collection. a queries information in formed to have an need. satisfy order that could users meaningful queries is to the IR, Web approximate used sampling In the context of selective query decision bound<-y E, in the to set order distribution of outcome values of an experiment if the In this queries are sampled not even decision case, mechanism. of an ad-hoc meaningful, distribution in least a similar they should at result of outcome values to the one obtained from real queries. for The techniques sampling. three query The remainder of this section introduces from the terms vocabulary by sampling randomly first one generates single-term queries than with more queries one technique The generates document second collection. a of
201
7.4 Ad-hoc
decision mechanism
and query sampling
terms, by applying vocabulary.
automatic
query expansion to randomly selected terms from the
The third one is a novel query sampling technique, which generates queries from the anchor text of Web documents. This section is followed by the evaluation of the three query sampling techniques with respect to the TREC 2003 and 2004 Web track queries. 7.4.2.1
Sampling
Single-term
query
sampling
has been employed in the context (Callan databases of remote query sampling distributed of 2001).
of terms and their statistics a representation
IR, in order to obtain Cronen-Townsend distribution a
& Connell,
et al. (2002) employed single-term
in order to obtain & Oullis
of the clarity
scores of the terms in a collection. to obtain distribution a
Plachouras
(2004) used single-term hyperlink the of structure.
sampling
for the usefulness of values
Here, in order to sample meaningful terms and avoid generating queries with either in their to frequent the the terms terms, according are ranked vocabulary or rare very frequency in the collection. Then, terms with a rank in the range [rjo, rho], are nandomly sampled. respectively. The thresholds rjo and rhi correspond to the low and the high rank, This technique is referred to as Single-Term Sampling (STS). sampling is only a simple approximation bethe process, querying of
Single-term
(Jansen & Pooch, In 2001). term have than likely to one more cause real queries are (Sildocuments in the to be to tend co-occur terms and the correlated query addition, following techniques two the sampling For query these 1999). reasons, verstein et al., terms. than one generate queries with more 7.4.2.2 Multiple term query sampling
(DFR) frameRandomness From Divergence the based is using on The second technique from the terms informative find set of retrieved 2002) to Rijsbergen, & Van (Amati work from the top terms informative The of set 2005b). Ounis, & documents for a query (He form the queries. to sampled documents used are ranked The the terms following of vocLIbin the way. The sampled queries are generated des( in in the frequency as ribed collection, term their to ulary are ranked according he to [rlo, is selected randomly rhi] in the range A term rank 7.4.2.1. with Section low lie to the t and correspond rh= and The thresholds ri0 query. single-term a as used
202
7.4 Ad-hoc
decision
mechanism
and query
samnline
high rank, respectively.
From the top ranked documents, the most informative term is extracted and it is used as an intermediate seed term to perform retrieval again. This intermediate seed term is employed in order to reduce the effect of initially selecting a random term. From the new set of retrieved documents, the most informative terms are extracted from the top-ranked documents in order to form a sampled query. The number of extracted terms depends on the required length of the sampled queries. He & Ounis (2005b) used a uniform query length distribution. In this thesis, this technique is refined by using a Gaussian distribution with mean p and standard deviation o for the query length distribution of the sampled queries. This technique is referred to as Multiple
The advantage
Term Sampling (MTS).

is that it can be applied to any document collection, irre-
of MTS
spectively feature
of the type of documents, However,
because it does not make use of any particular
of Web documents.
in an appropriate
the length of the generated queries should be set way. This issue is further investigated in Section 7.4.3.
7.4.2.3
Anchor
text
query
sampling
This section introduces a novel technique for sampling queries. It is based on the observation that user queries are similar to the anchor text of Web documents (Eiron & McCurley, 2003b). The user queries have similar length and term frequency distributions with the anchor text of the incoming hyperlinks of a Web document, which can be seen as a concise textual description of that document. Therefore, it is reasonable to employ the anchor text for query sampling. The sampled queries are generated as follows. The frequency of the anchor text is in text the by times the appears anchor each collection. of number computed counting The anchor texts with a frequency less than the threshold a fl, are discarded. This towards bias texts that the the anchor to queries sampled of reduce restriction aims is texts, From few the randomly one selected as times. anchor remaining appear very (ATS). Sampling Anchor Text is to as referred the sampled query. This technique in is that the terms the technique of number The advantage of this query sampling In length these terms the is to addition, of queries. hyperlinks similar anchor text of Web documents. in the text likely to highly of co-occur they are and are correlated
203
7.4 Ad-hoc
decision mechanism and query sampling
7.4.3
Evaluation
of query sampling
and 7.4.2.3 will
The query sampling techniques described in Sections 7.4.2.1.7.4.2.2,
be used in the context of an ad-hoc decision mechanism to approximate the outcome distribution of an experiment E for real queries. In order to do so, it is necessary to assess the quality of the query sampling techniques. The evaluation of the query sampling techniques is based on measuring the difference between the distribution of t lie experiment outcomes for the sampled and the TREC 2003 and 2004 Web track queries (tasks mq2003 and mq2004). The difference between the two distributions is estimated in terms of the symmetric Jensen-Shannon divergence (Equation (5.13) on page 116). When the divergence between the distributions is low, the sampled queries and the TREC Web track queries are considered to be similar with regard to the outcome real The employed experiments are EV(at), EV(at), and ), avg(dan. Ed(b), have been be limited information to which shown effective when relevance std(dom)i is available (Section 7.3.3). values of the experiment. The remainder of this section introduces the experimental setting for the evaluat ion of the query sampling techniques, and presents the evaluation results. 7.4.3.1 Experimental setting for evaluation of query sampling
This section describes the employed experimental setting for the evaluation of the query sampling techniques. For each of the sampling techniques STS, MTS, and ATS, 500 queries are generated. Regarding the used thresholds rlo and rhi to randomly select single terms for the STS 20000, 20,200,2000 for and and MTS the are rl0 techniques, values employed and Because 20000. < 200,2000 for r10 rhi, only some and the employed values rhi are [200,2000], [20,20000], [20,2000], [20,200], combinations of threshold values are used: for the thresholds r10 and rhi are different The [2000,20000]. [200,20000] and values (low frequency high terms with the selecting in test to randomly of effect order selected for ATS fj0 technique the The threshold (and higher a frequency lower rank). or rank), in is to test the This effect, order made 50, choice 5,20, is set equal to respectively. and For from the example, wheri process. texts sampling infrequent discarding the anchor of least in lie 5 times t that texts at appear the ATS anchor 5, only then considers afro = collection to generate queries.
204
7.4 Ad-hoc
decision mechanism and query sampling
the intermediate query and the final sampled query are generated by using the field-based weighting model PL2F, where the hyperparameters cb, ca, ct are set equal to 1.0 and the weights Wb,wa, wt are set equal to 1.0, 0.0, and 1.0, respectively. of terms is estimated using the term weighting model Bol from the Divergence From Randomness framework (Amati, 2003). The terms from the top x retrieved documents for a query are ranked according to the weight: WM =tfx"loge
I+ Pn
For the MTS technique in particular,
The informativeness
Pn
+1og2(1+Pn)
(7.1)
f, is the frequency of the term t in the top x documents and P" t F is where =N. the frequency of a term in the collection, and N is the number of documents in the The parameter x is set equal to 3. The number of extracted terms for the final sampled query depends on its length. collection. The MTS technique has two parameters p and or related to the average and the deviation standard of the length of the generated queries. These parameters are set in order to match the average and the standard deviation of the length of the TREC 2003 (Craswell & Craswell Hawking, 2004; 2003). Table Web 7.1 2004 track et al., and queries displays the average and the standard deviation of the query length distribution of the TREC 2003 and 2004 Web track queries, after removing stop words. From the table, it (row home 1) distillation the be than topic that the page are shorter queries can seen (rows The length 2 3, finding the and respectively). queries named page or distribution of all the queries from the TREC 2003 and 2004 Web tracks is close to that finding, (rows 4, finding 5 This finding the home and respectively). named page and page of the finding finding home queries, is partly because there are more and named page page MTS the technique, For two the in of distillation evaluation than topic mq2003. queries first The tested. one corresponds to different settings of the parameters and or are from (p 2.1 distillation tasks and a=0.78 = the query length distribution of the topic length distribut ion the to query The 7.4). Table corresponds in 1 setting second row from in 1.31 (p 5 3.2 Web tasks row track and or 2004 = 2003 = TREC the and of all Table 7.4). in to described order evaluate setting the experimental The next section employs techniques. the three proposed query sampling
205
7.4 Ad-hoc
decision
mechanism
and query
sampling
Row 1 2
Task td2003 & td2004 hp2003 & hp2004 hp2003, np2004, hp2003 & np2004 mq2003 & mq2004
3
4 5
Query length Average Standard Dev. 2.1 0.78 3.5 1.23
np2003 & np2004
3.6
3.5 3.2
1.30
1.26 1.31
Table 7.4: Average and standard deviation for the length of the TREC 2003 and 2004 Web track queries. 7.4.3.2
This
Evaluation
evaluates
results
for query
sampling
query
techniques
techniques by measuring in the setting the similarity
section
the three proposed The evaluation
sampling
described
in Section
7.4.3.1.
is performed
between the outcome
values of three experiments queries.
for the sampled queries and the TREC before, the employed experiments well
2003 and 2004 Web track EV(at)i 4(at), in a setting
As mentioned
EV(b), have been selected because they perform and avg(dom), std(dom), limited (Section 7.3.3). with relevance information corresponds of outcome to the symmetric Jensen-Shannon divergence
The similarity the distributions
between
values of the experiments
for the sampled and the TREC
queries. The range of values of the symmetric Jensen-Shannon divergence is [0,2]. Byycause the divergence measures dissimilarity, the higher values of the symmetric Jensendifferences between the sampled Shannon divergence suggest that there are important
TREC the and real queries, with respect to the outcomes of the employed experiment E. When the divergence approaches zero, then the distributions of the experiment are very similar. Jensen-Shannon divergence between the sets of of the outcome values
Table 7.5 displays the symmetric
from the techniques, three the the and queries sampling sampled queries with any of TREC 2003 and 2004 Web track mixed tasks. The bold values indicate the set of
query samples that results in the lowest divergence for each experiment length query longer or E and each distributions For the MTS technique, to test whether distributions two different shorter
sampling
technique.
are evaluated effective. queries
in order
generating
queries is more of the
These length from
length distribution the to match are chosen
the tasks td2003
and td2004
(row 1 of Table 7.4). as well as that of the
from Table (row 7.4). 5 from tasks the mq2004 and mq2003 queries
206
7.4 Ad-hoc
decision
mechanism
and query
sampling
The sampled queries with STS are more similar to the real TREC queries, when the ranks of the sampled terms are within the range [20,20000] (rows 3, and 5-6 in Table 7.5). This is because, the experiment EV(,, is expected to be sensitive to t he t) frequency of the sampled terms. When the ranks of the sampled terms are in the range [200,20000], the lowest divergence is obtained between the sampled queries and the TREC queries for the outcome values of the experiment ey(at) (row 5). Regarding the F-V(at), Ev(b), lowest divergence value is obtained the experiments and avg(dom), std(dom), when the ranks of the sampled terms are in the range [20,20000] (row 3). When the ranks of the sampled terms are very low (row 1), the resulting divergence value is high. This suggests that the sampled terms are very frequent and the outcome of the experiments is very different from that of the TREC queries.
Symmetric
Row
J-S divergence
between query samples and mg2003 & mq2004

EY(at), av (dom EV b , std dom
Ev(at
1 2 3 4 5 6 7 8
9
rio 20 20 20 200 200 2000 rjo 20 20

20
rhi 200 2000 20000 2000 20000 20000 rhi 200 2000
20000
STS 1.9813 1.8176 0.4912 1.7964 0.3488 0.5842 0.5014 0.1299

0.0815
1.0894 0.5347 0.0836 0.4662 0.0869 0.2003 MTS =2.1 0.0705 0.0393
0.0471
1.9753 1.2302 0.0431 1.2671 0.0574 0.1158 or = 0.78 0.3492 0.1804

0.0604
10 11 12
13
200 200 2000

rto 20
2000 20000 20000

rhi 200
0.1186 0.1425 0.0957

0.2482
0.0309 0.0577 0.0793

MTS a=3.2 0.1515 0=1.31
0.1921 0.1114 0.0619

0.1309
14 15 16 17 18 19 20 21
20 20 200 200 2000
2000 20000 2000 20000 20000 aft,, 5 20 50
0.5850 0.9428 0.6409 0.9452 0.9287 0.0105 0.3926 0.6961
0.3403 0.5085 0.3054 0.5610 0.4902 ATS 0.1747 0.0329 0.0467
0.0490 0.1052 0.0070 0.0943 0.0887 0.0477 0.2842 0.4147
distribution between the of Jensen-Shannon Symmetric Table 7.5: ATS MTS, STS, and aiici with for the queries generated values outcome experiment The (mq2003 experiments mq2004). and Web track 2004 queries the TREC 2003 and deviation lie t The of standard and 4(b),, mean 4(at), EV(at), and td(dom) " are avg(dom), by denoted MTS t and Q. in distribution length are query (J-S) divergence
207
7.4 Ad-hoc
decision
mechanism
and query
sampling
Table 7.5 displays the divergence between the sampled queries generated with NITS and the real TREC queries in rows 7-12 for the short queries ( = 2.1 and a=0.78). in for 13-18 the longer queries (M = 3.2 and or = 1.31), respectively. Regarding and rows the shorter queries, sampling a random term with rank within [20,20000] provides the lowest divergence for the experiments CV(at) and (row in 9 Table 7.5). The y(b),std(dom) lowest divergence for the experiment 6V(at), the computes which average size avg(dom) i of domain aggregates, is obtained when the random term has rank within [200,2000] (row 10). It should be noted that the divergence for the experiment values obtained EV(at), lower than 0.08, regardless of the range of ranks. This suggests that are avg(dom) this experiment is robust and the distribution of its outcome values is not affected by the rank of the random term, which is used during the first step of the MTS technique. Regarding the longer queries generated with MTS (rows 13-18 from Table 7.5), the divergence values are lower when the randomly sampled terms have low ranks, or in other words, when the randomly sampled terms have high frequency. For example, t he EV(at), divergence between the outcome values of the experiments EV(o, for and t) avg(dom) the sampled and the TREC queries is the lowest when the rank of the randomly sampled term is within the range [20,200] (row 13 in Table 7.5). The obtained divergence value deviation for the experiment F-V(b), the size of the of standard estimates which std(dom) i [200,2000] (0.0070 from for lowest is domain the the the n)w range of ranks aggregates, 16 in Table 7.5). ATS disthe technique for the The evaluation results are generated queries with EV(b), EV(at) Regarding 7.5. the in Table and in 19-21 experiments rows played std(dor, t), flo less than texts, the is appear a which =5 anchor the lowest divergence obtained when divergence increases (row The 19). during the sampling process as times are discarded (rows Sampling 20-21). fl,, increases queries accordingly the value of the threshold a EV(at), for the better the when experiment ATS technique the performs avg(dorn) with 7.5). Table (row in 20 flo 20 threshold a = 7.4.3.3
Overall, distribution queries. process.
Discussion
query sampling is an effective method for generating queries with a similar
of outcome Query sampling
for an experiment values
to the one obtained
from real TREC generation
be seen as an approximation can
of the query
208
7.4 Ad-hoc
decision
mechanism
and query
sampling
The first query sampling technique, of real queries through are generated with
STS, provides only a rough approximation sampling of single terms. Queries with more than one tern is
either MTS, which is based on extracting the most informative terms from a set of documents, or ATS, which samples the anchor text of hyperlinks between Web documents. The former uses only statistical evidence, while the latter takes advantage of the similarity McCurley, 2003b). between real queries and the anchor text (Eiron &
The results from Table 7.5 show that the MTS and ATS techniques are more effective than STS in generating queries with a distribution of outcome values for the tested experiments similar to that of the TREC Web track queries (rows 19,10, and 16 in Table 7.5 for the experiments EV(at), EV(at), EV(b), and respectively). avg(dom), std(dom),
Regarding periment, the anchor pling the MTS technique, the results with suggest that when employing an exof which text considers and title documents all the query shorter terms in a combination
fields, sampling
is queries more effective
than saiii-
longer queries (rows 7-12 vs. rows 13-18 in Table 7.5). However, the experiment
F-V(b), in body, benefits documents the terms their all query considers with which std(dom), from sampling longer queries. This indicates that the type of the experiment should be considered queries. The most effective number 19 from documents of Table 7.5). sampling technique for the experiment EV(at), which counts the when setting the characteristics distribution length the of for the sampled
(row ATS is in text title, their terms the or anchor query with all This fact confirms the correspondence between queries and the (2003b). An advantage
documents Web text of anchor
& McCurley Eiron by suggested
is MTS ATS that sampling over of length distribution.
the anchor text provides queries with a representative hand, the MTS technique requires to specify t lie
On the other
distribution
length. the query of
Overall, query sampling can be employed to automatically
generate queries, in order
for TRE(' the real to approximate the distribution of outcome values of an experiment in is the employed context Web track queries. In the next section, query sampling decision boundary. the to in of a value set decision order mechanism, of an ad-hoc data. training decision the on dependence mechanism of the This will allow to reduce IR in Web a setting, where relevance facilitates selective the of it application Therefore, information is hardly used to set the decision mechanism.
209
7.4 Ad-hoc
decision mechanism
and query sampling
7.4.4
Evaluation
of ad-hoc decision mechanism
This section investigates the effectiveness of an ad-hoc decision mechanism for selective Web IR with limited relevance information, as described in Section 7.4.1. The ad-hoc decision mechanism employs query sampling in order to set a decision boundary.
The ad-hoc decision mechanism employs two retrieval approaches al and a2, aiid
it has only one decision boundary Bnd. The decision boundary Bnd is set so that the outcome o of an experiment E is lower than Bnd with a given probability This probability is obtained from the distribution P(o < Bnd).
of the outcome o of the experiment
E using the query sampling techniques described in Section 7.4.2. When the outcome of an experiment E for a query is lower than the decision boundary Bnd, then the is is the applied. approach al applied, otherwise, approach a2 retrieval retrieval The remainder of this section describes the experimental setting for the evaluation decision the of ad-hoc mechanism, presents the evaluation results, and closes with a discussion of the findings. 7.4.4.1 Experimental setting for the ad-hoc decision mechanism
This section briefly describes the used experimental decision mechanism. ad-hoc
for the evaluation of the setting
Bayesian E that the The employed experiments performed well when a ones are information: limited in been has decision mechanism relevance a setting with employed 195). (Section 7.3 Ey(b), on page F-V(at) EV(at), and std(dorn) avg(dom), 7 Bnd boundary decision the E, the of ad-hoc For each of the employed experiments decision In 0.5. the Bnd) P(o other words, < decision mechanism is set so that = decision boundary E less is the than boundary Bnd is set so that the outcome o of distribution to the is Bnd of set with respect for 50% of the queries. The value of E for technique each experiment sampling query the effective most o obtained with (Section 7.4.3.3 and Table 7.5). tasks, and mq2003 mg2004. two namely mixed is The evaluation performed with the two approaches retrieval decision employs mechanism For each task, the ad-hoc described in in as effectiveness, improvements for retrieval highest the potential with fieldthe the BM25F of combination and model Table 7.1: the field-based weighting for (DLHFP) mg2003 with evaluating PageRank DLHF with based weighting model
210
7.4 Ad-hoc
decision
mechanism
and query
sampling
(row 11 in Table 7.1); the field-based weighting model PB2F and DLHFP for evaluating (row 12 in Table 7.1). with mq2004 of the field-based weighting model DLHF with PageRank is apfor the queries which result in an experiment plied outcome o> Bnd, in order to favour the broader queries, which are likely to retrieve many documents with all more the query terms, or many aggregates of documents. When the decision mechanism is used for the mixed task mq2003, if the outcome o of an experiment F_is lower than the decision boundary Bnd, then BM25F is applied, otherwise DLHFP is Simapplied. ilarly, when the decision mechanism is evaluated for the mixed task mq2004, if the outcome o of an experiment applied, otherwise DLHFP 7.4.4.2 Evaluation F_is lower than the decision boundary Bnd, then PB2F is is applied. for the ad-hoc decision mechanism The combination
results
Table 7.6 displays the evaluation results for the ad-hoc decision mechanism that employs (rows 3-4), and 4(b), (rows 5-6). the experiments F-V(at)(rows 1-2), F-V(at), avg(dom) std(dom) The rows preceding the evaluation results in the table describe the setting of the decision mechanism, that is the query sampling technique used to obtain the distribution of for each experiment, outcome values explained in Section 7.4.4.1. (MAP) average precision decision boundary Bnd, the the as and value of
The column `Baseline' in the table displays the mean retrieval approach, and the
individual the of most effective
Bayesian decision by MAP displays the `Bayesian' the mechanism, obtained column in information the limited is same setting, as presented relevance applied with which in Table 7.2. in is decision the indicate the effective considered that The results mechanism ad-hoc (rows 1-2), EV(a, the obtained mean average When the experiment employing t) setting. is 0.5903. decision by the for which (MAP) mechanism task the mq2003 precision (0.5533). the 6.69% approach retrieval effective most improvement over of represents an Bayesian better the than corresponding The ad-hoc decision mechanism also performs in information the limited same is relevance with decision mechanism, which applied ). 'Bayesian', 1 -MAP' respectivel), and from 0.5775 columns (0.5903 and row vs. setting the task to mq2004, and applied when well decision performs The ad-hoc mechanism (row 2). For approach retrieval the effective improvement most in over +5.65% results the effective most retrieval decision applies the mechanism tasks, ad-hoc both tested
211
7.4 Ad-hoc
decision mechanism
and query samnline
significant number of queries, as indicated by t. In particular for the task mq2003, the improvement in MAP is statistically significant according to Wilcoxon's signed rank, as denoted by *.
Row Retrieval approaches Baseline Bayesian MAP +/ATS with a fl,, = 5, P(o < Bnd) = 0.5, and Bnd = 11.9956 1 mq2003 BM25F DLHFP 0.5533 V(at) 0.5775 0.5903 +6.69t' 2 mq2004 PB2F DLHFP 0.4145 0.4381 Y(at) 0.4391 +5.651 MTS with rjo = 200, rhi = 2000, p=2.1, a= 0.78, P(o < Bnd) = 0.5, and Bnd = 1.5129 3 mq2003 BM25F DLHFP 0.5533 0.5626 EV(at),avg(dom) 0.5728 +3.52 ' 4 PB2F DLHFP 0.4145 mq2004 0.4452 Ed(at), dom 0.4431 +6.62 a MTS with rio = 200, rhi = 2000, p= 3.2, a= 1.31, P(o < Bnd) = 0.5, and Bnd = 24.6724 5 mq2003 BM25F DLHFP 0.5533 0.5777 EV(b), 0.5799 8td(dom) 6 mq2004 PB2F DLHFP 0.4145 Ed a std dom 0.4212 0.4349 +4.641 , Task
approach for a statistically
Table 7.6: Evaluation of the ad-hoc decision mechanism with the experiments F_V(a, ), 4(at), F-V(b), The table displays the evaluation task ('Task'), the and avg(dom) , std(dorn) . ('Retrieval employed retrieval approaches approaches'), the mean average precision of the most effective retrieval approach ('Baseline'), the mean average precision obtained from a Bayesian decision mechanism, applied with limited relevance information for the ('Bayesian'), the employed experiment the mean average precision same setting ('MAP'), decision the of ad-hoc mechanism and the relative improvement over the baseline ('+/- %'). The symbol t denotes that the decision mechanism applies the for a statistically significant number of queries, most appropriate retrieval approach denotes difference between * The that the the test. to the symbol sign according MAP of the decision mechanism and that of the most effective retrieval approach is The Wilcoxon's test. to signed rank rows preceeding statistically significant, according the results describe the setting of the ad-hoc decision mechanism. (rows V(at), the decision The ad-hoc experiment mechanism, which employs avg(dom) for both baseline tested tasks the improvements in mq2003 and 3-4), also results over In it 0.4145, 0.4431 0.5533 (0.5728 addition, applie's respectively). vs. and vs. mq2004 for a statistically significant number of queries the most effective retrieval approach
from the mixed task mq2003, it outperforms and the corresponding Bayesian decision decision When 3). the the from employs mechanism 0.5626 (0.5728 row vs. mechanism higher that the is than MAP of corresponding the y(b), obtained experiment std(dom), 0.4349 0.4212 0.5777 (0.5799 and vs. vs. decision mechanism for both tested tasks from rows 5-6, respectively). effective retrieval approach The decision mechanism for a statistically also selectively number applies the mot of queries for both
significant
is t. there a statistically significant In by the mq2003, of indicated case tested tasks, as denoted by *. test, Wilcoxon's as rank signed to MAP, in according improvement
212
7.4 Ad-hoc
decision mechanism
and query sampling
7.4.4.3 Overall,
Discussion
it has been shown that query sampling can be effectively used to set the decision boundary of the proposed ad-hoc decision mechanism in the tested setting (Table 7.6). The ad-hoc decision mechanism performs as well as the corresponding Bayesian decision mechanism with limited relevance information. This section further
discusses two issues related to the ad-hoc decision mechanism

Selecting the retrieval approach to apply when o< Bnd In the case of the
task mq2004, the evaluated decision mechanism in Section 7.4.4.2 applies the fieldbased weighting model PB2F when the outcome o of an experiment is lower than the decision boundary weighting model DLHF Bnd, otherwise, it applies the combination (DLHFP). PageRank with field-based the of This setting has been based on expecting that employing PageRank performs better for the most broad queries, which However, if some training data
result in higher outcome values for the experiments.
is available, then they can be used to suggest which retrieval approach to apply when o< Bnd. For example, the training particular data can be used to estimate the likelihood This is approach effective. a retrieval outcome values when
of obtaining likelihood
is he indicate be to to a approach expected retrieval whether employed can hence, high for low to the the experiment, and of an select outcome values or effective is the the to of employed experiment outcome o apply when which retrieval approach lower than the decision boundary Bnd.
Setting the decision boundary In other Bnd boundary Bnd In the described experiments in Section 7.4.4.2. P(o < E
the decision Bnd) 0.5. = than
decision the mechanism ad-hoc of words, the probability that
has been set so that
the outcome
of an experiment probability
is lower that
is 0.5.
If further is effective,
evidence exist to suggest the prior then the probability P(o < Bnd)
a retrieval
approach
be could set
is decision to the task the applied mechanism For ad-hoc when example, accordingly. for DLHFP 122 3; )() the BM25F out of outperforms the approach retrieval mq2003, both for 115 BM25F while retrieval approaches queries, DLHFP outperforms queries. BM25F the that In this prior probability for case, 113 MAP queries. in the same result Bnd) P(o is 0.5 for < that appropriate = 0.515, suggests is which is effective = ff 12222 +1
the tested setting.
213
7.4 Ad-hoc
decision mechanism
and query samDline
7.4.5
Conclusions
an ad-hoc decision mechanism and novel techniques for samples of queries. The ad-hoc decision mechanism aims
This section has introduced automatically generating
to reduce the dependence of applying selective Web IR on training data, by setting its decision boundary with respect to the distribution of the outcome values of an experiment (Section 7.4.1). This distribution of outcome values is obtained from a sample of automatically generated queries (Section 7.4.2). Three techniques have been proposed for automatically generating queries. In the first one, STS, a generated query corresponds to a randomly sampled term from the vocabulary of the collection (Section 7.4.2.1). In the second technique, MTS, a generated query corresponds to a number of the most informative terms from a set of documeril s, length the and of the generated query follows a Gaussian distribution (Section 7.4.2.2). The third one, ATS, is a novel technique, where a generated query corresponds to a (Section from 7.4.2.3). The evaluation text the randomly sampled anchor collection of the three proposed techniques has shown that generating queries with either MTS ATS STS in is than or approximating more effective values of an experiment in approximating the distribution the distribution of the outcome (Section 7.4.3). Moreover, ATS is more effective than MTS EV(at). On the the experiment of of outcome values
distribution ATS in the MTS is hand, than the other approximating more effective Sampling Ey(b), Ev(at), the and experiments of the outcome values of std(dom). avg(dom) distribution for the is MTS of outcome approximating effective more short queries with for MTS is long EV(at), effective more with queries sampling while of values avg(dom), (Section 7.4.3.3). EV(b), distribution the of values of outcome approximating std(dom) The ad-hoc decision mechanism has employed the most effective query sampling (Section 7.4.4.1). for boundary decision a given experiment technique in order to set its the indicated have that 7.4.4.2 Section in ad-hoc mechanism can The evaluation results iii been to have E, perform well shown which be effectively used with the experiments information. limited decision relevance Bayesian with mechanism the context of a in decision statistically significant mechanism results the ad-hoc In particular, for approach a retrieval it the MAP, appropriate in most applies improvements and statistically significant task the in the mq2003. of case number of queries
214
7.5 Summary
Overall, this section has proposed an alternative way to using the Bayesian decision mechanism, in order to apply selective Web IR and reduce the dependence on available information. relevance
7.5
Summary
This chapter has investigated the application of selective Web IR in a setting, where limited relevance information exists in order to set a decision mechanism. The concept of limited relevance information corresponds to processing queries from mixed tasks, as well as to training and evaluating the Bayesian decision mechanism with different (Section 7.2). This definition of limited relevance information provides sets of queries a realistic setting for evaluating the effectiveness of selective Web IR. The evaluation of the Bayesian decision mechanism with limited relevance inform<<tion has shown that selective Web IR can be effectively applied (Section 7.3). Both in improvements For
the score-independent in retrieval example, query
and the score-dependent (Table
experiments
resulted
effectiveness the experiment
7.2 on page 197, and Table 7.3 on page 198).
terms
EV(p, which counts the number of documents with all the t), (rows in the anchor text, has been shown to be particularly effective 7.2). Indeed, retrieval the decision approach mechanism that employs EV(at) applies the
1-2 in Table
most appropriate particular significant
for a statistically
significant
In of queries. number
for the task mq2003, the Bayesian decision mechanism results in statistically improvements in MAP. decision mechanism ad-hoc on training has further reduced the
The introduction dependence
of a simple
Bayesian the of
decision mechanism
deciThe ad-hoc queries. of outcome
sion mechanism
boundary decision its sets This generate
distribution the to with respect is obtained with
for a given experiment. values sampling techniques, which
distribution either
three novel query 7.4.2.1), or
single-term
queries
(Section The latter
queries with distribution Web track
term than one more of outcome queries
(Sections
7.4.2.2 and 7.4.2.3).
generate a
those to corresponding closer values, which are 7.4.3.3). The evaluation effectively
to real TREC mecha-
(Section
decision the ad-hoc of
has that shown nism that perform
it can be applied
in conjunction
with the experiment" with limited rele-
in the context well (Section
Bayesian of a Indeed,
decision mechanism
information vance
7.4.4.2).
the ad-hoc decision mechanism
can lead
215
7.5 Summary
to statistically
improvements in in significant retrieval effectiveness, as well as applying for approach a statistically retrieval significant number of queries
(Table 7.6 on page 212). Therefore, the ad-hoc decision mechanism can be useful in the context of an operational Web retrieval setting. Overall, this chapter has shown that selective Web IR can be applied when limited information relevance in Chapter 6. IR in Web This the a evaluation of selective complements exists. discussed is information has been it that available, as assumed relevance setting where
216
Chapter
Conclusions
8.1 Contributions
Future and
and conclusions
Work
This thesis has investigated selective Web information of which appropriate discusses the contributions 8.1.1 Contributions following: the thesis this are of
retrieval, a technique by means
retrieval approaches are applied on a per-query basis. This section and conclusions of this thesis.
The main contributions
(IR) has been Web information for framework retrieval proselective "A general defined. E been has posed, and a range of experiments in been the have thoroughly The evaluated context of a experiments proposed " Web tasks decision of specific search and aL Bayesian a range with mechanism has been in The tWeb test performed TREC a evaluation 5(, collection. standard in is is to information as exist, as well a more realist assumed ting, where relevance and operational information limited relevance setting, where only is available.
including t. samples, of a query novel for c(11the generation Techniques automatic " have documents. been introduced text the of aiid anchor based sampling on nique boundary decision the of a proposed ad-hoc in setting of the context evaluated decision mechanism. for Web IR have been introduced and different approaches retrieval .A range of different including types tasks, for of search of number a evaluated thoroughly
2f1
8.1 Contributions
and conclusions
both ad-hoc and Web specific search tasks from standard TREC Web test collections. The introduced retrieval approaches include an extension of the Divergence From Randomness framework to perform per-field normalisation. as well as the
Absorbing Model, a novel hyperlink structure analysis algorithm. 8.1.2 Conclusions
This section discusses the achievements and conclusions of this work. Effectiveness of selective Web information retrieval The work in this thesis
by the wealth of different retrieval approaches that can be used for Web information retrieval (IR), as well as the fact that there are different types of search tasks performed by users. Most of the related works in the literature have described retrieval approaches that are applied for all the queries uniformly. Other have related works considered the prediction of the performance of a retrieval approach, or the classification of the query type. The aim of this thesis has been to introduce a general framework for selective NA b (, IR, in order to apply an effective retrieval approach on a per-query basis. Selective W(,b IR is different from the related work, in the sense that: different retrieval approaches the be task; type the to of and selection of the most same can applied queries of least two the effectiveness of at retrieval retrieval effective retrieval approach considers approaches. The obtained experimental significant Web IR that selective suggest results can improvements in retrieval effectiveness, and that the significant number
has been motivated
lead to statistically
most effective retrieval of queries. Potential information for
for be a statistically applied approach can
improvements retrieval
in retrieval
effectiveness
from
selective
Web
Chapter 4 has established the potential
for improvements
The investigating by approaches. several retrieval proposed from selective Web IR, frefield-based term from models, which weighting perform range approaches retrieval field (Section independently document 4.4), each of weighting and quency normalisation field-based with models query-independent eviclcnce weighting the to combinations of Absorbing Model, PageRank, the documents. and a novel al Web p URL from the of (Section 4.5). The docunwnts hyperlink employed structure the of for the analysis rithm
218
8.1 Contributions
and conclusions
fields are the body, the anchor text of the incoming hyperlinks, and the title of `Web documents. First the proposed retrieval approaches have been compared on the basis of their optimal performance, by setting their hyper-parameters in order to optimise the retrieval effectiveness for each tested task independently (Section 4.3.2). A more realistic setting of the hyper-parameters the optimisation with mixed types of tasks, and terminating (Section 4.6). This setting has been employed in order process early involved training
for improvements when the most effective retrieval approach is applied on a per-query basis (Section 4.7). The obtained results have shown that to establish the potential selective Web IR has the potential for statistically effectiveness. Decision theoretical framework for selective Web information Chapsignificant improvements in retrieval
retrieval
ter 5 has introduced a new framework for selective Web IR, based on statistical decision theory. One of the main concepts of this framework is the decision mechanism. which E in order to guess the state of nature, or in other words, in order to select the most effective retrieval approach to apply on a per-query basis (Secemploys an experiment tion 5.2). The consequences of applying a particular retrieval approach are modelled by the loss function, which corresponds to a preference relationship among the retrieval approaches with respect to their retrieval effectiveness. A range of score-independent and score-dependent experiments E has been defined. The score-independent experiments consider the occurrence of query terms in docThe score-independent document-level experiments count the in term least a particular combination documents query one at or all, with number of The (Section 5.3.1). aggregate-level experdocument score-independent fields of a of domain, from directory, the documents same or iments consider aggregates of related (Section 5.3.2). Three distribution feathe size features aggregate of and estimate the the distribution average size; standar(l considered: are the tures of aggregate size (Section 5.4). uments deviation large aggregates. the of number the size; and of
experiments E estimate the usefulness of the hyperlink patterns of hyperlinks structhere are non-random among the st ructuire
The score-dependent ture, an indication retrieved documents
of whether
hyperlink The 5.4). (Section the of usefulness for a query
219
8.1 Contributions
and conclusions
is defined in terms of the information tions. The first distribution weighting
theoretic divergence between two score distribuS, corresponds to the scores assigned to documents by a
is formed in order to favour the documents that point to other highly scored documents. The score of each document corresponds to either the sum of the original score of the document and the scores of all the documents that it points to (U,, ), or only the sum of the scores of all the documents that document (U,, ). The usefulness of the hyperlink to a points structure is defined as t lie symmetric Jensen- Shannon divergence L (Sn, U,, ) or L (S, Ute) . Selective Web IR has been primarily investigated in the context of a Bayesian decision mechanism, which has been introduced in Section 5.5. The Bayesian decision mechanism is trained with a set of queries. According to the outcome of an experiment for a given query, the Bayesian decision mechanism selects the retrieval approach, which results in the lowest expected loss. Overall, the proposed framework for selective Web IR is general in the sense that it does not depend on any particular retrieval approach, and it can be applied to select one out of any number of retrieval approaches. Evaluation tion of Selective Web information retrieval with relevance informa-
model. The second distribution
in Chapter 6 has been performed with different types of Web search tasks, including the topic distillation, home page finding, and named page finding tasks from the TREC 2003 and 2004 Web tracks (Craswell & Hawking, 2004; Craswell has 2003). The been performed by trainthe et al., evaluation of proposed experiments ing and testing the Bayesian decision mechanism with the same task, in order to foci. " the analysis on the effectiveness of the experiments, and reduce the effect of using different training decision Bayesian The tasks the testing mechanism. evaluation on and decision Bayesian the have the that can selectively apply most mechanism shown results for basis a statistically significant number of effective retrieval approach on a per-query for both the improve the score-dependent effectiveness achieved retrieval queries, and (Section 6.3.3) and the score-independent experiments (Section 6.4.6). The score-independent document-level experiments, as well as the score-independent deviation the the aggregate of standard estimate which experiments, aggregate-level for tasks boundaries decision the tested low in distribution, all number of a result size fc>r identifying in the (Section 6.3.3). This suggests that they are more effective queries document-level The is alo experiments appropriate. approach more retrieval a which
The evaluation
220
8.1 Contributions
and conclusions
perform well when the considered documents contain all the query terms in a particular field, or a combination of fields. This is explained by the fact that the outcome of the experiment is computed from a cohesive set of documents, which are likely to be about the query (Section 6.3.1.3). The aggregate-level experiments perform well when the considered documents contain the query terms in their body, because the distribution of aggregate sizes is generated from a higher number of documents. compared to when the experiments consider only the documents with query terms in their anchor text or title fields (Section 6.3.2.4). The domain aggregates are also more effective than the directory aggregates, because they provide a better indication that there are group', of documents (Section 6.3.2.4). As discussed in Section 5.3.2, this related about a query may depend on the characteristics of the Web sites that appear in the collection. The score-dependent experiments that estimate the usefulness of the hyperlink ) L(S7zi U,, have been shown to be robust in identifying at least one decistructure (Section boundary in 6.4.2). The experiments that compute the tested sion all cases the usefulness of the hyperlink structure L(S,,,, U7) have been more effective when conin documents the terms all a particular query sidering with (Section fields 6.4.5). of Evaluation information Selective of Web information retrieval with limited relevance field, or in a combination
Selective Web IR has been also evaluated in a setting, where it is limi The does information t limited of ed concept exist. that relevance only assumed for different training the tasks to information mixed employing corresponds relevance (Section 7.2). decision the mechanism and the evaluation of has in in this decision Bayesian resulted setting The application of the mechanism document The (Section 7.3). score-independent improvements in retrieval effectiveness terms documents the query all the with of level experiment EV(a, number counts which t), The for both tasks. tested mixed in their anchor text, has performed particularly well be trained decision with one Bayesian can the mechanism that obtained results suggest to previously other approaches retrieval appropriate apply selectively and queries, set of (Section 7.3.3). unseen queries Ad-hoc decision mechanism and query sampling techniques In order to re-
data. training an ad-hoc decision Bayesian on dependence the mechanism duce the of
221
8.2 Future
work
decision mechanism has been introduced nism sets its decision boundary experiment E. This distribution
in Section 7.4. The ad-hoc decision mechaaccording to the distribution of outcome values of an is obtained using three techniques for the automatic
queries by randomly
generation of queries.
The first vocabulary queries with technique, STS, generates single-term (Section 7.4.2.1). sampling the
of the collection more than (Section
The second technique, the most informative
MTS, generates terms from a
one term 7.4.2.2).
by extracting The third
documents set of the anchor text the observation McCurley, with
technique,
ATS, generates queries from This is based on k,
documents the of that the anchor
in the collection text
(Section 7.4.2.3).
of Web documents
(Eiron resembles queries
2003b).
The three proposed query sampling distribution the of
techniques have been evaluated values of an experiment
respect
to the similarity from the sampled
of outcome
E obtained queries.
queries and the real TREC
2003 and 2004 Web track techniques suggest tlimlt, better than
The evaluation
results of the proposed query sampling
MTS using
ATS to sample queries with or
more than one term performs
using STS to sample single-term
(Section 7.4.3.3). queries
The evaluation of the ad-hoc decision mechanism, which uses query sampling to set its decision boundary, has shown that it can be effectively applied and achieve similar, better or performance relevance information than that of the Bayesian decision mechanism, when limit cd (Section 7.4.4.2). exists
8.2
Future
work
from future for to, directions discusses related or stemming work This section several this thesis. than two retrieval approaches, more retrieval with Selective Web information IR Web in has been this thesis The E mainly of selective evaluation or experiments E to selectively apply one out of the experiments proposed focused on using one of this work may consider using combinations An of extension two retrieval approaches. to the appropriate select evidence to retrieval more in obtain order experiments of from both score-independent For evidence and using example, to apply. approach E may lead to improvements in the performance of i he experiments score-dependent
222
8.2 Future
work
decision mechanism, because the different experiments may capture diverse and different features of the set of retrieved documents.
more than two retrieval approaches to select from provides another direction for future work. One issue related to employing several retrieval approaches is the investigation of how the Bayesian decision mechanism can be effectively trained in order to apply an appropriate retrieval approach out of several ones. Another interesting direction for future work is the investigation of automatic techniques to select t he retrieval approaches that can be effectively applied for selective Web IR. Definition of the loss function and Garnes against nature This thesis has Considering
focused on the evaluation of selective Web IR from the perspective of retrieval effcctiveness. The loss associated with the application of a retrieval approach has bcveen defined in terms of the retrieval effectiveness, as described in Section 5.2. However, other factors can also be incorporated in defining the loss of a retrieval approach. The computational overhead and the efficiency of each retrieval approach can be considered in defining the loss function. For example, a retrieval approach, which is very be but has effective, appropriate also a significant computational overhead, may not for a Web search task, where fast response times are required from an IR system. Information from a user profile can also be used to define the loss associated with loss function For the application of the retrieval approaches. can bias the example, a decision mechanism towards retrieval approaches that are effective in finding eilt ry Such icular the to of a part user. preferences according in for this thesis the lead approaches proposed the to application of a technique can IR. IR in context, or adaptive performing formulation Web IR, different in of selective to interesting a It is also consider (1957) showed that a statistical decision & Raiffa Luce terms of games against nature. is the into nature, where reasoning against made game be a transformed problem can instead for the to decision of actions perform each mechanism, in terms of selecting a has different loss function, decision a mechanisms, which the Each of state of nature. Regarding Web IR. tasks. types different selective search of such a to may correspond by its a retrieval approach, the weighting of retrieval selection to be aid used can setting in the different tasks. effectiveness detailed information, points, or
223
8.2 Future
work
Updating
the decision
mechanism
with
relevance
feedback
The investigation
of selective Web IR has been focused on a TREC-like experimental setting, where all the queries are processed in a batch mode. An interesting direction of future work is related to employing selective Web IR in an interactive setting, where the decision mechanism is updated while processing queries. The updating of the decision mechanism can be performed with explicit, or implicit feedback from users. For example, clickthrough data can be used as an indication of the relevance of documents for a given query (Joachims In 2005). this way, the decision mechanism can be refined with more accurate et al., information, and it can adapt the loss function to the search behaviour of the users.
distribution of experiments with E The proposed query sampling techniques in order to set its is related This in the
Sampling
have been used in conjunction decision boundary. Another
an ad-hoc decision mechanism, of the query sampling of an experiment
application
techniques
to assigning probability obtained
a probability
to the outcome
E for a given query.
may be used in order to adjust from the experiment evidence outcome for an experiment
the belief of a decision mechanism
E. For example, the low probability E may provide a stronger indication and hence, about applying
of obtaining about the a particular
a particular structure retrieval
of the corresponding approach.
documents, set of
224
Appendix
Parameter evaluation
settings and of retrieval approaches
This appendix presents the parameter settings used for the evaluation of the retrieval approaches described in Chapter 4. In addition, it presents the precision at 10 retricvedi documents (MRR1), (P10), the mean reciprocal rank of the first retrieved relevant document and the number of retrieved relevant documents for the field-based weighting with the query-independent sources of evidence.
PB2 I(ne)C2 b 0.4424
0.4221
models, and their combination

Task tr2000
tr2001
PL2 c 11.9420
12.3985
BM25 ki 0.47
0.98
c c Full text 6.0645 52.9390

10.6837 11.4277
td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004
1.2712 0.4134 0.1536 0.3456 0.4973 0.2642 2.0354 0.9420 1.7393 99.3880 7.1904 3.1646 1.1635 1.3719
1.1485 0.2613 0.1417 0.3410 0.5064 0.2881 1.4713 1.0364 1.4253

Title
0.9024 0.1040 0.1059 0.5781 0.4320 0.2650 2.4378 0.9790 2.0450 142.2381 12.2877 4.6198 1.7706 4.1331
0.6788 0.8827 0.9524 0.8349 0.9233 0.8433 0.8072 0.6975 0.5548 0.0610 0.7527 0.4519 0.8069 0.5006
3.29 11.70 15.05 2.67 1.01 11.36 2.99 1.13 3.13 0.47 0.12 0.98 0.86 1.16
193.2049 6.1322 5.4616 1.2124 1.3898
225
continued
from previous
page
PL2 Task hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 c 2.0843 2.9547 10.7558 5.4129 8.0238 16.1657 23.4745 7.8859 10.6788 0.9540 10.7798 10.2345 3.0215 1.1069 9.0942 5.8931 4.4467 72.3341 1.0301 5.1115 1.1242 526.7705 805.5639 322.8153 81.1350 14.6921 13.4140 10.7286
PB2 c Title 2.4594 4.8307 1.2479 6.0953 9.1423 12.1322
I(ne)C2 c 4.2034 2.6783 4.0749 3.9854 9.1528 10.5189
BN125 b k1 0.5787 0.9991 0.5039 0.2840 0.6329 0.4488 0.4134 0.3931 0.2905 0.8588 0.2486 0.4189 0.5268 0.5410 0.9039 0.9131 0.6609 0.9859 0.8271 0.7752 0.3439 0.1076 0.0175 0.0093 0.0174 0.6172 0.6643 0.5743 0.92 0.34 2.71 0.87 0.37 0.73 0.42 0.17 0.71 0.77 0.64 0.42 0.80 1.11 0.09 0.09 0.52 0.12 0.26 2.40 2.09 1.38 0.15 0.99 0.35 0.50 0.36 0.59
Headings 21.0648 18.4154 5.7164 8.3176 10.0058 7.8139 0.9499 0.9435 13.1460 14.1948 4.3676 5.6029 3.6548 4.8310 1.0772 1.0981 12.4019 17.3411 5.7167 10.9804 3.4802 2.8013 Anchor text 0.6915 0.7031 2.7026 1.1675 1.2275 1.4353 0.9769 1.0912 83.1691 605.4442 930.4702 915.9571 452.3235 931.4237 94.2461 71.2218 11.4034 2.0410 35.6099 9.5270 209.9415 6.3414
headings, full title, from text, anchor the and for Parameter retrieval Table A. 1: values he I(ne)C2, PB2 PL2, t and and DFR the models weighting text of documents, with BM25. weighting model
226
Task
cb
Ca PL2F
Ct
Wa
tOt
tr2000 tr2001 td2002

td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
4.0855 10.7273 1.3457

0.3360 0.1293 0.7721 0.4573 0.4129 1.4210 1.0380 1.2840
6.7723 2.6661 1.0278

4.0263 324.2448 908.9378 306.6216 87.0932 27.9542 25.7228 15.9861
142.5767 5.8433 26.6474

4.2368 4.5425 5.2074 26.9822 100.4131 16.7559 10.9078 7.2411
0.3284 6.5407 2.3710

2.2810 0.3886 0.5038 2.2132 6.1152 1.4820 0.4325 3.5909
0.1014 0.7304 0.8517

2.1520 0.7046 15.2169 9.6546 7.9996 3.8587 3.0119 33.0245
PB2F
tr2000 tr2001
td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
16.0796 8.9545
0.9980 0.2794 0.1032 0.3838 0.5326 0.2873 1.1296 0.5159 1.0491 3.4647 8.9296 0.7487 0.0750 0.0556 0.5287 0.2289 0.1933 0.4348 0.5690 1.3135
5.9412 1.7225
3.5936 5.3132 49.5182 324.6800 43.1579 64.3357 5.1085 7.6558 5.9074
42.1003 2.3878
20.0154 10.4665 7.0637 3.4474 2.7982 42.3266 35.5495 46.3645 4.2198
0.0791 2.2264
0.6134 3.5192 2.3072 0.3079 5.6396 30.5284 0.6452 0.7124 3.8822 0.2040 5.8078 0.6465 1.0876 0.1441 0.4801 0.3949 0.3274 0.6905 0.9375 2.4891
0.0166 1.7090
2.7686 3.7206 6.9674 8.0844 31.8623 31.3110 3.2041 2.1435 15.7021 0.5960 5.2829 0.5532 0.6437 0.6937 6.9953 2.1620 0.6562 2.5353 6.7246 22.7825
I(ne)C2F 21.2702 9.1803 1.7729 1.6934 8.1612 1.0332 6.0006 1.2520 1.6926 13.7654 2.8953 291.4738 3.4725 936.6813 51.9629 971.5727 4.9082 14.4389 8.9510 10.8766 9.4366 28.2666
for fields the the weighting the of weights and Table A. 2: The values of the c parameters I(ne)C2. PB2F PL2F, and models
227
DLHF Task Wm wt
tr2000 tr2001
td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004
0.0058 0.0428
0.1757 6.3925 1.7329 2.1965 54.2242 6.4182 0.9546 1.1574 59.8354
0.2247 0.0057
0.7122 1.2955 0.2073 0.4854 9.4886 3.0871 0.7899 0.3274 2.8585
Table A. 3: The weights of the anchor text and title fields for the weighting model DLHF.
Task tr2000 tr2001
bb 0.2850 0.3605
ba 0.9984 0.8214
BM25F bt 0.1926 0.4723
k 0.52 0.60
w, 0.2648 1.8536
Wt 0.0416 0.2763
td2002 td2003 td2004

hp2001
0.6836 0.9198 0.9402

0.8474
0.8437 0.3766 0.0499

0.0079
0.5245 0.9910 0.4612

0.5912
3.89 21.00 36.69

2.83
1.7451 7.1014 3.2380

4.7198
2.7512 14.4419 15.8051

20.3408
hp2003 hp2004
np2002 np2003 np2004
0.9493 0.8808
0.8315 0.8831 0.6246
0.0185 0.0031
0.4384 0.6641 0.6236
0.7335 0.8621
0.4660 0.3600 0.6737
3.23 13.13
5.04 1.46 5.20
11.1898 28.1010 24.5700 38.9098

6.0283 4.7054 7.6192 4.5403 9.7318 19.2977
BM25F. for the model weighting the The A. 4: parameters of Table values
228
Task tr2000
PL2F 0.2620
PB2F 0.2540
P10 I(ne)C2F 0.2640
tr2001 td2002
td2003 td2004 hp2001 hp2003 hp2004
DLHF 0.2260
0.3620 0.2680
0.1320 0.1960 0.1207 0.0987 0.0853
BM25F 0.2740
0.3440 0.2700
0.1200 0.1627 0.1179 0.0933 0.0800
0.3800 0.2440
0.1320 0.1813 0.1248 0.0980 0.0893
0.3220 0.2280
0.1200 0.1893 0.1069 0.0920 0.0787
0.3760 0.2640
0.1500 0.2053 0.1234 0.1000 0.0893
np2002
np2003 np2004 Task tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004
0.0993
0.0953 0.0960 PL2FU 0.2640 0.3620 0.2680 0.1960 0.2413 0.1255 0.1073 0.0893
0.0973
0.0913 0.0960 PB2FU 0.2540 0.3440 0.2720 0.1620 0.2187 0.1241 0.1053 0.0880
0.0987
0.0947 0.0907 I(ne)C2FU 0.2640 0.3820 0.2440 0.1780 0.2733 0.1303 0.1100 0.0947
0.0853
0.0840 0.0800 DLHFU 0.2280 0.3200 0.2400 0.1520 0.2787 0.1145 0.0973 0.0853
0.0993
0.0940 0.0933 BM25FU 0.2760 0.3780 0.2640 0.2000 0.2613 0.1297 0.1113 0.0960
np2002
np2003
0.0993
0.0953
0.0980
0.0913
0.0987
0.0947
0.0853
0.0840
0.1000
0.0940
np2004
Task tr2000 tr2001 td2002 td2003 td2004 hp2001
0.0960
PL2FP 0.2640 0.3620 0.2680 0.1380 0.2000 0.1207
0.0960
PB2FP 0.2540 0.3440 0.2720 0.1200 0.1773 0.1172
0.0920
I(n, )C2FP 0.2640 0.3800 0.2440 0.1320 0.2200 0.1248
0.0800
DLHFP 0.2280 0.3300 0.2340 0.1340 0.2293 0.1069
0.0920
BM25FP 0.2740 0.3740 0.2640 0.1360 0.2120 0.1248
hp2003
hp2004
0.1033
0.0880
0.1000
0.0840
0.1047
0.0960
0.0967
0.0867
0.1040
0.0960
np2002 np2003 np2004

Task tr2000 tr2001 td2002 td2003 td2004 hp2001
0.0993 0.0980 0.0973

PL2FA 0.2620 0.3500 0.2680 0.1340 0.1987 0.1207
0.0973 0.0933 0.0973

PB2FA 0.2540 0.3420 0.2720 0.1200 0.1733 0.1193
0.0993 0.0967 0.0920

I(ne)C2FA 0.2640 0.3720 0.2420 0.1340 0.1987 0.1241
0.0887 0.0853 0.0800

DLHFA 0.2120 0.3200 0.2300 0.1220 0.1960 0.1021
0.1000 0.0960 0.0960

BM25FA 0.2700 0.3760 0.2620 0.1480 0.2080 0.1228
hp2003 hp2004 np2002 np2003 np2004
0.1033 0.0893 0.0987 0.0980 0.0960
0.1013 0.0953 0.0933 0.0840 0.0980 0.0973 0.0947 0.0953 0.0907 0.0960 continued on next page
0.0933 0.0800 0.0853 0.0820 0.0800
0.1033 0.0907 0.1000 0.0960 0.0947
229
Task
PL2FA
continued from previous page Plo PB2FA I(ne)C2FA DLHFA
BM25FA
Table A. 5: Precision at 10 retrieved documents (P10) for field retrieval and combination with query-independent evidence.
MRR1
Task tr2000 tr2001 td2002 td2003 td2004 PL2F 0.5524 0.7107 0.5711 0.3907 0.4511 PB2F 0.4800 0.6581 0.5413 0.3915 0.4184 I(ne)C2F 0.5130 0.6855 0.5479 0.3874 0.3858 DLHF 0.4658 0.5565 0.5098 0.3670 0.4132 BM25F 0.5136 0.6753 0.5423 0.4252 0.4420
hp2001
hp2003 hp2004 np2002 np2003 np2004 Task tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003 np2004 Task tr2000 tr2001 td2002 td2003 td2004
0.6797
0.7879 0.6711 0.7289 0.7669 0.7531 PL2FU 0.5523 0.7107 0.5709 0.4662 0.6156 0.8270 0.8273 0.7206 0.7289 0.7669 0.7561 PL2FP 0.5524 0.7107 0.5859 0.4240 0.4717
0.6626
0.7285 0.6033 0.6971 0.7200 0.7285 PB2FU 0.4800 0.6581 0.5412 0.4694 0.6021 0.7890 0.7774 0.6648 0.6971 0.7202 0.7455 PB2FP 0.4800 0.6581 0.5413 0.4033 0.4527
0.7138
0.7746 0.6666 0.7368 0.7083 0.7137 I(n,, )C2FU 0.5470 0.6736 0.5479 0.4434 0.5917 0.8363 0.8348 0.7298 0.7368 0.7083 0.7220 I(n, )C2FP 0.5127 0.6855 0.5693 0.4835 0.4667
0.5895
0.6771 0.5909 0.5893 0.5961 0.5453 DLHFU 0.4693 0.5596 0.5054 0.4461 0.5931 0.7522 0.7666 0.6574 0.5893 0.5984 0.5480 DLHFP 0.4662 0.5695 0.5071 0.3936 0.4612
0.7231
0.7940 0.6868 0.7333 0.7134 0.7245 BM25FU 0.5462 0.6755 0.5423 0.5306 0.6388 0.8415 0.8522 0.7311 0.7336 0.7134 0.7396 BM25FP 0.5136 0.6822 0.5425 0.4576 0.4839
hp2001 hp2003
hp2004
0.6777 0.7965
0.6943
0.6591 0.7535
0.6387
0.7176 0.8240
0.7812
0.5988 0.7758
0.6474
0.7146 0.8474
0.7671
np2002 np2003 np2004

Task
0.7324 0.8011 0.7697

PL2FA
0.6989 0.7447 0.7415

PB2FA
0.7492 0.7687 0.7372

I(n. )C2FA
0.5909 0.6306 0.5464

DLHFA
0.7439 0.7925 0.7479

BM25FA
tr2000
0.5424
0.4800
0.5122
0.4601
0.5121
230
continued from previous page
MRR1
Task tr2001 td2002 td2003 td2004 hp2001 hp2003 PL2FA 0.7200 0.5711 0.4092 0.4602 0.6763 0.7902 PB2FA 0.6775 0.5414 0.3955 0.4421 0.6592 0.7564 I(ne)C2FA 0.6800 0.5364 0.4176 0.4187 0.7191 0.7966 DLHFA 0.5547 0.4919 0.3677 0.4165 0.5851 0.7292 BM25FA 0.6753 0.5381 0.4340 0.4558 0.7195 0.8273
hp2004 np2002 np2003 np2004
0.6858 0.7321 0.7826 0.7707
0.6146 0.6999 0.7290 0.7439
0.7107 0.7439 0.7178 0.7302
0.6085 0.5894 0.6072 0.5453
0.7141 0.7370 0.7523 0.7574
Table A. 6: Mean reciprocal rank of the first retrieved relevant document (MRR1) for field retrieval and combination with query-independent evidence.
Task tr2000 Relevant docs. 2590 PL2F 1538 Retrieved relevant documents PB2F I(n,, )C2F DLHF 1412 1562 1458 BM25F 1641
tr2001
td2002 td2003 td2004 hp2001 hp2003 hp2004
3363
1574 516 1600 252 194 83
2466
1197 403 1133 246 191 81
2423
1137 383 979 241 189 80
2409
1160 403 1151 249 191 82
2308
1093 401 1094 238 182 82
2400
1183 398 1138 249 192 82
np2002
np2003
170
158
169
157
169
157
168
157
167
155
169
157
np2004
tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003 hp2004 np2002 np2003
80
2590 3363 1574 516 1600 252 194 83 170 158
80
PL2FU 1536 2466 1196 424 1208 243 191 81 169 157
80
PB2FU 1413 2423 1145 402 1100 242 190 81 169 157
80
I(ne)C2FU 1565 2409 1160 422 1214 249 194 82 168 157
77
DLHFU 1458 2312 1100 417 1224 245 188 82 167 155
80
BM25FU 1637 2399 1183 410 1182 249 194 82 169 157
np2004 tr2000 tr2001

td2002
80 2590 3363
1574
80
PL2FPR
80
PB2FPR
80
I(ne)C2FPR
77
DLHFPR
80
BM25FPR
1538 2466
1203
1412 2423
1141
1560 2409
1169
1451 2323
1113
1641 2399
1186
td2003
516
411
392
403
404
409
231
Task td2004 hp2001
hp2003
hp2004 np2002 np2003 np2004 tr2000 tr2001 td2002 td2003 td2004 hp2001 hp2003
continued from previous page Relevant docs. Retrieved relevant documents PL2FP PB2FP I(ne)C2FP DLHFP 1600 1145 1020 1193 1191 252 246 241 249 240
194
83 170 158 80
BM25FP 1173 249
191
82 169 157 80 PL2FA 1540 2465 1197 403 1133 246 192
190
82 169 157 80 PB2FA 1412 2414 1141 382 993 241 191
192
83 168 157 80 I(n, )C2FA 1562 2416 1160 404 1165 249 194
187
82 166 155 77 DLHFA 1431 2303 1104 400 1122 239 185
193
83 169 157 80 BM25FA 1641 2400 1184 404 1142 249 193
2590 3363 1574 516 1600 252 194
hp2004
np2002 np2003
83
170 158
82
169 157
80
169 157
83
168 157
82
167 155
82
169 157
np2004
80
80
80
80
77
80
Table A. 7: Number of retrieved relevant documents for field retrieval and combination with query-independent evidence.
wu ku Wpr kpr wam kam
Task tr2000
PL2FU 0.3093 1.5751
PL2FP 0.0375 23.9525
PL2FA 0.0375 23.9525
tr2001
td2002 td2003
0.0929
1.5717 7.8659
0.5403
1.8468 8.0463
0.0002
3.3667 5.1552
0.7134
47.1483 10.3136
0.0002
3.3667 5.1552
0.7134
47.1483 10.3136
td2004
hp2001 hp2003 hp2004 np2002 np2003 np2004 Task
7.6821
12.4129
1.8550
0.1420
1.8550
0.1420
15.7055 19.3671 14.3775 18.3087 14.6394 9.0759 0.1658 0.1013 5.5921 0.0046 19.4505 1.9834 PB2FU
1.3039 35.1833 1.0129 6.2034 0.8826 5.5842 0.1168 0.9174 0.1685 4.1131 0.8798 5.7268 PB2FP
1.3039 35.1833 1.0129 6.2034 0.8826 5.5842 0.1168 0.9174 0.1685 4.1131 0.8798 5.7268 PB2FA
tr2000 tr2001
td2002 td2003
0.0635 0.9795
13.9630 29.5918
4.9097 0.1521
0.3353 5.1464
0.0009 0.0089
0.5162 9.7372
1.2607 4.0126
6.0564 3.7462
0.0009 0.0089
0.5162 9.7372
1.2607 4.0126
6.0564 3.7462
td2004
hp2001 hp2003
34.5830
25.5420 23.5538
9.7946
3.9264 14.0256
14.4232
0.6829 19.2171
0.7003
0.4400 4.0532
14.4232
0.6829 19.2171
0.7003
0.4400 4.0532
hp2004
50.8372
21.0273
31.6911
1.0226
31.6911
1.0226
232
w"

k" Wpr kpr Wam kam
Task np2002 np2003 np2004 Task tr2000 tr2001
PB2FU 2.9455 6.8139 0.9358 5.8195 14.1126 20.0929 I(ne)C2FU 0.4539 52.8343 0.4432 0.2414
PB2FP 1.07 44 0.5002 16.8391 0.3043 9.9938 1.0405 I(ne)C2FP 0.2380 40.1671 0.1169 322.0717
td2002 td2003
td2004
PB2FA 1.0794 0.5002 16.8391 0.3043 9.9938 1.0405 I(ne)C2FA 0.2380 40.1671 0.1169 322.0717
0.2309 13.3819
9.0207
0.1100 12.7960
14.1381
0.6187 9.6733
5.4772
3.8294 1.6143
0.9171
0.6187 9.6733
5.4772
3.8294 1.6143
0.9171
hp2001 hp2003
hp2004 np2002 np2003
5.7255 4.9946
2.1851 0.2132 0.0177
12.3012 12.6104
6.8189 0.0152 0.1168
1.1897 2.6812
9.1281 1.9003 4.0721
160.9115 11.0089
0.5099 0.0842 0.2500
1.1897 2.6812
9.1281 1.9003 4.0721
160.9115 11.0089
0.5099 0.0842 0.2500
np2004
Task tr2000 tr2001 td2002 td2003 td2004
0.7658
3.7596
0.5165
0.8709
0.5165
0.8709
DLHFU 0.4859 23.1299 1.2604 0.7566 2.4051 6.7482 7.4755 6.8915 9.1663 11.9887
DLHFP 1.5792 18.0661 0.5686 1.0517 1.4771 4.8690 1.8798 0.5452 4.2494 0.6156
DLHFA 1.5792 18.0661 0.5686 1.0517 1.4771 4.8690 1.8798 0.5452 4.2494 0.6156
hp2001
hp2003 hp2004 np2002 np2003 np2004 Task tr2000 tr2001 td2002
17.6257
28.4453
3.3613
1.1164
3.3613
1.1164
16.4713 12.0029 7.5731 10.0293 1.0460 0.0585 0.9707 1.2028 45.0101 0.7591 BM25FU 86.3178 0.8037 0.1774 0.4702 0.0343 0.1301
13.9537 5.2911 9.3553 1.0605 0.1457 1.8805 0.4976 3.5503 2.5503 1.2275 BM25FP 10.0952 0.0189 53.1777 0.2245 10.3742 0.0677
13.9537 5.2911 9.3553 1.0605 1.8805 0.1457 3.5503 0.4976 2.5503 1.2275 BM25FA 0.0189 10.0952 0.2245 53.1777 0.0677 10.3742
td2003
td2004 hp2001 hp2003 hp2004 np2002
6.8095
6.1798 3.8562 2.0919 1.6581 0.1761
16.3036
37.1125 14.5958 13.4126 5.4128 3.9992
2.3923
1.7851 0.3967 1.7147 7.3362 2.0513
4.7923
3.5860 77.0965 7.9865 0.4661 0.0784
2.3923
1.7851 0.3967 1.7147 7.3362 2.0513
4.7923
3.5860 77.0965 7.9865 0.4661 0.0784
np2003
np2004
0.0125
2.1986
13.2079
30.4512
2.7349
2.3428
0.0784
2.2427
2.7349
2.3428
0.0784
2.2427
the the for the models with weighting of combination Table A. 8: The parameter values query-independent evidence.
233
Task (train)
nb
Ca
Ct
wQ
wt
mg2003 (mg2004) mq2004 (mq2003')

mq2003 (mg2004) mq2004 (mq2003') mg2003 (mg2004) mq2004 (mq2003') mq2003 (m42004) mg2004 (mq2003')
0.9319 0.6986
0.6296 0.4608 0.2628 0.4487
78.1036 73.2827
62.0220 21.7563 158.0428 12.4619 -
PL2F 8.4729 26.7106 PB2F

22.0046 7.0632 I(ne)C2F 9.2125 3.2772
10.2889 37.6963 1.0094 2.8912

1.1658 1.4040 0.4006 1.5102 96.0313 14.2996 12.3601 6.6747 3.3339 7.5873 37.3645 3.6735
DLHF
-
Task (train)
mq2003 (mq2004) mq2004 (mq2003')
bb
0.8211 0.8896
ba
0.0093 0.0487
BM25F be k
0.4580 0.8453 4.39 5.62
wa
8.6475 8.1673
wt
33.7770 20.0292
Table A. 9: The values of the parameters and the weights of the fields for the weighting models PL2F, PB2F, I(ne)C2, DLHF and BM25F for training and evaluating with different mixed tasks. The parameter values used for the mixed tasks are the ones used for their corresponding subsets of tasks.
wu
ku
Wpr
kpr
Wam
kam
Task (train) mq2003 (mg2004) (mq2003') mq2004 (mq2004) mq2003 (mq2003') mq2004
PL2FU 9.4083 11.0740 28.6998 9.7826
PL2FP 9.1928 0.2024 4.7340 35.7684
PL2FA 2.4205 0.4538 1.8934 0.6473
PB2FU
15.4118 21.6786 8.2846 8.0324 I(ne)C2FU
PB2FP
0.5128 17.9247 5.3236 0.0562 I(ne)C2FP
PB2FA
2.3209 5.0749 0.0036 8.5315 I(ne)C2FA
(mq2004) mq2003 (mq2003') mq2004

(mq2004) mq2003 (mq2003') mq2004 (mq2004) mq2003 (mg2003') mq2004
6.5864 1.5936 7.6559 1.9588 DLHFU

8.3892 15.9754 2.8241 1.7968 13.8659 71.0661 79.7968 12.7021
0.2262 9.3379 0.2680 7.1016 DLHFP

14.7347 6.7616 5.7691 2.1615 0.2093 1.1263 0.1833 2.0311
1.3728 1.1061 1.1468 17.5904 DLHFA

3.0739 3.4603 0.6453 1.1634 14.1576 34.2182 1.0578 20.6925
BM25FU
BM25FP
BM25FA
field for the retrieval combination of each Table A. 10: The values of the parameters for training evaluating with and the evidence query-independent weighting model and for the tasks the ones used are The mixed different mixed tasks. used values parameter to The a subset of task tasks. corresponds mq2003' for their corresponding subsets of for task. type first 50 topics of the each of consists which mq2003,
234
't'ask (train)
cb
ca
ct
wo
Wt
mg2003 (mg2004) mq2004 (mq2003') mq2003 (mg2004) mq2004 (mq2003') mg2003 (mg2004) mq2004 (mq2003')
mq2003 (mg2004) mq2004 (ma2003')
0.9572 0.6607 0.9905 1.0161 0.8462 1.1734
64.9514 1.2557 4.9990 1.3790 113.3989 1.0854 -
PL2F 8.0774 1.2172 PB2F 15.4419 3.0255 I(ne)C2F 1.2494 2.7481 DLHF -
5.3941 5.2971 8.2748 2.9781 1.3378 12.5489 9.3857

A Al QA z"V-VZ
8.8938 7.1962 13.7280 5.0802 5.3631 26.0581 9.8862 r '70, )n

V. 1 Vf/. 7
Task (train) mq2003 (mq2004) mg2004 (mq2003')
bb 0.5804 0.4866
ba 0.4794 0.5291
BM25F bt k 0.5462 2.92 0.5663 2.20
Wa 13.2098 18.2071
Wt 13.9637 9.4071
Table A. 11: The values of the parameters and the weights of the fields for the weighting models PL2F, PB2F, I(ne)C2, DLHF and BM25F for training and evaluating with The parameter values used for the mixed mixed tasks, and restricted optimisation. tasks are the ones used for their corresponding subsets of tasks. The task mq2003' corresponds to a subset of mq2003, which consists of the first 50 topics for each type of task.
Wu ku )pr kpr Wam Cam
Task (train) (mq2004) mq2003 mq2004 (mq2003') (mq2004) mq2003 (mq2003') mq2004 (mq2004) mq2003 (mq2003') mq2004 (mq2004) mq2003 mq2004 (mq2003')
PL2FU 6.8737 8.0400 9.0689 8.7801 PB2FU 5.8988 3.3266 10.1833 5.2613 I(ne)C2FU 4.5683 0.1295 2.4396 2.0438 DLHFU 2.7803 7.7638 4.2361 6.9989
PL2FP 5.5400 0.1651 6.5624 18.2335 PB2FP 14.9628 46.2044 0.8000 20.4207 I(ne)C2FP 1.5623 16.6362 0.4796 3.6810 DLHFP 8.4433 7.7489 0.4846 10.9063
PL2FA 2.2822 2.7487 5.7708 1.6061 PB2FA 4.9506 3.3182 8.1035 2.2948 I(ne)C2FA 5.1615 1.0130 10.7205 0.0155 DLHFA 5.6169 0.0993 0.8618 3.0052
(mq2004) mq2003 (mq2003') mq2004
BM25FU 3.1348 3.0927 3.7348 3.5144
BM25FP 3.7313 2.7073 2.8207 1.2952
BM25FA 1.4702 41.2509 0.9682 13.3225
field for the retrieval A. 12: The values of the parameters Table combination of each for training evaluating with and the evidence query-independent and model weighting for the The mixed used values tasks, parameter optimisation. and restricted mixed The task tasks. for mq2003' their of the subsets tasks are corresponding used ones for first 50 type topics the each of to a subset of mq2003, which consists corresponds of task.
235
Appendix
Evaluation
This appendix the context
of experiments
results from all the introduced
E
experiments E, in
presents the evaluation
of a Bayesian decision mechanism, which employs two retrieval approaches based on PL2F, PB2F, I(ne)C2F, DLHF, BM25F, or two different weighting models, respectively. Tables B. 1 to B. 11 present nism, which is trained In these tables, retrieval the evaluation results for a Bayesian decision mecha-
and tested with the same search task, as described in Chapter 6. column displays the name of the tested topic set, the two and the mean average preresults for
the first
approaches
employed
by the decision mechanism,
cision of the most effective one. The second column displays the evaluation the experiments combination experiments tion of fields. experiment that consider documents Similarly,
least one query term in a particular with at for the results combina-
of fields.
the third column displays the evaluation
that consider documents For each evaluated the obtained
with all the query terms in a particular the tables report ('MAP'),
decision mechanism,
the employed difference
mean average precision
the relative
between the MAP of the most effective retrieval
approach and the obtained
MAP by I lie
('B'). decision boundaries The decision mechanism ('+/-%'), the symbol of and number t denotes that the Bayesian decision mechanism applies the most appropriate retrieval approach difference individual rank test. for a statistically between retrieval significant The * symbol number of queries. (lenotes that the and that of the most effective according to \Vilcoxon's sing HH for
the MAP approach
decision the mechanism of is statistically significant
If an experiment
e does not identify likelihood
decision least one at of one retrieval
boundar\
a particular
task, because the posterior
approach
is always
236
higher than the posterior likelihood of the other retrieval approach, then this is denoted by - in the tables. For example, when the Bayesian decision mechanism employs the experiment F-3(at) in order to selectively apply either PL2FA or I(ne)C2FA for the task np2003, there is no decision boundary identified (Table B. 1). For this reason, the results of the experiment e3(at) are only reported in Table B. 1, and not in Table 6.2 (page 138). Tables B. 12 and B. 13 present the evaluation results for a Bayesian decision mechanism, which is trained and evaluated with different sets of mixed tasks, as described in Chapter 7. The columns of these tables display: the name of the evaluation task ('Task'); the two retrieval approaches employed by the decision mechanism ('Retrieval approaches'); the mean average precision of the most effective individual retrieval approach ('Baseline'); the employed experiment (È'); the obtained mean average precision by the Bayesian decision mechanism ('MAP'); the relative difference between the MAP of the most effective retrieval approach and the obtained MAP by the decision mecha('+/-%'); nism and the number of decision boundaries ('B'). The symbol t denotes that the Bayesian decision mechanism applies the most appropriate retrieval approach for a statistically significant number of queries. The symbol * denotes that the difference between the MAP of the decision mechanism and that of the most effective individual retrieval approach is statistically significant according to Wilcoxon's singed rank test.
Setting E
E3 (b)
MAP
-
+/-%
-
B
-
E
F-V(b)
MAP
-
+/-%
td2003 PL2F 0.1606 PL2FP td2004 PL2F

0.1299 PL2FA
E3 at E3(b)
E3(at)
0.1520 0.1355
0.1343
5.401 1 + 4.31 1
+ 3.391 2
EV(at) 0.1619 F-V(b) 0.1358

EY(at) 0.1331 0.7380
-
+ 0.811 1 + 4.54 3
+ 2.46 0.74 -
hp2003 PL2FU
0.7435 PL2FA hp2004 PL2FU 0.6674 PL2FP np2003 PL2F
E3(b)
E3(at)
E3(b)
0.7435
0.7409
-
0.00
0.35t -
1
1
-
F-V(b)
EY() at
F-V(b)
F-3(at)
E3 (b)
0.6674
-
0.00
-
1
-
Y(at)
F-V(b)
0.6816 0.6742
+ 2.13 + 0.43
1 1
0.6713 PL2FA
np2004 PL2F 0.7169 PL2FA
83(at)
E3 (b)
0.6687 0.7480
0.7501
0.39 + 4.34 2.00 1.70 -
2 2 2
2
y 0.6692 at F-V(b) 0.7174

&V(at) Ey 0.7296 0.1377
0.31 1 + 0.07 1
+ 1.77 1
F-3(at)
E3(b)
+ 4.631 2
td2003 PB2F
0.1417 PB2FA
0.1389
0.1393
F-V(b) 0.1366
at
E-3(at)
3.60 1 2.80 1
td2004 PB2FU 0.1404 PB2FP
E3(b) E3(at)
0.1451 + 3.35 ' 1 F-V(b) 0.1417 0.1441 0.1440 + 2.56 1 Ey at contin ued on next page
+ 0.93 1 + 2.64' 1
237
Setting
hp2003 PB2FU 0.6589 PB2FP

hp2004 PB2FU
3(b)
continued from previous p age MAP B +/-% E 0.6621 + 0.49 1 v(b)

)-
MAP 0.6696 0.6658 0.5762
+/-% + 1.62 + 1.05 + 1.50
B 1 1 1
83(at
E3(b)
Ey
EY(b)
at
0.5677 PB2FP
np2003 PB2F 0.6634 PB2FP
83(at )E3(b)
np2004 PB2FU
0.7241 PB2FP
83(at )
E3(b)
0.6728 0.6662
+ 1.42 + 0.42
Ewa)
eV(b) Ey at at
0.5766
0.6692
+ 1.571 2
+ 0.871 1
0.7091
0.7189
F-3(at ) F-3(at )-
2.10 0.72 -
1
1
V(b)
Ev
0.7318
0.7174 0.1341 0.1271 0.7220
+ 1.06
0.93 -
1
1
td2003I(ne)C2F
0.1283 I(ne)C2FA
E3(b)
0.7320 0.7323
EV(b)
Cy(at)
+ 4.521 1 2.801 1 1.70 1 -
td2004I(ne)C2F
0.1307 I(ne)C2FP hp2003I(ne)C2FU 0.7343 I(ne)C2FA
E3(b)
F-3(at )3(b)
0.31 0.27 -
1 1
EV(b)
EV(at) Ev(b) Ey at
E3
hp2004I(ne)C2FU
0.6632 I(ne)C2FP np2003I(ne)C2F 0.6940 I(ne)C2FP np2004 I(ne)C2F
E3(b)
F-3(at )E3(b)
at
0.6978 0.7022 0.6923
+ 0.55 + 1.18 + 1.17
1 1 1
EV(b)
y Qt Ev(b) Ey at EY(b)
0.6939
0.7031 -
+ 4.63
-
1
-
+ 6.021 3
E3(b)
E3(at)
0.6843 I(ne)C2FA td2003 DLHF

0.1455 DLHFP
83(at) 3(b)
E3(at
E3(b)
0.6819 0.1495
0.1453
-
0.35 + 2.75
0.14 -
1 2
2
-
V(at) 8V(b)
EV(at)
EV(b)
0.7079 0.1466
0.1434
-
+ 3.45 + 0.76
1.40 -
1 2
1
-
td2004 DLHF
0.1371 DLHFP
F-3(at)
E3(b)
hp2003 DLHFU
0.6710 DLHFP hp2004 DLHFU
0.6747 -
+ 0.55 -
2 -
EY(at)
83(at) 3(b)
EY(b) Ev at EV(b)
0.1312 0.6644 0.6135
0.6278 DLHFP
np2003 0.5241 np2004 0.4978 DLHFP DLHFA DLHFU DLHFP
E3(at )
E3(b) e3(at)
E3(b)
0.6173
0.4973
1.70 0.10 -
2
1 -
Cy(at)
Ev(b) Ev at EV(b) Ey at
0.6399
0.5377 0.5128 0.4871
+ 1.931 1
+ 2.591' + 3.01 2.101 1 1 2
4.30 0.98 2.30 -
1 1 3
E3(4t)-
td2003 BM25FU
0.1857 BM25FP td2004 BM25F 0.1169 BM25FA
E3(b)
F-3(at)
E3(b)
0.1173 0.1119
+ 0.34 4.30 -
3 1
EY(b)
&1(at)
0.1861
0.1701 0.1166 0.1170
+ 0.22
8.40 0.26 + 0.09
2
2 3 1
EV(b)
EY(at)
F-3(at)
E3(b)
hp2003 BM25FU 0.7516 BM25FP hp2004 BM25FU

0.6479 BM25FP
0.7516 F-3(at) 0.7602 E3(b) 0.6681

E3
E3
0.00 + 1.14 + 3.12

+ 2.69 1.10 0.73 -
1 2 1
2 1
E(b) v(b)
EY(at)
Cy() at
0.7481 0.7516 0.6635

0.6823 0.7182
0.47 0.00 + 2.41
1 2 1
at
BM25F np2003 0.7108 BM25FP np2004 BM25F 0.6707 BM25FU td2003 I(ne)C2FU
0.1455 DLHFP
(b)
0.6653 0.7031 0.6658
Ey(b)
t' 3 5.31 + + 1.04 1
F-3(at) 0.7068
_ E3(b)
0.56 -
1
2
EY(at)
EV(b)
0.1476 0.1568
F-3(at)
3(b)
0.6794
+ 1.301 2
EY(at) 0.6698
2 0.131 + 1.44 2 + 7.77f' 1
F-3(at)
v(b) 0.1483 + 1.92 1 EY(at) 1 9.35 0.1319 continued on next page
238
Setting td2004 PL2F 0.1307 I(ne)C2FP hp2003 DLHFU
0.6660 BM25FA
hp2004 PB2FU 0.5555 DLHFA np2003 PL2FP
continued from previous page E MAP B +/-% 3(b) 0.1313 y(b) + 0.46 2 C3(at 0.1330 y + 1.76 2 ) at 3(b) 0.6849 + 2.84 3 d(b)
F-3(at 3(b) )
MAP 0.1402 0.1322 0.6942
+/-%
+ 7.27 2 + 1.15 1 + 4.23 ' 1
0.6809
0.6202 0.5935 0.7007 0.7220 0.7154
+ 2.24 3
+11.65 1 + 6.84 1 + 2.35 1 + 3.97 2 + 3.02 1
83(at )
3(b)
EY(b) Y(at) y(b)
Y( at
0.6803
0.5635 0.5871 0.6940
+ 2.15
+ 1.44 + 5.69 + 1.37 + 5.72 + 2.97
1
1 2 1 1 1
0.6846 I(ne)C2FA
np2004 PB2F 0.6944 I(ne)C2FA
83(at)3(b)
V(at)
y(b) FV(at)
0.7091
0.7341 0.7150
+ 3.581 1
at
Table B. 1: Evaluation of score-independent document-level experiments F-3(f) and Evm.

Setting
td2003 PL2F 0.1606 PL2FP
F-3(b),
E
avg(dom)
MAP
0.1522 0.1614 0.1355 0.1316 0.7435 0.7443
-
+/-%
+ + + + 5.20 0.501 4.31 1.31 0.00 0.111
-
B
1 1 1 1 1 1
-
E
EV(b), avg(dom) EY at ,av dom EV(b), avg(dom) EY at av dom Y(b), avg(dom)
Y at ,av dom EY(b),
avg(dom)
MAP
-
+/-%
E3
td2004 PL2F
0.1299 PL2FA hp2003 PL2FU 0.7435 PL2FA
at ,av dom F-3(b),

avg(dom)
E3
hp2004 PL2FU
0.6674 PL2FP
at av dom F-3(b) avg(dom) , 3 at ,av dom
0.1599 0.1326 0.1334

-
0.441 1 + 2.08 1 + 2.69 1

-
F-3(b), avg(dom)
PL2F np2003
0.6713 PL2FA np2004 PL2F 0.7169 PL2FA td2003 PB2F 0.1417 PB2FA
F-3(b), avg(dom)
at av dom
td2004 PB2FU
0.1404 PB2FP
hp2003 PB2FU 0.6589 PB2FP hp2004 PB2FU 0.5677 PB2FP

np2003 PB2F
E3 at ,av dom E3(b),avg(dom) E3 at ,av dom 3(b), avg(dom) 3 at av dom 3(b), avg(dom) 3 at ,av dom F-3(b), avg(dom) 3
at , av dom F-3(b), avg(dom)
0.6713 0.6681 0.6703
+ 0.581 1 3 0.48 2 0.15 -
EY
0.7555
0.7370 0.1402 0.1399 0.1427 0.6649
-
+ 5.38T* 2
EY(b), avg(dom) Y at ,av dom
at ,av dom
0.6704 0.6664
0.13 3 0.73 1 + 1.67 1 + 0.14 1 + 0.35 1

-
Y(b), avg(dom)
0.7160
0.7289 0.1419 0.1422
-
+ 2.801 3 2 1.06 2 1.27 + 1.64 1 + 0.91

-
0.13 2 -
at av dom EY(b), avg(dom) Ey at ,av dom EY(b), avg(dom)
` 1
-
EY at ,av dom EV(b),

avg(dom) V at ,av dom
EY(b), avg(dom)
0.1416
0.6708 0.6667 0.5704 0.5908 0.6686
+ 0.85 1
+ + + + + 1.81 1.181 0.48 4.071 0.78 2 1 2 1 2
E3
0.6634 PB2FP
np2004 PB2FU 0.7241 PB2FP
td2003 0.1283 td2004 0.1307
I(n,, )C2F I(ne)C2FA I(ne)C2F I(ne)C2FP
E3(b), avg(dorn)
E3
F-3(b),
at ,av dom F-3(b) ,avg(dom) E3 at ,av dom F-3(b), avg(dom) E3 at ,av dom
at ,av dom
0.6650
+ 0.24
0.6686
0.7090 0.7247
-
+ 0.78
2.09 + 0.08
-
2
1 3
" -
EV at ,av dom EV(b), avg(dom)
Ey
at ,av dom EY(b),

avg(dom) V(b), EY
0.6694
0.7246 0.7227 0.1349
-
+ 0.901 1
+ 0.07 2 0.19 2 + 5.14 2
f -
EY
at ,av dom
avg(dom)
avg(dom)
at ,av dom Y(b), avg(dom)
hp2003 I(n,, )C2FU
at ,av dom E3(b),

avg(dom)
EV
0.7343 I(ne)C2FA
E3
at ,av dom
0.7320 0.7423
0.31 + 1.09
3 1
EY(b), avg(dom) Y at ,av dom
at ,au dom
0.1270
-
83t I 2 - .
-
0.7355
+ 0.16 1
239
Setting
continued
from previous
page
MAP
-
+/-%a
-
B
F-V(b),
avg(dom)
hp2004I(ne)C2FU
0.6632 I(ne)C2FP np2003 I(ne)C2F 0.6940 I(ne)C2FP np2004 I(ne)C2F 0.6843 I(ne)C2FA td2003 DLHF 0.1455 DLHFP td2004 DLHF
0.1371 DLHFP
MAP
E3(b), avg(dom)
E3 at ,av dom 3(b), avg(dom) E3 dom at av 3(b), , avg(dom) 3 at ,av dom
3(b), E3 at E3(b), 3
+/-%
0.6811 0.6993 0.7086 0.6846 0.7004 0.1485 0.1464

-
0.6919
0.6807 0.6920 0.6962 0.6755 0.7019 0.1461 0.1468
-
+ + + + +
2.701 0.76 2.10 0.04 2.35
2 2 1 1 2 2 2
-
avg(dom) dom av ,
avg(dom) dom
+ 2.06 + 0.62
-
y at ,av do m F-V(b), avg(dom) y at ,av dom F-V(b), avg(dom) y at ,av dom F-V(b), avg(dom) y dom at av F-V(b), ,
avg(dom)
+ 4.33
+ 2 641 . 0.29 + 0.32 1.23 + 2.57 + 0.41 + 0.89
-
1
4 1 1 1 1 2 2
-
hp2003 0.6710 hp2004 0.6278
DLHFU DLHFP DLHFU DLHFP
at , av 3(b), E3 at E3(b),
EV
avg(dom) dom av ,
avg(dom)
0.6721 0.6736
-
+ 0.16 2 + 0.391 3
-
E3
np2003 DLHFP
0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP td2003 BM25FU 0.1857 BM25FP
E3(b), avg(dom) E3
E3(b), avg(dom) 3 at ,av dom 3(b), avg(dom) E3 at ,av dom
at ,av dom
at ,av dom
0.6271
0.11 -
0.5242
-
+ 0.02
-
2 2
-
F-V(b), avg(dom) y dom at av y(b), avg(dom) Ey at ,av dom dom av , F-V(b), avg(dom) y at ,av dom F-V(b), avg(dom) Ey at ,av dom
at
at ,av
dom
0.6710 0.6209 0.6524 0.5301

-
0.00
1.10 1 + 3.921 1
F-V(b), 0.5310 avg(dom) d

0.5076 0.1920 0.1696
+ 1.32
-
1
-
+ 1.141 2 + 1.971 2 + 3.39 5 8.67 2 -
0.4973 0.4905 0.1889 0.1795
td2004 BM25F
0.1169 BM25FA
E3(b), avg(dom)
0.10 1.47 + 1.72 3.34 -
1 4 2 1
0.1148
0.1193
E3 3
hp2003 BM25FU
0.7516 BM25FP hp2004 BM25FU 0.6479 BM25FP
3(b), avg(dom)
at ,av dom
+ 2.051 4
1.80 -
4 1
EV(b), avg(dom)
Ey
0.1203
0.1156
+ 2.91 0.53 -
1 2
1 1 1
0.7523
0.7607 0.6681 0.6815
+ 0.09
y(b), avg(dom)
at av dom
0.7476
0.7516 0.6717 0.6737
1.111 3 0.00 + 3.67 + 3.98
np2003 BM25F
0.7108 BM25FP np2004 BM25F 0.6707 BM25FU td2003 I(ne)C2FU
dom at av E3(b), , avg(dom) 3 dom at av 3(b), ,

avg(do, n)
+ 1.211 4 + 3.12 2 + 5.19t' 2 0.49 0.24 + 0.92
0.7090
0.7073 0.6691 0.6769 0.1482
0.1455 DLHFP
td2004 PL2F 0.1307 I(ne)C2FP hp2003 DLHFU
dom av , avg(dom) E3 at ,av dom E3(b), avg(dom)

at 3(b), avg(dom)
E3
0.25 -
1
1 2 2 1
y at av dom F-V(b), avg(dom) y at ,av dom
F-V(b), avg(dom)
0.7148
0.7154 0.6733 0.6563 0.1429
+ 0.56
+ 0.65 + 0.39 2.15 1.79 + 6.04 1.01 + 8.98
1
1 2 2 3 3 2
+ 1.86
at ,av dom E3(b),

E3 at ,av dom E3(b), avg(dom)
0.1464
0.1347 0.1316 0.6732
+ 0.621 2 + 3.06 3
+ 0.69 + 1.08 +11.65 1 3 1
Ey at ,av dom F-V(b), avg(dom) y at ,av dom y(b), avg(dom) Ey at ,av dom
0.1573
0.1386 0.6593
+ 8.111 2
F-V(b), avg(dom)
0.6660 BM25FA
hp2004 PB2FU
dom at av E3(b), , 3
avg(dom)
avg(dom)
0.6895
0.6202
+ 3.531 4 +11.88t' 2 + 1.21 1

+ 1.42 1
Ev at ,av dom F-V(b), avg(dom)
Ey
at ,av dom E(b),

avg(dom)
0.6592
0.6054
1.02 -
4
2
0.5555 DLHFA PL2FP np2003

0.6846 I(ne)C2FA
dom av at E3(b), ,
at ,av dom E3(b),
avg(dom)
0.6215 0.6929
0.6943
y at av dom y(b), avg(dom)

Ey y
0.6279 0.7031
0.6972
+13.03t' 1 + 2.70 2
+ 1.84 1
E3
np2004 PB2F
0.7187
0.7298
+ 3.50
+ 5.10
2
3
y(b), avg(dom)
at av dom
0.7005
0.7040
+ 0.88
+ 1.38
2
3
0.6944 I(n. )C2FA
E3
at ,av dom
at ,av dom
Table B. 2:
E 3(f), avg(dom)
Evaluation
and 'y(f),
domain experiments aggregate-level of score-independent
avg(dom) .
240
Settin g td2003 PL2F
std(dom) 3 at ,atd dom 3

F -3(b),
0.1606 PL2FP td2004 PL2F

0.1299 PL2FA
MAP 0.1523 0.1515
3(b), atd(dom)
0.1325
0.1324 0.7435 0.7463
-
+/-% B 5.17 1 t 5.67 1 -
E v(b) atd(dom) , Ey at ,atd dom
MAP 0.1606
+/-%
0.001 1
+ 2.00
hp2003 PL2FU
0.7435 PL2FA hp2004 PL2FU
dom at atd 3(b), , atd(dom) E3 dom at std F-3(b), ,

E3 at
+ 1.92 4 0.00 1 + 0.381 1

-
EY(b) atd(dom)
0.1302
0.1299
-
0.6674 PL2FP
np2003 0.6713 np2004 0.7169 td2003 0.1417 td2004 0.1404 hp2003 PL2F PL2FA PL2F PL2FA PB2F PB2FA PB2FU PB2FP PB2FU
std(dom)
, std dom
Ev dom std at , EV(b), std(d)m) Ev at ,std dom

EV(b),
at v
+ 0.23
0.00
-
1
2
-
atd(dom)
, atd dom
E3(b),
0.6589 PB2FP
hp2004 PB2FU
std(dom) E3 dom at std F-3(b), , std(dom) E3 at ,atd dom F-3(b), std(dom) 3 dom atd , at E3(b), atd(dom) 3 dom at atd 3(b), , std(dom)
0.6638 0.6678 0.7348 0.7341 0.1404

-
+ +
1.12 0.52 2.50 2.40

-
1 2 2 3 3
-
0.92 -
Ev
Ey(b), atd(dom) Ev at ,atd dom v(b), atd(dom) Ev dom std at , EY(b), atd(dom)
at std dom
0.6827 0.6620 0.6725 0.6998

-
+ 2.29 1.39 + 0.181 2.39

-
1 3 1 2 3 2
-
0.1395 0.1440
-
1.55 + 1.62
-
0.1395 0.1455 0.6635
0.64 1 + 3.631 1 + 0.70 1
E3
3
at F-3(b),
dom atd ,
0.6622
-
+ 0.50
-
2
2 4 1 4
-
EV(b)atd(dom) , Ev std dom at EY(b), atd(dom)
0.1402 0.6706
-
0.14 + 1.78
-
2 3
-
Ev
0.5677 PB2FP
np2003 PB2F 0.6634 PB2FP np2004 PB2FU 0.7241 PB2FP td2003 I(ne)C2F
std(dom)
at , atd dom 3(b),
EY(b), std(dom) v EV(b), std(dom) Ev at ,std dom EV(b), std(dom) Ey dom std at
v(b), std(dom) std dom std(dom) Ev at EY(b), at , std dom
at
dom std ,
0.5762
-
+ 1.50
-
F-3(b),
atd(dom) 3 at ,std dom F-3(b), std(dom) E3 at ,std dom

std(dom) dom std , atd(dom)
0.6597 0.6716 0.7076 0.7353

-
+ +
0.56 1.24t' 2.28 1.551

-
0.6682 0.6722 0.73

-
+ 0.72 1 + 1.33t' 3 + 0.81

-
1
-
0.1283 I(n,, )C2FA

td2004 I(ne)C2F 0.1307 I(ne)C2FP hp2003I(ne)C2FU 0.7343 I(ne)C2FA hp2004I(na)C2FU
E3
at E3(b),
0.1283
-
0.001 2
-
0.6940 I(n. )C2FP )C2F np2004I(n. td2003 DLHF
0.6632 I(ne)C2FP np2003 I(ne)C2F
E3 ,std dom at 3(b), std(dom)

E3
E3 F-3(b),
3 at std dom F-3(b) ,std(dom) E3 at ,std dom F-3(b), std(dom)
0.1333 0.7336 0.7482 0.6914
+ 1.991 1 0.10 1 + 1.891 3 + 4.251 2
0.6955
0.7028
+ 0.22
+ 1.27
4
1
Ev at std dom EY(b), atd(dom) v of ,std dom E'(b), std(dom) Ev dom atd at
0.1305 0.7343 0.6876 0.7053
0.151 2 0.00 2 + 3.68 1 + 6.35t' 3
EV(b), atd(dom)
Ev Ev
0.6921
-
0.27 -
1
-
at std dom
std(dom)
0.6839
0.6814
0.06 0.42 -
1
2
Ey(b), atd(dom)
at
at
dom atd , dom atd ,
0.7155
-
+ 4.56
-
2
-
0.6843 I(ne)C2FA 0.1455 DLHFP td2004 DLHF 0.1371 DLHFP hp2003 DLHFU 0.6710 DLHFP hp2004 DLHFU 0.6278 DLHFP DLHFP np2003 0.5241 DLHFA
np2004 DLHFU
3(b), atd(dom) E3 at ,atd dom 3(b), std(dom) 3 std(dom) , E3 at ,atd dom F-3(b), std(dom) E3
atd dom at F-3(b)
at , std dom 3(b), std(dom) 3 at , std dom
0.1495 0.1463
-
+ 2.75 + 0.55
-
2 2
-
0.1467 EV(b), atd(dom) 0.1416 v dom atd at v(b), , atd(dom)
+ 0.82 2.68 -
1 1
-
0.1396 0.6718
0.6744
+ 1.821 1 + 0.12 2
+ 0.51 2
Ev , atd dom at v(b), atd(dom)

v
atd dom
0.6255 0.6358 0.5242

-
0.37 + 1.27 + 0.02

-
1 2 2
-
EY(b), std(dom) Ev atd dom at v(b), atd(dom)

v at
at atd dom
0.6709
0.02 -
0.6342 0.6471 0.54

0.5408
at 3(b),
std dom
0.4978 DLHFP
std(dom) E3 at atd dom
0.4963 -
0.30 -
1 -
EV(b), std(dom) Ev at std dom
0.4932 0.5118
+ 1.02 + 3.07' + 3.03 + 3.19 ?' 0.92 + 2.81
1 7 1
1
3 3
contin ued on next page
241
Iw
Settin g
--l-vu
Fum previous
it
MAP
page
+/-%
E atd(dom) , Ev at atd dom F-V(b), std(dom) Ev at ,etd dom Ev(b) atd(dom) , Ev at ,atd dom F-V(b), Ev std(dom) Ev dom atd at , F-V(b ),std(dom) eV at ,std dom Ev(b),atd(dom)
Ev
MAP 0.1902 0.1808 0.1165 0.1148 0.7532 + 2 42 . 2.64 0.34 1.80 + 0.21
td2003 B M25FU
0.1857 td2004 0.1169 hp2003 0.7516 hp2004 BM25FP BM25F BM25FA BM25FU BM25FP BM25FU
3(b), etd(dom)
E3 3(b), at atd dom
atd(dom)
B 3 1 1 2 4
0.1977
0.1965 0.1148 0.1193 0.7499 0.7627 0 6607 . 0 6539 . 0.7050 0.7108 0.6680
-
+ 6.46
0.6479 BM25FP
np2003 BM25F 0.7108 BM25FP
np2004 BM25F 0.6707 BM25FU

td2003 I(ne)C2FU
3 3
E3
3 at ,atd dom 3(b), etd(dom) 3 at ,etd dom 3 (b), atd(dom) 3 at ,std dom E3(b), std(dom) E3 dom atd , at e3(b), atd(dom)
+ 5.82 t 2 1.80 2 2 05 2 . 0.23 1 t 1.48 + 4 + 1.98 2 + 0.93 1 0.82 2 0.00 3 0.40 2 -
Ev(b)
11 11
0.7529
0.6623 0.6822 0 7137 . 0.7023 0.6709 0.6459 0.1525 0.1517
+ 0.17
+ 2.22 + 5.29t' + 0 41 . 1.20 + 0.03 3.70 + 4.81 + 4.26
1
1 3 1 1 2 1 1 1
E3(b),
at
dom atd ,
0.1455 DLHFP
td2004 0.1307 hp2003 0.6660 hp2004 PL2F I(ne)C2FP DLHFU BM25FA PB2FU
atd(dom)
atd(dom)
0.1404
dom at std E3(b), ,

dom atd ,
0.1426
-
3.51 -
1
-
2.001 1 + 3.06 + 2.33 + 1.29 +14.17 ' 1 3 2 2
at EV(b),
v(b), atd(dom)
dom std , atd(dom)
at ,atd dom
0.1353
0.1357 0.6682 0.6707 0.5622 0.5869 0.7230 0.7131 0.7184 -
at 3(b),
3 dom atd at , 3(b),atd(dom)

at E3(b), 3 std dom
atd(dom)
0.1347 0.6815 0.6746 0.6342
0.5555 DLHFA
np2003 PL2FP 0.6846 I(ne)C2FA np2004 PB2F 0.6944 I(ne)C2FA
Ev dom std at , v(b), atd(dom) Ev dom std at , v(b),atd(dom)
+ 3.52
+ + + + + + + + 3.83 0.33 0.71 1.21 5.65 5.61 4.161 3.46 -
1
2 1 1 1 2 1 1 1 -
0.5871
0.7052
+ 5.69
+ 3.01 + 1.58 + 2.07
2
1 2 1
v Ev
E3 dom atd at , E3(b), std(dom) E3 at ,std dom
atd(dom)
0.7055
0.7054 0.7088
+ 3.051 1
EY(b), atd(dom) F-V(b), std(dom) Ev at ,std dom
at
atd dom
at
dom std ,
Table B. 3:
' 3(f), std(dom)
Evaluation of score-independent domain aggregate-level experiments

and'y(f), std(dom)
Setting td2003 PL2F
E E3(b),Irg(dom) lrg(dom) 3 at ,lr dom

E3(b), Irg(dom)
MAP
dom
+/-%
-
B
F-V(b),
Irg(dom)
0.1606 PL2FP td2004 PL2F

0.1299 PL2FA
at E3(b),
E3
lr
0.1359 0.1335
-
+ 4.62 + 2.77
-
1 1
-
hp2003 PL2FU 0.7435 PL2FA

hp2004 PL2FU 0.6674 PL2FP np2003 PL2F
F-V(b), Irg(dom) v lr dom at F-V(b), ,

Irg(dom)
at ,Ir
dom
MAP 0.1608 0.1609 0.1386 0.1318

-
+/-% + 0.12 + 0.191 + 6.70t* + 1.46

-
B 1 1 3 2
-
E3
Irg(dom) E3 dom Ir at E3(b), , Irg(dom)
Ir dom at E3(b), ,
0.7444
0.6609 0.6752 0.6702
+ 0.121 1
0.97 + 1.17 0.16 1 2 1
F-V(b), Irg(dom) Ev at ,Ir dom F-V(b), Irg(dom)
at ,Ir dom
0.6524 0.6712 0.7252 0.1385 0.1393 0.1403
0.6713 PL2FA np2004 PL2F

0.7169 PL2FA td2003 PB2F 0.1417 PB2FA
td2004 PB2FU 0.1404 PB2FP
at ,Ir dom E3(b) Irg(dom) , E3 at Ir dom E3(b), lrg(dom)
3 at ,Ir dom E3(b), Irg(dom)
0.6674 0.7563
0.7310 0.1412 0.1399 0.1471
0.58 + 5.50
+ 1.97
5 3
1
v at ,Ir dom F-V(b), Irg(dom)

at ,Ir dom Ev(b),Irg(dom) v at Ir dem F-V(b), Irg(dom) v
2.25 0.02 + 1.16 2.26 1.69 0.07 -
3 0.35 3 1.27 + 4.77t* 1
2 2 2
at ,Ir dom
0.1482 + 5.56t' 1 Ev at Ir dom , continued on next page
0.1469
+ 4.63t' 3
242
Setting hp2003 PB2FU

E3(b),
continued from previous page MAP B +/-%
MAP 0.6580
+/-% 0.14 -
B 1
0.6589 PB2FP hp2004 PB2FU

0.5677 PB2FP np2003 PB2F 0.6634 PB2FP np2004 PB2FU 0.7241 PB2FP td2003 I(ne)C2F 0.1283 I(ne)C2FA td2004 I(ne)C2F
0.1307 I(ne)C2FP
E3
lrg(dom)
lrg(dom) ,! r
0.6649
+ 0.91
1r dom at E3(b), , E3 at E3(b),

3(b), dom
0.6615
-
+ 0.39
-
v(b),Irg(dom)
2
-
Ir dom at EY(b) ,
lrg(dom) v
Ev
0.6677
----
+ 1.34f 1
lrg(dom) E3 at ,lr dom

lrg(dom)
0.6705 0.6608
-
+ 1.07 0.39 -
1 1
-
F-V(b), Irg(dom) Ey Ir dom at V(b), ,

lrg(dom)
at ,Ir
dom
0.6714 --0.7172 --0.1347 0.1331 0.1314

--
+ 1.21 2 0.95 2 + 4.99 1 + 3.741 1 + 0.54 1 1.32 1 + 0.78 1
E3
E3
at ,!r dom 3(b),lrg(dom)
0.7094
-
2.03 -
1
-
at ,!r dom F-V(b), Irg(dom)

y
Ev
Ev
dom at , lr E3(b), lrg(dom)
at ,Ir F-V(b),
dom
dom
Irg(dom)
E3
hp2003I(ne)C2FU
0.7343 I(ne)C2FA hp2004I(ne)C2FU 0.6632 I(ne)C2FP np2003 I(ne)C2F 0.6940 I(ne)C2FP
np2004I(ne)C2F
E3(b), lrg(dom)
lr dom at E3(b), ,
lrg(dom)
E3 at
at ,Ir
dom
at ,Ir F-V(b),
Irg(dom)
E3
0.7371
-
+ 0.38
-
1
-
Ev
Ir dom at V(b), ,
Irg(dom)
0.7246 0.74
--"
3(b),
lrg(dom)
lr dom ,
0.6790
0.7091 0.7049
-
+ 2.381 2
+ 2.18 + 1.57
-
Ev
lr dom at E3(b), ,
lrg(dom)
E3
2 2
-
EV(b), Irg(dom) v dom Ir at F-V(b),

Irg(dom)
at lr dom
0.6632
0.7049
--
0.00 2
+ 1.57 1
--
0.6843 I(ne)C2FA td2003 0.1455 td2004 0.1371 DLHF DLHFP DLHF DLHFP
at ,Ir dom 3(b) Irg(dom) , E3 at ,lr dom E3(b),lrg(dom)

3 at ,Ir E3(b), dom
E3
0.6930 0.1454 0.1471

-
+ 1.271 1 0.07 + 1.10

-
3 2
-
F-V(b), v
at ,Ir dom F-V(b) Irg(dom) , Ey at ,!r dom

Irg(dom) at ,lr dom
--0.1405 0.1457
-----
3.44 1 + 0.141 2
hp2003 DLHFU 0.6710 DLHFP

hp2004 DLHFU 0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP
Irg(dom)
Irg(dom)
0.6784
+ 1.10
EV(b), Irg(dom)
0.6771
+ 0.91 4 + 3.27 2 2.05 2 -
3
3
dom Ir at E3(b), ,
at ,Ir dom E3(b), Irg(dom)
3
3
0.6781
-
+ 1.06
-
2
-
Ev
0.6292 0.5301
-
+ 0.22 + 1.14
-
2 3
-
lrg(dom) Ev at ,!r dom F-V(b), Irg(dom)

v dom at ,lr EY(b), Irg(dom)
Ir dom at EY(b), ,
0.6656
0.6483 0.6149
-------
3 0.80 -
at ,Ir E3(b),
at E3(b),
dom
lrg(dom)
!r dom
lrg(dom)
0.4961 0.1937
W2003 BM25FU 0.1857 BM25FP

td2004 BM25F 0.1169 BM25FA
1 0.341 + 4.31 3 + 0.34 + 0.771 0.56 + 2.35t' + 2.73 + 3.69 0.48 -
at ,Ir dom F-V(b), Irg(dom)
-0.1990
+ 7.16 1
E3
E3
dom Ir at 3(b), ,
lrg(dom)
0.1759
0.1173 0.1178 0.7474 0.7693 0.6656
5.28 -
3
2 4 2 2 2 3 1
-
hp2003 BM25FU 0.7516 BM25FP

hp2004 BM25FU
at ,1r dom E3(b),

lrg(dom)
E3 at ,Ir dom 3(b),Irg(dom)
EV(b), Irg(dom) Ey at ,Ir dome EV(b), Irg(dom) Ev at ,Ir dom EY(b), Irg(dom)
at ,Ir dom
0.1976
0.1172 0.1155 0.7492 0.7478 ---
+ 6.41 1
+ 0.26 1.20 0.32 0.51 3 2 4 1
0.6479 BM25FP BM25F np2003 0.7108 BM25FP BM25F np2004 0.6707 BM25FU td2003 I(n)C2FU 0.1455 DLHFP
Ir dom at 3(b),
lrg(dom) at E3(b), E3 Irg(dom)
E3
0.6852
0.7370 0.7074
-
+ 5.76t' 3
at ,1r dom EV(b),

trg(dom) Ev at ,Ir dom
Ev(b), Irg(dom)
0.6469
0.7117 -----
+ 0.13
0.15 2 -
lr ,
dom
at ,Ir dom E3(b), Irg(dom) dom E3 at ,Ir
E3
0.6687 0.1463
-
2 0.301 + 0.55 4
-
Ev
v
at ,r dom v(b), Irg(dom)

at
--0.1534
0.1536
+ 5.43 2
+ 5.57 1
dom cr ,
243
Setting td2004 PL2F

E3(b), E3 at E3(b),
E
lrg(dom)
continued from previous page MAP B +/-% 0.1399 0.1353 0.6719 + 7.04 2 + 3.52 2 + 0.89 2
MAP 0.1378 0.1377 0.6881
+/-%
0.1307 I(ne)C2FP hp2003 DLHFU 0.6660 BM25FA

hp2004 PB2FU
E3
1r dom , lrg(dom) lrg(dom)

lrg(dom)
lr dom at E3(b), , tr dom at E3(b), ,

1r dom at E3(b), ,
1rg(dom)
0.6758
0.6064
+ 1.47 4
+ 9.16F 1 + + + +
d(b),lrg(dom) y at ,1r dom y(b),lrg(dom)
+ 5.43 5 + 5.36 7 + 3.32 ' 1
Ey
EV(b)Irg(dom)
at ,Ir dom
, dom Irg(dom)
0.6855
0.5483
-
+ 2.93' 3
1.30 + 0.50
-
0.5555 DLHFA np2003 PL2FP

0.6846 I(ne)C2FA np2004 PB2F 0.6944 I(ne)C2FA
1 2 2
E3
E3
0.5550
0.6895 0.7028 0.7125 0.7352
0.09 1 0.72 2.66 2.61 5.88t 1 2 2 1
Ey
y
at ,Ir EY(b),
at ,lr
0.6880
-
dom
E3
at ,tr dom
EY(b), Irg(dom) Ey at ,lr dom
0.6959 -
+ 0.22 -
Table B. 4:
EB(f), 1r9(dom)
Evaluation of score-independent domain aggregate-level experiments

and y(f), Irg(dom).
Setting td2003 PL2F 0.1606 PL2FP td2004 PL2F 0.1299 PL2FA hp2003 PL2FU
3(b)
MAP
+/-%
MAP
+/-%
--
3
3
,avg(dir)
0.1515
0.1616
at ,av dir F-3(b),

avg(dir) at E3(b),
+ 0.621 1
5.67 1 -
EV(b), avg(dir)
y
dir av ,
avg(dir) dir
0.1357 0.1303
-
+ 4.46 1 + 0.31 3
-
Ev(b), avg(dir) Ey dir at av EV(b), ,

avg(dir)
at av dir
--
0.1337 0.1331
-
+ 2.93 + 2.46
-
1 1
0.7435 PL2FA
hp2004 PL2FU 0.6674 PL2FP np2003 PL2F 0.6713 PL2FA np2004 PL2F 0.7169 PL2FA td2003 PB2F 0.1417 PB2FA
at av , F-3(b),
avg(dir) 3 at ,av dir F-3(b), avg(dir) 3 at ,av dir E3(b), avg(dir) 3 at av dir avg(dir) E3 at ,av dir
F-3(b) 3 ,avg(dir) at F-3(b), F-3(b),
0.7488 0.6674 0.7060 0.6713 0.6774 0.7510 0.7265 0.1399 0.1421 0.1421 0.1402
td2004 PB2FU
0.1404 hp2003 0.6589 hp2004 0.5677 PB2FP PB2FU PB2FP PB2FU PB2FP PB2F np2003
dir av ,
+ 0.711 3 0.00 1 + 5.78 1 0.00 2 + 0.911 3 + 4.76 1 + 1.341 1 3 1.27 + 0.281 3 + 1.21 1 2 0.14 + 1.651 3
avg(dir)
avg(dir)
0.6648
0.6698
+ 0.90 3 + 2.29 1
1 0.87 -
Ey at ,ao dir y(b),avg(dir) Ey at ,av dir y(b), avg(dir) y at ,av dir EY(b), avg(dir) Ey at av dir EV(b), avg(dir) y at ao dir EY(b), avg(dir) y at ,av dir
---
0.6678 0.6745 0.72 0.7234 0.1417 0.1443
+ + +
-0.52 0.48 0.43 0.91
3 2 2 1 1 1
0.00 + 1.83
0.1390
0.1417 0.6708 0.6587 0.5973 0.6664
y(b), avg(dir)
Ev
at ,av dir F-3(b), at ,av dir E3(b), avg(dir)
0.5807
0.6576
v(b), avg(dir)
Ev at ,av dir EY(b), avg(dir)
at ,av dir
+ 0.931 2 + 1.81 2 0.03 1 + 5.21t 2 + 0.45 2 + 0 541 1
1.00 -
--
E3
0.6634 PB2FP PB2FU np2004

0.7241 PB2FP
0 1283 .
3 at ,av dir E3(b), avg(dir)

E3
3
E3
E F-3(b),
0.6628 0.7018
0.7146
-
3 0.09 2 3.08 -
y at ,av dir EV(b), avg(dir)

Ey
v
Ed E
EV(b),
0.6670
-
at ,av dir
avg(dir)
1 1.31 -
at ,av dir
avg(dir)
td2003 I(na)C2F
I(ne)C2FA
td2004 I(n. )C2F 0.1307 I(ne)C2FP hp2003 I(ne)C2FU 0.7343 I(n.,,)C2FA
at ,av dir F-3(b),

at ,av dir
avg(dir)
avg(dir)
0.1327
-
+ 1.53 1
1 0.52 + 1.24 4
on next vii;
EV(b), avg(dir)
ae ,av
at ,av air
dir
3(b),
E3
at ,ao dir
0 7305 . 0.7434
r-nnfinupl1
y(b),avg(dir) Ev dir at ,av
0 7322 . 0.7283
29 -0 . 0.82 -
2 2
244
Setting hp2004I(ne)C2FU 0.6632 I(ne)C2FP np2003 I(ne)C2F 0.6940 I(ne)C2FP
np2004I(ne)C2F
0.6843 I(ne)C2FA td2003 DLHF
continued from previous p age MAP B +/-% E F-3(b), 0.6632 0.00 1 y(b), avg(dir) avg(dir) E3 0.6774 y + 2.14 2 at ,av dir di at av r , F-3(b), 0.6918 0.32 EY(b), 2 avg(dir) avg(dir) E3 0.6955 Ey + 0.22 2 at ,avg(dir) at ,av dir E3(b),
avg(dir) -
MAP 0.6758 0.6945 0.7001 0.7007 0.6980 0.6756 0.1521 0.1476 0.1330 0.6778 0.6672 0.6249 0.6326 0.5250 0.5360 0.5071 0.5006 0.1768
+/-% 1.90 4 721 . 0.88 0.97 2.00 1.27 + 4.54 + 1.44 2.99 + + + + + + + + + + + 1.01 0.57 0.46 0.76f 0.17 2.271 1.87 0.56
B 3 2 3 5 2 1 2 2 2 2 1 3 2 2 5 2 4 3
E3
V(b),
0.1455 DLHFP W2004 DLHF

0.1371 hp2003 0.6710 hp2004 0.6278 np2003 0.5241 np2004 0.4978 DLHFP DLHFU DLHFP DLHFU DLHFP DLHFP DLHFA DLHFU DLHFP
at ,av dir 3(b), avg(dir) E3 at ,av dir F-3(b), avg(dir) E3 dir at av F-3(b), , avg(dir) E3 dir at av F-3(b), , avg(dir) E3 dir at av E3(b), ,
avg(dir)
0.6828 0.1486 0.1378 0.1355 0.6775 0.6740 0.6158 0.6547 0.5260 0.5262 0.4947 0.5011 0.1934
0.22 + 2.13 5.29 1.17 + + + + + + 0.97 0.45 1.91 4.28t' 0.36 0.401 0.62 0.66
1 2 1 1 2 1 1 2 1 4 1 1 2
td2003 BM25FU 0.1857 BM25FP

td2004 BM25F 0.1169 BM25FA
3 at ,av dir F-3(b), avg(dir) E3 at ,av dir E3(b), avg(dir)
+ 4.15
at ,av dir EY(b), avg(dir) Ey at ,av dir EY(b) avg(dir) , v at ,av dir Ev(b), avg(dir) Ey at av dir V(b), avg(dir) Ey at ,av dir V(b), avg(dir) Ey at ,av dir EY(b), avg(dir) Ev at ,av dir V(b), avg(dir)
avg(dir)
E3
dir at av F-3(b), ,
avg(dir) 3 dir at av F-3(b),
0.1873
0.1145 0.1172
+ 0.861 1
2.05 2 + 0.261 4
Ey
Ey
4.79 + 1.97 1.11 + + + +
at ,av dir EV(b),

avg(dir)
0.1765
0.1192 0.1156
4.95 0.12 -
2
2 3
hp2003 BM25FU
0.7516 BM25FP hp2004 BM25FU 0.6479 BM25FP BM25F np2003 0.7108 BM25FP np2004 BM25F 0.6707 BM25FU
E3 at ,av dir F-3(b) avg(dir) , 3 dir at av 3(b), , avg(dir) E3 at ,av dir F-3(b), avg(dir) 3 at ,av dir
E3(b), avg(dir)
avg(dir)
0.7653
0.7498 0.6494 0.6951 0.7093 0.7076 0.6606 0.6702 0.1483 0.1497 0.1336 0.1411 0.6742 0.6855
+ 1.82
+ + + + + + + + + + + + + 0.24 0.23 7.29t' 0.21 0.45 1.51 0.081 1.92 2.891 2.22 7.961 1.23 2.931 6.26 2.91 3.711 0.45 3.92
4
3 2 2 1 1 2 2 1 1 2 3 4 4 4 4 3 2 4
y(b), avg(dir)
Ev at av dir v(b), avg(dir) v at ,av dir v(b), avg(dir) Ey at ,av dir EV(b), avg(d{r) Ey at av dir v(b),avg(dir) Ey at ,av dir v(b), avg(dir) Ey at ,av dir y(b), avg(dir) v at ,av dir v at ,av dir EY(b), avg(dir) v ae ,av dir EV(b), avg(dir) v at av dir
at av dir
0.7507
0.7622 0.6544 0.6852 0.7182 0.7070 0.6538 0.6275
-
3
2 2 10 1 3 3 1
-
1.41 1.00 5.761 1.04 0.53 2.52 6.44 -
td2003 I(ne)C2FU
0.1455 td2004 0.1307 hp2003 0.6660 hp2004 0.5555 DLHFP PL2F I(ne)C2FP DLHFU BM25FA PB2FU DLHFA PL2FP np2003 0.6846 I(ne)C2FA
E3 3
at ,av dir 3(b),

avg(dir)
at ,av dir E3(b), avg(dir) E3 at ,av dir

E3(b)
0.1613 0.1387 0.1338 0.7021 0.6836
+10.86 + 6.12 + 2.37 + 5.42 + 2.64
2 2 1 4 7
0.6944 I(ne)C2FA
PB2F np2004
,avg(dir) E3 at ,av dir F-3(b), avg(dir) 3 at ,av dir

F-3(b), avg(dir)
0.6440
0.5903 0.7045 0.71 0.6975 0.7216
+15.93* 2
EY(b), avg(dir)
0.6164
0.6279 0.7195 0.6861 0.7295 0.7027
+10.96
+13.03t' + 5.10' + 0.22 + 5.05 + 1.20
2
3 2 5 2 1
E3
at ,av dir
Table B. 5:
83(f), avg(dir)
directory experiments aggregate-level Evaluation of score-independent

and F-V(f), avg(dir)
245
Settin g td2003 PL2F

0.1606 PL2FP
MAP
F -3(b), std(dir)
+/-%
-
B
Ev , at
Ev(b) std(dir)
std dir
0.1599
-
MAP
0.1693 0.1603 0.1295 0.1316
-
0.44 2 + 3.46 2 + 2.77 2

-
+/-%
+ 5.42 0.191 0.31 + 1.311
-
B
2 1 2 2
td2004 PL2F
0.1299 PL2FA
E3(b),
at
dir atd ,

hp2004 PL2FU 0.6674 PL2FP
std(dir) E3 dir at atd 3(b), ,

std(dir)
0.1344 0.1335
-
V(b), std(dir) Ev dir std , at Ev(b) v dir std at , EY(b), std(dir)

Ev ,
std(dir)
E3
E3
3
at F-3(b),
at 3(b),
at F-3(b),
dir std ,
std(dir)
dir std , atd(dir)
0.7463 0.6747
-
+ 0.381 1 + 1.09 1
-
0.6769
-
+ 1.42
-
1
-
np2003 PL2F
0.6713 PL2FA np2004 PL2F 0.7169 PL2FA td2003 PB2F
0.6684
0.6662
dir std ,
std(dir)
0.7447
0.7548 0.1418
0.76 -
0.43 2 1
v(b), std(dir)
v Ev at
ae , std dir dir atd , dir std ,
+ 3.88 3
+ 5.291 2 + 0.07 2 + 1.71 1
-
E3
V(b), atd(dir)
of Ey(b), atd(dir)
F-3(b),
at ,std dir
std(dir)
0.6694 0.6757 0.7388 0.7195 0.1410 0.1448 0.1453 0.1426 0.6720

-
+ + + + + + +
0.28 0.661 3.05 0.36 0.49 2.19t' 3.491* 1.571 1.99

-
4 5 2 1 1 2 4 1 1
-
0.1417 PB2FA
td2004 0.1404 hp2003 0.6589 hp2004 0.5677 np2003 0.6634 np2004 0.7241 PB2FU PB2FP PB2FU PB2FP PB2FU PB2FP PB2F PB2FP PB2FU PB2FP
E3
E3
dir at atd F-3(b), ,

std(dir)
dir atd , std(dir)
0.1403
0.1428
-
0.99 3 -
at F-3(b), E3 3
v(b), std(dir) Ev EV(b), std(dir) Ev dir std , at EV(b),

std(dir) at , std dir
at
, std
dir
dir at std F-3(b), ,

std(dir)
0.6633 0.6599
-
+ 0.67 3 + 0.151 2
-
dir at std F-3(b),

std(dir)
at ,atd dir F-3(b),

std(dir) E3 at
0.5835 0.6575 0.6636 0.7385
+ 2.781 0.89 + 0.03 + 1.99 + 3.12

-
2 1 1 4 1
-
Ev dir atd at , EY(b), std(dir) Ev dir atd at , EY(b), atd(dir)
0.5956 0.6621 0.6658 0.7173

-
td2003 I(ne)C2F 0.1283 I(ne)C2FA
F-3(b), std(dir) E3
at , std dir F-3(b) , std(dir) 3 dir atd at , F-3(b), std(dir)
E3 at
dir atd ,
0.7287
0.1323
-
+ 0.64 1
Ev
v
+ 4.911 1 0.20 2 + 0.361 3 0.94 1 -
at
std dir
v(b),std(dir)
at EV(b), Ed std dir std(dir)
0.1220
0.1327
0.1319 0.1319 -
+ 3.43f 1
+ 0.92 + 0.92f + 2.86 3 1 -
4.91 -
td2004 I(ne)C2F
0.1307 I(ne)C2FP hp2003 I(n, )C2FU 0.7343 I(ne)C2FA hp2004I(ne)C2FU 0.6632 I(ne)C2FP np2003 I(ne)C2F 0.6940 I(ne)C2FP np2004 I(ne)C2F 0.6843 I(ne)C2FA td2003 DLHF 0.1455 DLHFP td2004 DLHF
dir std , F-3(b), std(dir) 3 dir std at F-3(b)std(dir) , 3 at ,std dir E3(b),
std(dir)
0.7302 0.7303 -
0.56 1 3 0.54 -
EY(b), std(dir) v dir std of , EV(b), std(dir) Ev

at std dir
at , std dir
0.6822
0.6883
0.6904 0.6951 0.6894
-
+ 3.78
0.52 + 0.16 + 0.75
-
1
2 1 1
-
E3 E3
E3
F-3(b),
at ,std dir
0.6997 0.6929 0.6941 0.6957
+ 0.82 0.16 + 1.43 + 1.67
1 2 2 1
v(b),std(dir) Ev dir std of , v(b),std(dir) Ev dir atd at ,
std(dir)
atd(dir)
0.15
0.1418
-
+ 3.09 4
3 2.54 -
EV(b), atd(dir)
v at Ev(b),
dir std , std(dir)
at ,std dir E3(b),

at , atd dir E3(b),
0.1515 0.1483
-
+ 4.12 + 1.92
-
1 2
-
0.1371 DLHFP hp2003 DLHFU 0.6710 DLHFP hp2004 DLHFU 0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP I
dir std at F-3(b), ,

std(dir)
E3
atd(dir)
0.1364 0.6727
1 0.511 + 0.25 4
Ev dir atd at , EY(b), std(dir)
0.6758
0.6155
+ 0.72 2
1 1.96 -
EV(b), atd(dir)
at
dir std ,
E3 3
3
F-3(b),
at
dir std ,
std(dir)
std dir std(dir) std dir
at F-3(b),
at
+ 0.19 3 Ev of ,std dir 0.00 1 EY(b), sed(dir) + 2.041 1 Ev at std dir std(dir) + 0.16 2 v(b), + 0.78 1 Ev at std dir continued on next page 0.6290 0.5241 0.5348 0.4986 0.5017
0.6785 0.6769 0.6274 0.6169 0.5333

-
+ 1.12 + 0.88 0.06 1.74 + 1.76

-
2 1 3 2 1
-
0.5041
+ 1.27
2
4
0.5004
+ 0.52
246
Setting
continued from previous page MAP i_er_ n0 -L --,, , ""
ivirir
dir)
t/-
0
1.88
is
td2003 BM25FU
0.1857 BM25FP
td2004 0.1169 hp2003 0.7516 hp2004 0.6479 np2003 0.7108 np2004 0.6707 BM25F BM25FA BM25FU BM25FP BM25FU BM25FP BM25F BM25FP BM25F BM25FU
E3
E3(b), atd(dir) E3(b),std(dir) E3 dir std at , E3(b),

std(dir) at , std dir
0.1790
0.1538
0.1196 0.1134 0.7645 0.7516 0.6606 0.6605 0.7189 0.7117 0.6626 0.6792 0.1422 0.1284 0.1318
3.61 4 + 2.31 2 2.99 2 + 1.72 3 0.00 3 + 1.96 2 + 1.94 1 + 1.14 2 + 0.13 2 1.21 3 + 1.271 2 2.27 1 1 -12.00 + 0.84 2
-17.18
Ev
v 6 std () ,
0.1612
0.1822
-13.19
-
2
1
E3 dir std at , E3(b), $td(dir) 83(a t ,atd dir 3(b), std(dir) E3 at ,std dir E3(b), std(dir) E3 dir etd at ,
E3(b),
td2003 I(ne)C2FU 0.1455 DLHFP td2004 PL2F
0.1307 I(ne)C2FP
hp2003 DLHFU 0.6660 BM25FA hp2004 PB2FU
std(dir) E3 at std(dir) E3(b), std(dir)
E3
EY(b), std(dir) Ey dir std at , F-V(b), std(dir) Ev dir std at , F-V(b), std(dir) v dir std at , F-V(b), std(dir) v dir std at , F-V(b), std(dir) v of ,std dir F-V(b), std(dir) v dir std at , F-V(b), std(dir)
at , std dir
0.1146 0.1175 0.7432 0.7624 0.6565 0.6690 0.7119

-
+ + + + +
1.97 0.51 1.12 1.441 1.33 3.261 0.15

-
1 2 2 2 2 6 3
-
0.6554 0.6460 0.1565 0.1511 0.1359

-
2.28 2 3.68 1 + 7.56 2 + 3.85 2 + 3.98 3

-
E3(b), std(dir) 3 at ,std dir 3(b),std(dir)
at
std dir
0.1321
0.6699 0.6849 0.6040
+ 1.07 1
+ 0.59 3 + 2.84 2 + 8.73 2
F-V(b), std(dir) y at ,atd dir F-V(b), std(dir)
ae
std dir
0.6710 0.6936 0.5517
0.5555 DLHFA
np2003 PL2FP 0.6846 I(ne)C2FA np2004 PB2F 0.6944 I(ne)C2FA
E3
E3(b), std(dir) E3 dir at etd E3(b), ,

std(dir)
at
dir std ,
0.5898
0.6934 0.7111 0.7104 0.7040
+ 6.17 2
+ + + + 1.29 3.871 2.30 1.38 2 2 4 1
E3
at ,std dir
F-V(b), std(dir) Ey at std dir EY(b), std(dir) v dir std of ,
at
std dir
0.5940
0.7070 0.7128 0.7261 0.6827
+ 6.931 1
+ 3.27 1 + 4.12 1 + 4.57 1 1 1.68 -
+ 0.75 1 + 4.14' 2 0.68 1 -
Table B. 6:
83(f), std(dir)
Evaluation
and F-V(f),
directory
aggregate-level experiments
std(dir) "
Setting
td2003 0.1606 td2004 0.1299 hp2003 0.7435 PL2F PL2FP PL2F PL2FA PL2FU PL2FA
E3(b), Irg(dir)
MAP
at ,Ir d;r E3(b), Irg(dir) E3 at ,Ir dir
Irg(dir)
+/-%
+ + + + 4.79 5.351 1.92 5.471 0.73 0.341
-
B
1 1 1 1 1 1
-
F-V(b), Irg(dir) Ev at ,Ir dir V(b), Irg(dir) v at ,Ir dir F-V(b), Irg(dir) v at Ir dir
F-V(b),
MAP
0.1619 0.1327 -
+/- oB
+ 0.81f 1 + 2.16 3 ----
E3(b),
E3 E3
hp2004 PL2FU
0.6674 PL2FP
E3(b),
at Ir dir
Irg(d{r)
0.1529 0.1520 0.1324 0.1370 0.7489 0.7460

-
PL2F np2003 0.6713 PL2FA PL2F np2004 0.7169 PL2FA

td2003 PB2F 0.1417 PB2FA
at ,Ir dir 3(b),

Irg(dir)
0.7062
-
+ 5.811 1
-
Irg(dir)
0.6672
0.6685
2 0.03 -0.42 1
E3 at Ir dir E3(b), Irg(dir) E3 at ,Ir dir E3(b),

Irg(dir)
0.6703 0.7276 0.7163 0.1411
0.15 + 1.49 0.08 0.42
1 3 2 4
E3
td2004 PB2FU 0.1404 PB2FP I
E3(b)
at Ir dir
, Irg(dir)
E3
of Ir dir
v 0.1402 at Ir dir Ev(b),Irg(dir) 0.1450 Ev dir Ir ac continued on next page 1 1.06 + 3.28 2
v at ,Ir dir F-V(b), Irg(dir) v at ,Ir dir F-V(b), Irg(dir)
at ,Ir dir F-V(b), Irg(dir)
0.7169 0.1386
0.1381 0.1409 0.1431
-0.00 2 -2.19 1 -
2.54 1 + 0.36 2 + 1.92 2
247
Settin g
h p 2003 PB2FU
E 3(b), trg(dir)
.... _: _ u. 1. vu. previous .,.. ...
page
MAP 0.6621
f/-% + 0.49 + 2.75

-
B 1 1
-
E
lrg(dir) v(b) , v at !r dir
MAP 0.6651 + 0.94 -

0.5677 PB2FP np2003 PB2F
!r dir at E3(b),
3
0.6636
0.5833
-
+ 0.71t 2
0.6648
-
lrg(dir)
0.6634 PB2FP
np2004 PB2FU 0.7241 PB2FP td2003 I(ne)C2F 0.1283 I(ne)C2FA td2004 I(ne)C2F 0.1307 I(ne)C2FP hp2003I(ne)C2FU 0.7343 I(ne)C2FA hp2004I(ne)C2FU 0.6632 I(ne)C2FP
at , 1r dir 3(b),lrg(dir) 3 at lr dir E3(b), ,
F-V(b) ,Irg(dir)
Ev at , ir dir
+ 0.901 1
0.45 -
trg(dir) E3 at ,Ir dir

E3(b), lrg(dir)
, lr dir
0.6657 0.6579 0.7188 0.7092

-
+ 0.35 0.83 0.73 2.06 -
1 2 2 1
-
F-V(b), Irg(dir) y at ,1r dir EY(b), 1rg(dir) Ev at 1r dir

EV(b),
V at
0.6604 -
E3
trg(dir)
, lr dir
0.1259
-
1.87 -
1
-
at 3(b),
lrg(dir)
F-V(b),
Irg(dir)
3 at ,lr dir E3(b),lrg(dir) E3 1r dir at E3(b), ,

lrg(dir)
0.7343 0.7422
-
0.00 1 + 1.08t' 3
-
v at tr dir V(b), lrg(dir) Ev !r dir at F-V(b),

Irg(dir)
0.7398 0.7359
-
+ 0.75 + 0.22
-
1 1
-
E3
np2003 I(ne)C2F 0.6940 I(ne)C2FP

np2004I(ne)C2F 0.6843 I(ne)C2FA td2003 0.1455 td2004 0.1371 DLHF DLHFP DLHF DLHFP
E3(b),trg(dir) E3 tr dir at , 3(b), lrg(dir) E3 at ,1r dir

E3(b), lrg(dir)
3(b), lrg(dir) E3 at !r dir
at ,tr dir
0.6712
+ 1.211 2
Ey
0.7019 0.7070
0.6993
-
+ 1.14 + 1.87
+ 2.19
-
1 3
2
-
F-V(b), Irg(dir) Ev at !r dir

v
at ,1r dir
0.6632
0.00
0.6934 -
F-V(b), Irg(dir)
at , Ir dir
0.09 -
2 -
0.1454 0.1457
-
0.07 + 0.14
-
2 3
-
F-V(b), Irg(dir) v lr dir at F-V(b), ,

Irg(dir) V
0.1420 0.1453
-
2.41 0.14 -
1 1
-
hp2003 DLHFU
0.6710 DLHFP
E3 , tr dir at E3(b), lrg(dir)

tr dir at 3(b),
lrg(dir)
0.6729
0.6796 0.6222 0.6289
+ 0.28
E3 3
hp2004 DLHFU
0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP td2003 0.1857 td2004 0.1169 hp2003 BM25FU BM25FP BM25F BM25FA BM25FU
E3(b),
at tr dir
+ 1.281 2 0.89 2 + 0.18 2
at ,tr E3(b),
E3
1rg(dir)
dir
1rg(dir)
0.5241
-
0.00
-
1
-
F-V(b), Irg(dir) v
at , tr dir EV(b), trg(dir)
v ae 1r dir F-V(b), Irg(dir) v at ,tr dir
F-V(b), Irg(dir)
at , lr
dir
0.6708
0.6713 0.6086 0.6228
+ 0.05 1 3.06 3 0.801 3 -
0.03 -
0.5286
-
+ 0.86
-
1
-
at ,1r dir E3(b),lrg(dir) E3 dir lr at , 3(b), E3 at ,Ir dir E3(b),trg(dir)
E3
0.5060 0.1810 0.1865
+ 1.651 1 2.53 + 0.43 3 1
1rg(dir)
0.1170
0.1162 0.7459
+ 0.09
at ,tr dir F-V(b), Irg(dir) v at ,Ir dir v at ,tr di, F-V(b), Irg(dir)
0.2000 0.1796
+ 7.70 3.28 0.43 0.64
1 3
0.7516 BM25FP
hp2004 BM25FU 0.6479 BM25FP
E3
E3 E3
dir tr at E3(b), ,
trg(dir)
0.7701 0.6496
0.6532 0.7197 0.7009
-
+ 2.46t' 2 + 0.26 3
+ 0.82 + 1.25 1.39 -
3 0.601 2 0.76 -
F-V(b), Irg(dir)
0.1130
0.1164 0.7468
3.34 -
2
1 2
BM25F np2003
0.7108 BM25FP
1r dir at E3(b),
lrg(dir)
BM25F np2004 0.6707 BM25FU td2003 I(ne)C2FU 0.1455 DLHFP

I
dir tr at 3(b), ,
lrg(dir)
4 2 1
-
Ev ae ,tr di, F-V(b), Irg(dir) v 1r dir ae F-V(b), ,
Ev at tr dir F-V(b), Irg(dir)
0.7574 0.6561
0.6655 0.7124 -
+ 0.771 1 + 1.27 1
+ 2.72 + 0.23 1 1
at ,lr dir E3(b), trg(dir) 3 at tr dir
E3
0.6727 0.1547 0.1469

convinuea
+ 0.301 4 + 6.32 2 + 0.96 3

on aux, pew.
Irg(dir) v at ,Ir dir Ey(b), 1rg(dir) Ev at tr dir
0.6393 0.1527 0.1552
4.68 + 4.95 t 6 67 + .
1 3 1
248
Setting td2004 PL2F

3(b), E3 at
0.1307 I(ne)C2FP hp2003 DLHFU

0.6660 BM25FA hp2004 PB2FU
continued from previous page E MAP B +/-% 0.1295 0.92 y(b), 2 trg(dir) lrg(dir)
MAP 0.1384 0.1372
+/-`Xo + 5.89 + 4.97 + 3.18
B 1 1 1
lr dir , E3(b), lrg(dir)

3 lr dir at 3(b), ,
0.13 0.6712
0.54 1 + 0.78 2
0.6768
+ 1.62 6
Ey lr dir at , EY(b), lrg(dir)

y at ,lr dir
0.6994
0.6872
+ 5.02t' 3 + 2.86 3
0.5555 DLHFA
np2003 PL2FP 0.6846 I(ne)C2FA
at 3(b),
3
lrg(dir)
, lrg(dir) lr dir
0.6042 0.5550 0.6872

0.6859 0.7132 0.6910
+ 8.77 1 0.09 1 + 0.38 1

+ 0.19 1 + 2.71 2 0.49 1 -
Ey(b), lrg(dir) Ey lr dir at , d(b), lrg(dir)

y ae ,lr dir y(b) lrg(dir) , y at ,lr dir
0.5714 ----
np2004 PB2F
0.6944 I(ne)C2FA
lr dir at 3(b), lrg(dir) 3 at ,lr dir
-----
---
Table B. 7: Evaluation of score-independent '2(f), 1rg(dir) and EV(f), lrg(dir)

Setting td2003 PL2F 0.1606 PL2FP td2004 PL2F 0.1299 PL2FA hp2003 PL2FU
E3(b),
directory
aggregate-level experiments
c(sU)p, E3(at),L(SU) ,
E3(b),
MAP 0.1512 0.1522
B +/-% 5.85 2 5.231 1 -
EY(b), L(SU)o, v(at), L(su),
MAP 0.1601 0.1607
B +/-% 0.31 1 + 0.061 I
E3(b),
3(at), L(sU)
L(sU)p, L(sv) E3(at), 3(b),
L(sU)p,
,
,
0.1370
0.1352
-
+ 5.47
+ 4.08
-
4
2
-
EY(b), L(SU)o,
EV(at),L(SU),
EV(b), L(SU)p, L(SU), V(at),
0.1314
0.1323
-
+ 1.15
+ 1.85f
2
2
0.7435 PL2FA
hp2004 PL2FU 0.6674 PL2FP PL2F np2003
L(SU)p, E3(at),L(sU), E3(b), r,(sv)P,
0.6787 0.6680 0.6727
+ 1.69 + 0.09 + 0.21
2 3 2
EY(b), L(SU)P, EV(at),L(SU) , F-V(b), L(SU)pl
0.6674 0.6779
0.00 + 0.98 + 2.48 + 3.10 + + + + + +
2 3
0.6713 PL2FA
np2004 PL2F 0.7169 PL2FA
E3(at), L(SU) ,
r,(sU)p, E3(at),L(SU),
E3(b), E3(b),
0.6719
0.7440 0.7348
+ 0.09
EV(at), L(SU),
v(b),c(sv)p, V(at),L(sv),
0.6687
0.7347 0.7391
0.39 -
1
2 1
+ 3.78 1 + 2.501 3
td2003 PB2F
0.1417 PB2FA td2004 PB2FU 0.1404 PB2FP hp2003 PB2FU 0.6589 PB2FP hp2004 PB2FU 0.5677 PB2FP
E3(at),L(sU),
3(b),
r,(sU)P,
0.1402
0.1430 0.1444 0.1443 0.6648 0.6705 0.5699 0.6570
1.06 + + + + + +
2
1 2 1 1 1 1 1
L(SU)p, E3(at),t(sU) , 3(b),L(sU)p, 3(at), t(SU), L(sU)P, 3(at), L(sU), E3(b), L(sU)p,
3(b),
3(b),
0.92 2.8577 2.78t 0.90 1.761 0.39
np2003 PB2F 0.6634 PB2FP

PB2FU np2004 0.7241 PB2FP
E3(at), t(sU),
L(sU)p, 3(at), L(sU)
E3(b),
0.6570
0.7221 0.7298
0.28 + 0.79
0.96 -
0.96 -
v(ae L(su , , EY(b), L(sU)p, v(at), L(su) , F-V(b), L(SU)pl EY(at),L(SU), F-V(b), L(SU)pl v(at), L(su) , v(b),L(SU)p,
v(b), L(SU)y,
0.1403
0.1459 0.1431 0.1414 0.6644 0.6671 0.5694 0.6683
0.99 -
2
4 2 1 4 2 2 2
2.96t' 1.92 0.71f 0.83 1.241 0.30
+ 0.74
1
2 2
v(at),L(SU) ,
E1(b),L(SU)o, v(at), L(SU) , Ev(at), L(SU), V(b),L(SU),, v(at), L(SU),
0.6675
0.7466 0.7060 0.1335 0.1302 0.1322
+ 0.62
+ 3.11' 2.50 -t + 4.05t 0.381 + 1.15t
1
5 1 3 1 I
td2003 I(ne)C2F 0.1283 I(n. )C2FA td2004 I(ne)C2F 0.1307 I(n. )C2FP hp2003I(ne)C2FU
0.7343 I(ne)C2FA
3(b),
L(SU)p, E3(at), L(sU) ,

L(SU)P,
0.1383
0.1348 0.1342 0.1318
+ 7.79
F-V(b), L(SU)pl
3(at),L(sv) , 3(b), L(SU) pt

E3(at), r,(su) i
+ 5.071 2 + 2.68 2 + 0.841 1 + 0.65
0.7328
0.7391
0.20 -
2
2
F-V(b), L(SU)pl
EV(at),L(SU),
0.7457
0.7397
+ 1.55
+ 0.74
3
2
249
Settin g hp2004 I(n )C2FU e 0.6632 I(ne)C2FP np 2003 I (n e) C2F 0.6940 I(ne)C2FP
continued
L(su) P, E3(at) L(SU) , , E3(b), L(sv) p, E3(at), L(SU) ,

E 3(b),
MAP 0.6880 0.7158 0.6960
from previous page
+/-% + 3 74 . + 3.14t* + 0.291

+ 1.20
B 2 2 2
2
Ev(b), L(su) D , Ev( at), L(su), EV(b)L ( su ), , A y (at) L(sv >,

L(SU)
MAP 0.6897 0 7060 . 0 6990 . 0.7181 0.7236 0.1522 0.1479
+/-%
np2004 I(n e)C2F

td2003 DLH F 0.1455 DLHFP
3(b),
L(SU)PI
0.6843 I(ne)C2FA
0.6925
0.7223 0.1482 0.1462
-
y(b),
td2004 DLHF
0.1371 DLHFP
E3(b),
L(SU) , 3(b), L(sv)P, E3(at), L(SU) i

L(SU), i
E3(at),
+ 5.55t' 2 + 1.86 + 0.48

-
,,
+ 4.001 4 + 1 73 2 . + 0 72 1 . 4.941* 2
+
2 2
-
F-Y(b),
V(at),L(su) , Ey(b),L(SU) , , EV(at),L(SU),

L(SU) , ,
+ 5 74t' 1 . + 4.60 3 + 1.65 2
3(at),
3(b),
0.1351
-

hp2004 DLHFU 0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP W2003 BM25FU
L(SU),
0.1355
1.171 1 -
L(su)P, E3(at),L(SU), 3(b), L(su)p, E3(at),L(SU), E3(b), L(su)p, E3(at),L(SU),

3(b),
3(b),
L(su)p, E3(at), L(SU) ,
0.6763 0.6649
0.6555 0.6228 0.5279 0.5249 0.5178 0.5082 0.1964
+ 0.79 0.91 + 4.41 0.80 + 0.73 + 0.151 + 4.02 + 2.09 + 5.76
2 1
3 2 4 2 2 3 3
E. (b),L(su),, EY(at),L(sv), Ev(b) L(sU) , ,, y(at), L(SU), Ey(b),L(su),, EY(at),L(SU), d(b),L(su) , P y(at), L(SU), Ed(b), L(su) p, EV(at),L(SU), EV(b), L(SU),, EV(at),L(SU), Ev(b),L(SU),, EV(at),L(SU), Ev(b),L(sv),, Ev at ,L su EV(b), L(sv)y, Ey(at),L(SU),
EY(b), L(sv) , , EY(at), L(SU),
EV(at),L(SU) ,
1.46 -
1 2 2
0.6795 0.6702
0.6294 0.6317 0.5369 0.5137 0.4956 0.1830
+ 0.25 3 + 0.621 3 + 2.441 + 3.19 0.44 1.45 2.65 0.34 1.49 0.411 2.05 6.47t' 0.56 1.22 + 2.04 + 1.541 + + + + + + + 1 4 1 2 2 2 5 2 2 2 4 1 1 2
+ 1.27 0.12 -
L(sU)p,
0.1857 BM25FP
W2004 0.1169 hp2003 0.7516 hp2004 0.6479 np2003 0.7108 np2004 0.6707 BM25F BM25FA BM25FU BM25FP BM25FU BM25FP BM25F BM25FP BM25F BM25FU
3(at),L(SU) ,
L(su)P, 3(at), L(SU), 3(b), L(SU)p, E3(at),L(SU) , E3(b), L(SU)P, 3(at), L(SU) , 3(b),L(SU)p, E3(at),L(SU) , E3(b), L(SU)p, E3(at),L(SU),
E3(b), 3(b),
0.1944
0.12 0.1136 0.7544 0.7624 0.6844 0.6552 0.7224 0.7126 0.6790 0.6807
+ 4.68
+ 2.65 2.82 + 0.37 + 1.441 + 5.63' + 1.13 + 1.63 + 0.25 + 1.24 + 1.491
3
4 1 2 3 5 2 2 1 2 2
0.1692
0.12 0.1173 0.7628 0.7547 0.6612 0.6898 0.7148 0.7021 0.6844 0.6810
8.891 2 -
td2003 I(ne)C2FU
0.1455 DLHFP td2004 PL2F 0.1307 I(ne)C2FP hp2003 DLHFU 0.6660 BM25FA hp2004 PB2FU 0.5555 DLHFA
np2003 PL2FP
0.6846 I(n. )C2FA np2004 PB2F
E3(b),
, 3(b), L(SU)P, E3(at), L(SU) , E3(b), L(SU)P, E3(at),L(SU) , 3(b), L(SU)p, 3(at), L(SU) ,
3(at), L(SU)
L(SU) P,
0.1432
0.1484 0.1433 0.1337 0.6670 0.6796 0.5612 0.5877
1.58 + + + + + + +
2
3 3 3 3 5 3 1
1.991 9.64 2.30 0.15 2.04 1.03 5.80
EV(at),L(SU) , EV(b), L(SU)p, Ev(at),L(su) , EY(b), L(SU)p, EY(at),L(SU) , EV(b), L(SU),, EY(at),L(SU),
EV(b), L(SU) ,,
0.1607
0.1571 0.1287 0.1322 0.6943 0.6589 0.6132 0.5707
+10.45
+ 7.971 1.53 + 1.15 + 4.25 1.07 +1 0.39 + 2.74
3
4 1 5 5 3 3 1
E3(at), L(SU) , 3(b),L(SU)P,

3(at),
L(SU)p,
0.6899
0.6946 0.7312
+ 0.77
+ 1.46 + 5.30
1 3
2
EY(b), L(SU),,
Ey(b), L(SU),, EY(at), L SU)
0.7213
0.7013 0.7373
+ 5.36
+ 2.44 + 6.18'
4
1 3
0.6944 I(ne)C2FA
L(SU),
0.7382
+ 6.31
EY(at), L(SU),
0.7460
+ 7.431 1
8d(f), E3(f), Table B. 8: Evaluation of score-dependent experiments t(su),,, L(sv)P, and
250
Settin g td2003 PL2F 0.1606 PL2FP td2004 PL2F

0.1299 PL2FA
MAP
0.1510
+/-%
5.98 -
B
2 2 Ev(b) c(su) ;n , Ey at ,c SU " Ev(b>L(su) ;r ,
i. at , L(SU) L(SU); , ,,
)
MAP
0.1611 0.1310
3(b),L(SU);,,
+/-9e
+ 0.31 + 0.85
-
B
1
SU at ; 63(b), ,L
F-3(b), L(SU)in
0.1617
0.1339
+ 0.68t 3
+ 3.08

hp2004 PL2FU 0.6674 PL2FP np2003 PL2F
L(SU);,, 3at, LSVi
0.1317
-
+ 1.39
-
4
-
0.1315
-
EY(b)
+ 1.231 2
-
sv at " F-3(b), ,L
0.6713 PL2FA
np2004 PL2F 0.7169 PL2FA td2003 0.1417 td2004 0.1404 hp2003 0.6589 hp2004 0.5677 np2003 0.6634 np2004 0.7241 PB2F PB2FA PB2FU PB2FP PB2FU PB2FP PB2FU PB2FP PB2F PB2FP PB2FU PB2FP
L(SU)in 3 at ,L Sv E3(b), L(SU); 3 at SU ; ,L E3(b), L(sU); F-3(at), L(SU)in

F-3(b),
0.6782 0.6663 0.67 0.7407 0.7242 0.1406 0.1426 0.1450 0.1476 0.6660 0.6604 0.5699 0.5862 0.6570 0.6570 0.7425 0.7295 0.1427 0.1353
-
+ 1.62 + + + + + + + + + + + 0.74 0.19 3.32 1.02
Ev
3 2 1 1
at ,L F-3(b), L(SU)in E3(at),L(SU)jn E3(b),L(SU); E3 (SU)in L at F-3(b), L(SU)in

E3 at ,L (SU)in F-3(b),
L(SU)in
E3
L(SU)in
(SU)in
SU at " ,L F-3(b), L(SU)in

F-3(b),
F-3(at), L(SU)jn
0.78 2 0.64 1 3.28T*28 ' 2 . 5.131 1 1.08 1 0.231 2 0.39 1 3.261 2 0.96 1 0.96 1 2.54 4 0.75 2
F-V(at),L(SU)in
L(SU);,, EV(at),L(SU)j n y(b),L(su); ,, F-V(at), L(SU)in EY(b), L(SU);,, Ey at ,t Su " E(b), L(SU)i Ev at ,L SU " EV(b), L(SU);., F-V(at), L(SU)in v(b),L(SU), n
at ,c SU " V(b),
0.6796 0.6807 0.6664 0.6687 0.7364 0.7376 0.1412 0.1408 0.1423 0.1406 0.6693 0.5735 0.6670 0.6574 0.7461 0.7226 0.1302 0.1322 0.74 0.7358 -
+ 1.83 + 1.991 0.73 0.39 + 2.72 + 2.89 + + 0.35 0.64 1.35 0.14
2 1 5 1 4 1 1 5 2 1
d(b),L(su);,, FV(at),L(SU)in Y(b), L(sv);,, e-V(at), L(SU)in y(b),L(SU); F-V(at), L(SU)in d(b),L(SU);,,
V(b), L(SU); n
+ 1.581 2 + 1.02 1 + + 0.54 0.90 3.04 0.21 0.38 1.15f 0.78 0.20 + + + + 1.02 1.20 4.62 ' 5.74t' 2 2 3 1 1 1 1 2 2 1 2 1 1 1 1 -
td2003 I(ne)C2F
0.1283 I(ne)C2FA td2004 I(n, )C2F 0.1307 I(ne)C2FP hp2003 I(ne)C2FU 0.7343 I(ne)C2FA hp2004I(ne)C2FU
E3(b),
L(SU)in E3 at), L SU
L(SU); n
2 +11.22 + 5.461 3
-
F-V(at),L(SU)in
F-3(at), L(SU)in
F-3(at),
F-3(b)
E3(b),L(su);
L(SU)in
0.6632 I(ne)C2FP
np2003 0.6940 np2004 0.6843 I(n, )C2F I(ne)C2FP I(ne)C2F I(n)C2FA
F-3(at), L(SU)in
L(SU)i n , E3 at L SU 3(b), L(SU); n E3 at ,L Su L(su); n
F-3(b)
L(SU)i n ,
0.1318 0.7340 0.7353 0.6706
+ 0.841 0.04 + 0.14 + 1.12
1 2 1 3
Ed at ,L su E(b),L(SU); y at ,L SU Ey(b),L(SU){.,
+ + +
0.6654
0.7088 0.6962 0.7123 0.7237 0.1451
+ 0.33
+ + + + 2.13 0.321 4.09 5.76t'
1
2 2 4 3 4
Ey
td2003 DLHF
E3(b),
0.1455 DLHFP
td2004 DLHF
F-3(b), L(SU)in
F3(at), 3(b), L(SU)jn
F-3(at), L(SU)jn
0.1465
-
+ 0.69
-
0.27 -
EY(b), L(SU);,, Ed at ,L SU EV(b), L(SU);r Ey at L SU Ey(b),L(su);,,
at L su
0.6986
0.7011 0.7023 0.7159 0.7236 0.1529 0.1471 0.1351 -
+ 5.341 4
2
-
0.1371 DLHFP
v(b),L(SU); n F-V(at), L(SU)in FV(at),L(SU)dn
at ,L SU
+ 5.09' + 1.10 1.46' -
hp2004 DLHFU 0.6278 DLHFP DLHFP np2003

0.5241 DLHFA
at ,L SU F-3(b)
F-3(b),
L(sv);,,
0.6743
0.6689
+ 0.49
0.31 -
2
3
EV(b), L(SU)i
0.6752
0.6689
+ 0.63
0.31 -
3
1
L(SU)in , F-3(at),
L(SU)in
0.6481
0.6254
+ 3.23
0.38 -
2
2
L(SU)in
Ev(b), L(su);n FV(at), v(b), L(sv);, v at su ,L
0.6563
0.6209
+ 4.54
1.10' -
4
3
L(SU)im
0.5319
-
+ 1.49
-
5
-
DLHFU np2004 0.4978 DLHFP
at ,L su " 3(b),
L(SU) ;n
0.5364
0.5104
F-3(at), L(SU)in
0.5086 + 2.171 3 FV(a0. L(SU)in continued on next page
v(b),c(su);,,
-f + 2.35t I
+ 2.53
251

MAP /_Q1 uo,.... -ivirir t1-/o t3
td2 003 BM25FU

0.1857 BM25FP
td2004 BM25F
0.1169 BM25FA
3(b), L(SU);,, 3 at SU ;,, 3(b), ,L

L(SU)in E3 at ,L Su " F-3(b), L(SU)in E3 sv at { 3(b), ,L L(su); n F-3(at), L(SU)in
E3(b), L(SU)jn
0.1954 0.1753
0.1193 0.1135 0.7515 0.7636 0.6509 0.6693 0.7094 0.7151 0.6779
hp2003 BM25FU
0.7516 BM25FP hp2004 BM25FU 0.6479 BM25FP np2003 BM25F 0.7108 BM25FP
np2004 BM25F 0.6707 BM25FU

td2003 I(ne)C2FU
at ,L SU " F-3(b),
L(SU)in
+ 2.05 2.91 0.01 + 1.60 + 0.46 + 3.301 0.20 + 0.60 + 1.07
+ 5.22t 3 5.60t 4 2 3 2 3 4 1 2 1 3
E3
E3
F-3(b),
at ,L SU
L(SU)in
L(SU)in
L(SU){n
0.6773
0.1404 0.15
+ 0.981 3
3.51 2 + 3.091 3
i , Ed at ,L Su EY(b)L(su)i , * Ey at ,L SU y(b), L(su); , Ey at ,L SU " EV(b), L(su); ,, EY(at),L(SU)in y(b),L(SU);,, Ey at ,L Su y(b),L(SU)i,. y(b),L(su)in d at ,L sU
Y(b)L(SU)
0.1730 0.1908
0.1217 0.1190 0.7620 0.7511 0.6563 0.6832 0.7192 0.7184 0.6844
6.84 1 + 2.75t 4 -* -24.11 +

+ 1.80 + 1.38 0.07 + 1.30 + 5.45t" + 1.18 + 1.071 + 2.04 + 7.29 0.69 4 3 1 2 2 4 3 1 1 2
at ,L sv
0.6810
0.1561 0.1445
+ 1.541 2
0.1455 DLHFP
td2004 PL2F 0.1307 I(ne)C2FP
3(b),
at
0.1405
0.1367 0.6600 0.6772 0.5628
+ 7.50 2
+ 4.591 0.90 + 1.68 + 1.31 + + + + 1.43 2.48 7.03' 4.13 2 1 4 3 3 3 4 3
hp2003 DLHFU
0.6660 BM25FA hp2004 PB2FU
F-3(b),
F-3(at), L(SU)in
L(SU)in
SU at " ,L F-3(b),
L(SU)in F-3(at), F-3(b),
E3
V(at)1L(SU)1 EY(b), L(sv),,,

V(at)L(SU)
Ey(b), L(su)i,
0.1289
0.1350 0.7038 0.6608 0.6038
0.5555 DLHFA
np2003 0.6846 np2004 0.6944 PL2FP I(ne)C2FA PB2F I(ne)C2FA
L(SU)in
L(SU)in
0.5933
0.6944 0.7017 0.7432 0.7231
+ 6.80 2
d(b),L(5U){, t
+ 3.29 3 + 5.68 ' S 0.78 4 + 8.69 4
1.38 -
F-3(at), L(SU)jn 3(b),L(SU); E3 at ,L SU "
EV(b), L(SU)i,, F-V(at), L(SU)in y(b),L(sv)r Ey at ,L SU
of L Sv
0.5607
0.7149 0.6904 0.7368 0.7468
+ 0.94
+ + + + 4.43 0.85 6.11 7.551
1
5 3 3 1
Table B. 9: Evaluation of score-dependentexperiments E3(f),L(su);,, and Ed(f),t(su);,,

Setting E
3(b), L(SU')p,
MAP
-
+/-%
-
B
Ev(b),
E L(SU1)p,
MAP -
+/-% + 2.00
B 2
td2003 PL2F 0.1606 PL2FP

td2004 PL2F
3(at),L(SU'),
E3(b),L(SU')p,
0.16
0.1320
+ 1.62
-
1 0.371 2
EY(at), L(SU'),
Ev(b),L(SUI)p,
Ev(b),
0.1609
0.1325
+ 0.191 2 + 1.541 2
0.72 4.081 0.74 0.37 1.33 0.46 1 1 1 1 3 1 2
0.1299 PL2FA
3(at),L(su') , 3(b),
L(su')p,
0.1318
-
+ 1.46
+ 5.15 + 1.501 0.77 -
2
-
Ey(at), L(sus) ,
L(SU')p,
0.1319
-
E3(at), L(SUI)
3(b),
hp2004 PL2FU
0.6674 PL2FP
PL2F np2003 0.6713 PL2FA PL2F np2004 0.7169 PL2FA td2003 PB2F
0.1417 PB2FA td2004 PB2FU
L(su')p, 3(at), L(SU') , E3(b), L(su )p,
0.7018 0.6774 0.6661
1 2 1
v(at), L(SUI), EV(b), L(SU')p, EV(at),L(SU'), El(b), L(SU')p,
3(at),L(su') , 3(b),
L(SUI)p,
0.6736
0.7518 0.7353
+ 0.34t 1
+ 4.87 4 + 2.571 2
Ev(at), L(SU),
EY(b), L(SU')p, EY(at),L(SU') ,
E3(at), L(suI),
3(b),
0.6722 0.6946 0.6663 0.6688 0.7264 0.7202
+ + + +
E3(at),L(SU'), E3(a),L(su )p,
L(su')p,
0.1456
0.1449
+ 2.75
+ 2.26
2
5
Ey(b), L(sus)p,
0.1465
0.1436
+ 3.39t* 3
+ 1.34
E3(at), L(SU') , 3 b), L(SU')
0.1456 + 3.70T* 4 Ev(b), L(su')p, t 0.1475 + 5.06 2 EV(at), L(SU') , 0.6604 + 0.23 1 EV(b), L SU') , contin ued on next page
EY(at),L(sw) ,
0.1413
0.1385
+ 0.64
1.35 -
1
2
0.6649
+ 0.91
252
Settin g 0.6589 PB2FP hp2004 PB2FU 0.5677 PB2FP
np2003 PB2F 0.6634 PB2FP

np2004 PB2FU 0.7241 PB2FP
continued from previous page E MAP +/-% B E3(at),L(SU') 0.6713 + 1.88 1 EV(at), , L(SU'), E3(b),L(su')p, 0.5744 + 1.18 1 -F-V(b)L(SUI) , p 3(at), 0.5824 + 2.591 1 EY(at), L(SU') i L(SU'), 3(b),
E3(b),
MAP 0.6716 0.5744 0.5895
+/-% + 1.93 + 1.18 + 3.841
B 2 1 2
L(sU')p, 3(at),L(SU') t
0.6616 -
L(SU')P, E3(at), L(SU'),

E3(b), L(SU') P,
0.7086 0.7518
-
0.27 -
2.14 3 + 3.831 2
-
td2003 I(ne)C2F 0.1283 I(ne)C2FA

td2004 0.1307 hp2003 0.7343 I(ne)C2F I(ne)C2FP I(ne)C2FU I(ne)C2FA
EY(b)L(SU') , , p v(at), L(SU')

v(b) , L(SU' )y,
v(b)L(su, )n, , V(ot), L(SU'),
0.6676
0.7308 0.6986
-
+ 0.63
+ 0.93 3.52 -
1
4 1
-
E3(at), L(SU'),
E3(b),L(SU')pj E3(at),L(SU')
E3(b),
0.1283
0.7359 0.7399 0.6690 0.7040 0.7014 0.6874 0.6883 0.1439 0.1464
-
0.00t 2
-
EY(at), L(sU, ) , V(b),

L(SU')p,
hp2004I(ne)C2FU
0.6632 I(ne)C2FP
np2003 I(ne)C2F
0.6940 I(ne)C2FP np2004 I(ne)C2F 0.6843 I(n. )C2FA
L(sU')p, 3(at), L(SU') , 3(b), L(sU')P, 3(at), L(SU'), E3(b), L(SU')p, E3(at),L(SU'), E3(b), L(SU')P, 3(at), L(SU') , L(su')P, 3(at), L(SU') , 3(b),
L(SU')y, 3(b), E3(b),
+ 0.22 + 0.76 + 0.87 + + + + 1.44 1.07f 0.45 0.58
2 2 2 2 3 3 2
td2003 DLHF
0.1455 DLHFP td2004 DLHF 0.1371 DLHFP
3 1.10 + 0.621 3
-
EV(at),L(SU'), v(b),L(su')p, EY(at),L(su'), v(b),L(su')p, v(at), L(SU') , Ev(b),L(su')p, v(at), L(SU'), v(b),L(su')p, EV(at),L(SU') , v(b),L(su')P, V(at),L(SU'),
V(b), L(SU')p,
0.7365 0.7435 0.67 0.6853 0.7043 0.6978 0.7067 0.1436 0.1504

-
+ + + + + +
0.30 1.25 1.03 3.33f 1.48 0.55
1 1 2 4 5 1
+ 3.271 1 1.31 + 3.37

-
2 4
-
3(at), L(SU')
hp2003 DLHFU
0.6710 DLHFP hp2004 DLHFU 0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA
np2004 DLHFU
0.4978 DLHFP W2003 BM25FU 0.1857 BM25FP
L(SU)p, E3(at),L(SU') , E3(b), L(SU')p, 3(at), L(SU'), 3(b), L(sv')p, 3(at), L(sU') , 3(b), L(SU')P, 3(at), L(SU') ,
E3(b),
0.6743 0.6788 0.6279 0.6166 0.5360 0.4941 0.5081 0.1911 0.1925
+ + + +
0.49 1.161 0.02 1.78 2.27
1 4 2 2 5 2 3 1 3
0.74 + 2.07 + 2.91 + 3.66
v(at), L(SU') , v(b),L(su')p, EV(at),L(SU'), Ev(b),L(SU')p, EY(at),L(SU'), EY(b), L(SU')p, v(at), L(SU') , v(b),L(SU)p, v(at), L(SU'), EY(b), L(SU')pt EV(at),L(SU),
0.6769 0.6683 0.6297 0.5241 0.5302 0.4966 0.1836 0.1829
+ 0.88 0.40 + 0.311 0.00 + 1.16t' 0.24 1.13 1.51 -
1 2 2 1 1 1 3 3
L(su )P, 3(at), L(SU'),
td2004 BM25F 0.1169 BM25FA hp2003 BM25FU

0.7516 BM25FP
3(b), L(sv'),, E3(at), L(SU')

E3(b),
,
,
E3(at),L(SU')
E3(b),
L(sv')p,
0.1191 0.1167 0.7569

0.7712 0.6696
+ 1.88 0.17 + 0.71

+ 2.61 + 3.35
3 1 2
4 2
EY(b), L(su')y, v(at),L(SU') , v(b), L(SU')p,

EY(at),L(SU'), EV(b), L(SU')p,
0.1185 0.1209 0.7540

0.7572 0.6643
+ 1.37 1 + 3.421 3 + 0.32 3

+ 0.751 1 + 2.53 2 1 + 5.43 2
hp2004 BM25FU 0.6479 BM25FP np2003 BM25F 0.7108 BM25FP BM25F np2004 0.6707 BM25FU W2003 I(n. )C2FU 0.1455 DLHFP W2004 PL2F
L(SU')P,
E3(at), L(SU'), 3(b), L(SU')p,

E3(at),L(SU')
E3(b),
0.6718 0.7160
0.7152
+ 3.69t' 2 + 0.73 3
+ 0.62 4
EV(at), L(SU'), EV(b), L(SU')p,

v(at), L(SU')
0.6831 0.7127
0.7026
+ 0.27
1.15 -
3
1
L(SU')p, E3(at sv') , ,L E3(b), L(su')p, E3(at),L(SU') t 3(b), L(SU) ,
0.6851 0.6745 0.1377
+ 2.15 + 0.57
2 2 -
+ 5.36
v(b), L(SU')p, v at ,L su') , Ev(b), ' L(SU)p, Ev(at), L(su') v(b), L(SU') ,
0.6718 0.1655 0.1470 0.1343
+ 0.16 +13.75 ' + 1.03 + 2.75
2 3 1 3
253
continued
Setting 0.1307 I(ne)C2FP hp2003 DLHFU 0.6660 BM25FA hp2004 PB2FU 0.5555 DLHFA
E L(su') , E3(b), L(suI)P, 3(at), L(su') E3(b), L(suf)p, E3(at),L(SUI),

3(b), E3(at),
from previous page
MAP 0.1316 0.6544 0.6775 0.5998 0.5739
+/-% + 0.69 1.74 + 1.73 + 7.97 + 3.31
B 2 1 6 4 3
np2003 PL2FP
0.6846 I(ne)C2FA np2004 PB2F 0.6944 I(ne)C2FA
E3(at),L(SU'), E3(b), L(su')P, 3(at), L(SU'),
L(sU')p,
0.6935
0.7266 0.7407 0.7280
EV(at),L(SU'), Ev(b),L(sus)P, EV(at),t SU') , Ev(b),L(sus)p, EY(at),L(SU'),
MAP 0.1355 0.6829 0.6806 0.5939 0.5751
+/-% + 3.67 + 2.54 + 2.19 + 6.91' + 3.53
B 2 1 4 3 1
+ 1.30
+ 6.13t' 3 + 6.67 4 + 4.84 1
ti(b), L(5U')n,
0.6914
0.6925 0.7173 0.7155
+ 0.99 2
+ 1.15 2 + 3.30 2 + 3.04 1
EY(at),L(su') , Ey(b),L(su')P, y(at), L(SU'),
Table B. 10: Evaluation of score-dependentexperiments E3(f), F-d(f), L(su!)p, and L(sv')p,

Setting td2003 PL2F 0.1606 PL2FP td2004 PL2F 0.1299 PL2FA hp2003 PL2FU E
3(b), L(suI)t
MAP
--
B
v(e>,
L(su>;,,
MAP
-
+/-%
-
F-3(at), L(SUI)in E3(b), L(SU'){n F-3(at), L(SUI)in

E3(b), L(su');,,
0.1333 0.1348
-
-+ 2.62 2 + 3.771 2
--
v at ,t sui v(b),L(sU'),,, F-V((kt), L(SU')in

F-V(b), L(SU')in
0.1599 0.1329 0.1290

-
0.441 2 + 2.31 2 0.691 2 -
0.7435 PL2FA
hp2004 PL2FU
0.6674 PL2FP np2003 PL2F

0.6713 PL2FA np2004 PL2F 0.7169 PL2FA td2003 0.1417 td2004 0.1404
E3(at), L(SUI)in E3(b), L(sul)i 3 SUS at E3(b), ,L

E3
0.6807 0.6774
0.6711 0.7472 0.7477 0.1453 0.1456 0.1436 0.1480
-+ 1.99 1 + 1.501 2
-1 0.03 + 4.23 2 + 4.301 2 + + + + 2.54 2.75 2.28 5.411 2 1 2 2
L(su');,,
v at ,L SUS EV(b), L(SU');,, Ev at ,t SU' F-V(b),
0.6722
0.6640 0.6688 0.7283 0.7180 0.1452 0.1387 0.1406 0.1426
+ 0.72 1 + + 1.09 0.37 1.59 0.15 1 1 2 1
at ,L (SU')in E3(b),
PB2F PB2FA PB2FU PB2FP
L(su');,, 3 at ,L sui
E3(b),
L(su');,, E3 at ,r, sui 3(b), L(su'){,

F-3(at),L(SU')in
E3(b),

L(su'); E3 SUS at E3(b), ,L L(su');,,

E3(b), L(SU');,,
0.6606 0.6604 0.5744

0.5831 0.6649
+ 0.26 2 1 2 0.23 + + 1.18 1

1 1 2.71 + + 0.23 3
L(SU')in y at ,L sUI v(b),L(SU');,, v at ,L sui EV(b), L(sU')i,, v at), L(SUI)in EV(b), L(SU')t,, Ev at t sui
+ 2.47 2.12 + 0.14 + 1.57
3 3 1 2
Ev(b), L(su');,, F-V(ctt), L(SU')in EV(b), L(SU')4

Ev at ,L sui E1(b),L(sU');,,
0.6718 0.5743
0.5884 0.6576
1 1.96 2 + + 1.16 1
+ 3.65f 2 2 0.87 -
0.7241 PB2FP
PB2F np2003 0.6634 PB2FP np2004 PB2FU td2003 I(ng)C2F

0.1283 I(ne)C2FA td2004 I(ne)C2F 0.1307 I(ne)C2FP
F-3(at)L(SU')in ,
L(SU);,, E-3(at), L(SU')in

E3(b), L(SU'),,,
L Sui at), E3(b),
E3
0.6569
0.7112 0.7264
-
2 1.78 + 0.32 2
--
1 0.98 -
EY(b), L(SU')i,, F-V(at), L(SU')in

F-V(b), L(SU')in L sU%
E-V(at), L(SU')in
0.6642
0.7255 0.7220
-
+ 0.12 1
+ 0.19 2 0.29 1 -
F-3(at), L(SU')in
E3(b),
L(SU');,,
hp2003 I(ne)C2FU 0.7343 I(ne)C2FA

hp2004 I(n. )C2FU
0.6632 I(n. )C2FP
L(SU');n E3 SUS at E3(b), ,L L(SU');,, E3 at ,L SU'
E3(b),
3 at c SU'
0.1336
+ 2.221 1
--
F-V(b), L(SU')in
Ev
0.1313
-
+ 0.46 1
-
0.7304 0.7371 -
2 0.53 + 0.38 3 --
v(b), L(sU');,, Ev ae,t sui 9v(b), L(su);,, Ev ac ,[, sU'
at L sui
0.7221 0.7406 0.6593 0 6853 .
+ +
1.66 3 0.86 2 0.59 2 3 3312 .
254
Setting np2003 I(ne)C2F 0.6940 I (ne)C2FP
np2004I(ne)C2F
0.6843 I(ne)C2FA
td2003 DLHF
0.1455 W2004 0.1371 hp2003 0.6710 DLHFP DLHF DLHFP DLHFU DLHFP
hp2004 DLHFU
0.6278 DLHFP np2003 DLHFP 0.5241 DLHFA np2004 DLHFU 0.4978 DLHFP td2003 BM25FU
0.1857 BM25FP
td2004 BM25F
continued from previous page MAP B E 3(b), 0.7073 + 1.92 2 d(b) L(SV L(su'); n ,. ) 3 at . ,,, 0.7124 2.65t' 5 + Ey r, Su ) jn , L SU' at , 3(b), 0.6842 - 0.02 1 v(b) L(SU');,, )t L(SU' 3 , 0.7030 + 2.73 2 y at ,z Sui t SU' at " E3(b), 0.1432 1.58 Ev(b), 3 L(SU'){n L(SUl); 3 n 0.1455 0.00 F-V(at 2 ) Sui at jn L(SU') ,z 3(b), , 0.1373 + 0.15 1 Ed(b)L(su' );,, L(su');,, , 3 0.1393 1.601 Ey + 3 SUl r, at , sU' at ;,, ,L 3(b),L(su'); 0.6723 + 0.19 3 V(b),c(SU1)i n F-3(at), 0.6719 + 0.13 4 y L(SUI)jn su' at " E3(b), ,t 0.6282 F-V(b), + 0.06 5 L(sul);,, L(SU')in 3 0.6468 F-V(at ), at + 3.03 3 SUS r, L(SU')in , E3(b),L(su'),,, 0.5314 + 1.39 3 F-V(b), L(SU')in F-3(at), 0.5239 0.04 F-V(at 1 L(SU')in ), L(SU')in E3(b),L(su'); 0.5017 + 0.78 1 F-V(b), L(SU')in n F-3(at), 0.5099 + 2.43 2 F-V(cLt L(SU')in ), L(SU')jn E3(b), 0.1873 F-V(b), 0.86 1 + L(SUI);,, L(SU')in
MAP 0.6908 0.6994 0.7067 0.1434 0.1537
+/-% 0.46 + 0.78
B 2 1
+ 3.271 1 1.44 2 + 5.64' 3
0.6807 0.6706 0.6213 0.5244 0.5301 0.1809
+ 1.45 1 0.061 2 1.041 + 0.06 + 1.14t' 2.58 2 1 1 3
E3
[, SUS at E3(b), ,
0.1783
0.1169 BM25FA hp2003 BM25FU

0.7516 BM25FP
F-3(at), L(SU')jn E3(b),

L(su');,, E3 r, SUS at 3(b), , L(SU'); n
F-3(at), L(SU')in
3(b),
L(SU'); n
0.1196
+ 2.31 + + + +
3.98 -
y y
4 4 4 2 1 3
EV(b), c(SU')ir
at t SU' "
0.1861
0.1217
+ 0.22
+ 4.11 + + + + + 0.47 0.17 1.51 6.48t' 0.48
3
2 3 2 1 2 3
0.1185
0.7518 0.7758 0.6668 0.6746 0.7081
+ 1.371 4
0.03 3.221 2.92 4.121 0.38 -
at c SU' " F-V(b),

L(SU')in
0.1223
0.7551 0.7529 0.6577 0.6899 0.7142
+ 4.62t' 3
hp2004 BM25FU
0.6479 BM25FP
BM25F np2003 0.7108 BM25FP BM25F np2004 0.6707 BM25FU

td2003 I(ne)C2FU
L(su'); L(SU'); n sui ,L

r
FV(at), L(SU')in F-V(b), L(SU')in FV(at), L(SU')in d(b),L(SU');,,
3
3
L SU' " at E3(b),

at
0.7216
0.6897 0.6825
-
+ 1.521 2
+ 2.83 4 + 1.761 2
-
Ey
EY(b), L(SU'){n EV(at), L(SU)in '

F-V(b),
at L SU% .
0.6989
0.6720 0.1570
+ 0.19
1.67 -
1
2 3
E3(b), L(SU');
0.1455 DLHFP td2004 PL2F 0.1307 I(n, )C2FP

hp2003 DLHFU
3 at ,L Sui 3(b), L(SU');n F-3(at), L(SU')jn

E3(b),
0.1437 0.1343 0.1357

0.6590
1.24 + 2.75 + 3.83

1.05 -
1 3 3
2
L(su');
Ey at c 5U% EY(b), L(5U');w Ed at ,c SUS Ed(b), Ey at ,t 5U' Ev(b), L(sv');,, FV(at), L(SU')in
F-V(b) L(SU')in , y at ,L Su, EV(b)L(SU');,, , FV(at), L(SU)in L(SU');n
L(SU')in
+ 7.90
0.1503 0.1327 0.1376

0.6816
+ 3.30 + 1.53 + 5.28

+ 2.34
1 2 4
1
0.6944 I(ne)C2FA
0.6660 BM25FA hp2004 PB2FU 0.5555 DLHFA np2003 PL2FP 0.6846 I(ne)C2FA np2004 PB2F
F, 3(at),L(SU')in 3(b), L(SU'); F-3(a0, L(SU')in

E3(b), E3 L(SU ); n at E3(b),
0.6807 0.5962 0.5949

0.6943 0.7131 0.6893 0.7269
+ 2.21 4 + 7.33 4 + 7.091 5

+ 1.42 + 4.16 0.73 + 4.68 2 1 3 1
0.6719 0.6001 0.5640

0.7059 0.7009 0.7217 0.6914
+ 0.89 + 8.03 + 1.53

+ 3.11 + 2.38 + 3.93 0.43 -
2 3 1
4 2 2 1
F-3(at), L(SU')in
SUS ,t L(sU );
Ev(f), E3(f), L(su'),,, Table B. 11: Evaluation of score-dependentexperiments L(su');,, and
255
Baseline
MAP
E3 (b) F-3(6) EY(b)
mq2003
mq2004 mq2003 DLHFP DLHFP PB2F BM25F
0.5533
0.4156
+1- % B
0.5455
-
-1.41 +0.43
3 1
0.5533
0.4156 0.5533 0.4156 0.5533 0.4156
mq2004 mq2003
mq2004 mq2003 mq2004 mq2004 mq2003 mq2004
DLHFP DLHFP
DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP
PB2F BM25F
PB2F BM25F PB2F PB2F BM25F PB2F
0.5557
0.5486 0.5775 0.4381
dom
&1(b) E3(at) F-3(at) v(at) Ev at

E3(b),
3 -0.85 +4.37t* 1 +5.411 1
0.5533
0.4156
3 y
0.5490 avg(dom)
0.7590 0.5537 0.4152 0.5626 0.4452 0.5474 0.5777 0.4212 0.5554 0.4265 0.5622 0.4233 0.5561 0.5398 0.4013 0.5455
-0.78
+4.64t* +0.07 -0.09 +1.68 +7.121 -1.07 +4.411* +1.34 +0.38 +2.62 +1.61 +1.85 +0.51 -2.44 -3.44' -1.41
3
1 2 2 2 2 1 1 2 3 1 2 2 1 1 1 3
0.5533
0.4156
y(b),avg(dom) 3(at),
3
y
b av ,
mq2003 mg2004 mq2003

mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004
DLHFP DLHFP DLHFP

DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP
BM25F PB2F BM25F

PB2F BM25F PB2F BM25F PB2F BM25F PB2F BM25F PB2F
0.5533
0.4156 0.5533
b ,av
dom
avg(dom
0.4156 0.5533 0.4156
at ,av dom Ey(at), avg(dom

at ,av dom
3(b), std(dom) 3
0.5533 0.4156
0.5533
0.4156
d(b), std(dom)
y
b std dom ,
3(at),
3
b std dom ,
std(dom;
0.5533 0.4156 0.5533 0.4156
mq2003 mq2004 mq2003

mq2004
DLHFP DLHFP DLHFP

DLHFP
BM25F PB2F BM25F

PB2F
0.5533
0.4156
v(at), atd(dom; y dom' std at , 3(b),Irg(dom) E3 6 Ir dom ,
at ,std dom'
y(b), Irg(dom)
mq2003
mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004
DLHFP
DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP
BM25F
PB2F BM25F PB2F BM25F PB2F BM25F PB2F
0.5533
0.4156
3(at), Irg(dom;
3 y
y b Ir dom ,
0.5533
0.4156 0.5533 0.4156
v(at),Irg(dom
at ,Ir dom 3(b),
avg(dir)
at ,Ir dom
0.5541
0.5512 0.4115
+0.14
-0.38 -0.99
3
3 1
E3 b
0.5533
0.4156
y(b), avg(dir)
dir av ,
0.5745
0.4147 0.5499 0.4108
+184"
-0.22 -0.61 -1.15
1
1 2 1
mq2003
mq2004
DLHFP
DLHFP
BM25F
PB2F
0.5533
0.4156
d b dir av 3(at), ,
E3 at ,av
avg(dir)
dir
mq2003
mq2004
DLHFP
DLHFP
BM25F
PB2F
0.5533 0.4156
y(at), avg(dir) EV at ,av dir
0.5626 0.4395
0.5371 0.4190 0.5685
+1.68 1 +5.75k 2
-2.93 +0.82 +2.75 4 2 3
mq2003 mq2004 mq2003 mq2004 mq2003

mq2004
DLHFP DLHFP DLHFP DLHFP DLHFP

DLHFP
BM25F PB2F BM25F PB2F BM25F

PB2F
0.5533
0.4156
3(b), std(dir) E3 y(b), std(dir)

V b dir std ,
b std dir ,
0.5533
0.4156
0.5533 0.4156
0.5533
mq2003
mq2004
DLHFP
DLHFP
BM25F
PB2F
v(at),
V
3(at), std(dir) 3 ,std dir at

std(dir)
at ,std dir
0.5619
0.5648 0.4374
+1.55
0.4156
+2.08 ' 2 1 +5.25
256
continued
Task mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004
I Retrieval aDnrnai1 cl DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F
from previous page
cDl;T, o 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156

3(b),
r. 3 6 Ir dir , d(b),Irg(dir) y 6 Ir di, , 3(at),Irg(dir) 3 at I, dir V(at),Irg(dir) y at ,Ir dir
a"aaa
A.,
TI-
1W ., /o L
Irg(dir)
0.5513
0.4235 0.5554 -
+1.961 2 +0.38 1
3 -0.36
0.5516
0.4053 0.5613 0.4421
1 -2.48 +1.45 3 +6.371 1
4 -0.31
Table B. 12: Evaluation of the score-independent document-level and aggregate-level experiments with limited relevance information. The table displays the evaluation results of a decision mechanism, which is trained and evaluated with different mixed tasks.
Task mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 mq2003 mq2004 Retrieval approaches DLHFP BM25F DLHFP PB2F DLHFP BM25F DLHFP PB2F BM25F DLHFP DLHFP PB2F BM25F DLHFP PB2F DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP DLHFP BM25F PB2F BM25F PB2F BM25F PB2F BM25F PB2F BM25F PB2F BM25F PB2F BM25F PB2F Baseline 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156 0.5533 0.4156
E3(b), 3(b),
L(SU),
L(SU),
y(b),L(su) p, y(b),L(SU), E3(at),L(sU)D, E3(at),L(SU) , EV(at),L(SU)p, y((&t), L(SU),

E3(b),
MAP 0.5412 0.4144 0.5539 0.4215 0.5462 0.4218 0.5685 0.4215 0.5481 0.4038 0.5702 0.4207 0.5331 0.4243 0.5743 0.4264 0.5505
+/- % B 4 -2.19 2 -0.29 4 +0.11 +1.421 2 3 -1.28 2 +1.49 2 +2.75 1 +1.42 3 -0.94 2 -2.84 4 +3.05 +1.231 2 3 -3.65 2 +2.09 +3.80T* 2 2 +2.60 4 -0.51
L(5U)in E3(b),L(SU)sn Ey(b),L(SU)i Ey 6 su ,L E3(at),L(su);,, E3(at),L(SU)in EV(at),L(SU);,, Ed at ,L SU

E3(b), E3(b),
L(sU')p,
Ec(b),L(SU')P, EV(b), L(SU') , E3(at),L(SU')P, E3(at),L(SU'),
L(SU'),
0.4154
0.5698 0.4201 0.5471 0.4178
-0.04
+2.98 +1.081 -1.12 +0.53
1
3 1 3 1
mq2003
mq2004
DLHFP
DLHFP
BM25F
PB2F
0.5533
0.4156
EY(at), L(SU')p,
EY(at),L(SU') , E3(b), L(SU'); n E3(b), L(SU')in
0.5663
0.4156 0.5285 0.4154
+2.35
0.00 -4.48 -0.05
3
1 2 1
mq2003 mq2004 mq2003 mq2004

mq2003
DLHFP DLHFP DLHFP DLHFP

DLHFP
BM25F PB2F BM25F PB2F

BM25F
0.5533 0.4156 0.5533 0.4156

0.5533
EV(b), L(SU');,, y 6 sui ,L 3

at L SUS
0.5742 0.4194
0.5464
+3.78 +0.91
-1.25
3 1
3
257
Task mq2004 mg2003 mq2004
continued from previous page Retrieval approaches Baseline E DLHFP PB2F 3 0.4156 at ,L sui DLHFP BM25F Ey(at), L(SUI)t 0.5533 DLHFP PB2F Ed 0.4156 at ,L sui
MAP --0.5733 0.4157
+/- %B +3.61t' +0.02 1 1
Table B. 13: Evaluation of the score-dependent experiments with limited relevance information. The table displays the evaluation results of a decision mechanism, which is trained and evaluated with different mixed tasks.
258
Bibliography
Achlioptas, A. & McSherry, F. (2001). Web search via hub synthesis. in `Proceedings of the 42nd Annual Symposium of Foundations of Computer Science'. pp. 611-618.3.4.1 D., Fiat, A., Karlin,
Adamic, L. (2001). `Network Dynamics: The World Wide ì'ch. PhD Thesis. Stanford University'. 3.2.2
Albert, R. & Barabsi, (2002). Statistical -L. Physics 74(1), 47-97.3.2.2 A. mechanics of complex networks. Rer<< <<,
of Modern
Albert,
R., Jeong, H. & Barabsi, A. (1999). Diameter of the World `Vide `Veh. -L. Nature 401,130-131.3.2.2
J. (1996). Automatic hypertext ACM link typing. in `Proceedings of the 7th ACM
Allan,
conference
Hypertext'. on
Press. pp. 42-52.3.2.1
Amati,
G. (2003). Probability
Models for Information
Retrieval based on Divergence Science, University of
from Randomness. PhD thesis. Department Glasgow. UK.

Amati, G. (2006).
Computing of
2.3.3,2.3.3.3,7.4.3.1
Frequentist Bayesian and approach to Information Retrieval Retric'vva11.in (ECIR'06).
`Proceedings To appear'.
of the 28th European 2.3.3.4
Conference on Information
Amati, G. & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems 20(4), 357-389.2.3.3,2.3.3.4.4.4.1,7.4.2.2
259
BIBLIOGRAPHY
Amati, G., Carpineto, C. & Romano, G. (2004). Query Difficulty, Robustness, and
Selective Application of Query Expansion. in `Proceedings of the 26th European Conference in Information Retrieval (ECIR'04)'. pp. 127-137.3.6.2 G., Ounis, I. & Plachouras, V. (2003). The dynamic absorbing model for the web. Technical Report TR-2003-137. Department of Computing Science, University
Amati,
Glasgow. 3.4.4.3,4.5.2.4 of
Amento, B., Terveen, L. & Hill, W. (2000). Does authority mean quality? predict-
ing expert quality international Retrieval'.
ratings of web documents. in `Proceedings of the 23rd annual
ACM SIGIR conference on Research and Development in Information ACM Press. pp. 296-303.3.3.2.1
Amitay, E., Carmel, D., Darlow, A., Herscovici, M., Lempel, R., Soffer, A., Kraft, R. & Zien, J. (2003). Juru at TREC 2003 - Topic Distillation Using Query-Sensitive Tuning Text Twelfth The Publication 500-255: Special `KIST Filtering. in Cohesiveness and
REtrieval Conference (TREC 2003)'. pp. 255-262.3.6.2,5.3.1

Distillation (2002). Topic A. & Soffer, R. A., Lempel, Amitay, E., Carmel, D., Darlow, Text Eleventh The 500-251: Publication Special `NIST Agents. in Knowledge with Retrieval Conference (TREC 2002)'. 3.6.2,5.2.1 `Proceedings the in for Models of (2001). M. metasearch. Aslam, J. A. & Montague, in Development Research SIGIR ACM and on international conference 24th annual
276-284.3.4.4.1 Press. ACM Retrieval'. pp. Information

Baeza-Yates, R. & Ribeiro-Neto, Wesley. 2.1 B. (1999). Modern Information Retrieval. Addison
test Engineering (2003). multi-purpose D. a & Hawking, N. Bailey, P., Craswell, & Management Processing Information for experiments. retrieval web collection 39(6), 853-871.4.2 telac: Sic (2004). transit A. gloria & Tomkins, R. Kumar, A., Bar-Yossef, Z., Broder, international 13th the `Proceedings in of decay. the web's towards an understanding of 328-337.3.3.2 Press. ACM (WWW13)'. pp. Web Wide World conference on
260
BIBLIOGRAPHY
Barabsi, 3.2.2
A.
(2002). Linked: -L.
The New Science of Networks. Perseus Publishing.
Barabsi, A. L. & Albert, R. (1999). Emergence Scaling in Random Networks. Science of 286,509-512.3.2.2 Baron, L. (1996). Labeled, Typed links as Cues when Reading Hypertext Documents. Journal of the American Association for Information Science 47(12), 896-908.3.2.1 Bartell, B. T., Cottrell, G. W. & Belew, It. K. (1994). Automatic combination of
multiple ACM
ranked retrieval systems. in `Proceedings of the 17th annual international SIGIR conference on Research and Development in Information Retrieval'.
pp. 173-181.3.4.4.1 Beitzel, S., Jensen, E., Chowdhury, A. & Grossman, D. (2003). Using titles and category
from names editor-driven taxonomies for automatic evaluation. in `Proceedingsof the

12th international Conference on Information and Knowledge Management (CIKM)'.
ACM Press. pp. 17-23.3.6.1

Beitzel, S., Jensen, E., Chowdhury, A., Grossman, D. & Flieder, 0. (2004). Hourly
log. `Proceedings in the large topically of query categorized web analysis of a very 27th annual international ACM SIGIR Conference on Research and Development in Information Retrieval'. ACM Press. pp. 321-328.7.2.1
Engine Search TechPerspective A Cognitive About: Out (2000). Finding on Belew, R. UK. Cambridge, Press, University Cambridge WWW. the and nology Berners-Lee, T., Masinter, L. & McCahill, 2.3,3.2.1
M. (1994). RFC 1738 - Uniform Resource
Locators (URL). RFC 1738. IETF.
4.5.1
docufor identifying A co-derivative (2004). J. & Zobel, Y. scalable system Bernstein, Retrieval Information Processing String and Symposium Eleventh `The on in ments.
(SPIRE'04)'.
3.2.3
the overlap and for size A relative (1998). technique A. measuring Bharat, K. & Broder, international on 7th conference the `Proceedings in of engines. of public web search 379-388.3.2.2 V.. B. Publishers Science pp. World Wide Web (WWW7)'. Elsevier
261
BIBLIOGRAPHY
Bharat, K. & Broder, A. (1999). Mirror, mirror on the Web: A study of Host Pairs with Replicated Content. in `Proceedings of the 8th international conference on World Wide Web (WWW8)'. 3.2.3 K. & Henzinger, M. R. (1998). Improved algorithms for topic distillation in hyperlinked a environment. in `Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 104-111.3.3.2.2,3.4.1,5.1
Bharat,
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press. 5.5.3 Blair, D. (2001). Some thoughts on the reported results of trec. Information Processing
& Management 38(3), 445-451.2.4

Bomhoff, M., Huibers, T. & Van der Vet, P. (2005). User Intentions in Information
Retrieval. in `Proceedings of the 5th Dutch Information Retrieval Workshop (DIR'5)'. pp. 47-54.3.6.1,7.2.1 Borodin, A., Roberts, G., Rosenthal, J. & Tsaparas, P. (2001). Finding authorities
from link hubs `Proceedings in the of the 10th and structures on world wide web. international conference on World Wide Web (WWW10)'. ACM Press. pp. 415-429. 3.3.2.2 Botafogo, R. A. & Shneiderman, B. (1991). Identifying in hypertext strucaggregates ACM Hypertext'. ACM `Proceedings conference on tures. in of the third annual
Press. pp. 63-74.3.3.1

Botafogo, R., Rivlin, identifying hypertexts: (1992). Structural B. & Shneiderman, E. analysis of Sys-
hierarchies and useful metrics. ACM Transactions on Information
tems 10(2), 142-180.3.3.2

Web hypertextual large-scale search enBrin, S. & Page, L. (1998). The anatomy of a 107-117.3.2.2,3.3.1,3.3.2.1, Systems 30(1-7), ISDN Networks Computer and gine. 3.4.2,4.5.3,4.8,6.2.2.1
3-10.1.2,3.2.4, 36(2), Forum SIGIR Broder, A. (2002). A taxonomy of web search. 3.5.1,3.6.1
262
BIBLIOGRAPHY
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata. R., Tomkins, A. & Wiener, J. (2000). Graph Structure in the Web. in `Proceedings of the 9th international conference on World Wide Web (WWW9)'. 3.2.2 Bruza, P. (1990). Hyperindices: A novel aid for searching in hypermedia. in 'Proceedings of the ACM European Conference Hypertext '90 (ECHT '90)'. pp. 109-122. on 3.2.1
Bush, V. (1945). As we may think. The Atlantic Monthly 176(1), 101-108.3.2.1 Callan, J. & Connell, M. (2001). Query-based sampling of text databases. ACM Dunsactions on Information Systems 19(2), 97-130.7.4.2.1 Can, F., Nuray, R. & Sevdik, A. B. (2004). Automatic search engines. Information performance evaluation of web Processing & Management 40(3), 495-514.3.5.2
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y. & Soffer, A. (2001). Static index pruning for information retrieval systems. in `Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval'. ACM Press. pp. 43-50.4.2 Carriere, J. & Kazman, Searching and visualizing the Web in `Proceedings of the 6th internaltional conference on World R. (1997). WebQuery:
through connectivity.
Wide Web (WWW6)'.

Chakrabarti,
3.3.2,3.3.2.2
S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P. & Rajagopalan,
hyperlink by list and structure analyzing compilation resource Wide World international 7th the on `Proceedings conference in text. of associated 3.4.1 Web (WWW7)'. S. (1998). Automatic distillation text, (2001). Enhanced topic V. using & Tawde, M. Chakrabarti, S., Joshi, ACM international 24th the `Proceedings in annual hyperlinks. of tags, and markup ACM Retrieval'. Information in Development Research SIGIR conference on and Press. pp. 208-216.3.4.1 data for Graphical (1983). analysis. P. methods & Tukey, W. Cleveland, Chambers, J., Duxbury Press. 5.5.3
263
BIBLIOGRAPHY
Chowdhury,
A. & Soboroff, I. (2002). Automatic evaluation of world wide web search services. in `Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. 421-422. pp. 3.5.2
Clarke, C., Craswell, N. & Soboroff, I. (2004). Overview of the TREC 2004 Terabyte track. in `NIST Special Publication 500-261: The 13th Text REtrieval Conference
(TREC 2004)'. pp. 80-88.2.3.3.4,3.5.1 Clarke, C., Scholer, F. & Soboroff, I. (2005). The TREC 2005 Thrabyte Track. in
`Proceedings of the 14th Text REtrieval Conference (TREC 2005)'. 3.5.1
Cleverdon, C. (1997). The cranfield tests on index language devices. in `Readings in information retrieval'. Morgan Kaufmann Publishers Inc.. pp. 47-59.2.4 Cohn, D. & Chang, H. (2000). Learning to probabilistically uments. in `Proceedings of the 17th International identify authoritative doc-
Conference on Machine Learning'.
ACM Press. 3.3.2.2

Cohn, D. & Hofmann, T. (2001). The missing link -a probabilistic model of document & V. Tresp, G. Dietterich K. Leen, T. T. in hypertext eds, connectivity. content and Àdvances in Neural Information Processing Systems 13'. MIT Press. pp. 430-436.
3.4.1
`NIST Web in Track. TREC-2002 (2002). Overview the D. Craswell, N. & Hawking, of (TREC 2002)'. Conference REtrieval Text Eleventh The Special Publication 500-251:
86-93.3.5.1,4.2,4.3.1,4.5.3 pp. `NIST in Web Track. 2004 TREC Overview the (2004). of Craswell, N. & Hawking, D. (TREC Conference REtrieval Text Thirteenth The 500-261: Special Publication
2004)'. 1.1,2.4,3.5.1,3.6.1,4.2,4.3.1,4.4.4,4.5.1,7.2.1,1,7.4,7.4.3.1,8.1.2
Enterprise 2005 TREC Overview the (2005). I. of & Soboroff, A. Vries, Craswell, N., de 3.5.1 (TREC 2005)'. Conference REtrieval Text 14th the Track. in `Proceedings of
link finding Effective (2001). using S. site & Robertson, D. Craswell, N., Hawking, SIGIR ACM international 24th the annual `Proceedings in of information. anchor
264
BIBLIOGRAPHY
conference on Research and Development in Information Retrieval'. ACM Press. pp. 250-257.3.4.2,4.3.1
Craswell, N., Hawking, D., Wilkinson, R. & Wu, M. (2003). Overview of the TREC2003 Web Track. in `NIST Special Publication 500-255: The Twelfth Text REtrieval Conference (TREC 2003)'. pp. 220-236.1.1,3.5.1,4.3.1,7.4,7.4.3.1,8.1.2
Craswell, N., Robertson, S., Zaragoza, H. & Taylor, M. (2005). Relevance weighting for query independent evidence. in `Proceedingsof the 28th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 416-423.3.4.4.1,3.4.4.3,4.5.2.6
Croft, W. B. (2000). Combining approaches to information retrieval. in W. B. Croft, Àdvances in Information Retrieval from the Center for Intelligent Information ed., Retrieval'. Kluwer Academic. pp. 1-36.1.2,3.4 models of information retrieval
Croft, W. B. & Harper, D. (1988). Using probabilistic without information. relevance
Journal of Documentation 35,285-295.2.3.1
Cronen-Townsend, S., Zhou, Y. & Croft, W. B. (2002). Predicting query performance.

in `Proceedings of the 25th annual international ACM SIGIR conference on Research 299-306.3.6.2,5.4, ACM Press. Retrieval'. Information in Development pp. and
7.4.2.1
for horizontal (2002). Web M. Maggini, & M. Diligenti, M., Gori, page scoring systems World international 11th `Proceedings the on in conference of and vertical search.
Wide Web (WWW11)'. ACM Press. pp. 508-516.3.3.2.1 & Wiley John Analysis. Scene Duda, R. & Hart, P. (1973). Pattern Classification and Sons, New York, USA. 5.5.1 in documents the web. on Untangling (2003a). K. compound & McCurley, N. Eiron, ACM Hypermedia'. Hypertext ACM and fourteenth conferenceon `Proceedings of the Press. pp. 85-94.3.3.1 `Proin for text search. Analysis web (2003b). S. anchor K. of Eiron, N. & McCurley, Research and SIGIR ACM on conference international 26th the annual ceedings of
265
BIBLIOGRAPHY Development in Information Retrieval'. ACM Press.pp. 459-460.3.4.2,4.3.1,7.4.2.3, 7.4.3.3,8.1.2

Erds, P. & Renyi, A. (1959). On random graphs. Publicationes Mathematicae 6,290 297.3.2.2
Evans, D. A., Shanahan, J. G. & Sheftel, V. (2002). Topic structure modeling. in `Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 417-418.3.6.2
Fagin, R., Kumar, R., McCurley, K. S., Novak, J., Sivakumar, D., Zbmlin, J. A. & Williamson, D. P. (2003). Searching the workplace web. in `Proceedings of the 12th international conference on World Wide Web (WWW12)'. ACM Press. pp. 366-375.
3.4.4.1,4.4.1 Faloutsos, M., Faloutsos, P. & Faloutsos, C. (n.d. ). On Power-law Relationships of the
Internet Topology. Computer Communications Review 29,251-262.3.2.2 Fang, H., Tao, T. & Zhai, C. (2004). A formal study of information retrieval heuristics.
in `Proceedings of the 27th annual international ACM SIGIR conferenceon Research

Development in Information and Feller, W. (1957). An Introduction Retrieval'. ACM Press. pp. 49-56.2.3.1 to Probability Theory and its Applications, volume
1,2nd edition. John Wiley and Sons. 4.5.2,4.5.2.1,4.5.2.2

Classiin Text Experiments Useful? Are Links (2003). When R. & Everson, Fisher, M. fication. Retrieval Information Conference European 25th in `Proceedings of the on
(ECIR'03)'. Springer-Verlag. pp. 41-56.3.6.2

Frakes, W. & Baeza-Yates, R. (1992). Information Retrieval Data Structures 8 Algo-
2.2 Jersey. New Cliffs, Englewood Hall, Prentice rithms. Information Hypertext in Links Semantic Use of Frei, H. -P. & Stieger, S. (1995). The 31,1-13.3.4.3 Management 4 Processing Retrieval. Information
J. Nie, M., Zhang, H., G., He, Cao, Gao, J., -Y., 500-250: Publication Special `KIST A. in TREC-10 Web Track Experiments at MSR. Walker, S. & Robertson, S. (2001).
384-392.4.4.1 2001)'. (TREC pp. The Tenth Text REtrieval Conference
266
BIBLIOGRAPHY
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science 178,471479.1.2,3.3.2 Geller, N. (1977). On the Citation Influence Methodology of Pinski and Narin. Information Processing & Management 14,93-95.3.3.2,3.3.2.1 Gordon, M. & Pathak, P. (1999). Finding information on the world wide web: the retrieval effectiveness of search engines. Information Processing & Management 35(2), 141-180.3.5.2 from hypertext using dyretrieval namically planned guided tours. in `Proceedings of the ACM conference on Hypertext
Guinan, C. & Smeaton, A. F. (1992). Information
(ECHT'92)'. ACM Press. pp. 122-130.3.2.1

Gulli, A. & Signorini, A. (2005). The indexable web is more than 11.5 billion pages. in `Special interest tracks and posters of the 14th international conference on World
Wide Web (WWW14)'. ACM Press. pp. 902-903.3.2.2

Gurrin, C. & Smeaton, A. F. (2003). Improving the Evaluation of Web Search Sys-
tems. in `Proceedings of the 25th European Conference on Information Retrieval (ECIR'03 )' Springer-Verlag. pp. 25-40.3.6.2 . Halasz, F. G. (1987). Reflections on notecards: seven issuesfor the next generation of ACM Proceeding '87: the `HYPERTEXT in conferenceon hypermedia systems. of Hypertext'. ACM Press. pp. 345-365.3.2.1
Harman, (TREC-1). Conference REtrieval Text First D. (1993). Overview of the in
(TRECConference REtrieval Text First `NIST Special Publication 500-207: The

1)'. pp. 1-20.2.4,2.5
American Journal the indexing. for of An (1975). S. P. probabilistic Harter, algorithm 280-289.2.3 26(4), Science Information for Society Structure Link in Priors Document Dependent Age (2005). L. Azzopardi, Hauff, C. & Retrieval Information Conference European on 27th the `Proceedings Analysis. in of (ECIR'05)'. pp. 552-554.3.4.4.3
267
BIBLIOGRAPHY
Haveliwala, ternational 3.4.1
T. H. (2002). Topic-sensitive
pagerank. in `Proceedings of the 11th 1nconference on World Wide Web (WWW11)'. ACM Press. pp. 517-526.
Hawking, D. (2000). Overview of the TREC-9 Web Track. in `NIST Special Publication 500-249: The Ninth Text REtrieval Conference (TREC 9)'. pp. 87-102.3.5.1,4.2 Hawking, D. & Craswell, N. (2001). Overview of the TREC-2001 Web Track. in 'NISI Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001)'. pp. 61-67.3.5.1,3.6.1,4.2,4.3.1 Hawking, D. & Craswell, N. (2005). The Very Large Collection Web Tracks. in and `TREC: Experiment and Evaluation in Information Retrieval, editors E. M. Voorhees
D. and K. Harman'. MIT Press. pp. 199-232.2.4,3.5.1,4.2

Hawking, P. (1997). Overview of TREC-6 Very Large Collection Track. in `NIST Special Publication 500-240: The 6th Text REtrieval Conference D. & Thistlewaite,
(TREC6)'. pp. 93-106.3.5.1

Hawking, D., Craswell, N. & Thistlewaite, `NIST Special Publication P. (1998a). ACSys TREC-7 Experiments. in
500-242: The Seventh Text REtrieval Conference (TREC
7)'. pp. 299-313.4.2

Hawking, D., Craswell, N. & Thistlewaite, P. (1998b). Overview of trec-7 very large
REtrieval ConText The 7th Publication 500-242: Special `NIST in collection track.
ference (TREC7)'. pp. 91-104.3.5.1 Measuring (2001). K. & Griffihs, P. Bailey, N., Craswell, search engine D., Hawking, 33-59.3.5.2 4(1), Retrieval Informantion quality.
Hawking, D., Craswell, N., Crimmins, F. & Upstill, T. (2004). How valuable is ex-
15th `Proceedings the in of ternal link evidence when searching enterprise webs?. 3.4.2 (ADC'04)'. Conference Database Australasian in better (2004). Toward anchors. weighting of Hawking, D., Upstill, T. & Craswei, N. Research SIGIR ACM on international conference 27th `Proceedings of the annual 512-513.3.4.2,4.3.2, Press. ACM Retrieval'. pp. Information in Development and
4.4.1
268
BIBLIOGRAPHY
Hawking, D., Voorhees, E., Craswell, N. & Bailey, P. (1999). Overview of the TREC-8 Web Track. in `NIST Special Publication 500-246: The 8th Text REtrieval Confer(TREC8)'. ence pp. 131-150.3.5.1,4.2
He, B. & Ounis, I. (2003). A study of parameter tuning for term frequency normalization. in `Proceedings of the 12th international Conference on Information and
Knowledge Management (CIKM)'. ACM Press. pp. 10-16.2.3.3.3

He, B. & Ounis, I. (2004). Inferring Query Performance Using Pre-retrieval PredicRetrieval
tors. in `The Eleventh Symposium on String Processing Information and
(SPIRE'04)'.
1.2,3.6.2,5.2
He, B. & Ounis, I. (2005a). A study of the dirichlet priors for term frequency normalisation. in `Proceedings of the 28th annual international ACM SIGIR, conference on Research and Development in Information Retrieval'. ACM Press. pp. 465-471.
2.3.3.3
He, B. & Ounis, I. (2005b). Term Frequency Normalisation funning for BM25 and DFR
Conference Information Retrieval European `Proceedings 27th in the on of models. (ECIR'05)'. pp. 200-214.2.3.3.3,7.4.2.2 Heydon, A. & Najork, M. (1999). Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219-229.3.2.2
Hiemstra, retrieval. D. (1998). A linguistically in ÈCDL information of motivated probabilistic model '98: Proceedings of the Second European Conference on ReLibraries'. Springer-Verlag. London,
for Digital Technology Advanced search and
UK. pp. 569-584.2.3.2 6.2.1 Wiley. Ed.. Statistics, 5th Mathematical Hoel, P. G. (1984). Introduction to the `Proceedings in indexing. of Hofmann, T. (1999). Probabilistic latent semantic in Development Research SIGIR and ACM on international conference 22nd annual 50-57.3.3.2.2 Press. ACM Retrieval'. pp. Information the knowledge search on subject Hsieh-Yee, I. (1993). Effects of search experience and for Society American the Journal of tactics of novice and experienced searchers. Information Science 44,161-174.3.2.4
269
BIBLIOGRAPHY
Huberman, regularities
B. A., Pirolli,
P. L. T., Pitkow, J. E. & Lukose, R. M. (1998). Strong in World Wide Web surfing. Science 280,95-97.3.3.3
Jansen, B. J. & Pooch, U. W. (2001). A review of web searching studies and a framework for future research. Journal of the American Society of Information Science 52(3), 235-246.1.2,3.2.4,7.4.2.1 Jardine, K. & Sibson, R. (1971). Mathematical taxonomy. Wiley. 5.4.1
Jiang, X. Song, W. & Zeng, (2005). H. Applying Associative Relationship -M., -G. on -J. the Clickthrough Data to Improve Web Search. in `Proceedings of the 27th European
Conference on Information Retrieval (ECIR'05)'. pp. 475-486.3.3.3

Jin, R. & Dumais, S. (2001). Probabilistic ceedings of the 24th annual international Development in Information Jin, R., Hauptmann, links. in `Procombination of content and ACM SIGIR conference on Research and
Retrieval'. ACM Press. pp. 402-403.3.4.3
A. G. & Zhai, C. X. (2002). Title language model for information
ACM SIGIR international 25th `Proceedings in the conference on annual of retrieval. Research and Development in Information Retrieval'. ACM Press. pp. 42-48.4.3.1 data. in `Proceedings (2002). Optimizing T. Joachims, search engines using clickthrough Data Discovery Knowledge international SIGKDD and ACM 8th on conference the of
Mining'. ACM Press. pp. 133-142.3.3.3

in(2005). Accurately G. & Gay, H. Hembrooke, B., Joachims, T., Granka, L., Pan, 28th the `Proceedings in feedback. annual implicit of data as terpreting clickthrough Information in Development Research and international ACM SIGIR conference on
Retrieval'. ACM Press. pp. 154-161.3.3.3,8.2 in for Language (2004a). searching M. models Rijke, & de G. Kamps, J., Mishne, REtrieval Text Thirteenth The 500-261: Publication Special `NIST in web corpora. Conference (TREC 2004)'. 3.3.1 in Searching for Models Language (2004b). M. Rijke, & de Kamps, J., Mishne, G. Retrieval Text Thirteenth The 500-261: Publication Special Web Corpora. in `NIST
(TREC 2004)'. conference 4.5.1
270
BIBLIOGRAPHY
Kamps, J., Monz, C., de R.ijke, M. & Sigurbjornsson, B. (2003). Approaches to Robust and Web Retrieval. in `NIST Special Publication 500-255: The Twelfth Text Retrieval Conference (TREC 2003)'. pp. 594-599.4.4.1 Kang, I. & Kim, (2003). G. Query type classification for -H. web document retrieval. in `Proceedings of the 26th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 64-71.3.6.1,5.2.1 sources in a hyperlinked environment. in 'Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms'. ACM J. M. (1998). Authoritative
Kleinberg,
Press. pp. 668-677.3.3.2.2,3.3.2.2,5.1,5.2

Kraaij, W., Westerveld, T. & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. in `Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information 27-34.3.3.1,3.4.4.2,3.4.4.3,4.5.1 pp. Kullback, USA. S. (1959). Information 5.4.1,5.4.1 Theory and Statistics. Jown Wiley & Sons, New York, Retrieval'. ACM Press.
Kwok, K. L., Deng, P., Dinstl, N. & Chan, M. (2002). TREC2002 Web, Novelty and Filtering Track Experiments Special Publication `KIST PIRCS. in using 500-251The Eleventh Text Retrieval Conference (TREC)'. pp. 520-528.3.3.1 Lafferty, J. & Zhai, C. (2001). Document language models, query models, and risk mininternational 24th the in `Proceedings annual for information of imization retrieval. ACM Information in Development Research SIGIR conference on and Retrieval'.
ACM Press. pp. 111-119.2.3.2 document based and on Lafferty, J. & Zhai, C. (2003). Probabilistic relevance models InterKluwer Retrieval, Information Modelling `Language in and query generation. 2.3.2 13'. Retrieval, Information Series vol on national
`Proceedings in language based (2001). Relevance models. W. B. & Croft, V. Lavrenko, DevelopResearch and SIGIR ACM on conference international 24th the annual of
Information in ment Retrieval'. ACM Press. pp. 120-127.2.3.2
271
BIBLIOGRAPHY
Lawrence, S. & Giles, C. L. (1999). Accessibility of information on the Web. Nature 400,107-109.3.2.2 Lebanon, G. & Lafferty, J. (2002). Cranking: probability models on permutations. ference on Machine Learning'. 3.4.4.1 Combining rankings using conditional in `Proceedings of the 19th International Con-
Lee, J. H. (1997). Analyses of multiple evidence combination. in `Proceedings of the 20th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 267-276.3.4.4.1
Lempel, R. & Moran, S. (2000). The stochastic for link-structure analysis approach (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999)
33(1-6), 387-401.3.3.2.2 Levy, P. (1995). Que est ce que le virtuel. Editions La Decouverte. 3.2.1 Lewis, D. (1996). The TREC-4 Filtering Track. in `NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4)'. pp. 165-180.2.5
Li, L., Shang, Y. & Zhang, W. (2002). Improvement of hits-based algorithms on web
documents. in `Proceedings of the 11th international conferenceon World Wide Web (WWW11)'. ACM. 3.4.1
domains in logical (2000). Defining H. & Q. Takano, 0., Vu, a web Kolak, Li, W. -S., ACM hypermedia'. Hypertext ACM the and `Proceedings in on eleventh of site.
Press. pp. 123-132.3.3.1

Transactions IEEE Entropy. Shannon Based the Lin, J. (1991). Divergence Measures on
Information on
Theory 37,145-151.5.4.1,5.4.1
York. New Company, Macmillan The Theory. Decision (1971). Elements B. Lindgren, of 5.2 York. New Sons, Wiley John Decisions. and Games (1957). H. & Raiffa, R. and Luce, 8.2 Journal IBM Abstracts. of Literature Creation of Luhn, H. (1958). The Automatic Research and Development 2,159-165.2.2
272
BIBLIOGRAPHY Manmatha, R., Rath, T. & Feng, F. (2001). Modeling score distributions for combining the outputs of search engines, in `Proceedings of the 24th annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 267-275.3.4.4.1,4.5.2.6,5.4
Manning, C. & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. 2.2 0. A. (1994). GENVL and WWWW: Tools for taming the Web. in `Pro1.2,
McBryan,
ceedings of the 1st international 3.4.2
conference on World Wide Web (WWW1)'.
Meng, W., Yu, C. & Liu, K. (2002). Building Efficient Effective Metasearch and -L. Engines. ACM Computing Surveys 34,48-89.3.4.4.1 Metzler, D. & Croft, W. B. (2005). A markov random field model for term dependencies. in `Proceedings of the 28th annual international ACM SIGIR conference on Research
and Development in Information Retrieval'. ACM Press. pp. 472-479.4.3.2 Ng, A. Y., Zheng, A. X. & Jordan, M. I. (2001). Link Analysis, Eigenvectors and Stability. in ÌJCAI'. pp. 903-910.3.3.2.1
Nottelmann, H. & Fuhr, N. (2003). From Uncertain Inference to Probability of Relein `Proceedings of the 25th European Confer-
Applications. IR Advanced for vance
(ECIR'03)'. 235-250.5.4.2 Research IR pp. ence on

known-item for document (2003). Combining Ogilvie, P. & Callan, J. representations SIGIR ACM international 26th conference the `Proceedings in annual of search. Press. 143-150. ACM Retrieval'. Information in pp. Development Research and on
3.4.4.2,4.4.1 (2005). D. Johnson, C. & Macdonald, B., He, V., G., Plachouras, Amati, Ounis, I., ConEuropean 27th the `Proceedings in Platform. of Retrieval Terrier Information 517-519.4.2 (ECIR 2005)'. Research IR pp. ference on
RankCitation PageRank The (1998). T. & Winograd, R. S., Motwani, Page, L., Brin, Stanford, University, Stanford ing: Bringing Order to the Web. Technical report.
CA.
4.5.2,5.2
273
BIBLIOGRAPHY
Pandurangan, G., Raghavan, P. & Upfal, E. (2002). Using PageRank to Characterize Web Structure. in `8th Annual International Computing and Combinatorics Confer(COCOON)'. 4.5.2.6 ence
Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J. & Giles, C. L. (2002). Winners Don't Take All: Characterizing the Competition for Links on the Web. Proceedings of the National Academy Sciences 99(8), 5207-5211.3.2.2 of Pinski, influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Pro& Management pp. 297-312.3.3.2,3.3.2.1 cessing P. & Pitkow, J. E. (1999). Distributions characterizations. of surfers' paths through the world World Wide Web 2(1-2), 29-45.3.3.3,5.4.1 G. & Narin, F. (1976). Citation
Pirolli,
wide web: Empirical Pirolli, P., Pitkow,
J. & Rao, R. (1996). Silk from a sow's ear: extracting usable structures from the web. in `Conference proceedings on Human factors in computing
ACM Press. systems'. pp. 118-125.3.3.2

Plachouras, V. & Ounis, I. (2004). Usefulness of hyperlink structure for query-biased in `Proceedings of the 27th annual international ACM SIGIR topic distillation. Information in Developement Research and conference on 448-455.2.3.3.4,3.3.1,4.3,4.5.1,4.5.1,7.4.2.1 pp. for (2005). Dempster-Shafer theory I. & Ounis, V. a query-biased combiPlachouras, Information Web. the nation of evidence on Retrieval 8,197-218.3.6.2,5.2.1 Retrieval'. ACM Press.
University (2003). C. J. Rijsbergen, Van & of I. Ounis, Plachouras, V., Cacheda, F., the Hyperlink Application using Dynamic analysis of Glasgow at the Web track: Retrieval Text Twelfth The 500-251: Query Scope. in `NIST Special Publication Conference (TREC 2003)'. 3.3.1,4.4.1,4.5.1
ExTREC2004: Glasgow University (2004). at I. Ounis, of & B. He, V., Plachouras, the `Proceeddings in Terrier. of Terabyte tracks with Robust Web, in and periments 2.3.3.4,3.6.2,4.3,4.4.1,4.5.1 2004)'. (TREC Conference 13th Text REtrieval for Model the Absorbing Static (2005). The G. Amati, & Ounis, I. V., Plachouras, Web. Journal of Web Engineering 4,165-186.4.5.2.6
274
BIBLIOGRAPHY
Plachouras, V., Ounis, I. & Cacheda, F. (2004). Selective Combination Evidence for of Topic Distillation using Document and Aggregate-level Information. in `Proceedings of RIAO 2004 (Recherche d'Information Assistee par Ordinateur - Computer assisted information retrieval)'. pp. 610-622.5.3.1 Plachouras, G. & Van Rijsbergen, C. J. (2002). University of Glasgow at the Web track of TREC 2002. in `KIST Special Publication 500-251: The Eleventh Text Retrieval Conference (TREC)'. pp. 645-651.4.4.1 V., Ounis, I., Amati,
Ponte, J. M. & Croft, W. B. (1998). A language modeling approach to information retrieval. in `Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval'. ACM Press. pp. 275-281. 2.3.2 Porter, M. F. (1980). An algorithm for suffix stripping. Program 14(3), 130-137.2.2, 4.2 Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes:
The Art of Scientific Computing. 2nd edn. Cambridge University Press. Cambridge (UK) and New York. 4.3.2 R Development Core Team (2005). R: A language and environment for statistical comAustria. for Statistical Computing. Vienna, ISBN R Foundation 3-900051puting. 07-0.1
Raggett, D., Le Hors, D. & Jacobs, I. (1999). `Html 4.01 specification'. 3.2.1,4.3.1
Raghavan, S. & Garcia-Molina, the 27th International Kaufmann H. (2001). Crawling the hidden web. in `Proceedings of Conference on Very Large Data Bases (VLDB'01)'. Morgan
Publishers Inc.. pp. 129-138.3.2.2
IR. for `Proin Model Network (1996). A Belief R. Ribeiro-Neto, B. & Muntz, R. Research SIGIR ACM and international on conference 19th the annual ceedings of Developement in Information Retrieval'. pp. 253-260.3.4.3 ComProbabilistic Surfer: Intelligent Richardson, M. & Domingos, P. (2002). The
bination Information Content Link of and Neural in Àdvances in PageRank. in
Information Processing Systems 14'. 3.4.1
275
BIBLIOGRAPHY
Robertson, S. (2002). Òn Bayesian Models and Event Spacesin Information Retrieval. Workshop on Mathematical/Formal Methods in Information Retrieval, ACM SIGIR Conference'. 2.3.2
Robertson, S. & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Informmation Science 27,129-146.2.3,2.3.1 Robertson, S. & Walker, S. (1994). Some simple effective approximations to the 2poisson model for probabilistic weighted retrieval. in `Proceedings of the 17th annual international ACM-SIGIR conference on Research Development in Information and Retrieval'. Springer-Verlag New York, Inc.. pp. 232-241.2.3.1
Robertson, S., Van Rijsbergen, C. J. & Porter, M. F. (1981). Probabilistic models indexing of and searching. in `Proceedings of the 3rd annual ACM conferenceon
Development in Information Retrieval'. Butterworth and Research & Co.. Kent, UK,
UK. pp. 35-56.2.3,2.3.1

Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. & Gatford, M. (1994).
Okapi at TREC-3. in `NIST Special Publication 500-225: Overview of the Third

Text REtrieval Robertson, Conference (TREC-3)'. pp. 109-126.2.3.1,4.3.2
S., Zaragoza, H. & Taylor, M. (2004). Simple BM25 extension to multiple
Information Conference ACM 13th `Proceedings the fields. in and on of weighted Knowledge Management (CIKM'04)'. ACM Press. pp. 42-49.3.4.4.2,4.4.1,4.4.2
`Proin in (2004). Understanding D. user goals web search. Rose, D. E. & Levinson, ACM (WWW13)'. Web Wide World international 13th on the conference of ceedings Press. pp. 13-19.3.2.4,3.6.1,7.2.1 Salton, G. & McGill, Retrieval. Information Modern to M. (1986). Introduction
McGraw-Hill, Inc. 2.3

Cominformation boolean retrieval. Extended (1983). H. Wu, & A. E. G., Fox, Salton,
1022-1036.2.3 ACM 26(11), mun. M. retrieving. information and A seeking (1988). study of Saracevic, T. & Kantor, P. Science Information for Society American the Journal Searchers, searches,overlap. of 39(3), 197-216.3.4.4.1
276
BIBLIOGRAPHY
Savoy, J. (1996). An extended vector-prcessing scheme for searching information in Hypertext systems. Information Processing & Management 32(2), 155-170.3.4.3 Savoy, J. & Picard, J. (2001). Retrieval effectiveness on the web. Information Processing & Management 37(4), 543-569.3.4.3,4.2 Savoy, J. & Rasolofo, Y. (2001). Report on trec-10 experiment: Distributed collections and entrypage searching. in `NIST Special Publication 500-250: The Tenth Text Retrieval Conference (TREC 2001)'. 3.3.1,4.5.1 Savoy, J., Rasolofo, Y. & Perret, L. (2003). Report on the TREC 2003 Experiment: Genomic and Web Searches. tin `NIST Special Publication 500-255: The Twelfth Text
Retrieval Conference (TREC 2003)'. pp. 739-750.4.4.1

Shaw, J. & Fox, E. (1994). Combination of multiple searches. in `NIST Special Publication 500-226: Overview of the Third Text REtrieval Conference (TREC-3)'. pp. 105-
109.3.4.4.1
Shivakumar, N. & Garcia-Molina, H. (1998). Finding near-replicas of documents on the Workshop on the World Wide Web and
Web. in `Proceedings of the International
Databases (WebDB'98)'.
3.2.3
Siegel, S. & Castellan, J. J. (1988). Nonparametric statistics for the Behavioral Sciences,
2nd Ed.. McGraw-Hill.

Silva, I., Ribeiro-Neto,
6.2.1
B., Calado, P., Moura, E. & Ziviani, N. (2000). Link-based and information in a belief network model. in `Proceedings of ACM SIGIR conference on Research and Development
content-based evidential the 23rd annual international in Information Retrieval'.
ACM Press. pp. 96-103.3.4.3
5.5.3 London. & Hall, Chapman Silverman, B. (1986). Density Estimation. large Analysis (1999). M. & Moricz, a very of M. Henzinger, H., Marais, C., Silverstein, 6-12.1.2,3.2.4,7.4.2.1 33(1), Forum SIGIR log. web search engine query
length document Pivoted (1996). normalization. M. & Mitra, C. A., Buckley, Singhal, Research Conference SIGIR ACM on international 19th in `Proceedings of the annual 21-29.3.4.4.3 Press. ACM Retrieval'. pp. Information in Development and
277
BIBLIOGRAPHY
Spark-Jones, K. & Van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of documentation 32,59-75.2.4
Spertus, E. (1997). ParaSite: Mining Structural Information on the Web. in 'Proceedings of the 6th international conference on World Wide Web (WWW6)'. 3.2.3 Tajima, K., Hatano, K., Matsukura, T., Sano, R. & Tanaka, K. (1999). Discovery and retrieval of logical information units in web. in `Proceedings of the 1999 ACM Digital Libraries Workshop on Organizing Web Space'. 3.3.1 Tajima, K., Mizuuchi, Y., Kitagawa, M. & Tanaka, K. (1998). Cut as a querying unit for WWW, Netnews, and E-mail. in `Proceedings ACM Hypertext '98'. pp. 235-244. of 3.3.1 Tomiyama, T., Karoji, K., Kondo, T., Kakuta, Y., Takagi, T., Aizawa, A. & Kanazawa, T. (2003). Meiji University Web, Novelty and Genomics Brack Experiments. in `NIST Special Publication 500-261: The Thirteenth Text Retrieval Conference (TREC
2004)'. 4.4.1
Tomlinson, S. (2005). European web retrieval experiments with hummingbird searchWorkshop'. 4.5.1 CLEF Notes for 2005 `Working 2005. in the servertm at clef
Trigg, R. (1983). À Network-Based Approach to Text Handling for the Online Scientific Community. PhD Thesis, Dept. of Computer Science, University of Maryland'. 3.2.1
inferfor the (2004). Combining web retrieval using Tsikrika, T. & Lalmas, M. evidence f1 Management Processing Information ence network model: an experimental study.
40(5), 751-772.3.4.4.2
inference Evaluation (1991). network-based retrieval B. W. of an Turtle, H. & Croft, 187-222.3.4.3,3.4.4.2 9(3), Systems Information ACM Transactions on model. in home Query-independent (2003). evidence D. & Hawking, N. Craswell, Upstill, T., 21,286-313.3.4.2,3.4.4.3 Systems Information ACM Transactions finding. on page
Van Rijsbergen, C. J. (1979). Information

Heinemann. 2.1,2.3,2.4
Butterworth2nd Retrieval, edition.
278
BIBLIOGRAPHY Voorhees, E. (2003). Overview of the TREC 2003 Robust Retrieval Track. in 'NISI Special Publication 500-255: The Twelfth Text REtrieval Conference(TREC 2003)'. pp. 69-77.3.6.2 Voorhees, E. (2004). Overview the TREC 2004 Robust of Retrieval Track. in `Proceedings of the 13th Text REtrieval Conference (TREC 2004)'. 3.6.2
Wald, A. (1950). Statistical Decision Functions. Wiley. 5.2
Watts, D. J. & Strogatz, S. H. (1998). Collective dynamics of small-world networks.
Nature 393,440-442.3.2.2
Westerveld, T., Hiemstra, D. & Kraaij, W. (2001). Retrieving Web Pages Using Content, Links, URLs and Anchors. in `NIST Special Publication: Text REtrieval Williams, 500-250 The Tenth Conference (TREC 2001)'. pp. 663-673.3.3.1,3.4.4.2,3.4.4.3,4.5.1
H. & Zobel, J. (1999). Compressing Integers for Fast File Access. Computer
Journal 42,193-201.2.2 Witten, I., Moffat, A. & Bell, T. (1994). Managing Gigabytes: Compressing and Index-
ing Documents and Images. Van Nostrand Reinhold, New York. 2.2
Wong, A. & You, M. (1985). Entropy and distance of random graphs with application to Machine Analysis PAttern IEEE Transactions and on structural pattern recognition.
Intelligence PAMI-7,599-609.5.4.2
Yom-'Ibv, (2005). Learning to estimate query A. & Darlow, D. Carmel, S., E., Fine,
distributed indetection to and including applications difficulty: missing content SIGIR ACM international 28th formation retrieval. in `Proceedings of the annual Press. ACM Retrieval'. Information in Development Research and conference on
512-519.1.2,3.6.2 pp.
Yuret, D. (1994). `From Genetic Algorithms To Efficient Optimization. Master Thesis,
MIT, A. I. Technical Report No. 1569.'. 4.3.2

Microsoft (2004). S. & Robertson, S. Saria, M., Taylor, N., Craswell, H., Zaragoza, 500Publication Special `KIST in Cambridge at TREC-13: Web and HARD tracks. 3.4.4.2,4.4.1,4.4.1, 2004)'. (TREC Conference Retrieval Text Thirteenth The 261:
4.4.2,4.5.1
279
BIBLIOGRAPHY
Zhai, C. & Lafferty, J. (2001). A study of smoothing methods for language models apinformation hoc to ad plied ACM international 24th `Proceedings in the annual of retrieval. SIGIR conference on Research and Development in Information Retrieval'.
ACM Press. pp. 334-342.2.3.2
280

Operational Research PHD Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Operational Research PHD Thesis

Uploaded by

Copyright:

Available Formats

Plachouras, Vasileios (2006) Selective web information retrieval. PhD thesis.

Glasgow Theses Service http://theses.gla.ac.uk/ theses@gla.ac.uk

Web Information Retrieval

One of the main challenges in Web information

retrieval is the number of

Web selective of applying

cision mechanism is based on limited the information informational mixed

are evaluated using different training in order to approximate

introduction the this thesis are of

this thesis presents a thorough for Web information retrieval,

of queries in order to perform

Overall, selective Web information

is a promising approach, which retrieval

I would like to thank the following people:

being a great supervisor;

of my Ph. D have been unsurpassed.

for your advice

PAGE NUMBERING AS ORIGINAL

1.2 1.3 1.4 2 Basic 2.1 2.2 2.3

............... ............... ...............

2 3 4 6 ...... ...... ...... ...... ...... 6 7 8 10 12 13 19 21 22 ... 22 22 23 24 26 27 28 i '??

............................. About Web information retrieval Retrieval

Web Information 3.1 3.2 Introduction

... ... ... ... ... ... ...

3.2.2 3.2.3 3.2.4 3.3

3.3.2 3.3.3 3.4

analysis algorithms ..............

analysis with anchor text

3.4.3 3.4.4 3.5

intentions user goals and

Introduction Experimental Document

.................................. ............................. for Web information representations setting

4.3.1 4.3.2 4.3.3 4.3.4

Representing Web documents Parameter setting Evaluation results

Combining document fields 4.4.1

......................... Weighting models for field retrieval .................

4.4.3 4.4.4 4.5

Query-independent 4.5.1 4.5.2

4.5.3 4.5.4 4.6

field of retrieval with query-independent

92 92 ........... retrieval 95 9 .... 97 99

framework 5.1 5.2 Introduction

............................... Selective retrieval as a statistical decision problem

Bayesian decision mechanism 5.5.1 5.5.2 5.5.3 Definition Application

............................... Web Information Retrieval

Evaluation 6.1 6.2

141 151 151 132

6.4.4 6.4.5 6.4.6 6.5

159 161 162 163 16.1 . 165

Document sampling 6.5.1 6.5.2 6.5.3

Description of experimental setting and presentation of results

6.5.5 6.5.6 6.5.7 6.6 6.7 6.8 6.9 7 Using

Selective mation 7.1 7.2

Introduction Limited 7.2.1 7.2.2

Modelling limited relevance information Experimental

of experiments F_with limited relevance information

7.3.1 7.3.2 7.3.3

Ad-hoc decision mechanism and query sampling ........

...... ...... ......

............................. and Future Work

and conclusions ........................

Contributions Conclusions work

225 236 280