You are on page 1of 2

SIGIR 2007 Proceedings Poster

Improving Weak Ad-Hoc Queries using Wikipedia as


External Corpus

Y. Li, R.W.P. Luk, E.K.S. Ho and F.L. Chung


Department of Computing
The Hong Kong Polytechnic University
Kowloon, Hong Kong SAR
{csyhli, csrluk, csksho, cskchung}@comp.polyu.edu.hk

ABSTRACT 2. WEAK QUERIES


In an ad-hoc retrieval task, the query is usually short and The TREC Robust Track [5] was started in 2003 to fo-
the user expects to find the relevant documents in the first cus on poor performing queries. Several new measures were
several result pages. We explored the possibilities of using introduced to evaluate the effectiveness on weak queries.
Wikipedia’s articles as an external corpus to expand ad-hoc Among them, ‘#0p5’, ‘#0p10’, ‘#0p15’, and ‘#0p20’ are
queries. Results show promising improvements over mea- the number of queries that have zero precision at top 5, 10,
sures that emphasize on weak queries. 15, and 20 retrieved documents, respectivety. ‘Area’ is a
weighted sum of the average precision of the 25% worst per-
forming queries (e.g. 12 out of 50). The weight is Σxk=r 1/k,
Categories and Subject Descriptors where r is the query rank (weakest has r = 1) and x is size
H.3.3 [Information Storage and Retrieval]: Information of set. ‘Area’ measures the overall performance of weakest
Search and Retrieval—Relevance Feedback, Retrieval Models queries in a set and weight weaker ones heavier.
Since 2004, another new measure Geometric MAP (GMAP)
General Terms [6] was introduced as an alternative to the mean average pre-
cision (MAP). GMAP takes the geometric mean of average
Algorithms, Experimentation
precisions of all the queries instead of their arithmetic mean,
in order to emphasize scores close to 0.
Keywords The following table shows a comparison between our ini-
pseudo-relevance feedback, external corpus, Wikipedia tial (BM term weights [7]) and PRF results, on the TREC
Robust 2005 Track. 40 terms are picked from the top 20
documents for PRF query expansion. As can be seen from
1. INTRODUCTION the figures, although PRF improves MAP significantly, most
In web retrieval tasks, the number of terms in a query other measures favoring weak queries are lower.
is usually small (two to three on average) [1]. According
to [2], if the terms cannot provide enough information of #p5,10,15,20 MAP GMAP Area
the user’s need, the retrieval result may be poor. These are init 11 6 5 4 .2106 .1366 .0315
known as weak queries [3]. Also, the relevant documents PRF 15 10 8 6 .2733 .1436 .0297
are likely to be scattered along the retrieval list. In this
The performances of certain queries are significantly harmed
case, the user may give up after inspecting the first one
by PRF. Since PRF is based on the assumption that top N
or two result pages without finding a relevant document.
documents are relevant, when the initial retrieval is unsatis-
Pseudo Relevance Feedback (PRF) is a well-known method
factory, the top N documents are likely to be irrelevant and
for improving retrieval effectiveness. However, it is based on
may produce ineffective PRF terms.
the assumption that top retrieved documents are relevant,
and thus may actually harm performance possibly when the
initial retrieval’s top ranked documents are irrelevant. 3. WIKIPEDIA AS EXTERNAL CORPUS
Some previous works have been done to address the issue Wikipedia is a multilingual, web-based, free-content ency-
of weak queries in ad-hoc retrieval. Web assistance and data clopedia. Its articles are contributed by users worldwide via
fusion method [3] probe a web search engine (e.g. Google) Internet. Therefore, unlike traditional encyclopedias, the
to form new queries, and then combine the corresponding articles in Wikipedia are constantly being updated and new
retrieval lists. Our experiments, however, use a local reposi- entries are created everyday. Wikipedia also has detailed
tory of Wikipedia [4] articles as external corpus. New queries guidelines governing its content quality. These characteris-
are formed by analyzing Wikipedia articles and a second re- tics make it an ideal repository as background knowledge.
trieval on the target corpus is then performed. Results show All the articles in Wikipedia are available for download
that retrieval effectiveness, especially for weak queries, are through their weekly data dumps. We keep a local copy of
improved. such data for faster access. We indexed the articles using
the Indri search engine [8]. Retrieval is performed using
the Markov Random Field model for Term Dependencies
Copyright is held by the author/owner(s).
SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. [9] following the query formulation in [10]. 100 articles are
ACM 978-1-59593-597-7/07/0007. retrieved for each query.

797
SIGIR 2007 Proceedings Poster

The Wikipedia articles are stored in their own markup 4. RELATED WORKS
language called Wikitext which preserves features such as Many other works have used external corpus to improve
categories, hyperlinks, subsections, tables, pictures, etc. We retrieval. However, many of them use large corpus based
utilized the categorical information in each article to re-rank on the belief that the likelihood of finding good expansion
the retrieval list. Each category c is given a weight Wc terms increases as the database size increases. Some probe
which equals to the number of articles in the retrieval list the web which have billions of documents. Diaz and Met-
that belongs
P to c. Each article d is then given a weight zler [11] stated that the quality of the corpus also matters,
Wd = d∈c Wc . A new document score Sd is then given: as they were able to significantly improve the TREC Ro-
bust 2005 track [10] with the BIGNEWS corpus (6.4 million
Sd − min Sd Wd − min Wd
Sd = α × + (1 − α) × documents), which consists of similar content (both news
max Sd − min Sd max Wd − min Wd articles). Wikipedia, with 1.6 million documents at the mo-
where Sd is the original score and α is a constant equals ment, is much smaller than BIGNEWS. However, with all
to 0.8. The 100 articles are re-ranked according to Sd and the policies and guidelines regulating Wikipedia, as well as
40 terms are picked from the top 20 articles to expand the the constant addition of articles, we believe it can be a high
query. New retrieval is performed and shown as Wiki below. quality corpus for general purpose term expansion. Also,
with most Wikitext markup features (e.g. links, sections,
init PRF Wiki references) remain unexplored, it certainly has a lot of po-
tential. The challenge is how to make use of such features
P@5 0.4840 0.4800 0.5640
to find the information accurately and select terms more
P@10 0.4720 0.4860 0.5200
effectively.
P@15 0.4320 0.4707 0.4947
P@20 0.4030 0.4600 0.4720
#0p5 11 r 15 4
5. CONCLUSION
#0p10 6 10 4 We explored Wikipedia as external corpus to expand the
#0p15 5 8 4 query. Wikipedia is especially useful to improve weak queries
#0p20 4 6 3 which PRF is unable to improve. Despite its comparatively
area 0.0315 0.0297 0.0465 smaller size, we believe the content quality and the evolving
MAP 0.2106 0.2733 0.2555 nature make it a good resource. In the future, we will look
GMAP 0.1366 0.1436 0.1735 for more effective ways to find information and integrate it
with existing systems.
Although PRF outperforms Wiki in MAP, Wiki beats
PRF in most other measures favoring weak queries. A look 6. REFERENCES
[1] A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic.
into the per query result shows that, among the 15 queries
From E-Sex to E-Commerce: Web search changes.
that are hurt by PRF, 14 are improved by Wiki. For the
IEEE Computer, 35(3):107–109, 2002.
other 35 queries, PRF and Wiki both produced better result
[2] C. Buckley. Why current IR engines fail. In SIGIR,
over initial retrieval, though Wiki is not as effective as PRF.
pages 584–585, 2004.
This is probably because of the differences in language and
context between Wikipedia (up-to-date general knowledge [3] K.L. Kwok, L. Grunfeld and P. Deng. Improving weak
Encyclopedia) and the target corpus (news archive within ad-hoc retrieval by web assistance and data fusion. In
a time range). Also, since the size of Wikipedia is limited, AIRS, pages 17–30, 2005.
some topics may not be covered well. [4] Wikipedia, the free encyclopedia.
http://www.wikipedia.org.
15 PRF- 35 PRF+ [5] E.M. Voorhees. Overview of TREC 2003. In TREC,
init PRF Wiki init PRF Wiki pages 1–13, 2003.
MAP .1436 .1166 .1813 .2396 .3413 .2868 [6] S.E. Robertson. On GMAP: and other
GMAP .0761 .0256 .1042 .1756 .2645 .2157 transformations. In CIKM, pages 78–83, 2006.
[7] S.E. Robertson and S. Walker. Some simple effective
For the 15 queries hurt by PRF, Wilcoxon signed-ranks approximations to the 2-poisson model for
test with p < 0.05 shows significant increases of Wiki over probabilistic weighted retrieval. In SIGIR, pages
PRF. For the rest, the test shows that both have significant 232–241, 1994.
increases over initial retrieval. [8] T. Strohman, D. Metzler, H. Turtle and W.B. Croft.
Similarly, there are also 15 queries hurt by Wiki. The dif- Indri: A language-model based search engine for
ference between those hurt by PRF, however, is that these complex queries. In International Conference on
queries performed well in the initial retrieval, with M AP = Intelligence Analysis, 2005.
0.2094, and GM AP = 0.1279. Manually inspecting the [9] D. Metzler and W.B. Croft. A markov random field
Wikipedia articles reveals that some queries are not well model for term dependencies. In SIGIR, pages
covered while for the other queries, term selection failed to 472–479, 2005.
pick a good set of terms for expansion. [10] D. Metzler, F. Diaz, T. Strohman, and W.B. Croft.
The above result shows two possible ways for improve- Umass robust 2005: Using mixture of relevance
ment. The first is to identify topics that are not well covered models for query expansion, 2005.
in Wikipedia. This could be done by analyzing the top N [11] F. Diaz and D. Metzler. Improving the estimation of
retrieved Wiki articles. The second is to integrate Wiki and relevance models using large external corpora. In
PRF methods, hopefully to eliminate queries hurt by either. SIGIR, pages 154–161, 2006.

798

You might also like