Professional Documents
Culture Documents
797
SIGIR 2007 Proceedings Poster
The Wikipedia articles are stored in their own markup 4. RELATED WORKS
language called Wikitext which preserves features such as Many other works have used external corpus to improve
categories, hyperlinks, subsections, tables, pictures, etc. We retrieval. However, many of them use large corpus based
utilized the categorical information in each article to re-rank on the belief that the likelihood of finding good expansion
the retrieval list. Each category c is given a weight Wc terms increases as the database size increases. Some probe
which equals to the number of articles in the retrieval list the web which have billions of documents. Diaz and Met-
that belongs
P to c. Each article d is then given a weight zler [11] stated that the quality of the corpus also matters,
Wd = d∈c Wc . A new document score Sd is then given: as they were able to significantly improve the TREC Ro-
bust 2005 track [10] with the BIGNEWS corpus (6.4 million
Sd − min Sd Wd − min Wd
Sd = α × + (1 − α) × documents), which consists of similar content (both news
max Sd − min Sd max Wd − min Wd articles). Wikipedia, with 1.6 million documents at the mo-
where Sd is the original score and α is a constant equals ment, is much smaller than BIGNEWS. However, with all
to 0.8. The 100 articles are re-ranked according to Sd and the policies and guidelines regulating Wikipedia, as well as
40 terms are picked from the top 20 articles to expand the the constant addition of articles, we believe it can be a high
query. New retrieval is performed and shown as Wiki below. quality corpus for general purpose term expansion. Also,
with most Wikitext markup features (e.g. links, sections,
init PRF Wiki references) remain unexplored, it certainly has a lot of po-
tential. The challenge is how to make use of such features
P@5 0.4840 0.4800 0.5640
to find the information accurately and select terms more
P@10 0.4720 0.4860 0.5200
effectively.
P@15 0.4320 0.4707 0.4947
P@20 0.4030 0.4600 0.4720
#0p5 11 r 15 4
5. CONCLUSION
#0p10 6 10 4 We explored Wikipedia as external corpus to expand the
#0p15 5 8 4 query. Wikipedia is especially useful to improve weak queries
#0p20 4 6 3 which PRF is unable to improve. Despite its comparatively
area 0.0315 0.0297 0.0465 smaller size, we believe the content quality and the evolving
MAP 0.2106 0.2733 0.2555 nature make it a good resource. In the future, we will look
GMAP 0.1366 0.1436 0.1735 for more effective ways to find information and integrate it
with existing systems.
Although PRF outperforms Wiki in MAP, Wiki beats
PRF in most other measures favoring weak queries. A look 6. REFERENCES
[1] A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic.
into the per query result shows that, among the 15 queries
From E-Sex to E-Commerce: Web search changes.
that are hurt by PRF, 14 are improved by Wiki. For the
IEEE Computer, 35(3):107–109, 2002.
other 35 queries, PRF and Wiki both produced better result
[2] C. Buckley. Why current IR engines fail. In SIGIR,
over initial retrieval, though Wiki is not as effective as PRF.
pages 584–585, 2004.
This is probably because of the differences in language and
context between Wikipedia (up-to-date general knowledge [3] K.L. Kwok, L. Grunfeld and P. Deng. Improving weak
Encyclopedia) and the target corpus (news archive within ad-hoc retrieval by web assistance and data fusion. In
a time range). Also, since the size of Wikipedia is limited, AIRS, pages 17–30, 2005.
some topics may not be covered well. [4] Wikipedia, the free encyclopedia.
http://www.wikipedia.org.
15 PRF- 35 PRF+ [5] E.M. Voorhees. Overview of TREC 2003. In TREC,
init PRF Wiki init PRF Wiki pages 1–13, 2003.
MAP .1436 .1166 .1813 .2396 .3413 .2868 [6] S.E. Robertson. On GMAP: and other
GMAP .0761 .0256 .1042 .1756 .2645 .2157 transformations. In CIKM, pages 78–83, 2006.
[7] S.E. Robertson and S. Walker. Some simple effective
For the 15 queries hurt by PRF, Wilcoxon signed-ranks approximations to the 2-poisson model for
test with p < 0.05 shows significant increases of Wiki over probabilistic weighted retrieval. In SIGIR, pages
PRF. For the rest, the test shows that both have significant 232–241, 1994.
increases over initial retrieval. [8] T. Strohman, D. Metzler, H. Turtle and W.B. Croft.
Similarly, there are also 15 queries hurt by Wiki. The dif- Indri: A language-model based search engine for
ference between those hurt by PRF, however, is that these complex queries. In International Conference on
queries performed well in the initial retrieval, with M AP = Intelligence Analysis, 2005.
0.2094, and GM AP = 0.1279. Manually inspecting the [9] D. Metzler and W.B. Croft. A markov random field
Wikipedia articles reveals that some queries are not well model for term dependencies. In SIGIR, pages
covered while for the other queries, term selection failed to 472–479, 2005.
pick a good set of terms for expansion. [10] D. Metzler, F. Diaz, T. Strohman, and W.B. Croft.
The above result shows two possible ways for improve- Umass robust 2005: Using mixture of relevance
ment. The first is to identify topics that are not well covered models for query expansion, 2005.
in Wikipedia. This could be done by analyzing the top N [11] F. Diaz and D. Metzler. Improving the estimation of
retrieved Wiki articles. The second is to integrate Wiki and relevance models using large external corpora. In
PRF methods, hopefully to eliminate queries hurt by either. SIGIR, pages 154–161, 2006.
798