You are on page 1of 21

Lucene

Brian Nisonger Feb 08,2006

What is it?
Doug

name A open source set of Java Classses


Search

Cuttings grandmothers middle

Engine/Document Classifier/Indexer by Doug Cutting 1996

http://lucene.sourceforge.net/talks/pisa/

Developed
Wrote

Xerox/Apple/Excite/Nutch

several papers in IR

What is it-Nuts and Bolts


Modules

for IR
tokens are indexed

Analysis
Tokenization Where Where

Document

the Document ID is created Date of Document is extracted Title of document is extracted

Nuts and Bolts -II


Modules-Cont
Index
Provides

access to indexes Maintains indexes


Query

Parser
the magic of query happens across indexes

Where

Search
Searches

Nuts and Bolts-III


Modules-Cont
Search

Spans

Spans
K+/-

words Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking

Store/Util
Store

the indexes and other housekeeping

Theory
Space

Optimization for Total Ranking

Cutting

et al 1996 RAIO (Computer Assisted IR) 1997 http://lucene.sf.net/papers/riao97.ps


Lucene
Doug

lecture at Pisa

Cutting Slides from Lecture at University of Pisa 2004


See

previous link

Vector
Vectors

terms

are a mathematical distance between

Uses

a cosine distance to determine how close terms/documents are This distance can then be used for WSD/Clustering/IR Example:
Bass,fishing:

.6506 Bass,guitar: .000423 This tells us the document is about fishing not about guitars

Vectors-IR

Vector-space search engines use the notion of a term space , where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart. http://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IR

Nisonger 2005 :P

Inverted Index
Term/Doc
Term
A

Id/Weight

Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stopword elimination, stemming, filtering, term normalization, or language translation -- has been applied. http://www.javaworld.com/javaworld/jw-09-2000/jw-09

Inverted Index Cont


Doc
A

Id
unique key that identifies each document

Weight
Binary Freq

Count Weighting Algorithm

Index Merge
Basic/Basket/Basketball
Only

keeps track of the differences between words Periodically merges indexes


Allows

new documents to be added easily

Query
Boolean
Only

Search

searches documents with at least 1 term in query Boolean Search Engine


Parallel
Each

Search

term in query is search in parallel Partial scores added to queue of docs

Query-II
Threshold
If

partial score is too low and will not be part of N-best then the document is ignored even before search is complete
Example
Potential

New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored

Small

loss of recall greatly increases speed of search

Evaluation of Lucene
Quantitative

Evaluation of Passage Retrieval Algorithms for Question Answering


Tellex

et al, MIT AI Lab 2003

Compared
Question
<Who

Prise to Lucene for question and answer tasks


& Answer
is the president?> <George W. Bush .76>

Evaluation-II
Prise
A

IR system developed by NIS that according to the paper uses modern search engine techniques Prise was better than Lucene since Boolean query engines are considered old school and its answers to questions were better

Findings
Found

Eval-III
Lucene
Found

although Prise had better correct answers Lucene found more documents containing relevant information

Eval-Conclusion
External

Knowledge Sources for Question Answering

http://people.csail.mit.edu/gremio/publications http://people.csail.mit.edu/gremio/publication

Katz

et al, MIT Lab 2005

MIT

used Lucene in their 2005 TREC submission not Prise

Users
Lucene
TREC Document

is used widely

Retrieval Enterprise Systems Part of Database/Web engine Part of Nutch Used by academics for large projects
MIT,

AI Lab Know-It-All Project (UW)

Conclusions
Lucene

is a good set of classes

Designed

to allow customization without have to reinvent the wheel Robust Fast Large development groups Used Widely in Academia and Industry

Questions?
Feel

free to ask questions, make comments, tell jokes.

Thats ALL Folks!!!!!

You might also like