Lucene

Lucene
Brian Nisonger Feb 08,2006
What is it?
Doug
name A open source set of Java Classses

Search
Cuttings grandmothers middle
Engine/Document Classifier/Indexer by Doug Cutting 1996
http://lucene.sourceforge.net/talks/pisa/
Developed
Wrote
Xerox/Apple/Excite/Nutch
several papers in IR
What is it-Nuts and Bolts

Modules
for IR
tokens are indexed
Analysis
Tokenization Where Where
Document
the Document ID is created Date of Document is extracted Title of document is extracted
Nuts and Bolts -II

Modules-Cont
Index
Provides
access to indexes Maintains indexes

Query
Parser
the magic of query happens across indexes
Where
Search
Searches
Nuts and Bolts-III

Modules-Cont
Search
Spans
Spans
K+/-
words Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking
Store/Util
Store
the indexes and other housekeeping
Theory
Space
Optimization for Total Ranking
Cutting
et al 1996 RAIO (Computer Assisted IR) 1997 http://lucene.sf.net/papers/riao97.ps

Lucene
Doug
lecture at Pisa
Cutting Slides from Lecture at University of Pisa 2004

See
previous link
Vector
Vectors
terms
are a mathematical distance between
Uses
a cosine distance to determine how close terms/documents are This distance can then be used for WSD/Clustering/IR Example:
Bass,fishing:
.6506 Bass,guitar: .000423 This tells us the document is about fishing not about guitars
Vectors-IR
Vector-space search engines use the notion of a term space , where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart. http://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IR
Nisonger 2005 :P
Inverted Index
Term/Doc
Term
A
Id/Weight
Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stopword elimination, stemming, filtering, term normalization, or language translation -- has been applied. http://www.javaworld.com/javaworld/jw-09-2000/jw-09
Inverted Index Cont

Doc
A
Id
unique key that identifies each document
Weight
Binary Freq
Count Weighting Algorithm
Index Merge
Basic/Basket/Basketball
Only
keeps track of the differences between words Periodically merges indexes

Allows
new documents to be added easily
Query
Boolean
Only
Search
searches documents with at least 1 term in query Boolean Search Engine

Parallel
Each
Search
term in query is search in parallel Partial scores added to queue of docs
Query-II
Threshold
If
partial score is too low and will not be part of N-best then the document is ignored even before search is complete
Example
Potential
New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored
Small
loss of recall greatly increases speed of search
Evaluation of Lucene
Quantitative
Evaluation of Passage Retrieval Algorithms for Question Answering

Tellex
et al, MIT AI Lab 2003
Compared
Question
<Who
Prise to Lucene for question and answer tasks

& Answer
is the president?> <George W. Bush .76>
Evaluation-II
Prise
A
IR system developed by NIS that according to the paper uses modern search engine techniques Prise was better than Lucene since Boolean query engines are considered old school and its answers to questions were better
Findings
Found
Eval-III
Lucene
Found
although Prise had better correct answers Lucene found more documents containing relevant information
Eval-Conclusion
External
Knowledge Sources for Question Answering
http://people.csail.mit.edu/gremio/publications http://people.csail.mit.edu/gremio/publication
Katz
et al, MIT Lab 2005
MIT
used Lucene in their 2005 TREC submission not Prise
Users
Lucene
TREC Document
is used widely
Retrieval Enterprise Systems Part of Database/Web engine Part of Nutch Used by academics for large projects
MIT,
AI Lab Know-It-All Project (UW)
Conclusions
Lucene
is a good set of classes
Designed
to allow customization without have to reinvent the wheel Robust Fast Large development groups Used Widely in Academia and Industry
Questions?
Feel
free to ask questions, make comments, tell jokes.
Thats ALL Folks!!!!!

Lucene

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lucene

Uploaded by

Copyright:

Available Formats

Lucene

Brian Nisonger Feb 08,2006

name A open source set of Java Classses

Cuttings grandmothers middle

Engine/Document Classifier/Indexer by Doug Cutting 1996

What is it-Nuts and Bolts

the Document ID is created Date of Document is extracted Title of document is extracted

Nuts and Bolts -II

access to indexes Maintains indexes

Nuts and Bolts-III

the indexes and other housekeeping

Optimization for Total Ranking

et al 1996 RAIO (Computer Assisted IR) 1997 http://lucene.sf.net/papers/riao97.ps

Cutting Slides from Lecture at University of Pisa 2004

are a mathematical distance between

Inverted Index Cont

Count Weighting Algorithm

keeps track of the differences between words Periodically merges indexes

new documents to be added easily

searches documents with at least 1 term in query Boolean Search Engine

term in query is search in parallel Partial scores added to queue of docs

loss of recall greatly increases speed of search

Evaluation of Passage Retrieval Algorithms for Question Answering

et al, MIT AI Lab 2003

Prise to Lucene for question and answer tasks

Knowledge Sources for Question Answering

et al, MIT Lab 2005

used Lucene in their 2005 TREC submission not Prise

AI Lab Know-It-All Project (UW)

is a good set of classes

free to ask questions, make comments, tell jokes.

Thats ALL Folks!!!!!

You might also like