Professional Documents
Culture Documents
What is it?
Doug
http://lucene.sourceforge.net/talks/pisa/
Developed
Wrote
Xerox/Apple/Excite/Nutch
several papers in IR
for IR
tokens are indexed
Analysis
Tokenization Where Where
Document
Parser
the magic of query happens across indexes
Where
Search
Searches
Spans
Spans
K+/-
words Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking
Store/Util
Store
Theory
Space
Cutting
lecture at Pisa
previous link
Vector
Vectors
terms
Uses
a cosine distance to determine how close terms/documents are This distance can then be used for WSD/Clustering/IR Example:
Bass,fishing:
.6506 Bass,guitar: .000423 This tells us the document is about fishing not about guitars
Vectors-IR
Vector-space search engines use the notion of a term space , where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart. http://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IR
Nisonger 2005 :P
Inverted Index
Term/Doc
Term
A
Id/Weight
Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stopword elimination, stemming, filtering, term normalization, or language translation -- has been applied. http://www.javaworld.com/javaworld/jw-09-2000/jw-09
Id
unique key that identifies each document
Weight
Binary Freq
Index Merge
Basic/Basket/Basketball
Only
Query
Boolean
Only
Search
Search
Query-II
Threshold
If
partial score is too low and will not be part of N-best then the document is ignored even before search is complete
Example
Potential
New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored
Small
Evaluation of Lucene
Quantitative
Compared
Question
<Who
Evaluation-II
Prise
A
IR system developed by NIS that according to the paper uses modern search engine techniques Prise was better than Lucene since Boolean query engines are considered old school and its answers to questions were better
Findings
Found
Eval-III
Lucene
Found
although Prise had better correct answers Lucene found more documents containing relevant information
Eval-Conclusion
External
http://people.csail.mit.edu/gremio/publications http://people.csail.mit.edu/gremio/publication
Katz
MIT
Users
Lucene
TREC Document
is used widely
Retrieval Enterprise Systems Part of Database/Web engine Part of Nutch Used by academics for large projects
MIT,
Conclusions
Lucene
Designed
to allow customization without have to reinvent the wheel Robust Fast Large development groups Used Widely in Academia and Industry
Questions?
Feel