You are on page 1of 21

LUCENE .

NET SEARCH ENGINE


Introduction:
Lucene is a free/open source information retrieval library, originally implemented in Java
by Doug Cutting. It is supported by the Apache Software oundation and is released
under the Apache Software !icense. Lucene has been ported to programming languages
including Delphi, "erl, C#, C$$, "ython, %uby and "&". Lucene is a search engine
which ta'es the full te(t search to a step further.
)hile suitable for any application which re*uires full te(t inde(ing and searching
capability, !ucene has been widely recogni+ed for its utility in the implementation of
Internet search engines and local, single,site searching. Lucene itself is -ust an inde(ing
and search library and does not contain crawling and &./! parsing functionality.
In the present application we are implementing Lucene in Dot net ramewor' using C#
to perform inde(ing and searching on database.
Overview:
In the present application we are performing search on the local database using !ucene.
)e are passing an 0nglish *uery as the input or 'eyword for search. 1ow !ucene ta'es
the 'eyword as its input and creates an inde( on the particular table and searches for the
matches. Consider the following e(ample2
I1"3.2 "%4.41.
43."3.2 DIS"!A5S A!! /A.C&0S 4% "%4.41.
In the above e(ample 6"%4.417 is an element in a particular table residing in the
database. !ucene ta'es "%4.41 as the 'eyword and displays all the elements in the
table which contain proton.
Its sounds easy, isn8t it9 :ut for performing this functionality !ucene will be utili+ing
some classes and functions. !et8s discuss them in detail further.
LUCENE DOTNET ARCHITECTURE
As discussed earlier !ucene .net is an open source search engine which ta'es the
implementation of inde(ing and searching to a step further. .o perform this !ucene .net
has some classes defined.
!ucene .net namespace has ;; other namespaces defined in it. .hey are as follows2
Namesace Descrition
!ucene.1et.Analysis .his is used to analy+e the given 'eyword
or input and then build to'ens and then
filter it to search the matches.
!ucene.1et.Analysis.Standard
!ucene.1et.Documents .his namespace is used to define the given
document. <or e(ample2 1ame, field
etc=>
!ucene.1et.Inde( .his namespace is the 'ey to !ucene search
engine. .his namespace contains the
classes to create the inde( on a document.
!ucene.1et.?uery"arsers .his is used to parse the given *uery which
is nothing but the given 'eyword.
!ucene.1et.Search .his is also a 'ey to !ucene. .his
namespace contains classes for performing
search operation.
!ucene.1et.Search.Spans
!ucene.1et.Store .his namespace contains classes to store
the inde(es created in a particular directory
specified.
!ucene.1et.3til .his namespace contains classes for
manipulating the given string , giving
priorities to the 'eyword etc.
!et8s discuss the namespaces further in detail.
!ucene.1et.Analysis 1amespace2
C!asses
C!ass Descrition
Analy+er An Analy+er builds .o'enStreams, which
analy+e te(t. It thus represents a policy for
e(tracting inde( terms from te(t.
.ypical implementations first build a
.o'eni+er, which brea's the stream of
characters from the %eader into raw
.o'ens. 4ne or more .o'enilters may
then be applied to the output of the
.o'eni+er.
)A%1I1@2 5ou must override one of the
methods defined by this class in your
subclass or the Analy+er will enter an
infinite loop.
Char.o'eni+er An abstract base class for simple,
character,oriented to'eni+ers.
!etter.o'eni+er A !etter.o'eni+er is a to'eni+er that
divides te(t at non,letters. .hatAs to say, it
defines to'ens as ma(imal strings of
ad-acent letters, as defined by
-ava.lang.Character.is!etter<> predicate.
1ote2 this does a decent -ob for most
0uropean languages, but does a terrible -ob
for some Asian languages, where words are
not separated by spaces.
!owerCaseilter 1ormali+es to'en te(t to lower case.
!owerCase.o'eni+er
"erieldAnaly+er)rapper .his analy+er is used to facilitate scenarios
where different fields re*uire different
analysis techni*ues. 3se BClin'
#addAnaly+erD to add a non,default
analy+er on a ield name basis. See
.est"erieldAnaly+er)rapper.-ava for
e(ample usage.
"orterStemilter
SimpleAnaly+er An Analy+er that filters !etter.o'eni+er
with !owerCaseilter.
StopAnaly+er ilters !etter.o'eni+er with
!owerCaseilter and Stopilter.
Stopilter %emoves stop words from a to'en stream.
.o'en A .o'en is an occurrence of a term from
the te(t of a ield. It consists of a termAs
te(t, the start and end offset of the term in
the te(t of the ield, and a type string. .he
start and end offsets permit applications to
re,associate a to'en with its source te(t,
e.g., to display highlighted *uery terms in a
document browser, or to show matching
te(t fragments in a E)IC <Eeyword In
Conte(t> display, etc. .he type is an
interned string, assigned by a le(ical
analy+er <a.'.a. to'eni+er>, naming the
le(ical or syntactic class that the to'en
belongs to. or e(ample an end of sentence
mar'er to'en might be implemented with
type FeosF. .he default to'en type is
FwordF.
.o'enilter
.o'eni+er
.o'enStream
)hitespaceAnaly+er An Analy+er that uses
)hitespace.o'eni+er.
)hitespace.o'eni+er A )hitespace.o'eni+er is a to'eni+er that
divides te(t at whitespace. Ad-acent
se*uences of non,)hitespace characters
form to'ens.
!ucene.1et.Analysis.Standard :
C!asses
C!ass Descrition
astCharStream
"arse0(ception .his e(ception is thrown when parse errors
are encountered. 5ou can e(plicitly create
ob-ects of this e(ception type by calling the
method generate"arse0(ception in the
generated parser. 5ou can modify this class
to customi+e your error reporting
mechanisms so long as you retain the
public fields.
StandardAnaly+er ilters BClin' Standard.o'eni+erD with
BClin' StandardilterD, BClin'
!owerCaseilterD and BClin' StopilterD.
Standardilter 1ormali+es to'ens e(tracted with BClin'
Standard.o'eni+erD.
Standard.o'eni+er
Standard.o'eni+erConstants
Standard.o'eni+er.o'en/anager
.o'en Describes the input to'en stream.
.o'en/gr0rror
!ucene.1et.Documents2
C!asses:
C!ass Descrition
Dateield
Document
ield A ield is a section of a Document. 0ach
ield has two parts, a name and a value.
Galues may be free te(t, provided as a
String or as a %eader, or they may be
atomic 'eywords, which are not further
processed. Such 'eywords may be used to
represent dates, urls, etc. ields are ,,
,optionally stored in the inde(, so that they
may be returned with hits on the document.
!ucene.1et.Inde(2
C!asses:
C!ass Descrition
Compoundile%eader Class for accessing a compound stream.
.his class implements a directory, but is
limited to only read operations. Directory
methods that would normally modify data
throw an e(ception.
Compoundile%eader.CSInputStream Implementation of an InputStream that
reads from a portion of the compound file.
.he visibility is left as Fpac'ageF HonlyH
because this helps with testing since J3nit
test cases in a different class can then
access pac'age fields of this class.
Compoundile)riter
Document)riter
ieldInfo
ieldInfos Access to the ield Info file that describes
document fields and whether or not they
are inde(ed. 0ach segment has a separate
ield Info file. 4b-ects of this class are
thread,safe for multiple readers, but only
one thread can be adding documents at a
time, with no other reader or writer threads
accessing this ob-ect.
ields%eader Class responsible for access to stored
document fields. It uses IsegmentJ.fdt and
IsegmentJ.fd(K files.
ilterInde(%eader A ilterInde(%eader contains another
Inde(%eader, which it uses as its basic
source of data, possibly transforming the
data along the way or providing additional
functionality. .he class ilterInde(%eader
itself simply implements all abstract
methods of Inde(%eader with versions that
pass all re*uests to the contained inde(
reader. Subclasses of ilterInde(%eader
may further override some of these
methods and may also provide additional
methods and fields.
ilterInde(%eader.ilter.ermDocs
:ase class for filtering BClin' .ermDocsD
implementations.
ilterInde(%eader.ilter.erm0num :ase class for filtering BClin' .erm0numD
implementations.
ilterInde(%eader.ilter.erm"ositions :ase class for filtering BClin'
.erm"ositionsD implementations.
Inde(%eader %eads the Inde(.
Inde()riter An Inde()riter creates and maintains an
inde(. .he third argument to the
constructor determines whether a new
inde( is created, or whether an e(isting
inde( is opened for the addition of new
documents. In either case, documents are
added with the addDocument method.
)hen finished adding documents, c!ose
should be called. If an inde( will not have
more documents added for a while and
optimal search performance is desired, then
the otimi"e method should be called
before the inde( is closed.
/ultiple.erm"ositions Describe class /ultiple.erm"ositions
&ere.
/ulti%eader An Inde(%eader which reads multiple
inde(es, appending their content.
SegmentInfo
SegmentInfos
Segment/erger
Segment%eader IL/02 Describe class
SegmentReader
&ere.
Segment.ermDocs
Segment.erm0num
.erm A .erm represents a word from te(t. .his is
the unit of search. It is composed of two
elements, the te(t of the word, as a string,
and the name of the ield that the te(t
occurred in, an interned string. 1ote that
terms may represent more than words from
te(t fields, but also things li'e dates, email
addresses, urls, etc.
.erm0num
.ermInfo A .ermInfo is the record of information
stored for a term.
.ermInfos%eader
.ermInfos)riter
.ermGectors%eader .4D42 rela( synchroM
.ermGectors)riter )riter wor's by opening a document and
then opening the fields within the document
and then writing out the vectors for each
ield.
Inde()riter plays a ma-or role for creating the inde(
#u$!ic Static %ie!ds:
C4//I.N!4CEN1A/0
C4//I.N!4CEN.I/043. Default value is ;OOOO. 3se
Lucene.Net.commitLockTimeout
system property to override.
D0A3!.N/ALNI0!DN!01@.& Default value is ;OOOO. 3se
Lucene.Net.maxFieldLength
system property to override.
D0A3!.N/ALN/0%@0ND4CS Default value is BClin'
Integer#/ALNGA!30D. 3se
Lucene.Net.maxMergeDocs
system property to override.
D0A3!.N/0%@0NAC.4% Default value is ;O. 3se
Lucene.Net.mergeFactor
system property to override.
D0A3!.N/I1N/0%@0ND4CS Default value is ;O. 3se
Lucene.Net.minMergeDocs
system property to override.
)%I.0N!4CEN1A/0
)%I.0N!4CEN.I/043. Default value is ;OOO. 3se
Lucene.Net.writeLockTimeout
system property to override.
#u$!ic Instance Constructors
Inde()riter
4verloaded. Initiali+es a new instance of the
Inde()riter class.
#u$!ic Instance %ie!ds:
infoStream
If non,null, information about merges will
be printed to this.
ma(ield!ength
.he ma(imum number of terms that will be
inde(ed for a single ield in a document.
.his limits the amount of memory re*uired
for inde(ing, so that collections with very
large files will not crash the inde(ing
process by running out of memory. 1ote that
this effectively truncates large documents,
e(cluding from the inde( terms that occur
further in the document. If you 'now your
source documents are large, be sure to set
this value high enough to accommodate the
e(pected si+e. If you set it to
Integer./ALNGA!30, then the only limit is
your memory, but you should anticipate an
4ut4f/emory0rror. :y default, no more
than ;O,OOO terms will be inde(ed for a
ield.
ma(/ergeDocs

mergeactor

min/ergeDocs

#u$!ic Instance &et'ods:
AddDocument 4verloaded. Adds a document to this inde(,
using the provided analy+er instead of the
value of BClin' #@etAnaly+er<>D. If the
document contains more than BClin'
#ma(ield!engthD terms for a given ield,
the remainder are discarded.
AddInde(es 4verloaded. /erges the provided inde(es
into this inde(. After this completes, the
inde( is optimi+ed. .he provided
Inde(%eaders are not closed.
Close lushes all changes to an inde( and closes
all associated files.
DocCount %eturns the number of documents currently
in this inde(.
0*uals <inherited from O$(ect> Determines whether the specified 4b-ect is
e*ual to the current O$(ect.
@etAnaly+er %eturns the analy+er used by this inde(.
@et&ashCode <inherited from O$(ect> Serves as a hash function for a particular
type, suitable for use in hashing algorithms
and data structures li'e a hash table.
@etSimilarity
@et.ype <inherited from O$(ect> @ets the .ype of the current instance.
@et3seCompoundile Setting to turn on usage of a compound file.
)hen on, multiple files for each segment
are merged into a single file once the
segment creation is finished. .his is done
regardless of what directory is in use.
4ptimi+e /erges all segments together into a single
segment, optimi+ing an inde( for search.
SetSimilarity 0(pert2 Set the Similarity implementation
used by this Inde()riter.
Set3seCompoundile Setting to turn on usage of a compound file.
)hen on, multiple files for each segment
are merged into a single file once the
segment creation is finished. .his is done
regardless of what directory is in use.
.oString <inherited from O$(ect> %eturns a String that represents the current
4b-ect.
#rotected Instance &et'ods:
inali+e %elease the write loc', if needed.
/emberwiseClone <inherited from O$(ect> Creates a shallow copy of the current
4b-ect.
!ucene.1et.?uery"arsers2
C!asses:
C!ass Descrition
astCharStream
/ultiield?uery"arser A ?uery"arser which constructs *ueries to
search multiple fields.
"arse0(ception .his e(ception is thrown when parse errors
are encountered. 5ou can e(plicitly create
ob-ects of this e(ception type by calling the
method generate"arse0(ception in the
generated parser. 5ou can modify this class
to customi+e your error reporting
mechanisms so long as you retain the
public fields.
?uery"arser
?uery"arserConstants
?uery"arser.o'en/anager
.o'en Describes the input to'en stream.
.o'en/gr0rror
!ucene.1et.Search2
C!asses:
C!ass Descrition
AnonymousClassScoreDocComparator
AnonymousClassScoreDocComparator;
:ooleanClause A clause in a :oolean?uery.
:oolean?uery A ?uery that matches documents matching
boolean combinations of other *ueries,
typically BClin' .erm?ueryDs or BClin'
"hrase?ueryDs.
:oolean?uery..oo/anyClauses .hrown when an attempt is made to add
more than BClin' #@et/a(ClauseCount<>D
clauses.
Caching)rapperilter )raps another filters result and caches it.
.he caching behavior is li'e BClin'
?ueryilterD. .he purpose is to allow filters
to simply filter, and then wrap with this
class to add caching, 'eeping the two
concerns decoupled yet composable.
Dateilter
DefaultSimilarity 0(pert2 Default scoring implementation.
0(planation 0(pert2 Describes the score computation
for document and *uery.
ieldDoc
ilter Abstract base class providing a mechanism
to restrict searches to a subset of an inde(.
iltered?uery
iltered.erm0num
u++y?uery Implements the fu++y search *uery. .he
similiarity measurement is based on the
!evenshtein <edit distance> algorithm.
u++y.erm0num
&itCollector !ower,level search A"I.
&its A ran'ed list of documents, used to hold
search results.
Inde(Searcher
/ultiSearcher
/ulti.erm?uery
"arallel/ultiSearcher
"hrase"refi(?uery "hrase"refi(?uery is a generali+ed version
of "hrase?uery, with an added method
BClin' #Add<.ermPQ>D. .o use this class, to
search for the phrase F/icrosoft appHF first
use add<.erm> on the term F/icrosoftF,
then find all terms that has FappF as prefi(
using Inde(%eader.terms<.erm>, and use
"hrase"refi(?uery.add<.ermPQ terms> to
add them to the *uery.
"hrase?uery A ?uery that matches documents
containing a particular se*uence of terms.
.his may be combined with other terms
with a BClin' :oolean?ueryD.
"refi(?uery A ?uery that matches documents
containing terms with a specified prefi(.
?uery
?ueryilter
?uery.ermGector
%ange?uery A ?uery that matches documents within an
e(clusive range.
%emoteSearchable A remote searchable implementation.
ScoreDoc 0(pert2 %eturned by low,level search
implementations.
Scorer 0(pert2 Implements scoring for a class of
*ueries.
Searcher An abstract base class for search
implementations. Implements some
common utility methods.
Similarity
Sort
SortComparator
Sortield
StringInde(
.erm?uery A ?uery that matches documents
containing a term. .his may be combined
with other terms with a BClin'
:oolean?ueryD.
.opDocs 0(pert2 %eturned by low,level search
implementations.
.opieldDocs
Implements the wildcard search *uery.
)ildcard?uery
Supported wildcards are
*
, which matches any character se*uence
<including the empty one>, and
?
, which matches any single character. 1ote
this *uery can be slow, as it needs to iterate
over all terms. In order to prevent
e(tremely slow )ildcard?ueries, a
)ildcard term must not start with one of
the wildcards
*
or
?
.
)ildcard.erm0num
!ucene.1et.Search.Spans2
C!asses:
C!ass Descrition
Spanirst?uery /atches spans near the beginning of a
ield.
Span1ear?uery /atches spans which are near one another.
4ne can specify slop, the ma(imum
number of intervening unmatched
positions, as well as whether matches are
re*uired to be in,order.
Span1ot?uery %emoves matches which overlap with
another Span?uery.
Span4r?uery /atches the union of its clauses.
Span?uery :ase class for span,based *ueries.
Span.erm?uery /atches spans containing a term.
!ucene.1et.Store2
C!asses:
C!ass Descrition
Directory
SDirectory
SInputStream
InputStream Abstract base class for input from a file in a
BClin' DirectoryD. A random,access input
stream. 3sed for all !ucene inde( input
operations.
!oc'
!oc'.)ith 3tility class for e(ecuting code with
e(clusive access.
4utputStream Abstract class for output to a file in a
Directory. A random,access output stream.
3sed for all !ucene inde( output
operations.
%A/Directory A memory,resident BClin' DirectoryD
implementation.
%A/4utputStream A memory,resident BClin' 4utputStreamD
implementation.
!ucene.1et.3til2
C!asses:
C!ass Descrition
:itGector 4ptimi+ed implementation of a vector of
bits. .his is more,or,less li'e
-ava.util.:itSet, but also includes the
following2
a count<> method, which efficiently
computes the number of one bitsK
optimi+ed read from and write to
dis'K
inlinable get<> methodK
Constants Some useful constants.
"riority?ueue A "riority?ueue maintains a partial
ordering of its elements such that the least
element can always be found in constant
time. "ut<>As and pop<>As re*uire log<si+e>
time.
String&elper /ethods for manipulating strings. RId2
String&elper.-ava,v ;.S SOOT/OU/SV
;U2UW2VW otis 0(p R
&4) D40S I. )4%E9
!ucene is a high performance, scalable Information %etrieval <I%> library. It lets
you add inde(ing and searching capabilities to your applications. "eople new to !ucene
often mista'e it for a ready,to,use application li'e a file,search program, a web crawler,
or a web site search engine. .hat isn8t what !ucene is2 !ucene is a software library, a
tool'it if you will, not a full,featured search application. It concerns itself with te(t
inde(ing and searching, and it does those things very well. !ucene lets your application
deal with business rules specific to its problem domain while hiding the comple(ity of
inde(ing and searching implementation behind a simple,to,use A"I.
As said earlier, !ucene allows you to add inde(ing and searching capabilities to
your applications. !ucene can inde( and ma'e searchable any data that can be converted
to a te(tual format. !ucene doesn8t care about the source of the data, its format, or even
its language, as long as you can convert it to te(t. .his means you can use !ucene to
inde( and search data stored in files2 web pages on remote web servers, documents stored
in local file systems, simple te(t files, /icrosoft )ord documents, &./! or "D files,
or any other format from which you can e(tract te(tual information. Similarly, with
!ucene8s help you can inde( data stored in your databases, giving your users full,te(t
search capabilities that many databases don8t provide.
At the heart of all search engines is the concept of inde)in*2 processing the original
data into a highly efficient cross,reference loo'up in order to facilitate rapid searc'in*.
!et8s ta'e a *uic' high,level loo' at both the inde(ing and searching processes.
Inde)in*:
Suppose you needed to search a large number of files, and you wanted to be able to
find files that contained a certain word or a phrase. &ow would you go about writing a
program to do this9 A naXve approach would be to se*uentially scan each file for the
given word or phrase. .his approach has a number of flaws, the most obvious of which is
that it doesn8t scale to larger file sets or cases where files are very large. .his is where
inde(ing comes in2 .o search large amounts of te(t *uic'ly, you must first inde( that te(t
and convert it into a format that will let you search it rapidly, eliminating the slow
se*uential scanning process. .his conversion process is called indexing, and its output is
called an index.
5ou can thin' of an inde( as a data structure that allows fast random access to
words stored inside it. .he concept behind it is analogous to an inde( at the end of a
boo', which lets you *uic'ly locate pages that discuss certain topics. In the case of
!ucene, an inde( is a specially designed data structure, typically stored on the file system
as a set of inde( files. !ucene inde( is a tool that allows *uic' word loo'up.
Searc'in*:
Searching is the process of loo'ing up words in an inde( to find documents where
they appear. .he *uality of a search is typically described using precision and recall
metrics. %ecall measures how well the search system finds relevant documents, whereas
precision measures how well the system filters out the irrelevant documents. &owever,
you must consider a number of other factors when thin'ing about searching. )e already
mentioned speed and the ability to *uic'ly search large *uantities of te(t. Support for
single and multiterm *ueries, phrase *ueries, wildcards, result ran'ing, and sorting are
also important, as is a friendly synta( for entering those *ueries. !ucene8s powerful
software library offers a number of search features.

As we understood the concept of 6Inde(ing7 and 6Searching7, let8s see !ucene in
action.
Initially to search the database we need to create an Inde) on the database. .o do
this we have a specific class namely 6Inde)+riter7. .he importance and
functionality of this class has been discussed in "ageY. .he created Inde( will be
stored in a particular path that is specified as one of the arguments of the class
Inde()riter.
After creating the Inde( we will be saving it by using the function C!ose,-.
1ow that the Inde( is created, we can perform search operation on the created
Inde(. Searching in !ucene is as fast and simple as inde(ing. .he search operation
is completely controlled by S classes namely 6Searc'er7 and 6Inde)Searc'er7.
)e need to specify the path of the inde(<i.e., path of the inde( created on the
database> as the argument to Inde(Searcher, so that it performs search on that
particular inde(.
.he search operation is based on what every .e/word or 0uer/ we give as the
input. Initially the given ?uery which is human readable will be parsed into
!ucene8s ?uery class. .his is done by the class namely 60uer/#arser7.
Searching the inde( returns the output i.e., hits in the form of 6Hits7 ob-ect. 1ote
that the Hits ob-ect contains only references to the underlying documents. In
other words, instead of being loaded immediately upon search, matches are
loaded from the inde( in a la+y fashionZonly when re*uested with the call of
'its.doc,int-.
inally the hits or matches for our given *uery are displayed and the search is
performed successfully.
.hese are the basic steps followed in a simple !ucene search operation. .he flowchart
below e(plains this action in detailed way.
!4)C&A%. 4% !3C010 .10. DA.A:AS0 S0A%C&
+'o uses Lucene1
START
AN ENGLISH 0UER2 IS
#ASSED AS A .E2+ORD.
E3ISTING
DATA4ASE
CREATES AN INDE3 ON THAT
#ARTICULAR TA4LE TO
+HICH THE GI5EN 0UER2
4ELONGS TO AND STORES
THE INDE3ES IN THE LOCAL
DIRECTOR2
SEARCH O#ERATION IS
CALLED ON THAT
#ARTICULAR INDE3
HITS
%OUN
D
&ATCHES O%
THE GI5EN
.E2+ORD
ARE DIS#LA2ED
SEARCH IS
#ER%OR&ED
NO
2ES
)ho doesn8t9 A number of other large, well,'nown, multinational organi+ations are using
!ucene. It provides searching capabilities for the 0clipse ID0, the 0ncyclopedia
:ritannica CD,%4//DGD, ed0(, the /ayo Clinic, &ewlett,"ac'ard, 1ew Scientist
maga+ine, 0piphany, /I.8s 4pen Courseware and DSpace, A'amai8s 0dge Computing
platform, and so on. 5our name will be on this list soon, too. Some of the other
applications are2
@oogle Des'top has made a splash by bringing this functionality to end users.
1ow you have the power to bring the same inde(ing and searching capabilities
into your applications using !ucene.1et, a high,performance, scalable search
engine library written in the C# language and utili+ing the .10. ramewor'.
Can be used for any web application which needs a search portal.
Can also be used for searching the local database and also can be handy for local
computer search.

You might also like