Introduction: Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. It is supported by the Apache Software oundation and is released under the Apache Software !icense. Lucene has been ported to programming languages including Delphi, "erl, C#, C$$, "ython, %uby and "&". Lucene is a search engine which ta'es the full te(t search to a step further. )hile suitable for any application which re*uires full te(t inde(ing and searching capability, !ucene has been widely recogni+ed for its utility in the implementation of Internet search engines and local, single,site searching. Lucene itself is -ust an inde(ing and search library and does not contain crawling and &./! parsing functionality. In the present application we are implementing Lucene in Dot net ramewor' using C# to perform inde(ing and searching on database. Overview: In the present application we are performing search on the local database using !ucene. )e are passing an 0nglish *uery as the input or 'eyword for search. 1ow !ucene ta'es the 'eyword as its input and creates an inde( on the particular table and searches for the matches. Consider the following e(ample2 I1"3.2 "%4.41. 43."3.2 DIS"!A5S A!! /A.C&0S 4% "%4.41. In the above e(ample 6"%4.417 is an element in a particular table residing in the database. !ucene ta'es "%4.41 as the 'eyword and displays all the elements in the table which contain proton. Its sounds easy, isn8t it9 :ut for performing this functionality !ucene will be utili+ing some classes and functions. !et8s discuss them in detail further. LUCENE DOTNET ARCHITECTURE As discussed earlier !ucene .net is an open source search engine which ta'es the implementation of inde(ing and searching to a step further. .o perform this !ucene .net has some classes defined. !ucene .net namespace has ;; other namespaces defined in it. .hey are as follows2 Namesace Descrition !ucene.1et.Analysis .his is used to analy+e the given 'eyword or input and then build to'ens and then filter it to search the matches. !ucene.1et.Analysis.Standard !ucene.1et.Documents .his namespace is used to define the given document. <or e(ample2 1ame, field etc=> !ucene.1et.Inde( .his namespace is the 'ey to !ucene search engine. .his namespace contains the classes to create the inde( on a document. !ucene.1et.?uery"arsers .his is used to parse the given *uery which is nothing but the given 'eyword. !ucene.1et.Search .his is also a 'ey to !ucene. .his namespace contains classes for performing search operation. !ucene.1et.Search.Spans !ucene.1et.Store .his namespace contains classes to store the inde(es created in a particular directory specified. !ucene.1et.3til .his namespace contains classes for manipulating the given string , giving priorities to the 'eyword etc. !et8s discuss the namespaces further in detail. !ucene.1et.Analysis 1amespace2 C!asses C!ass Descrition Analy+er An Analy+er builds .o'enStreams, which analy+e te(t. It thus represents a policy for e(tracting inde( terms from te(t. .ypical implementations first build a .o'eni+er, which brea's the stream of characters from the %eader into raw .o'ens. 4ne or more .o'enilters may then be applied to the output of the .o'eni+er. )A%1I1@2 5ou must override one of the methods defined by this class in your subclass or the Analy+er will enter an infinite loop. Char.o'eni+er An abstract base class for simple, character,oriented to'eni+ers. !etter.o'eni+er A !etter.o'eni+er is a to'eni+er that divides te(t at non,letters. .hatAs to say, it defines to'ens as ma(imal strings of ad-acent letters, as defined by -ava.lang.Character.is!etter<> predicate. 1ote2 this does a decent -ob for most 0uropean languages, but does a terrible -ob for some Asian languages, where words are not separated by spaces. !owerCaseilter 1ormali+es to'en te(t to lower case. !owerCase.o'eni+er "erieldAnaly+er)rapper .his analy+er is used to facilitate scenarios where different fields re*uire different analysis techni*ues. 3se BClin' #addAnaly+erD to add a non,default analy+er on a ield name basis. See .est"erieldAnaly+er)rapper.-ava for e(ample usage. "orterStemilter SimpleAnaly+er An Analy+er that filters !etter.o'eni+er with !owerCaseilter. StopAnaly+er ilters !etter.o'eni+er with !owerCaseilter and Stopilter. Stopilter %emoves stop words from a to'en stream. .o'en A .o'en is an occurrence of a term from the te(t of a ield. It consists of a termAs te(t, the start and end offset of the term in the te(t of the ield, and a type string. .he start and end offsets permit applications to re,associate a to'en with its source te(t, e.g., to display highlighted *uery terms in a document browser, or to show matching te(t fragments in a E)IC <Eeyword In Conte(t> display, etc. .he type is an interned string, assigned by a le(ical analy+er <a.'.a. to'eni+er>, naming the le(ical or syntactic class that the to'en belongs to. or e(ample an end of sentence mar'er to'en might be implemented with type FeosF. .he default to'en type is FwordF. .o'enilter .o'eni+er .o'enStream )hitespaceAnaly+er An Analy+er that uses )hitespace.o'eni+er. )hitespace.o'eni+er A )hitespace.o'eni+er is a to'eni+er that divides te(t at whitespace. Ad-acent se*uences of non,)hitespace characters form to'ens. !ucene.1et.Analysis.Standard : C!asses C!ass Descrition astCharStream "arse0(ception .his e(ception is thrown when parse errors are encountered. 5ou can e(plicitly create ob-ects of this e(ception type by calling the method generate"arse0(ception in the generated parser. 5ou can modify this class to customi+e your error reporting mechanisms so long as you retain the public fields. StandardAnaly+er ilters BClin' Standard.o'eni+erD with BClin' StandardilterD, BClin' !owerCaseilterD and BClin' StopilterD. Standardilter 1ormali+es to'ens e(tracted with BClin' Standard.o'eni+erD. Standard.o'eni+er Standard.o'eni+erConstants Standard.o'eni+er.o'en/anager .o'en Describes the input to'en stream. .o'en/gr0rror !ucene.1et.Documents2 C!asses: C!ass Descrition Dateield Document ield A ield is a section of a Document. 0ach ield has two parts, a name and a value. Galues may be free te(t, provided as a String or as a %eader, or they may be atomic 'eywords, which are not further processed. Such 'eywords may be used to represent dates, urls, etc. ields are ,, ,optionally stored in the inde(, so that they may be returned with hits on the document. !ucene.1et.Inde(2 C!asses: C!ass Descrition Compoundile%eader Class for accessing a compound stream. .his class implements a directory, but is limited to only read operations. Directory methods that would normally modify data throw an e(ception. Compoundile%eader.CSInputStream Implementation of an InputStream that reads from a portion of the compound file. .he visibility is left as Fpac'ageF HonlyH because this helps with testing since J3nit test cases in a different class can then access pac'age fields of this class. Compoundile)riter Document)riter ieldInfo ieldInfos Access to the ield Info file that describes document fields and whether or not they are inde(ed. 0ach segment has a separate ield Info file. 4b-ects of this class are thread,safe for multiple readers, but only one thread can be adding documents at a time, with no other reader or writer threads accessing this ob-ect. ields%eader Class responsible for access to stored document fields. It uses IsegmentJ.fdt and IsegmentJ.fd(K files. ilterInde(%eader A ilterInde(%eader contains another Inde(%eader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality. .he class ilterInde(%eader itself simply implements all abstract methods of Inde(%eader with versions that pass all re*uests to the contained inde( reader. Subclasses of ilterInde(%eader may further override some of these methods and may also provide additional methods and fields. ilterInde(%eader.ilter.ermDocs :ase class for filtering BClin' .ermDocsD implementations. ilterInde(%eader.ilter.erm0num :ase class for filtering BClin' .erm0numD implementations. ilterInde(%eader.ilter.erm"ositions :ase class for filtering BClin' .erm"ositionsD implementations. Inde(%eader %eads the Inde(. Inde()riter An Inde()riter creates and maintains an inde(. .he third argument to the constructor determines whether a new inde( is created, or whether an e(isting inde( is opened for the addition of new documents. In either case, documents are added with the addDocument method. )hen finished adding documents, c!ose should be called. If an inde( will not have more documents added for a while and optimal search performance is desired, then the otimi"e method should be called before the inde( is closed. /ultiple.erm"ositions Describe class /ultiple.erm"ositions &ere. /ulti%eader An Inde(%eader which reads multiple inde(es, appending their content. SegmentInfo SegmentInfos Segment/erger Segment%eader IL/02 Describe class SegmentReader &ere. Segment.ermDocs Segment.erm0num .erm A .erm represents a word from te(t. .his is the unit of search. It is composed of two elements, the te(t of the word, as a string, and the name of the ield that the te(t occurred in, an interned string. 1ote that terms may represent more than words from te(t fields, but also things li'e dates, email addresses, urls, etc. .erm0num .ermInfo A .ermInfo is the record of information stored for a term. .ermInfos%eader .ermInfos)riter .ermGectors%eader .4D42 rela( synchroM .ermGectors)riter )riter wor's by opening a document and then opening the fields within the document and then writing out the vectors for each ield. Inde()riter plays a ma-or role for creating the inde( #u$!ic Static %ie!ds: C4//I.N!4CEN1A/0 C4//I.N!4CEN.I/043. Default value is ;OOOO. 3se Lucene.Net.commitLockTimeout system property to override. D0A3!.N/ALNI0!DN!01@.& Default value is ;OOOO. 3se Lucene.Net.maxFieldLength system property to override. D0A3!.N/ALN/0%@0ND4CS Default value is BClin' Integer#/ALNGA!30D. 3se Lucene.Net.maxMergeDocs system property to override. D0A3!.N/0%@0NAC.4% Default value is ;O. 3se Lucene.Net.mergeFactor system property to override. D0A3!.N/I1N/0%@0ND4CS Default value is ;O. 3se Lucene.Net.minMergeDocs system property to override. )%I.0N!4CEN1A/0 )%I.0N!4CEN.I/043. Default value is ;OOO. 3se Lucene.Net.writeLockTimeout system property to override. #u$!ic Instance Constructors Inde()riter 4verloaded. Initiali+es a new instance of the Inde()riter class. #u$!ic Instance %ie!ds: infoStream If non,null, information about merges will be printed to this. ma(ield!ength .he ma(imum number of terms that will be inde(ed for a single ield in a document. .his limits the amount of memory re*uired for inde(ing, so that collections with very large files will not crash the inde(ing process by running out of memory. 1ote that this effectively truncates large documents, e(cluding from the inde( terms that occur further in the document. If you 'now your source documents are large, be sure to set this value high enough to accommodate the e(pected si+e. If you set it to Integer./ALNGA!30, then the only limit is your memory, but you should anticipate an 4ut4f/emory0rror. :y default, no more than ;O,OOO terms will be inde(ed for a ield. ma(/ergeDocs
mergeactor
min/ergeDocs
#u$!ic Instance &et'ods: AddDocument 4verloaded. Adds a document to this inde(, using the provided analy+er instead of the value of BClin' #@etAnaly+er<>D. If the document contains more than BClin' #ma(ield!engthD terms for a given ield, the remainder are discarded. AddInde(es 4verloaded. /erges the provided inde(es into this inde(. After this completes, the inde( is optimi+ed. .he provided Inde(%eaders are not closed. Close lushes all changes to an inde( and closes all associated files. DocCount %eturns the number of documents currently in this inde(. 0*uals <inherited from O$(ect> Determines whether the specified 4b-ect is e*ual to the current O$(ect. @etAnaly+er %eturns the analy+er used by this inde(. @et&ashCode <inherited from O$(ect> Serves as a hash function for a particular type, suitable for use in hashing algorithms and data structures li'e a hash table. @etSimilarity @et.ype <inherited from O$(ect> @ets the .ype of the current instance. @et3seCompoundile Setting to turn on usage of a compound file. )hen on, multiple files for each segment are merged into a single file once the segment creation is finished. .his is done regardless of what directory is in use. 4ptimi+e /erges all segments together into a single segment, optimi+ing an inde( for search. SetSimilarity 0(pert2 Set the Similarity implementation used by this Inde()riter. Set3seCompoundile Setting to turn on usage of a compound file. )hen on, multiple files for each segment are merged into a single file once the segment creation is finished. .his is done regardless of what directory is in use. .oString <inherited from O$(ect> %eturns a String that represents the current 4b-ect. #rotected Instance &et'ods: inali+e %elease the write loc', if needed. /emberwiseClone <inherited from O$(ect> Creates a shallow copy of the current 4b-ect. !ucene.1et.?uery"arsers2 C!asses: C!ass Descrition astCharStream /ultiield?uery"arser A ?uery"arser which constructs *ueries to search multiple fields. "arse0(ception .his e(ception is thrown when parse errors are encountered. 5ou can e(plicitly create ob-ects of this e(ception type by calling the method generate"arse0(ception in the generated parser. 5ou can modify this class to customi+e your error reporting mechanisms so long as you retain the public fields. ?uery"arser ?uery"arserConstants ?uery"arser.o'en/anager .o'en Describes the input to'en stream. .o'en/gr0rror !ucene.1et.Search2 C!asses: C!ass Descrition AnonymousClassScoreDocComparator AnonymousClassScoreDocComparator; :ooleanClause A clause in a :oolean?uery. :oolean?uery A ?uery that matches documents matching boolean combinations of other *ueries, typically BClin' .erm?ueryDs or BClin' "hrase?ueryDs. :oolean?uery..oo/anyClauses .hrown when an attempt is made to add more than BClin' #@et/a(ClauseCount<>D clauses. Caching)rapperilter )raps another filters result and caches it. .he caching behavior is li'e BClin' ?ueryilterD. .he purpose is to allow filters to simply filter, and then wrap with this class to add caching, 'eeping the two concerns decoupled yet composable. Dateilter DefaultSimilarity 0(pert2 Default scoring implementation. 0(planation 0(pert2 Describes the score computation for document and *uery. ieldDoc ilter Abstract base class providing a mechanism to restrict searches to a subset of an inde(. iltered?uery iltered.erm0num u++y?uery Implements the fu++y search *uery. .he similiarity measurement is based on the !evenshtein <edit distance> algorithm. u++y.erm0num &itCollector !ower,level search A"I. &its A ran'ed list of documents, used to hold search results. Inde(Searcher /ultiSearcher /ulti.erm?uery "arallel/ultiSearcher "hrase"refi(?uery "hrase"refi(?uery is a generali+ed version of "hrase?uery, with an added method BClin' #Add<.ermPQ>D. .o use this class, to search for the phrase F/icrosoft appHF first use add<.erm> on the term F/icrosoftF, then find all terms that has FappF as prefi( using Inde(%eader.terms<.erm>, and use "hrase"refi(?uery.add<.ermPQ terms> to add them to the *uery. "hrase?uery A ?uery that matches documents containing a particular se*uence of terms. .his may be combined with other terms with a BClin' :oolean?ueryD. "refi(?uery A ?uery that matches documents containing terms with a specified prefi(. ?uery ?ueryilter ?uery.ermGector %ange?uery A ?uery that matches documents within an e(clusive range. %emoteSearchable A remote searchable implementation. ScoreDoc 0(pert2 %eturned by low,level search implementations. Scorer 0(pert2 Implements scoring for a class of *ueries. Searcher An abstract base class for search implementations. Implements some common utility methods. Similarity Sort SortComparator Sortield StringInde( .erm?uery A ?uery that matches documents containing a term. .his may be combined with other terms with a BClin' :oolean?ueryD. .opDocs 0(pert2 %eturned by low,level search implementations. .opieldDocs Implements the wildcard search *uery. )ildcard?uery Supported wildcards are * , which matches any character se*uence <including the empty one>, and ? , which matches any single character. 1ote this *uery can be slow, as it needs to iterate over all terms. In order to prevent e(tremely slow )ildcard?ueries, a )ildcard term must not start with one of the wildcards * or ? . )ildcard.erm0num !ucene.1et.Search.Spans2 C!asses: C!ass Descrition Spanirst?uery /atches spans near the beginning of a ield. Span1ear?uery /atches spans which are near one another. 4ne can specify slop, the ma(imum number of intervening unmatched positions, as well as whether matches are re*uired to be in,order. Span1ot?uery %emoves matches which overlap with another Span?uery. Span4r?uery /atches the union of its clauses. Span?uery :ase class for span,based *ueries. Span.erm?uery /atches spans containing a term. !ucene.1et.Store2 C!asses: C!ass Descrition Directory SDirectory SInputStream InputStream Abstract base class for input from a file in a BClin' DirectoryD. A random,access input stream. 3sed for all !ucene inde( input operations. !oc' !oc'.)ith 3tility class for e(ecuting code with e(clusive access. 4utputStream Abstract class for output to a file in a Directory. A random,access output stream. 3sed for all !ucene inde( output operations. %A/Directory A memory,resident BClin' DirectoryD implementation. %A/4utputStream A memory,resident BClin' 4utputStreamD implementation. !ucene.1et.3til2 C!asses: C!ass Descrition :itGector 4ptimi+ed implementation of a vector of bits. .his is more,or,less li'e -ava.util.:itSet, but also includes the following2 a count<> method, which efficiently computes the number of one bitsK optimi+ed read from and write to dis'K inlinable get<> methodK Constants Some useful constants. "riority?ueue A "riority?ueue maintains a partial ordering of its elements such that the least element can always be found in constant time. "ut<>As and pop<>As re*uire log<si+e> time. String&elper /ethods for manipulating strings. RId2 String&elper.-ava,v ;.S SOOT/OU/SV ;U2UW2VW otis 0(p R &4) D40S I. )4%E9 !ucene is a high performance, scalable Information %etrieval <I%> library. It lets you add inde(ing and searching capabilities to your applications. "eople new to !ucene often mista'e it for a ready,to,use application li'e a file,search program, a web crawler, or a web site search engine. .hat isn8t what !ucene is2 !ucene is a software library, a tool'it if you will, not a full,featured search application. It concerns itself with te(t inde(ing and searching, and it does those things very well. !ucene lets your application deal with business rules specific to its problem domain while hiding the comple(ity of inde(ing and searching implementation behind a simple,to,use A"I. As said earlier, !ucene allows you to add inde(ing and searching capabilities to your applications. !ucene can inde( and ma'e searchable any data that can be converted to a te(tual format. !ucene doesn8t care about the source of the data, its format, or even its language, as long as you can convert it to te(t. .his means you can use !ucene to inde( and search data stored in files2 web pages on remote web servers, documents stored in local file systems, simple te(t files, /icrosoft )ord documents, &./! or "D files, or any other format from which you can e(tract te(tual information. Similarly, with !ucene8s help you can inde( data stored in your databases, giving your users full,te(t search capabilities that many databases don8t provide. At the heart of all search engines is the concept of inde)in*2 processing the original data into a highly efficient cross,reference loo'up in order to facilitate rapid searc'in*. !et8s ta'e a *uic' high,level loo' at both the inde(ing and searching processes. Inde)in*: Suppose you needed to search a large number of files, and you wanted to be able to find files that contained a certain word or a phrase. &ow would you go about writing a program to do this9 A naXve approach would be to se*uentially scan each file for the given word or phrase. .his approach has a number of flaws, the most obvious of which is that it doesn8t scale to larger file sets or cases where files are very large. .his is where inde(ing comes in2 .o search large amounts of te(t *uic'ly, you must first inde( that te(t and convert it into a format that will let you search it rapidly, eliminating the slow se*uential scanning process. .his conversion process is called indexing, and its output is called an index. 5ou can thin' of an inde( as a data structure that allows fast random access to words stored inside it. .he concept behind it is analogous to an inde( at the end of a boo', which lets you *uic'ly locate pages that discuss certain topics. In the case of !ucene, an inde( is a specially designed data structure, typically stored on the file system as a set of inde( files. !ucene inde( is a tool that allows *uic' word loo'up. Searc'in*: Searching is the process of loo'ing up words in an inde( to find documents where they appear. .he *uality of a search is typically described using precision and recall metrics. %ecall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. &owever, you must consider a number of other factors when thin'ing about searching. )e already mentioned speed and the ability to *uic'ly search large *uantities of te(t. Support for single and multiterm *ueries, phrase *ueries, wildcards, result ran'ing, and sorting are also important, as is a friendly synta( for entering those *ueries. !ucene8s powerful software library offers a number of search features.
As we understood the concept of 6Inde(ing7 and 6Searching7, let8s see !ucene in action. Initially to search the database we need to create an Inde) on the database. .o do this we have a specific class namely 6Inde)+riter7. .he importance and functionality of this class has been discussed in "ageY. .he created Inde( will be stored in a particular path that is specified as one of the arguments of the class Inde()riter. After creating the Inde( we will be saving it by using the function C!ose,-. 1ow that the Inde( is created, we can perform search operation on the created Inde(. Searching in !ucene is as fast and simple as inde(ing. .he search operation is completely controlled by S classes namely 6Searc'er7 and 6Inde)Searc'er7. )e need to specify the path of the inde(<i.e., path of the inde( created on the database> as the argument to Inde(Searcher, so that it performs search on that particular inde(. .he search operation is based on what every .e/word or 0uer/ we give as the input. Initially the given ?uery which is human readable will be parsed into !ucene8s ?uery class. .his is done by the class namely 60uer/#arser7. Searching the inde( returns the output i.e., hits in the form of 6Hits7 ob-ect. 1ote that the Hits ob-ect contains only references to the underlying documents. In other words, instead of being loaded immediately upon search, matches are loaded from the inde( in a la+y fashionZonly when re*uested with the call of 'its.doc,int-. inally the hits or matches for our given *uery are displayed and the search is performed successfully. .hese are the basic steps followed in a simple !ucene search operation. .he flowchart below e(plains this action in detailed way. !4)C&A%. 4% !3C010 .10. DA.A:AS0 S0A%C& +'o uses Lucene1 START AN ENGLISH 0UER2 IS #ASSED AS A .E2+ORD. E3ISTING DATA4ASE CREATES AN INDE3 ON THAT #ARTICULAR TA4LE TO +HICH THE GI5EN 0UER2 4ELONGS TO AND STORES THE INDE3ES IN THE LOCAL DIRECTOR2 SEARCH O#ERATION IS CALLED ON THAT #ARTICULAR INDE3 HITS %OUN D &ATCHES O% THE GI5EN .E2+ORD ARE DIS#LA2ED SEARCH IS #ER%OR&ED NO 2ES )ho doesn8t9 A number of other large, well,'nown, multinational organi+ations are using !ucene. It provides searching capabilities for the 0clipse ID0, the 0ncyclopedia :ritannica CD,%4//DGD, ed0(, the /ayo Clinic, &ewlett,"ac'ard, 1ew Scientist maga+ine, 0piphany, /I.8s 4pen Courseware and DSpace, A'amai8s 0dge Computing platform, and so on. 5our name will be on this list soon, too. Some of the other applications are2 @oogle Des'top has made a splash by bringing this functionality to end users. 1ow you have the power to bring the same inde(ing and searching capabilities into your applications using !ucene.1et, a high,performance, scalable search engine library written in the C# language and utili+ing the .10. ramewor'. Can be used for any web application which needs a search portal. Can also be used for searching the local database and also can be handy for local computer search.