You are on page 1of 8

Efficient Information Retrieval using Lucene, LIndex and HIndex

in Hadoop
Anita Brigit Mathew
Department of CSE
NIT Calicut
anita.brigit@rediffmail.com

Priyabrat Pattnaik
Department of CSE
NIT Calicut
priyabrat.nitc12@gmail.com

Abstract
The growth of unstructured and partially-structured
data in biological networks, social media, geographical
information and other web-based applications present
an open challenge to the cloud database community.
Hence, the approach to exhaustive BigData analysis
that integrates structured and unstructured data processing have become increasingly critical in todays
world. MapReduce, has recently emerged as a popular
framework for extensive data analytics. Use of powerful indexing techniques would allow users to significantly speed up query processing among MapReduce
jobs. Currently, there are a number of indexing techniques like Hadoop++, HAIL, LIAH, Adaptive Indexing etc., but none of them provide an optimized technique for text based selection operations. This paper
proposes two indexing approaches in HDFS, namely
LIndex and HIndex. These indexing approaches are
found to carefully perform selection operation better compared to existing Lucene index approach. A
fast retrieval technique is suggested in the MapReduce framework with the new LIndex and HIndex approaches. LIndex provides a complete-text index and
it informs the Hadoop implementation engine to scan
only those data blocks which contain the terms of interest. LIndex also enhances the throughput (minimizes response time) and overcome some of the drawbacks like upfront cost and long idle time for index
creation. This gave a better performance than Lucene
but lacked in response and computation time. Hence
a new index named HIndex is suggested. This scheme
is found to perform better than LIndex in response
and computation time.
Keywords: Hadoop, MapReduce, Complete-text
indexing, Lucene, LIndex, HIndex.

S. D. Madhu Kumar
Department of CSE
NIT Calicut
madhu@nitc.ac.in

Introduction

Current BigData systems rely on underlying Distributed File Systems to manage data. Googles
GFS [5] and Hadoops HDFS [11] are some examples. Hadoops distributed file system facilitates fast
data transfer rates among nodes and allow the system
to continue operating in cases of node failures. The
Hadoop framework is used by major companies like
Yahoo, Facebook, IBM, etc. Hadoop provides the application programming interfaces (APIs) necessary to
distribute and track workloads as they are running on
large numbers of machines. Hadoop provides an easy
and fast platform to store and analyze huge amount
of data. The challenges here lie on how to partition
large data sets, distribute and place data among nodes
in Hadoop. How to manage these nodes to collaborate data for a specific job in Hadoop is another challenge. For large number of datasets it is very difficult
to keep track of the replications and for any update or
retrieval operation the execution engine has to search
the whole system which degrades the overall performance of Hadoop.
BigData systems currently perform parallel scanning of the whole data set to simplify implementation
of simple query processing. They are given enough
processing nodes, to provide good performance. For
example, if we have 1 terabyte of data and to search
from this a single node would take 10-15 minutes approximately but if the same data is searched in parallel having 10 datanodes, will take approximately 2 to
3 minutes. Therefore, instead of scrutinizing, a more
efficient data access is required. To access data effectively, it is required to implement indexing techniques
in HDFS part through which we can carefully access
the data of interest, skipping the unnecessary data,
while resulting in better performance. The motivation
behind this work is to investigate on the importance

978-1-4799-7100-8/14/$31.00 2014 IEEE


333

of full-text processing capabilities in large-data analysis. This is illustrated through implementing effective
text based indexing in Hadoop ecosystem. The rest of
this paper is organized as follows. We provide a comparision between Hadoop and parallel databases along
with a survey on text-based indexing in Section II, the
proposed approach is described section III. The experimental evaluation details of the approach are given
in Section IV. Finally, Section V concludes the paper
with a note on probable future extensions.

II

Existing
niques

Indexing

Tech-

In todays world growth of partially-structured and


unstructured data outruns the growth of relational
data. As a result, the approach to large-data investigation that integrates structured and unstructured
data processing are becoming critical, hence the optimization on textual operations like searching queries,
item descriptions, email informations, and other usergenerated content are important to be looked at. It
is identified that Hadoop suffers improvements that
are seen in relational databases, and therefore lacks
from quality performance on analytical tasks carried
out on BigData. Andrew Pavlo et al [9] have compared the MapReduce framework with parallel traditional RDBMS in terms of performance and development complexity. Their experiments show that parallel traditional RDBMS displayed a poor performance
advantage over Hadoop MapReduce[8] in executing a
variety of data intensive survey benchmarks. Some of
the analytical tasks include selecting data, data loading, aggregating and join task etc. Erik Paulson et al.
[12], in a similar theme argued that using MapReduce
systems to perform operations that are best suited for
RDBMSs yield good results, concluding that MapReduce is more like an extract-transform system than a
traditional RDBMS, as it quickly loads and processes
large amounts of data in an ad-hoc manner. Hence
after experiments Erik Paulson et al. concluded that
MapReduce complements DBMSs since databases are
not designed for extract-transform tasks. The biggest
issue arises when creating an indexing framework in
Hadoop ecosystem dealing with key-value pairs. In
RDBMS the data are structured, so indexing based
on some attributes is easier to implement where as in
HDFS the data are not structured and no concept of
candidate key is present. Still researchers have come
up with some indexing techniques in Hadoop through

which we can improve performance. All these indexing approaches have three main drawbacks like,
a) They require a high upfront cost and long idle times
for index creation.
b) They can support only one physical sort order that
is one clustered index per dataset.
c) They require users to have a good knowledge of the
workload in order to choose the indices to create.
Kun-Lung-Wu et al. [13], proposes an indexing
framework for the BigData cloud framework based
on structured overlay. This framework reduces the
amount of data transferred inside the BigData cloud
and smooth the development of database backend applications. To demonstrate the effectiveness of the
framework, commonly used index structure namely
B+ tree are employed. Each processing node builds
its local index to rate up data access.
Disadvantages with the indexing framework are:a) The network cost dominates the index lookup cost.
b) Concurrent access can not be avoided.
Manimal [3] proposes a new framework for applying relational optimizations to MapReduce programs.
Here they use a static code analysis to detect MapReduce program code and thereby enable automatic optimizations of MapReduce programs. Manimal also
uses the map function, that is similar to Dean and
Ghemawat [4] MapReduce program described in the
paper. Pavlo, et al. [9], in their comparison of MapReduce and RDBMS performance have injected a User
Defined Function in existing framework which encodes
a simple grep word count program, that counts the
number of lines in a text file that match the regular
expression given by the pattern.
In [7], Jimmy Lin et al. address one inefficient aspect of Hadoop based processing: the need to perform
a complete scan of the full dataset, even in cases where
not much required. To overcome this disadvantage a
full-text index is proposed, this informs the Hadoop
execution engine to compresses data blocks containing query terms of interest, these are referenced by
byte offset position, and only those data blocks are
decompressed and scanned.
Advantage of this approach is to speed up selection
procedure and hence enhance the performance.
Disadvantages of this framework are:a) Indexing stores the indices in a tabular format
hence access time is much higher which can be improvised.
b) Input splits are computed from the main node, it
fails to address features like fault tolerance and scalability.

334

In a related work of Hadoop++ [5], Jens Dittrich


et al. describe on indexing and joining techniques
known as Trojan Index and Trojan Join. Hadoop++
injects Trojan indexes into Hadoop input, splits at
data loading time. These indices make it possible
to execute relational operation efficiently but these
indices are not designed for full-text processing.
Advantages of these techniques are:
a) Non-Invasive that is, do not change the existing
framework.
b) Provides optional index access paths which can be
used for selective MapReduce jobs.
c) Outperforms Hadoop framework in terms of index
and join operation.
Disadvantage of this is that, it is not designed for
full-text processing.
The best approach for combining complete-text
search capabilities in the Hadoop ecosystem remains
an open question and hence there is a need for an effective indexing technique which enhances the overall
performance of Hadoop ecosystem. Thus, inorder to
develop an effective index framework in HDFS and
MapReduce systems without changing the underlying Hadoop framework can be done with the help
of Lucene index mechanisms[6]. But we found that
how Lucene index helps Hadoop ecosystem to retrieve
data, its performance and how the performance can
be improved with modified Lucene called LIndex. Inorder to avail better data retrieval response and computation time HashMap index is also suggested.

III

Proposed LIndex

The idea behind the proposed approach is to create a full-text indexing structure on the Hadoop
MapReduce framework for the fast retrieval of data.
Lucene[1] index was considered to create a full-text
search index. Lucene written in java is a scalable, high
performing, powerful and is a full-featured text search
engine. Eventhough Lucene created a full-text search
index, each time when a file is updated a re-index
structure needs to be build for the entire file. This
is because when the file is opened using SetOpenMode function for updation and insertion operations
the IndexWriter function has to create the index
for the entire file and IndexDoc function has to create the index for each document again. This helps
one to retrieve the data, the location and the time of
data creation, but deteriorate the performance in in-

dex creation time and data retrieval time. Lucene also


uses a standard analyser to analyze the type of input
data. To overcome these short comings of Lucene, a
modified Lucene called LIndex is suggested. We found
in our experiments that LIndex is performing better
than Lucene index in data retrieval.

LIndex-Algorithm

Algorithm 1 describes the proposed LIndex algorithm which is a modification to Lucene index. In this
algorithm the input is first checked by the jobtracker.
The jobtracker checks whether it has to insert or
search for a given file. If the choice is to insert it has to
convey the message the LIndex jar file. This file checks
for the file type. The file type should be of .txt format. Function check_f ile_validity(f ilename) does
the file format checking. The file size is obtained and
the file path is set. CreateIndex function creates the
index. This function automatically sets the time of
the data arrival. The SetOpenMode function is called
to open the file in update mode if for modification
otherwise it opens the file to read or write. SetOpenMode function also calls the create_index_writer,
tot_doc_index functions to index the file and to index the contents of the file. The contents of the file
is indexed with the help of Mapper interface which in
turn generates key-value pairs for each data inside the
file and give the index created to the tot_doc_index
function. This fuction helps to sort the data based on
key-value pairs and combines values of common keys.
Using tf-idf, we filter out the stop words and write the
result to the LIndex-writer function. In the filter out
process we inject an User Defined Function after the
merge phase whose sole responsibility is to filter out
the insignificant words and write to the LIndex-writer
function. If the choice users choice is to search, jobtracker conveys the message to LIndex jar file with the
input query message for retrieval. This jar file compares each datanode individual index with the query
message input.
The most popular indexing technique used for
these kind of operations is inverted indexing[2] but
inverted index has some drawbacks like if dataset becomes larger, postings lists grow longer, and at some
point in time, reducers will run out of memory. To
avoid such kind of shortcomings we have eliminated
the unnecessary texts in the filter process. The Step
24 of the algorithm, eliminates unnecessary words
from the list. This can be performed by two approaches. The first approach is to eliminate the stop

335

Algorithm 1: Proposed Algorithm - LIndex


1. if (case == Insert00 )then
2. if (f ile_type == .txt00 ) then
3. Call check_f ile_validity(f ilename)
4. Obtain f ile_length and Assign f ile_path
5. Call CreateIndex()
6. Set curr_t SystemCurrentT ime()
7. if SetOpenMode = true
8. Call fopen in update mode
9. Call create_index_writer(f ilename)
10. if index_directory 6= 0
11. then Append create_index_writer(f ilename)
12. else Create index_directory
13. else
14. Call tot_doc_index(f ilename)
15. Interface Mapper tot_doc_index(f ilename)
16. key value.
17. OutputCollector Mapper
18. while(OutputCollector! = null)
19. while 6= EOF
20. if key[i] key[i+1] then
21. Assign value key[i]
22. Assign key[i] key[i+1]
23. Call tot_doc_index(f ilename)
24. Filter stop words using tf-idf
25. LIndex-writer Output.
26. else print check text format
27. goto step 1
28. else if (case == Search00 )
29. then LIndex_Searcher query

dividing the whole number of documents by the number of documents having the term, and then taking
the logarithm of that quotient.
tf-idf is used to calculate a weight for each term in the
document such that the weight follows the following
rules1. Weight is highest when the term occur many times
within a small number of documents.
2. Weight is lower when the term occurs many times
in a document.
3. Weight is lowest when the term occurs in virtually
all documents.

words like the, a, an, etc. The second approach


can be tf-idf. tf-idf is the short form of term frequencyinverse document frequency which is a numerical statistic that reflects how important a word
is to a document in a collection. It is often used as a
weighting factor in information access and text data
Fig. 1 Graphical illustration of the phases of LIndex
mining.
approach.
tf-idf=tf idf
N
After obtaining the weights we can eliminate the
)
and idf= log( D:N
text with lower weight and then the result will be sent
where N - Number of Documents and D:N -Total num- to the LIndex-writer.
ber of Documents where the term is found. tfidf The next time when the users are going to peris the product of two statistics used term frequency form any text based search operation the search
and inverse document frequency. For term frequency engine will first look into the Index Structure ustf(t,d), it is better to use the raw frequency of a term ing LIndex_Searcher and then collect information
in a document, that is the number of times that term regarding the query. Based on the output from
t occurs in document d. The inverse document fre- LIndex_Searcher, the search engine function directs
quency is a measure to check whether the term is com- the jobtracker to direct the result to the user. A
mon or rare among all documents. It is achieved by diagramatic view of the above proposed LIndex al-

336

gorithm is given in Figure 1. Figure 1 shows how


the three input files namely File1, File2 and File3 of
.txt format are given through CreateIndex function
to mapper interface, sort-merged and send to the reducer which inturn write the output to the reducer
phase of Hadoop.

Fig. 2 Information Retrieval Graphical illustration.

The indexing is done on the basis of blocks initially and then on the documents of the block, that
is if a query is provided to the LIndex searcher then
based on that query only those blocks having the documents related to the query will be decompressed and
read by the Hadoop execution engine and the rest
documents will be skipped. This will significantly improve the performance of the whole system. But this
approach lacked in good response time during computation process. When a query is obtained the full-text
search operation should be faster enough to retrieve
the desired output through reducer. This is because
in applications of geographical information, bank sector data etc, its better to retrieve query in faster pace
otherwise the customer satisfaction cannot be met.
For this a new approach called HIndex is suggested.
Next problem encountered is, LIndex approach takes
a large number indices that causes performance deterioration during analysis hence a modification was
made to LIndex using hashing and a new index named
HIndex is designed.

Figure 2 illustrates a general representation of information retrieval of LIndex and HIndex approaches.
When a query is obtained as input it is automatically
subjected to the indexsearcher which inturn selects
the index approach. After selecting the index the
query is searched and the results are send back to the
user defined function which appends to the HDFS.
The basic components of LIndex are as follows:1. index_writer(): Create index for each incoming
file.
2. check_f ile_validity(): Checks validity of file,
which mode file is opened, file read privileges, file
length, file path.
3. tot_doc_index(): Parse the document file using
standard analyzer. Documet is checked by providng
it to the mapper interface sorting and placing the
result to the HDFS. It also checks wheather the
datatypes or format is compatible with the index
storage.
4. IndexSearcher: The search() function of the
LIndexSearcher fuction returns the list of matching
document files.
5. IndexReader Class: This class takes input from
LIndexSearcher and checks for hits in index database
or storage.
6. LIndex Database/ Storage is the storage place
where the indexes are stored and retrieved.

337

HIndex Approach

Fig. 3.Graphical illustration of the phases of


HIndex approach.

HIndex is a new modified index structure of


LIndex where arbitrary bytes of data are converted
to fixed size data. This is done with the help of
HashMap. HashMap[10] is a data structure that can
map keys to values. A HashMap uses a hash function to compute an index into cluster of buckets like
slots, from which the correct value can be found. The
HashMap takes individual text in each file and assign a unique hashcode key value to each word of the
text. This function is deterministic, that is when it
is invoked twice by identical data (two strings with
exactly same characters) the function produce same
value. If repetition of words exist in different locations then the Mapper-Combiner Interface takes
care of this. This whole process is carried out by
(hadoop, F ile1, HashM ap1). Then each individual
data of File1 is assigned with a unique hashcode key is
indexed. Hence when a request is made to access the
index for data retrieval, File1 and its corresponding
HashSet are compared. The average cost per operation is constant and is fairly better than LIndex. This
method is illustrated in figure 3.
Algorithm 2: HIndex Approach
1. if (case == Search00 )then
2. while (f ile_type == .txt00 )
3. filename parser
4. Set the pagehit initially zero
5. Call IndexReader(parser)
6. if (IndexReader(parser) 6= 0) then
7. Call the IndexSearch(parser)
8. Compare IndexSearch(parser) and HashSet
9. if (true) then return HashSet(filename)
10. else while parser 6= HashSet(filename)
11. then increment_counter
12. if Search not found then return 0
13. goto step 1
14. else if (case == Insert00 )then
15. Call fopen in read-write mode
16. Assign Interface Mapper filename
17. Generate HashSet(filename)
18. while 6= EOF
19. Assign key value
20. Asign value HashSet(filename)
21. Compare key and HashSet(filename)
22. if (true) then
23. Point to the right index structure
24. Generate index with hashcode and text
25. else return hashkeys, key.record

IV

Experimental Results

Our experimental setup consists of a four node


cluster which includes three data nodes and one
namenode with system conguration of intel CoreTM2
Duo CPU E4500 processor and Ubuntu 14.04 Operating System. The master node has the namenode
process running and the slave nodes have the
datanode process running. The jobtracker runs on
the master node and the Task tracker runs on all
the slave nodes. The Prerequisite for MapReduce
program is to ensure that the Hadoop cluster is
installed, configured and is running. To run the
MapReduce Program on our cluster we used the
following procedure, HADOOP HOME was the
root of the installation and HADOOP VERSION
is the Hadoop version installed. First we set the
CLASSPATH. Compile the Lucene, LIndex and
HIndex program using java compiler. Store all the
generated class files into a file. Create a jar of the
file and place it in Hadoop local directory. Run the
program by :
bin/hadoop jar LIndex.jar MapR input output
Here M apR is the class file present in LIndex.jar file
and input is the input directory and the result of the
program is stored in output folder. Similar steps have
to be repeated for Hindex also. HIndex algorithm
designed with the help of HashMap takes input
from multiple files and based on input dynamically
creates a HashMap which stores the information in
key-value pair in HashSet index. So, the next time
when we search any text, it checks the HashSet and
provides the details of entire mapping where the text
is present in HDFS. The HashMap structure used for
implementation is
HashSet(f ilename) < String, Set < String >>
hashname.
The HashMap takes text as key and a set of inputs of
file path as its value. The file path can be extracted
in Hadoop by use of the following statements
String name =
((F ileSplit)context.getInputSplit())
.getP ath().getN ame();
String filePathString =
((F ileSplit)context.getInputSplit())
.getP ath().toString();
Lucene index implemented was modified slightly
based on Algorithm1 and new modified LIndex was
implemented. LIndex mapping of keys to value resulted in long int keys and this type of mapping took

338

long response time. Hence new index named HIndex bytes (file information) is more in case of Lucene then
algorithm2 was tested. HIndex showed a better per- in LIndex followed by in HIndex.
fomance compared to LIndex.
Observations from our experiment: During the implementation of LIndex about 40 percent of words are
eliminated from the index file as stop words which improve the overall performance using tf-idf. In Lucene
only complete-text scan of words in a file are possible.
Each word in a file is traversed for any query which
results in O(N) where N is number of words present in
files, while in the proposed LIndex approach only indexed words need to be searched which results in running time of O(M) and O(M ) < O(N ), M is the number of indexed words present. This is because input
file is indexed in a complete-text file approach so data
retrieval is faster with the help of index_writer and
tot_doc_index functions in case of LIndex. The runFig. 5 Bytes Read within specific time by each of
ning time of HIndex is estimated to be O(T). While
the three approaches.
comparing the run time of Lucene, LIndex and HIndex we found that O(T ) O(M ) < O(N ).
We have implemented the indexing approach on
MapReduce
framework where after the Map phase an
A Performance Details
indexing called LIndex, HIndex technique is injected
for quicker information retrieval.

Fig. 4 CPU Time taken by each of the queries in


the approaches.
The computation time for the three approaches
namely Lucene, LIndex and HIndex is shown in figure
4. It is found the performance of HIndex was better
compared to other Lucene and LIndex in providing a
response time during computation due to comparison
with hash code. Figure 4 shows that LIndex outperforms Lucene during CPU time computation in two
aspects- less index creation time and better data retrieval time. Also HIndex outperforms LIndex due
to small fixed size hash value comparisons used during data retrieval phase. Similarly from Figure 5 we
can see the time taken for reading specific number of

Conclusion

This paper propose two new indexing approaches


namely LIndex and HIndex which provide a support to index in Hadoop Distributed File System and
MapReduce systems without changing the existing
Hadoop framework. These two approaches enhance
text processing capabilities in Hadoop ecosystem and
overcome some of the drawbacks of traditional indexing techniques like upfront cost, long idle time
at load time etc. In our approach the load time of
the file system remains same as that of the original
Hadoop ecosystem. The performance of the algorithm
was compared by comparing the execution time requirement in original MapReduce framework with the
Lucene, proposed LIndex and HIndex approaches. At
the initial phase LIndex and HIndex methods took
a longer time to create the index but after creation
of index it is far better than the Lucene MapReduce
approach. We found that the HIndex outperforms
LIndex on CPU time for processing queries and time
taken for reading same set of data items is less in
HIndex compared to Lucene and LIndex. As part of
strengthening these schemes, we plan to work on composite queries in a bigger cluster setup.

339

References

[10] G Sudha Sadasivam, KG Saranya, and KG Karrthik. Hashmapbased wikipedia search with semantics. International Journal of Web Science,
2(1):6679, 2013.

[1] Lucene. http://lucene.apache.org/.

[2] Hediyeh Baban, S Kami Makki, and Stefan Andrei. Comparison of different implementation of [11] Konstantin Shvachko, Hairong Kuang, Sanjay
Radia, and Robert Chansler. The hadoop disinverted indexes in hadoop. In The Second Intertributed file system. In Mass Storage Systems
national Conference on E-Technologies and Busiand Technologies (MSST), 2010 IEEE 26th Symness on the Web (EBW2014), pages 5258. The
posium on, pages 110. IEEE, 2010.
Society of Digital Information and Wireless Communication, 2014.
[12] Michael Stonebraker, Daniel Abadi, David J DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo,
[3] Michael J Cafarella and Christopher R. Manand Alexander Rasin. Mapreduce and parallel
imal: relational optimization for data-intensive
dbmss: friends or foes? Communications of the
programs. In Procceedings of the 13th InterACM, 53(1):6471, 2010.
national Workshop on the Web and Databases,
page 10. ACM, 2010.
[13] Sai Wu and Kun-Lung Wu. An indexing framework for efficient retrieval on the cloud. IEEE
[4] Jeffrey Dean and Sanjay Ghemawat. Mapreduce:
Data Eng. Bull., 32(1):7582, 2009.
simplified data processing on large clusters. Communications of the ACM, 51(1):107113, 2008.
[5] Jens Dittrich, Jorge-Arnulfo Quian-Ruiz, Alekh
Jindal, Yagiz Kargin, Vinay Setty, and Jrg
Schad. Hadoop++: Making a yellow elephant
run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515
529, 2010.
[6] Jens Dittrich, Stefan Richter, and Stefan Schuh.
Efficient or hadoop: Why not both? DatenbankSpektrum, 13(1):1722, 2013.
[7] Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.
Full-text indexing for optimizing selection operations in large-scale data analytics. In Proceedings
of the second international workshop on MapReduce and its applications, pages 5966. ACM,
2011.
[8] Richard McCreadie, Craig Macdonald, and Iadh
Ounis. Mapreduce indexing strategies: Studying
scalability and efficiency. Information Processing
& Management, 48(5):873888, 2012.
[9] Andrew Pavlo, Erik Paulson, Alexander Rasin,
Daniel J Abadi, David J DeWitt, Samuel Madden, and Michael Stonebraker. A comparison
of approaches to large-scale data analysis. In
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages
165178. ACM, 2009.

340

You might also like