You are on page 1of 4

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 11, November 2013)

A Survey on Information Retrieval Models, Techniques And


Applications
Manish Sharma1, Rahul Patel2
1
PG Scholar, CSE, AITR, Indore
2
Assistant professor, CSE, AITR, Indore
Abstract-- Information retrieval is become a important 3) Searching is the core process of IRS. There are various
research area in the field of computer science. Information techniques for retrieving documents that match with
retrieval (IR) is generally concerned with the searching and users need.
retrieving of knowledge-based information from database. In
this paper, we represent the various models and techniques There are two basic measures for assessing the quality of
for information retrieval. In this survey paper we are information retrieval [2].
describing different indexing methods for reducing search Precision: This is the percentage of retrieved documents
space and different searching techniques for retrieving a that are in fact relevant to the query. Recall: This is the
information. We are also providing the overview of traditional percentage of documents that are relevant to the query and
IR models.
were in fact retrieved.
Keywords-- Information Retrieval (IR), Indexing, IR mode, There are three basic processes an information retrieval
Searching, Vector Space Model (VSM). system has to support: the representation of the content of
the documents, the representation of the user's information
I. INTRODUCTION need, and the comparison of the two representations. The
processes are visualized in Figure 1. In the figure, squared
Information retrieval is generally considered as a
boxes represent data and rounded boxes represent
subfield of computer science that deals with the
processes.
representation, storage, and access of information [1].
Representing the documents is usually called the
Information retrieval is concerned with the organization
indexing process. The process takes place off-line, that is,
and retrieval of information from large database collections
the end user of the information retrieval system is not
[2]. Information Retrieval (IR) is the process by which a
directly involved. The indexing process results in a
collection of data is represented, stored, and searched for
representation of the document [5].
the purpose of knowledge discovery as a response to a user
Users do not search just for fun, they have a need for
request (query) [3].this process involves various stages
information. The process of representing their information
initiate with representing data and ending with returning
need is often referred to as the query formulation process.
relevant information to the user. Intermediate stage
The resulting representation is the query [5].
includes filtering, searching, matching and ranking
Comparing the two representations is known as the
operations.
matching process. Retrieval of documents is the result of
The main goal of information retrieval system (IRS) is to
this process.
finding relevant information or a document that satisfies
The structure of this paper is as follows. A brief
user information needs. To achieve this goal, IRSs usually
introduction of IR models is presented in Section II,
implement following processes:
followed by indexing method in section III. Followed by
1) In indexing process the documents are represented in searching techniques in Section IV. Followed by IR
summarized content form. applications in section V, Finally, Section VI covers
2) In filtering process all the stop words and common conclusions.
words are remove.

542
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 11, November 2013)

USERS INFORMATION NEED DOCUMENT COLLECTION

FILTERING

QUERY
FORMULATION

INDEXING

USER QUERY DOCUMENT INDEXED

MATCHING
ALGORITHM

DOCUMENT MATCHED

DOCUMENT RETRIEVED

Fig 1 A general framework of IR System

The fundamental IR models can be classified into


II. IR MODELS
Boolean, vector, and probabilistic model [8] [3]. The rest of
An IR model specifies the details of the document this section briefly describes these models.
representation, the query representation and the retrieval
functionality [3].

543
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 11, November 2013)
A. Boolean model: III. INDEXING T ECHNIQUES
The Boolean model allows for the use of operators of There are several popular information retrieval indexing
Boolean algebra, AND, OR and NOT, for query techniques, including inverted indices and signature files.
formulation, but has one major disadvantage: a Boolean
system is not able to rank the returned list of documents A. Signature file:
[4]. In the Boolean model, a document is associated with a In signature file method each document yields a bit
set of keywords. Queries are also expressions of keywords string (signature) using hashing on its words and
separated by AND, OR, or NOT/BUT. The retrieval superimposed coding. The resulting document signatures
function in this model treats a document as either relevant are stored sequentially in a separate file called signature
or irrelevant [3]. file, which is much smaller than the original file, and can
be searched much faster [6].
B. Vector Space Model:
The vector space model can best be characterized by its B. Inversion indices:
attempt to rank documents by the similarity between the Each document can be represented by a list of keywords
query and each document [10].In the Vector Space which describe the contents of the document for retrieval
Model(VSM), documents and query are represent as a purposes [6]. Fast retrieval can be achieved if we invert on
Vector and the angle between the two vectors are computed those keywords. The keywords are stored, eg
using the similarity cosine function. Similarity Cosine alphabetically; in the index file for each keyword we
function can be defined as: maintain a list of pointers to the qualifying documents in
Where, the postings file. This method is followed by almost all the
commercial systems [10] .

IV. SEARCHING T ECHNIQUES


There are various searching algorithms, including linear
search, binary search, brute force search etc. some general
Documents and queries are represented as vectors. searching algorithms are described below
1) In linear search algorithm is a method of finding a
particular element or keyword from list or array that
checks every element in list, one at a time and in
sequence. Linear search is a simplest search algorithm.
Vector Space Model have been introduce term weight One of the most important drawbacks of linear search is
scheme known as if-idf weighting. These weights have a slow searching speed in ordered list. This search is also
term frequency (tf ) factor measuring the frequency of known as sequential search.
occurrence of the terms in the document or query texts and 2) Brute force search is a very general problem-solving
an inverse document frequency (idf) factor measuring the technique that consists of systematically enumerating
inverse of the number of documents that contain a query or all possible candidates for the solution and checking
document term [4]. whether each candidate satisfies the problem's
statement. Brute force algorithm is simple to implement
C. Probabilistic model: and it will always find a solution if it exist.
The most important characteristic of the probabilistic 3) Binary search algorithm, finds specified position of the
model is its attempt to rank documents by their probability element by using the key value with in a sorted array. In
of relevance given a query [9]. Documents and queries are each step, the algorithm compares the search key value
represented by binary vectors ~d and ~q, each vector with the key value of the middle element of the array. If
element indicating whether a document attribute or term the keys match, then a matching element has been
occurs in the document or query, or not. Instead of found and its index, or position, is returned. Otherwise,
probabilities, the probabilistic model uses odds O(R), if the search key is less than the middle element's key,
where O(R) = P(R)/1 P(R), R means document is then the algorithm repeats its action on the sub-array to
relevant and R means document is not relevant [4]. the left of the middle element or, if the search key is
greater, on the sub-array to the right.

544
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 11, November 2013)
If the remaining array to be searched is empty, then the VI. CONCLUSION
key cannot be found in the array and a special "not At last we conclude that, information retrieval is a
found" indication is returned. process of searching and retrieving the knowledge based
information from collection of documents. This survey has
V. AREA O F IR APPLICATION dealt with the basics of the information retrieval. In first
Information retrieval (IR) systems were firstly section we are defining the information retrieval system
developed to help manage the huge amount of information. with their basic measurements. After this we concerns with
Many universities, corporate, and public Libraries now use traditional IR models and also discuss about the different
IR systems to provide access to books, journals, and other indexing techniques and searching techniques. This survey
documents. Information retrieval is used today in many paper also includes the area of IR applications.
applications [7]. General applications of information
retrieval system are as follows: REFERENCES
[1] Mohameth-Franois Sy, Sylvie Ranwez, Jacky Montmain, Armelle
A. Digital Library: Regnault, Michel Crampes, Vincent Ranwez Pezzoli , User centered
and ontology based information Retrieval system for life sciences,
A digital library is a library in which collections are BMC Bioinformatics, 2012, 1471-2105.
stored in digital formats and accessible by computers. The [2] R. Sagayam, S.Srinivasan, S. Roshni, A Survey of Text Mining:
digital content may be stored locally, or accessed remotely Retrieval, Extraction and Indexing Techniques, IJCER, Vol. 2 Issue.
via computer networks. A digital library is a type of 5, sep 2012, PP: 1443-1444.
information retrieval system [7]. [3] Anwar A. Alhenshiri, Web Information Retrieval and Search
Engines Techniques, Al- Satil journal,PP: 55-92.
B. Search Engines: [4] Djoerd Hiemstra, Arjen P. de Vries, Relating the new language
A search engine is one of the most the practical models of information retrieval to the traditional retrieval models,
published as CTIT technical report TR-CTIT-00-09, May 2000.
applications of information retrieval techniques to large
[5] Djoerd Hiemstra, Information Retrieval Models, published in Goker,
scale text collections. Web search engines are best known A., and Davies, J. Information Retrieval: Searching in the 21 st
examples, but many others searches exist, like: Desktop Century. John Wiley and Sons, Ltd., ISBN-13: 978-0470027622,
search, Enterprise search, Federated search, Mobile search, November 2009.
and Social search [7]. [6] Christos Faloutsos, Douglas W. Oard, A Survey of Information
Retrieval and Filtering Methods, CS-TR-3514, Aug 1995.
C. Media search: [7] Algorithms for Information Retrieval Introduction, Lab module 1.
An image retrieval system is a computer system for [8] R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information
browsing, searching and retrieving images from a large Retrieval", ACM Press, ISBN: 0-201-39829-X.
database of digital images [7]. [9] S.E. Robertson and K. Sparck Jones. Relevance weighting of search
terms. Journal of the American Society for Information Science,
27:129146, 1976.
[10] G. Salton and M.J. McGill, editors. Introduction to Modern
Information Retrieval. McGraw-Hill 1983,

545

You might also like