You are on page 1of 5

International Journal of Computer Information Systems, Vol. 3, No.

6, 2011

Concept-based Information Retrieval Approaches on the Web: A Brief Survey


Ashraf Ali
Department of Computer Science, Singhania University, Pacheri Bari, Rajasthan, India aali1979@rediffmail.com

Dr. Israr Ahmad


Department of Computer Science, Jamia Millia Islamia University, New Delhi, India israr_ahmad@rediffmail.com

Abstract- The Web is growing very fast and changing rapidly. Currently finding meaningful information on the Web is difficult. However, to overcome the problem of retrieving meaningful information on the Web intelligently, current information retrieval approaches need to be improved. Much more "smartness" should be embedded to search tools to manage effectively search, retrieval, filtering and presenting relevant information to the user. Concept-based information retrieval are playing major role in retrieving meaningful information intelligently. In this paper we provide an overview of Conceptbased information retrieval approaches with background study considering them promising ways of improving search on the Web. Finally, the advantages and disadvantages of Conceptbased approaches have been discussed.
Keywords: Concept-based information retrieval; Semantic Web; Word Sense Disambiguation. WordNet;

Over the years many methods have been proposed to overcome the problem of low precision search engine [2, 4, 7, 15, 28]. A concept-based information retrieval can overcome these challenges by utilizing semantic web, word sense disambiguation, and other techniques, to help it derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies. II. CONCEPT-BASED INFORMATION RETRIEVAL A. Concept and Concept Search A web search query is a query that a user enters into web search engine to satisfy his or her information needs. When a user searches for a term, there is more to the query than what is actually entered. Humans think in terms of concepts but the search is performed using words. Many times the query terms are ambiguous words, unable to fully represent the concept that the user has in mind. The intended meaning of such words is described by other words commonly occurring in the context of these words. For instance, if the words considered alone, each word has no obvious meaning. On the other hand, if the word supplied with some relevant words from a context, this will result an understandable meaning. Concept can be considered as group of words that together identify the clear meaning of the intended context. The Concept Search is an automated information retrieval method that is used to search for information that is conceptually similar to the information provide in search query. Concept search techniques were developed because of limitation imposed by keyword search technologies when dealing with large, unstructured digital collection of text. B. Concept-based Information Retrieval Model The meaning of a word depends on the conceptual rather than to linguistics relationship that is found in a dictionary. Therefore, new generation IR retrieval models should be constructed from this conceptual framework. This can be implanted by using a set of words that is mapped to the concept such as a content of document collection which can be described by a set of concepts.

I. INTRODUCTION In recent years, the advancement of computer technology has caused explosion of information published on the Web. The amount of information on the Web is growing very fast. In 2011 Google collected more than 25 billion pages (www.google.com, 2011). However the huge amount of such information, heterogeneity, dynamic and multilingual nature of the Web, makes it difficult to find the relevant information on the Web [3]. Mostly information retrieval tools use keyword search, which is unsatisfactory option because of its low precision and recall. There are two major obstacles for the keyword search, polysemy and synonymy. The effect of polysemy of words, where the same word may have multiple meanings, has caused many irrelevant documents to be retrieved. On the other hand, the effect of synonymy, where the two words can express the same meaning in a given context i.e., they can be synonyms, have caused many relevant document in the collection can not be retrieved, because they use different words with the querys word. In addition to the problems of polysemous and synonymy, keyword searches can exclude inadvertently misspelled words as well as the variations on the stems of words (for example, strike vs. striking).

December Issue

Page 14 of 72

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 6, 2011 Reference [14] made a survey on 13 concept-based information retrieval tools and determined that, the existence of conceptual structure is crucial in concept-based information retrieval model. Currently at least five models of conceptual structure in concept-based IR products that available, i.e. conceptual taxonomy, domain ontology, semantic linguistic network of concept, thesaurus and predictive model. Almost all model use manually construction that disadvantageous because of its time consuming. Recent work and challenges in Concept-based information retrieval can be witnessed through various research papers [2, 4,9,13,14,15,16,19,21,24,25]. The Concept-based information retrieval does not refer to strict information retrieval approach or tool. Over the years many different approaches have been proposed by various researchers, all of these attempts to improve the retrieval effectiveness by some ways including or using the conceptual or semantic information implied by words into retrieval methodology. Some approaches use the knowledge base [6, 19] and other approaches do not [21]. The common goal of all approaches is to either perform effective retrieval going beyond keyword matching or to utilize conceptual information in an effective way to improve the retrieval performance. 1) WordNet WordNet [12] is a knowledgebase in the form of lexical database that stores the meaning of words and relationship between them in a conceptually organized hierarchy. Each word in the WordNet belongs to a logical structure called synset. A synset represents a group of synonymous words that specially represents one underline concept. There is a close relationship between all words belonging to synset such that if the word from a particular synset is used in sentence, it should be able to replace by any other words in the synset without affecting the meaning of sentence. Effectively, every synset help distinguish the sense of the word. As a knowledgebase covering general domain, WordNet has proven to be very useful resources for various natural language processing and information retrieval tasks because it is a rich source of information for word relations, concepts, meaning, senses and parts of speech. It has excellent coverage of English language and an accurately organized conceptual hierarchy. But sometimes the fine grained nature of WordNet definition can cause problems. WordNet can be very specific in its definitions and a sense of word that are sparsely used. This can sometimes prove to be disadvantage to an approach that does not expect such rare and specific information. 2) Semantic Web The Semantic Web is an extension of the current Web in which human understood information is transformed into structured and organized knowledge that can be effectively exploited by humans and automated computer programs [26]. Essentially, this requires the semantic web to be reliant upon: Knowledge representation for information to be encoded to have structured meaning, Ontologies to form a basis for knowledge representation and provide a framework to resolve ambiguities, and Agents to perform automated inferencing tasks on knowledge across the semantic web.

Central technologies allowing information to be transformed into structured knowledge are XML (eXtensible Markup Language) and RDF (Resource Description Framework) [1], a standard for encoding web metadata. These combine to form a powerful method of organizing information into meaningful and useful structured knowledge. Concepts, which can be identified through a universal resource identifier (URI) such as a uniform resource locator (URL), can be established on the semantic web by linking it with existing knowledge already encoded in RDF schemas. Ontologies provide definitions of concepts and relationships between them along with inference rules required for effective deduction. They can be exploited to improve the effectiveness of web search or used in a more advanced fashion to related information with knowledge and knowledge with rules. Languages for ontology construction and knowledge representation for the semantic web have been proposed [5] and techniques to perform automatic integration of web taxonomies have been shown to be effective [8]. Also, information retrieval techniques have been applied to assist in the construction of ontologies for the semantic web [27]. Agents, that accept both knowledge and ontologies, can use these resources to reason and perform intelligent automated tasks [26]. By taking advantage of the rich semantic environment and framework that characterizes the semantic web, they should be able to perform advanced tasks that were previously not possible with just the original web. Information retrieval can play an important role in retrieving information from the semantic web. Reference [17] presented a framework in which inference and retrieval can be coupled through the integration of the semantic web with information retrieval. They illustrate the deficiencies of current information retrieval approaches as a candidate for a semantic web retrieval and inferencing. These are mainly centered around the use of ontologies and semantic markup in documents to facilitate effective retrieval on the semantic web [18]. They proposed that semantic queries can be encoded as plain text queries submitted to a web search engine for retrieval. Ranked pages can then be scraped of their semantic markup information and this information can be fed to the inference engine. Mayfield et al also describe the OWLIR

December Issue

Page 15 of 72

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 6, 2011 system, semantic web retrieval and inferencing system. It can extract concepts from plain text; match them in ontology and associate related inference rules with them. Reference [22] presented a semantic search framework called TAP. They show how it can be used to augment a web search engine by providing semantic search capabilities. Once a simple text query is submitted it is mapped to one or more denotations, specified by the semantic web. This is performed through a set of heuristics or explicitly by the user. During retrieval, documents from the text based search engine are filtered and ranked according to the estimated likelihood that the document belongs to the chosen denotation. The results are also augmented to display specific class type information derived from the semantic web. 3) Conceptual Indexing Some of the most recent and significant work undertaken to perform concept-based information retrieval is presented [28], of Sun Microsystems. The conceptual indexing system proposed has the ability to automatically extract conceptual information from documents and build hierarchical concept graphs dynamically. During retrieval intelligent matches can be made between queries and the conceptual information for a document. It can match more general terms with more specific terms and vice versa by exploit the conceptual index. The system was tested and evaluated on UNIX manual pages and another document collection from Sun Microsystems. The conceptual indexing system was compared against a number of other systems. Precision and recall were recorded a long with a new measure called success rate. The success rate was defined as the percentage of queries that had a successful answer in the top ten results. In comparison, the conceptual indexing system out performed the other systems by achieving a better success rate and recall rate. An elaborate explanation of the various techniques and details of this system can be found in [28]. The fact that the system can dynamically build conceptual hierarchies is very significant. A system that can accurately build its own conceptual knowledge representations can overcome the limitations of using a static hand made knowledgebase that has limited coverage such as WordNet. Of course, one must be skeptical at the quality and accuracy of the conceptual graphs that are dynamically generated. 4) Word Sense Disambiguation One way to potentially improve retrieval accuracy is to utilize word sense disambiguation (WSD) with information retrieval. The aim of WSD is to assign a given term occurring in a particular context, with the sense that is implied by that term. Senses are defined through concepts and thus WSD when combined with information retrieval can be considered as part of a concept-based retrieval approach. If words in a document and words in a query can be correctly disambiguate, then a system could possibly improve retrieval performance by returning more relevant documents to t he user by matching senses. For example, consider the query river bank. By knowing that the word bank in the query is referring to the sloping land besides a body of water, only documents using bank in the same context need to be returned. Further more, by identifying the correct sense, the problem of synonymy and polysemy can also be solved. Additional relevant documents could also be retrieved that refer to sloping land, but do not actually use bank as a term in the document. To date, the information retrieval community has yet to propose a comprehensive model or approach to improve retrieval performance by matching senses and overcoming synonymy and polysemy. Reference [20] performed an insightful investigation into the effect of WSD on information retrieval. Specifically, Sanderson investigated the effect of erroneous disambiguation on information retrieval performance. To do this, a technique introducing artificial ambiguity into a document collection through the novel use of pseudo-words was proposed [7]. A pseudo-word is a concatenation of 2 or more words. Once a pseudo-word is created it is used to replace all original occurrences of all the words used in t he pseudo-word. For example, if the words car and fish were the two chosen words, the pseudo-word would be car/fish. Taking car/fish, all occurrences of car or fish in the document collection are now replaced with the new pseudo-word car/fish. In doing this, an ambiguous document set is created. Sanderson ran a number of experiments using an ambiguous Reuters newswire collection where all words had been replaced by pseudo-words and found that it made hardly any difference to retrieval performance. But when disambiguation was performed on t he ambiguous set at various control led levels of accuracy to simulate erroneous disambiguation, retrieval performance suffered as a consequence. When the WSD algorithm performed at 75% accuracy degradation in retrieval performance was observed. Sanderson concluded that disambiguation accuracy needed to be at 90% or greater to observe an improvement in retrieval performance. An earlier and more elaborate study into the relationship between sense matches between query and document terms was performed [23]. They discovered that when queries are a well formulated and describe an information need well, then lexical ambiguity is less of an issue. Consider the queries bank teller financial deposit and big bank where both queries use the word bank in the sense of a financial institution. Using the first query, an information retrieval system would be more likely to return documents only referring to a bank as the financial institution because the words in the query give a good indication of this. Where as an information retrieval system using the second query is just as likely to return documents about a bank, the financial institution, as it is to return documents about bank, the sloping land beside a body of water. This is called the collocation effect, referring to how groups of words tend to appear together and implicitly distinguish the sense of a word. They also studied the frequency of distribution of senses across the collection and

December Issue

Page 16 of 72

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 6, 2011 found that certain senses of words t end to occur more frequently than other senses of the same word. Reference [10] performed an evaluation of retrieval effectiveness using a novel WordNet based WSD algorithm. She indexed the CACM, CISI, CRAN, MED and TIME collections using 3 sub-vectors to represent word sense information for each document and query. While not providing any indication as to the exact accuracy of the WSD algorithm, she observed a decrease in retrieval performance for the disambiguated sense based experiments when comparing them to the standard experiments. Voorhees also investigated the use of WSD for a query expansion approach and found that incorrect WSD will also degrade performance. It was clearly concluded that erroneous WSD is detrimental to retrieval performance. Reference [16] investigated the indexing of the SEMCOR collection [11], a subset of the Brown Corpus. SEMCOR consists of about 130 documents that have been manually disambiguated with WordNet senses with 100% accuracy. The aim of their investigations was to 1) ascertain how effective WordNet could be utilized for information retrieval, and 2) test the effect of erroneous disambiguation on retrieval performance. They conducted their investigations using three different document indexing schemes that included indexing WordNet synsets, WordNet senses and terms. For each indexing scheme, they observed the effect of retrieval performance with no disambiguation error and also with different levels of disambiguation error. When running experiments with the original corpus that was fully and accurately disambiguated, they report on substantial improvements in accuracy. When comparing WordNet synset and WordNet sense indexing against the standard vector space model, they achieve a 14% and 11% gain in retrieval performance respectively. In the second phase of their experimentation, they alter the corpus to introduce erroneous disambiguation. They run various experiments with different levels of disambiguation error and [20] conclude that disambiguation needs to be 90% accurate to observe any significant improvement over a standard approach. Reference [25] presented a heuristic WSD technique to disambiguate short queries using senses provided in WordNet. A disambiguated query can be expanded by identifying the most significant terms and phrases to be extracted from identified senses in WordNet. In their approach, each document is given a phrase-based score and a term-based score. The phrase-based score is a similarity score based on the matching of the phrases in the query with phrases in the document. It is independent of t he frequency of phrases in a document and is calculated by summing the IDF scores of each phrases occurring in a document. The term-based score is a typical content-based score calculated by OK API. Documents are ranked first by t heir phrase-based score and then by their termbased score. The approach is tested on the WT10g web collection and significant improvements of 23-31% are observed over a standard OKAPI ranking technique that does not use WSD. The advantage of using WSD on t he query only is that the computational expense and complexity of performing WSD on documents is avoided. III. ADVANTAGES AND DISADVANTAGES Concept-based retrieval approaches aim to present for the characteristics of human language and knowledge by introducing conceptual information into the retrieval process. They have the advantage of exploiting this additional information and including it into the retrieval process to improve effectiveness. This information is usually provided through the use of manually created ontologies, which usually provide accurately human annotated information. But conceptual information can also be obtained through automatic techniques which extract concepts and relations between them to give artificial ontologies and other types of useful concept based information. This is in contrast to a keyword based system where all the information used for retrieval is based on a statistical interpretation of the keywords appearing within the collection. The main disadvantage is the limited term coverage of ontologies and the potential incompleteness of the annotated conceptual information. IV. CONCLUSION In this paper, we have presented a brief survey on various concept-based approaches for improving search results on the Web. A comprehensive study of different conceptbased approaches was presented. The advantages and disadvantages were presented further, thus the focus of the paper is clearly narrowed. REFERENCES
[1] World Wide Web Consortium, http:// www.w3c.org. [2] A. Hamzah, Text document information retrieval based on concepts Jurnal Teknologi Vol. 4(1), Juni 2011, pp. 45-51. [3] A. Ali and I. Ahmad, Information retrieval issues on the web, International Journal of Computer Technology and Applications, Vol 2 (6), 1951-1955, Nov-Dec 2011. [4] L. Bentivogli, P. Forner, B. Magnini, and E. Pianta, Revising wordnet domains hierarchy: semantics, coverage, and balancing in Proceedings of COLING 2004 , Workshop on Multilingual Linguistic Resources, Aug. 28, pp. 101108, 2004. [5] D. Fensel, F. Harmelen, I. Horrocks, D. McGuinness and R. PatelSchneider, Oil: an ontology infrastructure for the semantic web IEEE Intelligent Systems Magazine, Vol. 16, no. 2, pp. 38-45, 2001. [6] D. Shin, H. Lim, Y. Yoon and K. Choi, HYKIS - an information retrieval system based on a hybrid knowledge base in Proceedings of the 2nd International Conference on Information and Knowledge Management, pp. 264-273, 1993. [7] D. Yarowsky, One sense per collocation in Proceedings of the ARPA Human Language Technology Workshop, pp. 266-271, 1993. [8] D. Zhang and W. Sun Lee, Web taxonomy integration through cobootstrapping in Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, pp. 410-417, 2004. [9] O. Egozi, S. Markovitch and E. Gabrilovich, concept-based information retrieval using explicit semantic, Journal ACM Transactions Information Systems (TOIS), TOIS Homepage archive Vol. 29(2). April 2011. [10] E. Voorhees, Using WordNet for text retrieval, WordNet: An Electronic Lexical Database, MIT Press, 1998, pp. 285-303.

December Issue

Page 17 of 72

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 6, 2011


[11] G. Miller, C. Leacock, R. Tengi and R. Bunker, A semantic concordance in Proceedings of the ARPA Human Language Technology Workshop, pp. 303-308, 1993. [12] G. Miller, WordNet: a lexical database Communications of the ACM, Vol. 38(11), pp. 39-41, 1995. [13] Goyal, P.; Behera, L.; McGinnity, T.M, An information retrieval model based on automatically learnt concept hierarchies, IEEE International Conference, ICSC '09, 14-16 Sept 2009. [14] H., Hele-Mai and L. Tanel-Lauri, A survey of concept-based information retrieval tools on the web, Institute of Cybernetics at Tallinn Technical University, Academia Taee 21, 12618 Tallinn, 2005. [15] V. Jalali and M.R.M. Borujerdi, Information retrieval with conceptbased pseudo-relevance feedback in MEDLINE, Knowledge and Information Systems DOI: 10.1007/s10115-010-0327-7, 2010. [16] J. Gonzalo, F. Verdejo, I. Chugur and J. Cigarran, "Indexing with WordNet synsets can improve text retrieval in Proceedings of the International Conference on Computational Linguistics '98 Workshop, pp. 38-44, 1998. [17] J. Mayfield and T. Finin, information retrieval on the semantic web: integrating inference and retrieval in Proceedings of the S IGIR 2003 Semantic Web Workshop, pp. 461-468, 2003. [18] J. Mayfield, "Ontologies and text retrieval, the knowledge engineering review,Vol. 17(1), pp. 71-75, 2002. [19] J. Zakos, B. Verma, X. Li and S. Kulkarni, Intelligent encoding of concepts in web document retrieval in Proceedings of the International Conference on Computational Intelligence and Multimedia Applications, China, pp. 72-77, 2003. [20] M. Sanderson, Word sense disambiguation and information retrieval in Proceedings of t he 17th International Conference on Research and Development in Information Retrieval, pp. 49-57, 1994. [21] P. Henstock, D. Pack, Y. Lee and C. Weinstein, Toward an improved concept-based information retrieval system in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 384-385, 2001. [22] R. Guha, R. McCool and E. Miller, Semantic search in Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, pp. 700-709, 2003. [23] R. Krovetz and W. Croft, Lexical ambiguity and information retrieval in ACM Transactions on Information Systems, Vol. 10(2), pp. 115-141, 1992. [24] Rad M.P., Hassanpour, H., and Poursaikh, R., Concept-based information retrieval with ontology approach for cross-language search, World Applied Science Journal (8), pp. 965-971 , 2010, ISSN:1818-4952 [25] S. Liu, F. Liu, C. Yu and W. Meng, An effective approach to document retrieval via utilizing WordNet and recognizing phrases in Proceedings of the 27th Annual International Conference on Research and Development, pp. 266-272, 2004. [26] T. Berners-Lee, J. Hendler and O. Lassila, The semantic web American Scientific, pp. 35-43, May, 2001. [27] W. van Hage, M. de Rijke and M. Marx, Information retrieval support for ontology construction and use in Proceedings of the International Semantic Web Conference, Hiroshima, Japan, pp. 518-533, 2004. [28] W. Woods, "Conceptual indexing: a better way to organize knowledge," Technical Report SMLI TR-97-61, Sun Microsystems Laboratories, 1997. Dr. Israr Ahmad is B.Sc.(Hons) Mathematics form Jamia Millia Islamia (JMI) and Master of Computer Science and Applications (MCA) from Aligarh Muslim University (AMU), Aligarh and Ph.D. on Computer Based Analysis and Study of the Transportation System and its Related Problems from JMI, New Delhi. He is working as a Computer Programmer in the Department of Computer Science, JMI, New Delhi since 1991. He is assisting in the teaching of Computer papers to M.Sc.(Mathematics) Evening students. Dr. Israr Ahmad is also engaged in research work. He has participated in twelve National Conferences/ Workshops and Seminars. He has published 6 research papers in National and International journals and 2 research papers in National and International conferences. He is supervising two Ph.D. students.

AUTHORS PROFILE Mr. Ashraf Ali received his B.Sc. (Computer Science & Mathematics) degree from University of Lucknow in 2000 and M.C.A. degree from Bundelkhand University, Jhansi in 2004. He is currently pursuing Ph.D. from Singhania University, Rajasthan. He has total 7 years of teaching experience in various science and engineering colleges in India and abroad. His research interest includes Information Retrieval, Data Mining, and Web Mining. He has presented 2 papers in International Journals.

December Issue

Page 18 of 72

ISSN 2229 5208

You might also like