You are on page 1of 6

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No.

8, August 2011

A Study About Developing Trends In Information Retrieval From Google To Swoogle


S.Kalarani Ph.D scolar
Department of Information Technology St Josephs Institute of Technology Chennai - 119 .

Dr.G.V.Uma
Prof/ Department of IST Anna University Chennai - 26 .

Abstract Information retrieval technology has been central to


the success of the Web. Web 2.0 is defined as the innovative use of the World Wide Web to expand social and business growth and to explore collective intelligence from the community. The features of Web 2.0 include user behavior and software design perspectives. A high level technical architecture is included in Web 2.0 features. Google Web Search is a web search engine owned by Google Inc. and is the most-used search engine on the Web. Google receives several hundred million queries each day through its various services. The main purpose of Google Search is to hunt for text in WebPages. Also the functionality of Google search engine in retrieving the information is based on the 3 principles. Keyword Search, where the search engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. Page Rank is a link analysis algorithm, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents. In simple words this means that your results will be ordered by the relative importance of your search terms in the document. Google also uses Indexing. It has an index of all pages it's crawled based on the terms in each page. Inverted index technique is now replaced and stores the information as stems. Another principle is finally, we point out the limitations of the current technologies in order to analyze the new technology development in the Web 3.0 model. The core of the Semantic Web is ontology. Also the requirements of ontology in the context of the Web are outlined. Advantages of using ontology in both knowledge-base-style and database-style applications are demonstrated using one real world application. (Abstract) Keywords-component; Keyword search, knowledge base, indexing, inverted index, ontology, page ranks, stemming, Web 2.0, Web 3.0based on ontology.

Web, where the user will not have control over the data in modifying and manipulating it. The Web 2.0 is Read and Write Web. Google search engine can be cited as an example of Web2.0.The google.com site is the top most-visited website in the world. Some of its features include a definition link for most searches including dictionary words, the number of results that one has got on a search, links to other searches (e.g. for words that Google believes to be misspelled, the links are provided with the search results using the proposed spelling.) The following sections elaborate on the information retrieval in Google and the future Web3.0. The Google search engine mainly concentrates upon the following: a) Keyword Search and indexing b) page ranking c) web crawlers Section 1 explains about the keyword search and indexing in Google. Section 2 deals about ranking of pages in Google and its advantage over other search engines. The next section illustrates the architecture of Web crawling and its effectiveness in information retrieval through precision and recall. Also it deals about the inverted indexing. The final section deals with the Web3.0 which is mainly concerned with ontology and concepts relating to the intelligent information retrieval system model based on domain ontology are highlighted. Also an model based on relation based search engine is proposed.

II.

TRADITIONAL METHODOLOGY OF INFORMATION RETRIEVAL

I.

INTRODUCTION

The search engines have provided everyone easiest way to link and fetch the information amidst having large collection of web pages .This has been made feasible through web information retrieval. The Web1.0 was merely a Read only

A. Keyword search and indexing Google uses the technique of keyword based search for the queries. The keyword can be regarded as a phrase specified in the web page by the author. For example when

135

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011

you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it. It is mainly concerned with the subject and its content. Such words are gathered by Web Crawlers and organized in a table with a pointer to a database consisting of URLs. The URL database consists of a set of inlist and outlist which will act as a reference to other web pages containing the same information. All these references are identified by an index value. Apart from indexed keywords, there is a considerable amount of data available in online databases which are accessible by means of queries but not by links. This is called as invisible or deep Web [1] .The deep Web contains library catalogs, official legislative documents of governments and other content which is dynamically prepared to respond to a query.

query into separate keywords using a plus (+) symbol depending on the space, and tries to search the query form its database. B. Page rank Page ranking is a central part of existing information retrieval [5]; this is achieved through polling (constant monitoring). Google uses the concept of page rank to rank the pages based on the idea that information on the web could be ordered in a hierarchy by "link popularity, a page is ranked higher as there are more links to it. Page Rank is calculated with the help of probability distribution. And its done for collections of documents of different sizes. The Page Rank computations involve a series of iterations. Through these iterations the page is ranked. The page which is viewed the most will be ranked higher compared to other pages. Apart from page ranks Google offers the users with a way to customize the search engine, by setting a default language, using the Safe Search filtering technology and set the number of results that can be shown on each page [3]. Google has enabled the customization by placing longterm cookies on users machines to store these preferences, a tactic which also enables them to track a user's search terms and retain the data for more than a year.

Figure 1.URL database with pointers to related information

In the case of keyword based search, Google guides its users with a set of related keywords to make it easier for the users to work. Consider an example for searching founder of facebook, while using Google users will be mentored with other related keywords from its database. But in the case of other search engine like yahoo, this type of assistance will not be provided.

Figure 3. Example of page ranking

A recent enhancement in Google is its instant search, which reduces the search time of users by 2 to 5 seconds in every search, thereby reducing it collectively by 11 million seconds per hour approximately. For any query, the first 1000 results can be shown with a maximum of 100 displayed per page. "Instant Search" is not enabled. If "Instant Search" is enabled, only 10 results are displayed, regardless of this setting.

Figure 2. Comparison based on keyword search Between Google and Yahoo

When the users provide more than a single keyword like a query for searching, Google automatically splits the

136

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011

Precision-- The ability to retrieve top-ranked documents that is mostly relevant. Recall--The ability of the search to find all of the relevant items in the corpus.

Problems with both precision and recall are: Number of irrelevant documents in the collection is not taken into account. Precision is undefined when no document is retrieved. Recall is undefined when there is no relevant document in the collection.
Figure 4. Page ranking architecture

The exact percentage of the total of web pages that Google indexes are not known, as it is very difficult to accurately calculate. Google not only indexes and caches web pages, but also takes "snapshots" of other file types, which include PDF, Word documents, Excel spreadsheets, Flash SWF, plain text files, and so on. Except in the case of text and SWF files, the cached version is a conversion to XHTML, allowing those without the corresponding viewer application to read the file. C.

D.

Inverted index An Inverted Index data structure is used to support most

WEB CRAWLER ARCHITECTURE


Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) -- an automated Web browser which follows every link on the site.[2] The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called Meta tags.) Data about web pages are stored in an index database for use in later queries.

IR techniques. Each word in the table has an index entry, which specifies the documents that include that word. A search for documents that contain a word then becomes a simple lookup in the index. The list of words may be reduced by using a stop word list or Stemming [5]. The key terms of a query or document may be represented by stems rather than by the original words. Thus "computation" might be stemmed to "comput" provided that different words with the same 'base meaning' are reduced to the same form and words with distinct meanings are kept separate. Stemming provides several advantages: a query can find a document with different morphological variants of the search term (improved recall); and Reduction in the number of distinct terms needed to represent the corpus reduces computer processing requirements. The inverted index is largely responsible for the impressive speed and scalability of current search engines. However, for large collections of documents the compilation of an inverted index can be a time consuming task, and the fast searching of an inverted index may require large amounts of computer memory. A large amount of research has been conducted on the problem of compressing inverted indexes to reduce these memory requirements.

E.

Drawbacks

Figure 5. Web crawler architecture

Also the Effectiveness is related to the relevancy of retrieved items. They include subjective, situational, cognitive, and dynamic. The effectiveness is measured through two parameters:

The major draw backs observed in the traditional information retrieval systems are Traditional information retrieval technology mainly makes use of the approaches, such as, category, index, and keyword and so on. However, this method cant reflect the deep meaning of the word. And hence the retrieval result is very large which fails to meet the users requirement. Coverage of retrieval is limited. As one word could have more than one meaning and several words can

137

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011

have same meaning, the retrievals coverage will be negatively affected if user cannot provide all possible words to be searched. Especially for specialized article retrieval, as there are many several different definitions for some specialized terms and knowledge, it is difficult for researchers to quickly and correctly find all articles related to certain scientific research topic among huge technical articles. One way to solve this problem is to improve information retrieval from traditional key-work based approach to knowledge or concept based approach. Thats, to combine information retrieval with artificial intelligence technology and natural language technology. III INTRODUCTION TO ONTOLOGY A. The Concept of Ontology

b) Knowledge acquisition: When a system based on knowledge is constructed, speed and reliability can be improved if existing ontology can be used as starting point and basis to guide the knowledge acquisition. c) Reliability: As the description of ontology is formalized, this formalized statement makes automatic consistence check become possible. So, software systems reliability will be improved. d) Specification: Ontology analysis is helpful to determine systems (such as knowledge database) requirement and criterion. Ontology determines exact meaning of concepts by strictly defining them and understanding the relationship among them, in order to express commonly recognizable and sharable knowledge [8]. So, it will be much more convenient for computer to process information in one domain, if this domains ontology can be built by abstracting or summarizing one group of concept and relating them with these concepts. [510]

Ontology comes from Philosophy domain, aiming at studying nature and composition of objective things. Some scholars think that Ontology is one certain category system on certain domain of the world, which doesnt depend on any certain languages. In knowledge engineering domain, Ontology is built based on the knowledge concepts, terms and their mutual relationship about the system. Ontology generally falls in four categories a) Top level ontology b) Domain ontology c) Task ontology d) Application ontology. Among them, domain ontology means the concepts and their mutual relationship with ordinary domain, such as medical, automobile, and so on. [7]\The target of constructing ontology is to normalize concept and terms in one or other domain, and provide convenience to practical application in that domain or among several domains. B. The Function of Ontology

IV. INTELLIGENT INFORMATION RETRIEVAL SYSTEM MODEL BASED ON DOMAIN ONTOLOGY A. Ontology knowledge Module The first step for building an intelligent information retrieval system based on ontology is to develop the related domain ontology with help of specialist in that domain, and then, gather the required data source and store them onto a database (relationship database and knowledge database)

B. Units Retrieval module As showed in Figure6, with inputted retrieving condition from user, retrieving transformer will use words and words with similar or same meaning, from retrieving user interface, to build a retrieving set.

The function of ontology can be summarized as Communication, interoperability and system engineering. 1) Communication: It means providing common words for communication between people or between organizations, which forms the basis for communication. 2) Interoperability: Ontology builds translation and mapping mechanism among different model construction methods, equations, languages and software tools, in order to do integration among different system. 3) System engineering: Ontology analysis can provide following benefits to system engineering [4]: a) Reusability: ontology is formalized description basis for important entities, properties, process and their mutual relationship. This formalized description can become reusable and shared component in software systems.

Figure 6. Ontology based retrieval module.

138

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011

For example, when user retrieves computer, the engine retrieves all the information related to computers like personal computer, micro computer from the ontological system built for this domain. This enhanced approach will make the search query more precise. Whats more, retrieving transformer is very helpful to improve retrieving hitting rate. For example, when user inputs apple, does user mean apple fruit or the brand of Apple PC? All above confused information can be clarified by concept description in ontology knowledge database. V. SYSTEM EVALUATION This model solved following retrieving problems with support of knowledge database. A. Retrieving hit rate is high

This model processes the keywords, not only are the keywords processed, but also the relationship between the entities offered by the architecture of semantic web. A page will be returned to users only when it includes the relationship between keywords, and those pages that are related only with the keywords and without the proper relationship are discarded. The sample screenshots illustrate the Ontology based retrieval of information. The user selects the domain of information retrieval. Here the train is chosen as the domain and related information is provided which are provided during annotation through sign in option. Then the list of related sites will be listed based on the usage (page ranking).The sites are listed in the table and the user selects the site which is relevant to them. The screenshot 2 provided the detailed information which is required.

Ontological model limits all possible interpretation of one term into only reasonable one, by adding ontology Knowledge database between user and database, which can solve the problem of multiple meaning to one word. This model can also be adapted with dynamic change of users and information source, to only provide retrieval within the domain to users.

B. Retrieval coverage rate is broad. In this model, system can infer a set of words with same meaning or similar meaning from user inputted retrieving word, as real retrieving word to retrieving system, because concept description on equivalent relation, such as words with same or similar meaning and word abbreviation, will be added into ontologys knowledge database. This method will lower down users burden and improve retrieving coverage range.

C. Retrieving inferring. Ontology is a conceptual explanation in a certain domain. It makes the terms in this domain form a knowledge system, which can express corresponding meaning logic and can be used for inferring, which can efficiently and correctly feedback users most important information.

V. IMPLEMENTATION OF RELATION BASED SEARCH ENGINE

139

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011

VII. CONCLUSION We have presented a framework about information retrieval system Model based on Ontology. Initially, the information retrieval in Google is analyzed and drawbacks are pointed out. Then, the concept of Ontology and its application in intelligent retrieval domain are elaborated with information retrieval system structure model based on domain Ontology. The idea of combining ontologies and knowledge is the key tool for developing the proposed system. Final result is in the form of user friendly direct url. Hence the searching time is considerably reduced. The efficient methods on how to utilize domain knowledge of Ontology to realize the query based on a concept, during the process of information retrieval are improvised. Also its performance based upon this models performance is evaluated. REFERENCES
[1] En.wikipedia.org [2] www.semanticfocus.com [3] J.Farrugia,Model-Theoretic semantics Hungary, May2003 for the web, Budapest,

[9]Dr. Muhammad Shahbaz, Dr. Syed Muhammad Ahsen, Farzeen Abbas, Muhammad Shaheen Syed Athar Masood,"An efficient method to improve Information Recovery on Web, Journal of American Science, Vol.7, No.7, 2011. [10]. Shivani Agarwal and Michael Collins, Maximum Margin Ranking Algorithms for Information Retrieval", Springer, 2010. [11]. Jianguo Jiang, Zhongxu Wang, Chunyan Liu, Zhiwen Tan, Xiaoze Chen, Min Li,"The Technology of Intelligent Information Retrieval Based on the Semantic Web",IEEE 2nd International Conference on Signal Processing Systems,2010. [12]A. Abusujhon, M.Tamib,"Improving Load Balance and Query Throughput of Distributed IR Systems, International Journal of Computing and ICT Research, Vol. 4, No. 1, June 2010. [13]. Peter D. Turney and Patrick Pantel,"From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, No. 37, P.p. 141-188, 2010. [14] Abdelkrim Bouramoul, Mohamed-Khireddine Kholladi and Bich-Lien Doan, Using content to improve the evaluation of the information retrieval system, International Journal of Database Management Systems ( IJDMS ), Vol.3, No.2, May 2011 [15]. Guo Chengxia and Huang Dongmei,"Research on Domain Ontology Based Information Retrieval Model", International Symposium on Intelligent Ubiquitous Computing and Education, 2009. AUTHORS PROFILE 1.

[4] Semantic Web Services Ontology www.daml.org [5] Zaihisma Che Cob , Rusli Abdullah Ontology-based Semantic Web Services Framework for Knowledge Management System 978-1-4244-23286/08/$25.00 2008 IEEE [6] Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1998.The page rank citation ranking: Bringing order to the web. Technical report, Stanford Database group. [7]Nilsson N.J. Artificial intelligence: A new synthesis. Beijing: China Machine Press and Morgan Kaufmann Publishers, Inc, 1999 , pp.215316 [8]Dong Hui, Yang Ning, Yu Chuanming , Research on the Ontology based Retrieval Model of Digital Library (I) , Journal of the China Society for Scientific and Technical Information, 2006(3), pp. 269-275

2.

S.Kalarani M.E,.(Ph.D) Associate Professor Department of Information Technology St Josephs institute of technology Chennai 119. Dr.G.V.Uma Professor / IST Anna university Chennai -26.

140

http://sites.google.com/site/ijcsis/ ISSN 1947-5500