Professional Documents
Culture Documents
& 6375(Online) Volume 4, Issue 3, May June (2013), IAEME TECHNOLOGY (IJCET) ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 512-518 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
IAEME
A NOVEL METHOD TO SEARCH INFORMATION THROUGH MULTI AGENT SEARCH AND RETRIEVE OPERATION USING CONTENT AND CONTEXT BASED SEARCH
Poonam Ghuli1, Swapna Rao P2, Harsha.S3 and Rajashree Shettar4
1 2
Asst. Prof., Dept. of CSE, R.V. College of Engineering, Bangalore, Karnataka, India Asst. Prof., Dept. of CSE, Nandi Institution of Technology and Management sciences, Bangalore, Karnataka, India 3 Asst. Prof., Dept. of CSE, Vidya Vikas Institute of Engineering and Technology, Mysore, Karnataka, India 4 Professor, Dept. of CSE, R.V. College of Engineering, Bangalore, Karnataka, India.
ABSTRACT Searching is a tiring job due to enormous increase in online database and growth in internet usage. Searching for information or files may be in personal computers or in internet. Searching in any manner consumes time and need an extra effort of filtering the results, as it provides relevant and irrelevant results. The aim of the paper is to provide the user a novel method to search files and information in both personal computers and internet. Our system describes a new searching mechanism which accepts two texts input, processes it according to the domain chosen, desktop search or internet search and provides relevant result. The processing of input includes context based search and content based search using indexing and multi-agents. Content based search is performed on Hadoop map reduce framework to increase the performance. Keywords: content search, context search, multi-agents, indexing and Hadoop. I. INTRODUCTION
The Web search or desktop search has become an integral part of our daily lifestyle. There are many applications which search information in Internet or in personal computers which is commonly called search engines or desktop search engines. These search engines help users in finding the information or files from enormously huge database. But still
512
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME research is being done to improve the efficiency and performance of these search engines. If only keywords are used to access the information, the retrieved results will have numerous irrelevant results which are further filtered by users in later stage which consumes time. This irrelevant data is retrieved as the content of information is not checked against the relevance of the information being searched. If the reason for which the information is searched is gathered then it may remove numerous irrelevant results from the search result. The proposed system first captures both search string and relevance string. Then it performs context search and content search with the help of multi-agents. Filtering of search results is done according to the relevance string specified by user. The whole process genetically evolves each time to give better results by filtering irrelevant search results. There is a provision to save all the links previously visited by the user in database. So that these results will have highest priority in next search result. As a result of tracking previously visited pages, search of relevant data becomes better and better. This process is carried out by multiagents, which are set of individual agents put together. An agent is an autonomous entity which performs a specific task. The characteristics of agent are autonomy, communicative, adaptive and decentralized. Multiple number of agent performing same tasks can be created or multiple number of agents performing different tasks can be created. In this paper three different types of agents performing different tasks are created and are created in multiple numbers. Context search means search for exact string provided as filenames or as a sub-strings in web page links. Content search means search of relevant contents in the file or web pages by opening it. II. LITERATURE SURVEY Lot of research work is ongoing in the field of search engines and search result optimization. These search engines use different methodologies or combination of different methodologies. A few such methodologies are mentioned below. Some search engines are specific and confined to search a particular file example: video search engines. One such video search engine is discussed in [1] which gives refined video results. It accepts text and image inputs from user to retrieve video results. Video concept detection, detecting sift features, multi-modality web categorization and semantic indexing are few mechanism used to retrieve relevant results. Another search engine discussed in paper [2] is Talash which is a Hindi search engine which has implemented three models for query expansion. These models are based on lexical variance, user context and combination of two methods. Search engines may implement different methods such as context search and indexing. One such example is context based indexing of web document using ontology [3]. In this work an index of files gathered by crawlers is maintained and knowledge base ontology repository, ontology context filtering, ontology ranking are the mechanism used to build a system. In paper [4] content based ranking for search engines is discussed. Here an approach is mentioned which ranks content of web resource based on user query. It ranks the relevant pages based on the content and keywords. The user query is used to retrieve results. Each result is analyzed using keywords and content. Dictionary is built using identified root words. Each result page is compared with the dictionary using keywords and content of the page. The matched words are given a weight and total relevancy is calculated. All pages are ranked in descending order of its total relevancy which falls between 1 to 0. As discussed earlier along with context based indexing, one can also associate ranking and an efficient information retrieval system is be developed as mentioned
513
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME in [5]. There are many systems developed using multi-agents and genetic algorithm, a specific one is discussed in [6, 8]. In paper [6] an agent driven shopping system is explained which is called as GAMA (Genetic Algorithm Driven Multi-Agents). The shoppers are helped by the system through product brokering and negotiation. Genetic algorithm adopted shopping agents are created to perform product brokering. The process of purchasing computer hardware is simulated by the system. An evolutionary algorithm, multi-agent system, interactions of multi-agent system and evolutionary approach is explained in book [8].In multi-agents system section characteristics of agents, agent classification and agent architecture are discussed. Cryptosystems performs cryptanalysis using multi-agent based cryptanalysis techniques for breaking large file encryption. They use context based search to enhance the probability of breaking. The multi-agents mechanism used in cryptosystems is used here to implement multi-agents. Multi-agents are used in secured e-shopping using elliptic curve cryptography [9] and also in protecting from attacking in elliptic curve cryptography [10]. III. OBJECTIVE AND PROBLEM STATEMENT The proposed system built is a generic framework to optimize search in standalone system as well as Internet using a genetic algorithm for parallel pattern recognition in context search and a content search using multiple agents to speed up operations in large databases. Most on-line documents are graphs in multi-dimension, searching and retrieving them is difficult if not infeasible using 1D or 2D search algorithm. There are 3D search algorithms proposed, however they are for databases on the same machine and they are content based only or they are designed for 3D models. The scope of the proposed system is to provide an optimized search engine which gives fast, accurate and appropriate results incorporating the features like Genetic Algorithm driven Multi-Agents, Agent Based Evolutionary Approach, Search Algorithm for 3-Dimension IV. PROPOSED SYSTEM The proposed system introduces a new mechanism of searching. It is a generic frame work which has two models, desktop search and Internet search. Both models use the same methodology with slight differences. The proposed system first performs context search and then does content search in documents or web pages being searched. The context search means searching for exact string which is provided as a filename or sub-strings in web pages. The content search means searching for relevant string by opening files or web pages. The content search process is performed using multi-agents. This mechanism remains same for both modules i.e. desktop search and internet search. But desktop search uses index file to perform context search and internet search relies on Google to perform context search. Desktop search maintains an index file. The index file is a text file, it contains the file paths of files present in the system. This index file is used to perform context search. The whole process genetically evolves itself by keeping track of previously user selected search results for the same strings. The system architecture shown in figure 4.1 consists of system configuration on which the proposed model is deployed. The bottom most layer consists of operating system. The operating system preferred here is Ubuntu 12.04 which is fast, secure and compatible
514
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME with a range of devices. The next layer introduces Hadoop framework, to take advantage of parallel computing using clusters commodity hardware and Hadoop distributed file system (HDFS). NetBeans IDE is used as a tool to develop the system using any programming language such as Java, Python etc. MYSQL is the database used to store the user interactions with search engine.
Figure 4.1 System Architecture for multi-agent search based on context and content search
Figure 4.2 Flow of mechanism of context and content search system using multi-agents
515
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME The figure 4.2 shows the flow of mechanism of entire system composed of different sub processes to get relevant and optimal search results. The user provides two strings, search string and relevance string. Next the user need to choose the domain in which the relevant string is to be searched, it is either standalone system search or Internet search. If the user choice is Desktop search, then Desktop search mechanism follows. In this process a search string is compared with the index file. The index is a text file having file paths of all files present in the standalone system. The matched file paths are provided as search results. These results and relevance string are the inputs to the Hadoop map job. The map job outputs relevance string word frequency. In the reduce job if the relevance string word frequency is more, then the respective filename with its path is selected as the search result. If the user choice is Internet search, then Internet search mechanism follows. In this process search string is provided to Google search engine and the results are retrieved. To these results content search is performed, before this process respective agents are created and task is assigned. Agents are created using hyper threading concept.Threads are created for each page link present in the result page. These threads are named as agent3 threads. Each Agent3 threads task is to create 10 more threads for each link present in result page, which is named as agent2. Agent2 threads task is to create another thread which does the task of opening the link, searching for string2 in the link opened and to return the link, if the link is relevant. The major part of content search process is performed on Hadoop. The links and relevance string are the input to map job, which provides relevance string word frequency as output. The links, relevance string and word frequency of relevance string are inputs to reduce job. If the relevance string word frequency is more, then the respective link is outputted. If the search string and relevance string pair has an entry in the database then the search result retrieved is merged with the present search result. All the user hit links are stored in the database for making the result list better in forthcoming transactions. The figure 4.3 depicts how internally different sub processes communicate with each other.
Figure 4.3 The overall structure of communication between sub-processes Descriptions of various components are explained in this section. In Context search the search string is used to search files in desktop search or links in Internet search. In Content search the relevance string is used to search files in desktop search or links in Internet search. In Preprocessing process, updating of index file whenever a new drive is
516
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME mounted is considered as preprocessing in desktop search model or searching, retrieving and updating the database is considered as preprocessing in Internet search model. Agent creation process creates a specified number of agents in different levels of the content search process. Agents are created using hyper threading concept. These threads performs specific task according to the specification. Three dimensional content search processes is performed by multiple agents. The dimensions here considered are pages, links and contents. Set of agents are created for these dimensions and those agents are responsible of these dimensional search. Depending upon the relevance of the files or links according to the relevance string provided, filtering process is performed on the files or links. Filtering removes irrelevant file paths or web links and chooses only files and links which contains content related to relevance string. Search Results are always stored and retrieved from the database. Instead of storing all search results, only the user hit links are stored and updated on each search process which makes the entire process genetic. As it refines results each time on each search process there is a learning process. V. CONCLUSION AND FUTURE ENHANCEMENTS In this paper two models are discussed. The first model is desktop search, which search for files in the standalone system by maintaining the index file. The second model is Internet search, which search Internet documents or links from the Internet with the help of GoogleTM. This model optimizes the Google results based on the relevance of the search conducted. In proposed system, 100 different set of search string and relevance string pair were provided to Internet module. All links part of search results were relevant to the relevance string provided by user. In that about 95% of the search result is composed of those links that are needed by the user. And in the search result these links are within top 5 links. When 50 different set of search string and relevance string pair were provided to desktop module. It gave 100% exact result needed. For example even if 10 files with same name are retrieved in context search process only files whose content matches with relevance string are part of output. The proposed system can be extended to support the following functionality. The system optimizes results from Google search engine alone but it can be extended for many search engines. Desktop search module is proposed only for standalone system; the enhancements can be performed on other systems connected in LAN. Filtering of strings provided to search engine can be done based on meaning of the words, phrases used and on frequency of words used. Ontological meaning of strings provided to search engines can be analyzed and filtered. This analysis helps in storing and retrieval of results. The system can be enhanced for its implementation in social networking applications. REFERENCES 1. S. Gomathy, K.P.Deepa, T. Revathi & L. Maria Michael Visuwasam. Genre Specific Classification for Information Search and Multimodal Semantic Indexing for Data Retrieval. The Standard International Journals Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 1, March-April 2013 Nandkishor Vasnik1, Shriya Sahu2, Devshri Roy. Talash: A semantic and context based optimized hindi search engine. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.2, No.3, June 2012.
517
2.
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Priyanka Saxena, Nidhi Tyagi & Dr. M.P.Yadav, Manik Chandra Pandey. Context Based Indexing of Web Document using Ontology. International Conference on Recent Trends in Engineering & Technology (ICRTET2012) ISBN: 978-81-925922-06. 4. P.Sudhakar, G.Poonkuzhali, R.Kishore Kumar, Member IAENG. Content Based Ranking for Search Engines. International Multi-conference of Engineers and computer scientists. 2012 Vol I, IMECS march 12-14 2012, Hong-Kong. ISBN-978988-19251-1-4. 5. Sunita Rani, Vinod Jain & Geetanjali Gandhi. Context Based Indexing and Ranking in Information Retrieval Systems. International Journal of Computer Science and Management Research, Vol 2 Issue 4 April 2013 ISSN 2278-733X. 6. Dr.Magda B. Fayek, Dr. Ihab A. Talkhan and Khalil S. El-Masry. GAMA (Genetic Algorithm driven Multi-Agents) for E-Commerce Integrative Negotiation. GECCO09, July 812, 2009, Montral Qubec, Canada. 7. Wooldridge M.: An Introduction to Multi-Agent Systems: New-York, Jonh Wiley & Sons (2002). 8. Ruhul Amin Sarkar and Tapabrata Ray, Agent Based Evolutionary search. Adaptation, learning and optimization Volume 5. Springer-Verlag Berlin Heidelberg 2010. ISBN: 978-3- 642 -13424-8 9. Sougata Khatua, Arijit Das, Zhang Yuheng, LI Li and N.Ch.S.N. Iyengar, Agent Based secured e-shopping Using Elliptic Curve Cryptography, International Journal of Advanced Science and Technology Vol. 38 January, 2012 10. Xu Huang, Pritam Gajkumar Shah, and Dharmendra Sharma, Multi-Agent System Protecting from Attacking in Elliptic Curve Cryptography, G.Phillips-Wren et al. [Eds], Advances in Intel. Decision Technologies, SIST 4, pp.123-131.Springer-Verlag Berlin, Heidelberg 2010. 11. Shruti V Kamath, Mayank Darbari and Dr. Rajashree Shettar, Content Based Indexing and Retrival From Vehicle Surveillance Videos using Gaussian Mixture Model, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 420-429, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 12. Shraddha Chaurasia and Lalit Dole, Secure Masid: Secure Multi-Agent System for Intrusion Detection, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 392-397, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 3.
518