You are on page 1of 53

Improving Web Search Result Using Cluster Analysis

A thesis Submitted by

Biswapratap SinghSahoo
in partial fulfillment for the award of the degree of

MASTER OF SCIENCE
IN

COMPUTER SCIENCE

Supervisor Dr. R.C Balabantaray

UTKAL UNIVERSITY: ODISHA


JUNE 2010

Copyright by Biswapratap SinghSahoo June 2010

Contents
Declaration Abstract Dedication Acknowledgment List of Figures Chapter 1 Introduction
1.1 Motivation 1.1.1 From organic to mineral memory 1.1.2 The problem of abundance 1.1.3 Information retrieval and Web search 1.1.4 Web search and Web crawling 1.1.5 Why the Web is so popular now? 1.1.6 Search Engine System Architecture 1.1.7 Overview of Information Retrieval 1.1.8 Evaluation in IR 1.1.9 Methods for IR

iii iv vi vii viii 1


2 2 4 5 7 8 9 11 12 14

Chapter 2 Related works


2.1 Search Engine Tools 2.1.1 Web Crawlers 2.1.2 How the Web Crawler Works 2.1.3 Overview of data clustering 2.2 An example information retrieval problem

16
18 18 18 19 20

Chapter 3 Implementation Details


3.1 Determining the user terms 3.1.1 Tokenization 3.1.2 Processing Boolean queries

26
26 26 27

3.1.3 Schematic Representation of Our Approach 3.1.4 Methodology of Our Proposed Model 3.2 Our Proposed Model Tool 3.2.1 Cluster Processor 3.2.2. DB/Cluster 3.3 Working Methodology

29 29 29 29 30 30

Chapter 4 Future Work and Conclusions


4.1 Future Work 4.2 Conclusion

32
32 32

Appendices References and Bibliography Annexure


Biswapratap SinghSahoo. National Seminar on

34 40 44
Computer

A Modern Approach to Search Engine Using Cluster Analysis,

Security: Issues and Challenges on 13th & 14th February 2010 held at PJ College of Management & Technology, Bhubaneswar, sponsored by All India Council for Technical Education, New Delhi. Page No - 27

DECLARATION
I, Sri Biswapratap SinghSahoo, do hereby declare that this thesis entitled Improving Web Search Result Using Cluster Analysis submitted to Utkal University, Bhubaneswar for the award of the degree of Master of Science in Computer Science is an original piece of work done by me and has not been submitted for award of any degree or diploma in any other Universities. Any help or source of information, which has been availed in this connection, is duly acknowledged.

Date: Place:

Biswapratap SinghSahoo Researcher

Abstract
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is much skewed, and interesting pages are scarce in comparison with the rest of the content. Search engines have changed the way people access and discover knowledge, allowing information almost any subject to be quickly and easily retrieved within seconds. As increasingly more materials become available electronically the influence of search engines on our lives will continue to grow. To engineer a search engine is a challenging task. Search engines index ten to hundred millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore due to rapid advance in overview of current web search engine designed and proposed a model with cluster analysis. We introduce new Meta search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. There has been less research in cluster analysis using user terminology rather than document keywords. Until log files of web sites were made available it was difficult to accumulate enough exact user searches to make a cluster analysis feasible. Another limitation in using searcher terms is that most users of the Internet employ short (one to two words) queries (Jansen et al., 1998). Wu, et al. (2001) used queries as a basis for clustering documents selected by searchers in response to

similar queries. This paper reports on an experimental search engine based on a cluster analysis of user text for quality information.

To my parents and to all my teachers both formal and informal

Serendipity is too important to be left to chance.

Acknowledgements
What you are is a consequence of whom you interact with, but just saying thanks everyone for everything would be wasting this

opportunity. I have been very lucky of interacting with really great people, even if some times I am prepared to understand just a small fraction of what they have to teach me. I am sincerely grateful for the support given by my advisor Dr. R. C Balabantaray during this thesis. The comments received from my advisor during the review process were also very helpful and detailed. It is my pleasure and good opportunity to express my profound sense of gratitude and indebtedness towards him for inspiring guidance,

unceasing encouragement, over and above all critical insight that has done into eventual fruition of this work. His blessings and inspiration helped me to stand in this new and most challenging field of Information Retrieval. This thesis is just a step on a very long road. I want to thank the professors I met during this study; especially I take this opportunity to extend my thanks to Prof. S. Prasad and Er. S.K Naik for their continuous encouragement throughout the entire course of the work. I am also thankful to each and every staff members of Spintronic Technology & Advance Research, Bhubaneswar for their time to time cooperation and help. I would say at the end that I owe everything to my parents, but that would imply that they also owe everything to their parents and so on, creating an infinite recursion that is outside the context of this work. Therefore, I thank Dr. Balabantaray for being with me even from before the beginning, and sometimes giving everything they have and more, and I need no calculation to say that he has given me the best guidance thank you.

List of Figures
Figure 1.1: Cyclic architecture for search engines, showing how different components can use the information generated by the other components. Figure 1.2 Architecture of a simple Web Search Engine Figure 2.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. Figure 2.2 Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia. Figure 2.3 The two parts of inverted index.

Chapter 1 Introduction
The World Wide Web (WWW) has seen a tremendous increase in size in the last two decades as well as the number of new users inexperienced in the art of web search [1]. The amount of information and resources available on WWW today has grown exponentially and almost any kind of information is present if the user looks long enough. In order to find relevant pages, a user has to browse through many WWW sites that may contain the information. Users may either browse the pages through entry points such as the popular portals, Google, Yahoo, MSN and AOL, etc. to look for specific information. Beginning the search from one of the entry points is not always the best approach, since there is no particular organized structure for the WWW, and not all pages are reachable from others. In the case of using a search engine, a user submits a query, typically a list of keywords, and the search engine returns a list of the web pages that may be relevant according to the keywords. In order to achieve this, the search engine has to search its already existing index of all web pages for the relevant ones. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years but very little academic research has been done on them. Search engines are constantly engaged in the task of crawling through the WWW for the purpose of indexing. When a user submits keywords for search, the search engine selects and ranks the documents from its index. The task of ranking the documents, according to some predefined criteria, falls under the responsibilities of the ranking algorithms. A good search engine should present relevant

documents higher in the ranking, with less relevant documents following them. A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second from WWW.

1.1 Motivation 1.1.1 From organic to mineral memory


As we mentioned before, finding relevant information from the mixed results is a time consuming task. In this context we introduce a simple high-precision information retrieval system by clustering and reranking retrieval results with the intention of eliminate these

shortcomings. The proposed architecture has some key features: Simple and high performance. Our experimental results (Section 4) shows that its almost 79 percent better than the best known standard Persian retrieval systems [1, 2, 18]. Independent of initial system architecture. It can embed in any fabric information retrieval system. It cause proposed architecture very good envisage for the web search engines. High-Precision. Relevant documents exhibit at top of the result list. We have three types of memory. The first one is organic, which is the memory made of flesh and blood and the one administrated by our brain. The second is mineral, and in this sense mankind has known two kinds of mineral memory: millennia ago, this was the memory represented by clay tablets and obelisks, pretty well known in this country, on which people carved their texts. However, this second type is also the electronic memory of todays computers, based upon silicon. We

have also known another kind of memory, the vegetal one, the one represented by the first papyruses, again well known in this country, and then on books, made of paper.

TheWorldWideWeb, a vast mineral memory, has become in a few years the largest cultural endeavor of all times, equivalent in importance to the first Library of Alexandria. How was the ancient library created? This is one version of the story:

By decree of Ptolemy III of Egypt, all visitors to the city were required to surrender all books and scrolls in their possession; these writings were then swiftly copied by official scribes. The originals were put into the Library, and the copies were delivered to the previous owners. While encroaching on the rights of the traveler or merchant, it also helped to create a reservoir of books in the relatively new city.

The main difference between the Library of Alexandria and the Web is not that one was vegetal, made of scrolls and ink, and the other one is mineral, made of cables and digital signals. The main difference is that while in the Library books were copied by hand, most of the information on the Web has been reviewed only once, by its author, at the time of writing.

Also, modern mineral memory allows fast reproduction of the work, with no human effort. The cost of disseminating content is lower due to new technologies, and has been decreasing substantially from oral tradition to writing, and then from printing and the press to electronic communications. This has generated much more information than we can handle.

1.1.2 The problem of abundance


The signal-to-noise ratio of the products of human culture is remarkably high: mass media, including the press, radio and cable networks, provide strong evidence of this phenomenon every day, as well as more small-scale actions such as browsing a book store or having a conversation. The average modern working day consists of dealing with 46 phone calls, 15 internal memos, 19 items of external post and 22 emails. We live in an era of information explosion, with information being measured in exabytes (1018 bytes): Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. We estimate that new stored information grew about 30% a year between 1999 and 2002. Information flows through electronic channels telephone, radio, TV, and the Internet contained almost 18 exabytes of new information in 2002, three and a half times more than is recorded in storage media. The World Wide Web contains about 170 terabytes of information on its surface. On the dawn of the World Wide Web, finding information was done mainly by scanning through lists of links collected and sorted by humans according to some criteria. Automated Web search engines were not needed when Web pages were counted only by thousands, and most directories of the Web included a prominent button to add a new Web page. Web site administrators were encouraged to submit their sites. Today, URLs of new pages are no longer a scarce resource, as there are thousands of millions of Web pages. The main problem search engines have to deal with is the size and rate of change of the Web, with no search engine indexing more than one third of the publicly available Web. As the number of pages grows, it will be increasingly important to focus on the most valuable pages, as no search engine will be able of indexing the complete Web. Moreover, in

this thesis we state that the number of Web pages is essentially infinite; this makes this area even more relevant.

1.1.3 Information retrieval and Web search


Information Retrieval (IR) is the area of computer science concerned with retrieving information about a subject from a collection of data objects. This is not the same as Data Retrieval, which in the context of documents consists mainly in determining which documents of a collection contain the keywords of a user query. Information Retrieval deals with satisfying a user need: ... the IR system must somehow interpret the contents of the information items (documents) in a collection and rank them according to a degree of relevance to the user query. This interpretation of document content involves extracting syntactic and semantic information from the document text ...

Although there was an important body of Information Retrieval techniques published before the invention of the World Wide Web, there are unique characteristics of the Web that made them unsuitable or insufficient. A survey by Arasu et al. on searching the Web notes that: IR algorithms were developed for relatively small and coherent collections such as newspaper articles or book catalogs in a (physical) library. The Web, on the other hand, is massive, much less coherent, changes more rapidly, and is spread over

geographically distributed computers ...

This idea is also present in a survey about Web search by Brooks, which states that a distinction could be made between the closed Web, which comprises high-quality controlled collections on which a 3 search

engine can fully trust, and the open Web, which includes the vast majority of web pages and on which traditional IR techniques concepts and methods are challenged. One of the main challenges the open Web poses to search engines is search engine spamming, i.e.: malicious attempts to get an undeserved high ranking in the results. This has created a whole branch of Information Retrieval called adversarial IR, which is related to retrieving information from collections in which a subset of the collection has been manipulated to influence the algorithms. For instance, the vector space model for documents], and the TF-IDF similarity measure are useful for identifying which documents in a collection are relevant in terms of a set of keywords provided by the user. However, this scheme can be easily defeated in the open Web by just adding frequently-asked query terms to Web pages. A solution to this problem is to use the hypertext structure of the Web, using links between pages as citations are used in academic literature to find the most important papers in an area. Link analysis, which is often not possible in traditional information repositories but is quite natural on the Web, can be used to exploit links and extract useful information from them, but this has to be done carefully, as in the case of Pagerank: Unlike academic papers which are scrupulously reviewed, web pages proliferate free of quality control or publishing costs. With a simple program, huge numbers of pages can be created easily, artificially inflating citation counts. Because the Web environment contains profit seeking ventures, attention getting strategies evolve in response to search engine algorithms. For this reason, any evaluation strategy which counts replicable features of web pages is prone to manipulation.

The low cost of publishing in the open Web is a key part of its success, but implies that searching information on the Web will always be inherently more difficult that searching information in traditional, closed repositories.

1.1.4 Web search and Web crawling


The typical design of search engines is a cascade, in which a Web crawler creates a collection which is indexed and searched. Most of the designs of search engines consider the Web crawler as just a first stage in Web search, with little feedback from the ranking algorithms to the crawling process. This is a cascade model, in which operations are executed in strict order: first crawling, then indexing, and then searching. Our approach is to provide the crawler with access to all the information about the collection to guide the crawling process effectively. This can be taken one step further, as there are tools available for dealing with all the possible interactions between the modules of a search engine, as shown in Figure 1.1

Figure 1.1: Cyclic architecture for search engines, showing how different components can use the information generated by the other components.

The typical cascade model is depicted with thick arrows. The indexing module can help the Web crawler by providing information about the ranking of pages, so the crawler can be more selective and try to collect important pages first. The searching process, through log file analysis or other techniques, is a source of optimizations for the index, and can also help the crawler by determining the active set of pages which are actually seen by users. Finally, the Web crawler could provide ondemand crawling services for search engines. All of these interactions are possible if we conceive the search engine as a whole from the very beginning.

1.1.5 Why the Web is so popular now?


Commercial developers noticed the potential of the web as a communications and marketing tool when graphical Web browsers broke onto the Internet scene (Mosaic, the precursor to Netscape Navigator, was the first popular web browser) making the Internet, and specifically the Web, "user friendly." As more sites were developed, the more popular the browser became as an interface for the Web, which spurred more Web use, more Web development etc. Now graphical web browsers are powerful, easy and fun to use and incorporate many "extra" features such as news and mail readers. The nature of the Web itself invites user interaction; web sites are composed of hypertext documents, which mean they are linked to one another. The user can choose his/her own path by selecting predefined "links". Since hypertext documents are not organized in an arrangement which requires the user to access the pages sequentially, users really like the ability to choose what they will see next and the chance to interact with the site contents.

1.1.6 Search Engine System Architecture


This section provides an overview of how the whole system of a search engine works. The major functions of the search engine crawling, indexing and searching are also covered in detail in the later sections. Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications visualization; include update page validation, structural and analysis personal and web

notification,

mirroring

assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc. Crawlers are automated programs that follow the links found on the web pages. There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompressed the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and

stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URL Resolver reads the anchors file and converts relative URLs into bsolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries. (Brin and Page 1998)

Figure 1.2 Architecture of a simple Web Search Engine

Figure 1.2 illustrates the architecture of a simple WWW search engine. In general, a search engine usually consists of three major modules: a) Information gathering b) Data extraction and indexing c) Document ranking Retrieval systems generally look at each document as a unique in assigning a page rank. If the document is viewed as a combination of other related documents in the query area, we can have better results. The conjecture that relevant documents tend to cluster was made by [26]. Irrelevant documents share many terms with relevant documents but about two completely different topics, so these may demonstrate some patterns. On the other hand an irrelevant cluster can be viewed as the retrieval result for a different query that share many terms with the original query. Xu et al. believe that document clustering can make mistake and when this happens, it adds more noise to the query expansion process. But as we discuss, document clustering is a good tool for high-precision information retrieval systems. In this context we proposed architecture (Fig. 3.1) to cluster search results and re-rank them based on cluster analysis. Although our benchmark in the Persian language but we believe that same results must be exhibit in other benchmarks.

1.1.7 Overview of Information Retrieval


People have the ability to understand abstract meanings that are conveyed by natural language. This is why reference librarians are useful; they can talk to a library patron about her information needs and then find the documents that are relevant. The challenge of information retrieval is to mimic this interaction, replacing the librarian with an automated system. This task is difficult because the machine

understanding of natural language is, in the general case, still an open

research problem. More formally, the field of Information Retrieval (IR) is concerned with the retrieval of information content that is relevant to a users information needs (Frakes 1992). Information Retrieval is often regarded as synonymous with document retrieval and text retrieval, though many IR systems also retrieve pictures, audio or other types of non-textual information. The word document is used here to include not just text documents, but any clump of information. Document retrieval subsumes two related activities: indexing and searching (Sparck Jones 1997). Indexing refers to the way documents, i.e. information to be retrieved, and queries, i.e. statements of a users information needs, are represented for retrieval purposes. Searching refers to the process whereby queries are used to produce a set of documents that are relevant to the query. Relevance here means simply that the documents are about the same topic as the query, as would be determined by a human judge. Relevance is an inherently fuzzy concept, and documents can be more or less relevant to a given query. This fuzziness puts IR in opposition to Data Retrieval, which uses deductive and boolean logic to find documents that completely match a query (van Rijsbergen 1979).

1.1.8 Evaluation in IR
Information retrieval algorithms are usually evaluated in terms of relevance to a given query, which is an arduous task considering that relevance judgements must be made by a human for each document retrieved. The Text REtrieval Conference (TREC) provides is a forum for pooling resources to evaluate text retrieval algorithms. Document corpora are chosen from naturally occurring collections such as the

Congressional Record and the Wall Street Journal. Queries are created by searching corpora for topics of interest, and then selecting queries that

have a decent number of documents relevant to that topic. Queries and corpora are distributed to participants, who use their algorithms to return ranked lists of documents related to the given queries. These documents are then evaluated for relevance by the same person who wrote the query (Voorhees 1999).

This evaluation method is based on two assumptions. First, it assumes that relevance to a query is the right criterion on which to judge a retrieval system. Other factors such as the quality of the document returned, whether the document was already known, the effort required to find a document, and whether the query actually represented the users true information needs are not considered. This assumption is controversial in the field. One alternative that has been proposed is to determine the overall utility of documents retrieved during normal task (Cooper 1973).

Users would be asked how many dollars (or other units of utility) each contact with a document was worth. The answer could be positive, zero, or negative depending on the experience. Utility would therefore be defined as any subjective value a document gives the user, regardless of why the document is valuable. The second assumption inherent in the evaluation method used in TREC is that queries tested are representative of queries that will be performed during actual use. This is not necessarily a valid assumption, since queries that are not well represented by documents in the corpus are explicitly removed from consideration. These two assumptions can be summarized as follows: if a retrieval system returns no documents that meet a users information needs, it is not considered the fault of the system so long the failure is due either to poor query construction or poor documents in the corpus.

1.1.9 Methods for IR


There are many different methods for both indexing and retrieval, and a full description is out of the scope of this thesis. However, a few broad categories will be described to give a feel for the range of methods that exist.

Vector-space model. The vector-space model represents queries and documents as vectors, where indexing terms are regarded as the coordinates of a multidimensional information space (Salton 1975). Terms can be words from the document or query itself or picked from a controlled list of topics. Relevance is represented by the distance of a query vector to a document vector within this information space.

Probabilistic model. The probabilistic model views IR as the attempt to rank documents in order of the probability that, given a query, the document will be useful (van Rijsbergen 1979). These models rely on relevance feedback: a list of documents that have already been annotated by the user as relevant or non-relevant to the query. With this information and the simplifying assumption that terms in a document are independent, an assessment can be made about which terms make a document more or less likely to be useful.

Natural

language

processing

model.

Most

of

the

other

approaches described are tricks to retrieve relevant documents without requiring the computer to understand the contents of a document in any deep way. Natural Language Processing (NLP) does not shirk this job, and attempts to parse naturally occurring language into representations of abstract meanings. The conceptual models of queries and documents can then be compared directly (Rau 1988).

Knowledge-based approaches. Sometimes knowledge about a particular domain can be used to aid retrieval. For example, an expert system might retrieve documents on diseases based on a list of symptoms. Such a system would rely on knowledge from the medical domain to make a diagnosis and retrieve the appropriate documents. Other domains may have additional structure that can be leveraged. For example, links between web pages have been used to identify authorities on a particular topic (Chakrabarti 1999).

Data Fusion. Data fusion is a meta-technique whereby several algorithms, indexing methods and search methods are used to produce different sets of relevant documents. The results are then combined in some form of voting to produce an overall best set of documents (Lee 1995). The Savant system described in Chapter 2.7 is an example of a data fusion IR system.

Chapter 2 Related works


Using some kind of documents clustering technique to help retrieval results is not new, although we believe we are the first to explicitly present and deal with the low-precision problem in terms of clustering search results. Many research efforts such as [10] have been made on how to solve the keyword barrier which exists because there is no perfect correlation between matching words and intended meaning. [9] presents TermRank, a variation of the PageRank algorithm based on a relational graph representation of the content of web document collections. Search result clustering has successfully served this purpose in both commercial and scientific systems [30, 10, 23, 16, 25, 33]. The proposed methods focus on separating search results into meaningful groups and user can browse and view of retrieval results. One of the first approaches to search results clustering called Suffix Tree Clustering would group documents according to the common phrases [13]. STC has two key features: the use of phrases and a simple cluster definition. This is very important when attempting to describe the contents of a cluster. [12] proposes a new approach for web search result clustering to improve the performance of approaches that uses the previous STC algorithms. Search Results Clustering has a few interesting characteristics and one of them is the fact that it is based only on document snippets. Certainly Document snippets returned by search engines are usually very short and noisy. Another shortage with these systems is the clusters name. Clusters name must accurately and concisely describe the contents of the cluster, so that the user can quickly decide if the cluster is interesting or not. This aspect of these systems is difficult and sometimes neglected [7, 12]. In this context our tendency to provide very simple high-precision system based on cluster hypothesis [16] without any user feedback. Document clustering can be performed, in advance, on the

collection as a whole (static clustering) [7, 15], but post-retrieval document clustering (dynamic clustering) has been shown produce superior results [10, 8]. Tombros et al. [14] conducted a number of experiments using five document collections and four hierarchic

clustering methods to show that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional inverted file search. The actual

effectiveness of hierarchic clustering can be gauged by Cluster-based retrieval strategies perform a ranking of clusters instead of individual documents in response to each query [13]. The generation of precisionrecall graphs is thus not possible in such systems, and in order to derive an evaluation function for clustering systems some effectiveness function was proposed by [13]. In this paper, firstly we want to propose a simple architecture which uses local cluster analysis to improve the

effectiveness of retrieval and yet utilize traditional precision-recall evaluation. Secondly, this paper is devoted to high-precision retrieval. Thirdly, we use a larger Persian standard test collection which is created based on TREC specifications that validate findings in a wider context. Query expansion is another approach to improve the effectiveness of information retrieval. These techniques can be categorized as either global or local. While global techniques rely on analysis of a whole collection to discover word relationships, local techniques emphasize analysis of the top-ranked documents retrieved for a query [28]. While local techniques have shown to be more effective that global techniques in general [29, 2]. In this paper we dont want to expand a query based on the information in the set of top-ranked documents retrieved for the query, instead use very simple and more efficient re-ranking approach to improve the effectiveness of search result and make high-precision system that contain more relevant documents at top of the result list to help user that find information needs efficiently.

2.1 Search Engine Tools 2.1.1 Web Crawlers


To find information from the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Web sites [6]. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page [8]. They are mainly used by web search engines to gather data for indexing. Web crawlers are also known as spiders, robots, worms etc. Crawlers are automated programs that follow the links found on the web pages [10]. There are a number of different scenarios in which crawlers are used for data acquisition. A very few examples and how they differ in the crawling strategies used are Breadth-First Crawler, Recrawling Pages for Updates, Focused Crawling, Random Walking and Sampling, Crawling the Hidden Web[11].

2.1.2 How the Web Crawler Works


Following is the process by which Web crawlers work: [3] Download the Web page. Parse through downloaded page and retrieve all the links. For each link retrieved, repeat the process.

In the first step, a Web crawler takes a URL and downloads the page from the Internet at the given URL. Oftentimes the downloaded page is saved to a file on disk or put in a database. [3] In the second step, a Web crawler parses through the downloaded page and retrieves the links to other pages. After the crawler has

retrieved the links from the page, each link is added to a list of links to be crawled. [3] The third step of Web crawling repeats the process. All crawlers work in a recursive or loop fashion, but there are two different ways to handle it. Links can be crawled in a depth-first or breadth-first manner. [3] Web pages and links between them can be modeled by a directed graph called the web graph. Web pages are represented by vertices and linked are represented by directed edges [7]. Using depth first search, an initial web page is selected, a link is followed to second web page (if there exist such a link), a link on the second web page is followed to a third web page, if there is such a link and so on, until a page with no new link is found. Backtracking Is used to examine links at the previous level to look for new links and so on. (Because of practical limitations, web spiders have limits to the depth they search in depth first search.) Using a breadth first search, an initial web page is selected and a link on this page is followed to second web page, then a second link on the initial page is followed (if it exist), and so on, until all link of the initial page have been followed. Then links on the pages one level down are followed, page by page and so on.

2.1.3 Overview of data clustering


The data clustering, as a class of data mining techniques, is to partition a given data set into separate clusters, with each cluster composed of the data objects with similar characteristics. Most existing clustering methods can be broadly classified into two categories: partitioning methods and hierarchical methods. Partitioning algorithms, such as k-means, kmedoid and EM, attempt to partition a data set into k

clusters such that a previously given evaluation function can be optimized. The basic idea of hierarchical clustering methods is to first construct a hierarchy by decomposing the given data set, and then use agglomerative or divisive operations to form clusters. In general, an agglomeration-based hierarchical method starts with a disjoint set of clusters, placing each data object into an individual cluster, and then merges pairs of clusters until the number of clusters is reduced to a given number k. On the other hand, the division-based hierarchical method treats the whole data set as one cluster at the beginning, and divides it iteratively until the number of clusters is increased to k. See [11] for more information. Although [17, 20, 23, 31, 33] have developed some special algorithms for clustering search results but now we prefer to use traditional methods in this paper. We will show that our method with basic clustering algorithms such as k-means and Principal Direction Divisive Partitioning achieves significant improvement over the methods based on similarity search ranking alone.

2.2 An example information retrieval problem


A fat book which many people own is Shakespeares Collected Works. Suppose youwanted to determinewhich plays of Shakespeare contain thewords Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as grepping through text, after the Unix GREP command grep, which performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful

possibilities forwildcard patternmatching through the use of regular expressions. With modern computers, for simple querying of modest collections (the size of Shakespeares Collected Works is a bit under one million words of text in total), you really need nothing more. But for many purposes, you do need more: 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as within 5 words or within the same sentence. 3. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words. The way to avoid linearly scanning the texts for each query is to index the documents in advance. Let us stick with Shakespeares Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document here a play of Shakespeares whether it contains eachword out of all the words Shakespeare used (Shakespeare used about 32,000 differentwords). The result is a binary term-document incidence matrix, as in Figure 2.1. Terms are the indexed units (further discussed in Section 2.2); they are usually words, and for the moment you can think of them as words, but the information

retrieval literature normally speaks of terms because some of them, such as perhaps I-9 or Hong Kong are not usually thought of aswords. Now, depending onwhetherwe look at thematrix rows or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.2

Figure: 2.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND:

The answers for this query are thus Antony and Cleopatra and Hamlet (Figure 2.2). The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words.

Figure:2.2 Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.

Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have documents. By documents we mean whatever units we have decided to build a retrieval system over. We will refer to the group of documents over which we perform retrieval as the (document) collection. It is sometimes also referred to as a corpus (a body of texts). Suppose each document is about 1000 words long (2-3 book pages). If we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 GB in size. Typically, there might be about distinct terms in these documents. There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to handle.

Our goal is to develop a system to address the ad hoc retrieval task. This is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a oneoff, user-initiated query. An information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like ``pipeline leaks'' and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the

quality of its search results), a user will usually want to know two key statistics about the system's returned results for a query: Precision : What fraction of the returned results are relevant to the information need? Recall : What fraction of the relevant documents in the collection were returned by the system? A matrix has half-a-trillion 0's and 1's - too many to fit in a computer's memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries. Because each document is 1000 words long, the matrix has no more than one billion 1's, so a minimum of 99.8% of the cells are zero. A much better representation is to record only the things that do occur, that is the 1 position. This idea is central to the first major concept in information retrieval, the inverted index. The name is actually redundant: an index always maps back from terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval. We keep a dictionary of terms (sometimes also referred to as a vocabulary or lexicon; in this book, we use dictionary for the data structure and vocabulary for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list - which records that a term appeared in a document (and, later, often, the positions in the document) - is conventionally called a posting. The list is then called a

postings list (or), and all the postings lists taken together are referred to as the postings.

Figure 2.3

Chapter 3 Implementation Details 3.1 Determining the user terms 3.1.1 Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:

These tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR systems dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document. However, rather than being exactly the tokens that appear in the document, they are usually derived from them by various normalization processes. For example, if the document to be indexed is to sleep perchance to dream, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index, then there will be only 3 terms: sleep, perchance, and dream. Themajor question of the tokenization phase is what are the correct tokens to use? In this example, it looks fairly trivial: you chop on whitespace and throw away punctuation characters. This is a starting

point, but even for English there are a number of tricky cases. For example, what do you do about the various uses of the apostrophe for possession and contractions? Mr. ONeill thinks that the boys stories about Chiles capital arent amusing.

3.1.2 Processing Boolean queries


How do we process a query using an inverted index and the basic Boolean retrieval model? Consider processing the simple conjunctive query: 1.1 Brutus AND Calpurnia The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. (This operation is sometimes referred to as merging postings lists: this slightly counterintuitive name reflects using the term merge algorithm for a general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each; here we are merging the lists with a logical AND operation.) There is a simple and effective method of intersecting postings lists using the merge algorithm: we maintain pointers into both lists

and walk through the two postings lists simultaneously, in time linear in the total number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer pointing to the smaller docID. If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations. Formally, the complexity of querying is Q(N), where N is the number of documents in the collection.6 Our indexing methods gain us just a constant, not a difference in Q time complexity compared to a linear scan, but in practice the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single global ordering. Using a numeric sort by docID is one simple way to achieve this. We can extend the intersection operation to processmore complicated queries like:

1.2 (Brutus OR Caesar) AND NOT Calpurnia 1.3 Brutus AND Caesar AND Calpurnia 1.4 (Calpurnia AND Brutus) AND Caesar 1.5 (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)

3.1.3 Schematic Representation of Our Approach

Figure 3.1: A Schematic Model of Our Approach

3.1.4 Methodology of Our Proposed Model


We have followed the existing process to get the DB/Indexes. Then we will group or cluster the existing index_database by analyzing the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page, and each cluster is associated with a set of keywords, which is assumed to represent a concept e.g. technology, science, arts, film, medical, music, sex, photo and so on.

3.2 Our Proposed Model Tool 3.2.1 Cluster Processor


Cluster Processor improve its performance automatically by learning relationships and associations within the stored data and make

the cluster, a statistical technique is used for identifying patterns and associations in complex data. It is somehow difficult to accumulate enough exact user searches to make a cluster. The clustering process is fully depends on fuzzy methods.

3.2.2. DB/Cluster
This is the second major module of our approach. It stores the patterns or cluster of complex data present on the web with its corresponding URL. The content inside the DB/cluster is similar to the DB/Indexes but the terms or string or keywords those are related to pattern or concept were found together in the same cluster whereas the index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs in the DB/Indexes. The data structure used in DB/Cluster allows rapid access to documents that contain user query terms.

3.3 Working Methodology


When the user will give any query string through the entry point of the search engine [12], the query engine will filter those strings or keywords by analyzing them. This will also do by learning process. Next, the Query Engine can detect that the searched string is associated with which clusters. Next, the Query Engine will retrieve the string from the relevant cluster only; without searching the entire DB/Indexes as in the previous architecture. In this way our methodology can give fast and relevant results. But one potential problem with this system is: it may happen that one string can be present many clusters also. E.g. Ferrari is a string

which is laptop model from Acer and also it is car model. Here how the query engine will know which Ferrari the user is looking for. So in this study we will store the frequency of each string in a file in DB/Cluster. So that the query engine can compare the matching number of clusters for that searched string and return the higher occurrences of the relevant cluster result.

Chapter 4 Future Work and Conclusions 4.2 Future Work


There are several directions in which this research can proceed. In this paper, we proposed a model for retrieval systems that is based on a simple document re-ranking method using Local Cluster Analysis. Experimental results showS that it is more effective than existing techniques. Whereas we intended to exhibits the efficiency of the proposed architecture, we use single clustering method (PDDP) to produce clusters that are tailored to the information need represented by the query. Afterwards, utilize K-means with PDDP clusters as initial configuration (Hybrid approach) and showed that PDDP has potential to improve results individually.

Whereas in our approach, the context of a document is considered in the retrieved results by the combination of information search and local cluster analysis, cause first: relevant cluster tailored to the user information need and improve the search results efficiently, second: make high-precision system that contain more relevant documents at top of the result list. As it was shown, even in worst query that average precision 0.1982 percent decreased, still our system remain highprecision.

4.1 Conclusion
We will pursue the work in several directions. Firstly, the current method for clustering search results is PDDP and hybrid K-means, however our experimental results had shown that PDDP has a great efficiency for our purpose but thence the total size of input in search

results clustering is small, we can afford some more complex processing, which can possibly let us achieve better results. Unlike previous clustering techniques that use some proximity measure between documents, tries to discover meaningful phrases that can become cluster descriptions and only then assign documents to those phrases to form clusters. Use these concept-driven clustering approaches maybe a useful future work.

Secondly, I assumed that search results contain two clusters (Relevant and Irrelevant). In some cases irrelevant cluster can split into other sub-clusters by semantic relations. Get the optimal sub-clusters semantically can be produce better results.

Thirdly, we re-ranked results based on both clusters and after that choose better one manually. As we mentioned before, we conjecture that relevant cluster centroid must be near than irrelevant cluster centroid to the query. So we can choose which cluster centroid that toward to the query (relevant cluster).

Lastly, we evaluate the proposed architecture in adhoc retrieval. As we mentioned before, our approach is independent of initial system architecture so it can embed on any fabric search engine. One of the high precision needful systems are Web search engines. Indisputable evaluate this approach on Web search engines can be a prominent future work.

Appendices
import javax.swing.*; import javax.swing.JScrollPane; import java.awt.*; import java.awt.event.*; import java.util.*; import java.io.*; public class GDemo { public static void main(String args[]) { SimpleFrame frame = new SimpleFrame(); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); frame.setVisible(true); } }

class SimpleFrame extends JFrame implements ActionListener { public static HashMap<String,ArrayList> result= new HashMap<String, ArrayList>(); public static String token; public static String op; public static String searchstring; public static final int DEFAULT_WIDTH = 600; public static final int DEFAULT_HEIGHT = 400; final JTextArea textArea; final JTextField textField; final JButton button;

public SimpleFrame() { setTitle("Information Retrival System"); setSize(DEFAULT_WIDTH, DEFAULT_HEIGHT);

textField = new JTextField(30); Font f = new Font("Old Book Style", Font.BOLD, 12);

textField.setFont(f); textArea = new JTextArea(20, 50); JScrollPane scrollPane = new JScrollPane(textArea); add(scrollPane,BorderLayout.CENTER); textArea.setWrapStyleWord(true); Font f1 = new Font("Old Book Style", Font.BOLD, 12); textArea.setFont(f1);

JPanel panel = new JPanel();

JLabel label = new JLabel("Input Text: "); panel.setLayout(new FlowLayout(FlowLayout.CENTER));

button = new JButton("Click Here");

panel.add(label); panel.add(textField); panel.add(button); panel.add(textArea); button.addActionListener(this);

Container cp=getContentPane(); cp.add(panel,BorderLayout.CENTER);

}//SimpleFrame()

public void actionPerformed(ActionEvent event) { Object sr=event.getSource(); if(sr==button) { textArea.setText(""); searchstring=textField.getText(); String tokens[]=searchstring.split(" "); if(tokens.length > 2) { op=tokens[1];

//ArrayList list=searchText(tokens[1]); result.put(tokens[0], searchText(tokens[0])); result.put(tokens[2], searchText(tokens[2])); if(op.equals("AND")) { HashSet<String> HashSet<String>(result.get(tokens[0])); HashSet HashSet<String>(result.get(tokens[2])); hs1.retainAll(hs2); //System.out.println("And "+hs1); textArea.setText(""); //textArea.setText(hs1.toString()); for(String fileName: hs1) textArea.append(fileName+"\n"); } else if(op.equals("OR")) { HashSet<String> HashSet<String>(result.get(tokens[0])); HashSet HashSet<String>(result.get(tokens[2])); hs1.addAll(hs2); //System.out.println("OR" + hs1); textArea.setText(""); //textArea.setText(hs1.toString()); for(String fileName: hs1) textArea.append(fileName+"\n"); } } else { ArrayList list=searchText(searchstring); textArea.setText(""); //textArea.setText(list.toString()); Iterator fileName=list.iterator(); while(fileName.hasNext()) <String> hs2= new hs1= new <String> hs2= new hs1= new

{ //System.out.println(it.next()); textArea.append(fileName.next()+"\n"); } } //textArea.append(textField.getText()+"\n"); } }//actionPerformed()

public ArrayList searchText(String args1) { //String args1=textField.getText(); //System.out.println("token="+args1); String args[]=args1.split(" "); for(int i=0;i<args.length;i++) args[i]=args[i].toUpperCase();

ArrayList<String> filefound= new ArrayList<String>(); File f= new File("D:\\program\\Java\\Test"); File[] files=f.listFiles(); for(File s: files) { for(int i=0;i<args.length;i++) { try { if(search(s.getPath(),args[i])) { filefound.add(s.getPath()); } } catch(Exception e) { e.toString(); } } }

textArea.append(filefound+"\n"); return filefound; }//searchText()

public boolean search(String file,String token) throws Exception { StringTokenizer st= null; HashSet<String> set= new HashSet<String>(); BufferedReader br= new BufferedReader(new FileReader(file)); String line=null;

while((line=br.readLine())!=null) { st=new StringTokenizer(line," ,."); while(st.hasMoreElements()) { set.add((st.nextToken()).toUpperCase()); } } //System.out.println(set+"\n"); if(set.contains(token)) return true; else return false;

}//search() }//class SimpleFrame

Results

REFERENCES AND BIBLIOGRAPHY


1. Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual Web search engine. Computer Networks and ISDN Systems, April 1998 2. A Novel Page Ranking Algorithm for Search Engines Using Implicit Feedback by Shahram Rahimi, Bidyut Gupta, Kaushik Adya, Southern Illinois University, USA, Engineering Letters, 13:3, EL_13_3_20 (Advance online publication: 4 November 2006) 3. Crawling the Web with Java by James Holmes, Chapter 6, Page: 2 & 3 4. Breadth-First Search Crawling Yields High-Quality Pages by Marc Najork and Janet L. Wiener, Compaq Systems Research Center, USA 5. HOW SEARCH ENGINES WORK AND A WEB CRAWLER APPLICATION by Monica Peshave, Department of Computer Science, University of Illinois at Springfield, Springfield 6. Search Engines for Intranets by K.T. Anuradha, National Centre for Science Information (NCSI), Indian Institute of Science, Bangalore 7. Searching the Web by Arvind Arasu Junghoo Cho Hector Garcia-Molina Andreas Paepcke Sriram Raghavan, Computer Science Department, Stanford University 8. Franklin, Curt. How Internet Search Engines Work, 2002. www.howstuffworks.com 9. Garcia-Molina, Hector. Searching the Web, August 2001 http://oak.cs.ucla.edu/~cho/papers/cho-toit01.pdf 10. Pant, Gautam, Padmini Srinivasan and Filippo Menczer: Crawling the Web, 2003. http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf 11. Retriever: Improving Web Search Engine Results Using Clustering by Anupam Joshi, University of Maryland, USA and Zhihua Jiang, American Management Systems, Inc., USA 12. Effective Web Crawling, PhD thesis by Carlos Castillo, Dept. of Computer Science University of Chile, November 2004 13. Design and Implementation of a High-Performance Distributed Web Crawler,

Vladislav Shkapenyuk Torsten Suel, CIS Department, Polytechnic University, Brooklyn, New York 11201 14. R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg. Natural language processing in the faq finder system: Results and prospects, 1997. 15. T. Calishain and R. Dornfest. Google Hacks: 100 Industrial-Strength Tips & Tools. OReilly, ISBN 0596004478, 2003. 16. David Carmel, Eitan Farchi, Yael Petruschka, and Aya Soffer. Automatic query refinement using lexical affinities with maximal information gain. In Proceedings of

the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 283290. ACM Press, 2002. 17. Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan- Kauffman, 2002. 18. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999), 31(1116):16231640, 1999. 19. Michael Chau, Hsinchun Chen, Jailun Qin, Yilu Zhou, Yi Qin, Wai-Ki Sung, and Daniel Mc- Donald. Comparison of two approaches to building a vertical search tool: A case study in the nanotechnology domain. In Proceedings Joint Conference on Digital Libraries, Portland, OR., 2002. 20. M. Keen C.W. Cleverdon, J. Mills. Factors determining the performance of indexing systems. Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project, Reprinted in Sparck Jones & Willett, Readings in Information Retrieval, 1966. 21. B. D. Davison, D. G. Deschenes, and D. B. Lewanda. Finding relevant website queries. In Proceedings of the twelfth international World Wide Web conference, 2003. 22. Daniel Dreilinger and Adele E. Howe. Experiences with selecting search engines using metasearch. ACM Transactions on Information Systems, 15(3):195222, 1997. 23. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the tenth international conference on World Wide Web, pages 613622. ACM Press, 2001. 24. B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):126, 1979. 25. Tina Eliassi-Rad and Jude Shavlik. Intelligent Web agents that learn to retrieve and extract information. Physica-Verlag GmbH, 2003. 26. Oren Etzioni. Moving up the information food chain: Deploying softbots on the world wide web. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, pages 13221326, Menlo Park, 4 8 1996. AAAI Press / MIT Press. 27. Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin, and David P. Williamson. Searching the workplace web. In WWW 03: Proceedings of the twelfth international conference on World Wide Web, pages 366 375. ACM Press, 2003. 28. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 301312. ACM Press, 2003.

29. A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Proc. 24th European Colloquium on Information Retrieval Research, Glasgow, pages 353362, 2002. 30. Aidan Finn and Nicholas Kushmerick. Learning to classify documents according to genre. In IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis, 2003. 31. C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing system. In Ian Witten, Rob Akscyn, and Frank M. Shipman III, editors, Digital Libraries 98 126 The Third ACM Conference on Digital Libraries, pages 89 98, Pittsburgh, PA, June 2326 1998. ACM Press. 32. Eric Glover, Gary Flake, Steve Lawrence, William P. Birmingham, Andries Kruger, C. Lee Giles, and David Pennock. Improving category specific web search by learning query modifications. In Symposium on Applications and the Internet, SAINT, pages 2331, San Diego, CA, January 812 2001. IEEE Computer Society, Los Alamitos, CA. 33. Eric J. Glover, Steve Lawrence, William P. Birmingham, and C. Lee Giles. Architecture of a metasearch engine that supports user information needs. In Proceedings of the eighth international conference on Information and knowledge management, pages 210216. ACM Press, 1999. 34. Ayse Goker. Capturing information need by learning user context. In Sixteenth International Joint Conference in Artificial Intelligence: Learning About Users Workshop, pages 2127, 1999. 35. Ayse Goker, Stuart Watt, Hans I. Myrhaug, Nik Whitehead, Murat Yakici, Ralf Bierig, Sree Kanth Nuti, and Hannah Cumming. User context learning for intelligent information retrieval. In EUSAI 04: Proceedings of the 2nd European Union symposium on Ambient intelligence, pages 1924. ACM Press, 2004. 36. Google Web APIs. http://www.google.com/apis/. 37. Luis Gravano, Chen-Chuan K. Chang, Hector Garcia-Molina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 207218. ACM Press, 1997. 38. Robert H. Guttmann and Pattie Maes. Agent-mediated integrative negotiation for retail electronic commerce. Lecture Notes in Computer Science, pages 7090, 1999. 39. Monika Henzinger, Bay-Wei Chang, Brian Milch, and Sergey Brin. Query-free news search. In Twelfth international World Wide Web Conference (WWW-2003), Budapest, Hungary, May 20-24 2003.

40. Adele E. Howe and Daniel Dreilinger. SAVVYSEARCH: A metasearch engine that learns which search engines to query. AI Magazine, 18(2):1925, 1997. 41. Jianying Hu, Ramanujan Kashi, and Gordon T. Wilfong. Document classification using layout analysis. In DEXA Workshop, pages 556560, 1999. 42. David Hull. Using statistical testing in the evaluation of retrieval experiments. In SIGIR 93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 329338. ACM Press, 1993. 43. Thorsten Joachims. Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137142. Springer-Verlag, 1998. 44. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nedellec and Celine Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137 142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. 45. George H. John, Ron Kohavi, and Karl Pfleger. Irrelevant features and the subset selection problem. In International Conference on Machine Learning, pages 121129, 1994.

You might also like