You are on page 1of 10

MINOR PROJECT PRESENTATION

PROJECT ON

IMPROVING THE PERFORRMANCE OF WEB CRAWLER


SUBMITTED BY: YASHAGOEL MCS1235 A.I.T.M

INDEX
INTRODUCTION What is the problem Objective of the project

INTRODUCTION

Focused crawlers in particular, have been introduced for satisfying the need of individuals or organizations to create and maintain subject-specific web portals or web document collections locally or for addressing complex information needs. Classic focused crawlers: It takes as input user query and seed URL and they guide the search towards page of interest. Pages pointed to by links with higher priority are downloaded first. The crawler proceeds recursively on the links contained in the downloaded pages. Semantic Crawlers: Download priorities are assigned to pages by applying semantic similarity criteria for computing page-to-topic relevance: a page and the topic can be relevant if they share conceptually Learning Crawlers: a learning crawler is supplied with a training set consisting of relevant and not relevant Web pages which is used to train the learning crawler BestFirst crawler: priority queue ordered by similarity between topic and page where link was found. PageRank crawler: crawls in pagerank order, recomputes ranks every 25th page. InfoSpiders: uses neural net, back propagation, considers text around links

Focused crawlers can be classified as follows:

There are three kinds of focused crawling strategies:


Problem of focused crawler


Focused crawler used local search

algorithms, in which it will miss a relevant page if there does not exist a chain of hyperlink The problem in local search algorithm is being trapped with local optimal. The using of local search algorithms become even more obvious after recent Web structural studies revealed the existence of Web communities.

Problem with web communities:Researchers have found that three structural properties of Web communities make local search algorithms not suitable for building collections for scientific digital libraries. 1. Instead of directly linking to each other, many pages in the same Web community relate to each other through co-citation relationships.

Contd

Figure 1: Relevant pages with co-citation relationships missed by focused crawlers

Cntd..
2. Web pages relevant to the same domain could be separated into different Web communities by irrelevant pages.

Fig 2: A focused crawler trapped within the initial community

Contd...
3. Researchers found that sometimes when there are some

links between Web pages belonging to two relevant Web communities, these links may all point from the pages of one community to those of the other, with none of them
pointing in the reverse direction

THANK YOU

You might also like