You are on page 1of 4

The 1st International Conference on Information Science and Engineering (ICISE2009)

The Improved Pagerank in Web Crawler


Ling Zhang

Zheng Qin

Department of Information Science and Engineering, Hunan First


Normal University

College of Computer and Communication, Hunan University

Changsha, China
Xiaoling79113@sohu.com

Changsha, China
lingchengmi@163.com
AbstractPagerank is an algorithm for rating web pages. It

introduces the relationship of citation in academic papers to


evaluate the web pages authority. It gives the same weight to
all edges and ignores the relevancy of web pages to the topic,
resulting in a problem of topic-drift. On the analysis of several
pagerank algorithms, an improved pagerank based upon
thematic segments is proposed. In this algorithm, a web page
is divided into several blocks by Html documents structure
and the most weight is given to linkages in the block that is
most relevant to given topic. Moreover, the visited outlinks are
regarded as feedback to modify blocks relevancy The
experiment on Web crawler shows that the new algorithm has
some effect on resolving the problem of topic-drift.
Keywords-Pagerank;web crawler;topic-drift;relevancy
Introduction

With the rapid growth of web resources in the Internet,


besides to hope the search engine can provide more and more
appropriate information, people have the requirement of
taking centralized query on given topic. Because the searching
range of topic-specific search engine only limits to the
professional area and the searching object is the small portion
of web resources, the traditional width-first or depth-first
searching strategy is already not suitable. In order to get more
objects by visiting few irrelevant web pages, the web crawler
usually takes the heuristic searching strategy that ranks urls by
their importance and preferentially visits the more important
web pages [1]. So, it becomes a hotspot in recent research on
how to decide the urls importance.
The existing ranking algorithms mainly estimate the urls
importance by web pages relevancy to the topic or their
authorities [2, 3]. There are several common algorithms for
evaluating authorities, such as pagerank, kleinberg, hits and
salsa [4]. These algorithms can exactly evaluate the web
pages authority, but they hardly consider topical information,
resulting in a problem of topic-drift [5, 6] that means although
the web page with high authority score certainly has high
universal authority, it not always has high authority on given
topic too. In order to resolve this problem, Bharat and
Henzinger implemented a new heuristic strategy that assigns
outlinks different weights. Taher Haveliwal proposed the
topic-sensitive pagerank[7], which uses different topic-vectors
for different topics and regards the similarity of these topicvectors to the given topic as weights. Then the weighted sum
of web pages relevancies to each topic is namely the
pagerank score. Matthew Richardson combines the linkage
*

This paper is supported by the National Natural Science


Foundation of China under Grant No.60273070, the Sci & Tech.
Project of Hunan under Grant No. 04GK3022

978-0-7695-3887-7/09/$26.00 2009 IEEE

1889

information and the content information to improve the


traditional pagerank [5]. When passing a web pages pagerank
score to each outlinks, not only the relationship of citation
should be taken into account, but also the web pages topical
relevancy is considered. The doubled focused pagerank [8]
proposed by Diligenti divides the probability of visiting a
linkage into two parts: one is the probability of jumping to a
certain web page or following outlinks in this page, which is
proportional to the pages relevancy to topic, the another is the
probability of following a certain out link, which is
proportional to the linkages relevancy to topic. These above
algorithms all introduce topical information into pagerank to
resolve the problem of topic-drift. But they only differentiate
linkages simply by assigning more pagerank score to outlinks
which are more relevant to topic, not fully making use of the
content structure of html documents.
On the analysis of above pagerank algorithms and the
content structure of html documents, we proposed the pagerank
based on thematic segments. It divides a web page into several
blocks based upon its content structure, and then assigns the
pages pagerank score to each block according to blocks
relevancy to topic. Finally, the blocks pagerank score is
further assigned to each outlink in it according to linkages
relevancy. Moreover, the visited linkages can provide feedback
to modify blocks relevancy. We applied this pagerank
algorithm to experiments in web crawler, and the results
showed that compared with the above mentioned pagerank
algorithm, the new pagerank improves on the searching
precision.
I.

PAGERANK

Sergey Brin and Larry Page proposed the pagerank


algorithm for scoring web pages. Each web page is given an
authority score for evaluating its importance. In the beginning,
pagerank has been only used in ranking results of information
retrieval, but now it has been applied in many fields such as
web crawling, clustering web pages and searching for relevant
web pages.
Imagining a web surfer who jumps from web page to web
page, it chooses with uniform probability which link to follow
at each step. The surfer will occasionally jump to a random
page with some small probability. We consider the web as a
directed graph. Fi be the set of pages which page i links to,
and Bi be the set of pages which link to page i. After averaged
over a sufficient number of steps, the probability of the surfer
on page j at some point in time is given by the formula (1):

P(i)
iB j | Fi |

P(j) = + (1 )

(1)

0<<1, the usual value is 0.15. The pagerank score reflects


the citation relationship of web pages and if a web page is
citied by many important pages, it is also an important page.
Although pagerank score can reflect the authority of web pages
properly, it ignores the web pages relevancy to topic and the
authority score is independent with topics. A web page has
only one pagerank score, but some web pages (especially some
door-way web pages) include information about many different
topics. For example, a web page that has high authority on
topic art may not have high authority on topic sports too.
Moreover, in many times, linkages just have the use of
navigation or advertisement. So, the web pages linked each
other not always have the same topical relevancy. Therefore, it
is not very appropriate to assign the same pagerank score to all
outlinks in a web page.
II.

THE PAGERANK COMBINED WITH CONTENTS

iB j

P (i) can be computed through formulate (1). Where


Pq(ij) is the probability that the surfer transitions to out-page
j given that he is on page i and on searching for the topic q.
Pq (j) specifies where the surfer chooses to jump when not
following outlinks. The two probabilities both relate to the
web pages relevancy to topic q. Given that W is the set of all
web pages, Rq kis a measure of relevancy of page k to
topic q. The definition of Pq(ij) and Pq (j) can be seen in
formula (3) and (4):

Pq (i j ) =

Rq ( j )

R (k )

(3)

kFi

Rq ( j )

P 'q ( j ) =

R (k )

(4)

k W

Seen from the above formulas, the pagerank combined with


content assigns pagerank score according to web pages and
outlinks relevancy to topic. So, in the outlinks on a same web
page, the more relevant to topic can get more pagerank score of
the parent web page.
III.

THE PAGERANK BASED ON TOPICAL BLOCKS

For considering the topical relevancy, the pagerank


combined with content has some effect on the problem of

1890

Navigatio
n block
Other topics

To combine the authority with the topical relevancy,


Matthew Richardson improved the traditional pagerank
algorithm. When the pagerank score is passed between web
pages, the relevancy to topic is considered and the out-page
more relevant to topic is assigned more pagerank score [4]. He
proposed an intelligent web surfer, who probabilistically hops
from page to page, depending on the content of the pages and
the topic terms the surfer is looking for. The resulting
probability distribution over pages can be seen in formula (2):
(2)
Pq ( j) = P 'q ( j) + (1 ) Pq (i ) Pq (i j )

topic-drift in information retrieval. However, when the


pagerenk algorithm is applied in web crawling, because the
crawler cant see contents of unvisited pages, it only can
estimate the unvisited web pages relevancy by visited pages
and information in hyperlinks. But hyperlinks are usually
incapable of providing enough information. On the analysis of
content structure of web pages, an improved pagerank
algorithm based upon thematic segments is proposed, in order
to make the web crawler be able to estimate links importance
more accurately.
The general web information processor usually treats the
web page as a unit. In fact, this process is too coarse [9,10].
Actually, when the author is designing a web page, he is not
piling various information pell-mell, but organizing
information by certain layout and structure.
Seen from figure1, according to a web pages layout and
structure, it can be divided into several information blocks,
which include many single information items. Moreover, these
information blocks can be classed into four types: the text
block, the relevant hyperlinks block, the navigation and
advertisement block, and the relevant to other topics block. If a
web page not only includes information about a single topic,
but also refers to multiple topics, considering information about
one topic often being placed together, we need classify these
information blocks by topics further.

The relevant
hyperlinks
Advertise
ment

Figure1.An example of Html document


In these information blocks, some include linkages relevant
to the given topic and some include linkages only for
navigation or advertisement and some include linkages for
other topics. So, we need classify these blocks further
according to their relevance to the given topic, and then assign
the web page's pagerank score to each linkage block in
proportion. More relevant to topic, more pagerank score the
linkage block gets. Then, the pagerank score that each linkage
block gets is assigned to each linkage in this block according
to the relevance of linkages. Moreover, the visited outlink can
be regarded as feedback to modify the blocks relevance. If the
outlink is relevant to topic, the relevance of block in which the
outlink is should be accordingly augmented, otherwise, the
relevance should be minished. The web crawler chooses the
linkage which points to web page j with the

probability.

P 'q ( j ) = P 'q ( j ) + (1 ) Pq (i) Sq ( m) Pq (i j ) (5)


iB j

Given the set S of all information blocks in web page i.


Linkage lj points to web page j and lies in the information
block m. Sq(m) is the topical relevance of m compared with
other blocks. L(m) is the set of all linkages in block m, and W
is the set of linkages in URL candidate frontier. Pq(j), Sq(m)
and L(m) are defined as below:

Rq (l j )

P 'q ( j ) =

| Bj |

iB j

(6)

R (k )
q

kW

S q ( m) =

Rq ( m)

(7)

Rq (k )
kS

Pq (i j ) =

Rq (l j )

(8)

Rq ( k )

kL ( m )

Every time the web crawler chooses the most important


linkage in URL frontier to visit. After the web page pointed by
this linkage is visited, the pagerank score of web pages and
linkages connected with it is updated immediately.
IV.

EXPERIMENTS

In order to compare with other pagerank algorithms, we


carry out web crawlers based on three different pagerank
algorithms: the traditional pagerank, the pagerank combined
with content, and the pagerank based on segments. The objects
of web crawlers are computer web pages. To compute the
relevance to topic, we choose and enrich the FOLDOC online
computer dictionary as the computer keywords. The semantic
similarity of web pages or anchor texts around linkages to the
computer dictionary is regarded as their relevance to the topic.
Here, we choose the vector space model to express web pages
and compute the semantic similarity by the cosine formula
(9)[12]:

s '(q, p) =

f kq f kp

kq p

kq

f 2 kq f 2 kp

(9)

k p

In the above formula, q is topic keyword terms, p is word


terms in anchor texts, and fkd is the frequency of term k
appearing in d. We choose the searching precision to evaluate
these algorithms' performance [13]:
Searching

precision =

| the

crawled
relevant
pages |
| the crawled
pages |

(10)
archive.ncsa.uiuc.edu and www.dmoz.org/science are
selected as the initial url seeds. Each web crawler starts from
the two urls to search for web pages on topic computer. Then
we get the Figure2.

1891

Figure2. The comparison of searching precision with the


archive.ncsa.uiuc.edu and www.dmoz.org/science url
seeds separately
Seen from the first graph of Figure2, because the traditional
pagerank ignores web pages' text information, its performance
is the worst and the searching precision is always under 0.3. In
the initial and middle searching stage, the precision of
pagerank combined with content is close to that of pagerank
based on segments, but it drops rather quickly in the last. So, in
the whole web crawling, pagerank based on segments exceeds
the other two pagerank in performance. From the second graph
of Figure2, we can draw the same conclusion except that in a
short stage, the precision of pagerank combined with content is
a bit higher than that of pagerank based on segments. By the
analysis of crawled web pages, we find that there is a website
including many web pages designed unmorally so it has
influence on the performance of the pagerank using the content
structure to segment. However, with the increase of crawled
web pages, the available information used in segment is more
and more. So the searching efficiency of pagerank based on
segments is above the other two more and more obviously.
REFERENCES
1 Junghoo Cho, Hector Garcia-Molina, Lawrence Page,
Efficient crawling through URL ordering, In Proceedings of
7th International World Wide Web Conference,1998
2 Chirita, P.; Olmedilla, D.; Nejdl, W, Finding Related Pages
Using the Link Structure of the WWW, In Proceedings of
IEEE/WIC/ACM International Conference,2004
3 Ingongngam, P.; Rungsawang. A, Topic-centric algorithm: a
novel approach to Web link analysis, Advanced

Information Networking and Applications, Vol.2,2004, pp.


299 - 301
B. L. Narayan, C. A. Murthy, Sankar K. Pal. Topic
continuity for Web document categorization and ranking,
IEEE/WIC
International
Conference
on
Web
Intelligence,2003,pp. 310-315
Matthew Richardson, Pedro Domingos,The Intelligent
Surfer: Probabilistic Combination of Link and Content
Information in PageRank, In advances in neural
information processing systems, vol 14,2002, pp.673-680
K. Bharat and M. R. Henzinger,Improved algorithms for
topic distillation in a hyperlinked environment, In
Proceedings of the Twenty-First Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval,1998
Taher H. Haveliwala, Topic-Sensitive PageRank: A
Context-Sensitive Ranking Algorithm for Web Search,
Knowledge and Data Engineering, IEEE Transactions,
vol.15(4),2003,,pp.784 - 796
Michelangelo Diligenti, Marco Gori, Marco Maggini, Web
Page Scoring Systems for Horizontal and Vertical Search,In

1892

Proceedings of the 11th International World Wide Web


Conference,2002
9 Michael Brinkmeier. PageRank revisited, ACM Transactions
on Internet Technology, vol6 (3), 2006, pp. 282 - 301
10 Wood,L,.Programming the Web: the W3C DOM
specification,Internet Computing, IEEE, vol3(1),1999,
pp.48 - 54
11 Soumen Chakrabarti, Mukul Joshi, Vivek Tawde, IIT
Bombay, Enhanced topic distillation using text, markup
tags, and hyperlinks, In Proceedings of the 24th annual
international ACM SIGIR conference on research and
development in information retrieval,2001
12 Srinivasan P, Pant G, Menczer F, Target seeking crawlers
and their topical performance, In Proceedings of the 25th
Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval,2002
13 M. Diligenti, F.M. Coetzee, S. Lawrence, etc, Focused
Crawling Using Context Graphs, In Proceedings of the
26th International Conference on Very Large
Databases. ,2000

You might also like