CSF-469-L11-13 (Link Analysis Page Rank)

Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks TODAY 2.00Pm there will be
A demo on writing a scraper
In F-106 by Mr. Amit
I teach a
class on IR.
CS F469:
Classes are
in the
LTC
CSIS Dept
at BITS
HYD
BITS PILANI
HYD
CAMPUS
2
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS
3
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Trick: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
All web pages are not equally important
http://universe.bits-
pilani.ac.in/hyderabad/arunamalapati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx
There is large diversity

in the web-graph
node connectivity.
Lets rank the pages by
the link structure!
We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
http://www.bits-pilani.ac.in/Hyderabad/index.aspx has 23,400 in-links
http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile has
1 in-link
Are all in-links are equal?

Links from important pages count more
Recursive question!
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
Each links vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,

each link gets rj / n votes
Page js own importance is the sum of the

votes on its in-links i k
ri/3 r /4
k
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
A vote from an important The web in 1839
page is worth more y/2

A page is important if it is y
pointed to by other important
a/2
pages y/2
Define a rank rj for page j m
a m
a/2
ri
rj Flow equations:
i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
out-degree of node
Flow equations:
3 equations, 3 unknowns, ry = ry /2 + ra /2
no constraints ra = ry /2 + rm
rm = ra /2
No unique solution
All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
+ + =

Solution: = , = , =

Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
We need a new formulation!
Stochastic adjacency matrix
Let page has out-links
1
If , then = else = 0

is a column stochastic matrix
Columns sum to 1
Rank vector : vector with an entry per page
is the importance score of page
= 1
ri
The flow equations can be written rj
i j di
=
ri
Remember the flow equation: rj
d
Flow equation in the matrix form i j i
=
Suppose page i links to 3 pages, including j
i
j rj
. =
ri
1/3
M . r = r
NOTE: x is an
eigenvector with
The flow equations can be written the corresponding
eigenvalue if:
= =
So the rank vector r is an eigenvector of the

stochastic web matrix M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
Largest eigenvalue of M is 1 since M is
column stochastic (with non-negative entries)
We know r is unit length and each column of M
sums to one, so
We can now efficiently solve for r!

The method is called Power iteration
y a m
y y 0
a 0 1
a m m 0 0
r = Mr
ry = ry /2 + ra /2 y 0 y
ra = ry /2 + rm a = 0 1 a
rm = ra /2 m 0 0 m
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages (t )
ri

( t 1)
Initialize: r(0) = [1/N,.,1/N]T rj
i j di
Iterate: r(t+1) = M r(t) di . out-degree of node i
Stop when |r(t+1) r(t)|1 <
|x|1 = 1iN|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
y a m
Power Iteration: y
y 0
Set = 1/N a 0 1
a m m 0 0
1: =

ry = ry /2 + ra /2
2: = ra = ry /2 + rm
Goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
y a m
Power Iteration: y
y 0
Set = 1/N a 0 1
a m m 0 0
1: =

ry = ry /2 + ra /2
2: = ra = ry /2 + rm
Goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
ri
out-link from uniformly at random rj
i j d out (i)
Ends up on some page linked from
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t 1) M p(t )
Suppose the random walk reaches a steady
state + = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
A central result from the theory of random
walks (a.k.a. Markov processes):
For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
(t )
ri

( t 1)
rj
i j di
or
equivalently r Mr
Does this converge?
Does it converge to what we want?
Are results reasonable?

(t )
ri

( t 1)
a b rj
i j di
Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2,
(t )
ri

( t 1)
a b rj
i j di
Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2,
Dead end
2 problems:
(1) Some pages are
dead ends (have no out-links)
Random walk has nowhere to go to
Such pages cause importance to leak out
(2) Spider traps:

(all out-links are within the group)
Random walked gets stuck in a trap
And eventually spider traps absorb all importance
y a m
Power Iteration: y
y 0
Set = 1 a 0 0
a m m 0 1
=

m is a spider trap ry = ry /2 + ra /2
And iterate
ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
All the PageRank score gets trapped in node m.
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap
within a few time steps
y y
a m a m
y a m
Power Iteration: y
y 0
Set = 1 a 0 0
a m m 0 0
=

ry = ry /2 + ra /2
And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
Here the PageRank leaks out since the matrix is not stochastic.
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
Googles solution that does it all:
At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]

1 di out-degree
= + (1 ) of node i

This formulation assumes that has no dead ends. We can either
preprocess matrix to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
PageRank equation [Brin-Page, 98]
1
= + (1 )

[1/N]NxNN by N matrix
The Google Matrix A: where all entries are 1/N
1
=+ 1

We have a recursive problem: =
And the Power method still works!
What is ?
In practice =0.8,0.9 (make 5 steps on avg., jump)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
y 7/15 7/15 1/15

13/15
a 7/15 1/15 1/15
a m 1/15 7/15 13/15
m
A
y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
Key step is matrix-vector multiplication
rnew = A rold
Easy if we have enough main memory to
hold A, rold, rnew
Say N = 1 billion pages
We need 4 bytes for A = M + (1-) [1/N]NxN
each entry (say) 0 1/3 1/3 1/3
2 billion entries for A = 0.8 0 0 +0.2 1/3 1/3 1/3

0 1 1/3 1/3 1/3
vectors, approx 8GB
Matrix A has N2 entries 7/15 7/15 1/15
1018 is a large number! = 7/15 1/15 1/15
1/15 7/15 13/15
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i j
and Mji = 0 otherwise
The random teleport is equivalent to:
Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its
score and redistribute evenly

= , where = +

=
i=1
1
= +
=1

1
= i=1 + i=1

1
= i=1 + since = 1

So we get: = +

Note: Here we assumed M

[x]N a vector of length N with all entries x
has no dead-ends
Input: Graph and parameter
Directed graph (can have spider traps and dead ends)
Parameter
Output: PageRank vector
1
Set: =

repeat until convergence: >

:
=

= if in-degree of is 0
Now re-insert the leaked PageRank:

:
= + where: =

=
If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Measures generic popularity of a page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses a single measure of importance
Other models of importance
Solution: Hubs-and-Authorities
Susceptible to Link spam
Artificial link topographies created in order to
boost page rank
Solution: TrustRank

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

Copyright:

Available Formats

Web as a directed graph:

There is large diversity

Are all in-links are equal?

If page j with importance rj has n out-links,

Page js own importance is the sum of the

page is worth more y/2

So the rank vector r is an eigenvector of the

We can now efficiently solve for r!

For graphs that satisfy certain conditions,

Are results reasonable?

(2) Spider traps:

PageRank equation [Brin-Page, 98]

y 7/15 7/15 1/15

y 1/3 0.33 0.24 0.26 7/33

2 billion entries for A = 0.8 0 0 +0.2 1/3 1/3 1/3

Note: Here we assumed M

You might also like