You are on page 1of 47

Web as a directed graph:

Nodes: Webpages
Edges: Hyperlinks TODAY 2.00Pm there will be
A demo on writing a scraper
In F-106 by Mr. Amit
I teach a
class on IR.
CS F469:
Classes are
in the
LTC
CSIS Dept
at BITS
HYD
BITS PILANI
HYD
CAMPUS

2
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks

I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS

3
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Trick: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
All web pages are not equally important
http://universe.bits-
pilani.ac.in/hyderabad/arunamalapati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx

There is large diversity


in the web-graph
node connectivity.
Lets rank the pages by
the link structure!
We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
http://www.bits-pilani.ac.in/Hyderabad/index.aspx has 23,400 in-links
http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile has
1 in-link

Are all in-links are equal?


Links from important pages count more
Recursive question!
A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Each links vote is proportional to the
importance of its source page

If page j with importance rj has n out-links,


each link gets rj / n votes

Page js own importance is the sum of the


votes on its in-links i k
ri/3 r /4
k

j rj/3
rj = ri/3+rk/4
rj/3 rj/3
A vote from an important The web in 1839

page is worth more y/2


A page is important if it is y
pointed to by other important
a/2
pages y/2
Define a rank rj for page j m
a m
a/2
ri
rj Flow equations:

i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
out-degree of node
Flow equations:
3 equations, 3 unknowns, ry = ry /2 + ra /2
no constraints ra = ry /2 + rm
rm = ra /2
No unique solution
All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
+ + =

Solution: = , = , =

Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
We need a new formulation!
Stochastic adjacency matrix
Let page has out-links
1
If , then = else = 0

is a column stochastic matrix
Columns sum to 1
Rank vector : vector with an entry per page
is the importance score of page
= 1
ri
The flow equations can be written rj
i j di
=
ri
Remember the flow equation: rj
d
Flow equation in the matrix form i j i
=
Suppose page i links to 3 pages, including j
i

j rj
. =
ri
1/3

M . r = r
NOTE: x is an
eigenvector with
The flow equations can be written the corresponding
eigenvalue if:
= =

So the rank vector r is an eigenvector of the


stochastic web matrix M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
Largest eigenvalue of M is 1 since M is
column stochastic (with non-negative entries)
We know r is unit length and each column of M
sums to one, so

We can now efficiently solve for r!


The method is called Power iteration
y a m
y y 0
a 0 1
a m m 0 0

r = Mr

ry = ry /2 + ra /2 y 0 y
ra = ry /2 + rm a = 0 1 a
rm = ra /2 m 0 0 m
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages (t )
ri

( t 1)
Initialize: r(0) = [1/N,.,1/N]T rj
i j di
Iterate: r(t+1) = M r(t) di . out-degree of node i
Stop when |r(t+1) r(t)|1 <
|x|1 = 1iN|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
y a m
Power Iteration: y
y 0
Set = 1/N a 0 1
a m m 0 0
1: =

ry = ry /2 + ra /2
2: = ra = ry /2 + rm
Goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
y a m
Power Iteration: y
y 0
Set = 1/N a 0 1
a m m 0 0
1: =

ry = ry /2 + ra /2
2: = ra = ry /2 + rm
Goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
ri
out-link from uniformly at random rj
i j d out (i)
Ends up on some page linked from
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t 1) M p(t )
Suppose the random walk reaches a steady
state + = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,


the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
(t )
ri

( t 1)
rj
i j di
or
equivalently r Mr
Does this converge?
Does it converge to what we want?

Are results reasonable?


(t )
ri

( t 1)
a b rj
i j di

Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2,
(t )
ri

( t 1)
a b rj
i j di

Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2,
Dead end
2 problems:
(1) Some pages are
dead ends (have no out-links)
Random walk has nowhere to go to
Such pages cause importance to leak out

(2) Spider traps:


(all out-links are within the group)
Random walked gets stuck in a trap
And eventually spider traps absorb all importance
y a m
Power Iteration: y
y 0
Set = 1 a 0 0
a m m 0 1
=

m is a spider trap ry = ry /2 + ra /2
And iterate
ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
All the PageRank score gets trapped in node m.
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
y a m
Power Iteration: y
y 0
Set = 1 a 0 0
a m m 0 0
=

ry = ry /2 + ra /2
And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
Here the PageRank leaks out since the matrix is not stochastic.
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly

y y

a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
Googles solution that does it all:
At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page

PageRank equation [Brin-Page, 98]


1 di out-degree
= + (1 ) of node i



This formulation assumes that has no dead ends. We can either
preprocess matrix to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
PageRank equation [Brin-Page, 98]
1
= + (1 )


[1/N]NxNN by N matrix
The Google Matrix A: where all entries are 1/N
1
=+ 1

We have a recursive problem: =
And the Power method still works!
What is ?
In practice =0.8,0.9 (make 5 steps on avg., jump)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15


13/15
a 7/15 1/15 1/15
a m 1/15 7/15 13/15
m
A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
Key step is matrix-vector multiplication
rnew = A rold
Easy if we have enough main memory to
hold A, rold, rnew
Say N = 1 billion pages
We need 4 bytes for A = M + (1-) [1/N]NxN
each entry (say) 0 1/3 1/3 1/3

2 billion entries for A = 0.8 0 0 +0.2 1/3 1/3 1/3


0 1 1/3 1/3 1/3
vectors, approx 8GB
Matrix A has N2 entries 7/15 7/15 1/15
1018 is a large number! = 7/15 1/15 1/15
1/15 7/15 13/15
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i j
and Mji = 0 otherwise
The random teleport is equivalent to:
Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its
score and redistribute evenly

= , where = +

=
i=1
1
= +
=1

1
= i=1 + i=1

1
= i=1 + since = 1


So we get: = +

Note: Here we assumed M


[x]N a vector of length N with all entries x
has no dead-ends
Input: Graph and parameter
Directed graph (can have spider traps and dead ends)
Parameter
Output: PageRank vector
1
Set: =

repeat until convergence: >

:
=


= if in-degree of is 0
Now re-insert the leaked PageRank:

:
= + where: =

=
If the graph has no dead-ends then the amount of leaked PageRank is 1-. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Measures generic popularity of a page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Uses a single measure of importance
Other models of importance
Solution: Hubs-and-Authorities
Susceptible to Link spam
Artificial link topographies created in order to
boost page rank
Solution: TrustRank

You might also like