You are on page 1of 11

Practica de Cercetare June 16, 2011 Speeding up Algorithms on Compressed Web Graphs

Student: Neamtu Elena Ramona : MOC1

ABSTRACT
In this paper, is showed that several important classes of web graph algorithms can be extended to run directly on virtual node compressed graphs, such that their running times depends on the size of the compressed graph rather than the original. These include algorithms for link analysis, estimating the size of vertex neighborhoods, and a variety of algorithms based on matrix-vector products and random walks.

1. INTRODUCTION
One approach to implementing algorithms on compressed graphs is to decompress the graph on the y, so that a client algorithm does not need to know hoe the underlying graph is compressed. Another approach is to design specialized algorithms that can directly use the compressed representation. In virtual node compression, a succinct representation of the graph is constructed by replacing dense subgraphs by spare ones. In particular, a directed bipartite clique on the vertex set K is replaced by a star centered at a new virtual node, with nodes K being the leaves. Applying this transformation repeatedly leads to a compressed graph with signicantly fewer edges and a relatively small number of additional nodes. Motwani and Feder showed that several classical graphs algorithms can be sped up by using a similar type of virtual node compression., in which an undirected clique is transformed into a star. Recently, Buehrer and Chellaplla demonstrated that virtual node compression can achieve high compression ratios for web graphs. A large class of web graphs algorithms can be extended to run on virtual node compressed graphs, with running time speeds-up proportional to the compression ratio. As a fundamental tool, we rst show the that multiplication by the adjacency matrix of the graph can be performed in time proportional to the size of the compressed graph. Using this matrix multiplication routine as a black box, we obtain signicant speed-ups for numerous popular web graph algorithms, including PageRank, HITS ans SALSA, and various algorithms based on random walks. This multiplication routine can be implemented in the sequential le access model, and can be implemented on a distributed graph using a small number of global synchronizations.

2. BACKGROUND
2.1 Graph Compression Using Virtual Nodes
A directed bipartite clique (or biclique) (S,T) is a pair of disjoints vertex sets S and T such that for each u S and v T , there is a directed link from u to v in G. Given a biclique (S,T), we form a new compressed graph G(S,T) by adding a new vertex w to the graph, removing all the edges in (S,T), and adding a new edge uw E for each u S and a new edge wv E for each v T . This transformation is depicted in Figure 1.

Figure 1: The bipartite clique-star transformation We call the node w a virtual node as opposed to the real nodes already present in G. Note that the biclique-star transformation essentially replaces an edge uv in G with a unique path u w v in G that acts as a placeholder for the original edge. We will call such a path a virtual edge. The biclique-star transformation may be performed again on G. We allow virtual modes to be refused, so the bipartite clique (S,T) found in G may contain the virtual edge path between u and v in the resulting graph G is extended to u w w v. To obtain signicant compression, this process is then repeated many times. The graph obtained by this process is called a compression of G. Generally, given two digraphs G(V,E) and G(V,E), we say G is a compression of G if it can be obtained by applying a series of bipartite clique-star transformations to G. This relation is denoted G G. Any compression G G satises the following properties. There is a one-to-one correspondence between edges in G and the set of edges and virtual edges in G. If uv E is such that uv E then there exists a unique path from u to v in G. The graph induced by edges incident yo and from virtual nodes in G is a set of disjoints directed trees. We will refer to this as the acyclic property of the compressed graphs. Let Q be the set of real nodes in G that are reachable from a real node u by a path consisting internally entirely of virtual nodes. Then G is exactly the same as the set of out-neighbors of u in G.

2.2 Finding Virtual Nodes Using Frequent Item-set Mining


The most important property of the graph compression is the number of edges and nodes. We refer to quantity |E|/|E | as the compression ratio. We want to bound the maximum length of any virtual edge, which we call depth of the compression. 2

Buehrer and Chellapilla introduced an algorithm that produces compressions of web graphs with high compression ratio and small depth. Their algorithm nds collections of bicliques using techniques from frequent item-set mining, and runs on time O(|E|log|V |). They report that the resulting compressed graphs contain ve to ten times fewer edges than the original, for a variety of page-level web graphs. To obtain this compression typically requires 4-5 phases of the algorithm, leading to compressions whose depth is a small constants. It is shown that nding the optimal compression is NP-hard, but a good approximation algorithm exists for the restricted problem of nding the best compression obtained from a collection of vertex-disjoint cliques.

2.3 Notation
We consider directed graphs G(V,E) with no loops or parallel edges. We denote the set of inG G neighbors and out-neighbors of node v by in (v) and out (v) respectively. We overload the symbol E to denote the adjacency matrix of the graph, where: 1 0 If edge uv E Otherwise

E[u, v] =

We will denote the probability of transition of u to v by Pr(u,v). By W we will denote the random walk matrix obtained by normalizing each row of E. So, if p0 is the starting probability distribution, then p1 = W T p0 is the distribution resulting from a single step of the uniform random walk on the graph.

3. Speeding up Matrix-Vector Multiplication


The multiplication of a vector by the adjacency matrix of a graph can be carried out in time proportional to the size of the graphs compressed representation. This matrix multiplication can be used as a black box to obtain ecient compressed implementation.

3.1 Adjacency Matrix Multiplication


Proposition 1. Let G be a graph with adjacency matrix E, and let G G. Then for any vector x R|v| , G be a compression of

The matrix-vector product E T x can be computed in time O(|E | + |V |). This computation needs only sequential access to the adjacency list of G and does not require the original graph G. 3

Proof. First, what the computation y = E T x looks like when the uncompressed graph G is accessible. Algorithm 1 performs a series of what are popularly called push operations. The value stored at node u in x is pushed along the edge uv. This algorithm simply encodes the denition of y: y[v] = x[u] (1)
uvE

Algorithm 1: Multiply(E, x) forall v V do y[v]=0 forall Nodes u V do forall Edges uv E do y[v]=y[v]+x[u]

For a virtual node v, we expand x[v] as: x[v] =


uvE

x[u]

(2)

The equation that computes y using the compressed graph G: y[v] =


uvE

x[u]

(3)

Denitions (1) and (3) of y are equivalent. This fallows easily from the acyclic property. Using the recursive denition (2), we can expand the terms corresponding to virtual nodes on the right side of equation (3) to obtain exactly equation (1).The input vector x is not dened on virtual nodes. Due to recursive denition (2), the values have dependencies. We illustrated this in Figure 2, where w is a virtual node.

Figure 2: Push operation on compressed graph y[v] = x[u1 ] + x[u2 ] + x[u3 ] + x[u4 ] + x[u5 ] = x[u1 ] + x[u2 ] + x[w] We now formalize by assigning a rank R(v) to each virtual node v using the following recursive denition. If is real for all uv E then R(v) = 0. 4

Else, R(v) = 1 + maxuin (v)v V R(u). We now reorder the rows of the adjacency list representation of G in the fallowing manner: 1. Adjacency lists of real nodes appear before those of virtual nodes. 2. For two virtual nodes u and v, if R(u) R(v) then the adjacency list of u appears before that of v. Algorithm 2 for computing y using the reordered representation of G.

Algorithm 2: Compressed-Multiply(E, x) forall Real nodes v do y[v]=0; forall Virtual nodes v do x[v]=0; forall Nodes u V do forall Edges uv E do if v is real then y[v]=y[v]+x[u]; else x[v]=x[v]+x[u];

Note that reordering can be performed during reprocessing by computing the ranking function R using a simple algorithm that requires O(|E | + |V |) time.

3.2 Applications of Compressed Multiplication


Here we describe a few examples of algorithms that can be written in terms of adjacency matrix multiplication. Random walk distribution: Starting from initial distribution p0 , we need T iterations by computing pt+1 = E T D1 Pt , where D is the diagonal matrix. The time per operation is O(|V |)+O(|E | + |V |) = O(|E | + |V |). Eigenvectors and spectral methods: The time required per iteration is O(|E | + |V |). Top singular vectors: The compressed in-link graph and out-link graph have the same values |E | and |V |, the time per iteration is O(|E | + |V |). 5

Estimating the size of neighborhoods: Becchetti introduced an algorithm for estimating the number of nodes within r steps of each node in graph, based on probabilistic counting. This iteration can be viewed as multiplication by the adjacency matrix, where the sum is replaced by bitwise or. The approaches described above can be used to speed-up the canonical link analysis algorithms PageRank, HITS and SALSA. These algorithms essentially perform several iterations of the power method, for dierent graph-related matrices. Each iteration requires (|E| + |V |) operations on an uncompressed graph G. PageRank:Given a graph G with adjacency matrix E, PageRank can be computed by the following power method step: xi+1 = (1 )E T (D1 xi ) + j where is the jump probability and j is the jump vector. HITS and SALSA: The HITS algorithm assigns a separate hub score and authority score to each web page in a query-dependent graph, equal to the top eigenvector of EE T and E T E. SALSA can be viewed as a normalized version of HITS, where the authority vector a and hub vector h are the top eigenvectors of WrT Wc and Wc WrT , where Wr and Wc are the row and column normalized versions of E.

4. Stochastic Algorithms on Compressed Graphs


In this section, we will show that the stationary vector in the original graph can be computed by computing the stationary vector of Markov chain running on the compressed graph, then projecting and rescaling.

4.1 PageRank on Compressed Graphs


PageRank models a uniform random walk on the web-graph performed by the random surfer. The matrix Wrepresents the underlying Markov chain. For ensure ergodicity, we assume that the surfer only clicks on a random link on a page with probability 1 , 0 1. With probability , he jumps to any page in the graph, which he then chooses from the probability distribution j, where j is the jump vector. The equation governing the steady state is: p = ((1 )W T + J)p = LT p. Our goal is to run an algorithm similar to above on a compression G G such that just restricted to nodes in V, it models the jump-adjusted uniform random walk in G. For a graph G (compressed or otherwise), we dene G (u) as follows: G (u) = 1
wo utG (u)

G (w)

, if u is real , if u is virtual

G (u) is the number of real nodes reachable from u by virtual edges not starting at u. Given the function G (u), we dene the real out-degree of u in G: G (u) =
wo utG (u)

G (u)

Figure 3: Illustration of the function For a real node u, G (u) is the number of real nodes in G reachable from u using one virtual edge. For a virtual node v, G (u) = G (u). A random walk on a graph G compressed from G that exhibits the desired modeling behavior: 1. The random walk on G is not uniform (unlike the one on G). 2. We ensure that the jump vector has zeros in entries corresponding to virtual nodes. Similarly, transitions made from virtual nodes have zero jump probability. Given graphs G G as fallows: G, jump probability and the jump vector j we dene the random walk on

Let X be the matrix of dimension |V | |V | such that: X[u, v] = G (v) G (u)

We obtain Y from X by making adjustments for the jump probability: Y [u, v] = (1 )X[u, v] , if u is real X[u, v] , if u is virtual

Pad the jump vector j with zeros to obtain a jump vector j for G. Let J be the jump matrix containing copies of j in each column. The desired Markov chain is given by the transition matrix M C(G ) = Z = (Y + J T ). Algorithm 3 takes as input a graph G, its compressed representation G, jump probability and the jump vector j to compute PageRank on vertex of G strictly the graph G.

Algorithm 3: ComputePageRank(G, G, , j) 1. Compute Z = M C(G ) 2. Compute the steady state of the Markov chain represented by Z. 3. Project p onto p, the set of real nodes. Discard the values for virtual nodes. 1. Scale p up to unit L1 norm to obtain p which is the desired vector of PageRank values on G.

This algorithm can be implemented to run in (r(|E | + |V |)), where r is the desired number of power iterations. Theorem 1. Vector p computed by Algorithm 3 satises p = ((1 )W T + J)p That is, p is the steady state of the jump-adjusted uniform random walk M C(G). Claim 1. For all 0 i k and u Vi , Pi+1 [u] = i pi [u], where i is a constant depending only upon i. If the value of the constant is very small, the computed values of p will contain very few bits of accuracy, and the subsequent scaling up will only maintain this precision. Theorem 2 will prove a lower bound on . Theorem 2. Let G = Gk Gk1 ... G1 G0 = G
1

by any sequence of graphs as in the proof of Theorem 1. Le = p factor between p and p in Algorithm 3. Then 2k .

be the scaling

We can use the sequence of graph Gi , such that Gi is the graph after i phases of edge-disjoint clique-star transformations. Since only 4-5 phases are required in practice to obtain nearly the best possible compression, the above theorem then concludes that we lose only 4-5 bits of oating point accuracy when using Algorithm 3.

4.2 SALSA on Compressed Graphs


SALSA is a link analysis algorithm similar to HITS that assigns each webpage a separate authority score and hub score. Let G(V,E) be the query-specic graph under consideration, with Wr and Wc being the row and column normalized version of E respectively. Then the authority vector a and a hub vector h are the top eigenvectors of WrT Wc and Wc WrT respectively, satisfying the following recursive denition: a = WrT h h = Wc a

The solution a and h o the above system are unique and with non-negative entries. As with PageRank, the power method can be employed to compute the eigenvalues. We dene a function G , in manner analogous to G : G (u) = 1 , if u is real , if u is virtual wo utG (u) G (w)

Similarly, we dene the in-degree analogue of G as: G (u) =


wo utG (u)

G (w)

Consider an edge uv E in the graph G and the corresponding virtual edge u w v in the compressed graph G. In case of PageRank, only one commodity - the PageRank score - ows through the net-work. Hence the virtual nodes in compressed graphs merely delay the ow of PageRank between real nodes. In case of SALSA, the situation is dierent: In the original graph G, the hub score from node u is pushed along a forward edge (uv E) into the authority score bucket of node v, whereas authority score of node v is pushed along the reverse edge into the hub score of node u. If we attempt to run SALSA power iterations unchanged on G, h[u] would contribute to a[w] but never a[v]. For a virtual edge u w v, we must push the hub score h[u] into hub score h[v], which subsequently will contribute to a[v] as desired. ai+1 [u] =
G vin (u)

hi (v) G out (v) ai (v) G in (v) 1

hi+1 [u] =
G vout (u)

Figure 4: SALSA on uncompressed graph.

ai+1 [u] =

G (u) G vin (u) G (v) hi (v) G (u) G vout (u) G (v) ai (v) G (u) G vout (u) G (v) ai (v) G (u) G vin (u) G (v) hi (v)

, if u is real , if u is virtual , if u is real , if u is virtual

hi+1 [u] =

Figure 5: SALSA on compressed graph. Unlike the PageRank, the irreducibility and aperiodicity of this Markov chain is not immediately obvious. Aperiodicity can be obtained by introducing a non-zero probability of non-transition on real nodes. The following theorem proves the correctness of our solution. Theorem 3. Let [a , h ] and [a, h] be top eigenvectors of M and M respectively. Then, 1. a [u] = a[u] and h [u] = h[u] for all u V (G). 2. If k is the length of the longest virtual edge in G, then 2k .

4.3 Comparison of the Two Approaches We now summarize the advantages and
disadvantages of computing PageRank and Salsa with the black-box multiplication algorithms, and the Markov chain algorithms. Although the Markov chain algorithms converge to eigenvectors that are similar to the corresponding eigenvectors on the uncompressed graph, the number of iterations required may change. The number of iterations required may increase by at most a factor of the longest virtual edge. The black-box methods simply speed-up each individual iteration, so the number of iterations required is identical. the number of iterations required by the Markov chain algorithm and the overall comparison in speed-up ratios is examined experimentally. An existing implementation of PageRank can be run directly on the compressed graph, with appropriately modied weights, to compute PageRank in the original uncompressed graph. Both method can be eciently parallelized. Black-box multiplication requires that certain sets if virtual nodes be pushed before others, requiring a small number of global synchronizations in each iteration. For the Markov chain method, any parallel algorithm for computing PageRank or SALSA can be used. The Markov chain methods are not directly applicable to HITS because the scaling step involved after every iteration destroys correctness. The black-box method on SALSA needs lists of in-links of virtual nodes and separate ordering on virtual nodes in-links and out-links. This adds to the storage required for the compressed graph, apart from slowing the algorithm down to a small extent.

5. Experiments
10

The method discussed before were implemented and compared the against standard version of PageRank and SALSA running on uncompressed graphs. Both the Black-box and Markov chain methods show an improvement in the time per iteration over the uncompressed versions of the algorithms. However, the Markov chain method requires more iterations to coverage to the same accuracy, bringing down the net performance boost. The overall speed-up ratios do not exactly match the reduction in the number of edges. This is due the fact that both these algorithms perform some book-keeping operations in the compressed graphs due to the increased number of nodes. The algorithms achieve signicant speed-up over the uncompressed version. In the case of SALSA, the Markov chain method performs better than the Black-box method. The Markov chain method for SALSA also requires slightly less storage on disk, since it only needs to store one ordering of the virtual nodes.

References
[1] R. Andersen and K. J. Lang. Communities from seed sets In WWW, 2006, 223-232 [2] P. Boldi and S. Vigna The webgraph framework ii: Codes for the world-wide web. In Data Compression, 2004, 258 [3] F. R. K. Chung Spectral Graph Theory [4] T. Federe and R. Motwani Clique partitions, graph compressions and speeding-up algorithms, 1995, 261-272 [5] F. McSherry A uniform approach to accelerated pagerank computation, 2005, 575-582

11

You might also like