You are on page 1of 26

Link Prediction across

Networks by Biased
Cross-Network Sampling
ICDE 2013

Introduction

Goal predict future links in a growing network with the use of the existing network
structure

Problem existing network may be too sparse, with too few links

Solution other (more densely linked) networks may be available which show similar
linkage structure -> existing networks can be used in conjunction with the node
attribute information with the sparse networks

Problem Formulation

Cross-network

transfer learning

Source Network (mature, more linkages info) G0 = (V0, E0)

Note V0 = {}

Edge () E0

Target Network (sparse, nascent) G = (V,E)

Note V = {

Edge () E

Correspondence between the nodes in V and V0 is unknown, and the only information
which relates them is the available attribute information at the nodes

Problem Formulation
Cross Network Transfer
Learning
Each node in V and V0 is associated with a set of keywords derived from profile info of
both social networks

Attributes associated with node G is feature vector xs

Attributes associated with node G0 is feature vector

xs and are in the vector space Rd of dimension d

Problem Given the training network G0 = (V0, E0), along with its associated content
attributes , determine the links which have the highest probability to appear in the
future in a currently existing target network G = (V,E) with corresponding attribute .

Problem Formulation

2 main algorithms proposed:

Cross Network Link Model leverage link information in the source network in
order to predict the links in the target network

Cross Network Biased Correction determination of sampling weights to correct


bias, to ensure that links in the target network which are consistent with the source
networks are given much greater importance

Algorithm 1: Cross Network Link


Model

Uses Latent Space Approach related the network attributes to the probability of link
presence in the source and target networks

In co-authorship network, attribute vector of an author node correspond to, for


example, the keywords of their published paper

Algorithm 1: Cross Network Link


Model

Attribute vector and xs are mapped to latent vector and s in a latent topic space Rk

Mapping linear transformation = W . and s = W. xs

Social interaction between two nodes vs and vt in target network can be measured by
similarity of s t between corresponding latent vector

Matrix W = k x d; k = dimension of the latent topic space

Ex: collaboration links between the authors in a co-authorship network can be inferred based on the
similarity between the latent topic vectors of their research interests and expertise

To do the mapping of to we need to get W

Algorithm 1: Cross Network Link


Model
To determine topic space W, nodes with relevant content in them (on source matrix), as well
as nodes which are topologically well connected (in target network) tend to be placed
together in a topic space.
Two Components of
matrix W

1. Current state of target network (content and structure)

Logarithmic likelihood of the existing link

Algorithm 1: Cross Network Link


Model
2 Components of matrix

2. Cross network knowledge which is transferred from source to target network

Combine model with knowledge from a resampled source network based on sampling
importance of link between nodes and denoted by Pij

link exist; link does not exist

Pij = weights the importance of sampling link , in the source network

Algorithm 1: Cross Network Link


Model

Next, maximize combined log-likelihood of links in source and target network. Thus, optimal
latent transformation matrix W:

Thus, by knowing W, to predict the probability of a link between a pair of nodes, we use:

s t = Latent vector similarity of nodes vs and vt

bst = Adamic-Adar feature defined on a pair of nodes v s and vt to capture the common neighbors in
target network

Algorithm 2: Cross Network Bias


Correction

Idea to ensure that the links in the target network, which are consistent with the
source networks in terms of the node-content relationship are given much greater
importance
The closer two nodes are, the more
relevant their attributes

Nodes {} in source network are more relevant to


nodes in target network

Thus, they should have more sampling weights


than in the resampling process

Link structure between {} should be kept


intact to preserve the links incident with these
relevant nodes

Algorithm 2: Cross Network Bias


Correction

Resampling process is done to:

Maximize the consistency between the source and target networks in terms of the attributes
associated with their nodes

Preserve the richness of the structure of the sampled network, so that as much structural
information as possible is available for the transfer learning process

Algorithm 2: Cross Network Bias


Correction

Resampling
the source network

Each node is sampled according to a weighting distribution = {1, 2, , n) on the


node set V0 of the source network

Definition: A re-sampled source network = , } is a stochastic network structure, whose


nodes are sampled from the node set V0 of the source network G0 according to the
sampling weights . Formally, a node U in the resampled network is a random variable
which takes on values from with the probability that for i = 1,2,,n

Thus, the probability Pij of sampling a link (vi, vj) E0 in the re-sampling process is:

Algorithm 2: Cross Network Bias


Correction
Value

of Pij depends upon sampling distribution . Thus, our goal is to determine , which minimizes

the cross network bias, while retaining the richness of network structure

1. Cross-Network Relevance

Relevance R (G0, G) between source and target networks -> consistency of the distributions
underlying these 2 networks

If we dont consider node distribution, we can simply measure the average attribute similarity between
the nodes of the networks by a nave definition:
S(,
) = similarity between the attributes of nodes

Next, generalize nave definition to measure relevance between re-sampled source network
parameterized by node distribution and target network G

Algorithm 2: Cross Network Bias


Correction
1. Cross-Network Relevance

Instead of averaging over all nodes in the source network, we compute expected value on sampling
distribution .

Consider a node U sampled from node set V0 according to distribution in the re-sampled
source network. Its average relevance to the nodes in target network is:

Thus we can compute the expected value that provide a measure of cross network relevance between
re-sampled cross network and target network by:

Algorithm 2: Cross Network Bias


Correction
1. Cross-Network Relevance

Then, we define the cross network relevance between re-sampled network source and target network
as follows:

transpose operator
u n x 1 vector as

Algorithm 2: Cross Network Bias


Correction

In optimizing sampling weights , link richness contribution is needed

We should sum up all sampling probabilities of the links in the source network to measure the
proportion of the preservers link by using:

2. Link Richness

Ni = set of neighbors of node in source network


ki = |Ni| = node degree

For each node , we measure the average sampling probability over all the links incidents
with it, and sum over all the nodes in network

1/ki ensures densely linked nodes are not over sampled excessively as compared with
sparsely linked networks

Algorithm 2: Cross Network Bias


Correction
2. Link Richness

Next, we need to regularize the sampling weight i for each node to prevent over-sampling of some
nodes in source network by:

The equation suggests that the sampling weight i of a node should be penalized extra factor 1/ki
when it is linked to a neighbor node

Thus, 2 neighboring nodes will compete for the distribution of their sampling weights, and node
with dense links should be penalized to a greater degree to avoid being over sampled. This
guarantees that a sparsely linked node can be sufficiently sampled

Algorithm 2: Cross Network Bias


Correction
2. Link Richness

Lastly, by combining equation 11 and 12, the link richness expression is:

Algorithm 2: Cross Network Bias


Correction
Example of Link Richness

Node has a set of neighboring nodes {} which have greater


sampling weights on average

To preserve links between {} and , sampling weight of central


node tends to be increased

But, has only once incident link (sparse), thus, it is


important to preserve.

Thus, as indicated by second term in equation 13, the


sampling weight of central node should be increased
more to preserve this link

Algorithm 2: Cross Network Bias


Correction
Combining Cross Network Relevance and Link

Richness

Optimal sampling distribution can be obtained by maximizing cross network relevance and link
richness LinkRich() of re-sampled source network:

Experiments
Experimental Dataset

Co-authorship networks from the academic publication in four areas Database, Data
Mining, Machine Learning, Information Retrieval

Papers from 20 conferences; 28,702 authors

66,832 co-authors link; each author linked with 2.3 coauthors

Attributes of authors in the network 13,214 keywords from title of publication

Source network combination of publications in 3 areas

Target network remaining 1 area (retain link and attributes from 20% of publications to
create nascent target network)

Experiments
Baseline Algorithms

Adamic-Adar (AA) Predicts link between 2 authors by their common neighbors

Logistic Regression (A+T) combines attribute (A) similarity and topological features (T) to predict
the links in target network

CNLP without re-sampling cross network link prediction in Algorithm 1, but without resampling in
Algorithm 2

CNLP cross network link prediction with re-sampling

5-fold cross validation process on current target network

Measured based on top K precision with K 500 1000

Experiments
Link Prediction Result

Table 1. Top Conferences in 4 Areas

Table 2. Top keywords in source network

Experiments
Re-sampling Result

Table 3. Top 10 keywords re-sampled in


source networks

Experiments
Computational Efficiency

Experiments conducted on Intel Xeon 2.40 GHz CPU processor with 8GB physical memory and
Linux system

Link model building 38.36 seconds

Source network re-sampling 57.70 seconds

Once link model predicted, link between authors can be predicted in 2.70 x 10 -5 milliseconds

Comparison: AA needs 0.86 x 10-5 milliseconds; LR(A+T) needs 1.38 x 10-5 milliseconds to predict the link
between authors

You might also like