Link Prediction Across Networks by Biased Cross-Network Sampling

Link Prediction across
Networks by Biased
Cross-Network Sampling
ICDE 2013
Introduction
Goal predict future links in a growing network with the use of the existing network
structure
Problem existing network may be too sparse, with too few links
Solution other (more densely linked) networks may be available which show similar
linkage structure -> existing networks can be used in conjunction with the node
attribute information with the sparse networks
Problem Formulation
Cross-network
transfer learning
Source Network (mature, more linkages info) G0 = (V0, E0)
Note V0 = {}
Edge () E0
Target Network (sparse, nascent) G = (V,E)
Note V = {
Edge () E
Correspondence between the nodes in V and V0 is unknown, and the only information
which relates them is the available attribute information at the nodes
Problem Formulation
Cross Network Transfer
Learning
Each node in V and V0 is associated with a set of keywords derived from profile info of
both social networks
Attributes associated with node G is feature vector xs
Attributes associated with node G0 is feature vector
xs and are in the vector space Rd of dimension d
Problem Given the training network G0 = (V0, E0), along with its associated content
attributes , determine the links which have the highest probability to appear in the
future in a currently existing target network G = (V,E) with corresponding attribute .
Problem Formulation
2 main algorithms proposed:
Cross Network Link Model leverage link information in the source network in
order to predict the links in the target network
Cross Network Biased Correction determination of sampling weights to correct

bias, to ensure that links in the target network which are consistent with the source
networks are given much greater importance
Algorithm 1: Cross Network Link

Model
Uses Latent Space Approach related the network attributes to the probability of link
presence in the source and target networks
In co-authorship network, attribute vector of an author node correspond to, for

example, the keywords of their published paper

Model
Attribute vector and xs are mapped to latent vector and s in a latent topic space Rk
Mapping linear transformation = W . and s = W. xs
Social interaction between two nodes vs and vt in target network can be measured by
similarity of s t between corresponding latent vector
Matrix W = k x d; k = dimension of the latent topic space
Ex: collaboration links between the authors in a co-authorship network can be inferred based on the
similarity between the latent topic vectors of their research interests and expertise
To do the mapping of to we need to get W

Model
To determine topic space W, nodes with relevant content in them (on source matrix), as well
as nodes which are topologically well connected (in target network) tend to be placed
together in a topic space.
Two Components of
matrix W
1. Current state of target network (content and structure)
Logarithmic likelihood of the existing link

Model
2 Components of matrix
2. Cross network knowledge which is transferred from source to target network
Combine model with knowledge from a resampled source network based on sampling
importance of link between nodes and denoted by Pij
link exist; link does not exist
Pij = weights the importance of sampling link , in the source network

Model
Next, maximize combined log-likelihood of links in source and target network. Thus, optimal
latent transformation matrix W:
Thus, by knowing W, to predict the probability of a link between a pair of nodes, we use:
s t = Latent vector similarity of nodes vs and vt
bst = Adamic-Adar feature defined on a pair of nodes v s and vt to capture the common neighbors in
target network
Algorithm 2: Cross Network Bias

Correction
Idea to ensure that the links in the target network, which are consistent with the
source networks in terms of the node-content relationship are given much greater
importance
The closer two nodes are, the more
relevant their attributes
Nodes {} in source network are more relevant to

nodes in target network
Thus, they should have more sampling weights

than in the resampling process
Link structure between {} should be kept

intact to preserve the links incident with these
relevant nodes

Correction
Resampling process is done to:
Maximize the consistency between the source and target networks in terms of the attributes
associated with their nodes
Preserve the richness of the structure of the sampled network, so that as much structural
information as possible is available for the transfer learning process

Correction

Resampling
the source network
Each node is sampled according to a weighting distribution = {1, 2, , n) on the

node set V0 of the source network
Definition: A re-sampled source network = , } is a stochastic network structure, whose

nodes are sampled from the node set V0 of the source network G0 according to the
sampling weights . Formally, a node U in the resampled network is a random variable
which takes on values from with the probability that for i = 1,2,,n
Thus, the probability Pij of sampling a link (vi, vj) E0 in the re-sampling process is:

Correction
Value
of Pij depends upon sampling distribution . Thus, our goal is to determine , which minimizes
the cross network bias, while retaining the richness of network structure
1. Cross-Network Relevance
Relevance R (G0, G) between source and target networks -> consistency of the distributions
underlying these 2 networks
If we dont consider node distribution, we can simply measure the average attribute similarity between
the nodes of the networks by a nave definition:
S(,
) = similarity between the attributes of nodes
Next, generalize nave definition to measure relevance between re-sampled source network
parameterized by node distribution and target network G

Correction
Instead of averaging over all nodes in the source network, we compute expected value on sampling
distribution .
Consider a node U sampled from node set V0 according to distribution in the re-sampled
source network. Its average relevance to the nodes in target network is:
Thus we can compute the expected value that provide a measure of cross network relevance between
re-sampled cross network and target network by:

Correction
Then, we define the cross network relevance between re-sampled network source and target network
as follows:
transpose operator
u n x 1 vector as

Correction
In optimizing sampling weights , link richness contribution is needed
We should sum up all sampling probabilities of the links in the source network to measure the
proportion of the preservers link by using:
2. Link Richness
Ni = set of neighbors of node in source network

ki = |Ni| = node degree
For each node , we measure the average sampling probability over all the links incidents
with it, and sum over all the nodes in network
1/ki ensures densely linked nodes are not over sampled excessively as compared with
sparsely linked networks

Correction
2. Link Richness
Next, we need to regularize the sampling weight i for each node to prevent over-sampling of some
nodes in source network by:
The equation suggests that the sampling weight i of a node should be penalized extra factor 1/ki
when it is linked to a neighbor node
Thus, 2 neighboring nodes will compete for the distribution of their sampling weights, and node
with dense links should be penalized to a greater degree to avoid being over sampled. This
guarantees that a sparsely linked node can be sufficiently sampled

Correction
2. Link Richness
Lastly, by combining equation 11 and 12, the link richness expression is:

Correction
Example of Link Richness
Node has a set of neighboring nodes {} which have greater

sampling weights on average
To preserve links between {} and , sampling weight of central

node tends to be increased
But, has only once incident link (sparse), thus, it is

important to preserve.
Thus, as indicated by second term in equation 13, the

sampling weight of central node should be increased
more to preserve this link

Correction
Combining Cross Network Relevance and Link

Richness
Optimal sampling distribution can be obtained by maximizing cross network relevance and link
richness LinkRich() of re-sampled source network:
Experiments
Experimental Dataset
Co-authorship networks from the academic publication in four areas Database, Data
Mining, Machine Learning, Information Retrieval
Papers from 20 conferences; 28,702 authors
66,832 co-authors link; each author linked with 2.3 coauthors
Attributes of authors in the network 13,214 keywords from title of publication
Source network combination of publications in 3 areas
Target network remaining 1 area (retain link and attributes from 20% of publications to
create nascent target network)
Experiments
Baseline Algorithms
Adamic-Adar (AA) Predicts link between 2 authors by their common neighbors
Logistic Regression (A+T) combines attribute (A) similarity and topological features (T) to predict
the links in target network
CNLP without re-sampling cross network link prediction in Algorithm 1, but without resampling in
Algorithm 2
CNLP cross network link prediction with re-sampling
5-fold cross validation process on current target network
Measured based on top K precision with K 500 1000
Experiments
Link Prediction Result
Table 1. Top Conferences in 4 Areas
Table 2. Top keywords in source network
Experiments
Re-sampling Result
Table 3. Top 10 keywords re-sampled in

source networks
Experiments
Computational Efficiency
Experiments conducted on Intel Xeon 2.40 GHz CPU processor with 8GB physical memory and
Linux system
Link model building 38.36 seconds
Source network re-sampling 57.70 seconds
Once link model predicted, link between authors can be predicted in 2.70 x 10 -5 milliseconds
Comparison: AA needs 0.86 x 10-5 milliseconds; LR(A+T) needs 1.38 x 10-5 milliseconds to predict the link
between authors

Link Prediction Across Networks by Biased Cross-Network Sampling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Link Prediction Across Networks by Biased Cross-Network Sampling

Uploaded by

Copyright:

Available Formats

Link Prediction across

Source Network (mature, more linkages info) G0 = (V0, E0)

Target Network (sparse, nascent) G = (V,E)

Attributes associated with node G is feature vector xs

Attributes associated with node G0 is feature vector

xs and are in the vector space Rd of dimension d

2 main algorithms proposed:

Cross Network Biased Correction determination of sampling weights to correct

Algorithm 1: Cross Network Link

In co-authorship network, attribute vector of an author node correspond to, for

Algorithm 1: Cross Network Link

Mapping linear transformation = W . and s = W. xs

Matrix W = k x d; k = dimension of the latent topic space

To do the mapping of to we need to get W

Algorithm 1: Cross Network Link

1. Current state of target network (content and structure)

Logarithmic likelihood of the existing link

Algorithm 1: Cross Network Link

2. Cross network knowledge which is transferred from source to target network

link exist; link does not exist

Pij = weights the importance of sampling link , in the source network

Algorithm 1: Cross Network Link

s t = Latent vector similarity of nodes vs and vt

Algorithm 2: Cross Network Bias

Nodes {} in source network are more relevant to

Thus, they should have more sampling weights

Link structure between {} should be kept

Algorithm 2: Cross Network Bias

Resampling process is done to:

Algorithm 2: Cross Network Bias

Each node is sampled according to a weighting distribution = {1, 2, , n) on the

Definition: A re-sampled source network = , } is a stochastic network structure, whose

Algorithm 2: Cross Network Bias

Algorithm 2: Cross Network Bias

Algorithm 2: Cross Network Bias

Algorithm 2: Cross Network Bias

In optimizing sampling weights , link richness contribution is needed

Ni = set of neighbors of node in source network

Algorithm 2: Cross Network Bias

Algorithm 2: Cross Network Bias

Algorithm 2: Cross Network Bias

Node has a set of neighboring nodes {} which have greater

To preserve links between {} and , sampling weight of central

But, has only once incident link (sparse), thus, it is

Thus, as indicated by second term in equation 13, the

Algorithm 2: Cross Network Bias

Papers from 20 conferences; 28,702 authors

66,832 co-authors link; each author linked with 2.3 coauthors

Attributes of authors in the network 13,214 keywords from title of publication

Source network combination of publications in 3 areas

Adamic-Adar (AA) Predicts link between 2 authors by their common neighbors

CNLP cross network link prediction with re-sampling

5-fold cross validation process on current target network

Measured based on top K precision with K 500 1000

Table 1. Top Conferences in 4 Areas

Table 2. Top keywords in source network

Table 3. Top 10 keywords re-sampled in

Link model building 38.36 seconds

Source network re-sampling 57.70 seconds

You might also like