Professional Documents
Culture Documents
Networks by Biased
Cross-Network Sampling
ICDE 2013
Introduction
Goal predict future links in a growing network with the use of the existing network
structure
Problem existing network may be too sparse, with too few links
Solution other (more densely linked) networks may be available which show similar
linkage structure -> existing networks can be used in conjunction with the node
attribute information with the sparse networks
Problem Formulation
Cross-network
transfer learning
Note V0 = {}
Edge () E0
Note V = {
Edge () E
Correspondence between the nodes in V and V0 is unknown, and the only information
which relates them is the available attribute information at the nodes
Problem Formulation
Cross Network Transfer
Learning
Each node in V and V0 is associated with a set of keywords derived from profile info of
both social networks
Problem Given the training network G0 = (V0, E0), along with its associated content
attributes , determine the links which have the highest probability to appear in the
future in a currently existing target network G = (V,E) with corresponding attribute .
Problem Formulation
Cross Network Link Model leverage link information in the source network in
order to predict the links in the target network
Uses Latent Space Approach related the network attributes to the probability of link
presence in the source and target networks
Attribute vector and xs are mapped to latent vector and s in a latent topic space Rk
Social interaction between two nodes vs and vt in target network can be measured by
similarity of s t between corresponding latent vector
Ex: collaboration links between the authors in a co-authorship network can be inferred based on the
similarity between the latent topic vectors of their research interests and expertise
Combine model with knowledge from a resampled source network based on sampling
importance of link between nodes and denoted by Pij
Next, maximize combined log-likelihood of links in source and target network. Thus, optimal
latent transformation matrix W:
Thus, by knowing W, to predict the probability of a link between a pair of nodes, we use:
bst = Adamic-Adar feature defined on a pair of nodes v s and vt to capture the common neighbors in
target network
Idea to ensure that the links in the target network, which are consistent with the
source networks in terms of the node-content relationship are given much greater
importance
The closer two nodes are, the more
relevant their attributes
Maximize the consistency between the source and target networks in terms of the attributes
associated with their nodes
Preserve the richness of the structure of the sampled network, so that as much structural
information as possible is available for the transfer learning process
Thus, the probability Pij of sampling a link (vi, vj) E0 in the re-sampling process is:
of Pij depends upon sampling distribution . Thus, our goal is to determine , which minimizes
the cross network bias, while retaining the richness of network structure
1. Cross-Network Relevance
Relevance R (G0, G) between source and target networks -> consistency of the distributions
underlying these 2 networks
If we dont consider node distribution, we can simply measure the average attribute similarity between
the nodes of the networks by a nave definition:
S(,
) = similarity between the attributes of nodes
Next, generalize nave definition to measure relevance between re-sampled source network
parameterized by node distribution and target network G
Instead of averaging over all nodes in the source network, we compute expected value on sampling
distribution .
Consider a node U sampled from node set V0 according to distribution in the re-sampled
source network. Its average relevance to the nodes in target network is:
Thus we can compute the expected value that provide a measure of cross network relevance between
re-sampled cross network and target network by:
Then, we define the cross network relevance between re-sampled network source and target network
as follows:
transpose operator
u n x 1 vector as
We should sum up all sampling probabilities of the links in the source network to measure the
proportion of the preservers link by using:
2. Link Richness
For each node , we measure the average sampling probability over all the links incidents
with it, and sum over all the nodes in network
1/ki ensures densely linked nodes are not over sampled excessively as compared with
sparsely linked networks
Next, we need to regularize the sampling weight i for each node to prevent over-sampling of some
nodes in source network by:
The equation suggests that the sampling weight i of a node should be penalized extra factor 1/ki
when it is linked to a neighbor node
Thus, 2 neighboring nodes will compete for the distribution of their sampling weights, and node
with dense links should be penalized to a greater degree to avoid being over sampled. This
guarantees that a sparsely linked node can be sufficiently sampled
Lastly, by combining equation 11 and 12, the link richness expression is:
Optimal sampling distribution can be obtained by maximizing cross network relevance and link
richness LinkRich() of re-sampled source network:
Experiments
Experimental Dataset
Co-authorship networks from the academic publication in four areas Database, Data
Mining, Machine Learning, Information Retrieval
Target network remaining 1 area (retain link and attributes from 20% of publications to
create nascent target network)
Experiments
Baseline Algorithms
Logistic Regression (A+T) combines attribute (A) similarity and topological features (T) to predict
the links in target network
CNLP without re-sampling cross network link prediction in Algorithm 1, but without resampling in
Algorithm 2
Experiments
Link Prediction Result
Experiments
Re-sampling Result
Experiments
Computational Efficiency
Experiments conducted on Intel Xeon 2.40 GHz CPU processor with 8GB physical memory and
Linux system
Once link model predicted, link between authors can be predicted in 2.70 x 10 -5 milliseconds
Comparison: AA needs 0.86 x 10-5 milliseconds; LR(A+T) needs 1.38 x 10-5 milliseconds to predict the link
between authors