You are on page 1of 5

Link Prediction in Social Networks by Social Distance

H.D.D.D. Premadasa
Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa, Sri Lanka damitha.premadasa@uom.lk

Abstract
In the last decade the, Social Networks has been studied extensively in the context of analyzing relationships, interaction between people and determining the interesting structural properties of the network. The studies are mainly focused on online Social Network application such as Facebook, Twitter, LinkedIn and MySpace. The Social Network data in the internet and Application Programming Interfaces (API) enable research on Social Networks. The identification of people who has more probability to be connected, and introduce them to each other enables the network tightly connected and expand the network further. This research is focused on identify the social distance and communities of members in a social network by analyzing structural variable and social variables in to account.

affected by other nodes and attributes. The specific problem instance that addresses in this research is to predict the likelihood of a future association between two nodes, knowing that there is no association between the nodes in the current state of the graph. This problem is commonly known as the Link Prediction problem which is the focus of this research

Background
Although, most of the early research in social network has been done, numerous efforts has been made by Computer Scientists recently. Most of the work has concentrated on analyzing the existing social networks and characterize them. The most similar research I found is "Predicting Tie Strength With Social Media" which derives the seven tie strength model for existing relationships. Few efforts have made to solve the link prediction problem, especially for social network domain. The closest match with this work is the work by D. Liben, et.all, where the authors extracted several features from the network topology of a co-authorship network[1] and Link Prediction using Supervised Learning by M Al Hasan et all[2]. Their experiments evaluated the effectiveness of these features for the link prediction problem. This project intends to build the link prediction model by analyzing structured and social variables of the node in social network.

Introduction
In general, a social network is defined as a network of interactions or relationships, where the nodes consist of actors, and the edges consist of the relationships or interactions between these actors. The relationships of social network do not always imply only online social networks such as Facebook, LinkedIn but some interactions may be in any conventional or nonconventional form, such as face-to-face interactions, telecommunication interactions, email interactions or postal mail interactions. But in the context of research and analysis the online social networks are used to model social networks as high volume of social data availability. The associations are usually driven by mutual interests and attributes. Understanding the dynamics that drives the evolution of social network is a complex problem due to a large number of variable parameters. But, a comparatively easier problem is to understand the association between two specific nodes. Several variations of the above problems make interesting research topics. What are the factors that drive the associations, how is the association between two nodes

1.1

Multi-Attribute Relationship Networks

In a multi-dimensional network has multiple types of interactions among the nodes in the network elements. In Each dimension of the network represents one type of connectivity between users. For instance, in Facebook a user can directly or indirectly connect with others by friend relationships, common structural attributes such as common working place or common education place or more, common groups and etc. The identified main attributes types of these networks can be identified as follows [3],

1. 2. 3. 4. 5. 6.

Predictive Intensity Attributes Duration Attributes Reciprocal Services Attributes Structural Attributes Emotional Support Attributes Social Distance Attributes

1.4

Data Extraction and Preprocessing

1.2

Tie Strength Model

This research focuses on selected attributes of each members profile. Basic information, Friends and Family, Education and Work fields were chosen to represent social behavior and structural behavior of a member. Mutual relationships are also considered to build the Link Prediction Model.

The relationships between users in the social networks have been studied the Tie Strength Model [3] is derived. Predictive model maps social media data to tie strength. The model builds on a Facebook dataset and ties and performs relationships well, distinguishing between strong and weak ties with over 85% accuracy.

Figure 2 Following are the detailed attributes descriptions that were extracted. Basic Information: User Id, Current City, Birth Year, Gender Friends and Families: Mutual Friends Education and Work: Employer, College / University To extract quality data, a Java client program which uses RestFB is developed and used. More incomplete data were observed when the social data is extracted. For an example some profiles exist only with the user id but other attributes are missing. Incomplete data was filtered using the same program. The output of the program was two .csv files which will be the input to the WEKA[6] data mining software.

Figure 1

1.3

Facebook Data Extracting API

The new Graph API[4] of Facebook presents a simple, consistent view of the Facebook social graph, uniformly representing objects in the graph (e.g., people) and the connections between them (e.g., friend relationships). The Old Graph API holds functions needs to use for this research analysis. RestFB[5] is an open source simple and flexible Facebook Graph API and Old REST API client implemented in Java language. Hence the RestFB will be used to extract information from the Facebook.

1.5

Build and Evaluate Link Prediction Model

Methodology
In this section, the approach, technologies and techniques will be discussed. The methodology can be divide in to main three parts. 1. Data extraction and Preprocessing 2. Build Link Prediction Model 3. Test Link Prediction Model

Classification model of link prediction problem needs to predict this link by successfully distinguishing the positive classes from the dataset. Thus, link prediction problem can be posed as a binary classification problem that can be solved by employing effective features in a supervised learning framework. To build the Link Prediction model 500 users have been randomly used. The model of each classification

methodology is saved to re-evaluate with the test dataset.

Results and Discussions


The results obtained from this experiment are given below. Primarily the result set is the performance analysis which contains accuracy, precision, F-measure and squared error. Classifier Accurac y (%) 78.77 77.1 74.30 74.30 Precision FMeasure 0.793 0.771 0.747 0.744

1.5.1 Feature Set


Choosing an appropriate feature set is the most critical part of any machine learning algorithm. For link prediction, we should choose features that represent some form of proximity between the pair of vertices that represent a data point. In this research, even though the discussion is focused on the feature set for structural and social attributes analysis, the above concept measure provides a clear direction to choose conceptually identical features in other network domains. One favorable property of these features is that they are very cheap to compute and there exist individual attributes that can also provide helpful clues for link prediction. Since, these attributes only pertain to one node in the social network, some aggregation functions need to be used to combine the attribute values of the corresponding nodes in a node-pair. These will be called aggregated feature. For an example total number of mutual friends among two friends is an aggregation attribute and it will be an interesting feature in the process of link prediction.

C4.5 (J 48) 0.814 Naive Bayes 0.772 Decision Table 0.743 Multilayer 0.745 Perception RBF Network 73.18 0.732 0.732 Bagging 64.24 0.554 0.553 Table 1 - Accuracy, Precision and F-measure Results

1.5.2 Classification Algorithms


There are large numbers of classification algorithms for supervised learning. Although their performances are comparable, some usually work better than others for a specific dataset or domain. In this research, the experimented was done with five different classification algorithms. For some of these, we tried more than one variation and reported the result that showed the best performance. The algorithms that we used are Decision tree, Multilayer Perception, Naive Bayes, RBF Network and Bagging. For the algorithms, a well known machine learning library, WEKA [6] was used. Then we compared the performance of the above classifiers using different performance metrics like accuracy, precision-recall, F-value, squared-error. For all the algorithms, 5-fold cross validation was used for the result reported. There is a separate test data set which is used to evaluate the classification models. During the testing phase accuracy, precision-recall, F-value, squared-error metrics are considered.

In all the classifier algorithms other than Bagging has more than 73% accuracy which proves that the popular classification algorithms yields quality results in social network domain as well. This indicates that the features that we had selected have good discriminating ability. In the context of Table1 attributes C4.5 algorithm and Nave Bayes dominates. Decision Table, Multilayer perception and RBF Network accuracy levels lies between 74.18%- 74.30% range. Such a small difference is not statistically significant, so no conclusion can be drawn from the accuracy metric about the most suited algorithm for the link prediction. Further, to understand the inconsistency in feature values, the distribution of positive and negatively label samples were analyzed for the selected features in the dataset. For most of the features, the distribution of positive and negative class exhibit significantly difference, thus facilitating the classification algorithm to pick patterns from the feature value to correctly classify the samples. F-value is the harmonic mean of precision-recall that is sometimes considered a better performance measure for a classification model in comparison to accuracy, especially if the populations of the classes are biased in the training dataset. The values are not significantly different in all models for this dataset. Classifier C4.5 (J 48) Naive Bayes Decision Table Multilayer Perception Root mean squared error 0.3088 0.305 0.3318 0.2729 Relative absolute error 0.3899 0.4118 0.4058 0.4808

RBF Network Bagging

0.3156 0.4228 Table 2 Error results

0.4644 0.4748

To compare errors of different algorithms root mean squared error and relative absolute error measures were used. Recent researches [7] show that this metric is remarkably robust and has the higher average correlation to the other metrics, hence an excellent metric to compare the performances of different classifiers. Recent researches [7] show that this metric is remarkably robust and has the higher average correlation to the other metrics, hence an excellent metric to compare the performances of different classifiers. The C.4.5 decision tree algorithm has the lowest relative absolute error errors comparing to other algorithms used whereas Nave Bayes has the lowest root mean squared error. But 0.3088 root mean square error and 0.3899 relative absolute error implies the model has considerable amount of error even though it has the best error results among the others. C4.5[9], an algorithm used to generate a decision tree yields the best accuracy precision, F-measure and error results. When the tree structure is analyzed, the followings were discovered .The attribute which had highest information gain is Mutual Friends count. Higher the number of mutual friends more the possibility to be connected in the network. The other attributes such as College, Work Place and Birth Year are covered by Mutual Friends attribute.

distance, the better the chance that they will collaborate. There are other similar measures such as Jaccards [8] coefficient. Social networks contain a large number of forged data. Data preprocessing step of this research focused only on incomplete data. Identifying forge data patterns in social network is also a challenging task which helps to extract high quality data.

Conclusion
Link prediction in a social network is an important problem as it is very helpful in analyzing the possible growth of inter-relationships in the social groups. Such understanding can lead to efficient implementation of tools to identify hidden characteristics, find missing members of groups and find missing relationships. Through this research it is proved that, link prediction in a real life social network can be solved with a very high accuracy by considering only few features. It has been shown that most of the popular classification model can solve the problem with an acceptable accuracy, but C4.5 has proven that it is the most suitable algorithm for this dataset. It is also provided a comparison of the features and ranked them according to their prediction ability using different feature analysis algorithms. We believe that these ranks are meaningful and can help other researchers to choose attributes for link prediction problem in a similar domain. Moreover, the methodology, key features extraction and the approach will be a guidance to solve social network problems in future.

Future Works
Link prediction is done using a comparatively small localized datasets but large localized dataset will yield more accurate prediction model. The behavior of the link prediction model is varying with the locality of the user. Therefore the localized dataset will yield more accurate model for link prediction over a global link prediction model. The feature selection for link prediction can be extended and more attributes with higher information gain can be added. For an example attributes membership in Friends and Family section, favorite sports in Sports section. Building of concept hierarchies will drive to produce more fine grained results. For an example, user location contains country, current city and hometown hierarchy can be defined in the context of structural attribute of the user. In results, it has been proved that most significant feature is the "Mutual Friendships" which is a topological feature of the social network. Adding more topological features such as edge disjoint k shortest distances makes the Model stronger. The shorter the

Acknowledgements
I wish to acknowledge the active support and advice given by Dr. Shehan Perera. The constant guidance and frequent reviews have been enabled the completion of this research project.

References
[1] D. Liben-Nowell and J.M. Kleinberg, "The linkprediction problem for social networks", presented at JASIST, 2007, pp.1019-1031. [2] Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, Mohammed Zaki, "Link Prediction using Supervised Learning", Citeseer, New York 2006, http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.61.1225 [3] Eric Gilbert and Karrie Karahalios , Predicting Tie Strength With Social Media, CHI '09 Proceedings of the 27th international conference on Human factors in computing systems, 2009, pp. 211-220

[4]

Facebook, Facebook Graph API, July 2011, https://developers.facebook.com/docs/reference/api/ [5] Mark Allen, restfb, July 2011, http://restfb.com/ [6] I. Witten and E. Frank, Data Mining: Pract. Machine learning tools and techniques, Morgan Kaufmann, San Francisco, 2005.

[7] R. Caruana and A. Niculescu-Mizil, Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria, KDD 2004. [8] D. Liben-Nowell and J. Kleinberg, "The Link Prediction Problem for Social Networks", LinkKDD, 2004. [9] J. R. Quinlan, "C4.5: Programs for Machine Learning", Morgan Kaufmann, 1993

You might also like