Link Prediction in Social Networks

CS378 Data Mining Final Project Report Dustin Ho : dsh544 Eric Shrewsberry : eas2389 Link Prediction in Social Networks 1. Introduction Social networks are becoming increasingly more prevalent in the daily lifestyle of the 21st century student. Key to the definition of a social network is the nature of the relationship between the people participating in the network, and especially the formation/ dissolution of these relationships. By viewing a social network as an undirected graph, with the people as the nodes and the relationships between them as edges, we can begin to formulate techniques to analyze the network in order to obtain interesting and useful results. In particular, link prediction, or the problem of estimating the probability a given relationship will form at a future time, given evidence of the social graph, plays a critical role in the evolution of the network and is the focus of our study. 2. Original Proposal Our original project proposal was to explore the mechanisms of social network evolution through testing various methods of unsupervised link prediction on the social network datasets. We found many of the predictors listed in the Kleinberg paper [1] to be interesting and we intended on implementing as many as we could in order to measure their performance on this dataset. A secondary goal we had was to visualize the results we obtained in a clear and informative manner, preferably in an interactive form. Also, we originally planned on to using Python as our language of choice for this project. Many of these assumptions have changed in the past few weeks as we worked on this project, though our end goal remains the same: to analyze the natural layout of social networks and to build and test various predictive models of link prediction in social networks. 3. Background Research In order to familiarize ourselves with the current state of the field, we spent time reading both papers that Wei suggested as well as seeking out resources on our own. Of note were the many websites and papers we looked at detailing typical layouts of social networks. This was our primary motivator in deciding to visualize the data, and affirming that it indeed follows a power-law distribution. The Kleinberg paper [1] has proven to be the most useful to us so far. Partly because it was written in a very accessible manner and partly because it presented such compelling results, we have been mostly trying to replicate the algorithms and methods discussed in this paper.

The Song paper [2] we really only used to get a feel of the properties of the data. A few of the methods in the paper felt very advanced, and we decided to only try to implement them if we had time at the end of the project. We also greatly benefited from the discussion of ROC Curves in class[5] and have adopted it as our primary metric by which we evaluate various link prediction techniques. We feel the ROC Curves provide a new way of perceiving the accuracy of different predictors and make it clear when a predictor performs well. Something else we considered was spectral clustering as a method for link prediction[4] after our meeting with Dr. Dhillon and Wei Tang. However, we ran into significant issues with the implementation of the method, which we discuss later. 4. Examining Properties of the Social Network This is a graph of the degrees of separation from an individual. Due to the limitations of our visualization framework (discussed later), we decided that visualizing degrees of separation in the graph would be an interesting way to visualize the data. It takes a lot of time to process the separation tree for moderately to large sized sets of data. For the full LiveJournal set it takes 5 days to process the full separation tree for one

person. In contrast, the views of node degree frequency took less than 10 minutes to process for the entire set. Because of the large amount of time required for the separation tree, we spent a lot of time trying to optimize our algorithm, but so far none of our efforts have produced significant improvements.

Livejournal Myspace Max 28786 92390 Mean 46.8 42.3 Median 17 9 Mode 1 1 The node degree frequency graph shows a power-law distribution. We were surprised to learn that the most frequent node degree in both cases is 1. Having one connection in a pool of over 1.5M certainly demonstrates how sparseness of the data. We think this shows that a large number of users are not engaged in the social networking aspects of livejournal and myspace, or the users sign up for a single purpose and do not care to come back. We spent a good deal of time examining different visualization options including popular packages such as GraphViz, NetworkX, and NodeBox. In the end, the package that seemed the most relevant was Walrus [3], a tool developed by the Cooperative Association for Internet Data Analysis. It has the ability to visualize incredibly large graphs (extremely essential for our 2 million node networks) in a way that was visually appealing. We also ran the graph and analysis algorithms on the new arxiv data we recieved: Graph of index 2 from the astro-ph dataset for 6 and then 11 years cumulative.

Graph of index 2084 from the hep-th dataset for 6 and then 11 years cumlulative. As was expected, the total number of nodes as well as the degree of closer nodes increased dramatically over the last 5 years in the datasets. As one can see, there is a slight increase in nodes connected directly to the the root node in each case. There is also a huge increase in total nodes, and many additional layers added separating the root node from its furthest connection. The code for generating these graphs is explained in the readme, and more pictures and graph files are contained in the project folder. 5. Unsupervised Link Prediction Before the mid-term report, we only had access to the immense LiveJournal and MySpace datasets, as such we only ran the predictors on a 100k subset of the LiveJournal network (about 5% of all the data in that month) and tested predictive accuracy over the next month. We assumed the number of links that develop over the course of the month was known in order to simplify the problem, though testing shows that we could build a pretty good linear estimator of links per month. Predictor Accuracy (%) Random 0.00000844 Common Neighbors 5.784 Preferential Attachment 3.875 As predicted, these simple classifiers don't do very well, but they perform much better than random (especially since the graph is sparse). Preliminary data shows these predictors running on MySpace seem to be slightly more accurate, though the sample size of 5,000

seemed too small to report conclusively. We decided not to investigate further on these datasets, instead turning our attention to the arxiv datasets made available to us. Previously, running a predictor on even a subset of the LiveJournal network could take a day. The arxiv datasets allowed us to try multiple methods, with multiple datasets in the matter of hours. All computation was run on Dustin Ho's computer (Quad Core 3.2 Ghz, 6GB RAM) since the largest dataset needed approximately 5GB of RAM. All of our code is available on our website [6]. A README is provided to assist with reproduction of our results. We created a test harness to help automate the data collection process as well as separate scoring functions for each predictor we wished to test. The data we used was collected by Kleinberg [1] and consists of 3 sections of the arxiv coauthorship network: gr-qc (general relativity and quantum mechanics), hep-ph (high energy physics - phenomenology), and hep-th (high energy physics - theory). This data was collected over the course of 11 years, with networks available for each year. We tried a variety of test situations, including testing over 5 years to predict the next 6 years, testing over 3 years to predict the next 3 years, and testing over 1 year to predict the following year. It is this last situation where we felt most comfortable with, and we compiled the results for this data as follows: Results: Common Neighbors Predictor, 1 year training, 1 year testing data. Common Neighbors on GR-QC

Common Neighbors on HEP-PH Common Neighbors on HEP-TH

Common Neighbors was our simplest predictor, and the baseline by which we compared the other predictors we implemented (Note that it barely does better than random). It simply scores a link based upon the number of neighbors two nodes have in common. Preferential Attachment Predictor, 1 year training, 1 year testing data. Preferential Attachment on GR-QC

Preferential Attachment on HEP-PH Preferential Attachment on HEP-TH

Preferential Attachment - This predictor takes the product of the degree of each vertex pair to score the link. We note that it appears to perform better than common neighbors on each data set. Also, HEP-TH accuracy and GR-QC ROC curves seem to exactly match up. We reproduced these results to double check, however, and didn't find any source of error to account for this apparent discrepancy. Weighted Katz Predictor, 1 year training, 1 year testing data. Weighted Katz on GR-QC

Weighted Katz on HEP-PH Weighted Katz on HEP-TH

Weighted Katz considers the number of, and length of, paths between two nodes to score the link between them. Weighted Katz appears to perform slightly better than Common Neighbors, but worse than Preferential Attachment, except for the case of HEP-PH, where Weighted Katz performs the best. 6. Problems we have encountered The biggest obstacles we had run into is the sheer size of the datasets. With the LiveJournal network spanning around 1.7 million nodes and the MySpace network at around 2.1 million nodes, we quickly run into both time and space restrictions. So far we have been satisfied with taking a simple subset of the graph (usually around 5% or 100,000 nodes) in order to test our predictors, but there is a fairly large problem with taking subsets of social networks. Since social networks are so highly based on clustering (and therefore are not evenly distributed), a bad subset choice could throw away a lot of the useful evidence that could be used in link prediction. We have explored many methods for getting around this problem, including taking a subgraph, which resulted in poor data, as well as attempting to find the "core" of a dataset, as described in Kleinberg [1]. It is possible that the Condor clusters could be used to alleviate this problem, however we have tried many different requirements and configurations of the condor job description and have yet to find one that works even on the fairly simple common neighbors predictor for a full network and finishes in a reasonable amount of time. We have also tried using the sparse() option in MATLAB to take advantage of the fact that the adjacency list description of the networks are sparse, but that also has proven to be unfruitful.

We have also explored using low-rank approximations of the full network, and working with those since creation of predictors is of high complexity while predictor testing is a linear operation. After receiving the arxiv data though, we have been successful at using the full dataset for our predictors. Even on the largest dataset and the Katz (most complicated) predictor, runtime is less than an hour. Also of note is our attempt at using spectral graph embedding for link prediction. However, though we were able to get GRACLUS to install correctly and made some attempts at implementing the algorithm, our data didn't seem to be correct and we ran into many bug issues. In the end we decided not to include our incomplete algorithm and just submit the data with which we were comfortable with the quality. 7. Things we learned We enjoyed this project greatly and felt that we got a taste of research in the Data Mining field. We come out of this project with a new appreciation of the complexity of social networks and the difficult problem that is Link Prediction. Of note is the sheer size of the dataset and the time spent on computing it - in no other project have we had to work on computations that lasted for days (and once even a whole week). It is now obvious to us the critical importance of complexity in this field, and the power of smart code to reduce running time by orders of magnitude. We have only scratched the surface of link prediction, but the cleverness of the algorithms and predictor we have examined leave us intrigued towards the possibilities of more powerful, more elegant solutions to this problem. In conclusion, we feel we have learned a great deal from this project and would like to thank both the professor and the TA for the opportunity to work on such an interesting problem. References 1. D. Liben-Nowell and J. Kleinberg. The Link-prediction Problem for Social Networks. In Journal of the American Society for Information Science 58(7), 2007. 2. H. Song, T. Cho, V. Dave, Y. Zhang and L. Qiu. Scalable Proximity Estimation and Link Prediction in Online Social Networks. In Proc. of the ACM/USENIX Internet Measurement Conference, 2009. 3. The Cooperative Association for Internet Data Analysis (CAIDA). Walrus - Graph Visualization Tool. http://www.caida.org/tools/visualization/walrus/ 4. H. Song, B. Savas, T. Cho, V. Dave, Z. Lu, I. S. Dhillon, Y. Zhang and L. Qiu, "Clustered Embedding of Massive Online Social Networks", submitted for publication, February 2010. 5. W. Tang. "ROC Curves". CS378 Lecture, Spring 2010. 6. D. Ho, E. Shrewsberry. "Link Prediction in Social Networks". http://code.google.com/p/dmhoshrews/