Link Prediction in Social Networks



Similar documents
Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Network Metrics, Planar Graphs, and Software Tools. Based on materials by Lala Adamic, UMichigan

Part 2: Community Detection

How To Cluster Of Complex Systems

CS Data Science and Visualization Spring 2016

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Virtual Landmarks for the Internet

Protein Protein Interaction Networks

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

How To Understand The Network Of A Network

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

An Empirical Study of Two MIS Algorithms

Graph Mining and Social Network Analysis

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

Graph/Network Visualization

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Music Mood Classification

Practical Graph Mining with R. 5. Link Analysis

Hard Disk Drive vs. Kingston SSDNow V+ 200 Series 240GB: Comparative Test

Stability of QOS. Avinash Varadarajan, Subhransu Maji

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Lecture 6 Online and streaming algorithms for clustering

Distributed forests for MapReduce-based machine learning

White Paper: Impact of Inventory on Network Design

Social Media Mining. Graph Essentials

Distributed Computing over Communication Networks: Maximal Independent Set

Automatic Inventory Control: A Neural Network Approach. Nicholas Hall

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Mining Social Network Graphs

Predictive Modeling Techniques in Insurance

Big Data Big Deal? Salford Systems

Map-like Wikipedia Visualization. Pang Cheong Iao. Master of Science in Software Engineering

DECENTRALIZED SCALE-FREE NETWORK CONSTRUCTION AND LOAD BALANCING IN MASSIVE MULTIUSER VIRTUAL ENVIRONMENTS

Analysis of Algorithms, I

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Inet-3.0: Internet Topology Generator

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Social Media Mining. Data Mining Essentials

Social Media Mining. Network Measures

InfiniteGraph: The Distributed Graph Database

How To Find Local Affinity Patterns In Big Data

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

WHITE PAPER. The Top Six Reasons to Simplify the Customer Service Desktop

Final Project Report

An Open Framework for Reverse Engineering Graph Data Visualization. Alexandru C. Telea Eindhoven University of Technology The Netherlands.

Distance Degree Sequences for Network Analysis

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Six Degrees of Separation in Online Society

Data Mining for Knowledge Management. Classification

How To Make A Credit Risk Model For A Bank Account

Adaptive Context-sensitive Analysis for JavaScript

Dmitri Krioukov CAIDA/UCSD

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Course Description This course will change the way you think about data and its role in business.

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Reputation Network Analysis for Filtering

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

Understanding Neo4j Scalability

Machine Learning Final Project Spam Filtering

Smart Queue Scheduling for QoS Spring 2001 Final Report

Performance Metrics for Graph Mining Tasks

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

File Management. Chapter 12

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Energy Efficient MapReduce

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Understanding Sociograms

Leveraging Ensemble Models in SAS Enterprise Miner

LDIF - Linked Data Integration Framework

MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS

The Data Mining Process

Character Image Patterns as Big Data

Tableau Server Scalability Explained

Scala Storage Scale-Out Clustered Storage White Paper

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Big Graph Processing: Some Background

Predicting the Stock Market with News Articles

Introducing Performance Engineering by means of Tools and Practical Exercises

Scaling Graphite Installations

Brillig Systems Making Projects Successful

Transcription:

CS378 Data Mining Final Project Report Dustin Ho : dsh544 Eric Shrewsberry : eas2389 Link Prediction in Social Networks 1. Introduction Social networks are becoming increasingly more prevalent in the daily lifestyle of the 21st century student. Key to the definition of a social network is the nature of the relationship between the people participating in the network, and especially the formation/ dissolution of these relationships. By viewing a social network as an undirected graph, with the people as the nodes and the relationships between them as edges, we can begin to formulate techniques to analyze the network in order to obtain interesting and useful results. In particular, link prediction, or the problem of estimating the probability a given relationship will form at a future time, given evidence of the social graph, plays a critical role in the evolution of the network and is the focus of our study. 2. Original Proposal Our original project proposal was to explore the mechanisms of social network evolution through testing various methods of unsupervised link prediction on the social network datasets. We found many of the predictors listed in the Kleinberg paper [1] to be interesting and we intended on implementing as many as we could in order to measure their performance on this dataset. A secondary goal we had was to visualize the results we obtained in a clear and informative manner, preferably in an interactive form. Also, we originally planned on to using Python as our language of choice for this project. Many of these assumptions have changed in the past few weeks as we worked on this project, though our end goal remains the same: to analyze the natural layout of social networks and to build and test various predictive models of link prediction in social networks. 3. Background Research In order to familiarize ourselves with the current state of the field, we spent time reading both papers that Wei suggested as well as seeking out resources on our own. Of note were the many websites and papers we looked at detailing typical layouts of social networks. This was our primary motivator in deciding to visualize the data, and affirming that it indeed follows a power-law distribution. The Kleinberg paper [1] has proven to be the most useful to us so far. Partly because it was written in a very accessible manner and partly because it presented such compelling results, we have been mostly trying to replicate the algorithms and methods discussed in this paper.

The Song paper [2] we really only used to get a feel of the properties of the data. A few of the methods in the paper felt very advanced, and we decided to only try to implement them if we had time at the end of the project. We also greatly benefited from the discussion of ROC Curves in class[5] and have adopted it as our primary metric by which we evaluate various link prediction techniques. We feel the ROC Curves provide a new way of perceiving the accuracy of different predictors and make it clear when a predictor performs well. Something else we considered was spectral clustering as a method for link prediction[4] after our meeting with Dr. Dhillon and Wei Tang. However, we ran into significant issues with the implementation of the method, which we discuss later. 4. Examining Properties of the Social Network This is a graph of the degrees of separation from an individual. Due to the limitations of our visualization framework (discussed later), we decided that visualizing degrees of separation in the graph would be an interesting way to visualize the data. It takes a lot of time to process the separation tree for moderately to large sized sets of data. For the full LiveJournal set it takes 5 days to process the full separation tree for one

person. In contrast, the views of node degree frequency took less than 10 minutes to process for the entire set. Because of the large amount of time required for the separation tree, we spent a lot of time trying to optimize our algorithm, but so far none of our efforts have produced significant improvements.

Livejournal Myspace Max 28786 92390 Mean 46.8 42.3 Median 17 9 Mode 1 1 The node degree frequency graph shows a power-law distribution. We were surprised to learn that the most frequent node degree in both cases is 1. Having one connection in a pool of over 1.5M certainly demonstrates how sparseness of the data. We think this shows that a large number of users are not engaged in the social networking aspects of livejournal and myspace, or the users sign up for a single purpose and do not care to come back. We spent a good deal of time examining different visualization options including popular packages such as GraphViz, NetworkX, and NodeBox. In the end, the package that seemed the most relevant was Walrus [3], a tool developed by the Cooperative Association for Internet Data Analysis. It has the ability to visualize incredibly large graphs (extremely essential for our 2 million node networks) in a way that was visually appealing. We also ran the graph and analysis algorithms on the new arxiv data we recieved: Graph of index 2 from the astro-ph dataset for 6 and then 11 years cumulative.

Graph of index 2084 from the hep-th dataset for 6 and then 11 years cumlulative. As was expected, the total number of nodes as well as the degree of closer nodes increased dramatically over the last 5 years in the datasets. As one can see, there is a slight increase in nodes connected directly to the the root node in each case. There is also a huge increase in total nodes, and many additional layers added separating the root node from its furthest connection. The code for generating these graphs is explained in the readme, and more pictures and graph files are contained in the project folder. 5. Unsupervised Link Prediction Before the mid-term report, we only had access to the immense LiveJournal and MySpace datasets, as such we only ran the predictors on a 100k subset of the LiveJournal network (about 5% of all the data in that month) and tested predictive accuracy over the next month. We assumed the number of links that develop over the course of the month was known in order to simplify the problem, though testing shows that we could build a pretty good linear estimator of links per month. Predictor Accuracy (%) Random 0.00000844 Common Neighbors 5.784 Preferential Attachment 3.875 As predicted, these simple classifiers don't do very well, but they perform much better than random (especially since the graph is sparse). Preliminary data shows these predictors running on MySpace seem to be slightly more accurate, though the sample size of 5,000

seemed too small to report conclusively. We decided not to investigate further on these datasets, instead turning our attention to the arxiv datasets made available to us. Previously, running a predictor on even a subset of the LiveJournal network could take a day. The arxiv datasets allowed us to try multiple methods, with multiple datasets in the matter of hours. All computation was run on Dustin Ho's computer (Quad Core 3.2 Ghz, 6GB RAM) since the largest dataset needed approximately 5GB of RAM. All of our code is available on our website [6]. A README is provided to assist with reproduction of our results. We created a test harness to help automate the data collection process as well as separate scoring functions for each predictor we wished to test. The data we used was collected by Kleinberg [1] and consists of 3 sections of the arxiv coauthorship network: gr-qc (general relativity and quantum mechanics), hep-ph (high energy physics - phenomenology), and hep-th (high energy physics - theory). This data was collected over the course of 11 years, with networks available for each year. We tried a variety of test situations, including testing over 5 years to predict the next 6 years, testing over 3 years to predict the next 3 years, and testing over 1 year to predict the following year. It is this last situation where we felt most comfortable with, and we compiled the results for this data as follows: Results: Common Neighbors Predictor, 1 year training, 1 year testing data. Common Neighbors on GR-QC

Common Neighbors on HEP-PH Common Neighbors on HEP-TH

Common Neighbors was our simplest predictor, and the baseline by which we compared the other predictors we implemented (Note that it barely does better than random). It simply scores a link based upon the number of neighbors two nodes have in common. Preferential Attachment Predictor, 1 year training, 1 year testing data. Preferential Attachment on GR-QC

Preferential Attachment on HEP-PH Preferential Attachment on HEP-TH

Preferential Attachment - This predictor takes the product of the degree of each vertex pair to score the link. We note that it appears to perform better than common neighbors on each data set. Also, HEP-TH accuracy and GR-QC ROC curves seem to exactly match up. We reproduced these results to double check, however, and didn't find any source of error to account for this apparent discrepancy. Weighted Katz Predictor, 1 year training, 1 year testing data. Weighted Katz on GR-QC

Weighted Katz on HEP-PH Weighted Katz on HEP-TH

Weighted Katz considers the number of, and length of, paths between two nodes to score the link between them. Weighted Katz appears to perform slightly better than Common Neighbors, but worse than Preferential Attachment, except for the case of HEP-PH, where Weighted Katz performs the best. 6. Problems we have encountered The biggest obstacles we had run into is the sheer size of the datasets. With the LiveJournal network spanning around 1.7 million nodes and the MySpace network at around 2.1 million nodes, we quickly run into both time and space restrictions. So far we have been satisfied with taking a simple subset of the graph (usually around 5% or 100,000 nodes) in order to test our predictors, but there is a fairly large problem with taking subsets of social networks. Since social networks are so highly based on clustering (and therefore are not evenly distributed), a bad subset choice could throw away a lot of the useful evidence that could be used in link prediction. We have explored many methods for getting around this problem, including taking a subgraph, which resulted in poor data, as well as attempting to find the "core" of a dataset, as described in Kleinberg [1]. It is possible that the Condor clusters could be used to alleviate this problem, however we have tried many different requirements and configurations of the condor job description and have yet to find one that works even on the fairly simple common neighbors predictor for a full network and finishes in a reasonable amount of time. We have also tried using the sparse() option in MATLAB to take advantage of the fact that the adjacency list description of the networks are sparse, but that also has proven to be unfruitful.

We have also explored using low-rank approximations of the full network, and working with those since creation of predictors is of high complexity while predictor testing is a linear operation. After receiving the arxiv data though, we have been successful at using the full dataset for our predictors. Even on the largest dataset and the Katz (most complicated) predictor, runtime is less than an hour. Also of note is our attempt at using spectral graph embedding for link prediction. However, though we were able to get GRACLUS to install correctly and made some attempts at implementing the algorithm, our data didn't seem to be correct and we ran into many bug issues. In the end we decided not to include our incomplete algorithm and just submit the data with which we were comfortable with the quality. 7. Things we learned We enjoyed this project greatly and felt that we got a taste of research in the Data Mining field. We come out of this project with a new appreciation of the complexity of social networks and the difficult problem that is Link Prediction. Of note is the sheer size of the dataset and the time spent on computing it - in no other project have we had to work on computations that lasted for days (and once even a whole week). It is now obvious to us the critical importance of complexity in this field, and the power of smart code to reduce running time by orders of magnitude. We have only scratched the surface of link prediction, but the cleverness of the algorithms and predictor we have examined leave us intrigued towards the possibilities of more powerful, more elegant solutions to this problem. In conclusion, we feel we have learned a great deal from this project and would like to thank both the professor and the TA for the opportunity to work on such an interesting problem. References 1. D. Liben-Nowell and J. Kleinberg. The Link-prediction Problem for Social Networks. In Journal of the American Society for Information Science 58(7), 2007. 2. H. Song, T. Cho, V. Dave, Y. Zhang and L. Qiu. Scalable Proximity Estimation and Link Prediction in Online Social Networks. In Proc. of the ACM/USENIX Internet Measurement Conference, 2009. 3. The Cooperative Association for Internet Data Analysis (CAIDA). Walrus - Graph Visualization Tool. http://www.caida.org/tools/visualization/walrus/ 4. H. Song, B. Savas, T. Cho, V. Dave, Z. Lu, I. S. Dhillon, Y. Zhang and L. Qiu, "Clustered Embedding of Massive Online Social Networks", submitted for publication, February 2010. 5. W. Tang. "ROC Curves". CS378 Lecture, Spring 2010. 6. D. Ho, E. Shrewsberry. "Link Prediction in Social Networks". http://code.google.com/p/dmhoshrews/