Social Network Analysis Challenges in Computer Science April 1, 2014 Frank Takes (ftakes@liacs.nl) LIACS, Leiden University
Overview Context Social Network Analysis Online Social Networks Friendship Graph Centrality Measures Example Conclusions
Data Science Data Science is the study of the generalizable extraction of knowledge from data, yet the key word is science (Wikipedia) Builds on techniques and theories from many fields, including machine learning, computer programming, statistics, data engineering, pattern recognition and learning, visualization, Goal is extracting meaning from data and creating data products Data science is a buzzword, often used interchangeably with analytics or big data
Big Data? Online Social Network Friendship Graph 8 million users 1 billion friendships between users 10GB on disk Is this Big Data? No.
Three V s of Big Data Volume Velocity Volatile
Or.?
From Data to Networks Unstructured data: numeric measurements from a temperature sensor, textual contexts of a news article Structured data: data organized according to a model or data structure, for example: Database: tables with rows and columns Graph/Network: nodes and edges
Graphs n = 5 nodes m = 6 edges A B Vertex/Node/Knoop C Distance/Afstand d(c, E) = d(e, C) = 2 D E Relationship/Edge/Link/Tak
Social Network Analysis Social Network Analysis (SNA): the study of social networks to understand their structure and behavior. Social Network: a social structure of people, related (directly or indirectly) to each other through a common relation or interest. Social Networks!= Social Media
Social Network Analysis Social Network Analysis (SNA) Sociology Algorithms Data Mining Social Networks Real-life (explicit) Online (explicit) Derived (implicit) e-mail networks, citation networks, co-author networks, terrorist collaboration networks
Online Social Networks
History 1997: SixDegrees.com 2000: Friendster 2003: LinkedIn & MySpace 2004: Hyves 2005: Facebook 2006: Twitter... 2010: The Social Network (movie)
Online Social Networks User (node) has a profile Profiles have attributes (labels/annotations) Explicit links (edges) Social Links / Friendship links User groups Implicit links Social messaging Common attributes Directed vs. undirected links
2011
Example: Facebook More than 1 billion active users Average user has 130 friends Estimated 100 billion social links Over 600 million interactive objects (pages, groups and events) More than 45 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month
OSNA Research Topics User behavior Privacy & Anonymity Trust & Authorities Diffusion of information Sampling & Crawling Community Detection Friendship Graph
Friendship Graph
Friendship Graph
Friendship Graph Static analysis Who is the most important person in a network? Can we distinguish between groups of people in the network? What is the average distance between two people in the network? Dynamic analysis Who are likely to become friends next? How does the social network evolve over time?
Centrality measures Degree centrality Betweenness centrality Closeness centrality Graph centrality (eccentricity centrality) Eigenvector centrality Random walk centrality Hyperlink Induced Topic Search (HITS) PageRank
Centrality Who has a central position in this graph? A E F G H A E F G H B B D D C C Degree Centrality: C has the highest degree
Centrality Who has a central position in this graph? A E F G H A E F G H B B D D C C Betweenness Centrality: E is part of the largest number of shortest paths
Google PageRank
Google PageRank 4 webpages A, B, C and D (N = 4) Initially: PR(A) = PR(B) = PR(C) = PR(D) = 1/n L(A) is the outdegree of page A Now if B, C and D each link to A, the simple PageRank PR(A) of a page A is equal to:
Google PageRank 0.25 0.25 A B 0.458 0.083 A B C D 0.25 0.25 C D 0.208 0
Google PageRank PageRank as suggested by Larry Page in 1999 N = number of pages, p i and p j are pages M(p i ) is the set of pages linking to p i L(p j ) is the outdegree of p j d = 0.85, 85% chance to follow a link, 15% chance to jump to a random page (random surfer) t = 0 t = t + 1
Frequent Subgraphs What pattern occurs frequently in this graph? B A A B C A A A A A A B B B B B B C C C Frequent Subgraph: A-B-C
Clustering http://bayramannakov.wordpress.com
Six Degrees of Separation "I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation. Between us and everybody else on this planet. The president of the United States. A gondolier in Venice. Fill in the names. I find that A) tremendously comforting that we're so close and B) like Chinese water torture that we're so close. Because you have to find the right six people to make the connection. It's not just big names. It's anyone. A native in a rain forest. A Tierra del Fuegan. An Eskimo. I am bound to everyone on this planet by a trail of six people. It's a profound thought." John Guare, 1990
Six Degrees of Separation Stanley Milgram, 1969 300 brieven van Omaha naar Boston Geadresseerd aan 300 willekeurige mensen, met het verzoek de brief door te sturen richting de uiteindelijk geadresseerde. Na gemiddeld 5.5 stappen kwam de brief bij de geadresseerde aan.
Six Degrees of Separation Testen op een online sociaal netwerk Dataset 8 miljoen gebruikers 900 miljoen onderlinge vriendschappen 9GB in text (datafile), 4GB in memory Alle afstanden vergelijken: 8M x 8M = 64 x 10 12 Sampling: onderlinge afstand van paren van 1000 willekeurige gebruikers bepalen mb.v. kortstepadalgoritme van Dijkstra
Gemiddelde afstanden 1000 samples Random gekozen Voor datasets van sociale netwerken: Netwerk Gebruikers Vriendschappen Gemiddelde afstand Flickr 1.800.000 22.600.000 5.67 Hyves 8.000.000 900.000.000 4.75 LiveJournal 5.300.000 77.400.000 5.88 Orkut 3.100.000 223.500.000 4.25 YouTube 1.160.000 4.950.000 5.10
Distance Distribution
Data storage in memory n nodes, m links, k links per node on average 1 < k < n < m < n 2 Adjacency Matrix Sorted Adjecency List Size 8M x 8M x 1bit = 64 Tbit = 8 Tbyte O(n 2 ) space 900M x 8 bytes (INT pairs) = 7.2 Gbyte O(m) space Link existence O(1) time O(log k) time Link addition O(1) time O(k log k) time Link deletion O(1) time O(1) time Neighborhood O(n) time O(1) time
Friendship Graph Analysis Static analysis Densely connected core Fringe of low-degree nodes Few isolated communities & singletons Static properties Node degree distribution, average distance, diameter Edge/node ratio, level of symmetry Number of cliques, k-cliques, etc. Small world phenomanon
Degree Distribution
Small World Networks Class of networks with certain properties: Sparse graphs Highly connected Short average node-to-node distance: d ~ log(n) Fat tailed power law node degree distribution Densely connected core with many (near-)cliques Existence of hubs: nodes with a very high degree Fringe of low(er)-degree nodes
Regular vs. Small World
Small World Networks Other examples of small world networks Web graphs Gene networks E-mail networks Telephone call graphs Information networks Internet topology networks Scientific co-authorship networks Corporate networks (interlocks or ownerships)
Collaboration @ LIACS
Friendship Graph Analysis Static analysis Dynamic analysis Network evolution Network modelling Network growth Link Prediction Triadic Closure Preferential Attachment
Link Prediction A E A E B B D D C C Two principles: preferential attachment and triadic closure
Triadic Closure A C A C B B Two principles: preferential attachment and triadic closure
Preferential Attachment Nodes with a large degree acquire new links at a faster rate. A E F B J G C D
Conclusions When you hear big data, then it is almost never really big data. Online social networks are an excellent domain of study for data (or graph-) miners. Social Network Analysis is important for many areas of research, not only computer science. (Small world) networks are everywhere.
Try this at home Graph visualization: http://www.gephi.org http://nodexl.codeplex.com Network datasets: http://snap.stanford.edu http://konect.uni-koblenz.de/networks