Mining Social Network Graphs



Similar documents
Part 2: Community Detection

Mining Social-Network Graphs

Class One: Degree Sequences

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Distance Degree Sequences for Network Analysis

Approximation Algorithms

Social Media Mining. Graph Essentials

Social Media Mining. Network Measures

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

2010 Solutions. a + b. a + b 1. (a + b)2 + (b a) 2. (b2 + a 2 ) 2 (a 2 b 2 ) 2

Advanced GMAT Math Questions

6.852: Distributed Algorithms Fall, Class 2

Lecture 1: Course overview, circuits, and formulas

The Goldberg Rao Algorithm for the Maximum Flow Problem

Strong and Weak Ties

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

4. How many integers between 2004 and 4002 are perfect squares?

Network (Tree) Topology Inference Based on Prüfer Sequence

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Distributed Computing over Communication Networks: Maximal Independent Set

Protein Protein Interaction Networks

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

13. Write the decimal approximation of 9,000,001 9,000,000, rounded to three significant

An Introduction to APGL

1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal , AP, INDIA

Sampling Biases in IP Topology Measurements

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Infrastructures for big data

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Problem Set 7 Solutions

Practical Graph Mining with R. 5. Link Analysis

Analysis of Algorithms, I

Max Flow, Min Cut, and Matchings (Solution)

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Dynamic Programming. Lecture Overview Introduction

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Lecture 10: Regression Trees

3. The Junction Tree Algorithms

The Open University s repository of research publications and other research outputs

DATA ANALYSIS II. Matrix Algorithms

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

SoSe 2014: M-TANI: Big Data Analytics

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

On the independence number of graphs with maximum degree 3

Efficient Crawling of Community Structures in Online Social Networks

Triangle deletion. Ernie Croot. February 3, 2010

Part 1: Link Analysis & Page Rank

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Graphs without proper subgraphs of minimum degree 3 and short cycles

L 2 : x = s + 1, y = s, z = 4s Suppose that C has coordinates (x, y, z). Then from the vector equality AC = BD, one has

Graph Processing and Social Networks

Why? A central concept in Computer Science. Algorithms are ubiquitous.

B490 Mining the Big Data. 2 Clustering

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Baltic Way Västerås (Sweden), November 12, Problems and solutions

Gerry Hobbs, Department of Statistics, West Virginia University

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Big Data Begets Big Database Theory

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Online EFFECTIVE AS OF JANUARY 2013

SEMITOTAL AND TOTAL BLOCK-CUTVERTEX GRAPH

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Applied Algorithm Design Lecture 5

Y. Xiang, Constraint Satisfaction Problems

Graph Mining and Social Network Analysis

Solutions to Homework 6

CSU Fresno Problem Solving Session. Geometry, 17 March 2012

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Factoring Polynomials

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

How To Understand The Network Of A Network

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

A Comparison of General Approaches to Multiprocessor Scheduling

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Algorithm Design and Analysis

Boolean Algebra Part 1

Inversion. Chapter Constructing The Inverse of a Point: If P is inside the circle of inversion: (See Figure 7.1)

Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design

DEFINITIONS. Perpendicular Two lines are called perpendicular if they form a right angle.

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

Data Structures and Algorithms Written Examination

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Crawling and Detecting Community Structure in Online Social Networks using Local Information

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certificate of Education Ordinary Level

Chapter 10: Network Flow Programming

TIgeometry.com. Geometry. Angle Bisectors in a Triangle

Introduction to Graph Mining

Measuring the Performance of an Agent

Chapter 13: Binary and Mixed-Integer Programming

MATH 21. College Algebra 1 Lecture Notes

Load Balancing. Load Balancing 1 / 24

Transcription:

Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014

Social Network No introduc+on required Really? We s7ll need to understand a few proper7es disclaimer: the brand logos are used here en7rely for educa7onal purpose 2

Social Network A collection of entities Typically people, but could be something else too At least one relationship between entities of the network For example: friends Sometimes boolean: two people are either friends or they are not May have a degree Discrete degree: friends, family, acquaintances, or none Degree real number: the fraction of the average day that two people spend talking to each other An assumption of nonrandomness or locality Hard to formalize Intuition: that relationships tend to cluster If entity A is related to both B and C, then the probability that B and C are related is higher than average (random) 3

Social Network as a Graph A B D E C A graph with boolean (friends) relationship G F Check for the non-randomness criterion In a random graph (V,E) of 7 nodes and 9 edges, if XY is an edge, YZ is an edge, what is the probability that XZ is an edge? For a large random graph, it would be close to E /( V C 2 ) = 9/21 ~ 0.43 Small graph: XY and YZ are already edges, so compute within the rest So the probability is ( E 2)/( V C 2 2) = 7/19 = 0.37 Now let s compute what is the probability for this graph in particular Example courtesy: Leskovec, Rajaraman and Ullman 4

Social Network as a Graph A B D E Does have locality property C A graph with boolean (friends) relationship For each X, check possible YZ and check if YZ is an edge or not Example: if X = A, YZ = {BC}, it is an edge G F X= YZ= Yes/Total A BC 1/1 B AC, AD, CD 1/3 C AB 1/1 D BE,BG,BF,EF, EG,FG 2/6 X= YZ= Yes/Total E DF 1/1 F DE,DG,EG 2/3 G DF 1/1 Total 9/16 ~ 0.56 5

Types of Social (or Professional) Networks A B D E C G F Of course, the social network. But also several other types Telephone network Nodes are phone numbers AB is an edge if A and B talked over phone within the last one week, or month, or ever Edges could be weighted by the number of times phone calls were made, or total time of conversation 6

Types of Social (or Professional) Networks A B D E C G F Email network: nodes are email addresses AB is an edge if A and B sent mails to each other within the last one week, or month, or ever One directional edges would allow spammers to have edges Edges could be weighted Other networks: collaboration network authors of papers, jointly written papers or not Also networks exhibiting locality property 7

Clustering of Social Network Graphs Locality property à there are clusters Clusters are communities People of the same institute, or company People in a photography club Set of people with Something in common between them Need to define a distance between points (nodes) In graphs with weighted edges, different distances exist For graphs with friends or not friends relationship Distance is 0 (friends) or 1 (not friends) Or 1 (friends) and infinity (not friends) Both of these violate the triangle inequality Fix triangle inequality: distance = 1 (friends) and 1.5 or 2 (not friends) or length of shortest path 8

Tradi7onal Clustering A B D E C G F Intuitively, two communities Traditional clustering depends on the distance Likely to put two nodes with small distance in the same cluster Social network graphs would have cross-community edges Severe merging of communities likely May join B and D (and hence the two communities) with not so low probability 9

Betweenness of an Edge A B D E C G F Betweenness of an edge AB: #of pairs of nodes (X,Y) such that AB lies on the shortest path between X and Y There can be more than one shortest paths between X and Y Credit AB the fraction of those paths which include the edge AB High score of betweenness means? The edge runs between two communities Betweenness gives a better measure Edges such as BD get a higher score than edges such as AB Not a distance measure, may not satisfy triangle inequality. Doesn t matter! 10

The Girvan Newman Algorithm Step 1 BFS: Start at a node X, perform a BFS with X as root Calculate betweenness of edges 1 E Observe: level of node Y = length of shortest path from X to Y Edges between level are called DAG edges Each DAG edge is part of at least one shortest path from X 1 1 D B Level 1 G F Level 2 2 1 Step 2 Labeling: Label each node Y by the number of shortest paths from X to Y A C 1 1 Level 3 11

The Girvan Newman Algorithm Step 3 credit sharing: Each leaf node gets credit 1 Each non-leaf node gets 1 + sum(credits of the DAG edges to the level below) Credit of DAG edges: Let Y i (i=1,, k) be parents of Z, p i = label(y i ) Calculate betweenness of edges 1 4.5 D 1 E 4.5 1.5 Level 1 F 1 1.5 credit(y i, Z) = credit(z) p i (p 1 +! p k ) Intuition: a DAG edge Y i Z gets the share of credit of Z proportional to the #of shortest paths from X to Z going through Y i Z Finally: Repeat Steps 1, 2 and 3 with each node as root. For each edge, betweenness = sum credits obtained in all iterations / 2 1 3 3 0.5 0.5 B G 1 1 1 A C 1 1 1 1 Level 2 2 Level 3 12

Computa7on in prac7ce Complexity: n nodes, e edges BFS starting at each node: O(e) Do it for n nodes Total: O(ne) time Very expensive Method in practice Choose a random subset W of the nodes Compute credit of each edge starting at each node in W Sum and compute betweenness A reasonable approximation 13

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 14

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 15

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 16

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 17

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 18

Finding Communi7es using Betweenness Method 1: Keep adding edges (among existing ones) starting from lowest betweenness Gradually join small components to build large connected components 19

Finding Communi7es using Betweenness Method 2: Start from all existing edges. The graph may look like one big component. Keep removing edges starting from highest betweenness Gradually split large components to arrive at communities 20

Finding Communi7es using Betweenness Method 2: Start from all existing edges. The graph may look like one big component. Keep removing edges starting from highest betweenness Gradually split large components to arrive at communities 21

Finding Communi7es using Betweenness Method 2: Start from all existing edges. The graph may look like one big component. Keep removing edges starting from highest betweenness Gradually split large components to arrive at communities At some point, removing the edge with highest betweenness would split the graph into separate components 22

Finding Communi7es using Betweenness For a fixed threshold of betweenness, both methods would ultimately produce the same clustering However, a suitable threshold is not known beforehand Method 1 vs Method 2 Method 2 is likely to take less number of operations. Why? Inter-community edges are less than intra-community edges 23

Triangles in Social Network Graph Number of triangles in a social network graph is expected to be much larger than a random graph with the same size The locality property Counting the number of triangles How much the graph looks like a social network Age of community A new community forms Members bring in their like minded friends Such new members are expected to eventually connect to other members directly 24

Triangle Coun7ng Algorithm Graph (V, E); V = n, E = m Step 1: Compute degree of each node Examine each edge Add degree 1 to each of the two nodes Takes O(m) time Step 2: A hash table (v i,v j ) à 1 So that, given two nodes, we can determine if they have an edge between them Construction takes O(m) time Each query ~ expected O(1) time, with a proper hash function Step 3: An index v à list of nodes adjacent to v Construction takes O(m) time, querying takes O(1) time 25

Coun7ng Heavy Hi[er Triangles Heavy hitter node: a node with degree m Note: there are at most 2 m heavy hitter nodes More than 2 m nodes à total degree > 2m (but E = m) Heavy hitter triangle: triangle with all 3 heavy hitter nodes Number of possible heavy hitter triangles: at most 2 m C 3 ~ O(m 3/2 ) For each possible triangle, use hash table (step 2) to check if all three edges exist Takes O(m 3/2 ) time 26

Coun7ng other Triangles Consider an ordering of nodes v i << v j if Either degree(v i ) < degree(v j ), and If degree(v i ) = degree(v j ) then i < j For each edge (v i,v j ) If both nodes are heavy hitters, skip (already done) Suppose v i is not a heavy hitter Find nodes w 1,w 2,,w k which are adjacent to v i (using node à adjacent nodes index, step 3) [Takes O(k) time] For each w l, l = 1,, k check if edge v j w l exist, in O(1) time, total O(k) time Count the triangle {v i v j w l } if and only if Edge v j w l exists Also v i << w l Total time for each edge (v i,v j ) is O( m) There are m edges, total time is O(m 3/2 ) time 27

Op7mality Worst case scenario If G is a complete graph Number of triangles = m C 3 ~ O(m 3/2 ) Cannot even enumerate all triangles in less than O(m 3/2 ) Hence it is the lower bound for computing all triangles If G is sparse Consider a complete graph G with n nodes, m edges Note that m = n C 2 = O(n 2 ) Construct G from G by adding a chain of length n 2 The number of triangles remain the same, O(m 3/2 ) The number of edges remain of the same order O(m) G is quite sparse, lowering edge to node ratio Still cannot compute the triangles in less than O(m 3/2 ) time 28

Directed Graphs in (Social) Networks Set of nodes V and directed edges (arcs) u à v The web: pages link to other pages Persons made calls to other persons Twitter, Google+: people follow other people All undirected graphs can be considered as directed Think of each edge as bidirectional 29

Paths and Neighborhoods Path of length k: a sequence of nodes v 0,v 1,,v k from v 0 to v k so that v i à v i+1 is an arc for i = 0,, k 1 Neighborhood N(v,d) of radius d for a node v: set of all nodes w such that there is a path from v to w of length d For a set of nodes V, N(V,d):= {w there is a path of length d from some v in V to w} Neighborhood profile of a node v: sequence of sizes of its neighborhoods of radius d = 1, 2, ; that is N(v,1), N(v,2), N(v,3), 30

Neighborhood Profile A B D E C G F Neighborhood profile of B N( B,1) = 4 N( B,2) = 7 Neighborhood profile of A N( A,1) = 3 N( A,2) = 4 N( A,3) = 7 31

Diameter of a Graph Diameter of a graph G(V,E): the smallest integer d such that for any two nodes v, w in V, there is a path of length at most d from v to w Only makes sense for strongly connected graphs Can reach any node from any node The web graph: not strongly connected But there is a large strongly connected component The six degrees of separation conjecture The diameter of the graph of the people in the world is six 32

Diameter and Neighborhood Profile Neighborhood profile of a node v N(v,1), N(v,2), N(v,3), V = N(v,k) for some k Denote this k as d(v) If G is a complete graph, d(v) = 1 Diameter of G is max v {d(v)} 33

Reference Mining of Massive Datasets, by Leskovec, Rajaraman and Ullman, Chapter 10 34