Part 2: Community Detection



Similar documents
Mining Social-Network Graphs

DATA ANALYSIS II. Matrix Algorithms

Mining Social Network Graphs

B490 Mining the Big Data. 2 Clustering

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Social Media Mining. Graph Essentials

Social Media Mining. Network Measures

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Complex Networks Analysis: Clustering Methods

Practical Graph Mining with R. 5. Link Analysis

Strong and Weak Ties

5.1 Bipartite Matching

Analysis of Algorithms, I

Protein Protein Interaction Networks

SCAN: A Structural Clustering Algorithm for Networks

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

CIS 700: algorithms for Big Data

A Tutorial on Spectral Clustering

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

3. The Junction Tree Algorithms

1. Then f has a relative maximum at x = c if f(c) f(x) for all values of x in some

Split Nonthreshold Laplacian Integral Graphs

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Structural and functional analytics for community detection in large-scale complex networks

Approximation Algorithms

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Section Inner Products and Norms

Simplified External memory Algorithms for Planar DAGs. July 2004

Triangle deletion. Ernie Croot. February 3, 2010

LINEAR-ALGEBRAIC GRAPH MINING

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Walk-Based Centrality and Communicability Measures for Network Analysis

A scalable multilevel algorithm for graph clustering and community structure detection

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Scheduling Shop Scheduling. Tim Nieberg

Linear Algebra Review. Vectors

5 Directed acyclic graphs

Solutions to Homework 6

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Data Structures and Algorithms Written Examination

4. How many integers between 2004 and 4002 are perfect squares?

Francesco Sorrentino Department of Mechanical Engineering

1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal , AP, INDIA

Social Networks and Social Media

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Problem Set 7 Solutions

CMPSCI611: Approximating MAX-CUT Lecture 20

SPANNING CACTI FOR STRUCTURALLY CONTROLLABLE NETWORKS NGO THI TU ANH NATIONAL UNIVERSITY OF SINGAPORE

1 Definitions. Supplementary Material for: Digraphs. Concept graphs

1 Introduction to Matrices

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Part 1: Link Analysis & Page Rank

Linear Programming I

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Graphs without proper subgraphs of minimum degree 3 and short cycles

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Least-Squares Intersection of Lines

Institut für Informatik Lehrstuhl Theoretische Informatik I / Komplexitätstheorie. An Iterative Compression Algorithm for Vertex Cover

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Introduction to Matrix Algebra

Network Algorithms for Homeland Security

ON THE DEGREES OF FREEDOM OF SIGNALS ON GRAPHS. Mikhail Tsitsvero and Sergio Barbarossa

SGL: Stata graph library for network analysis

Application of Graph Theory to

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Hadoop SNS. renren.com. Saturday, December 3, 11

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Cluster Analysis: Advanced Concepts

Statistical and computational challenges in networks and cybersecurity

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Network (Tree) Topology Inference Based on Prüfer Sequence

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Chapter 6. Orthogonality

8. Matchings and Factors

Cpt S 223. School of EECS, WSU

Discrete Mathematics Problems

Exponential time algorithms for graph coloring

Big Data Analytics CSCI 4030

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Orthogonal Diagonalization of Symmetric Matrices

Warshall s Algorithm: Transitive Closure

Midterm Practice Problems

Graph Mining and Social Network Analysis

Tools and Techniques for Social Network Analysis

Manifold Learning Examples PCA, LLE and ISOMAP

How To Cluster Of Complex Systems

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Transcription:

Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics

Outline Community Detection - Social networks - Betweenness - Girvan-Newman Algorithm - Modularity - Graph Partitioning - Spectral Graph Partitioning - Trawling Big Data Management and Analytics 2

Networks & Communities: Think of networks being organized into: Modules Cluster Communities Goal: Find densely linked clusters Big Data Management and Analytics 3

What is a Social Network? Characteristics of a social network: - Collection of entities participating in the network (entities might be individuals, phone numbers, email addresses, ) - At least one relationship between entities of the network. (Facebook: friend ). Relationship can be all-or-nothing or specified by a degree (e.g. fraction of the average day that two people communication to each other) - Assumption of nonrandomness or locality, i.e. relationships tend to cluster. (e.g. A is related to B and C higher probability that B is related to C) Big Data Management and Analytics 4

How to find communities? Here we will work with undirected (unweighted networks) We need to resolve 2 questions: How to compute betweenness? How to select the number of clusters? Big Data Management and Analytics 5

Outline Community Detection - Social networks - Betweenness - Girvan-Newman Algorithm - Modularity - Graph Partitioning - Spectral Graph Partitioning - Trawling Big Data Management and Analytics 6

Betweenness Definition: The betweenness of an edge (a,b) is the number of pairs of nodes x and y such that (a,b) lies on the shortest-path between x and y. Example: Edge (B,D) has highest betweenness (shortest path of A,B,C to any of D,E,F,G) Betweenness of (B,D) aggregates to: 3 x 4 = 12 Betweenness of edge (D,F)? Big Data Management and Analytics 7

Girvan-Newman Algorithm Goal: Computation of betweenness of edges Step 1: Perform a breadth-first search, starting at node X and construct a DAG (directed, acyclic graph) Example: - Start at node E Level 1 Level 2 Level 3 Big Data Management and Analytics 8

Girvan-Newman Algorithm Goal: Computation of betweenness of edges Step 2: label each node by the number of shortest paths that reach it from the root. Label of root = 1, each node is labeled by the sum of its parents. Example: 1 1 1 1 2 Level 1 Level 2 1 1 Level 3 Big Data Management and Analytics 9

Girvan-Newman Algorithm Goal: Computation of betweenness of edges Step 3: calculate for each edge e the sum over all nodes Y the fraction of shortest paths from the root X to Y: In Detail: 1. Each leaf gets a credit of 1. 2. Non-leaf nodes gets a credit of 1 plus the sum of its children 3. A DAG edge e entering node Z from the level above is given a share of the credit of Z proportional to the fraction of shortest paths from the root to Z. Formally: let Y 1,, Y k be the parent nodes of Z with p i, 1 i k, be the number of shortest path to Y i. The credit k for edge (Y i, Z) is given by: Z p i / j=1 p j 4.5 1.5 3 1 1 1 Level 1 Level 2 Level 3 Big Data Management and Analytics 10

Find Communities using Betweenness Idea: Clustering is performed by taking the edges in order of increasing betweenness and add them to the graph or as a process of edge removal Example: GN-Algorithm has been performed for every node and the credit of each edge has been calculated (by summing the credits up and dividing them by 2. Why?) Remove edges, starting by highest betweenness: - 1. Remove (B,D) Communities {A,B,C} and {D,E,F,G} - 2. Remove (A,B), (B,C), (D,G), (D,E), (D,F) Communities {A,C} and {E,F,G} Node B and D are encapsulated as traitors of communities 5 12 4.5 4 1 5 4.5 1.5 Big Data Management and Analytics 11 1.5

Find Communities using Betweenness Girvan-Newman Algorithm: - connected components are communities - gives a hierarchical decomposition of the network Big Data Management and Analytics 12

Outline Community Detection - Social networks - Betweenness - Girvan-Newman Algorithm - Modularity - Graph Partitioning - Spectral Graph Partitioning - Trawling Big Data Management and Analytics 13

Network Communities How to compute betweenness? How to select the number of clusters? Communities: sets of tightly connected nodes Modularity Q: - A measure of how well a network is partitioned into communities - Given a partitioning of the network into groups s S: Q s S [ #edges within group s (expected #edges within group s)] Defined by null model Big Data Management and Analytics 14

Null Model: Configuration Model - Given a graph G with n nodes and m edges, construct rewired network G : - Same degree distribution but random connections - Consider G as a multigraph The expected number of edges between nodes i and j of degrees k i and k j is given by: 1 2m k ik j Proof that G contains the expected number of m edges: 1 k i k j 2 2m = 1 1 k 2 2m i i N j N i N j N k j = 1 2m 2m = m 4m Big Data Management and Analytics 15

Modularity Modularity of partitioning S of graph G: Q s S [ #edges within group s (expected #edges within group s)] Q G, S = 1 2m s S i s j s (a ij k ik j 2m ) Normalizing: 1 < Q < 1 Modularity values take range [ 1, 1]: - Positive if the number of edges within groups exceeds the expected number - 0. 3 0. 7 < Q means significant community structure Big Data Management and Analytics 16

Modularity Q is useful for selecting the number of clusters Big Data Management and Analytics 17

Outline Community Detection - Social networks - Betweenness - Girvan-Newman Algorithm - Modularity - Graph Partitioning - Spectral Graph Partitioning - Trawling Big Data Management and Analytics 18

Partitioning of Graphs Given an undirected Graph G(V, E): Bi-partitioning task: - Divide vertices into two disjoint groups A, B A B Questions: - How can we define good partition of G? - How can we efficiently identify such a partition? Big Data Management and Analytics 19

Partitioning of Graphs What makes a good partition? Maximize the number of within-group connections Minimize the number of between-group connections Example: A B Big Data Management and Analytics 20

Partitioning of Graphs Graph Cuts Express partitioning objectives as a function of the edge cut of the partition Cut: Set of edges with only one vertex in a group: Example: A cut A, B = i A,j B w ij B cut A, B = 2 Big Data Management and Analytics 21

Partitioning of Graphs Minimum-cut Minimize weight of connections between groups: arg min cut(a, B) A,B Example: Optimal Cut Minimum Cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity Big Data Management and Analytics 22

Partitioning of Graphs - Graph Cuts Normalized-cut: Connectivity between groups relative to the density of each group cut(a, B) cut(a, B) ncut A, B = + vol(a) vol(b) vol(x): total weight of edges with at least one endpoint in X: vol X = i A k i Produces more balanced partitions How to find a good partition efficiently? Problem: Computing optimal cut is NP-hard! Big Data Management and Analytics 23

Spectral Graph Partitioning - Adjacency matrix of an undirected Graph G a ij = 1 if (i, j) exist in G, else 0 - Vector x R n with components x 1,, x n Think of it as a label/value of each node of G What is the meaning of A x? a 11 a 1n a n1 a nn x 1 x n = y 1 y n y i = n j=1 a ij x j = (i,j) E x j Entry y i is a sum of labels / values x j of neighbors of i Big Data Management and Analytics 24

Spectral Graph Partitioning What is the meaning of A x? a 11 a 1n a n1 a nn x 1 x n = y 1 y n a 11 a 1n a n1 a nn x 1 x n = λ x 1 x n A x = λ x Eigenvalue Problem Spectral Graph Theory: Analyze the spectrum of matrix representing G Spectrum: Eigenvectors x i of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues λ i Λ = λ 1,, λ n with λ 1 λ n Big Data Management and Analytics 25

Spectral Graph Partitioning Intuition Suppose all nodes in G have degree d and G is connected What are some eigenvalues/vectors of G? Eigenvalue Problem: A x = λ x find λ and x Let s try x = (1,, 1) Then A x = d,, d = λ x λ = d Remember: y i = a ij x j = j=1 (i,j) E Big Data Management and Analytics 26 n x j

Spectral Graph Partitioning Intuition What if G is not connected? A G has 2 components, each d-regular What are some eigenvectors? x = put all 1s on A and 0s on B or vice versa x = (1,, 1, 0,, 0), then A x = (d,, d, 0,, 0) x = (0,, 0, 1,, 1), then A x = (0,, 0, d,, d) in both cases the corresponding λ = d B A B A B λ n = λ n 1 λ n λ n 1 0 2 nd largest eigval. λ n 1 now has value very close to λ n Big Data Management and Analytics 27

Spectral Graph Partitioning Intuition A B A B λ n = λ n 1 λ n λ n 1 0 2 nd largest eigval. λ n 1 now has value very close to λ n If the graph is connected (right example) then we already know that x n = (1,, 1) is an eigenvector Since eigenvectors are orthogonal then the components of x n 1 sum to 0 Why? Because i x n [i] i x n 1 i = 0 General Idea: we can look at the eigenvector of the 2nd largest eigenvalue and declare nodes with positive label in A and negative label in B Big Data Management and Analytics 28

Spectral Graph Partitioning - Adjacency matrix A: n x n matrix A = a ij, a ij = 1 if there is an edge between node i and j 1 5 1 2 3 4 5 6 1 0 1 1 0 1 0 2 3 4 6 2 1 0 1 0 0 0 3 1 1 0 1 0 0 4 0 0 1 0 1 1 5 1 0 0 1 0 1 6 0 0 0 1 1 0 Big Data Management and Analytics 29

Spectral Graph Partitioning - Degree matrix D: n x n diagonal matrix D = d ii, d ii = degree of node i 1 5 1 2 3 4 5 6 1 3 0 0 0 0 0 2 3 4 6 2 0 2 0 0 0 0 3 0 0 3 0 0 0 4 0 0 0 3 0 0 5 0 0 0 0 3 0 6 0 0 0 0 0 2 Big Data Management and Analytics 30

Spectral Graph Partitioning - Laplacian Matrix L: n x n symmetric matrix L = D A 2 1 3 4 5 6 Trivial eigenpair? X = (1,, 1), then L x = 0 and so λ 1 = 0 1 2 3 4 5 6 1 3-1 -1 0-1 0 2-1 2-1 0 0 0 3-1 -1 3-1 0 0 4 0 0-1 3-1 -1 5-1 0 0-1 3-1 6 0 0 0-1 -1 2 Big Data Management and Analytics 31

Spectral Clustering Algorithms Three basic stages: 1. Pre-processing Construct a matrix representation of the graph 2. Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point point to a lower-dimensional representation on one or more eigenvectors 3. Grouping Assign points to two or more clusters, based on the new representation Big Data Management and Analytics 32

Spectral Clustering Algorithms 1. Pre-processing: Build Laplacian matrix L of the graph 1 2 3 4 5 6 1 3-1 -1 0-1 0 2-1 2-1 0 0 0 3-1 -1 3-1 0 0 4 0 0-1 3-1 -1 5-1 0 0-1 3-1 6 0 0 0-1 -1 2 2. Decomposition: Find eigenvalues λ and eigenvectors x of the matrix L 0.0 1.0 3.0 = X = 3.0 4.0 5.0 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.6 0.3-0.3-0.3-0.6-0.5 0.4 0.1 0.1-0.5 0.4-0.2-0.4 0.6 0.6-0.2-0.4-0.4 0.4-0.4 0.4 0.4-0.4-0.5 0.0 0.5-0.5 0.5 0.0 Map vertices to lower-dimensional representation 1 2 0.3 0.6 3 4 5 0.3-0.3-0.3 How do we now find the clusters? 6-0.6 Big Data Management and Analytics 33

Spectral Clustering Algorithms 3. Grouping: Sort components of reduced 1-dimensional vector Identify clusters by splitting the sorted vector in two (threshold ε) By choosing m vectors, there are max. 2 m clusters How to choose a splitting point, i.e threshold ε? Naive approaches: Split at ε = 0 or median value More expensive approaches: Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induces by the eigenvector) 1 2 3 0.3 0.6 0.3 Split at ε = 0: Cluster A: Positive points Cluster B: Negative points A B 4-0.3 1 0.3 4-0.3 5-0.3 2 0.6 5-0.3 6-0.6 3 0.3 6-0.6 Big Data Management and Analytics

Spectral Clustering Algorithms Value of x 2 Rank in x 2 Big Data Management and Analytics

Spectral Clustering Algorithms Components of x 2 Value of x 2 Rank in x 2 Big Data Management and Analytics

Spectral Clustering Algorithms Components of x 3 Big Data Management and Analytics

Outline Community Detection - Social networks - Betweenness - Girvan-Newman Algorithm - Modularity - Graph Partitioning - Spectral Graph Partitioning - Trawling Big Data Management and Analytics 38

- Trawling Trawling Goal: find small communities in huge graphs E.g. how to describe community/discussion in a Web Example: E.g. people talking about the same things or visited web pages Big Data Management and Analytics 39

- Trawling Problem definition: Enumerate complete bipartite subgraphs K s,t : All vertices in K s,t can be partitioned in two sets. Each vertex in the first set of size s is linked to each vertex in second set of size t Where K s,t : s nodes on the left where each links to the same t other nodes on the right X = s = 3 Y = t = 4 X Y Big Data Management and Analytics 40

- Trawling Frequent Itemset Analysis Market Basket Analysis Market: Universe U of n items Baskets: subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: frequency threshold Goal: Find all subsets T s.t. T S i of at least f sets S i (items in T were bought together at least f times) Frequent itemsets = complete bipartite graphs i a b c d S i ={a,b,c,d} View each node i as a set S i of nodes i points to K s,t = a set Y of size t that occurs in s sets S i j i k a b c d Big Data Management and Analytics 41

- Trawling E.g. Bipartite subgraph K 3,4 a frequent itemset Y={a,b,c} of supp s. So, there are s nodes that link to all of {a,b,c}: x a b c y a b c z a b c x y z a b c We found K s,t! K s,t = a set Y of size t that occurs in s sets S i Big Data Management and Analytics 42

- Trawling a c e f d b Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} Frequent itemsets support > 1 {b,d}: support 3 {e,f}: support 2 a c d b c e f d e Big Data Management and Analytics 43