B490 Mining the Big Data. 2 Clustering

Similar documents

Part 2: Community Detection

DATA ANALYSIS II. Matrix Algorithms

Mining Social-Network Graphs

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Neural Networks Lesson 5 - Cluster Analysis

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Applied Algorithm Design Lecture 5

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

CMPSCI611: Approximating MAX-CUT Lecture 20

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Approximation Algorithms

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Mining Social Network Graphs

Private Approximation of Clustering and Vertex Cover

B490 Mining the Big Data. 0 Introduction

Social Media Mining. Network Measures

Lecture 4 Online and streaming algorithms for clustering

Social Media Mining. Data Mining Essentials

Cluster Analysis: Advanced Concepts

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Information Retrieval and Web Search Engines

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Guessing Game: NP-Complete?

GRAPH data are important data types in many scientific

Vector and Matrix Norms

Chapter ML:XI (continued)

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Distances, Clustering, and Classification. Heatmaps

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Dynamic Programming. Lecture Overview Introduction

Closest Pair Problem

Social Networks and Social Media

Chapter 6. Cuboids. and. vol(conv(p ))

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Algorithm Design and Analysis

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Random Map Generator v1.0 User s Guide

A Tutorial on Spectral Clustering

Segmentation & Clustering

Similarity and Diagonalization. Similar Matrices

Machine Learning using MapReduce

Complex Networks Analysis: Clustering Methods

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Distance Degree Sequences for Network Analysis

5.1 Bipartite Matching

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

How To Understand The Network Of A Network

Going Big in Data Dimensionality:

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Practical Graph Mining with R. 5. Link Analysis

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

Hadoop SNS. renren.com. Saturday, December 3, 11

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

Chapter 7. Cluster Analysis

Protein Protein Interaction Networks

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Environmental Remote Sensing GEOG 2021

Problem Set 7 Solutions

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Near Optimal Solutions

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

CIS 700: algorithms for Big Data

A scalable multilevel algorithm for graph clustering and community structure detection

Clustering UE 141 Spring 2013

Data Mining: Algorithms and Applications Matrix Math Review

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Metric Spaces. Chapter Metrics

Graph/Network Visualization

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Data Preprocessing. Week 2

Transcription:

B490 Mining the Big Data 2 Clustering Qin Zhang 1-1

Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster topic/articles from the web by same story. Google image Many others... 2-1

Motivations 2-2 Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster topic/articles from the web by same story. Google image Many others... Given a cloud of data points we want to understand their structures Community discovery in social networks

Motivations 2-3 Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster topic/articles from the web by same story. Google image Many others... Given a cloud of data points we want to understand their structures Community discovery in social networks

The plan Clustering is an extremely broad and ill-defined topic. There are many many variants; we will not try to cover all of them. Instead we will focus on three broad classes of algorithms Hierachical Clustering Assignment-based Clustering Spectural Clustering 3-1

The plan Clustering is an extremely broad and ill-defined topic. There are many many variants; we will not try to cover all of them. Instead we will focus on three broad classes of algorithms Hierachical Clustering Assignment-based Clustering Spectural Clustering There is a saying When data is easily cluster-able, most clustering algorithms work quickly and well. When is not easily cluster-able, then no algorithm will find good clusters. 3-2

4-1 Hierachical Clustering

Hard clustering Problem: Start with a set X R d, and a metric d : R d R d R +. A cluster C is a subset of X. A clustering is a partition ρ(x ) = {C 1, C 2,..., C k }, where 1. Each C i X. 2. Each pair C i C j =. 3. Each k i=1 C i = X. 5-1

Hard clustering Problem: Start with a set X R d, and a metric d : R d R d R +. A cluster C is a subset of X. A clustering is a partition ρ(x ) = {C 1, C 2,..., C k }, where 1. Each C i X. 2. Each pair C i C j =. 3. Each k i=1 C i = X. Goal: 1. (close intra) For each C ρ(x ) for all x, x C, d(x, x ) is small. 2. (far inter) For each C i, C j ρ(x ) (i j), for most x i C i and x j C j, d(x i, x j ) is large. 5-2

Hierarchical clustering Hierarchical Clustering 1. Each x i X forms a cluster C i 2. While exists two clusters close enough (a) Find the closest two clusters C i, C j. (b) Merge C i, C j into a single cluster 6-1

Hierarchical clustering Hierarchical Clustering 1. Each x i X forms a cluster C i 2. While exists two clusters close enough (a) Find the closest two clusters C i, C j. (b) Merge C i, C j into a single cluster Two things to define 1. Define a distance between clusters. 2. Define close enough 6-2

Definition of distance between clusters Distance between center of clusters. What does center mean? 7-1

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. 7-2

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) 7-3

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) 7-4

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) some (random) representative point 7-5

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) some (random) representative point Distance between closest pair of points 7-6

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) some (random) representative point Distance between closest pair of points Distance between furthest pair of points 7-7

Definition of distance between clusters Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) some (random) representative point Distance between closest pair of points Distance between furthest pair of points Average distance between all pairs of points in different clusters 7-8

Definition of distance between clusters 7-9 Distance between center of clusters. What does center mean? the mean (average) point. Called the centroid. the center-point (like a high-dimensional median) center of the MEB (minimum enclosing ball) some (random) representative point Distance between closest pair of points Distance between furthest pair of points Average distance between all pairs of points in different clusters Radius of minimum enclosing ball of joined cluster

Definition of close enough If information is known about the data, perhaps some absolute value τ can be given. 8-1

Definition of close enough If information is known about the data, perhaps some absolute value τ can be given. If the { diameter, or radius of MEB, or average distance from center point } is beneath some threshold. This fixes the scale of a cluster, which could be good or bad. 8-2

Definition of close enough If information is known about the data, perhaps some absolute value τ can be given. If the { diameter, or radius of MEB, or average distance from center point } is beneath some threshold. This fixes the scale of a cluster, which could be good or bad. If the density is beneath some threshold; where density is # points/volume or # points/radius d 8-3

Definition of close enough If information is known about the data, perhaps some absolute value τ can be given. If the { diameter, or radius of MEB, or average distance from center point } is beneath some threshold. This fixes the scale of a cluster, which could be good or bad. If the density is beneath some threshold; where density is # points/volume or # points/radius d If the joint density decrease too much from single cluster density. Variations of this are called the elbow technique. 8-4

Definition of close enough If information is known about the data, perhaps some absolute value τ can be given. If the { diameter, or radius of MEB, or average distance from center point } is beneath some threshold. This fixes the scale of a cluster, which could be good or bad. If the density is beneath some threshold; where density is # points/volume or # points/radius d If the joint density decrease too much from single cluster density. Variations of this are called the elbow technique. When the number of clusters is k (for some magical value k). 8-5

Definition of close enough 8-6 If information is known about the data, perhaps some absolute value τ can be given. If the { diameter, or radius of MEB, or average distance from center point } is beneath some threshold. This fixes the scale of a cluster, which could be good or bad. If the density is beneath some threshold; where density is # points/volume or # points/radius d If the joint density decrease too much from single cluster density. Variations of this are called the elbow technique. When the number of clusters is k (for some magical value k). Example (on board): Distance is distance between centroids. Stop when there is 1 cluster left.

Assignment-based Clustering (k-center, k-mean, k-median) 9-1

Assignment based clustering Definitions For a set X, and distance d : X X R +, the output is a set of points C = {c 1,..., c k }, which implicitly defines a set of clusters where φ C (x) = arg min c C d(x, c). 10-1

Assignment based clustering Definitions For a set X, and distance d : X X R +, the output is a set of points C = {c 1,..., c k }, which implicitly defines a set of clusters where φ C (x) = arg min c C d(x, c). The k-center cluster problem is to find the set C of k clusters (often, but not always as a subset of X ) to minimize max x X d(φ C (x), x). 10-2

Assignment based clustering Definitions For a set X, and distance d : X X R +, the output is a set of points C = {c 1,..., c k }, which implicitly defines a set of clusters where φ C (x) = arg min c C d(x, c). The k-center cluster problem is to find the set C of k clusters (often, but not always as a subset of X ) to minimize max x X d(φ C (x), x). The k-means cluster problem: minimize x X d(φ C (x), x) 2 10-3

Assignment based clustering Definitions For a set X, and distance d : X X R +, the output is a set of points C = {c 1,..., c k }, which implicitly defines a set of clusters where φ C (x) = arg min c C d(x, c). The k-center cluster problem is to find the set C of k clusters (often, but not always as a subset of X ) to minimize max x X d(φ C (x), x). The k-means cluster problem: minimize x X d(φ C (x), x) 2 The k-median cluster problem: minimize x X d(φ C (x), x) 10-4

Gonzalez Algorithm Gonzalez Algorithm A simple greedy algorithm for k-center 1. Choose c 1 X arbitrarily. Let S 1 = {c 1 } 2. For i = 2 to k do (a) Set c i = arg max x X d(x, φ Si 1 (x)) (b) Let S i = {c 1,..., c i } 11-1

Gonzalez Algorithm Gonzalez Algorithm A simple greedy algorithm for k-center 1. Choose c 1 X arbitrarily. Let S 1 = {c 1 } 2. For i = 2 to k do (a) Set c i = arg max x X d(x, φ Si 1 (x)) (b) Let S i = {c 1,..., c i } In the worst case, it gives a 2-approximation to the optimal solution. 11-2

Gonzalez Algorithm Gonzalez Algorithm A simple greedy algorithm for k-center 1. Choose c 1 X arbitrarily. Let S 1 = {c 1 } 2. For i = 2 to k do (a) Set c i = arg max x X d(x, φ Si 1 (x)) (b) Let S i = {c 1,..., c i } In the worst case, it gives a 2-approximation to the optimal solution. Analysis (on board) 11-3

Parallel Guessing Algorithm Parallel Guessing Algorithm for k-center Run for R = (1 + ɛ/2), (1 + ɛ/2) 2,..., 1. We pick an arbitrary point as the first center, C = {c 1 }. 2. For each point p P compute r p = min c C d(p, c) and test whether r p > R. If it is, then set C = C {p}. 3. Abort at C > k Pick the minimum R that does not abort. 12-1

Parallel Guessing Algorithm Parallel Guessing Algorithm for k-center Run for R = (1 + ɛ/2), (1 + ɛ/2) 2,..., 1. We pick an arbitrary point as the first center, C = {c 1 }. 2. For each point p P compute r p = min c C d(p, c) and test whether r p > R. If it is, then set C = C {p}. 3. Abort at C > k Pick the minimum R that does not abort. In the worst case, it gives a (2 + ɛ)-approximation to the optimal solution (on board). 12-2

Parallel Guessing Algorithm Parallel Guessing Algorithm for k-center Run for R = (1 + ɛ/2), (1 + ɛ/2) 2,..., 1. We pick an arbitrary point as the first center, C = {c 1 }. 2. For each point p P compute r p = min c C d(p, c) and test whether r p > R. If it is, then set C = C {p}. 3. Abort at C > k Pick the minimum R that does not abort. In the worst case, it gives a (2 + ɛ)-approximation to the optimal solution (on board). Can be implemented as a streaming algorithm performs a linear scan of the data, uses space Õ(k) and time Õ(nk). 12-3

Lloyd s Algorithm Lloyd s Algorithm The well-known algorithm for k-means 1. Choose k points C X (arbitrarily?) 2. Repeat (a) For all x X, find φ C (x) (closest center c C to x) (b) For all i [k], let c i = average{x X φ C (x) = c i } 3. until the set C is unchanged http://www.math.le.ac.uk/people/ag153/homepage/kmeanskmedoids/ Kmeans_Kmedoids.html 13-1

Lloyd s Algorithm Lloyd s Algorithm The well-known algorithm for k-means 1. Choose k points C X (arbitrarily?) 2. Repeat (a) For all x X, find φ C (x) (closest center c C to x) (b) For all i [k], let c i = average{x X φ C (x) = c i } 3. until the set C is unchanged http://www.math.le.ac.uk/people/ag153/homepage/kmeanskmedoids/ Kmeans_Kmedoids.html If the main loop has R rounds, then the running time is O(Rnk) 13-2

Lloyd s Algorithm 13-3 Lloyd s Algorithm The well-known algorithm for k-means 1. Choose k points C X (arbitrarily?) 2. Repeat (a) For all x X, find φ C (x) (closest center c C to x) (b) For all i [k], let c i = average{x X φ C (x) = c i } 3. until the set C is unchanged http://www.math.le.ac.uk/people/ag153/homepage/kmeanskmedoids/ Kmeans_Kmedoids.html If the main loop has R rounds, then the running time is O(Rnk) But 1. What is R? Converge in finite number of steps? 2. How do we initialize C? 3. How accurate is this algorithm?

Value of R It is finite. The cost x X d(x, φ C (x)) 2 is always decreasing, and there are a finite number of possible distinct cluster centers. But it could be exponential in k and d (the dimension when Euclidean distance used): n O(kd) (number of voronoi cell) 14-1

Value of R It is finite. The cost x X d(x, φ C (x)) 2 is always decreasing, and there are a finite number of possible distinct cluster centers. But it could be exponential in k and d (the dimension when Euclidean distance used): n O(kd) (number of voronoi cell) However, usually R = 10 is fine. 14-2

Value of R It is finite. The cost x X d(x, φ C (x)) 2 is always decreasing, and there are a finite number of possible distinct cluster centers. But it could be exponential in k and d (the dimension when Euclidean distance used): n O(kd) (number of voronoi cell) However, usually R = 10 is fine. Smoothed analysis: if data perturbed randomly slightly, then R = O(n 35 k 34 d 8 ). This is polynomial, but still ridiculous. 14-3

Value of R It is finite. The cost x X d(x, φ C (x)) 2 is always decreasing, and there are a finite number of possible distinct cluster centers. But it could be exponential in k and d (the dimension when Euclidean distance used): n O(kd) (number of voronoi cell) However, usually R = 10 is fine. Smoothed analysis: if data perturbed randomly slightly, then R = O(n 35 k 34 d 8 ). This is polynomial, but still ridiculous. If all points are on a grid of length M, then R = O(dn 4 M 2 ). But thats still way too big. 14-4

Value of R 14-5 It is finite. The cost x X d(x, φ C (x)) 2 is always decreasing, and there are a finite number of possible distinct cluster centers. But it could be exponential in k and d (the dimension when Euclidean distance used): n O(kd) (number of voronoi cell) However, usually R = 10 is fine. Smoothed analysis: if data perturbed randomly slightly, then R = O(n 35 k 34 d 8 ). This is polynomial, but still ridiculous. If all points are on a grid of length M, then R = O(dn 4 M 2 ). But thats still way too big. Conclusion: There are crazy special cases that can take a long time, but usually it works.

Initialize C Random set of points. By coupon collectors, we know that we need about O(k log k) to get one in each cluster, given that each cluster contains equal number of points. We can later reduce to k clusters, by merging extra clusters. 15-1

Initialize C Random set of points. By coupon collectors, we know that we need about O(k log k) to get one in each cluster, given that each cluster contains equal number of points. We can later reduce to k clusters, by merging extra clusters. Randomly partition X = {X 1, X 2,..., X k } and take c i = average(x i ). This biases towards center of X. 15-2

Initialize C Random set of points. By coupon collectors, we know that we need about O(k log k) to get one in each cluster, given that each cluster contains equal number of points. We can later reduce to k clusters, by merging extra clusters. Randomly partition X = {X 1, X 2,..., X k } and take c i = average(x i ). This biases towards center of X. Gonzalez algorithm (for k-center). This biases too much to outlier points. 15-3

Accuracy Can be arbitrarily bad. (Give an example?) 16-1

Accuracy Can be arbitrarily bad. (Give an example?) 4 vertices of a rectangle (width height) 16-2

Accuracy Can be arbitrarily bad. (Give an example?) Theory algorithm: Gets (1 + ɛ)-approximation for k-means in 2 (k/ɛ)o(1) nd time. (Kumar, Sabharwal, Sen 2004) 16-3

Accuracy 16-4 Can be arbitrarily bad. (Give an example?) Theory algorithm: Gets (1 + ɛ)-approximation for k-means in 2 (k/ɛ)o(1) nd time. (Kumar, Sabharwal, Sen 2004) The following k-means++ is O(log k)-approximation a. a Ref: k-means++: The Advantages of Careful Seeding k-means++ 1. Choose c 1 X arbitrarily. Let S 1 = {c 1 } 2. For i = 2 to k do (a) Choose c i from X with probability proportional to d(x, φ Si 1 (x)) 2 (b) Let S i = {c 1,..., c i }

Other issues The key step that makes Lloyds algorithm so cool is average{x X } = arg min c x 2 c R d 2. x X But it only works for distance function d(x, c) = x c 2 Is effected by outliers more than k-median clustering. 17-1

Local Search Algorithm Local Search Algorithm Designed for k-median 1. Choose K centers at random from X. Call this set C. 2. Repeat For each u in X and c in C, compute Cost(C {c} + {u}). Find (u, c) for which this cost is smallest, and replace C with C {c} + {u}. 3. Stop when Cost(C) does not change significantly 18-1

Local Search Algorithm Local Search Algorithm Designed for k-median 1. Choose K centers at random from X. Call this set C. 2. Repeat For each u in X and c in C, compute Cost(C {c} + {u}). Find (u, c) for which this cost is smallest, and replace C with C {c} + {u}. 3. Stop when Cost(C) does not change significantly Can be used to design a 5-approximation algorithm a. a Ref: Local Search Heuristics for k-median and Facility Location Problems Q: How can you speed up the swap operation by avoiding searching over all (u, c) pairs? 18-2

19-1 Spectural Clustering

Top-down clustering Top-down Clustering Framework 1. Find the best cut of the data into two pieces. 2. Recur on both pieces until that data should not be split anymore. 20-1

Top-down clustering Top-down Clustering Framework 1. Find the best cut of the data into two pieces. 2. Recur on both pieces until that data should not be split anymore. What is the best way to partition the data? This is a big question. Let s first talk about graphs. 20-2

Graphs Graph: G = (V, E) is defined by a set of vertices V = {v 1, v 2,..., v n } and a set of edges E = {e 1, e 2,..., e m } where each edge e j is an unordered (or ordered in a directed graph) pair of edges e j = {v i, v i } (or e j = (v i, v i )). Example: e g a b c d f h 21-1

Graphs Graph: G = (V, E) is defined by a set of vertices V = {v 1, v 2,..., v n } and a set of edges E = {e 1, e 2,..., e m } where each edge e j is an unordered (or ordered in a directed graph) pair of edges e j = {v i, v i } (or e j = (v i, v i )). Example: e c a d b f g h Adjacent list representation 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 A = 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 21-2

From point sets to graphs ɛ-neighborhood graph: Connect all points whose pairwise distance are smaller than ɛ. An unweighted graph 22-1

From point sets to graphs ɛ-neighborhood graph: Connect all points whose pairwise distance are smaller than ɛ. An unweighted graph k-nearest neighbor graphs: Connect v i with v j if v j is among the k-nearest neighbors of v i. Naturally a directed graph. Two ways to make it undirected: (1) neglect the direction; (2) mutual k-nearest neighbor. 22-2

From point sets to graphs ɛ-neighborhood graph: Connect all points whose pairwise distance are smaller than ɛ. An unweighted graph k-nearest neighbor graphs: Connect v i with v j if v j is among the k-nearest neighbors of v i. Naturally a directed graph. Two ways to make it undirected: (1) neglect the direction; (2) mutual k-nearest neighbor. The fully connected graph: connect all points with finite distances with each other, and weight all edges by d(v i, v j ), where d(v i, v j ) is the distance between v i and v j. 22-3

A good cut a b c d e f g h Rule of the thumb 1. Many edges in a cluster. Volume if a cluster is Vol(S) = the number of edges with at least one vertex in V. 2. Few edges between clusters. Cut between two clusters S, T is Cut(S, T ) = the number of edges with one vertex in S and the other in T. 23-1

A good cut a b c d e f g h Rule of the thumb 1. Many edges in a cluster. Volume if a cluster is Vol(S) = the number of edges with at least one vertex in V. 2. Few edges between clusters. Cut between two clusters S, T is Cut(S, T ) = the number of edges with one vertex in S and the other in T. 23-2 Ratio cut: RatioCut(S, T ) = Cut(S,T ) S + Cut(S,T ) T Normalized cut: NCut(S, T ) = Cut(S,T ) Vol(S) + Cut(S,T ) Vol(T )

A good cut a b c Rule of the thumb 1. Many edges in a cluster. 2. Few edges between clusters. d e f g h Can be generalize to general k Volume if a cluster is Vol(S) = the number of edges with at least one vertex in V. Cut between two clusters S, T is Cut(S, T ) = the number of edges with one vertex in S and the other in T. 23-3 Ratio cut: RatioCut(S, T ) = Cut(S,T ) S + Cut(S,T ) T Normalized cut: NCut(S, T ) = Cut(S,T ) Vol(S) + Cut(S,T ) Vol(T )

Laplacian matrix and eigenvector 1. Adjacent matrix of the graph A 2. Degree (diagonal) matrix D 3. Laplacian matrix L = D A 4. Eigenvector of a matrix M is the vector v such that Mv = λv, where λ is a scalar, called the corresponding eigenvalue. a b c e d f g h 24-1

Laplacian matrix and eigenvector 1. Adjacent matrix of the graph A 2. Degree (diagonal) matrix D 3. Laplacian matrix L = D A 4. Eigenvector of a matrix M is the vector v such that Mv = λv, where λ is a scalar, called the corresponding eigenvalue. a b c e d f g h Adjacent list matrix 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 A = 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 Degree matrix 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 D = 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 24-2

Laplacian matrix and eigenvector (cont.) L = D A = 3 1 1 1 0 0 0 0 1 2 0 1 0 0 0 0 1 0 3 1 1 0 0 0 1 1 1 3 0 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 1 3 1 1 0 0 0 0 1 1 2 0 0 0 0 0 0 1 0 1 25-1

Laplacian matrix and eigenvector (cont.) L = D A = 3 1 1 1 0 0 0 0 1 2 0 1 0 0 0 0 1 0 3 1 1 0 0 0 1 1 1 3 0 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 1 3 1 1 0 0 0 0 1 1 2 0 0 0 0 0 0 1 0 1 Eigenvalues and eigenvectors (after normalization) of Laplacian matrix 25-2 λ 0 0.278 1.11 2.31 3.46 4 4.82 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8.42 0.18 0.64.38 0.25 0 1/ 8.20.11 0.61 0.03.25 0 V 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8 0.17.37 0.21.54.25 0 1/ 8 0.36.08.10.28 0.75 0 1/ 8 0.31.51.36.56 0.56 0 1/ 8 0.50 0.73 0.08 0.11 0.11 0

Laplacian matrix and eigenvector (cont.) L = D A = 3 1 1 1 0 0 0 0 1 2 0 1 0 0 0 0 1 0 3 1 1 0 0 0 1 1 1 3 0 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 1 3 1 1 0 0 0 0 1 1 2 0 0 0 0 0 0 1 0 1 Eigenvalues and eigenvectors (after normalization) of Laplacian matrix 25-3 λ 0 0.278 1.11 2.31 3.46 4 4.82 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8.42 0.18 0.64.38 0.25 0 1/ 8.20.11 0.61 0.03.25 0 V 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8 0.17.37 0.21.54.25 0 1/ 8 0.36.08.10.28 0.75 0 1/ 8 0.31.51.36.56 0.56 0 1/ 8 0.50 0.73 0.08 0.11 0.11 0 (Graph embedding on board)

Laplacian matrix and eigenvector (cont.) L = D A = 3 1 1 1 0 0 0 0 1 2 0 1 0 0 0 0 1 0 3 1 1 0 0 0 1 1 1 3 0 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 1 3 1 1 0 0 0 0 1 1 2 0 0 0 0 0 0 1 0 1 Eigenvalues and eigenvectors (after normalization) of Laplacian matrix 25-4 λ 0 0.278 1.11 2.31 3.46 4 4.82 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8.42 0.18 0.64.38 0.25 0 1/ 8.20.11 0.61 0.03.25 0 V 1/ 8.36 0.08 0.10 0.28 0.25 1/ 2 1/ 8 0.17.37 0.21.54.25 0 1/ 8 0.36.08.10.28 0.75 0 1/ 8 0.31.51.36.56 0.56 0 1/ 8 0.50 0.73 0.08 0.11 0.11 0 (Graph embedding on board)

Spectral clustering algorithm Spectral Clustering 1. Compute the Laplacian L 2. Compute the first k eigenvectors u 1,..., u k of L 3. Let U R n k be the matrix containing the vectors u 1,..., u k as columns 4. For i = 1,..., n, let y i R k be the vector corresponding to the i-th row of U. 5. Cluster the points {y 1,..., y n } in R k with k-means algorithm into clusters C 1,..., C k 26-1

The connections When k = 2, we want to compute min S V RatioCut(S, T ) 27-1

The connections When k = 2, we want to compute min S V RatioCut(S, T ) We can prove that this is equivalent to min f T Lf s.t. f 1, f = 1, and f i is defined as S V f i = 1 n T S if v i S 1 n S T if v i T = V \S 27-2

The connections When k = 2, we want to compute min S V RatioCut(S, T ) We can prove that this is equivalent to min f T Lf s.t. f 1, f = 1, and f i is defined as S V f i = 1 n T S if v i S 1 n S T if v i T = V \S This is NP-hard. And we solve the following relaxed optimization problem instead. min f R n f T Lf s.t. f 1, f = 1 27-3 The best f is the second smallest eigenvalue of L.

28-1 Clustering Social Network

Social network and communities Telephone networks Communities: groups of people that communicate frequently, e.g., groups of friends, members of a club, etc. 29-1

Social network and communities Telephone networks Communities: groups of people that communicate frequently, e.g., groups of friends, members of a club, etc. Email networks 29-2

Social network and communities Telephone networks Communities: groups of people that communicate frequently, e.g., groups of friends, members of a club, etc. Email networks Collaboration networks: E.g., two people publish paper together are connected. Communities: authors working on a particular topic (Wikipedia articles), people working on the same project in Google. 29-3

Social network and communities 29-4 Telephone networks Communities: groups of people that communicate frequently, e.g., groups of friends, members of a club, etc. Email networks Collaboration networks: E.g., two people publish paper together are connected. Communities: authors working on a particular topic (Wikipedia articles), people working on the same project in Google. Goal: To find the communities. Similar to normal clustering but we sometimes allow overlaps (will not discuss here).

Previous clustering algorithms a b c d e f g Difficulty: Hard to define the distance. E.g., if we assign 1 to (x, y) if x, y are friends and 0 otherwise, then the triangle inequality is not satisfied 30-1

Previous clustering algorithms a b c d e f g Difficulty: Hard to define the distance. E.g., if we assign 1 to (x, y) if x, y are friends and 0 otherwise, then the triangle inequality is not satisfied Hierarchical clustering: may combine c, e first. 30-2

Previous clustering algorithms a b c d e f g Difficulty: Hard to define the distance. E.g., if we assign 1 to (x, y) if x, y are friends and 0 otherwise, then the triangle inequality is not satisfied Hierarchical clustering: may combine c, e first. Assignment-based clustering, e.g., k-mean: if two random seeds are chosen to be b, e, then c may be assigned to e 30-3

Previous clustering algorithms a b c d e f g Difficulty: Hard to define the distance. E.g., if we assign 1 to (x, y) if x, y are friends and 0 otherwise, then the triangle inequality is not satisfied Hierarchical clustering: may combine c, e first. Assignment-based clustering, e.g., k-mean: if two random seeds are chosen to be b, e, then c may be assigned to e Spectral clustering: may work. Try it. 30-4

Betweenness A new distance measure in social network: Betweenness Betweenness(A, B): number of shortest paths that use edge (A, B). 31-1

Betweenness A new distance measure in social network: Betweenness Betweenness(A, B): number of shortest paths that use edge (A, B). Intuition: If to get between two communities you need to take this edge, its betweenness score will be high. An edge with a high betweenness may be a facilitator edge, not a community edge. 31-2

Girvan-Newman Algorithm Girvan-Newman Algorithm 1. For each v V (a) Run BFS on v to build a directed acyclic graph (DAG) on the entire graph. Give 1 credit to each node except the root. (b) Walk back up the DAG and add a credit to each edge. When paths merge, the credits adds up, and when paths split, the credit splits as well. 2. The final score of each edge is the summation of all credits of the n runs, divided by 2. 3. Remove edges with high betweenness, and the remaining connected components are communities. Example (on board) 32-1

Girvan-Newman Algorithm Girvan-Newman Algorithm 32-2 1. For each v V (a) Run BFS on v to build a directed acyclic graph (DAG) on the entire graph. Give 1 credit to each node except the root. (b) Walk back up the DAG and add a credit to each edge. When paths merge, the credits adds up, and when paths split, the credit splits as well. 2. The final score of each edge is the summation of all credits of the n runs, divided by 2. 3. Remove edges with high betweenness, and the remaining connected components are communities. Example (on board) What s the running time? Can we speed it?

33-1 What do the real graphs look like?

Power-law distributions Power-law distributions A nonnegative random variable X is said to have a power law distribution if Pr[X = x] cx α, for constants c > 0 and α > 0. 34-1 Zipf s Law: states that the frequency of the j the most common word in English (or other common languages) is proportional to j 1 A city grows in proportion to its current size as a result of people having children. Price studied the network of citations between scientific papers and found that the in degrees (number of times a paper has been cited) have power law distributions. WWW,... the rich get richer, heavy tails

Power-law graphs 35-1

Preferential attachment 36-1 Barabasi-Albert model When a new node is added to the network at each time t N. 1. With a probability p [0, 1], this new node connects to m existing nodes uniformly at random. 2. With a probability 1 p, this new node connects to m existing nodes with a probability proportional to the degree of node which it will be connected to. After some math, we get m(k) k β, where m(k) is the number of nodes whose degree is k, and β = 3 p 1 p. (on board, for p = 0)

Thank you! Some slides are based on the MMDS book http://infolab.stanford.edu/~ullman/mmds.html and Jeff Phillips lecture notes http://www.cs.utah.edu/~jeffp/teaching/cs5955.html 37-1