SCAN: A Structural Clustering Algorithm for Networks



Similar documents
SCAN: A Structural Clustering Algorithm for Networks

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Social Media Mining. Graph Essentials

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

How To Cluster Of Complex Systems

Graph Mining and Social Network Analysis

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Part 2: Community Detection

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

5.1 Bipartite Matching

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Chapter ML:XI (continued)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

DATA ANALYSIS II. Matrix Algorithms

Protein Protein Interaction Networks

Network (Tree) Topology Inference Based on Prüfer Sequence

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Small Maximal Independent Sets and Faster Exact Graph Coloring

Practical Graph Mining with R. 5. Link Analysis

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

A Fast Algorithm For Finding Hamilton Cycles

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

5. Binary objects labeling

Simplified External memory Algorithms for Planar DAGs. July 2004

Constrained Tetrahedral Mesh Generation of Human Organs on Segmented Volume *

CIS 700: algorithms for Big Data

Big Data Graph Algorithms

Analysis of Algorithms, I

TU e. Advanced Algorithms: experimentation project. The problem: load balancing with bounded look-ahead. Input: integer m 2: number of machines

ALBERTA. Social Network Analysis for the Assessment of Learning UNIVERSITY OF. Osmar R. Zaïane Professor & Scientific Director of AICML

Cluster Analysis: Advanced Concepts

MapReduce Approach to Collective Classification for Networks

Complex Networks Analysis: Clustering Methods

Cluster Analysis: Basic Concepts and Algorithms

A comparison of various clustering methods and algorithms in data mining

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Using Library Dependencies for Clustering

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Distributed Computing over Communication Networks: Maximal Independent Set

Clustering UE 141 Spring 2013

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Exponential time algorithms for graph coloring

Introduction to Graph Mining

Social Media Mining. Data Mining Essentials

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

SPANNING CACTI FOR STRUCTURALLY CONTROLLABLE NETWORKS NGO THI TU ANH NATIONAL UNIVERSITY OF SINGAPORE

Big graphs for big data: parallel matching and clustering on billion-vertex graphs

An Empirical Study of Two MIS Algorithms

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Mining Social Network Graphs

Tree-representation of set families and applications to combinatorial decompositions

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Max Flow, Min Cut, and Matchings (Solution)

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Unsupervised Data Mining (Clustering)

ROCK: A Robust Clustering Algorithm for Categorical Attributes

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Forschungskolleg Data Analytics Methods and Techniques

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Course on Social Network Analysis Graphs and Networks

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

High-dimensional labeled data analysis with Gabriel graphs

On Integer Additive Set-Indexers of Graphs

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Performance Metrics for Graph Mining Tasks

Data Structures in Java. Session 15 Instructor: Bert Huang

Mining Social-Network Graphs

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Character Image Patterns as Big Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Computational Game Theory and Clustering

Static Load Balancing

How To Understand The Network Of A Network

A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Crawling and Detecting Community Structure in Online Social Networks using Local Information

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Tools and Techniques for Social Network Analysis

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Lecture 17 : Equivalence and Order Relations DRAFT

Clustering Via Decision Tree Construction

Network Algorithms for Homeland Security

Transcription:

SCAN: A Structural Clustering Algorithm for Networks Xiaowei Xu, Nurcan Yuruk, Zhidan Feng (University of Arkansas at Little Rock) Thomas A. J. Schweiger (Acxiom Corporation)

Networks scaling: #edges connected to respective vertex

Network basics Network graph G = {V, E} V is set of vertices, E is set of edges connecting the vertices Hub Hubs: vertices that bridge (many) clusters Outliers: vertices marginally connected to clusters featuring a weak association Outlier network clustering/graph partitioning: division of graph into set of disjoint sub-graphs in order to find hidden structures existing approaches aim at finding clusters based on the number of edges between vertices or clusters

SCAN A Structural Clustering Algorithm for Networks based on definitions of DBSCAN detects clusters, hubs and outliers using structure and connectivity of the vertices as a clustering criterion vertices sharing a certain quantity of neighbors should be grouped into one cluster, hubs and outliers should be isolated runtime complexity wrt. n vertices and m edges in a network: O(m) focusses on simple, undirected and unweighted graphs

Definitions for structure-connected clusters (I) VERTEX STRUCTURE, the structure of v is defined by its neighborhood STRUCTURAL SIMILARITY two objects sharing a similar structure in terms of their neighborbood, will have a large structural similarity

Definitions for structure-connected clusters (II) ε NEIGHBORHOOD apply threshold to structural similarity in order to assign cluster memberships ε = 0.65 CORE VERTEX core vertex shares structural similiarity of at least ε with at least μ neighbors

Definitions for structure-connected clusters (III) DIRECT STRUCTURE REACHABILITY if vertex is in ε Neighborhood of a core vertex, they should be in the same cluster w v ε = 0.65, μ = 2 STRUCTURE REACHABILITY a vertex is structure-reachable from, if there is a chain of vertices such that is directly structure-reachable from w ε = 0.65, μ = 2 v

Definitions for structure-connected clusters (IV) STRUCTURE CONNECTIVITY two non-core vertices belonging to the same cluster may not be structure-reachable, as the core-condition does not hold for them. However, they still belong to the same cluster, if they are structure-reachable from the same core vertex. This is called structure-connectivity. w u v ε = 0.7, μ = 3

Definitions for structure-connected clusters (V) STRUCTURE-CONNECTED CLUSTER a non-empty subset is called a structure-connected cluster, if all vertices in C are structure-connected and C is maximal wrt. structure reachability 1. Connectivity: 2. Maximality: CLUSTERING a clustering P consists of all structure-connected clusters

Definitions for structure-connected clusters (VI) If a vertex is not a member of any cluster, it is either a hub or an outlier, depending on its neighborhood. HUB if an isolated vertex a hub has neighbors belonging to two or more different clusters, it is 1. v is not a member of any cluster: 2. v bridges different clusters: OUTLIER an isolated vertex is an outlier, if and only if all its neighbors either belong to only one cluster or do not belong to any cluster 1. v is not a member of any cluster: 2. v does not bridge different clusters:

Algorithm SCAN for each vertex v not yet classified, it is checked whether vertex is a core if so, a new cluster is expanded from this vertex: 1. the algorithm generates a new cluster id, looks for all unclassified vertices in the neighborhood of the core and inserts them into a queue vertices which are directly structure-reachable from core vertex v 2. it traverses the queue until it is empty and establishes all vertices which are directly structure-reachable from the respective vertex w vertices which are structure-reachable from core vertex v (via vertex w) 3. the same cluster id is assigned to all those vertices and if they are not labeled as a non-member yet, they are also inserted into the queue if not, it is labeled as a non-member non-member vertices which have edges to two or more clusters, are classified as hubs. Otherwise, they are classified as outliers. ε-value between 0.5 and 0.8, μ = 2

Experimental Evaluation (I) Comparison to FastModularity algorithm: hierarchical network clustering algorithm optimizing the modularity modularity measures whether division of a network into communities is a good one in terms of many edges within communities and preferably little edges between communities 1. initally, each vertex is the only member of a community 2. iteratively, the algorithm greedily merges the two communities causing the largest increase of modularity, until all vertices are members of the same community

Experimental Evaluation (II) Customer Segmentation dataset

Experimental Evaluation (III) FastModularity clustering result SCAN clustering result

Experimental Evaluation (IV) Books about US politics dataset

Experimental Evaluation (V) FastModularity clustering result SCAN clustering result

Discussion + identifies not only clusters, but also outliers and hubs + linear run-time complexity wrt. # edges, each vertex is visited only once - performance highly depends on sensitive input parameters - ignores domain knowledge in clustering attributes - assumes that network is homogeneous and adjacency matrix is already defined

Q & A