Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Similar documents
Graph Mining Techniques for Social Media Analysis

Group CRM: a New Telecom CRM Framework from Social Network Perspective

An approach of detecting structure emergence of regional complex network of entrepreneurs: simulation experiment of college student start-ups

Complex Networks Analysis: Clustering Methods

Protein Protein Interaction Networks

A scalable multilevel algorithm for graph clustering and community structure detection

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Analysis of Internet Topologies

Social Media Mining. Data Mining Essentials

Introduction to Data Mining

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

Social Network Mining

Graph Mining and Social Network Analysis

Graph Classification and Easy Reliability Polynomials

ALBERTA. Social Network Analysis for the Assessment of Learning UNIVERSITY OF. Osmar R. Zaïane Professor & Scientific Director of AICML

Analysis of Internet Topologies: A Historical View

Dmitri Krioukov CAIDA/UCSD

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

A Performance Comparison of Five Algorithms for Graph Isomorphism

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

CAD Algorithms. P and NP

Course Syllabus For Operations Management. Management Information Systems

Link Prediction in Social Networks

Information Management course

Graph theoretic approach to analyze amino acid network

An Introduction to APGL

Distance Degree Sequences for Network Analysis

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Introduction to Scheduling Theory

Structural and functional analytics for community detection in large-scale complex networks

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Inet-3.0: Internet Topology Generator

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

How To Cluster Of Complex Systems

Distributed Computing over Communication Networks: Maximal Independent Set

Self Organizing Maps for Visualization of Categories

Finding and counting given length cycles

Travis Goodwin & Sanda Harabagiu

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Advanced Ensemble Strategies for Polynomial Models

Exponential time algorithms for graph coloring

Part 2: Community Detection

Big Data Graph Algorithms

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

A comparative study of social network analysis tools

Ant Colony Optimization and Constraint Programming

Small Maximal Independent Sets and Faster Exact Graph Coloring

Problem Set 7 Solutions

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

IC05 Introduction on Networks &Visualization Nov

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

A GRAPH-THEORETIC DEFINITION OF A SOCIOMETRIC CLIQUE *

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Approximation Algorithms

Community Mining from Multi-relational Networks

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance

The Enron Corpus: A New Dataset for Classification Research

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

SPANNING CACTI FOR STRUCTURALLY CONTROLLABLE NETWORKS NGO THI TU ANH NATIONAL UNIVERSITY OF SINGAPORE

DATA MINING - SELECTED TOPICS

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

A box-covering algorithm for fractal scaling in scale-free networks

Nonorthogonal Decomposition of Binary Matrices for Bounded-Error Data Compression and Analysis

Social Network Analysis

Structural constraints in complex networks

Visualization of textual data: unfolding the Kohonen maps.

Definition Given a graph G on n vertices, we define the following quantities:

Performance Metrics for Graph Mining Tasks

CSC2420 Spring 2015: Lecture 3

Data Mining Fundamentals

How To Understand The Network Of A Network

A number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution

Transcription:

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer 1

Content What is Community Detection? Motivation Defining a community Methods to find communities Overlapping communities Clique percolation method Finding a community with query nodes Conclusion 2

What is Community Detection? Different from traditional clustering Algorithms use the graph property Graphs with a natural origin have a structure that is not random We try to find these structures by analyzing the graph A perfect solution has yet to be found 3

Motivation Communities can represent parts of a larger system (Like organs in the human body) Communities can be considered as a summary of the graph Communities make it easy to visualize and understand complex systems Communities on the web might represent pages of related topics Community can reveal the properties without releasing the individual privacy information 4

Defining a Community There is not exact definition of a community in a graph It depends on the application A general definition: Separation between nodes in different communities Cohesion between nodes in a community The differences between algorithms come down to the precise definition 5

Basics For a Graph G = {V, E} and a subgraph C G with G = V = n and C = nc φint(c) should have a higher value than the whole graph and φext(c) should be much lower Local definitions see communities as an autonomous entity within a larger system Global definitions see the communities as essential parts of a larger system Vertex similarity: compare individual nodes and group them based on a similarity measure 6

Methods Finding overlapping communities Clique percolation method (CPM) Finding communities with query nodes 7

Clique Percolation Method CPM is based on the idea that communities are likely to consist of cliques Assumption: Every node in the same community is connected to nearly every other node A community is build up by a chain of k-cliques which are adjacent. Two k-cliques are adjacent if they share k-1 nodes The largest possible chain is defined as community This is a local definition 8

Implementation of CPM The number of possible k-cliques in a graph is quite high Implementations search for maximal k-cliques (NP-hard problem) We build an clique-clique overlap matrix O All entries smaller than k-1 are removed 9

Parameter k = 3; k = 4 The results of processing the example graph with the CFinder software 10

Drawbacks Even if the underlying problem is NP-hard, for large sparse graphs, this algorithm is reasonably fast Some cases lead to useless results: It looks for cliques not dense subgraphs It requires a large number of cliques, but not too many 11

Finding a community with query nodes The goal is to find a subgraph H that contains a given set Q of query nodes and is densely connected. The function f is maximized among all possible choices for H In this case we choose the minimum degree for f Additionally we add a distance constraint d 12

Without size restriction - Greedy algorithm Choose f = f(h) = minimum degree of a node in H We set G0=G then repeat the steps: Obtain Gt+1 by removing a node which violates the distance constraint or has the minimum degree Terminate if either one of the query nodes has minimum degree or the query nodes are no longer connected We choose the component of Gt for which the minimum degree f(h) is maximized This can be implemented in O(n+m) 13

Q = {1, 2, 3} The greedy algorithm, without size constraint, applied on the example graph 14

Communities with size restriction A size constraint k makes the problem NP hard (Can be shown via a reduction to the Steiner tree problem) But it can be assumed that the size of the result set is correlated with the distance constraint The paper proposes two heuristics: GreedyDist repeatedly executes Greedy and decreases d until the size k of the graph is small enogh GreedyFast restricts the graph to the k closest nodes to the query nodes. Then Greedy is invoked 15

Evaluation with the DBLP dataset The goal was to find a network of scientific collaboration around Christos Papadimitriou 16

Conclusion A really broad topic with lots of applications Each algorithms is build with different problems in mind Algorithms are difficult to compare, there is no standard way of testing 17

Bibliography [1] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5:17 61, 1960. [2] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 174, 2010. [3] P. F. Jonsson and P. A. Bates*. Global topological features of cancer proteins in the human interactome. Bioinformatics, 2291 2297, 2006. [4] T. H. J. S. J.-P. O. K. Kaski. Spectral and network methods in the analysis of correlation matrices of stock returns. Physica A 383, 147 151, 2007. [5] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. Sequential algorithm for fast clique percolation. Phys. Rev. E, 78:026109, Aug 2008. [6] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping com- munity structure of complex networks in nature and society. Nature, 435:814 818, June 2005. [7] M. E. Porter, K. Schwab, M. E. Porter, K. Schwab, F. Paua, E. T. Herrera, and M. Porter. Communities in networks. Notices of the American Mathematical Society, 1164 1166, 2009. [8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, KDD '10, 939 948, New York, NY, USA, 2010. ACM. [9] K.-F. W. Wei Gao. Information Retrieval Technology. Springer Berlin Heidelberg, 2008. 18