How To Cluster Of Complex Systems



Similar documents
Protein Protein Interaction Networks

SCAN: A Structural Clustering Algorithm for Networks

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Crawling and Detecting Community Structure in Online Social Networks using Local Information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

A scalable multilevel algorithm for graph clustering and community structure detection

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Part 2: Community Detection

Practical Graph Mining with R. 5. Link Analysis

Graph Mining and Social Network Analysis

GEOENGINE MSc in Geomatics Engineering (Master Thesis) Anamelechi, Falasy Ebere

How To Cluster

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Link Prediction in Social Networks

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Unsupervised Data Mining (Clustering)

Social Media Mining. Data Mining Essentials

MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Distributed Computing over Communication Networks: Maximal Independent Set

Graph Database Proof of Concept Report

Big Data Graph Algorithms

Network Algorithms for Homeland Security

DATA ANALYSIS II. Matrix Algorithms

CD-HIT User s Guide. Last updated: April 5,

InfiniteGraph: The Distributed Graph Database

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Graph Processing and Social Networks

Hadoop SNS. renren.com. Saturday, December 3, 11

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Forschungskolleg Data Analytics Methods and Techniques

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Validation and automatic repair of two- and three-dimensional GIS datasets

Data Mining - Evaluation of Classifiers

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Distance Degree Sequences for Network Analysis

Software Testing Strategies and Techniques

Conclusions and Future Directions

Complex Networks Analysis: Clustering Methods

Evaluating partitioning of big graphs

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Chapter 13: Query Processing. Basic Steps in Query Processing

Presto/Blockus: Towards Scalable R Data Analysis

Performance Metrics for Graph Mining Tasks

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Statistical machine learning, high dimension and big data

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Java Modules for Time Series Analysis

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

How To Find Influence Between Two Concepts In A Network

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

A Process for ATLAS Software Development

Logistic Regression for Spam Filtering

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Affinity Prediction in Online Social Networks

Automatic parameter regulation for a tracking system with an auto-critical function

A Property & Casualty Insurance Predictive Modeling Process in SAS

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Graph Theory and Networks in Biology

CLUSTERING FOR FORENSIC ANALYSIS

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

arxiv:cs.dm/ v1 30 Mar 2002

Mathematics for Algorithm and System Analysis

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Gamma Distribution Fitting

Component Ordering in Independent Component Analysis Based on Data Power

Bioinformatics: Network Analysis

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

A Business Process Driven Approach for Generating Software Modules

How To Cluster On A Search Engine

How to Design the Shortest Path Between Two Keyword Algorithms

How To Make A Credit Risk Model For A Bank Account

Tools and Techniques for Social Network Analysis

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

How To Develop Software

On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data:

Automatic Labeling of Lane Markings for Autonomous Vehicles

Data Mining Practical Machine Learning Tools and Techniques

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Map/Reduce Affinity Propagation Clustering Algorithm

An Empirical Study of Two MIS Algorithms

Machine Learning Final Project Spam Filtering

Introduction to Logistic Regression

IBM SPSS Modeler Social Network Analysis 15 User Guide

Transcription:

Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University

Complex Systems Definition Dynamically evolving networks that have a large number of objects with complex connectivity Examples Social networks Biological i l networks Telecommunication networks WWW

Clustering of Complex Systems Purpose Issues Predicting groups of objects that closely communicate with each other Groups having the same subjects (called communities) Groups having the same functions (called modules) Large scale Scalability / Efficiency Complex connectivity Robustness / Accuracy Approaches Density based methods Searching densely connected sub graphs Partition based methods Detecting the best partition Hierarchical methods Merging sub graphs iteratively, or dividing sub graphs recursively

New Definition: Graph Entropy Assumptions Entropy in a graph represents structural stability Isolated complete sub graphs as clusters have the most stable status (i.e., lowest entropy) Random reconnections decrease stability (i.e., increase entropy) General Notations Inner links of v given G (V,E ): edges from v to the vertices in V p i (v): probability of v having inner links Outer links of v given G (V,E ): edges from v to the vertices not in V p o (v): probability of v having outer links Definitions Vertex Entropy: Graph Entropy: e(v) = p i (v) log 2 p i (v) p o (v) log 2 p o (v) e(g(v,e)) = v V e(v)

Examples of Graph Entropy Examples e(a) = e(b) = e(c) = 0 e(d) = 0.81 e(f) = 1.00 e(g) = e(h) = e(k) = 0 e(g) = 1.81 e(a) = e(b) = e(c) = e(d) = 0 e(f) = 1.00 e(g) = 0.92 e(h) = e(k) = 0 e(g) = 1.92 {a,b,c,d} giveshigher structural stability to G than {a,b,c,d,f}

Entropy Based Graph Clustering Algorithm (1) Select a random seed vertex, and form an initial cluster with the seed and its neighbors (2) Remove a vertex on the inner boundary of the cluster if it decreases graph entropy (3) Repeat step (2) until graph entropy is minimal (4) Add a vertex on the outer boundary of the cluster if it decreases graph entropy (5) Repeat step (4) until graph entropy is minimal (6) Output the cluster (7) Repeat steps (1) to (6) until there is no more vertex left Features Local optimization No parameters Randomness Low density clusters Overlapping clusters

Application to Biological Networks Dataset Yeast genome wide protein protein interaction data from DIP 4,928 proteins (vertices) and 17,186 interactions (edges) Clustering Accuracy Evaluation Method Comparison with known protein complex data f measure score: mean of recall and precision, Results

Revisions of Entropy Based Clustering Priority on Seed Selection in Step (1) Selecting a seed vertex in the descending order of degree Selecting a seed vertex in the descending order of clustering coefficient Priority on Seed Growth in Step (4) Adding a vertex on the outer boundary in the ascending order of vertex entropy Increase of Cluster Overlapping Rates Allowing random selection of a seed vertex from previous clusters Post Process Filtering out the clusters with graph entropy higher than a threshold

Accuracy Improvement Results before post processprocess after post processprocess # clusters avg. size avg. f score # clusters avg. size avg. f score original algorithm 501 3.82 0.414 474 3.50 0.420 degree based seed selection 442 4.02 0.416 413 3.60 0.423 coef. based seed selection 551 3.73 0.415 522 3.39 0.422 degree based seed selection AND entropy based seed growth coef. based seed selection AND entropy based seed growth 442 4.03 0.416 413 3.61 0.424 551 374 3.74 0.415 522 339 3.39 0.422 Comparison with other competing methods method # clusters avg. size avg. f score Entropy Based Clustering 413 3.61 0.424 MCL 428 5.42 0.419 CNM 40 58.55 0.391

Application to Social / Telecommunication Networks Dataset AS (Autonomous System) link network YouTube video link network MySpace social network dataset # vertices # edges density (%) AS links 45,744 323,009 0.031 YouTube 321,683 505,845 0.001 MySpace 100,000 6,854,231 0.137 Clustering Accuracy Evaluation Method p value of cumulative hypergeometric distribution for each vertex in a cluster Averaging p values of all vertices in a cluster Results method AS links YouTube MySpace # clusters avg. size avg. log p # clusters avg. size avg. log p # clusters avg. size avg. log p Entropy Based 824 63.1 5.49 25,897 8.8 6,505 6.2 MCL 979 46.7 4.31 23,001 14.0 117 853.8

Efficiency Improvement Parallelization Multiple seed growth concurrently producing different clusters Issues Avoiding duplicate cluster generation Sharing resource of candidate seeds Results Close to linear increase of speed up as the increment of the number of threads Runtime Comparison with other competing methods method AS links YouTube MySpace Entropy Based (1 thread) 162 75 50,028 Entropy Based (4 threads) 58 28 16,519 MCL 6,983 317 3,764 CNM 414 118 5,334

Conclusion Significance Systematic approach to analyze complex systems Information theoretic model to formulate modularity Local optimization algorithm for graph clustering Applications to diverse areas: Computational systems biology Social network analysis National security Internet flow analysis Future Works Further revisions of entropy based graph clustering algorithm Revising the seed selection step, e.g., starting from cliques as initial seed clusters Revising the seed growth step, e.g., alternating vertex addition and removal