Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro



Similar documents
On Network Tools for Network Motif Finding: A Survey Study

Protein Protein Interaction Networks

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Implementing Graph Pattern Mining for Big Data in the Cloud

Clustering & Visualization

Graph Mining and Social Network Analysis

Load Balancing. Load Balancing 1 / 24

Distance Degree Sequences for Network Analysis

The Scientific Data Mining Process

9.1. Graph Mining, Social Network Analysis, and Multirelational Data Mining. Graph Mining

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Bioinformatics: Network Analysis

Unsupervised Data Mining (Clustering)

A Fast Algorithm For Finding Hamilton Cycles

Big Data Text Mining and Visualization. Anton Heijs

PPD: Scheduling and Load Balancing 2

InfiniteGraph: The Distributed Graph Database

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

D A T A M I N I N G C L A S S I F I C A T I O N

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Part 2: Community Detection

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Social Media Mining. Graph Essentials

Network (Tree) Topology Inference Based on Prüfer Sequence

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Graph Processing and Social Networks

Sanjeev Kumar. contribute

Introduction to Graph Mining

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Information Management course

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

CHAPTER 1 INTRODUCTION

Binary Coded Web Access Pattern Tree in Education Domain

GRAPH THEORY LECTURE 4: TREES

Graph theoretic approach to analyze amino acid network

Comparison of Data Mining Techniques for Money Laundering Detection System

Statistical and computational challenges in networks and cybersecurity

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

Systems and Algorithms for Big Data Analytics

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Performance Analysis and Optimization Tool

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Fast Matching of Binary Features

Clustering through Decision Tree Construction in Geology

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

A Review of Data Mining Techniques

Random graphs with a given degree sequence

Graph Mining and Social Network Analysis. Data Mining; EECS 4412 Darren Rolfe + Vince Chu

Complex Network Analysis of Brain Connectivity: An Introduction LABREPORT 5

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

PARALLEL PROGRAMMING

How To Find Local Affinity Patterns In Big Data

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

The University of Jordan

CAD Algorithms. P and NP

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Map-Reduce for Machine Learning on Multicore

SCAN: A Structural Clustering Algorithm for Networks

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Network Algorithms for Homeland Security

Practical Graph Mining with R. 5. Link Analysis

Supervised Feature Selection & Unsupervised Dimensionality Reduction

High Throughput Network Analysis

Network Metrics, Planar Graphs, and Software Tools. Based on materials by Lala Adamic, UMichigan

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Decision Trees from large Databases: SLIQ

Big Data Systems CS 5965/6965 FALL 2015

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Chapter 12 Discovering New Knowledge Data Mining

A comparative study of social network analysis tools

Graph Database Proof of Concept Report

Analysis of MapReduce Algorithms

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Using In-Memory Computing to Simplify Big Data Analytics

Exponential time algorithms for graph coloring

Big Data Management and NoSQL Databases

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Data Warehousing und Data Mining

Transcription:

Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro

Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free, small-world Community Detection What groups of nodes are related? Node Classification Importance and function of a certain node? Network Comparison What is the type of the network? 2

Building Blocks of Networks Subnetworks, or subgraphs, are the building blocks of networks They have the power to characterize and discriminate networks 3

Building Blocks of Networks 4

Some Graph Definitions Subgraph: subset of vertices and edges inside another graph Induced subgraph: You must consider all connections between selected nodes Induced Subgraph Non-Induced Subgraph 5

Some Graph Definitions Graph Isomorphism: Two graphs G and H are isomorphic (G ~ H) when there is a one to one mapping between the nodes of both graphs and there is an edge between two vertices of G if and only if the corresponding vertices in H also form an edge Informally we are seeing if the topology is the same 4 Isomorphic Graphs 6

Some Graph Definitions Graph Automorphisms: Set of isomorphisms of a graph into itself Equivalence of Nodes Two nodes are equivalent if there exists an automorphism that maps one into the other This relation partitions a graph into equivalence classes Informally, we are looking at the graph symmetry Example graph automorphisms and equivalence classes 7

Some Graph Definitions Graph Matches: A match of a graph H in a larger graph G is a set of nodes of G that induce the respective subgraph H. In other words, a subgraph of G that is isomorphic to H 8

Example Application of Subgraphs Consider all possible directed subgraphs of size 3 9

Example Application of Subgraphs Z-Score measures overrepresentation Different networks present similar fingerprints Image: Adapted from (Milo et al., 2004) 10

Network Motifs Milo et al. (2002) came up with the definition of network motifs: recurring, significant patterns of interconnections How to define: Pattern: induced subgraph Recurring: found many times, i.e., high frequency Significant: more frequent than it would be expected in similar networks (same degree sequence) Parameters P, U, D, N control the definition (Milo et al., 2002, used {0.01, 4, 0.1, 1000}) Image: Adapted from (Milo et al., 2004) 11

Network Motifs Random Networks Motif Original Network Image: Adapted from (Milo et al., 2002) 12

Network Motifs: Example Application Table: Adapted from (Milo et al., 2002) 13

Network Motifs: Example Application Image: taken from (Sporns and Kotter, 2004) 14

Graphlet Degree Distribution What about a node subgraph metric? The degree distribution is in a way measuring participation in subgraphs of size 2 Can we generalize this? Przulj (2006) came up with the definition of graphlet degree distribution: Where does the node appear in orbits of subgraphs? 15

Graphlet Degree Distribution 16

Computational Problem In its core, finding motifs and graphlets its all about finding and counting subgraphs. Just knowing if a certain subgraph exists is already an hard computational problem! Subgraph isomorphism is NP-complete Execution time grows exponentially as the size of the graph or the motif/graphlet increases Feasible motif size is usually small (3 to 8) and network size in the order of hundreds or thousands of nodes 17

Subgraph Discovery Research Part of my research is focused on improving efficiency in network motif detection. Scale up! How? Novel data structures for the graphs and subgraphs Novel faster algorithms Sampling techniques Parallel approaches (with different paradigms) 18

Previous Approaches Network-centric approaches: Enumerate all subgraphs and then compute isomorphisms (ex: ESU/Fanmod, Kavosh). Subgraph-centric approaches: Find one subgraph at a time (ex: Grochow and Kellis) 19

A set-centric approach Key insight: can we do better looking for a given set of subgraphs? Looking for all does not allow for discarding subgraphs we don't care about (which would save computation time) Looking for one at the time does not allow to reuse results (which would avoid redundant computation) Can we find what is common between subgraphs and use that? Set-centric approach: Find a custom set of subgraphs (maybe one, maybe all, maybe something in between) 20

G-Tries Motivation Sequences and Prefix Trees (tries) Can the concept be extended to graphs? 21

G-Tries Motivation Graphs have common substructure! G-Tries (etimology: Graph retrieval ) 22

The G-Trie data structure G-Tries: (customized) collections of subgraphs Common substructures are identified Information is compressed 23

The G-Trie data structure G-Tries: also valid for directed networks 24

The G-Trie data structure G-Tries: also valid for colored/labeled networks G-Trie 25

Creating a G-Trie Iterative insertion Start with an empty g-trie 26

The Need for a Canonical Form There are different node orderings representing the same subgraph Canonical form for a getting an unique g-tries Different canon will give origin to different g-tries 27

Impact of Canonical Form 28

Custom Canonical Form Connectivity Path induces connected subgraph Compressibility More common subtructre, less g-tries nodes Constraining As many connections as possible to ancestor nodes (limit possible matches) GTCanon 29

GTCanon Example 30

Searching with G-Tries Backtracking Procedure Searching at the same time for several subgraphs Candidates for node 1: {0, 1, 2, 3, 4, 5} Try 0: Match = {0}, Neighb. = {1,3,4} Try 1: Match = {0,1}, Neighb. = {2,3,4,5} Try 2: no edge from 2 to 0! FAIL Try 3: no edge from 3 to 1! FAIL Try 4: Match = {0, 1, 4} FOUND! Try 5: no edge from 5 to 1! FAIL 31

Searching with G-Tries The same subgraph could be found several times due to automorphisms (symmetries) We would not only find {0,1,4} but also: {0, 4, 1} {1, 0, 4} {1, 4, 0} {4, 0, 1} {4, 1, 0} 32

Symmetry Breaking Conditions Conditions on node labels Add conditions to respective g-trie path Match only when conditions of at least one descendant are respected Filter conditions to ensure minimum work Ex: transitive property (a<b,a<c,b<c leads to a<b,b<c); assured descendants, only store relevant to node, etc 33

Complete G-Trie Example All six undirected 4-subgraphs 34

Complete G-Trie Example All thirteen directed 3-subgraphs 35

Sequential version: some results Set of 12 representative real networks Network Group Directed dolphins social no 62 159 5.1 12 circuit physical no 252 399 3.2 14 neural biological yes 297 2,345 14.5 134 metabolic biological yes 453 2,025 8.9 237 links social yes 1,490 19,022 22.4 351 coauthors social no 1,589 2,742 3.5 34 ppi biological no 2,361 6,646 5.6 64 odlis semantic yes 2,909 18,241 11.3 592 power physical no 4,941 6,594 2.7 19 company social yes 8,497 6,724 1.6 552 foldoc Semantic yes 13,356 120,238 13.7 728 internet Physical no 22,963 48,436 4.2 2,390 V(G) 36 E(G) Nr. Neighbours Average Max

Sequential version: some results On both directed and undirected graphs g-tries were from 1 to 2 orders of magnitude faster than existing state of the art From 10x to 200x Example results for full census of size k (speedup on a set of undirected networks) Network k ESU Kavosh Grochow dolphins 8 28.9 26.9 39.5 circuit 9 53.2 52.0 39.4 coauthors 6 64.4 66.3 39.7 ppi 5 61.8 62.1 25.6 power 7 38.2 38.0 285.9 internet 4 46.9 45.5 14.7 37

Sequential version: some results On both directed and undirected graphs g-tries were from 1 to 2 orders of magnitude faster than existing state of the art From 10x to 200x Example results for full census of size k (speedup on a set of directed networks) Network K ESU Kavosh Grochow neural 5 25.3 25.5 28.8 metabolic 5 69.9 68.9 15.4 links 4 14.9 15.2 13.2 odlis 4 29.3 29.7 22.6 company 4 48.9 50.1 25.3 foldoc 4 15.8 16.0 50.5 38

Sequential version: some results On both directed and undirected graphs g-tries were from 1 to 2 orders of magnitude faster than existing state of the art From 10x to 200x 39

Sampling approach Sample some subgraph occurrences Deduce approximate total counting Trade accuracy for speed 40

Sampling approach: search tree Backtracking produces search tree 41

Sampling approach Original: unbalanced search tree Goal: Uniform sampling Subgraph occurrences are in last tree depth 42

Sampling approach Original: unbalanced search tree Goal: Uniform sampling Associate a probability of traversing with each search tree depth 100% 100% 80% 80% 43

Sampling Approach Probabilities associated with each depth: {P0, P1,, Pmax} Sampling is uniform Probability of any occurrence is P0 x P1 x x Pmax We can produce unbiased estimator Estimate = (Nr sampled occurences) / (P0 x P1 x x Pmax) The probabilities control the search Regarding accuracy: avoid small values close to root Entire search branches disregarded more variance Regarding speed: avoid high values close to root More search branches higher execution time Some results We can obtain ~90% of accuracy using ~20% of time 44

Parallelism Subgraph Discovery is a tree shaped computation Search nodes are independent from each other {0,1,3} If we know where we are, we can continue from there Tree Nodes -> Work Units 45

Parallel Problem Input: set of work units G-Trie: (network, g-trie node, partial match) Goal: efficiently distribute work units among processors We need dynamic load balancing 46

Parallel approaches Environments we explored Clusters (distributed memory using MPI) [almost linear speedup up to 128 cores] Multicores (shared memory using threads) [almost linear speedup up to 64 cores] Some of the techniques Receiver initiated work stealing Diagonal splitting and efficient snapshots Random polling 47

Example Computation G-Trie Node Graph Vertex 48

Example Computation G-Trie Node Graph Vertex 49

Example Computation G-Trie Node Graph Vertex 50

Example Computation G-Trie Node Graph Vertex 51

Example Computation G-Trie Node Graph Vertex Current Work Unit Explored Work Units STOP 52

Example Computation Keep Give to requester Both Diagonal Splitting 53

Some publications Core sequential algorithms - G-Tries: a data structure for storing and finding subgraphs. Data Mining and Know. Discovery, 2014. - Towards a faster network-centric subgraph census. ASONAM'2013 - Querying Subgraph Sets with G-Tries. DBSocial'2012 (best paper award) - Strategies for Network Motifs Discovery. E-Science 2009. Sampling approach - Efficient Subgraph Frequency Estimation with G-Tries. WABI'2010. - Rand-FaSE: fast approximate subgraph census. SNAM, 2015 Parallel approach - Parallel subgraph counting for multicore architectures. ISPA'2014 - A Scalable Parallel Approach for Subgraph Census Computation. MuCoCos'2014 - Parallel Discovery of Network Motifs. Journal of Parallel and Distributed Computing. 2012. - Efficient Parallel Subgraph Counting using G-Tries. IEEE Cluster'2010. Concept variation and applications - Discovering Colored Network Motifs. Complenet'2014 - Motif Mining in Weighted Networks. Damnet'2012 (and new related in paper ACM-SAC'2015) - Co-authorship network comparison across research fields using motifs. ASONAM'2012. 54

Software Reference sequential implementation is available online and uses C++: http://www.dcc.fc.up.pt/gtries/ More versions available by request and we are working on giving more options to practitioners (ex: GUI, plugin for Gephi/Cytoscape) Graphlets, large scale and adaptive sampling almost public 55

Complex Networks "Group Pedro Ribeiro Fernando Silva (leader) David Aparício (PhD Student) Multicores, Graphlets Pedro Paredes (BSc Student) Network-Centric Graph Theory (leader) Miguel Araújo S. Choobdar (PhD Student) (former PhD Stud) Community Detection Matrix Factorization Structural Roles Temporal, Weigths Miguel Silva (MSc Student) Random Networks Ahmad Naser (MSc Student) GUI and Visualization Some collaborations (ongoing and/or starting): - M. Kaiser (U Newcastle) [Brain Networks] - L. F. Costa (U São Paulo) [Null Models] - K. Pingali (UT Austin) [Temporal Patterns] - C. Faloutsos (CMU) [Large Scale Networks] - S. Parthasarathy (Ohio State) [Roles] 56