Fast Algorithm for Modularity-based Graph Clustering

Similar documents
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

A Performance Analysis of Distributed Indexing using Terrier

Cluster detection algorithm in neural networks

Part 2: Community Detection

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Crawling and Detecting Community Structure in Online Social Networks using Local Information

Understanding Neo4j Scalability

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Big Graph Processing: Some Background

STINGER: High Performance Data Structure for Streaming Graphs

How To Cluster Of Complex Systems

Very Large Enterprise Network, Deployment, Users

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Security-Aware Beacon Based Network Monitoring

Empirical Examination of a Collaborative Web Application

Graph Mining and Social Network Analysis

Comparison of Windows IaaS Environments

Hardware Configuration Guide

An Integrated Business Management Solution

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Evaluating partitioning of big graphs

Mining Social Network Graphs

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Large-Scale Test Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cost-Effective Business Intelligence with Red Hat and Open Source

I.T. System Requirements 2015

Shared Parallel File System

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

SCAN: A Structural Clustering Algorithm for Networks

Large-Scale Data Processing

Protein Protein Interaction Networks

Very Large Enterprise Network Deployment, 25,000+ Users

Journée Thématique Big Data 13/03/2015

Strong and Weak Ties

I Want To Start A Business : Getting Recommendation on Starting New Businesses Based on Yelp Data

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Microblogging Queries on Graph Databases: An Introspection

On the Placement of Management and Control Functionality in Software Defined Networks

Graph Database Proof of Concept Report

Integrated Grid Solutions. and Greenplum

High Performance Computing and Big Data: The coming wave.

A comparative study of social network analysis tools

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

VMware vsphere Data Protection 6.1

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

FPGA-based Multithreading for In-Memory Hash Joins

Parallels Plesk Automation

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

System requirements for A+

Social Networks and Social Media

Ignify ecommerce. Item Requirements Notes

web hosting and domain names

CS54100: Database Systems

Development of Software Dispatcher Based. for Heterogeneous. Cluster Based Web Systems

The tipping point is often referred to in popular literature. First introduced in Schelling 78 and Granovetter 78

Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC

Practical Graph Mining with R. 5. Link Analysis

Link Prediction in Social Networks

Scaling from 1 PC to a super computer using Mascot

Scheduling and Load Balancing in the Parallel ROOT Facility (PROOF)

A1 and FARM scalable graph database on top of a transactional memory layer

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Generational Performance Comparison: Microsoft Azure s A- Series and D-Series. A Buyer's Lens Report by Anne Qingyang Liu

RevoScaleR Speed and Scalability

A Theory of the Spatial Computational Domain

INFOBrief. Red Hat Enterprise Linux 4. Key Points

Invited Applications Paper

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Understanding the Benefits of IBM SPSS Statistics Server

Here comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Exploring the Structure of Government on the Web

Parallel Simplification of Large Meshes on PC Clusters

How Companies are! Using Spark

Amazon Web Services vs. Horizon

Big Fast Data Hadoop acceleration with Flash. June 2013

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Benchmarking Cassandra on Violin

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP

The Open University s repository of research publications and other research outputs

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

DFW Backup Software. Whitepaper Data Security

Building a Top500-class Supercomputing Cluster at LNS-BUAP

How To Handle Big Data With A Data Scientist

Packet-based Network Traffic Monitoring and Analysis with GPUs

Lecture 1: the anatomy of a supercomputer

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Efficient Crawling of Community Structures in Online Social Networks

Deploying Load balancing for Novell Border Manager Proxy using Session Failover feature of NBM and L4 Switch

IBM SPSS Modeler Social Network Analysis 15 User Guide

How To Find Influence Between Two Concepts In A Network

Developments in Internet Infrastructure

Finding Fastest Paths on A Road Network with Speed Patterns

Computer System. Chapter Introduction

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Transcription:

Fast Algorithm for Modularity-based Graph Clustering Hiroaki Shiokawa NTT Software Innovation Center, NTT Corporation, July 23 rd, 2013

BACKGROUND & MOTIVATION 2

Large Graphs Large-scale graphs become available Facebook: 1.11 billion active users / month(*1) Twitter: 140 million active users / day 340 million new posts / day (*2) And more A lot of techniques for analyzing massive-scale graph Massive data require so much time for analysis It is important to analyze large scale data quickly (*1) Key Facts, http://newsroom.fb.com/key-facts (*2) http://dev.twitter.com/media/authors 3

Graph Clustering Graph clustering is one of the most important methods Community detection over social networks Event detection from microblogging services Brain Analysis, Image segmentation, Clustering analysis Densely connected Sparsely connected 4

Modularity-based Graph Clustering Clustering methods which find the division of graph to maximize the modularity measure Improvement of clustering speed Our research target 1B 100M nodes/hour There are no algorithms 10M nodes/hour BGLL[Blondel et al., 2008] 1M nodes/hour CNM[Clauset et al., 2004],WT[Wakita et al., 2008] 100k nodes/hour Newman method [Newman et al., 2004] 10k nodes/hour Girvan-Newman method [Girvan et al., 2004] 5

Objective and Contributions Objective Fast graph clustering method with high modularity 3 key techniques 1. Incremental nodes aggregation 2. Incremental nodes pruning 3. Efficient ordering of nodes selections Contributions of our algorithm Efficiency Considerably faster than BGLL Clusters 100M nodes within 3 minutes High Modularity Scores high modularity as same as BGLL Effectiveness Improves performances for complex networks 6

PRELIMINARIES 7

Modularity Modularity evaluates the strength of division of a graph into clusters [Newman and Girvan 2004] Finding the division which maximizes modularity is NP-complete A lot of greedy approaches were proposed 2 2 :Set of cluster :Number of edges between cluster, :Total number of edges in a graph The fraction of the edges within cluster i The fraction of outgoing edges from a cluster i 8

State of the art algorithm: BGLL Continuing following passes until the modularity score is maximized Pass = 1 st phase: Local clustering 1) Selects a node 2) Computes the modularity gain 3) Places the neighbor node in the same cluster 2 nd phase: Data reduction Aggregates all nodes in the same cluster as a single node 1 st phase 2 nd phase Continue to the next pass 9

PROPOSED ALGORITHM 10

Overview of proposed algorithm Our method 3 key techniques Incremental aggregation Incremental pruning Efficient ordering BGLL Batch based aggregation Random ordering Clustering coefficient Power-law degree distribution 11

Idea 1 : Clustering coefficient Complex networks have large clustering coefficient Clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together There are many duplicated nodes/edges in a graph which has large clustering coefficient edu Duplicated edges between different clusters Nodes whose clusters are trivially obtained Internal nodes within a cluster Cluster 12

Idea 2 : Power-law degree distribution Complex networks have highly skewed degree distribution following the power-law distribution Most of nodes only have a few neighbor nodes, and only few nodes have large number of neighbor nodes The frequency F of nodes with degree d is Example of degree distribution of complex network Random ordering of node selection leads redundant computation 13

Incremental Aggregation Incrementally aggregate nodes which belong to the same cluster It aggregates duplicated edges between clusters while keeping the graph semantics Example) Aggregate Same cluster 2 Aggregated node Duplicated edges 2 14

Incremental pruning Incrementally prune nodes whose cluster is trivially obtained We can easily assign nodes to clusters without computing modularity gains From the graph structure, there are 2 patterns of pruning Pattern A A node only has a single neighbor node A cluster Pattern B A node surrounded by nodes belong to same cluster Easy to prune nodes Non-trivial B Check neighborsʼ cluster 15

Incremental pruning (Cont.) Efficient pruning approach for pattern 2 All nodes within the same cluster have been aggregated to a node by incremental aggregation We can find all prunable nodes by obtaining nodes such that they have only a single adjacent node cluster Incremental Aggregate aggregated cluster A node of pattern B is converted to pattern A 16

Efficient ordering of node selection Dynamically selects a node with the smallest degree Example) Node A and B being assigned to the same cluster A B Select from A many computations Select from B 2 computations Itʼs more efficient to compute from node B than node A By selecting node with the smallest degree, we can avoid producing super-cluster structures 17

EVALUATION 18

Datasets & Experimental Environment Real world datasets 2 Social networks and 3 Web graphs of IP domains Dataset V E Skewness of degree distribution dblp-2010 326,186 1,615,400 2.82 ljournal-2008 5,363,260 79,023,142 2.29 uk-2005 39,459,925 936,364,282 1.71 webbase 118,142,155 1,019,903,190 2.14 uk2007-05 105,896,555 3,738,733,648 1.51 Experimental Environment All experiments were conducted on a Linux 2.6.18 server with Intel Xeon CPU L5640 2.27GHz and 144GB RAM Run all methods on 1 core, 1CPU 19

Computational time Proposed is up to 60 times faster than the state of the art algorithm BGLL Graphs with highly skewed degree distribution run faster than the other datasets 100 million nodes and 1 billion edges in 156 seconds! 320k nodes within 1 seconds! 20

Computational time power-law differences 21

Computational time size differences 22

Modularity score Modularity score for datasets Large modularity score means the output of algorithms is well clustered Proposed method achieves almost same modularity scores as/slightly higher than BGLL 23

CONCLUSION 24

Conclusion Fast clustering algorithm for large graphs Our solution Incremental aggregation Incremental pruning Efficient ordering of nodes selections Contribution of our algorithm Efficiency Considerably faster than BGLL Clusters 100M nodes within 3 minutes High Modularity Scores high modularity as same as BGLL Effectiveness Improves performance for complex networks 25