graph analysis beyond linear algebra



Similar documents
STINGER: High Performance Data Structure for Streaming Graphs

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Big Graph Processing: Some Background

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Protein Protein Interaction Networks

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

DATA ANALYSIS II. Matrix Algorithms

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Big graphs for big data: parallel matching and clustering on billion-vertex graphs

Parallelism and Cloud Computing

SCAN: A Structural Clustering Algorithm for Networks

Presto/Blockus: Towards Scalable R Data Analysis

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

How To Cluster Of Complex Systems

Big Data Graph Algorithms

Practical Graph Mining with R. 5. Link Analysis

Big Data Analytics. Lucas Rego Drumond

Part 2: Community Detection

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Small Maximal Independent Sets and Faster Exact Graph Coloring

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

InfiniteGraph: The Distributed Graph Database

Dynamical Clustering of Personalized Web Search Results

Best practices for efficient HPC performance with large models

Part 1: Link Analysis & Page Rank

Fast Multipole Method for particle interactions: an open source parallel library component

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

HIGH PERFORMANCE BIG DATA ANALYTICS

IC05 Introduction on Networks &Visualization Nov

GRAPH data are important data types in many scientific

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Systems and Algorithms for Big Data Analytics

Cray: Enabling Real-Time Discovery in Big Data

Large-Scale Data Processing

How To Get A Computer Science Degree At Appalachian State

Building an energy dashboard. Energy measurement and visualization in current HPC systems

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

The University of Jordan

Load Balancing. Load Balancing 1 / 24

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Information Management course

Chapter 4 Software Lifecycle and Performance Analysis

Hadoop SNS. renren.com. Saturday, December 3, 11

Green-Marl: A DSL for Easy and Efficient Graph Analysis

Graph Processing and Social Networks

A scalable multilevel algorithm for graph clustering and community structure detection

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Distributed Computing over Communication Networks: Maximal Independent Set

Understanding Neo4j Scalability

Social Media Mining. Graph Essentials

Social Media Mining. Network Measures

Massive-Scale Streaming Analytics. David A. Bader

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Distance Degree Sequences for Network Analysis

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Challenges for Data Driven Systems

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Perron vector Optimization applied to search engines

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Unsupervised Data Mining (Clustering)

Social Networks and Social Media

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Big Data Systems CS 5965/6965 FALL 2015

Social Network Mining

An Empirical Study of Two MIS Algorithms

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming

The PageRank Citation Ranking: Bring Order to the Web

Hyacinth An IEEE based Multi-channel Wireless Mesh Network

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

BSC vision on Big Data and extreme scale computing

Load balancing Static Load Balancing

Network Algorithms for Homeland Security

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

How To Create A Data Visualization Engine For Your Computer Or Your Brain

Load Balancing and Termination Detection

Finding Fault Location: Combining network topology and end-to-end measurments to locate network problems?

CD-HIT User s Guide. Last updated: April 5,

UF EDGE brings the classroom to you with online, worldwide course delivery!

Transcription:

graph analysis beyond linear algebra E. Jason Riedy DMML, 24 October 2015 HPC Lab, School of Computational Science and Engineering Georgia Institute of Technology

motivation and applications

(insert prefix here)-scale data analysis Cyber-security Identify anomalies, malicious actors Health care Finding outbreaks, population epidemiology Social networks Advertising, searching, grouping Intelligence Decisions at scale, regulating algorithms Systems biology Understanding interactions, drug design Power grid Disruptions, conservation Simulation Discrete events, cracking meshes Graphs are a motif / theme in data analysis. Changing and dynamic graphs are important! 3

outline 1. Motivation and background 2. Linear algebra leads to a better graph algorithm: incremental PageRank 3. Sparse linear algebra techniques lead to a scoop: community detection 4. And something else: connected components 4

why graphs? Another tool, like dense and sparse linear algebra. Combine things with pairwise relationships Smaller, more generic than raw data. Taught (roughly) to all CS students... Semantic attributions can capture essential relationships. Traversals can be faster than filtering DB joins. Provide clear phrasing for queries about relationships. 5

potential applications Social Networks Identify communities, influences, bridges, trends, anomalies (trends before they happen)... Potential to help social sciences, city planning, and others with large-scale data. Cybersecurity Determine if new connections can access a device or represent new threat in < 5ms... Is the transfer by a virus / persistent threat? Bioinformatics, health Construct gene sequences, analyze protein interactions, map brain interactions Credit fraud forensics detection monitoring Integrate all the customer s data, identify in real-time 6

streaming graph data Networks data rates: Gigabit ethernet: 81k 1.5M packets per second Over 130 000 flows per second on 10 GigE (< 7.7 µs) Person-level data rates: 500M posts per day on Twitter (6k / sec) 1 3M posts per minute on Facebook (50k / sec) 2 We need to analyze only changes and not entire graph. Throughput & latency trade off and expose different levels of concurrency. 1 www.internetlivestats.com/twitter-statistics/ 2 www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/ 7

incremental pagerank

pagerank Everyone s favorite metric: PageRank. Stationary distribution of the random surfer model. Eigenvalue problem can be re-phrased as a linear system ( I αa T D 1) x = kv, with α teleportation constant, much < 1 A adjacency matrix D diagonal matrix of out degrees, with x/0 = x (self-loop) v personalization vector, here 1/ V k irrelevant scaling constant Amenable to analysis, etc. 9

incremental pagerank Streaming data setting, update PageRank without touching the entire graph. Existing methods maintain databases of walks, etc. Let A = A + A, D = D + D for the new graph, want to solve for x + x. Simple algebra: ( ) ( I αa T D 1 x = α A D 1 AD 1) x, and the right-hand side is sparse. Re-arrange for Jacobi, x (k+1) = αa T D 1 x(k) + α ( A D 1 AD 1) x, iterate,... 10

incremental pagerank: whoops 1000 100 10 val 1e 10 1e 12 1e 10 1e 12 1e 10 1e 12 1e 10 1e 12 1e 10 1e 12 1e 10 1e 12 caidarouterlevel copapersciteseer copapersdblp great_britain.osm PGPgiantcompo power 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 k graphname caidarouterlevel copapersciteseer copapersdblp great_britain.osm PGPgiantcompo power And fail. The updated solution wanders away from the true solution. Top rankings stay the same... 11

incremental pagerank: think instead The old solution x is an ok, not exact, solution to the original problem, now a nearby problem. How close? Residual: r = kv x + αa D 1 x = r + α ( A D 1 AD 1) x. Solve (I αa D 1 ) x = r. Cheat by not refining all of r, only region growing around the changes. (Also cheat by updating r rather than recomputing at the changes.) 12

incremental pagerank: works belgium.osm caidarouterlevel copapersdblp luxembourg.osm 1e 02 Relative 1 norm wise backward error v. restarted PageRank 1e 04 1e 02 1e 04 1e 02 1e 04 100 1000 10000 0 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 20000 Number of edges added alg dpr dprheld pr pr_restart Thinking about the numerical linear algebra issues can lead to better graph algorithms. 13

community detection

graphs: big, nasty hairballs Yifan Hu s (AT&T) visualization of the in-2004 data set http://www2.research.att.com/~yifanhu/gallery.html 15

but no shortage of structure... Protein interactions, Giot et al., A Protein Interaction Map of Drosophila melanogaster, Science 302, 1722-1736, 2003. Jason s network via LinkedIn Labs Locally, there are clusters or communities. Until 2011, no parallel method for community detection. But, gee, the problem looks familiar... 16

community detection Partition a graph s vertices into disjoint communities. A community locally optimizes some metric, NP-hard. Trying to capture that vertices are more similar within one community than between communities. Jason s network via LinkedIn Labs 17

common community metric: modularity Modularity: Deviation of connectivity in the community induced by a vertex set S from some expected background model of connectivity. Newman s uniform model, modularity of a cluster is fraction of edges in the community fraction expected from uniformly sampling graphs with the same degree sequence Q S = (m S x 2 S /4m)/m Modularity: sum of cluster contributions Sufficiently large modularity some structure Known issues: Resolution limit, NP, etc. 18

sequential agglomerative method A D G B F C E A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. Each vertex begins in its own community. An edge is chosen to contract. Merging maximally increases modularity. Priority queue. Known often to fall into an O(n 2 ) performance trap with modularity (Wakita & Tsurumi 07). 19

sequential agglomerative method A D G C B F E A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. Each vertex begins in its own community. An edge is chosen to contract. Merging maximally increases modularity. Priority queue. Known often to fall into an O(n 2 ) performance trap with modularity (Wakita & Tsurumi 07). 19

sequential agglomerative method A D G C B F E A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. Each vertex begins in its own community. An edge is chosen to contract. Merging maximally increases modularity. Priority queue. Known often to fall into an O(n 2 ) performance trap with modularity (Wakita & Tsurumi 07). 19

sequential agglomerative method A D G C B F E A common method (e.g. Clauset, Newman, & Moore, 2004) agglomerates vertices into communities. Each vertex begins in its own community. An edge is chosen to contract. Merging maximally increases modularity. Priority queue. Known often to fall into an O(n 2 ) performance trap with modularity (Wakita & Tsurumi 07). 19

parallel agglomerative method Use a matching to avoid the queue. A D G B F C E Compute a heavy weight matching. Simple greedy, maximal algorithm. Within factor of 2 from heaviest. Merge all communities at once. Maintains some balance. Produces different results. Agnostic to weighting, matching Up until 2011, no one tried this... 20

parallel agglomerative method Use a matching to avoid the queue. A D G C B F E Compute a heavy weight matching. Simple greedy, maximal algorithm. Within factor of 2 from heaviest. Merge all communities at once. Maintains some balance. Produces different results. Agnostic to weighting, matching Up until 2011, no one tried this... 20

parallel agglomerative method Use a matching to avoid the queue. A D G C B F E Compute a heavy weight matching. Simple greedy, maximal algorithm. Within factor of 2 from heaviest. Merge all communities at once. Maintains some balance. Produces different results. Agnostic to weighting, matching Up until 2011, no one tried this... 20

parallel agglomerative community detection Graph V E Reference soc-livejournal1 4 847 571 68 993 773 SNAP uk-2007-05 105 896 555 3 301 876 564 Ubicrawler Peak processing rates in edges/second: Platform Mem soc-livejournal1 uk-2007-05 E7-8870 256GiB 6.90 10 6 6.54 10 6 XMT2 2TiB 1.73 10 6 3.11 10 6 Clustering: Sufficiently good. Won 10 th DIMACS Implementation Challenge s mix category in 2012. Later: Fagginger Auer & Bisseling (2012), add star detection. LaSalle and Karypis (2014), add m-l refinement. 21

what about streaming? Preliminary experiments... Simple re-agglomeration: Fast, decreasing modularity. Backtracking appears to work, but carries more data (see also Görke, et al. at KIT). Clusterings are very sensitive. Data and plots from Pushkar Godbolé. 22

connected components

sensitivity and components Ok, clusterings are optimizing over a bumpy surface, of course they re sensitive... (Streaming exacerbates.) Pick a clean problem: connected components Where could errors occur? Streaming: Dropped, forgotten information Computing: Stop for energy or time, thresholds Real-life: Surveys not returned How do you even measure errors? Pairwise co-membership counts Empirical distributions from vertex membership No one measure... All need the true solution... 24

sensitivity of connected components From Zakrzewska & Bader, Measuring the Sensitivity of Graph Metrics to Missing Data, PPAM 2013 Error + Fraction of graph remaing (kinda) + 25

questions / future Can graph analysis learn from linear algebra and numerical analysis? Are there relevant concepts of backward error? Don t need the true solution to evaluate (or estimate) some distance. Should graph analysis look more in the statistical direction? Moderate graphs hit converging limits. Are there other easy analogies / low-hanging fruit? Environments for playing with large graphs? Sane threading and atomic operations (not data) Feel free to join in... 26

acknowledgements

hpc lab people Faculty: David A. Bader Oded Green (was student) STINGER: Robert McColl, James Fairbanks, Adam McLaughlin, Daniel Henderson, David Ediger (now GTRI), Data: Pushkar Godbolé Anita Zakrzewska Jason Poovey (GTRI), Karl Jiang, and feedback from users in industry, government, academia 28

stinger: where do you get it? Home: www.cc.gatech.edu/stinger/ Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/ Gateway to code, development, documentation, presentations... Remember: Academic code, but maturing with contributions. Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi,... 29