Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014



Similar documents
An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Big Graph Processing: Some Background

Social Media Mining. Graph Essentials

Practical Graph Mining with R. 5. Link Analysis

Graph Processing and Social Networks

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Distance Degree Sequences for Network Analysis

SCAN: A Structural Clustering Algorithm for Networks

DATA ANALYSIS II. Matrix Algorithms

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Social Media Mining. Network Measures

Mining Social Network Graphs

Hadoop SNS. renren.com. Saturday, December 3, 11

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Analysis of Algorithms, I

Machine Learning over Big Data

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

Network (Tree) Topology Inference Based on Prüfer Sequence

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Big Data Analytics. Lucas Rego Drumond

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.


Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Approximation Algorithms

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

InfiniteGraph: The Distributed Graph Database

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Complex Networks Analysis: Clustering Methods

Systems and Algorithms for Big Data Analytics

Search engines: ranking algorithms

Mining Large Datasets: Case of Mining Graph Data in the Cloud

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Part 2: Community Detection

Introduction to Graph Mining

Similarity Search in a Very Large Scale Using Hadoop and HBase

Small Maximal Independent Sets and Faster Exact Graph Coloring

Mining Social-Network Graphs

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Efficient Identification of Starters and Followers in Social Media

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Presto/Blockus: Towards Scalable R Data Analysis

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Large-Scale Data Processing

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Fast Parallel Conversion of Edge List to Adjacency List for Large-Scale Graphs

Distributed Computing over Communication Networks: Maximal Independent Set

Euler Paths and Euler Circuits

How To Cluster Of Complex Systems

Energy Efficient MapReduce

On the independence number of graphs with maximum degree 3

Introduction to Networks and Business Intelligence

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

A1 and FARM scalable graph database on top of a transactional memory layer

B490 Mining the Big Data. 0 Introduction

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Finding and counting given length cycles

SECTIONS NOTES ON GRAPH THEORY NOTATION AND ITS USE IN THE STUDY OF SPARSE SYMMETRIC MATRICES

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop and Map-reduce computing

Frans J.C.T. de Ruiter, Norman L. Biggs Applications of integer programming methods to cages

How To Understand The Network Of A Network

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

White Paper: Big Data and the hype around IoT

Graph Database Proof of Concept Report

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Graph theory and network analysis. Devika Subramanian Comp 140 Fall 2008

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MapReduce and Hadoop Distributed File System V I J A Y R A O

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Network/Graph Theory. What is a Network? What is network theory? Graph-based representations. Friendship Network. What makes a problem graph-like?

Graph Mining and Social Network Analysis

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Design of LDPC codes

Solutions to Homework 6

Arrangements And Duality

CIS 700: algorithms for Big Data

GRAPH data are important data types in many scientific

SGL: Stata graph library for network analysis

Large Scale Graph Processing with Apache Giraph

HIGH PERFORMANCE BIG DATA ANALYTICS

Transcription:

Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014

300 years before Watson there was Euler! The first (Jeopardy!) graph question? A path crossing each of the Seven Bridges of Königsberg exactly once is not possible because of this type of graph.

300 years before Watson there was Euler! The first (Jeopardy!) graph question? A path crossing each of the Seven Bridges of Königsberg exactly once is not possible because of this type of graph. Answer: What is a graph with more than two, odd degree vertices?

Asking a question is a search for an answer... Can we find X in Y? This is a graph search problem! What is a graph? A network of pairwise relationships... vertices connected by edges Brain network of C. elegans Watts, Strogatz, Nature 393(6684), 1998

Search the Web Graph! Google it! In 1998 Google published the PageRank algorithm... Treat the Web as a Big Graph Web Pages are vertices and hyperlinks are edges... Initialize all pages with a starting rank Imitate web surfer randomly clicking on hyperlinks Random Walk (Markov Chain) on a graph! Rank pages by quantity and quality of links Important/Relevant websites rank higher and websites referenced by important websites rank higher

Search the Social Graph! Facebook Graph Search index a social graph of 1 trillion edges complex queries based on the social connections between users in Facebook What questions can it answer? The Facebook Graph Search can answer questions like: which restaurants did my friends like? did people like my comments about the latest movie?

Search the Knowledge Graph! Semantic Graphs The meaning of a graph is encoded in the graph nodes are linked by semantics, e.g. dog is a mammal semantics are machine-readable... RDF, OWL, Open Graph Google Knowledge Graph Added to Google search engine in 2012 Microsoft Satori Announced in 2013 for the Microsoft Bing! search engine

Graphs are great, so what s the problem? Locality of Reference Conventional implementations expect O(1) random access, but the real-world is different... Random Access Memory (RAM) Typically takes 100 nanoseconds (ns) to load a memory reference.

Back to Basics... Breadth-First Search Search can be a walk on a graph... Traversal by breadth-first expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O

Back to Basics... Breadth-First Search Search can be a walk on a graph... Traversal by breadth-first expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O

Back to Basics... Breadth-First Search Search can be a walk on a graph... Traversal by breadth-first expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O

Back to Basics... Breadth-First Search Search can be a walk on a graph... Traversal by breadth-first expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O

Back to Basics... Breadth-First Search Search can be a walk on a graph... Traversal by breadth-first expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O

Real-world costs for simple BFS Preliminaries (U) For a simple, undirected graph G = (V, E) with n = V vertices and m = E edges where... N(v) = {u V (v, u) E} is the neighborhood of v, d v = N(v) is the degree of v A is the adjacency matrix of G where A T = A Cost of Breadth-First Search { Θ(2n) Θ() Θ(n + 2m) algorithm storage memory references

Why locality matters! Example The cost of BFS on the 2002 Yahoo! Web Graph (n = 1.4 10 9, m = 6.6 10 9 ): Θ() { Θ(2n) 8 = 22.4 10 9 Θ(n + 2m) = 14.6 10 9 bytes of storage memory references If each memory reference took 100 ns the minimum time to complete BFS would be more than 24 minutes!

What s next... Bigger? Big Data begets Big Graphs Big Data challenges conventional algorithms... can t store it all in memory but... disks are 1000x slower need a scalable Breadth-First Search... An NSA Big Graph experiment Tech Report NSA-RD-2013-056002v1 1 Petabyte RMAT graph 19.5x more than cluster memory linear performance from 1 trillion to 70 trillion edges... Problem Class 36 39 42 80 70 Problem Class Memory = 57.6 TB 1126 60 50 40 30 20 10 140 0 17 Traversed Edges (trillion) 0 20 40 60 80 100 120 140 Time (h) Terabytes

... Harder graph questions? Beyond Breadth-First Search Hard questions aren t always linear time... Example How similar is each vertex to all other vertices in the graph? Real-world use case Find all duplicate or near-duplicate web pages...

Similarity on Graphs Vertex Similarity Similarity between a pair of vertices can be defined by the overlap of common neighbors; i.e. structural similarity depends only on adjacency information does not require transitivity, i.e. triangles

Jaccard Similarity Jaccard Coefficient Ratio of intersection to union cardinalities of two sets. Properties J ij = N(i) N(j) N(i) N(j) range in [0, 1]; 0 disjoint sets and 1 identical sets non-zero if (i, j) are endpoints of paths of length two, i.e. 2-paths Jaccard Distance, 1 J ij, satisfies the triangle inequality

Computing exact, all-pairs Jaccard Similarity is hard! Cubic upper bound! O(n 3 ) worst case complexity! Count (i, j) pairs where i, j N(v) Let γ ij denote count of (i, j) 2-paths and δ ij = d i + d j then, γ ij J ij = δ ij γ ij Runtime complexity Neighbor pairing 1: for all v V do 2: for all {i, j} N(v) do 3: γ ij γ ij + 1 4: end for 5: end for ( ( ) dv ) ( O O 2 v v d 2 v ) ) O (d max d v O(md max ) O(mn) O(n 3 ) v

MapReduce Jaccard Similarity (memory-bound) Round 1 Map: Identity u, (v, dv ) u, (v, d v ) Reduce: Create ( d v 2 ) ordered pairs of neighbors as compound keys u, {(v, dv ) v N(u)} (v w, w), d v + d w v, w N(u) Round 2 Map: Identity Reduce function: Calculate the Jaccard Coefficient (i, j), {δij, δ ij,...} (i, j), J ij = γ ij /(δ ij γ ij ) δ ij = d i + d j γ ij = {δ ij, δ ij,...}

Must load-balance v ( d(v) ) 2 pair construction Parallel, pairwise combinations construct neighbor pairs for each adjacency set; for each v i N(u) { (u, } (u 1, v 1 ) (u 2, v 2 ) (u 3, v 3 ) (u 4, v 4 ) j), vi (u 1, v 2 ) (u 2, v 3 ) (u 3, v 4 ) j=1..i (u 1, v 3 ) (u 2, v 4 ) (u 1, v 4 ) pair first element in each column with the remaining elements to generate all ( ) d(v) 2 pairs Benefit Enables distributed pair construction in O(1) memory

Jaccard Similarity in O(1) rounds and memory Round 1 Map: Identity Reduce: Label edges with ordinal Du, (v i, d vi ) v i N(u) E n (u, i), (vi, d vi ) o i=1..d(u) Round 2 Map: Prepare edges for pairing (u, i), (v, dv ) n (u, j), (v, dv ) o j=1..i Reduce: Create ordered neighbor pairs as compound keys Du, (v, d v ) v N(u) E (v w, w), d v + d w v, w N(u) Round 3 Map: Identity Reduce: Calculate and output J ij (i, j), {δij, δ ij,...} (i, j), J ij = γ ij /(δ ij γ ij ) δ ij = d i + d j γ ij = {δ ij, δ ij,...}

Exact, All-Pairs Jaccard Similarity benchmarks Experiments Verify scalability on synthetic datasets Graph500 RMAT graphs (A=.57,B=C=.19,D=0.05) n = 2 SCALE, m = 16n Cluster 1000 nodes 12 cores per node 64 GB RAM per node Constant-memory MapReduce Job parameters pre-step round to annotate undirected edges with degree, e.g. u, (v, d v ) edge preparation is block-distributed total of 5 rounds; one additional round inserted to randomize output from round 1

All-Pairs Jaccard Similarity on Graph500 datasets Graph500 RMAT Graphs (undirected) n (10 6 ) m (10 6 ) 2-paths (10 9 ) J ij (10 9 ) RMAT 16 0.06554 1.049 0.2004 0.08235 RMAT 18 0.2621 4.194 1.464 0.6381 RMAT 20 1.049 16.78 10.46 4.854 RMAT 22 4.194 67.11 73.22 36.00 RMAT 24 16.78 268.4 504.4 261.5 RMAT 26 67.11 1074 3433 1871 Time (minutes) 256 128 64 32 16 8 4 2 1 Fast 2 20 2 22 2 24 2 26 2 28 2 30 Total Edges Time (minutes) RMAT 16 2.30 RMAT 18 2.83 RMAT 20 4.40 RMAT 22 11.2 RMAT 24 39.1 RMAT 26 208

Results Exact, all-pairs Jaccard Similarity in O(1) memory and rounds neighbor pairing similar to Node Iterator for triangle listing load-balance O( v d(v)2 ) pair generation scales well with increasing 2-path count MapReduce All-Pairs Jaccard Similarity performance 9 billion Jaccard coefficients per minute! (RMAT scale 26 1.9 trillion J ij )

What s the next Big question? Beyond...?