Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014


 Justin Lyons
 1 years ago
 Views:
Transcription
1 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate  R6 Technical Report February 3, 2014
2 300 years before Watson there was Euler! The first (Jeopardy!) graph question? A path crossing each of the Seven Bridges of Königsberg exactly once is not possible because of this type of graph.
3 300 years before Watson there was Euler! The first (Jeopardy!) graph question? A path crossing each of the Seven Bridges of Königsberg exactly once is not possible because of this type of graph. Answer: What is a graph with more than two, odd degree vertices?
4 Asking a question is a search for an answer... Can we find X in Y? This is a graph search problem! What is a graph? A network of pairwise relationships... vertices connected by edges Brain network of C. elegans Watts, Strogatz, Nature 393(6684), 1998
5 Search the Web Graph! Google it! In 1998 Google published the PageRank algorithm... Treat the Web as a Big Graph Web Pages are vertices and hyperlinks are edges... Initialize all pages with a starting rank Imitate web surfer randomly clicking on hyperlinks Random Walk (Markov Chain) on a graph! Rank pages by quantity and quality of links Important/Relevant websites rank higher and websites referenced by important websites rank higher
6 Search the Social Graph! Facebook Graph Search index a social graph of 1 trillion edges complex queries based on the social connections between users in Facebook What questions can it answer? The Facebook Graph Search can answer questions like: which restaurants did my friends like? did people like my comments about the latest movie?
7 Search the Knowledge Graph! Semantic Graphs The meaning of a graph is encoded in the graph nodes are linked by semantics, e.g. dog is a mammal semantics are machinereadable... RDF, OWL, Open Graph Google Knowledge Graph Added to Google search engine in 2012 Microsoft Satori Announced in 2013 for the Microsoft Bing! search engine
8 Graphs are great, so what s the problem? Locality of Reference Conventional implementations expect O(1) random access, but the realworld is different... Random Access Memory (RAM) Typically takes 100 nanoseconds (ns) to load a memory reference.
9 Back to Basics... BreadthFirst Search Search can be a walk on a graph... Traversal by breadthfirst expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O
10 Back to Basics... BreadthFirst Search Search can be a walk on a graph... Traversal by breadthfirst expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O
11 Back to Basics... BreadthFirst Search Search can be a walk on a graph... Traversal by breadthfirst expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O
12 Back to Basics... BreadthFirst Search Search can be a walk on a graph... Traversal by breadthfirst expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O
13 Back to Basics... BreadthFirst Search Search can be a walk on a graph... Traversal by breadthfirst expansion can answer the question: Starting from A can we find K? A B C D E F G H I J K L M N O
14 Realworld costs for simple BFS Preliminaries (U) For a simple, undirected graph G = (V, E) with n = V vertices and m = E edges where... N(v) = {u V (v, u) E} is the neighborhood of v, d v = N(v) is the degree of v A is the adjacency matrix of G where A T = A Cost of BreadthFirst Search { Θ(2n) Θ() Θ(n + 2m) algorithm storage memory references
15 Why locality matters! Example The cost of BFS on the 2002 Yahoo! Web Graph (n = , m = ): Θ() { Θ(2n) 8 = Θ(n + 2m) = bytes of storage memory references If each memory reference took 100 ns the minimum time to complete BFS would be more than 24 minutes!
16 What s next... Bigger? Big Data begets Big Graphs Big Data challenges conventional algorithms... can t store it all in memory but... disks are 1000x slower need a scalable BreadthFirst Search... An NSA Big Graph experiment Tech Report NSARD v1 1 Petabyte RMAT graph 19.5x more than cluster memory linear performance from 1 trillion to 70 trillion edges... Problem Class Problem Class Memory = 57.6 TB Traversed Edges (trillion) Time (h) Terabytes
17 ... Harder graph questions? Beyond BreadthFirst Search Hard questions aren t always linear time... Example How similar is each vertex to all other vertices in the graph? Realworld use case Find all duplicate or nearduplicate web pages...
18 Similarity on Graphs Vertex Similarity Similarity between a pair of vertices can be defined by the overlap of common neighbors; i.e. structural similarity depends only on adjacency information does not require transitivity, i.e. triangles
19 Jaccard Similarity Jaccard Coefficient Ratio of intersection to union cardinalities of two sets. Properties J ij = N(i) N(j) N(i) N(j) range in [0, 1]; 0 disjoint sets and 1 identical sets nonzero if (i, j) are endpoints of paths of length two, i.e. 2paths Jaccard Distance, 1 J ij, satisfies the triangle inequality
20 Computing exact, allpairs Jaccard Similarity is hard! Cubic upper bound! O(n 3 ) worst case complexity! Count (i, j) pairs where i, j N(v) Let γ ij denote count of (i, j) 2paths and δ ij = d i + d j then, γ ij J ij = δ ij γ ij Runtime complexity Neighbor pairing 1: for all v V do 2: for all {i, j} N(v) do 3: γ ij γ ij + 1 4: end for 5: end for ( ( ) dv ) ( O O 2 v v d 2 v ) ) O (d max d v O(md max ) O(mn) O(n 3 ) v
21 MapReduce Jaccard Similarity (memorybound) Round 1 Map: Identity u, (v, dv ) u, (v, d v ) Reduce: Create ( d v 2 ) ordered pairs of neighbors as compound keys u, {(v, dv ) v N(u)} (v w, w), d v + d w v, w N(u) Round 2 Map: Identity Reduce function: Calculate the Jaccard Coefficient (i, j), {δij, δ ij,...} (i, j), J ij = γ ij /(δ ij γ ij ) δ ij = d i + d j γ ij = {δ ij, δ ij,...}
22 Must loadbalance v ( d(v) ) 2 pair construction Parallel, pairwise combinations construct neighbor pairs for each adjacency set; for each v i N(u) { (u, } (u 1, v 1 ) (u 2, v 2 ) (u 3, v 3 ) (u 4, v 4 ) j), vi (u 1, v 2 ) (u 2, v 3 ) (u 3, v 4 ) j=1..i (u 1, v 3 ) (u 2, v 4 ) (u 1, v 4 ) pair first element in each column with the remaining elements to generate all ( ) d(v) 2 pairs Benefit Enables distributed pair construction in O(1) memory
23 Jaccard Similarity in O(1) rounds and memory Round 1 Map: Identity Reduce: Label edges with ordinal Du, (v i, d vi ) v i N(u) E n (u, i), (vi, d vi ) o i=1..d(u) Round 2 Map: Prepare edges for pairing (u, i), (v, dv ) n (u, j), (v, dv ) o j=1..i Reduce: Create ordered neighbor pairs as compound keys Du, (v, d v ) v N(u) E (v w, w), d v + d w v, w N(u) Round 3 Map: Identity Reduce: Calculate and output J ij (i, j), {δij, δ ij,...} (i, j), J ij = γ ij /(δ ij γ ij ) δ ij = d i + d j γ ij = {δ ij, δ ij,...}
24 Exact, AllPairs Jaccard Similarity benchmarks Experiments Verify scalability on synthetic datasets Graph500 RMAT graphs (A=.57,B=C=.19,D=0.05) n = 2 SCALE, m = 16n Cluster 1000 nodes 12 cores per node 64 GB RAM per node Constantmemory MapReduce Job parameters prestep round to annotate undirected edges with degree, e.g. u, (v, d v ) edge preparation is blockdistributed total of 5 rounds; one additional round inserted to randomize output from round 1
25 AllPairs Jaccard Similarity on Graph500 datasets Graph500 RMAT Graphs (undirected) n (10 6 ) m (10 6 ) 2paths (10 9 ) J ij (10 9 ) RMAT RMAT RMAT RMAT RMAT RMAT Time (minutes) Fast Total Edges Time (minutes) RMAT RMAT RMAT RMAT RMAT RMAT
26 Results Exact, allpairs Jaccard Similarity in O(1) memory and rounds neighbor pairing similar to Node Iterator for triangle listing loadbalance O( v d(v)2 ) pair generation scales well with increasing 2path count MapReduce AllPairs Jaccard Similarity performance 9 billion Jaccard coefficients per minute! (RMAT scale trillion J ij )
27 What s the next Big question? Beyond...?
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud Bin Shao Microsoft Research Asia Beijing, China binshao@microsoft.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Yatao
More informationAn efficient reconciliation algorithm for social networks
An efficient reconciliation algorithm for social networks Nitish Korula Google Inc. 76 Ninth Ave, 4th Floor New York, NY nitish@google.com Silvio Lattanzi Google Inc. 76 Ninth Ave, 4th Floor New York,
More informationQuestion Answering on Interlinked Data
Question Answering on Interlinked Data Saeedeh Shekarpour Universität Leipzig, IFI/AKSW shekarpour@informatik.unileipzig.de AxelCyrille Ngonga Ngomo Universität Leipzig, IFI/AKSW ngonga@informatik.unileipzig.de
More informationComputing Personalized PageRank Quickly by Exploiting Graph Structures
Computing Personalized PageRank Quickly by Exploiting Graph Structures Takanori Maehara Takuya Akiba Yoichi Iwata Ken ichi Kawarabayashi National Institute of Informatics, The University of Tokyo JST,
More informationFast InMemory Triangle Listing for Large RealWorld Graphs
Fast InMemory Triangle Listing for Large RealWorld Graphs ABSTRACT Martin Sevenich Oracle Labs martin.sevenich@oracle.com Adam Welc Oracle Labs adam.welc@oracle.com Triangle listing, or identifying all
More informationMizan: A System for Dynamic Load Balancing in Largescale Graph Processing
: A System for Dynamic Load Balancing in Largescale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,
More informationDOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems
DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems Yin Huai 1 Rubao Lee 1 Simon Zhang 2 Cathy H Xia 3 Xiaodong Zhang 1 1,3Department of Computer
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationStructural and functional analytics for community detection in largescale complex networks
Chopade and Zhan Journal of Big Data DOI 10.1186/s405370150019y RESEARCH Open Access Structural and functional analytics for community detection in largescale complex networks Pravin Chopade 1* and
More informationStreaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing
Streaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden and Pradeep
More informationAn External Memory Algorithm for Listing Triangles
UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA DEPARTAMENTO DE INFORMÁTICA TEÓRICA CURSO DE CIÊNCIA DA COMPUTAÇÃO BRUNO MENEGOLA An External Memory Algorithm for Listing Triangles Trabalho
More informationScaling Datalog for Machine Learning on Big Data
Scaling Datalog for Machine Learning on Big Data Yingyi Bu, Vinayak Borkar, Michael J. Carey University of California, Irvine Joshua Rosen, Neoklis Polyzotis University of California, Santa Cruz Tyson
More informationTowards a Benchmark for Graph Data Management and Processing
Towards a Benchmark for Graph Data Management and Processing Michael Grossniklaus University of Konstanz Germany michael.grossniklaus @unikonstanz.de Stefania Leone University of Southern California United
More informationHow Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis
How Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella and Theodore L. Willke
More informationReal Time Personalized Search on Social Networks
Real Time Personalized Search on Social Networks Yuchen Li #, Zhifeng Bao, Guoliang Li +, KianLee Tan # # National University of Singapore University of Tasmania + Tsinghua University {liyuchen, tankl}@comp.nus.edu.sg
More informationSo Who Won? Dynamic Max Discovery with the Crowd
So Who Won? Dynamic Max Discovery with the Crowd Stephen Guo Stanford University Stanford, CA, USA sdguo@cs.stanford.edu Aditya Parameswaran Stanford University Stanford, CA, USA adityagp@cs.stanford.edu
More informationThe mathematics of networks
The mathematics of networks M. E. J. Newman Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109 1040 In much of economic theory it is assumed that economic agents interact,
More informationFast PointtoPoint Shortest Path Computations with ArcFlags
Fast PointtoPoint Shortest Path Computations with ArcFlags Ekkehard Köhler, Rolf H. Möhring, and Heiko Schilling Institute of Mathematics, TU Berlin, Germany {Ekkehard.Koehler Rolf.Moehring Heiko.Schilling}@TUBerlin.DE
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More informationUNIVERSITÀ DEGLI STUDI DI CATANIA facoltà di scienze matematiche, fisiche e naturali corso di laurea specialistica in fisica
UNIVERSITÀ DEGLI STUDI DI CATANIA facoltà di scienze matematiche, fisiche e naturali corso di laurea specialistica in fisica Alessio Vincenzo Cardillo Structural Properties of Planar Graphs of Urban Street
More informationThe LinkPrediction Problem for Social Networks
The LinkPrediction Problem for Social Networks David LibenNowell Department of Computer Science Carleton College Northfield, MN 55057 USA dlibenno@carleton.edu Jon Kleinberg Department of Computer Science
More informationModern Information Retrieval
Modern Information Retrieval Chapter 11 Web Retrieval with Yoelle Maarek A Challenging Problem The Web Search Engine Architectures Search Engine Ranking Managing Web Data Search Engine User Interaction
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationFinding Similar Items
72 Chapter 3 Finding Similar Items A fundamental datamining problem is to examine data for similar items. We shall take up applications in Section 3.1, but an example would be looking at a collection
More informationRanking on Data Manifolds
Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname
More informationMonte Carlo methods in PageRank computation: When one iteration is sufficient
Monte Carlo methods in PageRank computation: When one iteration is sufficient K.Avrachenkov, N. Litvak, D. Nemirovsky, N. Osipova Abstract PageRank is one of the principle criteria according to which Google
More informationEssential Web Pages Are Easy to Find
Essential Web Pages Are Easy to Find Ricardo BaezaYates Yahoo Labs Barcelona Spain rbaeza@acmorg Paolo Boldi Dipartimento di Informatica Università degli Studi di ilano Italy paoloboldi@unimiit Flavio
More informationHelp File Version 1.0.14
Version 1.0.14 By engineering@optimumg.com www.optimumg.com Welcome Thank you for purchasing OptimumT the new benchmark in tire model fitting and analysis. This help file contains information about all
More information2 Basic Concepts and Techniques of Cluster Analysis
The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or
More information