Algorithms and Tools for Scalable Graph Analy8cs. Kamesh Madduri

Transcription

1 Algorithms and Tools for Scalable Graph Analy8cs Kamesh Madduri Computer Science and Engineering The Pennsylvania State University MMDS 2012 July 13, 2012

2 This talk: A methodology for blazingly fast graph analy8cs! 2

3 Recent projects involving graph- structured computa8ons De novo Genome Assembly [~ 400 GB seq. data] Distributed BFS and Graph500 [SyntheBc graphs up to a trillion edges] with A. Buluc, SC 11 and DIMACS Challenge 2012 Indexes for SPARQL queries [~ 100 GB RDF data] with K. Wu, SSDBM 11 Parallel centrality computabons [graphs with up to 1 billion edges] with M. Frasca and P. Raghavan, SC 12 Algorithms for k shortest loopless paths [graphs with up to 10 million edges] 3

4 0. Having a reference/correct implementa8on Possible with well- deﬁned problems Six orders of magnitude performance Improvement in the past few years! And parallel programming challenges Challenges: NP- hard problems And BigData 4

5 1. Realis8c performance es8mates My new algorithm takes 10 seconds for data set bigdata-x. Is it fast and efficient? Given a problem, an algorithm, a data set, and a parallel pla`orm, we need an esbmate of execubon Bme. SoluBon: Look beyond asymptobc worst- case analysis Average case analysis Pla`orm- independent algorithm counts ExecuBon Bme in terms of subroubnes/library calls 5

6 e.g., A very crude es8mate Linear work algorithm n = 1 billion verbces m = 10 billion edges Edge represented using 8 bytes Lower bound 80 GB/(~ 50 GB/s read BW) 1.6 seconds 6.25 billion traversed edges per second AMD Magny- Cours 6

7 e.g., Models designed by computer architects Source: Hong, Kim, An analytical model for a GPU Architecture, Proc. ISCA

8 8

9 Borrowing ideas from scien8fic compu8ng O( log(n) ) O( 1 ) O( N ) Sparse Graph computations A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods ArithmeBc Intensity Total floabng point operabons/total DRAM bytes An algorithm/implementabon can be Compute- bound, Bandwidth- bound, Latency- bound A naïve sparse graph implementabon is almost always latency- bound We d like to make them compute- bound 9

10 A simpler es8mate of level- synchronous parallel BFS execu8on 8me The granularity of algorithm analysis I propose: Intra-node memory cost: β L m p + α L, n / p n + m p Inter-node communication cost: edgecut β N, a2a ( p) + α p N p Inverse local RAM bandwidth Local latency on working set n/p All-to-all remote bandwidth with p participating processors

11 2. Data- centric op8miza8ons Frequency Human Protein Interaction Network (18669 proteins, interactions) Vertex Degree We need new algorithms! Reducing computabon sampling- based methods approximabon algorithms Linear- work heurisbcs Improving parallel scaling and efficiency Reducing memory ublizabon; communicabon- minimizing graph layout; improving computabonal load balance Orders of magnitude speedup possible! High- dimensional data, Low graph diameter, Skewed degree distribubons 11

12 3. Design memory hierarchy- aware, cache- aware shared memory algorithms Move your algorithm from latency- bound to bandwidth- bound regime UBlize shared caches more efficiently Maximize memory bandwidth Strategies Reduce synchronizabon Reordering Blocked data structures Locality- aware work queues Can lead up to an order of magnitude speedup! 12

13 Betweenness Centrality Shortest paths- based centrality metric O(mn) work serial algorithm [Brandes, 2001] Our prior work Sampling- based Approximate BC computabon Memory- efficient inner- loop parallelizabon Lock- free strategy ReducBon in non- conbguous memory references Recent improvements [Frasca et al., SC 12] (Parallel) low- overhead graph reordering NUMA- aware work scheduling 13

14 Experimental study: several different graphs Performance results on a quad- socket Intel Westmere- EX server Xeon E processors 2.6 GHz processor 24 MB L3 cache per processor 256 GB memory (vertex and edge counts in millions) 14

15 Enhanced scaling on NUMA pla]orms 15

16 Op8miza8ons enable reduced per- thread working sets 16

17 and improve cache hit rates 17

18 4. Space/8me tradeoffs HPC systems typically provide greater amounts of fast memory Use it (judiciously) Some good examples Preprocessing to store auxiliary informabon replicate shared data structures data layout to minimize inter- node communicabon., i.e., replicabon Bad example Beefy in- memory representabon 18

19 5. Not all big data problems are big graph problems: turn yours into a small graph problem May want to operate on local structure, an induced subgraph with verbces of interest Exploit hierarchical structure in networks Sparsify networks Divide and conquer Genome assembly: sequence data can be reduced into a graph problem that is two orders of magnitude smaller 19

20 De novo Genome Assembly conbgs ACACGTGTGCACTACTGCACTCTACTCCACTGACTA Scaffold the contigs Align the reads Short reads ~ 100 s GB/run ACATCGTCTG TCGCGCTGAA nucleobde Genome ~ billions of nucleo8des Genome assembler Sequencer ~ PB/yr Sample 20

21 De novo Metagenome Assembly ACACGTGTGCACTACTGCACTCTACTCCACTGACTA ACACGTGTGCACTACTGCACTCTACTCCACTGACTA ACACGTGTGCACTACTGCACTCTACTCCACTGACTA ACACGTGTGCACTACTGCACTCTACTCCACTGACTA conbgs De Bruijn graph Scaffold the contigs Short reads ~ 100 s GB/run Align the reads ACATCGTCTG TCGCGCTGAA 100 s 1000 s of organisms ~ millions of bases each Parallel metagenome assembly Sequencer ~ PB/yr Sample 21

22 6. Par88on, if you must High- dimensional data, lack of balanced separators ImplicaBons for memory- intensive graph computabons O(m) inter- node communicabon, O(m) local memory references network bandwidths/latencies will be the primary performance limiters Load balancing is non- trivial the trivial solubon is randomly shuffling vertex idenbfiers, but that destroys locality 22

23 Graph500 (began Nov 2010) Parallel BFS (from a single vertex) on a stabc, undirected synthebc network (R- MAT generator) with average vertex degree 16. EvaluaBon criteria: minimum execubon Bme largest problem size Reference distributed and shared memory implementabons provided. Computers/supercomputers ranked every 6 months 23

24 Graph500 Performance: top parallel system Billions of traversed edges/second IBM BlueGene/P 8192 nodes Nov 10 Jun 11 Nov 11 Jun 12 ANL Mira IBM BlueGene/Q nodes Best single-node performance Convey HC-2EX 7.85 GTEPS 24

25 Graph500 Normalized Performance (per node) Millions of traversed edges/ second #2 #2 #2 #8 Top entries LBNL/NERSC submissions 0 Nov 10 Jun 11 Nov 11 Jun 12 25

26 Graph500 Normalized Performance (per node) Millions of traversed edges/ second #2 #2 #2 #8 Nov 10 Jun 11 Nov 11 Jun 12 Top entries LBNL/NERSC submissions 500 nodes of Cray XT4 MPI- only All- to- all communicabon limited performance CompeBBon ranking criterion: largest problem size 26

27 Graph500 Normalized Performance (per node) Millions of traversed edges/ second #2 #2 #2 #8 Nov 10 Jun 11 Nov 11 Jun 12 Top entries LBNL/NERSC submissions 1800 nodes of Cray XE6 MPI + OpenMP All- to- all communicabon limited performance CompeBBon ranking criterion: largest problem size 27

28 Graph500 Normalized Performance (per node) Millions of traversed edges/ second #2 #2 #2 #8 Nov 10 Jun 11 Nov 11 Jun 12 Top entries LBNL/NERSC submissions 4000 nodes of Cray XE6 MPI + OpenMP CompeBBon ranking criterion: peak performance => Smaller problem size 2D parbboning Led to a 2X performance improvement 28

29 Graph500 Normalized Performance (per node) Millions of traversed edges/ second #2 #2 #2 #8 Nov 10 Jun 11 Nov 11 Jun 12 Top entries LBNL/NERSC submissions Ninja programming! 4817 nodes of Cray XE6 HeurisBc to reduce memory references in power- law graphs [Beamer et al., 2011] Again, led to a 2X performance improvement 29

30 Balanced par88ons, reduced edge cut does not necessarily mean faster graph algorithm execu8on ExecuBon Bmeline for parallel BFS on a web crawl (eu- 2005) 16 nodes of Cray XE6 (Bmes with 4 nodes shown below) 30

31 7. Adapt exis8ng scalable frameworks/tools for your problem Problems amenable to a MapReduce- style of execubon Borrow ideas from scienbfic compubng, parbcularly parallel sparse linear algebra Our recent work: AdapBng FastBit, a compressed bitmap index, to speed up SPARQL queries 31

32 Seman8c data analysis and RDF The RDF (Resource DescripBon Framework) data model is a popular abstracbon for linked data repositories Records in triple form [<subject> <predicate> <object>] Data sets with a few billion triples quite common Triple- stores: custom databases for storage and retrieval of RDF data Jena, Virtuoso, Sesame

33 SPARQL Query language expressing conjuncbons and disjuncbons of triple pa{erns Each conjuncbon corresponds to a database join SPARQL queries can be viewed as graph pa{ern- matching problems Example query from the Lehigh University Benchmark Suite (LUBM): select?x?y?z where { }?x rdf:type ub:graduatestudent.?y rdf:type ub:university.?z rdf:type ub:department.?x ub:memberof?z.?z ub:suborganizabonof?y.?x ub:undergraduatedegreefrom?y.

34 FastBit+RDF: Our Contribu8ons We use the compressed bitmap indexing so~ware FastBit to index RDF data Several different types of bitmap indexes Scalable parallel index construcbon We present a new SPARQL query evaluabon approach Pa{ern- matching queries on RDF data are modified to use bitmap indexes Our approach is X faster than the RDF- 3X SPARQL query so~ware Speedup insight: The nested joins in SPARQL queries can be expressed as fast bitvector operabons.

35 8. Always ques8on conven8onal wisdom With appropriate sanity checks i.e., O(n 2 ) algorithms aren t a good idea for massive data, even on massively parallel systems Several innovabve ideas from this workshop 35

36 Summary: Methodology for High- performance large graph analy8cs 1. Performance Models 2. Data- centric alg. 3. Memory opt. 4. Space/Bme tradeoffs 5. Reduce problem size 6. Scale out 7. Adapt current state- of- the- art tools My recent research contribu8ons Parallel Centrality Genome assembly Parallel BFS SPARQL queries 36

37 Acknowledgments M. Frasca, P.Raghavan M. Poss, M. Roossinck A. Buluc K. Wu, S. Williams, L. Oliker V. Markowitz, K. Yelick, R. Egan 37

38 Thank you! QuesBons? madduri.org 38