Tutorial: Algorithm Engineering for Big Data

Transcription

1 Tutorial: Algorithm Engineering for Big Data Peter Sanders, Karlsruhe Institute of Technology Efficient algorithms are at the heart of any nontrivial computer application. But how can we obtain innovative algorithmic solutions for demanding application problems with exploding input sizes using complex modern hardware and advanced algorithmic techniques? This tutorial proposes algorithm engineering as a methodology for taking all these issues into account. Algorithm engineering tightly integrates modeling, algorithm design, analysis, implementation and experimental evaluation into a cycle resembling the scientific method used in the natural sciences. Reusable, robust, flexible, and efficient implementations are put into algorithm libraries. Benchmark instances provide further coupling to applications. algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications We begin with examples representing fundamental algorithms and data structures with a particular emphasis on large data sets. We first look at sorting in detail. Then we will have shorter examples for full text indices, priority queue data structures, route planning, graph partitioning, and minimum spanning trees. We will also give examples of future challenges centered on particular big data applications like genome sequencing and phylogenetic tree reconstruction, particle tracking at the CERN LHC, and the SAP-HANA data base, Further Information Duration: half-day Intended Audience: Practitioners with some basic background in algorithms (2nd semester computer science in most German universities) Slides are attached. Some images with unclear copyright are removed

2 Algorithm Engineering for Big Data Peter Sanders Institute of Theoretical Informatics - Algorithmics Xeon Xeon 2D write buffers ping 16 GB RAM. 240 GB O(D)prefetch buffers D blocks disk scheduling Infiniband switch 400 MB / s node all all read buffers k+o(d) overlap buffers merge buffers 1. k merge merging overlap algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications 0 KIT Peter UniversitySanders: of the State of Baden-Wuerttemberg and National Algorithm Laboratory of Engineering the Helmholtz Association for Big Data Institute of Theoretical Informatics Algorithmics

3 Sanders: Algorithm Engineering / Big Data 2 Overview A detailed explanation of algorithm engineering with sorting for (more or less) big inputs as a throughgoing example More Big Data examples from my group [with: David Bader, Veit Batz, Andreas Beckmann, Timo Bingmann, Stefan Burkhardt, Jonathan Dees, Daniel Delling, Roman Dementiev, Daniel Funke, Robert Geisberger, David Hutchinson, Juha Kärkkäinen, Lutz Kettner, Moritz Kobitzsch, Nicolai Leischner, Dennis Luxen, Kurt Mehlhorn, Ulrich Meyer, Henning Meyerhenke, Rolf Möhring, Ingo Müller, Petra Mutzel, Vitaly Osipov, Felix Putze, Günther Quast, Mirko Rahn, Dennis Schieferdecker, Sebastian Schlag, Dominik Schultes, Christian Schulz, Jop Sibeyn, Johannes Singler, Jeff Vitter, Dorothea Wagner, Jan Wassenberg, Martin Weidner, Sebastian Winkel, Emmanuel Ziegler]

4 Sanders: Algorithm Engineering / Big Data 3 Algorithmics = the systematic design of efficient software and hardware computer science theoretical algorithmics logics efficient soft & hardware correct practical

5 Sanders: Algorithm Engineering / Big Data 4 (Caricatured) Traditional View: Algorithm Theory models design Theory Practice analysis perf. guarantees deduction implementation applications

6 Sanders: Algorithm Engineering / Big Data 5 Gaps Between Theory & Practice Theory Practice simple appl. model complex simple machine model real complex algorithms FOR simple advanced data structures arrays,... worst case max complexity measure inputs asympt. O( ) efficiency 42% constant factors

7 Sanders: Algorithm Engineering / Big Data 6 Algorithmics as Algorithm Engineering algorithm engineering models design analysis?? experiments deduction? perf. guarantees [1]

8 Sanders: Algorithm Engineering / Big Data 7 Algorithmics as Algorithm Engineering algorithm engineering models analysis deduction perf. guarantees design falsifiable hypotheses induction implementation experiments [1]

9 Sanders: Algorithm Engineering / Big Data 8 Algorithmics as Algorithm Engineering algorithm engineering realistic models real. design real. analysis deduction perf. guarantees falsifiable hypotheses induction implementation experiments [1]

10 Sanders: Algorithm Engineering / Big Data 9 Algorithmics as Algorithm Engineering algorithm engineering realistic models design real Inputs analysis deduction perf. guarantees falsifiable hypotheses induction implementation algorithm libraries experiments [1]

11 Sanders: Algorithm Engineering / Big Data 10 Algorithmics as Algorithm Engineering algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications [1]

12 Sanders: Algorithm Engineering / Big Data 11 Goals bridge gaps between theory and practice accelerate transfer of algorithmic results into applications keep the advantages of theoretical treatment: generality of solutions and reliabiltiy, predictabilty from performance guarantees algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications

13 Sanders: Algorithm Engineering / Big Data 12 Bits of History 1843 Algorithms in theory and practice 1950s,1960s Still infancy 1970s,1980s Paper and pencil algorithm theory. Exceptions exist, e.g., [D. Johnson] 1986 Term used by [T. Beth], lecture Algorithmentechnik in Karlsruhe Library of Efficient Data Types and Algorithms (LEDA) [2] 1997 Workshop on Algorithm Engineering ESA applied track [G. Italiano] 1997 Term used in US policy paper [Aho, Johnson, Karp, et. al] 1998 Alex workshop in Italy ALENEX

14 Sanders: Algorithm Engineering / Big Data 13 Realistic Models Theory Practice simple appl. model complex Careful refinements simple machine model real Try to preserve (partial) analyzability / simple results

15 Sanders: Algorithm Engineering / Big Data 14 Sorting Model Comparison based < true/false arbitrary e.g. integer full information

16 Sanders: Algorithm Engineering / Big Data 15 Advanced Machine Models[3] RAM / von Neumann External registers ALU freely programmable registers ALU fast memory capacity M freely programmable B count instructions large memory (also) count (block) I/Os [3]

17 Sanders: Algorithm Engineering / Big Data 16 Distributed Memory [4] P (also) determine communication volume

18 Sanders: Algorithm Engineering / Big Data 17 Parallel Disks [5] M M B 1 2 B... D

19 Sanders: Algorithm Engineering / Big Data 18 Set Associative Caches [6] M B B cache sets cache a=2 cache lines of the memory main memory

20 Sanders: Algorithm Engineering / Big Data 19 Branch Prediction [7]

21 Sanders: Algorithm Engineering / Big Data 20 Hierarchical Parallel External Memory [8] multicore disks MPI...

22 Sanders: Algorithm Engineering / Big Data 21 Graphics Processing Units [9]

23 Sanders: Algorithm Engineering / Big Data 22 Combininig Models? design / analyze one aspect at a time hierarchical combination autotuning? Comparison based < true/false arbitrary e.g. integer full information P B cache sets cache lines of the memory cache a= main memory 1 2 M B... D

24 Sanders: Algorithm Engineering / Big Data 23 Design of algorithms that work well in practice simplicity reuse constant factors exploit easy instances

25 Sanders: Algorithm Engineering / Big Data 24 Design Sorting merge distribute simplicity reuse disk scheduling, prefetching, load balancing, sequence partitioning [10, 5, 11, 8] constant factors detailed machine model (caches, TLBs, registers, branch prediction, ILP) [3, 7] instances randomization for difficult instances [5, 8]

26 Sanders: Algorithm Engineering / Big Data 25 Example: External Sorting [12] n: input size M : internal memory size B: block size registers ALU fast memory capacity M freely programmable B large memory

27 Sanders: Algorithm Engineering / Big Data 26 Procedure externalmerge(a, b, c :File of Element) x := a.readelement // Assume emptyfile.readelement= y := b.readelement forj :=1to a + b do if x y then c.writeelement(x); x := a.readelement else c.writeelement(y); y := b.readelement a x c b y

28 Sanders: Algorithm Engineering / Big Data 27 External Binary Merging read filea: a /B. read fileb: b /B. write filec: ( a + b )/B. overall: 2 a + b B a x c b y

29 Sanders: Algorithm Engineering / Big Data 28 Run Formation Sort input pieces of sizem f M i=0 i=m i=2m run sort internal t i=0 i=m i=2m I/Os: 2 n B

30 Sanders: Algorithm Engineering / Big Data 29 Sorting by External Binary Merging make_things_ formruns formruns formruns formruns aeghikmnst merge as_simple_as aaeilmpsss aaaeeghiiklmmnpsssst merge _possible_bu aaeilmpsss merge t_no_simpler eilmnoprst bbeeiillmnoopprssstu aaabbeeeeghiiiiklllmmmnnooppprsssssssttu Procedure externalbinarymergesort run formation // I/Os: // 2n/B while more than one run left do // log n M merge pairs of runs output remaining run // : 2 n B ( 1+ log n M // 2n/B )

31 Sanders: Algorithm Engineering / Big Data 30 Example Numbers: PC 2013 n = 2 40 Byte (1 TB) M = 2 33 Byte (8 GB) B = 2 22 Byte (4 MB) one I/O needs2 5 s (31.25 ms) time =2 n B ( 1+ log n M ) 2 5 s Idea: 8 passes 2passes = (1+7) 2 5 s = 2 17 s 36h

32 Sanders: Algorithm Engineering / Big Data 31 Multiway Merging Procedure multiwaymerge(a 1,...,a k,c :File of Element) fori:=1tok do x i :=a i.readelement forj :=1to k i=1 a i do findi 1..k that minimizesx i c.writeelement(x i ) x i :=a i.readelement // no I/Os!,O(logk) time internal buffers...

33 Sanders: Algorithm Engineering / Big Data 32 Mulitway Merging Analysis I/Os: read filea i : a i /B. write filec: k i=1 a i /B overall: 2 k i=1 a i B constraint: We needk +1buffer blocks, i.e.,k +1 < M/B interne puffer...

34 Sanders: Algorithm Engineering / Big Data 33 Sorting by Multiway-Merging sort n/m runs withm elements each 2n/B I/Os mergem/b runs at a time unit a single run remains 2n/B I/Os log M/B n M merging phases overall make_things_ - sort(n) := 2n B ( 1+ log M/B n M formruns formruns formruns formruns aeghikmnst as_simple_as aaeilmpsss multi merge _possible_bu aaeilmpsss ) t_no_simpler eilmnoprst aaabbeeeeghiiiiklllmmmnnooppprsssssssttu I/Os

35 Sanders: Algorithm Engineering / Big Data 34 External Sorting by Multiway-Merging More than one merging phase?: Not for the hierarchy main memory, hard disk. reason: >2000 {}}{ M B > 200 { }} { RAM Euro/bit Platte Euro/bit

36 Sanders: Algorithm Engineering / Big Data 35 More on Multiway Mergesort Parallel Disks Randomized Striping [5] Optimal Prefetching [5] Overlapping of I/O and Computation [10] ping 2D write buffers. O(D) prefetch buffers D blocks disk scheduling read buffers k+o(d) overlap buffers merge buffers 1. merge k merging overlap

37 Sanders: Algorithm Engineering / Big Data 36 Shared Memory Multiway Mergesort [11]

38 Sanders: Algorithm Engineering / Big Data 37 Combinations parallel disk + shared memory: [13] + distributed memory: [8] stay tuned load balancing, randomization, collective communication + energy: [14] stay tuned multicore disks MPI...

39 Sanders: Algorithm Engineering / Big Data 38 Analysis Constant factors matter Beyond worst case analysis Practical algorithms might be difficult to analyze (randomization, meta heuristics,... )

40 Sanders: Algorithm Engineering / Big Data 39 Analysis Sorting write. prefetch overlap merge. merge Constant factors matter (1 + o(1)) lower bound [5, 8] I/Os for parallel (disk) external sorting Beyond worst case analysis Practical algorithms might be difficult to analyze Open Problem: [5] greedy algorithm for parallel disk prefetching [Knuth@48]

41 Sanders: Algorithm Engineering / Big Data 40 Implementation sanity check for algorithms! Challenges Semantic gaps: Abstract algorithm C++... write. prefetch overlap merge. merge hardware Small constant factors: compare highly tuned competitors

42 Sanders: Algorithm Engineering / Big Data 41 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcount(const T* const a, const int n, const int k, const T* const s, Oracle* const oracle, int* const bucket) { { for (int i = 0; i < n; i++) } } int j = 1; while (j < k) { } j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; < s splitter 4 1 array index > *2 +1 decisions < s 2 s > < > *2 +1 *2 +1 decisions s 1 4 s 3 5 s 5 6 s 7 7 < > < > < > < > decisions buckets

43 Sanders: Algorithm Engineering / Big Data 42 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcountunrolled([...]){ } } for (int i = 0; i < n; i++) int j = 1; j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; < s splitter 4 1 array index > *2 +1 decisions < s 2 s > < > *2 +1 *2 +1 decisions s 1 4 s 3 5 s 5 6 s 7 7 < > < > < > < > decisions buckets

44 Sanders: Algorithm Engineering / Big Data 43 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcountunrolled2([...]){ for (int i = n & 1; i < n; i+=2) {\ int j0 = 1; int j1 = 1; T ai0 = a[i]; T ai1 = a[i+1]; j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); int b0 = j0-k; int b1 = j1-k; bucket[b0]++; bucket[b1]++; oracle[i] = b0; oracle[i+1] = b1; } }

45 Sanders: Algorithm Engineering / Big Data 44 Experiments sometimes a good surrogate for analysis too much rather than too little output data reproducibility (10 years!) software engineering

46 Sanders: Algorithm Engineering / Big Data 45 Example, Parallel External Sorting [10] sort 100GiB per node sec worst case input worst case input, randomized random input random input, randomized nodes

47 Sanders: Algorithm Engineering / Big Data 46 Algorithm Libraries Challenges software engineering, e.g. CGAL [ standardization, e.g. java.util, C++ STL and BOOST performance generality simplicity applications are a priori unknown result checking, verification MCSTL TXXL S Applications STL user layer Streaming layer Containers: vector, stack, set priority_queue, map Pipelined sorting, Algorithms: sort, for_each, merge zero I/O scanning Block management layer typed block, block manager, buffered streams, block prefetcher, buffered block writer Asynchronous I/O primitives layer files, I/O requests, disk queues, completion handlers Operating System

48 Sanders: Algorithm Engineering / Big Data 47 Example: External Sorting [10, 15] TXXL S Applications STL user layer Streaming layer Containers: vector, stack, set priority_queue, map Pipelined sorting, Algorithms: sort for_each, merge zero I/O scanning Block management layer typed block, block manager, buffered streams, block prefetcher, buffered block writer Asynchronous I/O primitives layer files, I/O requests, disk queues, completion handlers Operating System Linux Windows Mac,...

49 Sanders: Algorithm Engineering / Big Data 48 Example: Shared Memory Sorting [11, 16] MCSTL STL-alike STL-integrated

50 Sanders: Algorithm Engineering / Big Data 49 Problem Instances Benchmark instances for NP-hard problems TSP Steiner-Tree SAT set covering graph partitioning [17]... have proved essential for development of practical algorithms Strange: much less real world instances for polynomial problems (MST, shortest path, max flow, matching... )

51 Sanders: Algorithm Engineering / Big Data 50 Example: Sorting Benchmark (Indy) [8, 14] 100 byte records, 10 byte random keys, with file I/O Category data volume performance improvement GraySort 100 TB 564 GB / min 17 MinuteSort 955 GB 955 GB / min > 10 JouleSort GB Recs/Joule 4 JouleSort 100 GB Recs/Joule 3 JouleSort 10 GB Recs/Joule 3 Also: PennySort

52 Sanders: Algorithm Engineering / Big Data 51 GraySort: inplace multiway mergesort, exact splitting [8] Xeon Xeon 16 GB RAM 240 GB Infiniband switch 400 MB / s node all all

53 Sanders: Algorithm Engineering / Big Data 52 JouleSort [14] Intel Atom N330 4 GB RAM GB SSD (SuperTalent) Algorithm similar to GraySort

54 Sanders: Algorithm Engineering / Big Data 53 Applications that Change the World Algorithmics has the potential to SHAPE applications (not just the other way round) [G. Myers] Bioinformatics: sequencing, proteomics, phylogenetic trees,... Information Retrieval: Searching, ranking, Traffic Planning: navigation, flow optimization, adaptive toll, disruption management Geographic Information Systems: agriculture, environmental protection, disaster management, tourism,... Communication Networks: mobile, P2P, grid, selfish users,...

55 Sanders: Algorithm Engineering / Big Data 54 AE for Big Data Techniques data structures graphs geometry strings coding theory... AE Applications sensor data genomes data bases www GIS mobile... Technology parallelism memory hierarchies communication fault tolerance energy experience PS

56 Sanders: Algorithm Engineering / Big Data 55 Larger Sorting Problems millions of processors multipass algorithms fault tolerance still energy time? Higly related to MapReduce, index construction,...

57 Sanders: Algorithm Engineering / Big Data 56 More Big Data Examples From my Group Suffix Sorting and its applications Main Memory Data Bases Graph Partitioning Track Reconstruction at CERN Route Planning Genome Sequencing Image Processing Priority Queues

58 Sanders: Algorithm Engineering / Big Data 57 Suffix Sorting sort suffixess i s n of string S = s 1 s n,s i {1..n}. Applications: full text search, Burrows-Wheeler text compression, bioinformatics,... b a n a n a a ana anana banana na nana E.g. phrase search in time logarithmic or even independent of input size. particularly interesting for large data "to be or not to be" WWW

59 Sanders: Algorithm Engineering / Big Data 58 Linear Work Suffix Sorting [18] simple: Radix-Sort + linear recursion + merging. trivial external [19], parallel [20] adaptation I anananas. nananas.0 sort anaananannass.0 ananas lexicographic triple names I

60 Sanders: Algorithm Engineering / Big Data 59 Current Work distributed memory (external) query parallel distributed construction of query data structure (longest common prefixes,... )

61 Sanders: Algorithm Engineering / Big Data 60 Data Bases Our Approach [21, 22] [with SAP HANA team, PhD students Dees, Müller] main memory based column based many-core machines NUMA-aware no precomputed aggregates aggressive indexing inverted index forward index generate C++ code close to tuned manual implementation

62 Sanders: Algorithm Engineering / Big Data 61 TPC-H Decision Support Benchmark 22 realistic queries of varying complexity pseudorealistic random data F GByte space 6M*F TPC H Scheme LINEITEM 800K*F PARTSUPP ORDERS 1.5M*F 10K*F SUPPLIER CUSTOMER 150K*F 200K*F PART F = scale factor =attribute NATION REGION 25 5

63 Sanders: Algorithm Engineering / Big Data 62 Typical TPC-H Queries Q1: Revenue etc. of all shipped LINEITEMs (aggregated into 6 categories) plain flat scan of all LINEITEMs Q9: Sum profit for all LINEITEMs with a given color for each nation and order year. scan PARTs, use inverted index to access matching LINEITEMs Go down from there using forward indices ( pointers). price... LINEITEM date cost PARTSUPP ORDERS nation SUPPLIER CUSTOMER color PART NATION name REGION

64 Sanders: Algorithm Engineering / Big Data 63 First Results 30 faster than current record in 300GB category (manual implementation) Compiler: seems to be largely orthogonal to algorithmic and parallelization issues TPC H Scheme 6M*F LINEITEM [21] 800K*F PARTSUPP ORDERS 1.5M*F 10K*F SUPPLIER CUSTOMER 150K*F 200K*F PART F = scale factor =attribute NATION REGION 25 5

65 1 TB RAM Sanders: Algorithm Engineering / Big Data 64 Larger Inputs Already needed by some large customers of SAP Move to clusters Master thesis Martin Weidner seems to give positive results 256 GB RAM 256 GB RAM 256 GB RAM (5 TPC-H queries) [22] fault tolerance 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB beyond recovery? 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB energy efficiency using many small nodes (ARM)? Algorithmic Meat: Randomization, collective communication, communication complexity, sorting, data structures, multi-level memory hierarchies, coding theory

66 Sanders: Algorithm Engineering / Big Data 65 Graph Partitionierung [23, 24] Input: Graph(V,E) (possibly with node and edge weights),ǫ,k Output:V 1 V k mit V i (1+ǫ) Objective Function: minimize cut V Applications: finite element simulations, VLSI-design, route planning,... Variants: hypergraphs, clustering, different objective functions,... k

67 Sanders: Algorithm Engineering / Big Data 66 Multilevel Graph Partitioning input graph Output Partition local improvement contract initial uncontract partitioning

68 Sanders: Algorithm Engineering / Big Data 67 Reengineering Multilevel Graph Partitioning distr. evol. Alg. [Alenex12] V F W [ESA11] Cycles a la multigrid input graph Output Partition [IPDPS10] edge ratings match + contract [SEA12] local improvement uncontract initial parallel [IPDPS10] n level [ESA10] flows etc. [ESA11] augm. paths [SEA13] partitioning todo

69 Sanders: Algorithm Engineering / Big Data 68 Our Contribution scalable parallelization KaPPa (matching, edge coloring, evolutionary) thorough reengineering of multilevel approch (use flows, SCCs, BFS, matching, edge coloring, negative cycle detection,... ) high quality (e.g % entries in Walshaw s benchmark)

70 Sanders: Algorithm Engineering / Big Data 69 Large Data Graph Partitioning difficult inputs: social networks, WWW, 3D/4D models, VLSI, knowledge graph? more difficult parallelization

71 Sanders: Algorithm Engineering / Big Data 70 Future Work parallel external other variants fault tolerant component of a graph processing framework

72 Sanders: Algorithm Engineering / Big Data 71 Track reconstruction [25] Input: clouds of D points Output:< 10 3 spiral tracks of high energy particles Also cluster tracks by emergence point Large Data??? up to10 5 instances / s cost of processors / energy memory constrained exploit SIMD/GPU parallelism? Algorithmic Meat: Geometric data structures, parallelization, clustering

73 Sanders: Algorithm Engineering / Big Data 72 Route Planning Large Data 2004: Western European network (18M nodes). Dijkstra s algorithm needs 6s. too much time for servers too much memory for mobile devices inaccurate heuristics with tedious manual preprocessing Our contribution: Automatic preprocessing techniques times faster exact query on servers still instantaneous on mobile devices (external implementation) [26]

74 Sanders: Algorithm Engineering / Big Data 73 Large Data G nodes OpenStreetMap routing graph (edge based) billions of GPS traces (+ road based sensors + elevation data) public transportation Potential use: time-dependent edge weights [27] detailed traffic jam detection Google, TomTom,... multi-modal route planning [28] probabilistic route planning attempts really useful detours around traffic jams??? use real time traffic simulation??

75 Sanders: Algorithm Engineering / Big Data 74 Genome Sequencing [29]: CPU hours for shotgun sequencing of the human genome ( base pairs, 5 10 times oversampling. Prototypical large data problem? Today: a few minutes on a work station [ZieglerDFMS work in progr.] (use template, modern hardware, AE + cheap sequencing) routine use for personal medicine New Challenge: processing many sequences