Tutorial: Algorithm Engineering for Big Data

Similar documents
Binary search tree with SIMD bandwidth optimization using SSE

Rethinking SIMD Vectorization for In-Memory Databases

Parallel Computing. Benson Muite. benson.

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Engineering Route Planning Algorithms. Online Topological Ordering for Dense DAGs

Computing Many-to-Many Shortest Paths Using Highway Hierarchies

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Big Data Graph Algorithms

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Processing with Google s MapReduce. Alexandru Costan

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Stream Processing on GPUs Using Distributed Multimedia Middleware

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Virtuoso and Database Scalability

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

GraySort on Apache Spark by Databricks

Chapter 11 I/O Management and Disk Scheduling

Main Memory Data Warehouses

Oracle8i Spatial: Experiences with Extensible Databases

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Binary Heap Algorithms

Vector storage and access; algorithms in GIS. This is lecture 6

FAWN - a Fast Array of Wimpy Nodes

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

4.2 Sorting and Searching

Oracle Database In-Memory The Next Big Thing

Overlapping Data Transfer With Application Execution on Clusters

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

CS2101a Foundations of Programming for High Performance Computing

Project DALKIT (informal working title)

Parallel Programming Survey

Lecture 10 - Functional programming: Hadoop and MapReduce

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Bringing Big Data Modelling into the Hands of Domain Experts

MEng, BSc Applied Computer Science

Physical Data Organization

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

Big Data Challenges in Bioinformatics

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Optimizing the Performance of Your Longview Application

Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Graph Processing: Some Background

HPC Programming Framework Research Team

Understanding Neo4j Scalability

High Performance Computing for Operation Research

Distance Degree Sequences for Network Analysis

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Introduction to GPU hardware and to CUDA

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Performance And Scalability In Oracle9i And SQL Server 2000

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Building an Inexpensive Parallel Computer

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Big Table A Distributed Storage System For Data

Benchmarking Cassandra on Violin

Big Data Mining Services and Knowledge Discovery Applications on Clouds

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Scalability and Classifications

Partitioning under the hood in MySQL 5.5

Data Processing Solutions - A Case Study

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Lecture 2 Parallel Programming Platforms

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Optimization of ETL Work Flow in Data Warehouse

Doctor of Philosophy in Computer Science

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

CSE-E5430 Scalable Cloud Computing Lecture 2

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Architectures for Big Data Analytics A database perspective

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Program Optimization for Multi-core Architectures

Security-Aware Beacon Based Network Monitoring

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MEng, BSc Computer Science with Artificial Intelligence

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

2009 Oracle Corporation 1

Chapter 7. Using Hadoop Cluster and MapReduce

Transcription:

Tutorial: Algorithm Engineering for Big Data Peter Sanders, Karlsruhe Institute of Technology Efficient algorithms are at the heart of any nontrivial computer application. But how can we obtain innovative algorithmic solutions for demanding application problems with exploding input sizes using complex modern hardware and advanced algorithmic techniques? This tutorial proposes algorithm engineering as a methodology for taking all these issues into account. Algorithm engineering tightly integrates modeling, algorithm design, analysis, implementation and experimental evaluation into a cycle resembling the scientific method used in the natural sciences. Reusable, robust, flexible, and efficient implementations are put into algorithm libraries. Benchmark instances provide further coupling to applications. algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications We begin with examples representing fundamental algorithms and data structures with a particular emphasis on large data sets. We first look at sorting in detail. Then we will have shorter examples for full text indices, priority queue data structures, route planning, graph partitioning, and minimum spanning trees. We will also give examples of future challenges centered on particular big data applications like genome sequencing and phylogenetic tree reconstruction, particle tracking at the CERN LHC, and the SAP-HANA data base, Further Information Duration: half-day Intended Audience: Practitioners with some basic background in algorithms (2nd semester computer science in most German universities) Slides are attached. Some images with unclear copyright are removed

Algorithm Engineering for Big Data Peter Sanders Institute of Theoretical Informatics - Algorithmics Xeon Xeon 2D write buffers ping 16 GB RAM. 240 GB O(D)prefetch buffers D blocks disk scheduling Infiniband switch 400 MB / s node all all read buffers k+o(d) overlap buffers merge buffers 1. k merge merging overlap algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications 0 KIT Peter UniversitySanders: of the State of Baden-Wuerttemberg and National Algorithm Laboratory of Engineering the Helmholtz Association for Big Data Institute of Theoretical Informatics Algorithmics www.kit.edu

Sanders: Algorithm Engineering / Big Data 2 Overview A detailed explanation of algorithm engineering with sorting for (more or less) big inputs as a throughgoing example More Big Data examples from my group [with: David Bader, Veit Batz, Andreas Beckmann, Timo Bingmann, Stefan Burkhardt, Jonathan Dees, Daniel Delling, Roman Dementiev, Daniel Funke, Robert Geisberger, David Hutchinson, Juha Kärkkäinen, Lutz Kettner, Moritz Kobitzsch, Nicolai Leischner, Dennis Luxen, Kurt Mehlhorn, Ulrich Meyer, Henning Meyerhenke, Rolf Möhring, Ingo Müller, Petra Mutzel, Vitaly Osipov, Felix Putze, Günther Quast, Mirko Rahn, Dennis Schieferdecker, Sebastian Schlag, Dominik Schultes, Christian Schulz, Jop Sibeyn, Johannes Singler, Jeff Vitter, Dorothea Wagner, Jan Wassenberg, Martin Weidner, Sebastian Winkel, Emmanuel Ziegler]

Sanders: Algorithm Engineering / Big Data 3 Algorithmics = the systematic design of efficient software and hardware computer science theoretical algorithmics logics efficient soft & hardware correct practical

Sanders: Algorithm Engineering / Big Data 4 (Caricatured) Traditional View: Algorithm Theory models design Theory Practice analysis perf. guarantees deduction implementation applications

Sanders: Algorithm Engineering / Big Data 5 Gaps Between Theory & Practice Theory Practice simple appl. model complex simple machine model real complex algorithms FOR simple advanced data structures arrays,... worst case max complexity measure inputs asympt. O( ) efficiency 42% constant factors

Sanders: Algorithm Engineering / Big Data 6 Algorithmics as Algorithm Engineering algorithm engineering models design analysis?? experiments deduction? perf. guarantees [1]

Sanders: Algorithm Engineering / Big Data 7 Algorithmics as Algorithm Engineering algorithm engineering models analysis deduction perf. guarantees design falsifiable hypotheses induction implementation experiments [1]

Sanders: Algorithm Engineering / Big Data 8 Algorithmics as Algorithm Engineering algorithm engineering realistic models real. design real. analysis deduction perf. guarantees falsifiable hypotheses induction implementation experiments [1]

Sanders: Algorithm Engineering / Big Data 9 Algorithmics as Algorithm Engineering algorithm engineering realistic models design real Inputs analysis deduction perf. guarantees falsifiable hypotheses induction implementation algorithm libraries experiments [1]

Sanders: Algorithm Engineering / Big Data 10 Algorithmics as Algorithm Engineering algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications [1]

Sanders: Algorithm Engineering / Big Data 11 Goals bridge gaps between theory and practice accelerate transfer of algorithmic results into applications keep the advantages of theoretical treatment: generality of solutions and reliabiltiy, predictabilty from performance guarantees algorithm engineering analysis deduction perf. guarantees realistic models design falsifiable hypotheses induction implementation algorithm libraries real Inputs experiments appl. engin. applications

Sanders: Algorithm Engineering / Big Data 12 Bits of History 1843 Algorithms in theory and practice 1950s,1960s Still infancy 1970s,1980s Paper and pencil algorithm theory. Exceptions exist, e.g., [D. Johnson] 1986 Term used by [T. Beth], lecture Algorithmentechnik in Karlsruhe. 1988 Library of Efficient Data Types and Algorithms (LEDA) [2] 1997 Workshop on Algorithm Engineering ESA applied track [G. Italiano] 1997 Term used in US policy paper [Aho, Johnson, Karp, et. al] 1998 Alex workshop in Italy ALENEX

Sanders: Algorithm Engineering / Big Data 13 Realistic Models Theory Practice simple appl. model complex Careful refinements simple machine model real Try to preserve (partial) analyzability / simple results

Sanders: Algorithm Engineering / Big Data 14 Sorting Model Comparison based < true/false arbitrary e.g. integer full information

Sanders: Algorithm Engineering / Big Data 15 Advanced Machine Models[3] RAM / von Neumann External registers ALU freely programmable registers ALU fast memory capacity M freely programmable B count instructions large memory (also) count (block) I/Os [3]

Sanders: Algorithm Engineering / Big Data 16 Distributed Memory [4] 1 2...... P (also) determine communication volume

Sanders: Algorithm Engineering / Big Data 17 Parallel Disks [5] M M B 1 2 B... D

Sanders: Algorithm Engineering / Big Data 18 Set Associative Caches [6] M B B cache sets cache a=2 cache lines of the memory...... main memory

Sanders: Algorithm Engineering / Big Data 19 Branch Prediction [7]

Sanders: Algorithm Engineering / Big Data 20 Hierarchical Parallel External Memory [8] multicore disks MPI...

Sanders: Algorithm Engineering / Big Data 21 Graphics Processing Units [9]

Sanders: Algorithm Engineering / Big Data 22 Combininig Models? design / analyze one aspect at a time hierarchical combination autotuning? Comparison based < true/false arbitrary e.g. integer full information 1 2...... P B cache sets cache lines of the memory cache a=2...... main memory 1 2 M B... D

Sanders: Algorithm Engineering / Big Data 23 Design of algorithms that work well in practice simplicity reuse constant factors exploit easy instances

Sanders: Algorithm Engineering / Big Data 24 Design Sorting...... merge distribute.................. simplicity............ reuse disk scheduling, prefetching, load balancing, sequence partitioning [10, 5, 11, 8] constant factors detailed machine model (caches, TLBs, registers, branch prediction, ILP) [3, 7] instances randomization for difficult instances [5, 8]

Sanders: Algorithm Engineering / Big Data 25 Example: External Sorting [12] n: input size M : internal memory size B: block size registers ALU fast memory capacity M freely programmable B large memory

Sanders: Algorithm Engineering / Big Data 26 Procedure externalmerge(a, b, c :File of Element) x := a.readelement // Assume emptyfile.readelement= y := b.readelement forj :=1to a + b do if x y then c.writeelement(x); x := a.readelement else c.writeelement(y); y := b.readelement a x c b y

Sanders: Algorithm Engineering / Big Data 27 External Binary Merging read filea: a /B. read fileb: b /B. write filec: ( a + b )/B. overall: 2 a + b B a x c b y

Sanders: Algorithm Engineering / Big Data 28 Run Formation Sort input pieces of sizem f M i=0 i=m i=2m run sort internal t i=0 i=m i=2m I/Os: 2 n B

Sanders: Algorithm Engineering / Big Data 29 Sorting by External Binary Merging make_things_ formruns formruns formruns formruns aeghikmnst merge as_simple_as aaeilmpsss aaaeeghiiklmmnpsssst merge _possible_bu aaeilmpsss merge t_no_simpler eilmnoprst bbeeiillmnoopprssstu aaabbeeeeghiiiiklllmmmnnooppprsssssssttu Procedure externalbinarymergesort run formation // I/Os: // 2n/B while more than one run left do // log n M merge pairs of runs output remaining run // : 2 n B ( 1+ log n M // 2n/B )

Sanders: Algorithm Engineering / Big Data 30 Example Numbers: PC 2013 n = 2 40 Byte (1 TB) M = 2 33 Byte (8 GB) B = 2 22 Byte (4 MB) one I/O needs2 5 s (31.25 ms) time =2 n B ( 1+ log n M ) 2 5 s Idea: 8 passes 2passes =2 2 18 (1+7) 2 5 s = 2 17 s 36h

Sanders: Algorithm Engineering / Big Data 31 Multiway Merging Procedure multiwaymerge(a 1,...,a k,c :File of Element) fori:=1tok do x i :=a i.readelement forj :=1to k i=1 a i do findi 1..k that minimizesx i c.writeelement(x i ) x i :=a i.readelement // no I/Os!,O(logk) time............... internal buffers...

Sanders: Algorithm Engineering / Big Data 32 Mulitway Merging Analysis I/Os: read filea i : a i /B. write filec: k i=1 a i /B overall: 2 k i=1 a i B constraint: We needk +1buffer blocks, i.e.,k +1 < M/B............... interne puffer...

Sanders: Algorithm Engineering / Big Data 33 Sorting by Multiway-Merging sort n/m runs withm elements each 2n/B I/Os mergem/b runs at a time unit a single run remains 2n/B I/Os log M/B n M merging phases overall make_things_ - sort(n) := 2n B ( 1+ log M/B n M formruns formruns formruns formruns aeghikmnst as_simple_as aaeilmpsss multi merge _possible_bu aaeilmpsss ) t_no_simpler eilmnoprst aaabbeeeeghiiiiklllmmmnnooppprsssssssttu I/Os

Sanders: Algorithm Engineering / Big Data 34 External Sorting by Multiway-Merging More than one merging phase?: Not for the hierarchy main memory, hard disk. reason: >2000 {}}{ M B > 200 { }} { RAM Euro/bit Platte Euro/bit

Sanders: Algorithm Engineering / Big Data 35 More on Multiway Mergesort Parallel Disks Randomized Striping [5] Optimal Prefetching [5] Overlapping of I/O and Computation [10] ping 2D write buffers. O(D) prefetch buffers D blocks disk scheduling read buffers k+o(d) overlap buffers merge buffers 1. merge k merging overlap

Sanders: Algorithm Engineering / Big Data 36 Shared Memory Multiway Mergesort [11]

Sanders: Algorithm Engineering / Big Data 37 Combinations parallel disk + shared memory: [13] + distributed memory: [8] stay tuned load balancing, randomization, collective communication + energy: [14] stay tuned multicore disks MPI...

Sanders: Algorithm Engineering / Big Data 38 Analysis Constant factors matter Beyond worst case analysis Practical algorithms might be difficult to analyze (randomization, meta heuristics,... )

Sanders: Algorithm Engineering / Big Data 39 Analysis Sorting write. prefetch overlap merge. merge Constant factors matter (1 + o(1)) lower bound [5, 8] I/Os for parallel (disk) external sorting Beyond worst case analysis Practical algorithms might be difficult to analyze Open Problem: [5] greedy algorithm for parallel disk prefetching [Knuth@48]

Sanders: Algorithm Engineering / Big Data 40 Implementation sanity check for algorithms! Challenges Semantic gaps: Abstract algorithm C++... write. prefetch overlap merge. merge hardware Small constant factors: compare highly tuned competitors

Sanders: Algorithm Engineering / Big Data 41 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcount(const T* const a, const int n, const int k, const T* const s, Oracle* const oracle, int* const bucket) { { for (int i = 0; i < n; i++) } } int j = 1; while (j < k) { } j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; < s splitter 4 1 array index > *2 +1 decisions < s 2 s 2 6 3 > < > *2 +1 *2 +1 decisions s 1 4 s 3 5 s 5 6 s 7 7 < > < > < > < > decisions buckets

Sanders: Algorithm Engineering / Big Data 42 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcountunrolled([...]){ } } for (int i = 0; i < n; i++) int j = 1; j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; < s splitter 4 1 array index > *2 +1 decisions < s 2 s 2 6 3 > < > *2 +1 *2 +1 decisions s 1 4 s 3 5 s 5 6 s 7 7 < > < > < > < > decisions buckets

Sanders: Algorithm Engineering / Big Data 43 Example: Inner Loops Sample Sort [7] template <class T> void findoraclesandcountunrolled2([...]){ for (int i = n & 1; i < n; i+=2) {\ int j0 = 1; int j1 = 1; T ai0 = a[i]; T ai1 = a[i+1]; j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); int b0 = j0-k; int b1 = j1-k; bucket[b0]++; bucket[b1]++; oracle[i] = b0; oracle[i+1] = b1; } }

Sanders: Algorithm Engineering / Big Data 44 Experiments sometimes a good surrogate for analysis too much rather than too little output data reproducibility (10 years!) software engineering

Sanders: Algorithm Engineering / Big Data 45 Example, Parallel External Sorting [10] sort 100GiB per node 4000 3000 sec 2000 1000 0 worst case input worst case input, randomized random input random input, randomized 1 2 4 8 16 32 64 nodes

Sanders: Algorithm Engineering / Big Data 46 Algorithm Libraries Challenges software engineering, e.g. CGAL [www.cgal.org] standardization, e.g. java.util, C++ STL and BOOST performance generality simplicity applications are a priori unknown result checking, verification MCSTL TXXL S 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Applications 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 STL user layer 0000000000000 1111111111111 Streaming layer 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 Containers: vector, stack, set 0000000000000 1111111111111 00000000000000000 11111111111111111 priority_queue, map 0000000000000 1111111111111 Pipelined sorting, 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 Algorithms: sort, for_each, merge 0000000000000 1111111111111 zero I/O scanning 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 00000000000000000 11111111111111111 0000000000000 1111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 Block management layer 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 typed block, block manager, buffered streams, 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 block prefetcher, buffered block writer 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 Asynchronous I/O primitives layer 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 files, I/O requests, disk queues, 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 completion handlers 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 Operating System 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111

Sanders: Algorithm Engineering / Big Data 47 Example: External Sorting [10, 15] TXXL S Applications 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000 11111111111111111100000000000000 000000000000000000 11111111111111111100000000000000 000000000000000000 11111111111111111100000000000000 000000000000000000 111111111111111111 STL user layer 00000000000000 11111111111111 Streaming layer 000000000000000000 11111111111111111100000000000000 000000000000000000 11111111111111111100000000000000 000000000000000000 11111111111111111100000000000000 000000000000000000 111111111111111111 Containers: vector, stack, set 00000000000000 11111111111111 000000000000000000 111111111111111111 priority_queue, map Pipelined sorting, 00000000000000 11111111111111 000000000000000000 11111111111111111100000000000000 000000000000000000 111111111111111111 Algorithms: sort for_each, merge 00000000000000 11111111111111 zero I/O scanning 000000000000000000 11111111111111111100000000000000 000000000000000000 11111111111111111100000000000000 Block management layer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 typed block, block manager, buffered streams, 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 block prefetcher, buffered block writer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Asynchronous I/O primitives layer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 files, I/O requests, disk queues, 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 completion handlers 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Operating System 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 Linux Windows Mac,...

Sanders: Algorithm Engineering / Big Data 48 Example: Shared Memory Sorting [11, 16] MCSTL STL-alike STL-integrated

Sanders: Algorithm Engineering / Big Data 49 Problem Instances Benchmark instances for NP-hard problems TSP Steiner-Tree SAT set covering graph partitioning [17]... have proved essential for development of practical algorithms Strange: much less real world instances for polynomial problems (MST, shortest path, max flow, matching... )

Sanders: Algorithm Engineering / Big Data 50 Example: Sorting Benchmark (Indy) [8, 14] 100 byte records, 10 byte random keys, with file I/O Category data volume performance improvement GraySort 100 TB 564 GB / min 17 MinuteSort 955 GB 955 GB / min > 10 JouleSort 1 000 GB 13 400 Recs/Joule 4 JouleSort 100 GB 35 500 Recs/Joule 3 JouleSort 10 GB 34 300 Recs/Joule 3 Also: PennySort

Sanders: Algorithm Engineering / Big Data 51 GraySort: inplace multiway mergesort, exact splitting [8] Xeon Xeon 16 GB RAM 240 GB Infiniband switch 400 MB / s node all all

Sanders: Algorithm Engineering / Big Data 52 JouleSort [14] Intel Atom N330 4 GB RAM 4 256 GB SSD (SuperTalent) Algorithm similar to GraySort

Sanders: Algorithm Engineering / Big Data 53 Applications that Change the World Algorithmics has the potential to SHAPE applications (not just the other way round) [G. Myers] Bioinformatics: sequencing, proteomics, phylogenetic trees,... Information Retrieval: Searching, ranking, Traffic Planning: navigation, flow optimization, adaptive toll, disruption management Geographic Information Systems: agriculture, environmental protection, disaster management, tourism,... Communication Networks: mobile, P2P, grid, selfish users,...

Sanders: Algorithm Engineering / Big Data 54 AE for Big Data Techniques data structures graphs geometry strings coding theory... AE Applications sensor data genomes data bases www GIS mobile... Technology parallelism memory hierarchies communication fault tolerance energy experience PS

Sanders: Algorithm Engineering / Big Data 55 Larger Sorting Problems millions of processors multipass algorithms fault tolerance still energy time? Higly related to MapReduce, index construction,...

Sanders: Algorithm Engineering / Big Data 56 More Big Data Examples From my Group Suffix Sorting and its applications Main Memory Data Bases Graph Partitioning Track Reconstruction at CERN Route Planning Genome Sequencing Image Processing Priority Queues

Sanders: Algorithm Engineering / Big Data 57 Suffix Sorting sort suffixess i s n of string S = s 1 s n,s i {1..n}. Applications: full text search, Burrows-Wheeler text compression, bioinformatics,... b a n a n a a ana anana banana na nana E.g. phrase search in time logarithmic or even independent of input size. particularly interesting for large data "to be or not to be" WWW

Sanders: Algorithm Engineering / Big Data 58 Linear Work Suffix Sorting [18] simple: Radix-Sort + linear recursion + merging. trivial external [19], parallel [20] adaptation I 012345678 anananas. nananas.0 sort 3 2 5.00anaananannass.0 ananas.00 1 2 2 3 4 5 2 4 1 lexicographic triple names I 325241

Sanders: Algorithm Engineering / Big Data 59 Current Work distributed memory (external) query parallel distributed construction of query data structure (longest common prefixes,... )

Sanders: Algorithm Engineering / Big Data 60 Data Bases Our Approach [21, 22] [with SAP HANA team, PhD students Dees, Müller] main memory based column based many-core machines NUMA-aware no precomputed aggregates aggressive indexing inverted index forward index generate C++ code close to tuned manual implementation

Sanders: Algorithm Engineering / Big Data 61 TPC-H Decision Support Benchmark 22 realistic queries of varying complexity pseudorealistic random data F GByte space 6M*F TPC H Scheme LINEITEM 800K*F PARTSUPP ORDERS 1.5M*F 10K*F SUPPLIER CUSTOMER 150K*F 200K*F PART F = scale factor =attribute NATION REGION 25 5

Sanders: Algorithm Engineering / Big Data 62 Typical TPC-H Queries Q1: Revenue etc. of all shipped LINEITEMs (aggregated into 6 categories) plain flat scan of all LINEITEMs Q9: Sum profit for all LINEITEMs with a given color for each nation and order year. scan PARTs, use inverted index to access matching LINEITEMs Go down from there using forward indices ( pointers). price... LINEITEM date cost PARTSUPP ORDERS nation SUPPLIER CUSTOMER color PART NATION name REGION

Sanders: Algorithm Engineering / Big Data 63 First Results 30 faster than current record in 300GB category (manual implementation) Compiler: seems to be largely orthogonal to algorithmic and parallelization issues TPC H Scheme 6M*F LINEITEM [21] 800K*F PARTSUPP ORDERS 1.5M*F 10K*F SUPPLIER CUSTOMER 150K*F 200K*F PART F = scale factor =attribute NATION REGION 25 5

1 TB RAM Sanders: Algorithm Engineering / Big Data 64 Larger Inputs Already needed by some large customers of SAP Move to clusters Master thesis Martin Weidner seems to give positive results 256 GB RAM 256 GB RAM 256 GB RAM (5 TPC-H queries) [22] fault tolerance 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB beyond recovery? 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB energy efficiency using many small nodes (ARM)? Algorithmic Meat: Randomization, collective communication, communication complexity, sorting, data structures, multi-level memory hierarchies, coding theory

Sanders: Algorithm Engineering / Big Data 65 Graph Partitionierung [23, 24] Input: Graph(V,E) (possibly with node and edge weights),ǫ,k Output:V 1 V k mit V i (1+ǫ) Objective Function: minimize cut V Applications: finite element simulations, VLSI-design, route planning,... Variants: hypergraphs, clustering, different objective functions,... k

Sanders: Algorithm Engineering / Big Data 66 Multilevel Graph Partitioning input graph Output Partition...... local improvement contract initial uncontract partitioning

Sanders: Algorithm Engineering / Big Data 67 Reengineering Multilevel Graph Partitioning distr. evol. Alg. [Alenex12] V F W [ESA11] Cycles a la multigrid input graph Output Partition [IPDPS10] edge ratings match + contract [SEA12]...... local improvement uncontract initial parallel [IPDPS10] n level [ESA10] flows etc. [ESA11] augm. paths [SEA13] partitioning todo

Sanders: Algorithm Engineering / Big Data 68 Our Contribution scalable parallelization KaPPa (matching, edge coloring, evolutionary) thorough reengineering of multilevel approch (use flows, SCCs, BFS, matching, edge coloring, negative cycle detection,... ) high quality (e.g. 90 99% entries in Walshaw s benchmark)

Sanders: Algorithm Engineering / Big Data 69 Large Data Graph Partitioning difficult inputs: social networks, WWW, 3D/4D models, VLSI, knowledge graph? more difficult parallelization

Sanders: Algorithm Engineering / Big Data 70 Future Work parallel external other variants fault tolerant component of a graph processing framework

Sanders: Algorithm Engineering / Big Data 71 Track reconstruction [25] Input: clouds of 10 4 3D points Output:< 10 3 spiral tracks of high energy particles Also cluster tracks by emergence point Large Data??? up to10 5 instances / s cost of processors / energy memory constrained exploit SIMD/GPU parallelism? Algorithmic Meat: Geometric data structures, parallelization, clustering

Sanders: Algorithm Engineering / Big Data 72 Route Planning Large Data 2004: Western European network (18M nodes). Dijkstra s algorithm needs 6s. too much time for servers too much memory for mobile devices inaccurate heuristics with tedious manual preprocessing Our contribution: Automatic preprocessing techniques 10 4 10 6 times faster exact query on servers still instantaneous on mobile devices (external implementation) [26]

Sanders: Algorithm Engineering / Big Data 73 Large Data 2013 1.6G nodes OpenStreetMap routing graph (edge based) billions of GPS traces (+ road based sensors + elevation data) public transportation Potential use: time-dependent edge weights [27] detailed traffic jam detection Google, TomTom,... multi-modal route planning [28] probabilistic route planning attempts really useful detours around traffic jams??? use real time traffic simulation??

Sanders: Algorithm Engineering / Big Data 74 Genome Sequencing [29]: 20 000 CPU hours for shotgun sequencing of the human genome (3 10 9 base pairs, 5 10 times oversampling. Prototypical large data problem? Today: a few minutes on a work station [ZieglerDFMS work in progr.] (use template, modern hardware, AE + cheap sequencing) routine use for personal medicine New Challenge: processing many sequences