Locality-Sensitive Operators for Parallel Main-Memory Database Clusters



Similar documents
FPGA-based Multithreading for In-Memory Hash Joins

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Can the Elephants Handle the NoSQL Onslaught?

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Integrating Apache Spark with an Enterprise Data Warehouse

Chapter 14: Distributed Operating Systems

Module 15: Network Structures

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Chapter 16: Distributed Operating Systems

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

Report: Declarative Machine Learning on MapReduce (SystemML)

Topological Properties

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Parallel Computing. Benson Muite. benson.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Components: Interconnect Page 1 of 18

Big Data and Apache Hadoop s MapReduce

GeoGrid Project and Experiences with Hadoop

Advances in Virtualization In Support of In-Memory Big Data Applications

Capacity Planning Guide for Adobe LiveCycle Data Services 2.6

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Using diversification, communication and parallelism to solve mixed-integer linear programs

Query Optimization in Cloud Environment

MAPREDUCE Programming Model

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Distributed Architecture of Oracle Database In-memory

Understanding the Benefits of IBM SPSS Statistics Server

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

VMWARE WHITE PAPER 1

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data Processing with Google s MapReduce. Alexandru Costan

Architectures for Big Data Analytics A database perspective

THE PROBLEM WORMS (1) WORMS (2) THE PROBLEM OF WORM PROPAGATION/PREVENTION THE MINIMUM VERTEX COVER PROBLEM

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Chapter 15: Distributed Structures. Topology

Binary search tree with SIMD bandwidth optimization using SSE

DISTRIBUTED AND PARALLELL DATABASE

New Approach of Computing Data Cubes in Data Warehousing

Developing MapReduce Programs

Resource Allocation Schemes for Gang Scheduling

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

RevoScaleR Speed and Scalability

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Management & Analysis of Big Data in Zenith Team

CSE-E5430 Scalable Cloud Computing Lecture 2

Graph Database Proof of Concept Report

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

A Comparison of General Approaches to Multiprocessor Scheduling

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

BigMemory & Hybris : Working together to improve the e-commerce customer experience

A1 and FARM scalable graph database on top of a transactional memory layer

Research Statement Immanuel Trummer

SQL Server 2005 Features Comparison

Scalable Architecture on Amazon AWS Cloud

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Database Scalability and Oracle 12c

How To Scale Out Of A Nosql Database

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Performance Analysis of Cloud Relational Database Services

Cloud Computing on Amazon's EC2

Midterm. Name: Andrew user id:

Energy Efficient MapReduce

Hadoop. Sunday, November 25, 12

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Challenges for Data Driven Systems

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Introduction to DISC and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

ICONICS Choosing the Correct Edition of MS SQL Server

Transcription:

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München, *Snowflake Computing, Inc. 1 / 31

Scale Out HyPer: High-performance sort in-memory transaction and query processing system group Scale out to process extremely largejoin inputs Aiming at clusters with large mainlineitem memory capacityorders TPC-H Query 12 node 2 node 1 node 3 SWITCH node 4 node 6 node 5 Main-Memory Cluster 2 / 31

Running Example (1) Focus on analytical query processing in this talk TPC-H query 12 used as running example Runtime dominated by join orders lineitem lineitem sort group join orders n n TPC-H Query 12 3 / 31

Running Example (2) orders lineitem Relations are equally distributed across nodes We make no other assumptions on data distribution Network communication required as tuples join with tuples on remote nodes node 3 node 2 node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP 4 / 31

Scale Out: Network is the Bottleneck 200 Single node: Performance is bound algorithmically Cluster: Network is bottleneck for query processing We propose a novel join algorithm called Neo-Join Goal: Increase local processing to close the performance gap join performance [M tuples/s] 150 100 50 0 distributed local 5 / 31

Neo-Join: Network-optimized Join 1. Open Shop Scheduling Efficient network communication 2. Optimal Partition Assignment Increase local processing 3. Selective Broadcast Handle value skew 6 / 31

Open Shop Scheduling Efficient network communication 7 / 31

Standard Network Model Star topology Nodes are connected to a central switch Fully switched All links can be used simultaneously Fully duplex Nodes can both send and receive at full speed node 1 node 4 node 2 SWITCH node 5 node 3 node 6 8 / 31

Bandwidth Sharing node 11 node 22 node 11 bottleneck SWITCH bottleneck node 33 SWITCH node 22 node 33 Simultaneous use of a single link creates a bottleneck Reduces bandwidth by at least a factor of 2 9 / 31

Naïve Schedule node 1 5 5 node 2 1 1 1 1 1 4 node 3 1 1 1 1 1 4 bandwidth sharing time Node node 21and 3 send 4 to node 1 at 5 the same 1 time Bandwidth node 2 sharing 4 increases 4network duration 1 significantly node 3 4 4 1 time maximum straggler 10 / 31

Open Shop Scheduling (1) Avoiding bandwidth sharing translates directly to open shop scheduling Network Transfer: Receivers receive from at most one sender, senders send to at most one receiver Open Shop: Processors perform one task at a time, only one task of a job is processed at a time Open Shop task processor job execution time Network Transfer data transfer receiver sender message size 11 / 31

Open Shop Scheduling (2) Bipartite graph of senders and receivers Edge weights represent transfer size Scheduler repeatedly finds perfect matchings Each matching specifies one communication phase Transfers in a phase will never share bandwidth node 1 node 1 sender node 2 node 3 5 5 4 node 2 receiver node 3 12 / 31

node 2 node 3 1 1 1 1 1 4 Optimal Schedule bandwidth sharing 1 1 1 1 1 4 node 1 4 5 1 node 2 4 4 1 node 3 4 4 1 time maximum straggler Open shop schedule achieves minimal network duration Schedule duration determined by maximum straggler 13 / 31

Optimal Partition Assignment Minimize network duration for distributed joins 14 / 31

Distributed Join Tuples may join with tuples on remote nodes Repartition and redistribute both relations for local join Tuples will join only with the corresponding partition Using hash, range, radix, or other partitioning scheme In any case: Decide how to assign partitions to nodes O L O1 O2 O3 L1 L2 L3 fragmented redistributed 15 / 31

Running Example: Hash Partitioning node 3 node 2 node 1 orders lineitem key priority key shipmode 1 1-URGENT 1 MAIL 2 2-HIGH 1 MAIL 3 1-URGENT 1 MAIL 4 5-LOW 2 SHIP 5 3-MEDIUM 2 MAIL x+2 mod 3 6 1-URGENT 6 SHIP 6 5 5 7 2-HIGH 6 SHIP 8 1-URGENT 6 SHIP P1 P2 P3 9 1-URGENT 6 MAIL 10 2-HIGH 10 SHIP 11 3-MEDIUM 11 MAIL 12 5-LOW 11 MAIL 13 1-URGENT 13 MAIL x+2 mod 3 14 3-MEDIUM 13 MAIL 5 4 4 15 1-URGENT P1 P2 P3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL x+2 mod 3 21 2-HIGH 19 SHIP 5 4 4 20 SHIP P1 P2 P3 16 / 31

Assign Partitions to Nodes (1) Option 1: Minimize network traffic Assign partition to node that owns its largest part Only the small fragments of a partition sent over the network Schedule with minimal network traffic may have high duration n1 n2 n3 hash partitioning (x mod 3) 5 6 5 4 2 35 34 4 2 35 34 P1 P2 P3 open shop schedule n1 n2 13 n3 13 h n1 n2 n3 n1 n2 n3 traffic: 26 time: 26 t 17 / 31

Assign Partitions to Nodes (2) hash partitioning (x mod 3) n1 5 6 5 Option 2: Minimize response time: n2 4 2 35 3 4 Query response time is time from request n3 4 2 to result3 5 3 4 Query response P1 time P2 dominated P3 by network duration To minimize network open shop duration, schedule minimize n1 maximum straggler n2 n3 13 13 n1 n2 n3 hash partitioning (x mod 3) 5 6 5 4 5 4 4 5 4 P1 P2 P3 open shop schedule n1 4 5 1 n2 4 4 1 n3 4 4 1 traffic: 26 time: 26 traffic: 28 time: 10 18 / 31

ing one would expect that every its tuples to every other node ( of the tuples from every other plified example, radix partitioning network phase by almost a factor rtitioning. Minimize Maximum Straggler Formalized as mixed-integer linear program ment (Phase 2) Objective function minimizes maximum straggler ribed how to repartition the input he same join key fall into the same partitions are fragmented across gments of one specific partition he same node for joining. This rmine an assignment of partitions network phase duration. st of a node as the number of nodes for the partitions that were end cost is defined as the number ther nodes. Section III-C4 shows ase duration is determined by the d/receive cost. The assignment is ize this maximum cost. ssign a partition to the node that However, this is not optimal in ment for the running example in ned to node 1 even though node 0 Shown to be NP-hard (see paper for proof sketch) In practice fast enough using CPLEX or Gurobi (< 0.5 % overhead for 32 nodes, 200 M tuples each) Partition assignment can optimize any partitioning Equation 3 computes the send cost of node i as the size of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to node i to the receive cost. MILPs require a linear objective, which minimizing a maximum is not. Fortunately, we can rephrase the objective and instead minimize a new variable w. Additional constraints take care that w assumes the maximum over the send/receive costs: (OPT-ASSIGN) minimize w, subject to p 1 w h ij (1 x ij ) j=0 p 1 w x ij j=0 n 1 1 = i=0 x ij n 1 k=0,i k h kj 0 i < n 0 i < n 0 j < p One can obtain an optimal solution for a specific partition assignment problem (OPT-ASSIGN) by passing the mixed integer linear program to an optimizer such as Microsoft Gurobi 3 or IBM CPLEX 4. These solvers can be linked as a library to create and solve linear programs via API calls. 3 http://www.gurobi.com 4 http://ibm.com/software/integration/optimization/cplex 19 / 31

Running Example: Locality orders lineitem node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP radix 15 P1 1 P2 P3 node 2 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL radix 11 1 1 P1 P2 P3 node 3 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP radix P1 2 P2 11 P3 20 / 31

Locality Running example exhibits time-of-creation clustering Radix repartitioning on most significant bits retains locality Partition assignment can exploit locality Significantly reduces query response time n1 n2 n3 n1 n1 1 n2 n2 1 1 n3 2 n3 radix partitioning (MSB) 15 1 0 1 11 1 0 2 11 P1 P2 P2 P3 P3 open shop schedule traffic: 5 time: 3 21 / 31

Selective Broadcast Handle value skew 22 / 31

Running Example: Skew orders lineitem key priority key shipmode node 1 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 1-URGENT 1 MAIL 2 MAIL 2 MAIL 2 MAIL 2 SHIP 2 SHIP 2 SHIP 6 MAIL x+2 mod 3 2 6 1 1 1 1 O1 L1 O2 L2 O3 L3 node 3 node 2 5 1-URGENT 6 2-HIGH 7 3-MEDIUM 8 3-MEDIUM 9 2-HIGH 10 3-MEDIUM 6 MAIL 7 SHIP 8 SHIP 8 SHIP 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 SHIP 10 MAIL x+2 mod 3 x+2 mod 3 6 1 1 1 1 1 O1 L1 O2 L2 O3 L3 7 1 1 1 1 O1 L1 O2 L2 O3 L3 23 / 31

Skew Value skew can lead to some very large partitions Assignment of these partitions increases network duration One may try to balance skewed partitions by partitioning the input into more partitions High skew is still a problem n1 n2 n3 hash partitioning (mod 3) 3 7 2 2 7 2 2 8 1 P1 P2 P2 P3 P3 open shop schedule n1 n1 7 2 n2 n2 2 7 n3 1 n3 2 traffic: 21 time: 14 24 / 31

Broadcast O Alternative data redistribution scheme Replicate the smaller relation between all nodes O L O L Larger relation remains fragmented across nodes O broadcast O local join 25 / 31

Selective Broadcast Decide per partition whether to assign or broadcast Broadcast partitions with large relation size difference Assign the other partitions taking locality into account Role reversal possible: Broadcast different partitions by different relations n1 n2 n3 n1 n1 n2 n2 n3 n3 hash partitioning (mod 3) 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O1 L1 O2 L2 L2 O3 O3 L3 L3 open shop schedule 3 1 3 3 1 3 traffic: 14 time: 6 26 / 31

Evaluation 27 / 31

Locality Vary locality from 0 % (uniform distribution) to 100 % (range partitioning) Neo-Join improves join performance from 29 M to 156 M tuples/s (> 500 %) 3 nodes (Core i7, 4 cores, 3.4 GHz, 32 GB RAM), 600 M tuples (64 bit key, 64 bit payload) join performance [tuple/s] Neo-Join Hadoop Hive DBMS-X MySQL Cluster 160 M 120 M 80 M 40 M 0 M 0 % 25 % 50 % 75 % 100 % locality 28 / 31

Skew Zipfian distribution models realistic data skew Using more partitions alleviates the problem Selective broadcast actually improves performance for skewed inputs 4 nodes, 400 M tuples Zipf factor s partitions 0.00 0.25 0.50 0.75 1.00 16 27 s 24 s 23 s 29s 44s 512 23 s 23 s 23 s 23 s 33s 16 (SB) 24 s 24 s 23 s 20s 10s 29 / 31

TPC-H Results (scale factor 100) Results for three selected TPC-H queries Broadcast outperforms hash for large relation size differences Neo-Join always performs better due to selective broadcast and locality 4 nodes, scale factor 100 execution time [s] hash broadcast Neo-Join 7 6 5 4 3 2 1 0 Q12 Q14 Q19 TPC-H queries 30 / 31

Summary Motivation: Network is the bottleneck for distributed query processing Increase local processing to close the performance gap Contributions: Open Shop Scheduling avoids bandwidth sharing Optimal Partition Assignment minimizes query response time and can exploit locality in the data distribution Selective Broadcast combines repartitioning and broadcast to improve the performance for skewed inputs 31 / 31