Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München, *Snowflake Computing, Inc. 1 / 31
Scale Out HyPer: High-performance sort in-memory transaction and query processing system group Scale out to process extremely largejoin inputs Aiming at clusters with large mainlineitem memory capacityorders TPC-H Query 12 node 2 node 1 node 3 SWITCH node 4 node 6 node 5 Main-Memory Cluster 2 / 31
Running Example (1) Focus on analytical query processing in this talk TPC-H query 12 used as running example Runtime dominated by join orders lineitem lineitem sort group join orders n n TPC-H Query 12 3 / 31
Running Example (2) orders lineitem Relations are equally distributed across nodes We make no other assumptions on data distribution Network communication required as tuples join with tuples on remote nodes node 3 node 2 node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP 4 / 31
Scale Out: Network is the Bottleneck 200 Single node: Performance is bound algorithmically Cluster: Network is bottleneck for query processing We propose a novel join algorithm called Neo-Join Goal: Increase local processing to close the performance gap join performance [M tuples/s] 150 100 50 0 distributed local 5 / 31
Neo-Join: Network-optimized Join 1. Open Shop Scheduling Efficient network communication 2. Optimal Partition Assignment Increase local processing 3. Selective Broadcast Handle value skew 6 / 31
Open Shop Scheduling Efficient network communication 7 / 31
Standard Network Model Star topology Nodes are connected to a central switch Fully switched All links can be used simultaneously Fully duplex Nodes can both send and receive at full speed node 1 node 4 node 2 SWITCH node 5 node 3 node 6 8 / 31
Bandwidth Sharing node 11 node 22 node 11 bottleneck SWITCH bottleneck node 33 SWITCH node 22 node 33 Simultaneous use of a single link creates a bottleneck Reduces bandwidth by at least a factor of 2 9 / 31
Naïve Schedule node 1 5 5 node 2 1 1 1 1 1 4 node 3 1 1 1 1 1 4 bandwidth sharing time Node node 21and 3 send 4 to node 1 at 5 the same 1 time Bandwidth node 2 sharing 4 increases 4network duration 1 significantly node 3 4 4 1 time maximum straggler 10 / 31
Open Shop Scheduling (1) Avoiding bandwidth sharing translates directly to open shop scheduling Network Transfer: Receivers receive from at most one sender, senders send to at most one receiver Open Shop: Processors perform one task at a time, only one task of a job is processed at a time Open Shop task processor job execution time Network Transfer data transfer receiver sender message size 11 / 31
Open Shop Scheduling (2) Bipartite graph of senders and receivers Edge weights represent transfer size Scheduler repeatedly finds perfect matchings Each matching specifies one communication phase Transfers in a phase will never share bandwidth node 1 node 1 sender node 2 node 3 5 5 4 node 2 receiver node 3 12 / 31
node 2 node 3 1 1 1 1 1 4 Optimal Schedule bandwidth sharing 1 1 1 1 1 4 node 1 4 5 1 node 2 4 4 1 node 3 4 4 1 time maximum straggler Open shop schedule achieves minimal network duration Schedule duration determined by maximum straggler 13 / 31
Optimal Partition Assignment Minimize network duration for distributed joins 14 / 31
Distributed Join Tuples may join with tuples on remote nodes Repartition and redistribute both relations for local join Tuples will join only with the corresponding partition Using hash, range, radix, or other partitioning scheme In any case: Decide how to assign partitions to nodes O L O1 O2 O3 L1 L2 L3 fragmented redistributed 15 / 31
Running Example: Hash Partitioning node 3 node 2 node 1 orders lineitem key priority key shipmode 1 1-URGENT 1 MAIL 2 2-HIGH 1 MAIL 3 1-URGENT 1 MAIL 4 5-LOW 2 SHIP 5 3-MEDIUM 2 MAIL x+2 mod 3 6 1-URGENT 6 SHIP 6 5 5 7 2-HIGH 6 SHIP 8 1-URGENT 6 SHIP P1 P2 P3 9 1-URGENT 6 MAIL 10 2-HIGH 10 SHIP 11 3-MEDIUM 11 MAIL 12 5-LOW 11 MAIL 13 1-URGENT 13 MAIL x+2 mod 3 14 3-MEDIUM 13 MAIL 5 4 4 15 1-URGENT P1 P2 P3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL x+2 mod 3 21 2-HIGH 19 SHIP 5 4 4 20 SHIP P1 P2 P3 16 / 31
Assign Partitions to Nodes (1) Option 1: Minimize network traffic Assign partition to node that owns its largest part Only the small fragments of a partition sent over the network Schedule with minimal network traffic may have high duration n1 n2 n3 hash partitioning (x mod 3) 5 6 5 4 2 35 34 4 2 35 34 P1 P2 P3 open shop schedule n1 n2 13 n3 13 h n1 n2 n3 n1 n2 n3 traffic: 26 time: 26 t 17 / 31
Assign Partitions to Nodes (2) hash partitioning (x mod 3) n1 5 6 5 Option 2: Minimize response time: n2 4 2 35 3 4 Query response time is time from request n3 4 2 to result3 5 3 4 Query response P1 time P2 dominated P3 by network duration To minimize network open shop duration, schedule minimize n1 maximum straggler n2 n3 13 13 n1 n2 n3 hash partitioning (x mod 3) 5 6 5 4 5 4 4 5 4 P1 P2 P3 open shop schedule n1 4 5 1 n2 4 4 1 n3 4 4 1 traffic: 26 time: 26 traffic: 28 time: 10 18 / 31
ing one would expect that every its tuples to every other node ( of the tuples from every other plified example, radix partitioning network phase by almost a factor rtitioning. Minimize Maximum Straggler Formalized as mixed-integer linear program ment (Phase 2) Objective function minimizes maximum straggler ribed how to repartition the input he same join key fall into the same partitions are fragmented across gments of one specific partition he same node for joining. This rmine an assignment of partitions network phase duration. st of a node as the number of nodes for the partitions that were end cost is defined as the number ther nodes. Section III-C4 shows ase duration is determined by the d/receive cost. The assignment is ize this maximum cost. ssign a partition to the node that However, this is not optimal in ment for the running example in ned to node 1 even though node 0 Shown to be NP-hard (see paper for proof sketch) In practice fast enough using CPLEX or Gurobi (< 0.5 % overhead for 32 nodes, 200 M tuples each) Partition assignment can optimize any partitioning Equation 3 computes the send cost of node i as the size of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to node i to the receive cost. MILPs require a linear objective, which minimizing a maximum is not. Fortunately, we can rephrase the objective and instead minimize a new variable w. Additional constraints take care that w assumes the maximum over the send/receive costs: (OPT-ASSIGN) minimize w, subject to p 1 w h ij (1 x ij ) j=0 p 1 w x ij j=0 n 1 1 = i=0 x ij n 1 k=0,i k h kj 0 i < n 0 i < n 0 j < p One can obtain an optimal solution for a specific partition assignment problem (OPT-ASSIGN) by passing the mixed integer linear program to an optimizer such as Microsoft Gurobi 3 or IBM CPLEX 4. These solvers can be linked as a library to create and solve linear programs via API calls. 3 http://www.gurobi.com 4 http://ibm.com/software/integration/optimization/cplex 19 / 31
Running Example: Locality orders lineitem node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP radix 15 P1 1 P2 P3 node 2 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL radix 11 1 1 P1 P2 P3 node 3 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT 21 2-HIGH 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 19 SHIP 20 SHIP radix P1 2 P2 11 P3 20 / 31
Locality Running example exhibits time-of-creation clustering Radix repartitioning on most significant bits retains locality Partition assignment can exploit locality Significantly reduces query response time n1 n2 n3 n1 n1 1 n2 n2 1 1 n3 2 n3 radix partitioning (MSB) 15 1 0 1 11 1 0 2 11 P1 P2 P2 P3 P3 open shop schedule traffic: 5 time: 3 21 / 31
Selective Broadcast Handle value skew 22 / 31
Running Example: Skew orders lineitem key priority key shipmode node 1 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 1-URGENT 1 MAIL 2 MAIL 2 MAIL 2 MAIL 2 SHIP 2 SHIP 2 SHIP 6 MAIL x+2 mod 3 2 6 1 1 1 1 O1 L1 O2 L2 O3 L3 node 3 node 2 5 1-URGENT 6 2-HIGH 7 3-MEDIUM 8 3-MEDIUM 9 2-HIGH 10 3-MEDIUM 6 MAIL 7 SHIP 8 SHIP 8 SHIP 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 MAIL 8 SHIP 10 MAIL x+2 mod 3 x+2 mod 3 6 1 1 1 1 1 O1 L1 O2 L2 O3 L3 7 1 1 1 1 O1 L1 O2 L2 O3 L3 23 / 31
Skew Value skew can lead to some very large partitions Assignment of these partitions increases network duration One may try to balance skewed partitions by partitioning the input into more partitions High skew is still a problem n1 n2 n3 hash partitioning (mod 3) 3 7 2 2 7 2 2 8 1 P1 P2 P2 P3 P3 open shop schedule n1 n1 7 2 n2 n2 2 7 n3 1 n3 2 traffic: 21 time: 14 24 / 31
Broadcast O Alternative data redistribution scheme Replicate the smaller relation between all nodes O L O L Larger relation remains fragmented across nodes O broadcast O local join 25 / 31
Selective Broadcast Decide per partition whether to assign or broadcast Broadcast partitions with large relation size difference Assign the other partitions taking locality into account Role reversal possible: Broadcast different partitions by different relations n1 n2 n3 n1 n1 n2 n2 n3 n3 hash partitioning (mod 3) 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O1 L1 O2 L2 L2 O3 O3 L3 L3 open shop schedule 3 1 3 3 1 3 traffic: 14 time: 6 26 / 31
Evaluation 27 / 31
Locality Vary locality from 0 % (uniform distribution) to 100 % (range partitioning) Neo-Join improves join performance from 29 M to 156 M tuples/s (> 500 %) 3 nodes (Core i7, 4 cores, 3.4 GHz, 32 GB RAM), 600 M tuples (64 bit key, 64 bit payload) join performance [tuple/s] Neo-Join Hadoop Hive DBMS-X MySQL Cluster 160 M 120 M 80 M 40 M 0 M 0 % 25 % 50 % 75 % 100 % locality 28 / 31
Skew Zipfian distribution models realistic data skew Using more partitions alleviates the problem Selective broadcast actually improves performance for skewed inputs 4 nodes, 400 M tuples Zipf factor s partitions 0.00 0.25 0.50 0.75 1.00 16 27 s 24 s 23 s 29s 44s 512 23 s 23 s 23 s 23 s 33s 16 (SB) 24 s 24 s 23 s 20s 10s 29 / 31
TPC-H Results (scale factor 100) Results for three selected TPC-H queries Broadcast outperforms hash for large relation size differences Neo-Join always performs better due to selective broadcast and locality 4 nodes, scale factor 100 execution time [s] hash broadcast Neo-Join 7 6 5 4 3 2 1 0 Q12 Q14 Q19 TPC-H queries 30 / 31
Summary Motivation: Network is the bottleneck for distributed query processing Increase local processing to close the performance gap Contributions: Open Shop Scheduling avoids bandwidth sharing Optimal Partition Assignment minimizes query response time and can exploit locality in the data distribution Selective Broadcast combines repartitioning and broadcast to improve the performance for skewed inputs 31 / 31