Operator Variant Selection on Heterogeneous Hardware. Viktor Rosenfeld, Max Heimel, Christoph Viebig, Volker Markl

Size: px

Start display at page:

Download "Operator Variant Selection on Heterogeneous Hardware. Viktor Rosenfeld, Max Heimel, Christoph Viebig, Volker Markl"

Edmund Ryan
7 years ago
Views:

1 Operator Variant Selection on Heterogeneous Hardware Viktor Rosenfeld, Max Heimel, Christoph Viebig, Volker Markl ADMS '15 in conjunction with VLDB '41 Kohala Coast, Hawai'i August 31, 2015

2 1 Challenge High performance or small implementations 2 Experimental demonstration Lack of performance portability in OpenCL 3 Solution sketch Performance-portable database operators 4 Current work Learning fast operator implementations 2

3 1 Challenge High performance or small implementations 3

4 Hardware-sensitive Approach dedicated operator implementations for each device x86 CPUs NVIDIA GPUs AMD GPUs Intel Xeon Phi ARM CPUs Power CPUs pro: optimal implementations for each device AMD Fusion contra: development and maintenance overhead 4

5 Hardware-oblivious Approach support multiple devices from a single implementation single operator implementation pro: smaller code base pro: support for unknown devices OpenCL vendor driver / compiler contra: lost optimization opportunities CPUs GPUs Xeon Phi... hardwarespecific code 5

6 OpenCL Portability OpenCL offers functional portability But not performance portability Many parameters to tweak: thread workload, memory access, special functions,... Hardware-specific OpenCL implementations? lack of performance portability limits the value of functional portability 6

7 2 Experimental demonstration Lack of performance portability in OpenCL 7

8 Selection Kernels Basic algorithm scan over column evaluate predicate for each value, x < const return bitmap indicating satisfying values 8

9 Variant Dimensions Code modifications ~ 60 variants Basic algorithm (memory access & result bitmap construction): sequential, atomic-global, atomic-local, reduce, collect, transpose Result bitmap granularity: 8 bit, 16 bit, 32 bit, 64 bit Loop unrolling: yes, no Predication: yes, no Workload parameters Local size: 1, 2, 4, 8,..., max Elements per thread: 1, 2, 4, 8,..., 1024 ~5000 of selection kernel variants 9

10 Percentage of variants 50% 40% 30% 20% 10% Competitive Variants percentage of variants that are at most 2x slower than fastest variant for each device AMD Opteron 6128 HE Intel Iris 5100 Intel Xeon E5620 AMD Opteron 2356 Intel Core i7 4900MQ Intel Core i7 870 Intel Xeon E v2 NVIDIA Quadro K2100M NVIDIA Tesla K40M AMD Radeon HD 6950 NVIDIA Geforce GTX 460 Intel Xeon Phi IBM 8231 E2B 0% Slowdown relative to optimal variant often many variants are competitive 10

11 Competitive Variants percentage of variants that are at most 2x slower than fastest variant for each device 50% Percentage of variants 40% 30% 20% 10% Intel Xeon E5620 (15) Intel Xeon Phi (444) 0% Slowdown relative to optimal variant "easy" and "difficult" devices 11

12 3 Solution sketch Performance-portable database operators 12

13 Automatic Variant Tuning 1. specify operators in generic fashion? 2. derive different implementations? 3. learn best implementation per device let the system generate and find the best variant 13

14 4 Current work Learning fast operator implementations 14

15 Micro Adaptivity Flavor 1 Flavor 2 Flavor 3 Flavor 4 explore flavor performance exploit best flavor Learn best operator implementation for current workload Implementation details Loop Unrolling Branch Elimination... data for a single query no optimal operator implementation, even for a single query Raducanu et al., "Micro Adaptivity In Vectorwise", SIGMOD,

16 Micro Adaptivity With Many Variants Micro Adaptivity Random Optimal runtime distribution of 300 queries with 1K chunks for each query: select pool of size X from ~5000 different variants 16

17 Micro Adaptivity With Many Variants 1 variant in the pool can be good, bad, or average Micro Adaptivity Random Optimal runtime distribution of 300 queries with 1K chunks for each query: select pool of size X from ~5000 different variants 17

18 Micro Adaptivity With Many Variants Micro Adaptivity Random Optimal 16 variants in the pool every variant is used (no explore/exploit) distribution is compacted runtime distribution of 300 queries with 1K chunks for each query: select pool of size X from ~5000 different variants 18

19 Micro Adaptivity With Many Variants Micro Adaptivity Random Optimal as the number of variants in the pool increases, we converge on the mean runtime of the variant universe runtime distribution of 300 selection queries with 1K chunks for each query: select pool of size X from ~5000 different variants 19

20 Micro Adaptivity With Many Variants Micro Adaptivity Random Optimal small pool sizes distribution is shifted to optimal value distribution is compacted runtime distribution of 300 queries with 1K chunks for each query: select pool of size X from ~5000 different variants 20

21 Micro Adaptivity With Many Variants large pool sizes overhead of explore destroys benefit of Micro Adaptivity Micro Adaptivity Random Optimal runtime distribution of 300 queries with 1K chunks for each query: select pool of size X from ~5000 different variants 21

22 Search Strategies improve the pool between queries Greedy keep 2 fastest variants randomly replace others Genetic keep 2 fastest variants replace others by combining attributes from 2 parents currently in pool chance of becoming a parent depends on performance mutate variants to get out of local minima 22

23 Competitive Variants percentage of variants that are at most 2x slower than fastest variant for each device 50% Percentage of variants 40% 30% 20% 10% Intel Xeon E5620 (15) Intel Xeon Phi (444) 0% Slowdown relative to optimal variant "easy" and "difficult" devices 23

24 Influence of Search Strategies Average slowdown (compared to optimal) Intel Xeon CPU close to optimal performance for "easy" devices Query Number None Greedy Genetic 100 series of 10 consecutive selection queries working pool: 8 variants chosen randomly at start of series baseline None: no updates of working pool between queries 24

25 Influence of Search Strategies Average slowdown (compared to optimal) Intel Xeon Phi Accelerator room for improvement for "difficult" devices Query Number None Greedy Genetic 100 series of 10 consecutive selection queries working pool: 8 variants chosen randomly at start of series baseline None: no updates of working pool between queries 25

26 Summary and Outlook OpenCL offers functional portability, but lack of performance portability limits usefulness we can use query feedback to learn fast data processing operators generate variants automatically (step 1 and 2) improve search strategies (micro benchmarks, source code metrics,...) 26

27 Learning Framework 27

28 Processor Characteristics fast CPU variants Sequential Interleaved-atomic-global and interleaved-atomic-local Interleaved-reduce manufacturer and architecture differences Interleaved-collect Interleaved-transpose fast GPU variants no results Intel Core i7-870 (Nehalem) Intel Xeon E5620 (Nehalem) Intel Xeon E v2 (Ivy Bridge) Intel Core i7-4900mq (Haswell) AMD Opteron 2356 (Barcelona) AMD Opteron 6128 HE (Magny-Cours) IBM 8231-E2B (POWER7) Intel Xeon Phi SE10/7120 (Knights Corner) NVIDIA GeForce GTX 460 (Fermi) NVIDIA Quadro K2100M (Kepler) NVIDIA Tesla K40M (Kepler) AMD Radeon HD 6950 (Terascale3) Intel Iris 5100 (Gen7.5) 28 Figure 2: Heatmaps illustrating the relative performance of selection kernel variants across a range of selectivities, plotted

29 Best Variants By Device Device 1024 Variant Core i7 4900MQ Core i7 870 Elements per thread Intel CPUs 64-bit sequential PU device-specific Local size NVIDIA GeForce GTX bit transpose U NVIDIA Quadro K2100M 16-bit transpose (P/PU) 1/2/4 128 NVIDIA Tesla K40M 16-bit transpose (U) AMD Radeon HD bit collect (U) Intel Iris bit transpose P/PU /128 P: predicated, U: unrolled Local size Elements per thread best variant is hardware-specific 29

Several tips on how to choose a suitable computer

Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec