Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster!

Size: px

Start display at page:

Download "Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster!"

Morris Waters
7 years ago
Views:

1 Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, Thomas Neumann Technische Universität München Data Management on New Hardware (DaMoN 214) 1

2 The Rise of Dark Silicon Moore s law still valid transistor density doubles with each generation + Failure of Dennard Scaling proportional scaling of threshold/supply voltages failed in ~22 power density is growing = Dimmed or Dark Silicon: > 5% at 1nm (ITRS roadmap projection) processor power budget is constant not all transistors can be used simultaneously multi-core scaling is just a workaround Heterogeneous Processors (CPU including GPGPU, FPGA, ASICs, ) 2

3 Single-ISA Heterogeneous Multi-Cores Cores implement the same instruction set architecture (ISA) Cache-coherent access to main memory Advantages Different types of cores for energy efficiency and performance Multiple simple cores for high parallel performance, some complex cores for high serial performance (Amdahl s law) Single ISA = single implementation Avoid over-specialization (GPGPUs, ASICs, FPGAs) Challenge: mapping jobs to the cores that fit best OS has to rely on performance counters, history, DBMS has more valuable knowledge about the work that will be executed 3

4 ARM big.little (Exynos 541) LITTLE 4 cores in-order max. 2 issues/ cycle 8 1 stage pipeline 32B cache lines 32kB L1 I/D 512kB L2 LLC 3.8mm² die area die photo from big 4 cores out of order max. 3 issues/ cycle stage pipeline 64B cache lines 32kB L1 I/D 2MB L2 LLC 19mm² die area Cache-Coherent Interconnect (currently only allows one active cluster) 2GB dual-channel LPDDR3 main memory (12.8GB/s peak transfer rate) 4

5 Contributions 1) Analysis of parallel query execution on a single-isa heterogeneous system (ARM big.little system) 2) Analysis of parallel database operators on the LITTLE and big cluster 3) A heterogeneity-conscious DMBS-controlled job-to-core mapping approach for parallel query execution 5

6 Parallel Query Execution in HyPer Data-centric code generation operators that do not require intermediate materialization are interleaved and compiled together into pipelines tight work loops: keep data in registers B Morsel-driven parallelism execute a pipeline in parallel morsels are fragments of input tuples all operators are parallelized (lock-free) elastic NUMA-aware σ scan lineitem 6 σ scan lineitem P 1 build HT σ scan lineitem lineitem part P 2 probe HT σ scan lineitem scan part scan part scan part scan part

7 Initial Results: HyPer on big.little single-threaded multi-threaded (4 threads) EDP [kj s] response time [s] LITTLE big LITTLE big LITTLE big TPC-H scale factor 2, running all 22 queries, OS ondemand Energy Delay Product EDP = energy consumed response time Single-threaded execution: LITTLE core has worse EDP than big core Multi-threaded execution: LITTLE cluster and big cluster have equal EDP 7

8 Core Database Operator Analysis (i) hash equi-join hash group-by (duplicate elimination, 5 groups) aggregation (5 columns, 1 group) merge sort LITTLE big LITTLE big response time [ms] clock rate [MHz] clock rate [MHz] Working set in LLC 4-way parallel processing of operators 8

9 Core Database Operator Analysis (ii) hash equi-join hash group-by (duplicate elimination, 5 groups) aggregation (5 columns, 1 group) merge sort LITTLE big LITTLE big response time [ms] clock rate [MHz] clock rate [MHz] Working set exceeding LLC 4-way parallel processing of operators 9

10 Core Database Operator Analysis (iii) LITTLE (6 MHz) big (16 MHz) EDP [mj s] LITTLE (-44%) LITTLE (-27%) big (-65%) big (-37%) equi-join group-by aggregation sort Working set exceeding LLC 4-way parallel processing of operators Even with varying implementations, trend stays the same 1

11 Parallel Hash Equi-Join LITTLE 6 MHz (4 cores) big 16 MHz (4 cores) big 16 MHz (1 core) Join build LITTLE cluster has better EDP atomic CAS to build hash table has worse performance on big cluster Join probe les in R [2 1 tuples] response time [ms] tuples in R [2 1 tuples] LITTLE 6 MHz (4 cores) big 16 MHz (4 cores) big 16 MHz (1 core) 1 almost equal EDP 2 big and LITTLE cluster have pointer chasing vs hash table in LLC tuples in R [2 1 tuples] response time [ms] tuples in R [2 1 tuples] 2 1 join build tuples in R [2 1 tuples] (a) build (b) probe join probe Response time and energy consumption of multi-threaded build and probe phases of the groups hash 11 B S on the LITTLE and big cluster (build cardinality R apple tuples tuples, in R [2 probe 1 tuples] cardinality tup tuples in R [2 1 tuples] response time [ms]

12 Getting a better mileage while driving faster fixed clock (LITTLE) fixed clock (big) OS scheduling Performance 16 MHz DBMS-controlled job-to-core mapping Ondemand MHz 6 MHz response time [s] 25 MHz Powersafe constant EDP relative to 16 MHz

13 Heterogeneity-Conscious Dispatching Operator Benchmarks mapping decision at runtime Performance and Energy Model Dispatcher Pipeline Job J 1 Pipeline Job J 2... M 1... M 2 Core Core 1... Core 2 Core 3 big C C1 C2 C3 LITTLE 13

14 Performance Energy Model (PEM) LITTLE 6 MHz big 16 MHz predictor f LITTLE,join-build predictor f big,join-build response time [s] tuples in R [2 1 tuples] tuples in R [2 1 tuples] join build PEM 2 segments: working set in LLC and exceeding LLC linear regression models based on benchmarks, data from query processing 14

15 Evaluation: TPC-H Scale Factor 2 DBMS (our approach) big 16 MHz LITTLE 6 MHz OS ondemand EDP [J s] response time [s] %/12%/14% improved EDP over OS ondemand/little/big Some queries faster and more energy efficient (e.g., query 14) Query 14: 63%/45% improved EDP over OS ondemand/big 15

16 Conclusion Heterogeneous single-isa multi-core processors are an interesting design space in light of dark silicon and Amdahl s law: In our experiments 4 LITTLE cores constantly outperform 1 big core (but occupy the same die area) These processors are no free lunch for parallel database systems: job-to-core mapping is challenging and is better performed by the DBMS rather than the OS (more knowledge) A heterogeneity-conscious DBMS-controlled job-to-core mapping approach for parallel query execution achieves a higher performance while using less energy We get a better mileage while driving faster 16

Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster!

Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster! Tobias Mühlbauer Wolf Rödiger Robert Seilbeck Alfons Kemper Thomas Neumann Technische Universität München