Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Similar documents
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Infrastructure Matters: POWER8 vs. Xeon x86

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

FPGA-based Multithreading for In-Memory Hash Joins

Benchmarking Cassandra on Violin

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Maximum performance, minimal risk for data warehousing

Safe Harbor Statement

Capacity Planning Process Estimating the load Initial configuration

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Next Generation GPU Architecture Code-named Fermi

Inge Os Sales Consulting Manager Oracle Norway

Oracle Database In-Memory The Next Big Thing

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Main Memory Data Warehouses

Performance and scalability of a large OLTP workload

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

SQL Server Business Intelligence on HP ProLiant DL785 Server

HP ProLiant SL270s Gen8 Server. Evaluation Report

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

2009 Oracle Corporation 1

Capacity Management for Oracle Database Machine Exadata v2

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Query Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Understanding the Benefits of IBM SPSS Statistics Server

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

The Fastest, Most Efficient HPC Architecture Ever Built

Performance And Scalability In Oracle9i And SQL Server 2000

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Integrating Apache Spark with an Enterprise Data Warehouse

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Overview: X5 Generation Database Machines

NVIDIA GRID OVERVIEW SERVER POWERED BY NVIDIA GRID. WHY GPUs FOR VIRTUAL DESKTOPS AND APPLICATIONS? WHAT IS A VIRTUAL DESKTOP?

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ROLAP with Column Store Index Deep Dive. Alexei Khalyako SQL CAT Program Manager

Data Center and Cloud Computing Market Landscape and Challenges

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Performance Tuning and Optimizing SQL Databases 2016

White Paper. IBM POWER8: Performance and Cost Advantages in Business Intelligence Systems. 89 Fifth Avenue, 7th Floor. New York, NY 10003

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Virtuoso and Database Scalability

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Performance Tuning for the Teradata Database

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Binary search tree with SIMD bandwidth optimization using SSE

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Optimizing Performance. Training Division New Delhi

Enterprise Applications

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

HiBench Introduction. Carson Wang Software & Services Group

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Benchmarking Hadoop & HBase on Violin

Rackspace Cloud Databases and Container-based Virtualization

Configuring Apache Derby for Performance and Durability Olav Sandstå

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

1 Storage Devices Summary

Packet-based Network Traffic Monitoring and Analysis with GPUs

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Streamline SAP HANA with Nearline Storage Solutions by PBS and IBM Elke Hartmann-Bakan, IBM Germany Dr. Klaus Zimmer, PBS Software DMM127

High-Volume Data Warehousing in Centerprise. Product Datasheet

Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0

Optimizing Application Performance with CUDA Profiling Tools

Why DBMSs Matter More than Ever in the Big Data Era

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Raima Database Manager Version 14.0 In-memory Database Engine

GPU Parallel Computing Architecture and CUDA Programming Model

Computer Graphics Hardware An Overview

Transcription:

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com

Please Note IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Outline DB2 BLU Acceleration Hardware Acceleration Nvidia GPU Key Analytic Database Operators Our Acceleration Design Live Technology Demonstration

DB2 with BLU Acceleration Next generation database Super Fast (query performance) Super Simple (load-and-go) Super Small (storage savings) Seamlessly integrated Built seamlessly into DB2 Consistent SQL, language interfaces, administration Dramatic simplification Hardware optimized Memory optimized CPU-optimized I/O optimized

Risk system injects 1/2 TB per night from 25 different source systems. Impressive load times. Some queries achieved an almost 100x speed up with literally no tuning. One of the world s most profitable and secure rated banks. 6 hours! Installing BLU to query results 2015 IBM Corporation

DB2 with BLU Acceleration: The 7 Big Ideas 6

Hardware Acceleration Use specific hardware to execute software functions faster Popular accelerator technology SIMD Present in every CPU GPUs Easy to program FPGA Hard to program

Nvidia GPU NVIDIA Tesla K40 Kepler technology Peak double precession performance: 1.66 TFLOPs Peak single precession performance: 5 TFLOPs High Memory Bandwidth: up to 288 GB/Sec Memory Size: 12GB Number of cores: 2880

Key Analytic Database Operators GROUP BY / Aggregation SELECT column_name, aggregate_function(column_name) FROM table_name WHERE column_name operator value GROUP BY column_name; Join SELECT column_name(s) FROM table1 JOIN table2 ON table1.column_name=table2.column_name; Sort SELECT column_name FROM table_name ORDER BY column_name;

Hardware Configuration POWER8 S824L 2 sockets, 12 cores per socket, SMT-8, 512GB Ubuntu LE 14.04.02 LTS GPU: 2 NVIDIA Tesla K40

Infrastructure Adding the support for Nvidia GPU CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA Memory Management Pin/Unpin memory To run on GPU, threads asks for pinned memory This is for fast transfers to/from GPU PCI-E Gen3 Will be improved in 2016 with Nvlink on POWER

GPU Scheduler Each CPU thread can submit tasks to GPU scheduler Should submit memory requirement on GPU The scheduler checks all the GPUs on the system Reserve the memory on the GPU Create a stream Return to the CPU thread with GPU number and stream Id

Our Acceleration Design Use parallel POWER8 threads for reading/preprocessing data Transfer data to GPU Have the GPU to process the query Transfer the result back to host machine

Hybrid Design: Use Both POWER8 and GPU for Query Processing Decide where to execute the query dynamically at runtime Use GPU only Use CPU only Use both

GPU Kernels Design and develop our own GPU runtime Developed fast kernels e.g. GROUP BY, aggregation Use Nvidia CUDA calls e.g. Atomic operations Use Nvidia fast kernels e.g. sort 15

What does it do? Group By/Aggregation SELECT C1, SUM(C2) FROM simple_table GROUP BY C1 Simple_Table C1 C2 9 98 2 21 9 92 3 38 9 90 2 22 C1 SUM(C2) 9 280 3 38 2 43

Group by/aggregate Operation in GPU Hash based algorithm Use grouping keys and a hash function to insert keys to a hash table Aggregation Use Nvidia Atomic CUDA calls for some data types (INT32, INT64,etc) Use Locks for other data types (Double, FixedString, etc) Three main steps Initialize the hash table Grouping/Aggregation in a global hash table Scanning the global hash bale to retrieve groups

Initialization kernel Create/initialize the hash table in device memory Data needs to be aligned May need Padding Grouping key can be anywhere in the hash table based on alignment requirements Initialization happens in parallel using parallel GPU threads Select SUM(C1), MIN(C2), MAX(C3) from table1 Group by(c1) Int 64: C1, C2 Int 32: C3 C1(64bit) SUM(C1) (64bit) MAX(C2)(64bit) MIN(C3)(64bit) Padding(32 bit) FFFFFFFFFFFFFFFF 0-9223372036854775808 2147483647 0 FFFFFFFFFFFFFFFF 0-9223372036854775808 2147483647 0 FFFFFFFFFFFFFFFF 0-9223372036854775808 2147483647 0

Hash based Group by/aggregate Group by: Parallel threads read keys/payloads from table and insert keys to HT Use a hash function to hash keys Murmur hashing: Wide keys(larger than 64bit) http://en.wikipedia.org/wiki/murmurhash Mod hashing: short keys(smaller than 64bit) If collision happens, we check the next empty slot in hash table Aggregation: If thread key matches an entry in HT we need to perform the Agg function Thread i Key Payload 1 Payload2 ABFGH 13 21.2 HT(before Aggregation) Key Sum Min...... ABFGH 8 1.2...... HT(After Aggregation) Key Agg1 AGG2...... ABFGH 21 1.2......

Aggregation CUDA atomic operations for Implemented in hardware(very fast) Use for both global and shared memory Specific data types(int32, INT64, etc) Use AtomicCAS for specific data types e.g. Double Specific Agg functions/data types device double atomicadd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old; old = atomiccas(address_as_ull, assumed, double_as_longlong(val + _longlong_as_double(assumed))); }while(assumed!= old); return longlong_as_double(old); } Check Nvidia docs for more details: http://docs.nvidia.com/cuda/cuda-cprogramming-guide/#atomic-functions

Aggregation(Continued) Locks For datatypes that are larger than 64 bits Decimal, FixedString Each thread needs to perform following Acquire a lock which is associated with the corresponding entry in hash table Apply the AGG function Release the lock Costly operation

Hash-Based Group By/Aggregate SELECT C1, SUM(C2) FROM Simple_Table GROUP BY C1 Simple_Table KEY Value Hash Table 93 5 23 2 93 1 23 5 93 0 93 1000 Parallel HT Creation Key 23 7 93 1006 Aggregated Value Parallel Probe Result 23 7 93 1006

Supported Data Types & AGG functions SQL-- MAX MIN SUM COUNT SINT8 Cast to SINT32 Cast to SINT32 Cast to SINT 32 AtomicCount SINT16 Cast to SINT32 Cast to SINT32 Cast to SINT32 AtomicCount SINT32 AtomicMax AtomicMin AtomicAdd AtomicCount SINT64 AtomicMax AtomicMin AtomicAdd AtomicCount REAL Use AtomicCAS Use AtomicCAS AtomicAdd AtomicCount DOUBLE Use AtomicCAS Use AtomicCAS Use AtomicCAS AtomicCount DECIMAL Lock Lock Use AtomicADD(2-3 steps) AtomicCount DATE CAST to SINT32 CAST to SINT32 N/A AtomicCount TIME CAST to SINT32 CAST to SINT32 N/A AtomicCount TIMESTAMP(64bit) AtomicMax AtomicMIn N/A AtomicCount FixedString Lock Lock N/A AtomicCount

GPU SORT Reduced the amount of data transferred between host and GPU device Use Nvidia Fast sort kernel Copy key and data to GPU memory, use 4-byte key and 4-byte payload Skip the copying back of the sorted keys Skip the copying of payload data into GPU memory on subsequent sorts to resolve duplicates. Use the same data format between DB2 and GPU sort routines Handling multiple small sort jobs concurrently in the GPU Handle multiple small sort jobs in the GPU Each thread works on sort data range there are more sort key bytes to process

GPU SORT Where GPU performs BEST: Up to 750M rows when all sort data fit within GPU device memory Sort on single integer column of size 4-byte or less. i.e. only one trip to the GPU is required

Acceleration Demonstration Accelerating DB2 BLU Query Processing with Nvidia GPUs on POWER8 Servers A Hardware/Software Innovation Preview Compare query acceleration of DB2 BLU with GPU vs. non- GPU baseline Show CPU offload by demonstrating increased multi-user throughput with DB2 BLU with GPU

BLU Analytic Workload A set of Queries from existing BLU Analytic workloads TPC-DS database schema Based on a retail database with in-store, on-line, and catalog sales of merchandise 15% of queries use GPU heavily 50% of queries use GPU moderately 35% of queries do not use GPU at all Benchmark Configuration 100 GB (raw) Data set 10 concurrent users

Performance Result 4000 3500 3000 2500 2000 1500 1000 500 0 Using GPU Queries Per Hour No GPU ~2x improvement in workload throughput CPU Offload + improved query runtimes are the main factors Simple Non-GPU Q47 Simple Non-GPU Q15 ROLAP Q5 ROLAP Q20 ROLAP Q11 GROUP BY Store Sales BDI CQ1 Avg CPU Total Duration Avg GPU Total Duration GPU vs CPU Most individual queries improve in end-to-end run time

MiB Utilization GPU Utilization 1.2 1 GPU Card Utilization GPU 0 GPU 1 The DB2 BLU GPU demo technology will attempt to balance GPU operations across the available GPU devices 0.8 0.6 0.4 0.2 12000 10000 8000 GPU 0 Mem GPU 1 Mem GPU Memory Used 0 Time -----------------> 6000 4000 2000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 time ----> These measurements are taken from the Demo Workload running in continuous mode.

Summary Hardware/Software Innovation Preview demonstrated GPU Acceleration Improved DB2 BLU query throughput Use both POWER8 processor and Nvidia GPUs Design and develop fast GPU kernels Use Nvidia kernels, function calls, etc Hardware Acceleration shows potential for Faster execution time CPU off-loading 30