FPGA-based Multithreading for In-Memory Hash Joins



Similar documents
Rethinking SIMD Vectorization for In-Memory Databases

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Benchmarking Cassandra on Violin

Binary search tree with SIMD bandwidth optimization using SSE

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Parallel Algorithm Engineering

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Infrastructure Matters: POWER8 vs. Xeon x86

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Virtuoso and Database Scalability

Embedded Systems: map to FPGA, GPU, CPU?

Enabling Technologies for Distributed and Cloud Computing

Networking Virtualization Using FPGAs

Enabling Technologies for Distributed Computing

YarcData urika Technical White Paper

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Next Generation GPU Architecture Code-named Fermi

Parallel Programming Survey

Direct GPU/FPGA Communication Via PCI Express

CHAPTER 1 INTRODUCTION

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

High-Density Network Flow Monitoring

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multi-Threading Performance on Commodity Multi-Core Processors

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

MOMENTUM - A MEMORY-HARD PROOF-OF-WORK VIA FINDING BIRTHDAY COLLISIONS. DANIEL LARIMER dlarimer@invictus-innovations.com Invictus Innovations, Inc

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

7a. System-on-chip design and prototyping platforms

Data Center and Cloud Computing Market Landscape and Challenges

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Improving Grid Processing Efficiency through Compute-Data Confluence

Understanding the Benefits of IBM SPSS Statistics Server

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Multi-core and Linux* Kernel

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Hadoop: Embracing future hardware

FPGA-based MapReduce Framework for Machine Learning

Using Synology SSD Technology to Enhance System Performance Synology Inc.

CFD Implementation with In-Socket FPGA Accelerators

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Multicore Programming with LabVIEW Technical Resource Guide

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Parallel Computing. Benson Muite. benson.

Communicating with devices

Intel Xeon Processor E5-2600

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Overlapping Data Transfer With Application Execution on Clusters

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Overview: X5 Generation Database Machines

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

STINGER: High Performance Data Structure for Streaming Graphs

High-Density Network Flow Monitoring

GraySort on Apache Spark by Databricks

Application Performance Analysis of the Cortex-A9 MPCore

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

AMD Opteron Quad-Core

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

FAWN - a Fast Array of Wimpy Nodes

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Architectures for Big Data Analytics A database perspective

New Design and Layout Tips For Processing Multiple Tasks

G Porcupine. Robert Grimm New York University

CSE 6040 Computing for Data Analytics: Methods and Tools

Informatica Ultra Messaging SMX Shared-Memory Transport

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Performance Guideline for syslog-ng Premium Edition 5 LTS

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

In-Memory Data Management for Enterprise Applications

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Transcription:

FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside

Outline Background What are FPGAs Multithreaded Architectures & Memory Masking Case Study: In-memory Hash Join FPGA Implementation Software Implementation Experimental results 2

What are FPGAs? Reprogrammable Fabric Build custom application-specific circuits E.g. Join, Aggregation, etc. Load different circuits onto the same FPGA chip Highly parallel by nature Designs are capable of managing thousands of threads concurrently 3

Memory Masking Multithreaded architectures Issue memory request & stall the thread Fast context switching Resume thread on memory response Multithreading is an alternative to caching Not a general purpose solution Requires highly parallel applications Good for irregular operations (i.e. hashing, graphs, etc.) Some database operations could benefit from multithreading SPARC processors, and GPUs offer limited multithreading FPGAs can offer full multithreading 4

Case Study: In-Memory Hash Join Relational Join Crucial to any OLAP workload Hash Join is faster than Sort-Merge join on multicore CPUs [2] Typically FPGAs implement Sort-Merge join [3] Building a hash table is non-trivial for FPGAs Store data on FPGA [4] Fast memory accesses, but small size (few MBs) Store data in memory Larger size, but longer memory accesses We propose the first end-to-end in memory Hash Join implementation with a FPGAs [2] Balkesen, C. et al. Main-memory Hash Joins on Multi-core CPUs: Tuning to the underlying hardware. ICDE 2013 [3] Casper, J. et al. Hardware Acceleration of Database Operations. FPGA 2014 [4] Halstead, R. et al. Accelerating Join Operation for Relational Databases with FPGAs. FPGA 2013 5

FPGA Implementation All data structures are maintained in memory Relations, Hash Table, and the Linked Lists Separate chaining with linked list for conflict resolution An FPGA engine is a digital circuit Separate Engines for the Build & Probe Phase Reads tuples, and updates hash table and linked list Handle multiple tuples concurrently Engines operate independent of each other Many engines can be placed on a single FPGA chip 6

FPGA Implementation: Build Phase Engine Every cycle a new tuple enters the FPGA engine Every tuple in R is treated as a unique thread: Fetch tuple from memory Calculate hash value Create new linked list node Update Hash Table Has to be synchronized via atomic operations Insert the new node into the linked list 7

FPGA Implementation: Probe Phase Engine Every tuple in R is treated as a unique thread: Fetch tuple from memory Calculate hash value Probe Hash Table for linked list head pointer Drop tuples if the hash table location is empty Search linked list for a match Recycle threads through the data-path until they reach the last node Tuples with matches are joined Stalls can be issued between New & Recycled Jobs 8

FPGA Area & Memory Channel Constraints Target platform: Convey-MX Xilinx Virtex 6 760 FPGAs 4 FPGAs 16 Memory channels per FPGA Build engines need 4 channels Probe engines need 5 channels Designs are memory channel limited 9

Software Implementation Existing state-of-the art multi-core software implementation was used [5]. Hardware-oblivious approach Relies on hyper-threading to mask memory & thread synchronization latency Does not require any architecture configuration Hardware-conscious approach Performs preliminary Radix partitioning step Parameterized by L2 & TLB cache size (to determine number of partitions & fan-out of partitioning algorithm) Data format, commonly used in column stores two 4-byte wide columns: Integer join key Random payload value R Key Payload r 1.. R.key=S.ke [5] Balkesen, C. et al. Main-memory Hash Joins on Multi-core CPUs: Tuning to the underlying hardware. ICDE 2013 r n y S Key Payload s 1.. s m 10

Experimental Evaluation Four synthetically generated datasets with varied key distribution Unique: Shuffled sequentially increasing values (no-repeats) Random: Uniformly distributed random values (few-repeats) Zipf: Skewed values with skew factor 0.5 and 1 Each dataset has a set of relation pairs (R&S) ranging from 1M to 1B tuples Results were obtained on Convey-MX heterogeneous platform Hardware Region FPGA board Virtex-6 760 # FPGAs 4 Software Region CPU Intel Xeon E5-2643 # CPUs 2 Clock Freq. 150 MHz Cores / Threads 4 / 8 Engines per FPGA 4 / 3 Memory Channels 32 Clock Freq. L3 Cache 3.3 GHz 10 MB Memory Bandwidth (total) 76.8 GB/s Memory Bandwidth (total) 102.4 GB/s 11

Throughput Results: Unique dataset 1 CPU (51.2 GB/s) Non-partitioned CPU approach is better than partitioned one, since each bucket has exactly one linked list node 2 FPGAs (38.4 GB/s) 900 Mtuples/s when Probe Phase dominated 450 Mtuples/s when Build Phase dominated 2x Speedup over CPU 12

Throughput Results: Random & Zipf_0.5 As the average chain length grows from one non-partitioned CPU solution is outperformed by partitioned one FPGA has similar throughput, speedup ~3.4x 13

Throughput Results: Zipf_1.0 dataset FPGA throughput decreases significantly due to stalling during probe phase 14

Scale up Results: Probe-dominated Scale up: each 4 CPU threads are compared to 1 FPGA (roughly matches memory bandwidth) Only Unique dataset is shown, Random & Zipf_0.5 behave similarly FPGA does not scale on Zipf_1.0 data Partitioned CPU solution scales up, but at much lower rate than FPGA 15

Scale up Results: R = S FPGA does not scale better than partitioned CPU, but it is still ~2 times faster 16

Conclusions Present first end-to-end in-memory Hash join implementation on FPGAs Show memory masking can be a viable alternative to caching FPGA multi-threading can achieve 2x to 3.4x over CPUs Not reasonable for heavily skewed datasets (e.g. Zipf 1.0) 17

Normalized throughput comparison Hash join is memory-bounded problem Convey-MX platform gives advantage to multicore solutions in terms of memory bandwidth Normalized comparison shown that FPGA approach achieves speedup up to 6x (Unique) and 10x (Random & Zipf_0.5) 18