Rethinking SIMD Vectorization for In-Memory Databases



Similar documents
FPGA-based Multithreading for In-Memory Hash Joins

Chapter 13: Query Processing. Basic Steps in Query Processing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Binary search tree with SIMD bandwidth optimization using SSE

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Architecture Sensitive Database Design: Examples from the Columbia Group

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Data Management for Enterprise Applications

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

GPU Hardware Performance. Fall 2015

Parallel Algorithm Engineering

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

2010 Ingres Corporation. Interactive BI for Large Data Volumes Silicon India BI Conference, 2011, Mumbai Vivek Bhatnagar, Ingres Corporation

Integrating Apache Spark with an Enterprise Data Warehouse

Introduction to Cloud Computing

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Optimizing Code for Accelerators: The Long Road to High Performance

Drivers to support the growing business data demand for Performance Management solutions and BI Analytics

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Oracle Database In-Memory The Next Big Thing

Virtuoso and Database Scalability

Actian Vector in Hadoop

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

Unit Storage Structures 1. Storage Structures. Unit 4.3

Partitioning under the hood in MySQL 5.5

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Efficient Gather and Scatter Operations on Graphics Processors

ultra fast SOM using CUDA

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Performance Characteristics of Large SMP Machines

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Distributed Architecture of Oracle Database In-memory

Technical Challenges for Big Health Care Data. Donald Kossmann Systems Group Department of Computer Science ETH Zurich

Oracle Database In- Memory Op4on in Ac4on

Emerging Innovations In Analytical Databases TDWI India Chapter Meeting, July 23, 2011, Hyderabad

GPUs for Scientific Computing

SOC architecture and design

Whitepaper. Innovations in Business Intelligence Database Technology.

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

Computer Graphics Hardware An Overview

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

SAP HANA In-Memory Database Sizing Guideline

Energy-Efficient, High-Performance Heterogeneous Core Design

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

PARALLEL PROGRAMMING

Tertiary Storage and Data Mining queries

Chapter 1 Computer System Overview

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Parallel Replication for MySQL in 5 Minutes or Less

Benchmarking Cassandra on Violin

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

D B M G Data Base and Data Mining Group of Politecnico di Torino

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Generations of the computer. processors.

Intel Xeon Processor E5-2600

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

FAWN - a Fast Array of Wimpy Nodes

Intel Pentium 4 Processor on 90nm Technology

YALES2 porting on the Xeon- Phi Early results

Putting it all together: Intel Nehalem.

Understanding the Benefits of IBM SPSS Statistics Server

DISTRIBUTED AND PARALLELL DATABASE

Computer Architecture TDTS10

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Processing with Google s MapReduce. Alexandru Costan

Hardware Configuration Guide

Physical Data Organization

Decision Trees from large Databases: SLIQ

MS SQL Performance (Tuning) Best Practices:

SQL Query Evaluation. Winter Lecture 23

MCJoin: A Memory-Constrained Join for Column-Store Main-Memory Databases.

Introduction to Parallel Programming and MapReduce

Parquet. Columnar storage for the people

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Big Data Visualization on the MIC

QuickDB Yet YetAnother Database Management System?

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

In-Memory Databases MemSQL

The What, Why and How of the Pure Storage Enterprise Flash Array

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Introduction to GPU Architecture

Choosing a Computer for Running SLX, P3D, and P5

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Who am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3

Oracle InMemory Database

DATABASE DESIGN - 1DL400

Transcription:

SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University

Latest Hardware Designs Mainstream multi-core CPUs Use complex cores (e.g. Intel Haswell) Massively superscalar 2 18 cores Aggressively out-of-order Pack few cores per chip High power & area per core Core Core Core Core Core Core Core Core L3 cache

Latest Hardware Designs Mainstream multi-core CPUs Use complex cores (e.g. Intel Haswell) Massively superscalar Aggressively out-of-order ~ 6 cores Core L2 Pack few cores per chip High power & area per core Many Integrated Cores (MIC) Use simple cores (e.g. Intel P54C) In-order & non-superscalar (for SIMD) Core L2 Core L2 Ring Bus Ring Bus L2 L2 Core Core Augment SIMD to bridge the gap Increase SIMD register size More advanced SIMD instructions Pack many cores per chip Low power & area per core L2 Core

SIMD & Databases Automatic vectorization Works for simple loops only what is SIMD? Rare in database operators a + b = c a1 + b1 = a2 + a3 + a4 + b2 b3 b4 = = = aw + bw = c1 c2 c3 c4 cw

SIMD & Databases Automatic vectorization what is SIMD? Works for simple loops only Rare in database operators a + b = c Manual vectorization Linear access operators Predicate evaluation Compression a1 a2 a3 a4 aw Ad-hoc vectorization + + + + + Sorting (e.g. merge-sort, comb-sort, ) b1 b2 b3 b4 bw Merging (e.g. 2-way, bitonic, ) Generic vectorization Multi-way trees = c1 = = = c2 c3 c4 = cw Bucketized hash tables

Contributions & Outline Full vectorization From O(f(n)) scalar to O(f(n)/W) vector operations Random accesses excluded Principles for good (efficient) vectorization Reuse fundamental operations across multiple vectorizations Favor vertical vectorization by processing different input data per lane Maximize lane utilization by executing different things per lane subset

Contributions & Outline Full vectorization From O(f(n)) scalar to O(f(n)/W) vector operations Random accesses excluded Principles for good (efficient) vectorization Reuse fundamental operations across multiple vectorizations Favor vertical vectorization by processing different input data per lane Maximize lane utilization by executing different things per lane subset Vectorize basic operators Selection scans Hash tables Partitioning

Contributions & Outline Full vectorization From O(f(n)) scalar to O(f(n)/W) vector operations Random accesses excluded Principles for good (efficient) vectorization Reuse fundamental operations across multiple vectorizations Favor vertical vectorization by processing different input data per lane Maximize lane utilization by executing different things per lane subset Vectorize basic operators & build advanced operators (in memory) Selection scans Hash tables Partitioning Sorting Joins

Contributions & Outline Full vectorization From O(f(n)) scalar to O(f(n)/W) vector operations Random accesses excluded Principles for good (efficient) vectorization Reuse fundamental operations across multiple vectorizations Favor vertical vectorization by processing different input data per lane Maximize lane utilization by executing different things per lane subset Vectorize basic operators & build advanced operators (in memory) Selection scans Hash tables Partitioning Sorting Joins Show impact of good vectorization On software (database system) & hardware design

Fundamental Vector Operations (input) vector A B C D Selective load array X Y Selective store 1 1 mask (output) vector A X Y D vector A B C D 1 1 mask array B C

Fundamental Vector Operations Selective load index vector 5 2 7 Selective store array A B value vector C D E F G H F B C H (Selective) gather (Selective) scatter value vector A B C D index vector 5 2 7 array B C A D

Selection Scans Scalar Branching Branch to store qualifiers Affected by branch misses Branchless Eliminate branching Use conditional flags

Selection Scans Scalar Branching Branch to store qualifiers Affected by branch misses Branchless Eliminate branching Use conditional flags select where column > C input payloads v1 v2 v3 v4 v5 v6 v7 input keys k1 k2 k3 k4 k5 k6 k7 1) evaluate predicate(s) k1 k2 k3 k4 C C C C Vectorized Simple Evaluate predicates (in SIMD) Selectively store qualifiers Early materialized output keys k k2 k3 2) selectively store qualifiers Advanced Late materialized (in the paper ) output payloads v v2 v3

Selection Scans Xeon Phi 712P (MIC) Xeon E3-1275v3 (Haswell) Scalar (branching) Scalar (branchless) Scalar (branching) Scalar (branchless) Throughput (billion tuples / second) 5 4 3 2 1 branchless is slower on Xeon Phi! 1 2 5 1 2 5 1 6 5 4 3 2 1 branchless is faster on Haswell! 1 2 5 1 2 5 1 Selectivity (%) Selectivity (%)

Selection Scans Xeon Phi 712P (MIC) Xeon E3-1275v3 (Haswell) Throughput (billion tuples / second) 5 4 3 2 1 Scalar (branching) Scalar (branchless) Vectorized (early) Vectorized (late) important case 1 2 5 1 2 5 1 6 5 4 3 2 1 Scalar (branching) Scalar (branchless) Vectorized (early) Vectorized (late) 1 2 5 1 2 5 1 Selectivity (%) Selectivity (%)

Hash Tables Scalar hash tables Branching or branchless 1 input key at a time input key hash index linear probing hash table 1 table key at a time k h probe > 1 buckets? keys payloads

Hash Tables Scalar hash tables Branching or branchless 1 input key at a time 1 table key at a time input key k hash index h linear probing bucketized hash table probe > W buckets? Vectorized hash tables Horizontal vectorization Proposed on previous work 1 input key at a time W table keys per input key Load bucket with W keys k k k k W comparisons per bucket keys payloads

Hash Tables Scalar hash tables Branching or branchless 1 input key at a time 1 table key at a time input key k hash index h linear probing bucketized hash table probe > W buckets? Vectorized hash tables Horizontal vectorization Proposed on previous work 1 input key at a time W table keys per input key Load bucket with W keys k k k k W comparisons per bucket keys payloads However W are too many comparisons No advantage of larger SIMD

Hash Tables Vectorized hash tables Vertical vectorization W input keys at a time linear probing hash table k2 1 table keys per input key Gather buckets input keys hash indexes k5 Probing (linear probing) Store matching lanes k1 k2 h1 h2 gather buckets k3 k3 k4 h3 h4 keys payloads k7 matching lanes non-matching lanes

Hash Tables Vectorized hash tables Vertical vectorization W input keys at a time input keys k1 selectively load input keys linear probing hash table k2 1 table keys per input key Gather buckets k5 k6 hash indexes k5 Probing (linear probing) k4 h1+1 Store matching lanes h5 k3 Replace finished lanes h6 Keep unfinished lanes h4+1 h6 1 k7 1 linear probing offsets matching lanes replaced non-matching lanes kept

Hash Tables Vectorized hash tables Vertical vectorization W input keys at a time linear probing hash table 1 table keys per input key Gather buckets input keys hash indexes null Probing (linear probing) Store matching lanes Replace finished lanes Keep unfinished lanes Building (linear probing) Keep non-empty lanes k1 k2 k3 k4 h1 h2 h3 h4 gather buckets k null empty bucket lanes non-empty bucket lanes

Hash Tables Vectorized hash tables Vertical vectorization linear probing hash table W input keys at a time 1 table keys per input key Gather buckets Probing (linear probing) h1 1) selective scatter 1 1 2) selective gather 1 1 Store matching lanes Replace finished lanes Keep unfinished lanes Building (linear probing) Keep non-empty lanes Detect scatter conflicts h2 h3 h4 2 3 4 conflicting lanes 4 4 3 4 3) compare!= 2 3 4 non-conflicting lanes non-empty bucket lanes

Hash Tables Vectorized hash tables Vertical vectorization W input keys at a time linear probing hash table 1 table keys per input key Gather buckets input keys hash indexes k1 v1 Probing (linear probing) k1 h1 Store matching lanes Replace finished lanes Keep unfinished lanes Building (linear probing) Keep non-empty lanes Detect scatter conflicts k2 k3 k4 h2 h3 h4 conflicting lanes scatter tuples k v k4 v4 Scatter empty lanes non-conflicting lanes scattered non-empty bucket lanes

Hash Tables Vectorized hash tables Vertical vectorization W input keys at a time selectively load input keys (2nd iteration) linear probing hash table 1 table keys per input key Gather buckets input keys hash indexes k1 v1 Probing (linear probing) k5 h5 Store matching lanes Replace finished lanes Keep unfinished lanes Building (linear probing) Keep non-empty lanes Detect scatter conflicts k2 k3 k6 h2 h3 h6 conflicting lanes kept k k4 v v4 Scatter empty lanes & replace non-conflicting lanes replaced Keep conflicting lanes non-empty bucket lanes kept

Hash Table (Linear Probing) Scalar Scalar Probing throughput (billion tuples / second) 12 1 8 6 4 2 2 1.5 1.5 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB Hash table size Xeon Phi 712P Hash table size Xeon E3-1275v3 (Haswell)

Hash Table (Linear Probing) Scalar Vector (horizontal) Scalar Vector (horizontal) Probing throughput (billion tuples / second) 12 1 8 6 4 2 out of cache 2 1.5 1.5 out of cache 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB Hash table size Xeon Phi 712P Hash table size Xeon E3-1275v3 (Haswell)

Hash Table (Linear Probing) Scalar Vector (horizontal) Vector (vertical) Scalar Vector (horizontal) Vector (vertical) Probing throughput (billion tuples / second) 12 1 8 6 4 2 much faster in cache! out of cache 2 1.5 1.5 faster in L1/L2 cache! out of cache 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB 64 MB Hash table size Xeon Phi 712P Hash table size Xeon E3-1275v3 (Haswell)

Partitioning Types Radix 2 shifts (in SIMD) Hash Range 2 multiplications (in SIMD) Range function index Binary search (in the paper )

Partitioning Types Radix Hash 2 shifts (in SIMD) k1 k2 k3 k4 input keys Range 2 multiplications (in SIMD) h1 h2 h3 h4 partition ids Range function index Histogram Binary search (in the paper ) Data parallel update Gather & scatter counts +1 +1 +1 instead of histogram with P counts Conflicts miss counts +1 +1 +2

Partitioning Types Radix Hash 2 shifts (in SIMD) k1 k2 k3 k4 input keys Range 2 multiplications (in SIMD) h1 h2 h3 h4 partition ids Range function index Binary search (in the paper ) +1 Histogram Data parallel update Gather & scatter counts +1 +1 +1 Conflicts miss counts replicated histogram with P * W counts Replicate histogram W times More solutions in the paper

Radix & Hash Histogram Histogram generation on Xeon Phi Scalar radix Scalar hash Histogram throughput (billion tuples / second) 25 2 15 1 5 in scalar code: hash < radix 3 4 5 6 7 8 9 1 11 12 13 Fanout (log)

Radix & Hash Histogram Histogram generation on Xeon Phi Scalar radix Scalar hash Vector radix/hash (replicate) Histogram throughput (billion tuples / second) 25 2 15 1 5 3 4 5 6 7 8 9 1 11 12 13 }out of cache if replicated Fanout (log)

Radix & Hash Histogram Histogram generation on Xeon Phi Scalar radix Scalar hash Vector radix/hash (replicate) Vector radix/hash (replicate & compress) Histogram throughput (billion tuples / second) 25 2 15 1 5 in L1 in L2 3 4 5 6 7 8 9 1 11 12 13 Fanout (log)

Partitioning Shuffling Update the offsets Gather & scatter counts keys k1 k2 k3 k4 partition ids h1 h2 h3 h4 1) gather offsets Conflicts miss counts Tuple transfer Scatter directly to output l1 l3 l2 offset array Conflicts overwrite tuples conflicting lanes non-conflicting lanes partition offsets l1 l2 l3 l4 2) scatter tuples output keys k1 k4 k3 v1 v4 v3 output payloads

Partitioning Shuffling Update the offsets offset array offset array Gather & scatter counts Conflicts miss counts Tuple transfer Scatter directly to output Conflicts overwrite tuples Serialize conflicts partition ids h4 h3 h2 4 3 2 1 3 selective scatter 2 3 2 selective gather 4 3 2 1 h1 1 1 1 scatter 2 4 gather serialization offsets

Partitioning Shuffling Update the offsets Gather & scatter counts keys k1 k2 k3 k4 partition ids h1 h2 h3 h4 1) gather offsets Conflicts miss counts Tuple transfer Scatter directly to output l1 l3 l2 offset array Conflicts overwrite tuples conflicting lanes non-conflicting lanes Serialize conflicts More in the paper serialization offsets partition offsets 2) serialize conflicts 1 + l1 l2 l3 l4 3) scatter tuples output keys k1 k2 k4 k3 v1 v2 v4 v3 output payloads

Partitioning Shuffling Update the offsets Gather & scatter counts keys k1 k2 k3 k4 partition ids h1 h2 h3 h4 1) gather offsets Conflicts miss counts Tuple transfer Scatter directly to output Conflicts overwrite tuples Serialize conflicts More in the paper Buffering When input >> cache Handle conflicts Scatter tuples to buffers k1 l1 conflicting lanes serialization offsets 1 + l3 partition offsets l1 l2 l3 l4 k2 k4 l2 offset array non-conflicting lanes 2) serialize conflicts 3) scatter to buffer 4) flush full buffers k3 buffered keys Buffers are cache-resident Stream full buffers to output v1 v2 v4 v3 buffered payloads

Partitioning Shuffling Update the offsets Gather & scatter counts vertical logic output Conflicts miss counts Tuple transfer input buffers Scatter directly to output Conflicts overwrite tuples Serialize conflicts More in the paper Buffering When input >> cache 1-tuple scatters W-tuple stream stores Handle conflicts Scatter tuples to buffers Buffers are cache-resident Stream full buffers to output horizontal logic

Buffered & Unbuffered Partitioning Large-scale data shuffling on Xeon Phi Scalar unbuffered Scalar buffered 6 5 4 3 2 1 input >> cache 3 4 5 6 7 8 9 1 11 12 13 Fanout (log)

Buffered & Unbuffered Partitioning Large-scale data shuffling on Xeon Phi Scalar unbuffered Scalar buffered Vector unbuffered Vector buffered 6 5 4 3 2 1 vectorization orthogonal to buffering 3 4 5 6 7 8 9 1 11 12 13 Fanout (log)

Sorting & Hash Join Algorithms Least-significant-bit (LSB) radix-sort Stable radix partitioning passes Fully vectorized

Sorting & Hash Join Algorithms Least-significant-bit (LSB) radix-sort Stable radix partitioning passes Fully vectorized Hash join No partitioning Build 1 shared hash table using atomics Partially vectorized

Sorting & Hash Join Algorithms Least-significant-bit (LSB) radix-sort Stable radix partitioning passes Fully vectorized Hash join No partitioning Build 1 shared hash table using atomics Partially vectorized Min partitioning Partition building table Build 1 hash table per thread Fully vectorized

Sorting & Hash Join Algorithms Least-significant-bit (LSB) radix-sort Stable radix partitioning passes Fully vectorized Hash join No partitioning Build 1 shared hash table using atomics Partially vectorized Min partitioning Partition building table Build 1 hash table per thread Fully vectorized Max partitioning Partition both tables repeatedly Build & probe cache-resident hash tables Fully vectorized

Hash Joins sort 4 million tuples & join 2 & 2 million tuples 32-bit keys & payloads on Xeon Phi Sort time (seconds) 2 1.5 1.5 Scalar Vector LSB radixsort Partition Build Probe Build & Probe Join time (seconds) 2 1.5 fastest 1.5 Scalar Vector Scalar Vector Scalar Vector no partition join min partition join max partition join

Hash Joins sort 4 million tuples & join 2 & 2 million tuples 32-bit keys & payloads on Xeon Phi Sort time (seconds) 2 1.5 1.5 Partition Build Probe Build & Probe improved 2.2X! Scalar Vector LSB radixsort Join time (seconds) 2 1.5 1.5 not improved improved 3.4X! Scalar Vector Scalar Vector Scalar Vector no partition join min partition join max partition join

Power Efficiency 4 CPUs (Xeon E5-462) 32 Sandy Bridge cores 3 Watts TDP.6 Histogram Shuffle Partition Build & probe NUMA split 2.2 GHz 4 GB/s Time (seconds).45.3.15 Sort Join

Power Efficiency Histogram 4 CPUs (Xeon E5-462) 32 Sandy Bridge cores 3 Watts TDP.6 Shuffle Partition Build & probe NUMA split 42% less power 1 co-processor (Xeon Phi 712P) 61 P54C cores 52 Watts TDP 2.2 GHz 4 GB/s * Time (seconds).45.3.15 1.238 GHz 8 GB/s * Sort Join Sort Join * comparable speed even if equalized (in the paper )

Conclusions Full vectorization O(f(n)) scalar > O(f(n)/W) vector operations Good vectorization principles improve performance Define & reuse fundamental operations e.g. vertical vectorization, maximize lane utilization,

Conclusions Full vectorization O(f(n)) scalar > O(f(n)/W) vector operations Good vectorization principles improve performance Define & reuse fundamental operations e.g. vertical vectorization, maximize lane utilization, Impact on software design Vectorization favors cache-conscious algorithms e.g. partitioned hash join >> non-partitioned hash join if vectorized Vectorization is orthogonal to other optimizations e.g. both unbuffered & buffered partitioning get vectorization speedup

Conclusions Full vectorization O(f(n)) scalar > O(f(n)/W) vector operations Good vectorization principles improve performance Define & reuse fundamental operations e.g. vertical vectorization, maximize lane utilization, Impact on software design Vectorization favors cache-conscious algorithms e.g. partitioned hash join >> non-partitioned hash join if vectorized Vectorization is orthogonal to other optimizations e.g. both unbuffered & buffered partitioning get vectorization speedup Impact on hardware design Simple cores almost as fast as complex cores (for OLAP) 61 simple P54C cores ~ 32 complex Sandy Bridge cores Improved power efficiency for analytical databases

Questions

Join & Sort with Early Materialized Payloads LSB radixort time (seconds) 1.2 1.8.6.4.2 Partitioned hash join time (s).5.4.3.2.1 8-bit 16-bit 32-bit 64-bit 2x 64-bit 3x 64-bit 4x 64-bit 4:1x 64-bit 3:1x 64-bit 2:1x 64-bit 1:1x 64-bit 1:2x 64-bit 1:3x 64-bit 1:4x 64-bit 2 million 32-bit keys & X payloads on Xeon Phi 1 & 1 million 32-bit keys & X payloads on Xeon Phi