In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki 30.11.2012



Similar documents
Object Oriented Database Management System for Decision Support System.

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

In-Memory Data Management for Enterprise Applications

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

In-Memory Databases MemSQL

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

MS SQL Performance (Tuning) Best Practices:

Daniel J. Adabi. Workshop presentation by Lukas Probst

In-memory databases and innovations in Business Intelligence

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

An Evaluation of Strict Timestamp Ordering Concurrency Control for Main-Memory Database Systems

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Distributed Architecture of Oracle Database In-memory

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

SQL Server 2005 Features Comparison

Data Management in the Cloud

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

The SAP HANA Database An Architecture Overview

Understanding the Value of In-Memory in the IT Landscape

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

SQL Server 2012 Performance White Paper

Design Patterns for Distributed Non-Relational Databases

Innovative technology for big data analytics

Why DBMSs Matter More than Ever in the Big Data Era

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Rethinking SIMD Vectorization for In-Memory Databases

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Application-Tier In-Memory Analytics Best Practices and Use Cases

Oracle Database In-Memory The Next Big Thing

Data Warehouse: Introduction

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Architectures for Big Data Analytics A database perspective

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

The Oracle Universal Server Buffer Manager

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

TECHNICAL OVERVIEW HIGH PERFORMANCE, SCALE-OUT RDBMS FOR FAST DATA APPS RE- QUIRING REAL-TIME ANALYTICS WITH TRANSACTIONS.

Module 14: Scalability and High Availability

In-memory Tables Technology overview and solutions

Data Mining in the Swamp

2009 Oracle Corporation 1

How, What, and Where of Data Warehouses for MySQL

A1 and FARM scalable graph database on top of a transactional memory layer

Benchmarking Cassandra on Violin

How To Handle Big Data With A Data Scientist

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Oracle TimesTen: An In-Memory Database for Enterprise Applications

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Whitepaper. Innovations in Business Intelligence Database Technology.

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Original-page small file oriented EXT3 file storage system

Actian Vector in Hadoop

ICONICS Choosing the Correct Edition of MS SQL Server

Safe Harbor Statement

Safe Harbor Statement

QuickDB Yet YetAnother Database Management System?

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

CitusDB Architecture for Real-Time Big Data

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Aggregates Caching in Columnar In-Memory Databases

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Cloud Computing at Google. Architecture

Database Performance with In-Memory Solutions

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Optimize Oracle Business Intelligence Analytics with Oracle 12c In-Memory Database Option

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Inge Os Sales Consulting Manager Oracle Norway

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Data Warehousing & Data Mining

High-Volume Data Warehousing in Centerprise. Product Datasheet

Optimizing Performance. Training Division New Delhi

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

DBMS / Business Intelligence, SQL Server

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Technical Challenges for Big Health Care Data. Donald Kossmann Systems Group Department of Computer Science ETH Zurich

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

Big Data With Hadoop

DKDA 2012 and the Impact of In-Memory Database Algorithms

IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Maximum performance, minimal risk for data warehousing

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

DISTRIBUTED AND PARALLELL DATABASE

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Transcription:

In-Memory Columnar Databases HyPer Arto Kärki University of Helsinki 30.11.2012 1

Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 2

Introduction The relational database systems are today mostly separated in two different technical solutions because of the increased number of rows and for performance issues. OLTP (OnLine Transaction Processing) is designed for fast row inserts, updates and selections. OLAP (OnLine Analytic Processing) is designed for long lasting queries. 3

Introduction Problematic ETL-mechanism An OLAP database is populated with rows from OLTP database. This means complicated data extracting, transforming and loading mechanism. This mechanism is usually slow and can be made only once a day submitting batch jobs during night time. Data is not fresh any more for business intelligence calculations. 4

Introduction ETL example Extract Transform Load 5

Introduction The HyPer s challenge, how one could: 1. avoid ETL overhead and merge OLTP and OLAP back in the same database system. 2. manage to preserve the fast OLTP transactions and at the same time achieve up-to-date data for OLAP queries. 3. satisfy business intelligence calculations. 6

Introduction Next: Some aspects on columnar databases Hyper s design choices Hyper s data clustering and compression And finally: Conclusion 7

Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 8

H.Plattner: Columnar Databases - Early tests at SAP and HPI with in-memory databases of the relational type based on row storage did not show significant advantages over leading RDBMSs with equivalent memory caching. - The alternative idea to investigate the advantages of using column store databases for OLTP was born. - Column storage was successfully used for many years in OLAP and really surged when main memory became abundant. 9

Abadi et al.: Columnar Databases - Store each column separately, with attribute values belonging to the same column stored contiguously as opposed to traditional db systems that store entire rows one after the other. - Reading a subset of table s columns becomes faster when scanning multiple columns. - Potential expense of exessive disk-head seeking from column to column for scattered reads or updates. 10

Abadi et al.: Columnar Databases Two of the most-often cited disadvantages: - write operations (inserted tuples have to be broken up into their component attributes and each attribute must be written separately). - the dense-packed data layout makes moving tuples within page nearly impossible. 11

Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 12

Design Choices What s HyPer? It is a hybrid OLTP&OLAP main memory database system. And it is columnar in order to achieve best possible query execution performance for OLAP applications. Next: Hyper s performance is due to the following design choices. 13

Design Choices 1 HyPer relies on in-memory data management without the ballast of traditional database systems caused by DBMS-controlled page structures and buffer management. The SQL table definitions are transformed into simple vector-based virtual memory representations which constitutes a column-oriented physical storage scheme. 14

Design Choices 2 The OLAP processing is separated from the missioncritical OLTP transaction processing by fork-ing virtual memory snapshots. Thus, no concurrency control mechanisms other than the hardwareassisted VM management are needed to separate the two workload classes. 15

Design Choices 3 Fork OLAP queries Hardware assisted copy 16

Design Choices 4 Transactions and queries are specified in SQL and are efficiently compiled into LLVM assembly code. The transactions are specified in an SQL scripting language and registered stored procedures. The query evaluation follows a data-centric paradigm by applying as many operations on a data object as possible in between pipeline breakers. 17

Design Choices 5 Example Query Compiled query Example execution plan 18

LLVM assembler Design Choices 6 The experiments have shown that data-centric query processing is a very efficient query execution model. DBMS can achieve a query processing efficiency that rivals hand-written C++ code. The data-centric compilation approach is promising for all new database projects. By relying on mainstream compilation frameworks the DBMS automatically benefits from future compiler and processor improvements without re-engineering the query engine. 19

Design Choices 7 As in VoltDB, the parallel transactions are separated via lock-free admission control that allows only nonconflicting transactions at the same time. Parallelism in this serial execution model is achieved by logically partitioning the database and admitting multiple partition-constrained transactions in parallel. However, for executing partition-crossing transactions the scheduler resorts to strict serial execution, rather than costly locking-based synchronization. 20

Design Choices 8 HyPer relies on logical logging where, in essence, the invocation parameters of the stored prosedures / transactions are logged via a high-speed network. The serial execution model in combination with partitioning and group committing achieves extreme scalability in terms of transaction throughput without compromising the holy grail of ACID. 21

Design Choices 9 While in-core OLAP query processing can be based on sequential scans, this is not possible for transaction processing as we require execution times of a few microseconds only. Therefore, HyPer has deleloped sophisticated main-memory indexing structures based on hashing, balanced search trees and radiax trees. Hash indexes are dispensable for exact match (e.g., primary key) accesses that are most common in transactional processing while the tree structured indexes are essential for smallrange queries, that are commonly encouterd in transactional scripts as well. 22

Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 23

Data Clustering and Compression HyPer s approach to compression in hybrid OLTP & OLAP column stores is based on the observation that while OLTP workloads frequently modify the dataset, they often follow the working set assumption: only a small subset of the data is accessed and an even smaller subset of this working set is being modified. 24

Data Clustering and Compression Hot (volatile) data item Cold data item Cold & compressed data item 25

Data Clustering and Compression Hot/cold clustering is an elegant solution to this problem, as the cold bulk of the data can be stored on huge memory pages while the hot, frequently modified working set remains on regular memory pages that can be replicated inexpensively. The frozen, huge data pages are never modified; if a frozen data object is changed, after all, it is invalidated in the frozen partition and reinserted into the hot working set. 26

Introduction Columnar Databases Design Choices Snapshotting Data Compression Conclusion 27

Conclusion - A reunion of OLTP & OLAP systems via snapshotting. - Queries compiled into machine code using the optimizing LLVM compiler => the DBMS can achieve a query processing efficiency that rivals hand-written C++ code. - Hot/cold clustering to store frequently accessed tuples together on regular memory pages while cold, immutable tuples can reside on huge pages => advantageous combination of page table size and thus snapshot creation costs. 28

Thank you 29

References T. Cao, M. Salles, B. Sowell, Y. Yue, J. Gehrke, A.Demers, and W. White. Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications. In SIGMOD, 2011. Florian Funke, Alfons Kemper, Thomas Neumann: Compacting Transactional Data in Hybrid OLTP & OLAP Databases. PVLDB 5(11):1424-1435(2012). S. H eman, M. Zukowski, N. J. Nes, L. Sidirourgos,and P. A. Boncz. Positional Update Handling in Column Stores. In SIGMOD, pages 543 554, 2010. Alfons Kemper, Thomas Neumann, Florian Funke, Viktor Leis, Henrik Mühe: HyPer:Adapting Columnar Main-Memory Data Management for Transactional AND Query Processing. IEEE Data Eng. Bull. (DEBU)35(1):46-51(2012). T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539 550, 2011. Hasso Plattner: A common database approach for OLTP and OLAP using an in-memory column database. SIGMOD 2009:1-2. http://www.information-management.com/issues/20031101/7607-1.html 30