DataBlitz Main Memory DataBase System



Similar documents
Outline. Failure Types

Availability Digest. Raima s High-Availability Embedded Database December 2011

Storage in Database Systems. CMPSCI 445 Fall 2010

Configuring Apache Derby for Performance and Durability Olav Sandstå

ORACLE DATABASE 10G ENTERPRISE EDITION

Transaction Management Overview

Data Management in the Cloud

COS 318: Operating Systems

Hypertable Architecture Overview

The Classical Architecture. Storage 1 / 36

Oracle Database: SQL and PL/SQL Fundamentals NEW

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Chapter 14: Recovery System

Distributed File Systems

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Raima Database Manager Version 14.0 In-memory Database Engine

In Memory Accelerator for MongoDB

<Insert Picture Here> Oracle In-Memory Database Cache Overview

David Dye. Extract, Transform, Load

Recovery Principles in MySQL Cluster 5.1

Oracle Database 10g: Backup and Recovery 1-2

1 File Processing Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Recovery System C H A P T E R16. Practice Exercises

Eloquence Training What s new in Eloquence B.08.00

Physical Data Organization

Promise of Low-Latency Stable Storage for Enterprise Solutions

Empress Embedded Database. for. Medical Systems

A Comparison of Oracle Berkeley DB and Relational Database Management Systems. An Oracle Technical White Paper November 2006

Recovery Protocols For Flash File Systems

INTRODUCTION TO DATABASE SYSTEMS

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Information Systems. Computer Science Department ETH Zurich Spring 2012

Review: The ACID properties

Configuring Apache Derby for Performance and Durability Olav Sandstå

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Chapter 16: Recovery System

DBMS / Business Intelligence, SQL Server

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Scaling Distributed Database Management Systems by using a Grid-based Storage Service

Chapter 18: Database System Architectures. Centralized Systems

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

In-memory databases and innovations in Business Intelligence

Real-time Data Replication

Postgres Plus Advanced Server

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server

DISTRIBUTED AND PARALLELL DATABASE

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

SQL Server Enterprise Edition

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Module 14: Scalability and High Availability

The Google File System

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

! Volatile storage: ! Nonvolatile storage:

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

1.264 Lecture 15. SQL transactions, security, indexes

Database Administration with MySQL

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

FairCom c-tree Server System Support Guide

ORACLE INSTANCE ARCHITECTURE

How to Implement Multi-way Active/Active Replication SIMPLY

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

How To Write A Transaction System

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

My SQL in a Main Memory Database Context

Database Replication with Oracle 11g and MS SQL Server 2008

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit

Network Attached Storage. Jinfeng Yang Oct/19/2015

Lecture 18: Reliable Storage

Chapter 15: Recovery System

Distributed Data Management

The Hadoop Distributed File System

6. Storage and File Structures

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

HOUG Konferencia Oracle TimesTen In-Memory Database and TimesTen Application-Tier Database Cache. A few facts in 10 minutes

Oracle 11g Database Administration

Storage and File Structure

Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication

Introduction to Database Systems. Chapter 1 Introduction. Chapter 1 Introduction

Oracle Database Links Part 2 - Distributed Transactions Written and presented by Joel Goodman October 15th 2009

Oracle Database 11 g Performance Tuning. Recipes. Sam R. Alapati Darl Kuhn Bill Padfield. Apress*

The Oracle Universal Server Buffer Manager

In-Memory Databases MemSQL

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

Technical Overview Simple, Scalable, Object Storage Software

Cloud Computing at Google. Architecture

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Transcription:

DataBlitz Main Memory DataBase System

What is DataBlitz? DataBlitz is a general purpose Main Memory DataBase System that enables: Ð high-speed access to data Ð concurrent access to shared data Ð data integrity to be preserved in the presence of faults or failures DataBlitz differs from typical commercial database systems in that: Ð data is stored in main memory, not on disk Ð data is accessed directly, not over a network Ð there is no buffer manager Ð lower level APIs are exposed to applications

DataBlitzÕs Broad Applicability DataBlitzÕs high performance {high throughput and short, predictable response times} makes it an ideal basis for a variety of purposes. The fast-emerging world of e-commerce thrives on speed and is a natural for DataBlitz. DataBlitz plays an enabling role in web servers as well as in web infrastructure (cache servers and other web accelerators). In financial trading, time is money. Even being a minute late may cause a trader to miss a market move. DataBlitzÕs performance and strength in transaction processing make it an ideal basis for on-line stock exchanges as well as for program traders. Decision-support systems, currently based on off-line analysis of information in data warehouses, can be made much more effective by means of rapid analysis of real-time data based on DataBlitz.

Target Telecom Applications Real-time billing: Bill for up-to-the-moment service and usage Maintain billing history information Keep customer and service summaries Intelligent network applications 800 number translation Intelligent 800 number service (via customer logic) Call screening Call routing (local number portability, unified messaging, etc.) Switch-based functions Switching Call routing Call forwarding Call waiting

High Performance Databases Characteristics High throughput with hundreds to thousands of transactions per second Real-time response on the order of a few milliseconds Data must be in main memory for performance attainment Disk accesses have high latency (~20 milliseconds) and low throughput Main-memory accesses have low latency (~100 nanoseconds) and high throughtput and memory prices are falling reapidly CPU Main memory Disk Low latency High throughput Higher cost Smaller size High latency Low throughput Lower cost Larger size

Existing Commercial Databases Most commercial systems assume data is primarily disk-resident SERVER Network Buffer Database APPLICATION RAM DISK Every access incurs Network latency Disk latency Buffer manager overhead

DataBlitz - High Performance MMDB System DataBlitz provides real-time concurrent access to shared data Application Database Application Database Database RAM True 64-bit Architecture support!

DataBlitz Architecture Process 1 User Code DataBlitz Library Virtual Memory Locks Logs SysDB DB File 1 DB File 2 Virtual Memory Process 2 User Code DataBlitz Library Checkpoint Images DB File 3 Physical Memory Log of Updates

Multiple Layers of Abstraction Applications can be written to use one of four APIs Application SQL C++ Relational API Collections API Storage Manager API Lower level APIs offer simpler functionality and even higher performance

The Modular Architecture of DataBlitz SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Infrastructure Services SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Infrastructure Services Database Management Process Management Communication Management Database management Keeps track of database files in a DataBlitz system Provides services to create, remove, resize, open, close and coordinate access to a database file Manages the virtual address space of processes and keeps track of start addresses for mapped database files Allocates/grows the underlying shared memory for a database file Process management Keeps track of active processes and resources held by them (e.g. mutexes) Performs cleanup for a failed process or thread Communication management Provides a framework for messaging (local and remote) Supports client/server remote procedure calls

Recovery/Fault Resilience SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Recovery/Fault Resilience Cleanup Recovery Archive Checkpoint Codeword Protection Memory Protection Logging Recovery from system failures Recovers database to transaction consistent state Fuzzy checkpointing, multi-level and post-commit logging Recovery from process failures Process failures detected and transactions aborted Recovery from media failures Creation of and recovery from database archives Detecting data corruption (due to application errors) Codeword (one-word checksum) for each database page Any page written to disk must match codeword Preventing data corruption (due to application errors) Uses mprotect call to cause page faults on bad writes

Traditional Recovery Algorithms Each transaction writes the following to the log tail: Undo (before image) logs before update Redo (after image) logs after update Write-ahead logging Undo logs are flushed to disk before any affected page is written Logs are flushed on transaction commit Recovery in 3 phases Analysis - find winner and loser transactions Redo - redo effects of completed transactions Undo - undo effects of incomplete transactions Checkpointing Flush dirty (updated) pages and note transaction information Discard old part of the log

Recovery from System Failure Logs all updates on transaction commit/abort Writes periodic checkpoints of database state Performs recovery using checkpoint and logs Redo Log Database Dirty Page Table Undo Log Active Trans. Table Trans. Local Logs End of Stable Log System Log Tail In Main Memory Stable System Log On Disk cur_ckpt Ckpt A Ckpt B End of Stable Log Database ckpt_dpt Active Trans Table (ATT) (undo logs)

Datablitz Recovery Algorithm Repeat history Low Amount of disk I/O Redo logs are written to disk No undo logs to disk normally Ping-Pong Checkpointing Scheme Write to alternate disk copies on successive checkpoints resilient to crashes during checkpointing Only relevant undo logs written to disk when checkpointing undo logs of transactions that are active at the end of the checkpoint Fuzzy checkpoints Updates execute concurrently with checkpoints e.g. No page latching during checkpoints Only dirty pages written to disk during checkpoints Single pass recovery algorithm

Extended Logging Support Logging support for high concurrency Multi-level logging Physical undo during updates, logical undo afterward Post-commit actions Actions that cannot be rolled back (e.g. freeing storage space) Support for user-defined operation logging Time stamp recovery Recover the database to a transaction-consistent point in the past Turn redo logging off Support atomic transactions without requiring them to be durable

Codeword Protection When codeword protection is turned on for a database, a paritybased checksum (codeword) is maintained for each page, and updates performed through DataBlitz update the checksum. The integrity of a page can be tested by this checksum by detecting errors, such as writes that did not use the DataBlitz interface, and random writes by means of bad application pointers. The integrity of a page is always tested before it is written to disk, so the disk copy is never corrupted. The user can also request "codeword audits" at other times.

Memory Protection Support This allows pages to be protected using the UNIX system call mprotect(). Pages are unprotected when written, and reprotected on transaction commit. While the unprotect and reprotect are slow (by DataBlitz standards), memory protection can be very useful for debugging, for read-only databases, and for databases with relatively low update rates.

Concurrency Control SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Concurrency Control Lock Table Shared Mutexes Lock Manager Exclusive Mutexes Fast recoverable Mutexes Operating System semaphores are too expensive Mutexes are implemented in user space for speed (e.g. spin locks) Algorithms provided to detect a failed process holding a mutex Lock manager Fine-grained locking (read/write/intention) Supports lock upgrades/downgrades Lock table used to obtain named locks with scalable latching

Storage Allocator SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Storage Allocator Fixed-size Allocator Coalescing Allocator Chunk Manager Chunk = collection of segments Start Data Segment = contiguous, page-aligned space Chunks can have different recovery characteristics (zeroed memory, persistent memory) Control data stored separately from data Reduces probability of corruption due to stray application pointers

Storage Allocator BlzPtrs Database file plus offset Enables database files to be mapped anywhere in the address space of a process Enables database files to be resized Size of database can exceed virtual address space of processes (e.g., 32-bit machines) Multiple Storage Allocators Fixed-size allocator Variable size allocator Exact fit Coalesce adjacent free blocks

Transactions SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Transactions Full Transaction Semantics Transaction commit, abort, and save point interfaces provide fault tolerance through internal locking and logging facilities which deliver (A)tomicity, (C)onsistency, (I)solation, and (D)urability properties. Use of transactions greatly eases the task of implementing systems that handle concurrent programs, and a variety of failures, while maintaining the integrity of vital data.

Heap File SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Heap File The heap file is abstraction for handling a large number of fixed-length data items, and is implemented as a thin layer on top of the Storage Manager allocator. In addition to insertion and deletion, the heap file supports locking and an unordered scan of items. Item locks are obtained transparently when items are inserted, deleted, updated or scanned.

Extendible Hash Indices SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Extendible Hash Indices DataBlitz offers both fixed and extended hash indices: Directory grows with number of elements in hash index Inserts/deletes/scans involving different buckets can execute concurrently Directory can grow while inserts/deletes/lookups continue Ideal for point look-ups on data Supports varying degrees of isolation (e.g., cursor stability, phantom) Hash headers Hash nodes Directory

Ttree Indices SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Ttrees Indices Ordered structure, like an AVL or height balanced tree Multiple keys per node Ideal for ordered scans over data Supports varying degrees of isolation (e.g. Cursor Stability, Phantom) Min Max

C++ Relational API SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

C++ Relational API Views DML DDL (SQL) Supports Hash and Ttree indices on an arbitrary subset of attributes Predicate-based scans on tables are supported Materialized joins and one-to-many relationships Enforces referential integrity constraints Wide variety of field types Variable length fields Date, Time, Decimal, and other numeric types Null values Views: simple select project joins on base tables Supports different isolation levels (e.g. Cursor Stability, Phantom) Schema/Data import/export

SQL SQL C++ Relational API Collections API Heap Hash Ttrees File Indices Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

SQL/JDBC/ODBC Interfaces DataBlitz builds upon the Dharma/SQL engine to provide SQL, JDBC, and ODBC support The Dharma/SQL engine provides feature and performance parity with leading DBMS vendors The Dharma/SQL engine is production-proven with a large installed customer base Standards conformance Complete SQL 92 entry level support ODBC 3.0 (core, level1 and most level 2 APIs) JDBC 1.2

SQL Interface Features Query Optimization Statistics and cost-based optimization (e.g. query selectivity, table cardinality, histograms) Join order/method selection for complex queries Query rewrite algorithms for nested queries Join hints for influencing query execution plans Run-time optimizations Compiled/cached SQL for repeated queries Multi-tuple fetches/appends Tuple-level locking support Execution plan data pipelining Complete SQL-92 support plus Ð Date/time/interval types Ð Ð Ð Ð Ð variable length character strings Nested queries Transaction isolation levels Recursive views Schema manipulation statements Java support Ð Java stored procedures Ð Java triggers Ð JDBC 1.2

High Performance Modular Architecture DataBlitz provides: the highest performance the most reliability the most flexibility Heap File SQL C++ Relational API Collections API Hash Indices Ttrees Storage Manager API Transactions Recovery/ Fault Resilience Concurrency Control Storage Allocator Infrastructure Services

Data Replication Data replication can help improve System availability (data still available even if site fails) System performance (replicated data can be accessed locally) DataBlitz supports Òupdate anywhereó replication model Data is replicated at the granularity of tables Any site can update/access replicas without consulting other sites Updates propagated to replicas asynchronously using logs Conflicts between updates resolved using timestamps The DataBlitz replication manager guarantees that Every table update is eventually propagated to every table replica Table replicas converge to identical state when system is quiesced

Data Replication Logs DataBlitz site 1 T1 T2 Replication Manager T2 updates Logs T2 DataBlitz site 2 T3 Replication Manager T1 updates T3 updates Logs DataBlitz site 3 T1 T3 Replication Manager Updates performed at site 1 via T1 propagate to site 3. Updates performed at site 2 via T2 propagate to site 1. Updates performed at site 3 via T3 propagate to site 2. The DataBlitz replication protocol ensures that all sites converge to the same state.

Data Replication Handling conflicting updates Update with largest timestamp wins Timestamp and update information for loser updates output to a file Handling ÒnormalÓ site failures After site recovers, it resumes propagating updates and applying updates to replicated tables (shipped to it from other sites) Handling extended outages Logs for certain updates to be applied at recovering site may have been truncated After site recovers, it merges the most recent copy of the table from an operational site with its own copy of the table

Hot Spare Using Replication Client Primary Logs Backup System Pair Hot spare is the backup or secondary system Keeps in sync with the primary via logs Higher availablity as the backup takes over when the primary fails Once the backup has taken over, it becomes the primary. The original primary would then become the secondary.

Additional DataBlitz Features Thread-safe Concurrent thread access to DataBlitz True 64-bit architecture support Database files greater than 4GB Configurability Can turn features on and off (e.g., logging, locking) Can fine tune various system parameters (e.g. number of chunks, maximum database files, active transaction table size) Bulk loading Extremely fast data loading while locking and logging are disabled

DataBlitz System Servers Root server Initializes system structures (e.g. lock table) Checkpoint server Checkpoints database files periodically Flush server Writes logs to disk asynchronously Cleanup server Detects failed processes and co-ordinates cleanup Recovery server Recovers a database in response to data corruption Mlock server Memory locks performance critical database files in memory

DataBlitz System Tools Data Migration Tools for transfering schema information or table data between databases and files Tools for transfering relational data in tabular form RelDDL A stand-alone tool used for data definitions in standard SQL syntax Administrative GUI Monitor system resource usage Storage space (segments, bytes allocated in a chunk) Servers, processes, threads Transactions, locks Perform archives/restore Change configuration parameters Archive/Restore C++ API and command line interpreter for database backups Resource Monitoring/Checking C++ API and command line interpreter for checking usage of resources

Fastest Main Memory Database System DataBlitz Main Memory DataBase System Concurrency High Availability True 64-bit support Flexibility Multiple Interface Layers