Recovery Principles in MySQL Cluster 5.1



Similar documents
Recovery Principles of MySQL Cluster 5.1

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Nimble Storage Best Practices for Microsoft SQL Server

Outline. Failure Types

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Chapter 14: Recovery System

Configuring Apache Derby for Performance and Durability Olav Sandstå

COS 318: Operating Systems

Recovery. P.J. M c.brien. Imperial College London. P.J. M c.brien (Imperial College London) Recovery 1 / 1

Oracle Database 10g: Backup and Recovery 1-2

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Cloud Computing at Google. Architecture

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

3. PGCluster. There are two formal PGCluster Web sites.

Review: The ACID properties

Oracle Architecture. Overview

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

An Open Source Grid Storage Controller for Genealogy Storage

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

Transaction Log Internals and Troubleshooting. Andrey Zavadskiy

Tushar Joshi Turtle Networks Ltd

Recovery System C H A P T E R16. Practice Exercises

Information Systems. Computer Science Department ETH Zurich Spring 2012

Chapter 16: Recovery System

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Distributed File Systems

SQL SERVER TRAINING CURRICULUM

BEST PRACTICES GUIDE: VMware on Nimble Storage

SCALABLE DATA SERVICES

Configuring Apache Derby for Performance and Durability Olav Sandstå

Windows NT File System. Outline. Hardware Basics. Ausgewählte Betriebssysteme Institut Betriebssysteme Fakultät Informatik

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Outline. Windows NT File System. Hardware Basics. Win2K File System Formats. NTFS Cluster Sizes NTFS

Chapter 15: Recovery System

MySQL Cluster Deployment Best Practices

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

! Volatile storage: ! Nonvolatile storage:

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Hypertable Architecture Overview

Data Management in the Cloud

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Best Practices for Architecting Storage in Virtualized Environments

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

How To Build A Fault Tolerant Mythrodmausic Installation

DataBlitz Main Memory DataBase System

Crash Recovery. Chapter 18. Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke

High Availability Solutions for the MariaDB and MySQL Database

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

Release Notes. LiveVault. Contents. Version Revision 0

Google File System. Web and scalability

SAP HANA Backup and Recovery (Overview, SPS08)

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

White Paper. Optimizing the Performance Of MySQL Cluster

Hyper-V Protection. User guide

Recovery Protocols For Flash File Systems

Mind Q Systems Private Limited

SQL Server Transaction Log from A to Z

SQL Server 2012 Database Administration With AlwaysOn & Clustering Techniques

One Solution for Real-Time Data protection, Disaster Recovery & Migration

Module 14: Scalability and High Availability

Crashes and Recovery. Write-ahead logging

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Raima Database Manager Version 14.0 In-memory Database Engine

Appendix A Core Concepts in SQL Server High Availability and Replication

Percona Server features for OpenStack and Trove Ops

Critical SQL Server Databases:

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Perforce Backup Strategy & Disaster Recovery at National Instruments

UVA. Failure and Recovery. Failure and inconsistency. - transaction failures - system failures - media failures. Principle of recovery

Guide to Scaling OpenLDAP

Non-Native Options for High Availability

Database Administration with MySQL

6. Storage and File Structures

Using the NDMP File Service for DMA- Driven Replication for Disaster Recovery. Hugo Patterson

Microsoft SQL Database Administrator Certification

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

EMC DATA PROTECTION FOR SAP HANA

Tier Architectures. Kathleen Durant CS 3200

Administering a Microsoft SQL Server 2000 Database

FairCom c-tree Server System Support Guide

The Google File System

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server

Backup / migration of a Coffalyser.Net database

Oracle Database Links Part 2 - Distributed Transactions Written and presented by Joel Goodman October 15th 2009

2 nd Semester 2008/2009

Transaction Management Overview

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Chapter 12 File Management

In-memory database systems, NVDIMMs and data durability

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

Tivoli Storage Manager Explained

Transcription:

Recovery Principles in MySQL Cluster 5.1 Mikael Ronström Senior Software Architect MySQL AB 1

Outline of Talk Introduction of MySQL Cluster in version 4.1 and 5.0 Discussion of requirements for MySQL Cluster version 5.1 Adaption of System Recovery algorithms Adaption of Node Recovery algorithms 2

MySQL Cluster Architecture Application Application Application Application MySQL Server Application MySQL Server NDB API NDB API NDB API NDB Kernel (Database nodes) MGM Server (Management nodes) Management Client Management API 3

Overview of MySQL Cluster in version 4.1 and 5.0 Main memory Complemented with logging to disk Clustered Database distributed over many nodes Synchronous replication between nodes Database automatically partitioned over the nodes Fail-over capability in case of node failure Update in one MySQL server, immediately see result in another Support for high update loads (as well as very high read loads) 4

Node Recovery Transaction Coordinator coordinates fragment operations Running Node (Primary) Example with two replicas Restarting Node (Backup) Copying Copied fragments If there are two fragment copies, then both are part of transaction 5

System Recovery: Flushing the log to disk REDO log T1 commits T2 commits GCP T3 commits Time A global checkpoint GCP flushes REDO log to disk. Transactions T1 and T2 become disk persistent. (Also called group commit.) 6

System Recovery: Log and Local checkpoints The REDO log is written to disk. LCP T1 commits T2 commits T3 commits Time A local checkpoint LCP saves an image of the database on disk. This enables cutting the tail of the REDO log. Since MySQL Cluster replicate all data, the REDO log and LCPs are only used to recover from whole system failures. Transactions T1, T2 and T3 are main memory persistent in all replicas. 7

System Recovery: Checkpointing takes time UNDO log REDO log LCP start T1 commits T2 commits T3 commits LCP stop Time UNDO log makes LCP image consistent with database at LCP start time. This is needed since REDO log records are exactly the operations performed and thus the LCP must be action-consistent. 8

Restoring a Fragment at System Restart 1) Load Data Pages from local checkpoint into memory Start UNDO log Execute UNDO log End of UNDO log Fuzzy point of start of Local Checkpoint End Global Checkpoint Log Record Execute REDO log 9

New Requirements for MySQL Cluster 5.1 Disk data to handle larger data volumes Variable sized records User defined partitioning Replication between clusters 10

Discussion of Disk data requirement (1) Two major options - Put in existing disk-based engine into NDB kernel - Careful reengineering of NDB kernel to add support for disk-based 11

Discussion of Disk data requirement (2) We opted for a careful reengineering of the NDB kernel to handle disk-based data The aim to keep its very good performance although data resides on disk Implementation divided in two phases, first phase puts non-indexed data on disk Second phase implement disk-based indexes 12

Discussion of Disk data requirement (3) MySQL Cluster version 5.1 implements the first phase, non-indexed fields on disk The approach of careful reengineering also enabled and in some cases forced us to improve on parts of the recovery architecture This paper presents some of those improvements 13

Availability of MySQL Cluster 5.1 The 5.1 version of MySQL has just been made available for source download As of today the version includes support for user defined partitioning As of soon the version will include support for disk-based data The cluster replication will soon be available in this downloadable tree 14

2B 2B [4B] [4B] Tup Vers. Bits Structure of Disk-based Row Check Sum NULL Bits [4B] GCP Id [8B] Disk Ref Fixed Size Part Main Memory Part of Row Var Attr. x Attribute Id Attribute Descriptor Offset Variable Attribute Length Array 4B Bits [4B] Check Sum [4B] NULL Bits 4B GCP Id 4B MM Ref Fixed Size Part Disk Memory Part of Row Var Attr. x 15

Node Recovery Node Restart clearly affected by needing to copy entire data if data volume goes up by a factor of 10-100. 10-20 minutes of node restart with a few Gbytes of data is acceptable since no downtime happens but 10 hours of node restart with a few hundred Gbytes of data is not an acceptable period. Thus forced reengineering 16

Node Restart, Replay Log Variant Running Node Starting Node 2) Replay Log from First GCP not restored (e.g. 17 in this case) to Current GCP 1) Restart node from Local logs (e.g. GCP 16) 3) Complex Hand-over (most likely involving some exclusive locks when log replay completed 17

Pros and Cons with Replay Log Variant Cons 1: Need to copy all rows if not enough log records in the running node Cons 2: Need to read the REDO log while writing it Cons 3: Difficult hand-over problem when end-of-log is coming closer to complete the Copy Phase Pros 1: Minimum effort when REDO logs exist 18

Node Restart, Delete Log Variant Running Node Starting Node 3) Replay Delete Log from First GCP not restored (e.g. 17 in this case) to GCP of 2) 4) Scan Running Node for records with higher GCP then GCP restored (e.g. 16 here), synchronise with starting node those rows 1) Restart node from Local logs (e.g. GCP 16) 2) Start participating in transactions 19

Pros and Cons with Delete Log Variant Cons 1: Need to copy all data if not enough delete log records in the running node Cons 2: Need to read the Delete log while writing it Cons 3: Need to scan all rows in running node Cons 4: Need TIMESTAMP on rows Pros 1: No hand-over problem Pros 2: Delete Log much smaller than REDO log 20

iscsi Variant Intro View all data as ROWID entries ROWID s only added or deleted in chunks of pages => uncommon operation Thus an INSERT/DELETE is transformed into an UPDATE of the ROWID entry Thus synchronisation can be done by a side-by-side check if the old and the new data is the same Efficient manner of performing check is by having TIMESTAMP on all ROWs (in our case TIMESTAMP = GCP id) 21

Running Node Node Restart, iscsi Variant Starting Node 2) Specify page ranges existing in partition to copy 5) Synchronise all ROWID s, send new data when changed since GCP node restored (e.g. 16 in this case) 1) Restart node from Local logs (e.g. GCP 16) 3) Delete all pages no longer present 4) Start participating in transactions 22

Pros and Cons with iscsi Variant Cons 1: Need to scan all rows in running node Cons 2: Need TIMESTAMP on rows Cons 3: Requires ROWID as record identifier Pros 1: No hand-over problem Pros 2: Not limited by Log Sizes 23

System Recovery Algorithms that write entire data set each during local checkpoint cannot work for cached disk data Thus new algorithm needed for the diskbased parts Previous algorithm had a coupling between abort mechanisms and recovery mechanisms Here we found an enabled reengineering 24

Buffer Manager Problem Design Goal: Any update should at most generate two disk writes in the normal case to sustain very good performance for high update loads (side effect should be that we get almost comparable performance to file systems 25

Buffer Manager Solution (1) REDO log no changes Write Page using a Page Cache algorithm with Write Ahead Log (WAL) principle. We try to achieve multiple writes for inserts by choosing an extent with place for the record and as much free space as possible 26

Search Extent to Write (Down First, Right Then) (Extent Size = 1 MB, 70% full Record Size = 10 kb 0.5MB 28kB 0.4MB 22kB 0.3MB 18kB 0.2MB 15kB 0.1MB 11kB 0.5MB 24kB 0.4MB 19kB 0.2MB 15kB 0.2MB 12kB 0.1MB 8kB 0.5MB 20kB 0.4MB 16kB 0.3MB 12kB 0.2MB 9kB 0.1MB 5.5kB 0.5MB 16kB 0.4MB 12.8kB 0.3MB 9.6kB 0.2MB 6.4kB 0.1MB 3.2kB 27

Buffer Manager Solution (2) For disk-based data we need to UNDO log all structural changes (data movement, allocation, deallocation) Data UNDO logging is not needed if GCP was written since REDO log will replay it anyways Implementation through Filtering UNDO log records when preparing them for write to disk (avoids complex logic around page writes) 28

Abort Handling Decoupled from Recovery Decided to use No-Steal algorithm (only committed pages are written to disk) => Simplifies Recovery => Requires Transaction State to be in main memory for optimum performance Large transactions could require that transactional information is spooled down to disk and back later As for MySQL Cluster 4.1 and 5.0 also 5.1 will impose a configurable limit to transaction sizes 29

Buffer Manager Conclusions Good solution for minimising the number of disk writes for transactions that fit into main memory (thus can still be upto Gbytes in size) Growth of memory sizes makes design choices with easier architecture better (logical logging and No-Steal algorithm) 30

New LCP Algorithm for Main Memory Parts that avoids UNDO logging entirely Scan Memory Parts and write each row part into checkpoint log Use one bit in row to synchronise write at start of write This bit doesn t survive crashes and thus no problem maintaining it after a crash it is always initialised to 0 Simple variant on Copy-On-Write 31

Action by Insert/Update/Delete when scan ongoing and scan hasn t passed yet Commit Insert sets bit, no write to checkpoint Commit Update/Delete sets bit and writes to checkpoint log 32

Action by scan If bit = 0 when scan reads it write it to the local checkpoint file If bit = 1 then set bit = 0 and don t write it to local checkpoint 33

Index Recovery All indexes are main memory indexes and are rebuilt as part of system recovery and node recovery All index builds can be performed while updates are ongoing (feature of NDB kernel to be exported to SQL level in 5.1) 34

Conclusions Careful reengineering of NDB kernel introducing non-indexed fields on disk for MySQL Cluster version 5.1 Lots of improvements, both performance-wise and simplicity-wise of Recovery algorithms Welcome to play with it and check if there are areas needing improvement, code will soon be available 35