A Shared-nothing cluster system: Postgres-XC

Similar documents
Scaling out by distributing and replicating data in Postgres-XC

The elephant called PostgreSQL

Data Management in the Cloud

Sharding with postgres_fdw

SQL Server Training Course Content

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

Configuring Apache Derby for Performance and Durability Olav Sandstå

Data Distribution with SQL Server Replication

The Sierra Clustered Database Engine, the technology at the heart of

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Transaction Management Overview

CitusDB Architecture for Real-Time Big Data

Distributed Data Management

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

BBM467 Data Intensive ApplicaAons

Deployment Topologies

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Database Replication with MySQL and PostgreSQL

Postgres Plus xdb Replication Server with Multi-Master User s Guide

How To Run A Standby On Postgres (Postgres) On A Slave Server On A Standby Server On Your Computer (Mysql) On Your Server (Myscientific) (Mysberry) (

Tier Architectures. Kathleen Durant CS 3200

Transactions and the Internet

Oracle Database Links Part 2 - Distributed Transactions Written and presented by Joel Goodman October 15th 2009

High Availability Solutions for the MariaDB and MySQL Database

Microsoft SQL Database Administrator Certification

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

3. PGCluster. There are two formal PGCluster Web sites.

Module 14: Scalability and High Availability

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

Optimizing Performance. Training Division New Delhi

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

SCALABILITY AND AVAILABILITY

Using Oracle Real Application Clusters (RAC)

PostgreSQL Backup Strategies

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

High Availability Essentials

SQL Server 2008 Designing, Optimizing, and Maintaining a Database Session 1

Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

MySQL Storage Engines

Availability Digest. Raima s High-Availability Embedded Database December 2011

Database Replication with Oracle 11g and MS SQL Server 2008

Oracle Database 11 g Performance Tuning. Recipes. Sam R. Alapati Darl Kuhn Bill Padfield. Apress*

Maintaining a Microsoft SQL Server 2008 Database

Getting to Know the SQL Server Management Studio

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Distributed Databases. Concepts. Why distributed databases? Distributed Databases Basic Concepts

Configuring Apache Derby for Performance and Durability Olav Sandstå

Using DataDirect Connect for JDBC with Oracle Real Application Clusters (RAC)

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

Module 7. Backup and Recovery

DISTRIBUTED AND PARALLELL DATABASE

Survey on Comparative Analysis of Database Replication Techniques

Integrated Application and Data Protection. NEC ExpressCluster White Paper

Comparing MySQL and Postgres 9.0 Replication

Pivotal HAWQ Release Notes

DBAs having to manage DB2 on multiple platforms will find this information essential.

Building a Highly Available and Scalable Web Farm

Deploying Remote Desktop Connection Broker with High Availability Step-by-Step Guide

PostgreSQL 9.0 Streaming Replication under the hood Heikki Linnakangas

Monitoring PostgreSQL database with Verax NMS

Be Very Afraid. Christophe Pettus PostgreSQL Experts Logical Decoding & Backup Conference Europe 2014

The Methodology Behind the Dell SQL Server Advisor Tool

Lecture 7: Concurrency control. Rasmus Pagh

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

Database Administration with MySQL

A Proposal for a Multi-Master Synchronous Replication System

ORACLE DATABASE 10G ENTERPRISE EDITION

Chapter 18: Database System Architectures. Centralized Systems

<Insert Picture Here> Getting Coherence: Introduction to Data Grids South Florida User Group

Comparing SQL and NOSQL databases

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Oracle DBA Course Contents

Many DBA s are being required to support multiple DBMS s on multiple platforms. Many IT shops today are running a combination of Oracle and DB2 which

Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Pervasive PSQL Meets Critical Business Requirements

Database Management. Chapter Objectives

Basics Of Replication: SQL Server 2000

Tushar Joshi Turtle Networks Ltd

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Oracle Database 11g: New Features for Administrators DBA Release 2

An Overview of Distributed Databases

MS SQL Performance (Tuning) Best Practices:

Data Replication and Snapshot Isolation. Example: Cluster Replication

Distributed File Systems

Firebird. A really free database used in free and commercial projects

Backup and Recovery using PITR Mark Jones EnterpriseDB Corporation. All rights reserved. 1

High Performance Cluster Support for NLB on Window

Transcription:

Welcome A Shared-nothing cluster system: Postgres-XC - Amit Khandekar

Agenda Postgres-XC Configuration Shared-nothing architecture applied to Postgres-XC Supported functionalities: Present and Future

Configuration (User view) PG Client PostgreSQL (XC)

Configuration (Shared-nothing) PG client Postgres-XC (Coordinator) Libpq Libpq Libpq Postgres-XC (Datanode) Postgres-XC (Datanode) Postgres-XC (Datanode)

Configuration (Synchronous and Symmetric) PG client PG client PG client

What is shared? GXID, Snapshot GTM GXID, Snapshot GXID, Snapshot GXID, Snapshot

Shared-nothing Cluster Benefits Scalability and performance (no CPU, disk, memory bottleneck) Lower Hardware Cost (Commodity hardware) Scope for data redundancy (High availability) Efforts required Accessing non-local data Implementing distributed data access Node addition/removal requires reorganizing database

Shared-nothing : Scalability Parallelism Load-balancing Data distribution Data movement

Distributed Tables Psql create table employee (id int,...) DISTRIBUTE by (id) to node (D1, D3) insert into employee values (1,..), (2,...), (3,...), (4,...) 1... 3... 2... 4... D1 D2 D3

Distributed Tables Psql create table employee (id int,...) DISTRIBUTE by HASH (id) to node (D1, D3) select * from employee where id = 3 update employee set... where id = 3 1... 3... D1 D2 D3 2... 4...

Distributed Tables Distribute by HASH Distribute by MODULO Distribute by Round Robin Distribute by Range Vertical fragmentation

Replicated Tables Psql Create table employee (id int,...) DISTRIBUTE by REPLICATION to node (D1, D3) insert into employee values (1,..), (2,...), (3,...), (4,...) 1... 2... 3... 4... D1 D2 D3 1... 2... 3... 4...

Replicated tables (Preferred node) Psql Create table employee (id int,...) distribute by REPLICATION to node (D1, D3) Select * from employee C1 Select * from employee D1 D2 D3 (Preferred for C1)

Parallelism : E.g. Configuration PG client PG client PG client Update employee set... where id = 100 Select from replicated_table Select from replicated_table C1: Preferred: D2 C2: Preferred: D3 Update employee set... where id = 100 Select from replicated_table Select from replicated_table D1 D2 D3

Parallelism Inter query Intra query Intra plan-node Single remote table scan done in parallel on datanodes Inter plan-node Scope for future work Join table scans running parallel

Load balancing Replicated tables offer good load balancing opportunity Preferred node for replicated tables XC Randomly chooses data node if no preferred node DBA: Data distribution Coordinator load balancing Requires external application to redirect client requests to particular coordinator.

Reducing data movement Coordinator quals: (NOT is_old_employee(id)) Psql select empname from employee where not is_old_employee(id) and id = 3; Push work to datanodes SELECT empname, id FROM employee WHERE (id = 3)

Reading Plans explain verbose select empname from employee where not is_old_employee(id) and id = 3; PostgreSQL Seq Scan on public.employee (cost=0.00..332.88 rows=4 width=32) Output: empname Filter: ((employee.id = 3) AND (NOT is_old_employee(employee.id))) Postgres-XC Data Node Scan on employee (cost=0.00..0.00 rows=1000 width=32) Output: employee.empname Node/s: data_node_2 Remote query: SELECT empname, id FROM ONLY employee WHERE (id = 3) Coordinator quals: (NOT is_old_employee(employee.id))

Pushing work to datanodes Pushable : Immutable functions Constant expressions Join involving at least one common replicated table Whole query in certain scenarios (FQS) Work in progress

Pushing work to datanodes explain select * from employee join dept on employee.dept = dept.deptid; QUERY PLAN ------------------------------------------------------------------------------------------------- Hash Join (cost=0.12..0.26 rows=10 width=76) Hash Cond: (employee.dept = dept.deptid) -> Data Node Scan on employee (cost=0.00..0.00 rows=1000 width=40) Node/s: data_node_1, data_node_2 -> Hash (cost=0.00..0.00 rows=1000 width=36) -> Data Node Scan on dept (cost=0.00..0.00 rows=1000 width=36) Node/s: data_node_1, data_node_2

Pushing work to datanodes explain select name from employee join dept on employee.dept = dept.deptid; QUERY PLAN ----------------------------------------------------------------------------------------------------- Data Node Scan on " REMOTE_FQS_QUERY " (cost=0.00..0.00 rows=0 width=0) Output: employee.name Node/s: data_node_1 Remote query: SELECT employee.name FROM (employee JOIN dept ON ((employee.deptid = dept.deptid))) (4 rows)

Cost estimates Future work cost estimation is not cluster-aware Data transfer cost not calculated. No selectivity info on coordinator ANALYZE command updates stats on datanodes. Does not update on coordinator. Cheapest plan not chosen Datanodes have the usual PG cost estimation

Deadlocks No cluster-wide deadlock detection Updates on replicated tables : deadlocks more likely Two parallel updates on same row of replicated table Q1 has row lock on node1, Q2 has row lock on node2 Now Q1 waits on Q2 lock on node2, and Q2 waits on Q1 lock on node1 Assign same primary data node on each coordinator Might even not need to do this in the future

ACID Properties

ACID properties (Consistency) Consistent view of database throughout the cluster using Global transaction ID, and Global Snapshot MVCC takes care of the rest. GXID, Snapshot, timestamp, sequence GTM GXID, Snapshot, timestamp, sequence GXID, Snapshot, timestamp, sequence GXID, Snapshot, timestamp, sequence GXID, Snapshot, timestamp, sequence

ACID properties (Consistency) Global Constraints not supported yet Constraint check is done only on individual node; not done across datanodes. Hence, attempt to create table with a constraint that requires cluster-wide constraint check is not allowed. E.g. distributed table not allowed to have unique constraint on a column unless that column is distribution key, etc. Will keep this restriction until we support global constraint check. Updating distribution key column not supported TIP: Explicitly choose distribution key while creating table

ACID properties (Isolation) Transaction isolation read committed repeatable read serializable (>= 9.1) falls back to repeatable read

ACID properties (Atomicity and durability) Two-phase protocol Coordinator uses this transparently on nodes involved in write activity. This ensure either all nodes commit, or all nodes abort the transaction; even if a node crashes. Always used when explicitly requested from application using PREPARE TRANSACTION Needs to be disabled if temp tables are involved: PG restriction. set enforce_two_phase_commit = off Because datanodes are PostgreSQL-based servers, datanodes have their own CLOG, so can be individually recovered after a crash.

ACID properties (Durability) pg_prepared_xacts All nodes have executed PREPARED TRANSACTION Coordinator is about to send abort/commit when a node crashes pg_prepared_xacts will show such transactions pgxc_clean utility Cleans up such transactions on the nodes that are recovered Issues COMMIT PREPARED

Bottlenecks GTM Merged Snapshot request GTM Proxy Simultaneous Snapshot requests

Bottlenecks Pooler Pooler Pooler

High availability : In-built? Redundancy possible using replicated tables Queries not accessing failed node keep on executing If a node having all replicated data crashes, data is available on other nodes but it is not HA : coordinator does not automatically failover to other replicated node. Scope for further research

High availability GTM GTM Standby Copy of GTM status gtm_ctl promote gtm_ctl reconnect GTM Proxy PG Replication PG Replication PG Replication PG Replication

High availability For automatic failover, integrate Postgres-XC with HA middleware such as Pacemaker. Continuously monitor each component including GTM, coordinator and datanode Write Pacemaker resource agents for Postgres-XC Implement start, stop, status, promote, etc May still need manual intervention ALTER NODE for new IP. pgxc_clean() Linux-HA Japan team actively working for the above

Recovery For PITR, the whole cluster should be recovered upto the same point on all nodes CREATE BARRIER 'barrier_id' from any coordinator Waits for all the transactions to complete Creates an XLOG entry for barrier recovery on each node In recovery.conf, set recovery_target_barrier 'barrier_id', just like we set recovery target xid or timestamp Recovery takes place by rolling forward the xlog up to this point: 'barrier_id'

Catalog objects Queries on catalogs are always run locally. All nodes have the same copy of catalogs. DDL statements are propagated to all nodes. Views/Rules Rule rewrite happens on coordinator Sequences Fetched from GTM User Functions Definitions are everywhere. Coordinator chooses whether it should be called on datanode. System tables Has local information. Triggers (Under development)

Cluster initialization CREATE NODE C2 WITH (HOST = '238.12.34.11', type = 'coordinator'); CREATE NODE D1 WITH (HOST = 'localhost', type = 'datanode', preferred); CREATE NODE D2 WITH (HOST = '238.12.88.11', type = 'datanode'); C1 CREATE NODE C1 WITH (HOST = '238.12.34.12', type = 'coordinator'); CREATE NODE D1 WITH (HOST = '238.12.88.12', type = 'datanode'); CREATE NODE D2 WITH (HOST = 'localhost', type = 'datanode', preferred); C2 D1 GTM gtm_ctl start -Z gtm D2 initdb... nodename=d1; pg_ctl start -Z datanode initdb... nodename=d2; pg_ctl start -Z datanode

Cluster management Each coordinator needs to run CREATE NODE for all other nodes including other coordinators. Node configuration is static. Should be changed offline. pg_dump from any one coordinator Stop cluster, add and reinitialize all nodes again, including new node pg_restore on any coordinator Online node addition/removal (TODO)

Cluster management Online data redistribution Used to change distribution strategy ALTER TABLE tab1 DISTRIBUTE BY REPLICATION... Can also be used to redistribute the data onto newly added nodes. Online data redistribution concurrently (TODO) ALTER TABLE CONCURRENTLY

Features support (< 1.0) Postgres-XC 0.9.6 HAVING clause GROUP BY optimization for pushing down Temporary objects PREPARE/EXECUTE Postgres-XC 0.9.7 Cluster node management with DDLs SELECT INTO/CREATE TABLE AS INSERT SELECT Window functions Views, correlated subquery, Common table expression

Features support (>= 1.0) Postgres-XC 1.0 Based on PostgreSQL 9.1 Stabilization SERIAL types TABLESPACE Advisory locks Fast Query Shipping Cursors Development branch Merged with PostgreSQL 9.2 Data redistribution with ALTER TABLE Planner improvements RETURNING clause WHERE CURRENT OF TRIGGERS

Future Online node addition and removal Ongoing query processing improvements SAVEPOINT Serializable Snapshot Isolation HA improvements and many others

Thank you Project Web Page: http://postgres-xc.sourceforge.net/ Help at: postgres-xc-general@lists.sourceforge.net postgres-xc-developers@lists.sourceforge.net