Multi-CPU performance in PostgreSQL 9.2. Heikki Linnakangas / EnterpriseDB

Similar documents

Scalability And Performance Improvements In PostgreSQL 9.5

Delivering Quality in Software Performance and Scalability Testing

MS SQL Performance (Tuning) Best Practices:

Recommended hardware system configurations for ANSYS users

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Dynamics NAV/SQL Server Configuration Recommendations

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Eloquence Training What s new in Eloquence B.08.00

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Configuring Apache Derby for Performance and Durability Olav Sandstå

Performance Tuning and Optimizing SQL Databases 2016

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

An Implementation Of Multiprocessor Linux

Postgres Plus Advanced Server

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Cognos Performance Troubleshooting

Performance Analysis of Web based Applications on Single and Multi Core Servers

SAP Process Orchestration. Technical information

System Requirements Table of contents

PERFORMANCE TUNING FOR PEOPLESOFT APPLICATIONS

Scaling Database Performance in Azure

Load Testing and Monitoring Web Applications in a Windows Environment

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Users are Complaining that the System is Slow What Should I Do Now? Part 1

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Google File System. Web and scalability

Configuring Apache Derby for Performance and Durability Olav Sandstå

Ready Time Observations

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Oracle Database Resident Connection Pooling. Database Resident Connection Pooling (DRCP) Oracle Database 11g. Technical White paper

MAGENTO HOSTING Progressive Server Performance Improvements

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

The Oracle Universal Server Buffer Manager

Whitepaper: performance of SqlBulkCopy

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

PGCon PostgreSQL Performance Pitfalls

Agility Database Scalability Testing

Performance Optimization Guide

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Database Tuning and Physical Design: Execution of Transactions

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Benchmarking FreeBSD. Ivan Voras

Database Hardware Selection Guidelines

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Optimizing Ext4 for Low Memory Environments

Managing Users and Identity Stores

Sage Grant Management System Requirements

ManageEngine IT360. Professional Edition Installation Guide.

Scaling Graphite Installations

Crystal Reports Server 2008

Planning Domain Controller Capacity

x64 Servers: Do you want 64 or 32 bit apps with that server?

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Chapter 18: Database System Architectures. Centralized Systems

XenDesktop 7 Database Sizing

SanDisk Lab Validation: VMware vsphere Swap-to-Host Cache on SanDisk SSDs

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Perfmon counters for Enterprise MOSS

Software and Hardware Requirements

SQL Server Instance-Level Benchmarks with DVDStore

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

High performance ETL Benchmark

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

soliddb Fundamentals & Features Copyright 2013 UNICOM Global. All rights reserved.

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Performance Improvement In Java Application

Improving Scalability for Citrix Presentation Server

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

opensm2 Enterprise Performance Monitoring December 2010 Copyright 2010 Fujitsu Technology Solutions

Redis OLTP (Transactional) Load Testing

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

Data Management in the Cloud

PostgreSQL 9.0 Streaming Replication under the hood Heikki Linnakangas

AIX NFS Client Performance Improvements for Databases on NAS

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

PostgreSQL Past, Present, and Future EDB All rights reserved.

Planning the Installation and Installing SQL Server

From Firebird 1.5 to 2.5

ICRI-CI Retreat Architecture track

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Enterprise Manager Performance Tips

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Streaming and Virtual Hosted Desktop Study

Database Scalability and Oracle 12c

How To Improve Performance On A Single Chip Computer

SQL Server Business Intelligence on HP ProLiant DL785 Server

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Maximum Availability Architecture

OLAP Services. MicroStrategy Products. MicroStrategy OLAP Services Delivers Economic Savings, Analytical Insight, and up to 50x Faster Performance

HP reference configuration for entry-level SAS Grid Manager solutions

Transcription:

Multi-CPU performance in PostgreSQL 9.2 Heikki Linnakangas / EnterpriseDB

Scalability 101 Scalability defined: when you throw more hardware at a problem, the software can make use of it This presentation focuses on multi-cpu scalability

Multi-CPU systems today Even laptops typically have 2 cores Servers Low-end: 4 cores High-end: 32 cores and beyond RAM: > 128 GB

Journey begins: Itanium test box # machinfo CPU info: 8 Intel(R) Itanium(R) Processor 9350s (1.73 GHz, 24 MB) 4.79 GT/s QPI, CPU version E0 32 logical processors (4 per socket) Memory: 392917 MB (383.71 GB)... Platform info: Model: "ia64 hp Integrity BL890c i2"

Making software to scale 1. Benchmark 2. Identify bottleneck 3. Fix bottlenck

Choosing the benchmark There are workloads where PostgreSQL scales great SELECTs, bundled into large transactions > 64 CPUs, no problem! On other workloads, PostgreSQL scales poorly Concurrent inserts choke at 2 CPUs

COPY Bulk loading data with COPY has always scaled well unless it needs to be WAL-logged Which is most of the time

COPY, solved commit d326d9e8ea1d690cf6d968000efaa5121206d231 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Wed Nov 9 10:54:41 2011 +0200 In COPY, insert tuples to the heap in batches. This greatly reduces the WAL volume, especially when the table is narrow. The overhead of locking the heap page is also reduced. Reduced WAL traffic also makes it scale a lot better, if you run multiple COPY processes at the same time.

Impact Makes bulk loading scale Reduces WAL volume Caveats: Optimization does not apply if there are BEFORE/AFTER triggers or volatile DEFAULT expressions When loading into a single table, extending the file becomes bottleneck

Next workload: SELECTs I said SELECTs already scale well But in 9.1, only if you SELECTed different tables PostgreSQL lock manager is partitioned But when all backends hit the same table, that doesn't help, and the lock manager became a bottleneck

Lock manager, solved commit 3cba8999b343648c4c528432ab3d51400194e93b Author: Robert Haas <rhaas@postgresql.org> Date: Sat May 28 19:52:00 2011-0400 Create a "fast path" for acquiring weak relation locks. When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested on an unshared database relation, and we can verify that no conflicting locks can possibly be present, record the lock in a per-backend queue, stored within the PGPROC, rather than in the primary lock table. This eliminates a great deal of contention on the lock manager LWLocks.... Review by Jeff Davis.

Impact Here are the results of alternating runs without and with the patch on that machine: tps = 36291.996228 (including connections establishing) tps = 129242.054578 (including connections establishing) tps = 36704.393055 (including connections establishing) tps = 128998.648106 (including connections establishing) tps = 36531.208898 (including connections establishing) tps = 131341.367344 (including connections establishing) That's an improvement of about ~3.5x. According to the vmstat output, when running without the patch, the CPU state was about 40% idle. With the patch, it dropped down to around 6%. - Robert Haas, 3 Jun 2011

SELECTs continued: ProcArrayLock Each session has an entry in shared memory, in the proc array The proc array is protected by a lock called ProcArrayLock It is acquired in shared mode whenever a snapshot is taken (~= at the beginning of each transaction) It is acquired in exclusive mode whenever a transaction commits

ProcArrayLock Becomes a bottleneck at high transaction rates A problem with OLTP workloads with a lot of small transactions Not a problem with larger transactions that do more stuff per transaction

commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4 Author: Robert Haas <rhaas@postgresql.org> Date: Fri Jul 29 16:46:13 2011-0400 Reduce sinval synchronization overhead. Testing shows that the overhead of acquiring and releasing SInvalReadLock and msgnumlock on high-core count boxes can waste a lot of CPU time and hurt performance. This patch adds a per-backend flag that allows us to skip all that locking in most cases. Further testing shows that this improves performance even when sinval traffic is very high. Patch by me. Review and testing by Noah Misch.

commit 84e37126770dd6de903dad88ce150a49b63b5ef9 Author: Robert Haas <rhaas@postgresql.org> Date: Thu Aug 4 12:38:33 2011-0400 Create VXID locks "lazily" in the main lock table. Instead of entering them on transaction startup, we materialize them only when someone wants to wait, which will occur only during CREATE INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also be able to probe for conflicting VXID locks, but the lock need never be fully materialized, because the startup process does not use the normal lock wait mechanism. Since most VXID locks never need to touch the lock manager partition locks, this can significantly reduce blocking contention on read-heavy workloads. Patch by me. Review by Jeff Davis.

spinlocks commit c01c25fbe525869fa81237954727e1eb4b7d4a14 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Aug 29 10:05:48 2011-0400 Improve spinlock performance for HP-UX, ia64, non-gcc. At least on this architecture, it's very important to spin on a non-atomic instruction and only retry the atomic once it appears that it will succeed. To fix this, split TAS() into two macros: TAS(), for trying to grab the lock the first time, and TAS_SPIN(), for spinning until we get it. TAS_SPIN() defaults to same as TAS(), but we can override it when we know there's a better way. It's likely that some of the other cases in s_lock.h require similar treatment, but this is the only one we've got conclusive evidence for at present.

commit ed0b409d22346b1b027a4c2099ca66984d94b6dd Author: Robert Haas <rhaas@postgresql.org> Date: Fri Nov 25 08:02:10 2011-0500 Move "hot" members of PGPROC into a separate PGXACT array. This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas

commit 0d76b60db4684d3487223b003833828fe9655fe2 Author: Robert Haas <rhaas@postgresql.org> Date: Fri Dec 16 21:44:26 2011-0500 Various micro-optimizations for GetSnapshopData(). Heikki Linnakangas had the idea of rearranging GetSnapshotData to avoid checking for sub-xids when no top-level XID is present. This patch does that plus further a bit of further, related rearrangement. Benchmarking show a significant improvement on unlogged tables at higher concurrency levels, and mostly indifferent result on permanent tables (which are presumably bottlenecked elsewhere). Most of the benefit seems to come from using the new NormalTransactionIdPrecedes() macro rather than the function call TransactionIdPrecedes().

commit d573e239f03506920938bf0be56c868d9c3416da Author: Robert Haas <rhaas@postgresql.org> Date: Wed Dec 21 09:16:55 2011-0500 Take fewer snapshots. When a PORTAL_ONE_SELECT query is executed, we can opportunistically reuse the parse/plan shot for the execution phase. This cuts down the number of snapshots per simple query from 2 to 1 for the simple protocol, and 3 to 2 for the extended protocol. Since we are only reusing a snapshot taken early in the processing of the same protocol message, the change shouldn't be user-visible, except that the remote possibility of the planning and execution snapshots being different is eliminated.... Patch by me; review by Dimitri Fontaine.

Next bottleneck So I have a new theory: on permanent tables, *anything* that reduces ProcArrayLock contention causes an approximately equal increase in WALInsertLock contention (or maybe some other lock), and in some cases that increase in contention elsewhere can cost more than the amount we're saving here. - Robert Haas, 15 Dec 2011

Next workload: read/write Reran pgbench, now with writes Excluding branch-table updates, to avoid bottlenecking on the application level

WALInsertLock Remember the COPY problem? The patch to solve that was a special hack targeting just COPY. The problem remains for all other inserts/updates/deletes

Profiling HP-UX has a nice tool called caliper Like oprofile, but can include wait times too 34.62 postgres::procarrayendtransaction [51] 30.77 postgres::xloginsert [49] 11.54 postgres::lockbuffer [113] 7.69 postgres::transactionidsetpagestatus [146] 7.69 postgres::bufferalloc [72] 7.69 postgres::getsnapshotdata [246] [25] 13.6 1.1 12.5 3.85 postgres::lwlockacquire 69.23 postgres::pgsemaphorelock [38] 26.92 postgres::s_lock [60]

Summary 9.2 scales much better for many common workloads! Future focus Serializable transactions WAL-logging Thanks to Nathan boley, for lending a server for benchmarking Greg Smith, for creating pgbench-tools