Comparing MySQL and Postgres 9.0 Replication An EnterpriseDB White Paper For DBAs, Application Developers, and Enterprise Architects March 2010
Table of Contents Introduction... 3 A Look at the Replication of Oracle s MySQL... 3 Configuring MySQL Replication... 4 PostgreSQL Replication Options... 6 Built-in Streaming Replication... 6 Other PostgreSQL Replication Options... 8 Comparing MySQL and PostgreSQL Replication...11 Conclusions...12 About EnterpriseDB...12 EnterpriseDB The Enterprise PostgreSQL Company 2
Introduction Replication is one of the most popular features used in RDBMS s today. Replication is used for disaster recovery purposes (i.e. backup or warm stand-by servers), reporting systems where query activity is offloaded onto another machine to conserve resources on the transactional server, and scale-out architectures that use sharding or other methods to increase overall query performance and data throughput. Replication is not restricted to only the major proprietary databases; open source databases such as Oracle s MySQL and PostgreSQL also offer built-in replication as a feature. With recent releases of MySQL 5.5 and PostgreSQL 9.0, questions are being asked about how they differ in their replication technologies. What follows is an overview of both MySQL and PostgreSQL replication, with a summarized compare and contrast of the implementations being performed at the end of this paper. A Look at the Replication of Oracle s MySQL Asynchronous replication was introduced into Oracle s MySQL with version 3.23 and today it remains the primary feature employed by many MySQL users to create scale-out architectures, standby servers, read-only data marts, and more. The various supported MySQL replication topologies include: Single master to one slave Single master to multiple slaves Single master to one slave to one or more slaves Circular replication (A to B to C and back to A) Master to master The major replication topology not currently supported in Oracle s MySQL today is multisource replication: having one or more master servers feed a single slave. A graphical view of how MySQL replication functions can be represented as follows: EnterpriseDB The Enterprise PostgreSQL Company 3
Object, data, and security operations run on the master are copied to the master server s binary log. A user has the option of replicating an entire server, one or more databases, or just selected tables (although filtering by table is only done on the slave). The slave server obtains information from the master s binary log over the network, copies the commands and/or data, and first applies them to the slave s relay binary log. That log is then read by another process the SQL thread that applies the replicated operations/data to the slave database and its binary log. Prior to release 5.1, MySQL replication was statement-based, meaning that the actual SQL commands were replicated from the master to one or more slaves. However, certain use cases did not lend themselves to statement-based replication (e.g. non-deterministic function calls) so in MySQL 5.1 row-based replication was introduced. A user now has the option of setting a configuration parameter to use either statement or row-based replication. The primary bottleneck for busy MySQL replication configurations is the single-threaded nature of its design: replication operations are not multi-threaded at the moment, although MySQL has declared it is coming in a future release. This limitation can cause some slave servers under heavy load to get far behind the master in regards to applying binary log information. Configuring MySQL Replication Setting up MySQL replication is a fairly painless process. Although various setup procedures exist, in general, the following is a basic outline of how it is done: The master and slave servers are identified The master server is modified to include a replication security account The master server s MySQL configuration file is modified to enable binary logging. A few other parameters are included as well (e.g. a unique server ID, type of replication such as statement or row-based, etc.) EnterpriseDB The Enterprise PostgreSQL Company 4
The slave server s MySQL configuration file is modified to include a unique server ID The master server is restarted The master server s log file position is recorded The master s data is copied to the slave to initially seed the slave server. This can be done via a cold backup/restore, using the mysqldump utility, locking the master tables and doing a file copy, etc. The slave server is restarted The MySQL CHANGE MASTER command is executed on the slave server to set the master host name on the slave server as well as other parameters such as the master account username and password, the log file name, and beginning log file position Once set up, MySQL replication is quite reliable. Being asynchronous in nature, however, there are use cases that could result in data loss between a master and slave. To help combat these situations, MySQL 5.5 introduced semi-synchronous replication where a pending transaction is sent from a master to a slave, but not committed on the slave; it merely lands safely on the slave to be run as soon as possible. Once the master is notified that the transaction is safely recorded on the slave, then the transaction is committed on the master. In terms of MySQL replication limitations and missing features, besides the already mentioned single threaded nature of the implementation and the inability to perform multi-source replication, other wish-list items include a full synchronous option, conflict detection and resolution, time-delayed replication, changing the binary log to a storage engine, better replication filtering on the master, global statement ID s, and graphical tools to manage replication functions. There are third-party providers of MySQL replication solutions that overcome some of the current shortcomings in what is provided out-of-the-box with MySQL. One example is Continuent s Tungsten product. For more information about Oracle s MySQL replication, see: http://dev.mysql.com/doc/refman/5.5/en/replication.html. EnterpriseDB The Enterprise PostgreSQL Company 5
PostgreSQL Replication Options Having briefly examined the options available for MySQL, let s now look at the various replication options that exist for users of PostgreSQL. Built-in Streaming Replication Up until PostgreSQL 9.0 (released in September 2010), users of PostgreSQL who needed to build architectures that utilized database replication had to rely on community-provided or third-party software solutions. However, this all changed with the release of version 9.0; now those using PostgreSQL have replication (named streaming replication ) bundled right into the database server. Although new with 9.0, PostgreSQL streaming replication is based on a mature and long used technology in the database called write ahead logging or WAL. WAL technology has been deployed in PostgreSQL since version 7.1, and is used to ensure transactional integrity in the database server. It is also used for backup and point-in-time restore functionality, as well as the warm and hot standby features of PostgreSQL (slave servers that are fed log updates and kept either in an offline or online mode for disaster recovery purposes). A number of significant enhancements were made in PostgreSQL 9.0 that resulted in extremely fast WAL processing, with the outcome being near real-time replication for master-slave configurations, and also hot standby capabilities for slave servers. The currently supported PostgreSQL replication topologies include: Single master to one slave Single master to multiple slaves A graphical view of how PostgreSQL replication functions can be represented as follows: EnterpriseDB The Enterprise PostgreSQL Company 6
All objects and data (including schema) and security operations executed on the master are written to the WAL log directly on the slave machine for safety (avoiding complete data loss in the event of a catastrophic master failure). WAL also ensures that no transaction is committed on the master until a successful write of the WAL log has occurred. The slave then applies the WAL log by directly rewriting the raw table data on disk. Streaming replication uses a row-based replication methodology; hence it is the safest form of replicating data between database servers as it avoids data mismatches when nondeterminate function calls are made such as the following: INSERT INTO table (column) VALUES (SELECT function()); The primary limitations of PostgreSQL 9.0 replication are topology based. It cannot currently do cascading replication, replicate only certain objects, or filter tables by rows for replication. In other words, no filtering is currently possible with streaming replication so a complete copy of the master is replicated on the slave. Configuring Streaming Replication Setting up PostgreSQL replication is very straightforward. WAL logging is always enabled with minimal configuration needed by the user to utilize replication. The basic process to get replication going is: The master and slave servers are identified The postgresql.conf file on the master is edited to turn on streaming replication The pg_hba.conf file on the master is edited in order to let the slave connect The recovery.conf and postgresql.conf files on the slave are edited to start up replication and hot standby The master is shutdown and the data files are copied to the slave EnterpriseDB The Enterprise PostgreSQL Company 7
The slave is started first The master is started Miscellaneous Notes on Streaming Replication As stated earlier, PostgreSQL 9.0 s streaming replication is considered extremely reliable because it is based on the WAL technology that ensures transactional integrity in the database. Users should note, however, that streaming replication (as of 9.0) is asynchronous in nature, which means that the master server does not wait for a transaction to be applied to a slave server before it commits the transaction. While such a mode improves transactional response times on the master, the ramifications of asynchronous replication are that data loss between a master and a slave could occur. That said, it is also possible that given the speed of replication as well as the hardware and network architecture, that a slave s picture of the database may lag by only a single transaction. A synchronous replication enhancement has already been developed and should be available for the next version of PostgreSQL (9.1) that is due out either in the Summer or Fall of 2011. Once available, users will have the option of using either asynchronous or synchronous replication with PostgreSQL. Other PostgreSQL Replication Options Besides built-in streaming replication, there are a number of other replication options that PostgreSQL users have at their disposal. Slony Slony (http://www.slony.info/) is a community-driven, open source replication solution that was the primary replication option before streaming replication was introduced in PostgreSQL 9.0. Slony s replication is an asynchronous, trigger-based solution that allows a user to define a master to 1-n slave topology. The concept of cascading replication is also supported; this is where the master node replicates data to a slave node, which in turn, replicates to other slave nodes. EnterpriseDB The Enterprise PostgreSQL Company 8
Why would a user deploy Slony instead of PostgreSQL 9.0 streaming replication? Some of the use cases may include the following: A user needs to replicate data from/to versions of PostgreSQL below 9.0. A user wants to replicate data between systems that are not of identical architectures. A user desires to only replicate part of a database system (e.g. only a few tables) and not an entire database instance. More information and downloads of Slony can be found at: http://www.slony.info/. Bucardo Bucardo supplies an open source multi-master (or master-to-master) replication system for PostgreSQL. Bucardo is an asynchronous, trigger-based technology that allows multiple masters and multiple slave configurations. The Bucardo solution is written in Perl and provides the ability to perform conflict detection, conflict resolution, and exception handling via custom Perl routines. The Bucardo architecture is comprised of a Perl daemon and master Bucardo database that contains all the information about the databases involved in the replication configuration. Information and downloads of Bucardo can be found at: http://bucardo.org/. pgpool-ii The community-provided pgpool-ii software sits as middleware between PostgreSQL servers and incoming user requests. There are a number of high availability, scalability, and replication functions that pgpool-ii provides, including the following: Replication - Using the replication function enables pgpool-ii to create and maintain a real-time backup of a database. Connection Pooling This option saves and reuses connections whenever a new connection with the same properties (i.e. username, database, protocol version) is sent to a server, with the goal being to reduce connection overhead and improves system's overall throughput. Load Balancing - If a database is replicated, pgpool-ii can reduce the load on each PostgreSQL server by distributing SELECT queries among multiple servers, improving a system's overall throughput. Parallel Query This feature causes data to be divided among multiple PostgreSQL servers, so that a query can be executed on all the servers concurrently with the goal being a divide-and-conquer approach of reducing query execution times. More about pgpool-ii can be found at: http://pgpool.projects.postgresql.org/. Tungsten by Continuent Tungsten is a proprietary replication and data management solution for MySQL and PostgreSQL that is offered by Continuent. Tungsten uses replication and distributed management to create cloned databases using redundant data copies. EnterpriseDB The Enterprise PostgreSQL Company 9
At present, the Continuent Tungsten product offers more features for MySQL users than PostgreSQL users, but the company says they will be closing the feature gap between the two soon. More information on Tungsten can be found at: http://www.continuent.com/. EnterpriseDB xdb Replication Server xdb Replication Server is available with a subscription to EnterpriseDB s Postgres Plus Standard Server and Postgres Plus Advanced Server. The xdb Replication Server offers a number of features and more flexible options than PostgreSQL s built-in streaming replication provides, including: Replication of Oracle data to PostgreSQL Replication of Microsoft SQL Server data to PostgreSQL (due out June 2011) A distributed multi-publication / Subscription Architecture The flexibility to replicate only the databases or objects desired (e.g. just one-two tables out of many others in a database) The ability to define and apply row filters for very granular replication of data The ability to synchronize data across geographies The option of having both snapshot (on-demand) and continuous modes A replication scheduler that allows a user to schedule when/how often they want replication to occur Support for cascading replication A replication history viewing utility that allows a user to see all the activity that has occurred between managed servers A point-and-click, graphical replication console A call level interface for extending the tool EnterpriseDB The Enterprise PostgreSQL Company 10
The xdb Replication Server is easy to try; more information and downloads can be found at: http://www.enterprisedb.com/products-services-training/products-overview/postgresplus-solution-pack/xdb-replication-server. Comparing MySQL and PostgreSQL Replication Those wanting to use an open source database for a particular application project that requires replication have two good choices in MySQL and PostgreSQL. But, the question naturally arises, which should be used? Is one just as good as the other? As shown above, there are both feature and functional differences between how MySQL and PostgreSQL implement replication. However, for many general application use cases, either MySQL or PostgreSQL replication will serve just fine; technically speaking, from a functional and performance perspective, it won t matter which solution is chosen. That said, there still are some considerations to keep in mind in deciding between the different offerings. Some of these include the following: Oracle s MySQL offers both statement and row-based replication, whereas PostgreSQL only uses the latter based on write ahead log (WAL) information. There are pro s and con s to using statement-based replication, which MySQL has documented here: http://dev.mysql.com/doc/refman/5.5/en/replication-sbrrbr.html. It is generally acknowledged that row or WAL-based replication is the safest and most reliable form of replication. It does, however, result in larger log files for MySQL than the statement-based option does. MySQL currently supports more replication topologies than PostgreSQL (e.g. ring, etc.). However PostgreSQL does have a number of community supported replication offerings that help close this gap (e.g. Bucardo s master-to-master solution). In regard to data loss, MySQL 5.5 offers the semi-synchronous option, which helps minimize the risk of master-slave synchronization problems due to a master server going down. For PostgreSQL, a full synchronous replication option is in development and scheduled for release in 2011. As to replication filtering, MySQL provides filtering on the slave server, whereas with built-in PostgreSQL streaming replication, no filtering is available; in other words, the entire database from the master is replicated to the slave. With MySQL, all the information is sent, but then options exist to selectively apply the replicated events on the slave. However, as the MySQL binary log is not used for crash recovery purposes in the same way as PostgreSQL s WAL is, a user can configure a MySQL master so only certain databases are logged and, in that sense, a filter for the master server is available. If PostgreSQL users need to filter what data is replicated between systems or desire to only replicate a subset of a database server, then they can use a third party software option (e.g. EnterpriseDB s xdb Replication Server). Both MySQL and PostgreSQL replication are single-threaded at the current time. With respect to monitoring replication, MySQL provides a number of SHOW commands to understand the state of replication between a master and slave. To date, PostgreSQL offers functions to compute the differences in log positions between the master and slave servers, but that is all that is currently provided in 9.0. For failover and load balancing, the PostgreSQL community provides pgpool-ii, which is middleware that provides connection pooling, load balancing, failover, and more between replicated servers. An upcoming version of Oracle s MySQL will EnterpriseDB The Enterprise PostgreSQL Company 11
supports connection pooling (but only in the Enterprise edition), however failover and load balancing must be handled via a third-party product or custom development. Conclusions For many application use cases, either Oracle s MySQL or PostgreSQL replication will be an equally good choice. The best way to determine which is right for a particular project is to download both and put each through a comprehensive evaluation. You can download Oracle s MySQL at http://www.mysql.com/downloads/, while both community and EnterpriseDB s offerings of PostgreSQL can be found at: http://www.enterprisedb.com/products/download.do. About EnterpriseDB EnterpriseDB is the enterprise PostgreSQL company, providing products and services worldwide that are based on and support PostgreSQL, the world's most advanced open source database. EnterpriseDB s Postgres Plus products are ideally suited for transactionintensive applications requiring superior performance, massive scalability, and compatibility with proprietary database products. Postgres Plus products provide an economical open source alternative or complement to proprietary databases without sacrificing features or quality. EnterpriseDB understands that adopting an open source database is not a trivial task. You have lots of questions needing answers, schedules and budgets to keep, and processes to follow. We have helped thousands of organizations like yours through the steps to investigate, evaluate, prove, develop, and deploy their open source solutions. To make your work easier and faster we have special self-service sections on our website dedicated to assisting you in each of the steps. For working with any of these versions, EnterpriseDB has many free resources on the web site targeted at the various stages of open source adoption. Visit http://www.enterprisedb.com/solutions/stages/overview.do. Getting started access to free downloads, installation guides, demos, starter tutorials, and more to help get familiar with the database. Evaluations and pilots learn how Postgres has helped hundreds of Oracle users cut costs and MySQL users improve operations. EnterpriseDB The Enterprise PostgreSQL Company 12
Development EnterpriseDB employs more Postgres experts, developers and community members and than any other company, and offers key application development resources. Deployment information on how to scale a Postgres application, add Qualities of Service (QoS) like high availability or security, or get a health check. If you would like to discuss training, consulting, or enterprise support options, please do not hesitate to contact EnterpriseDB directly. EnterpriseDB has offices in North America, Europe, and Asia. The company was founded in 2004 and is headquartered in Bedford, MA. For more information, please visit http://www.enterprisedb.com. Sales Inquiries: sales-us@enterprisedb.com (US) sales-intl@enterprisedb.com (Intl) +1-732-331-1315 1-877-377-4352 General Inquiries: info@enterprisedb.com info.asiapacific@enterprisedb.com (APAC) info.emea@enterprisedb.com (EMEA) +1-732-331-1300 EnterpriseDB The Enterprise PostgreSQL Company 13
2011 EnterpriseDB Corporation. All rights reserved. EnterpriseDB and Postgres Plus are trademarks of EnterpriseDB