: Synchronous High Availability and Disaster Recovery Replication Greg Eckert Matt Kereczman Devin Vance Copyright 2015 LINBIT USA, LLC Trademark notice DRBD and LINBIT are trademarks or registered trademarks of LINBIT in Austria, the United States, and other countries. Other names mentioned in this document may be trademarks or registered trademarks of their respective owners. License information The text and illustrations in this document are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported license ("CC BY-NC-ND"). A summary of CC BY-NC-ND is available at http://creativecommons.org/licenses/by-nc-nd/3.0/. The full license text is available at http://creativecommons.org/licenses/by-nc-nd/3.0/legalcode. In accordance with CC BY-NC-ND, if you distribute this document, you must provide the URL for the original version. 1. Summary... 1 1.1. Background... 2 1.2. High Availability Testing: Sequential Read/Writes... 2 1.3. High Availability Testing: Random Read/Write tests... 3 1.4. Disaster Recovery Testing... 5 2. Conclusion... 5 3. Appendix... 5 3.1. Notes... 5 3.2. DRBD Replication Protocols: A, B, and C... 6 3.3. Primary/Secondary vs Dual Primary clusters... 6 4. Testing Specifications... 7 4.1. Server Specifications... 7 4.2. Dolphin Interconnects (network)... 7 4.3. Software Versions... 7 4.4. DRBD Configuration... 7 4.5. IO commands used... 8 1. Summary LINBIT 1 tests replication speeds to determine overhead for synchronously replicating data on two Intel DC S 3700 Series 800GB SATA SSD drives. 1 http://www.linbit.com/en/ 1
1.1. Background Intel Corporation designs, manufactures, and sells integrated digital technology platforms worldwide. The company provides NAND flash memory products, which are used in solid-state drives as well as producing SSDs internally. LINBIT is known for developing DRBD 2, the backbone of Linux High Availability software. LINBIT tests how quickly data can be synchronously replicated from an Intel 800GB SATA SSD in server A to an identical SSD located in server B. Disaster Recovery replication is also investigated utilizing the same hardware to a server off-site. For those who are unfamiliar with the "shared nothing" High Availability approach to synchronous data replication and block level replication: DRBD uses two (2) separate servers so that if one (1) fails, the other takes over. Synchronous replication is completely transaction safe and is used for 100% data protection purposes. DRBD has been a part of the mainline Linux kernel since version 2.6.33. This paper reviews DRBD in an active/passive configuration using synchronous replication (DRBD s Protocol C). Server A is active and server B is passive. This means, when writing data to server A, we must move the data over the network to write the same data to server B. Afterwards, a message is sent back to server A, confirming that the data has been written in both places before telling the application that the write was successful. Since the writes are confirmed in two places at all times, the status is known during hardware failures. Due to DRBD s positioning in the Linux kernel (just above the disk scheduler), DRBD is application agnostic. It can work with any filesystem, database, or application that writes data to disk on Linux. 1.2. High Availability Testing: Sequential Read/Writes The goal: determine the performance implications of synchronous replication when using high performance Intel SSD Drives. In the initial test, LINBIT used a 10GbE connection between servers. The Ethernet Connection s latency became the bottleneck when replicating data. Luckily, due to a partnership with Dolphin Interconnects, LINBIT had a faster network connection on-hand. "When benchmarking synchronous replication, there are many factors to consider. However, the bandwidth and latency of the replication link and backing disks are typically our limiting factors. In the past, it was common for the replication link to be much faster than our backing disks; with newer SSD storage this is no longer the case." Said Matt Kereczman, the LINBIT High Availability engineer who performed the testing. After mitigating the network latency issue and averaging five separate testing trials, LINBIT achieved the results below: Table 1. Sequential Testing Results Data Intel MB/s Single Drive DRBD Disconnected DRBD Protocol C a Performance Difference Seq R raw 491 470 839 70.89% Seq R ext4 526 522 889 69.01% Seq W raw (10 GB) 469 467 460-1.92% Seq W raw (1 GB) 489 492 471-3.68% Seq W ext4 (10 GB) 470 470 468-0.43% Seq W ext4 (1 GB) 504 495 494-1.98% a DRBD Protocol C uses synchronous replication. The advertised Intel drive speed is: Read- 500MB/s, Write- 460MB/s. Full testing specifications are located in the Appendix. As you can see from the data, installing DRBD introduced virtually unnoticeable 2 http://drbd.linbit.com 2
write overhead. Mounting the ext4 filesystem on top of DRBD, writing 1GiB of data to server A, transferring that data over the replication network to server B, and then sending a confirmation that the write is complete back to server A only incurs a 1.98% performance hit. Running DRBD, the SSD s still work above the advertised speed of the drive. In each write scenario, using high performance Intel SSD drives with DRBD either performed near or above advertised speeds for all sequential read/write tests. For 100% guaranteed data integrity, 0.5%-2% overhead is a very small price to pay, even in high performance systems. The data above in Table 1, Sequential Testing Results Data [2] is graphically represented by Figure 1, Sequential Read/Write Results Graph [3]. The utilities and configurations used to generate these results can be found in the appendix of this document. Figure 1. Sequential Read/Write Results Graph Horizontal blue and orange lines represent the advertised drive read and write speeds of 500MB/s and 460MB/s respectively 1.3. High Availability Testing: Random Read/Write tests The goal: mimic production scenarios by using random reads and writes to determine the performance implications of synchronous replication when using high performance Intel SSD drives. LINBIT digs deeper after finding the theoretical maximum speeds of DRBD Replication with Intel DC S 3700 800GB SSDs by using random read and write assessments. These random reads and writes simulate how many applications and databases work in a production environment. The purpose of random read/write test is to provide a realistic example of what users will experience when they add more load to their systems. Naturally, the disks will slow down when increasing the read and write load. The goal of this test is to gain insight into whether or not DRBD can keep up with fast transactions when simulating a customer environment using a typical database. LINBIT s chosen values are displayed in the Appendix. 3
Table 2. Random RW Testing Results Data Random RW IOPs Intel Single a DRBD Disconnected DRBD in Protocol C Performance Change Raw read 53858 54077 84062 56.1% Raw write 25950 26167 24052-7.3% ext4 read 55828 56791 91508 63.9% ext4 write 25843 26098 24073-6.9% Mixed b raw read 26507 26722 34049 28.5% Mixed raw write 11360 11455 11447 0.8% Mixed ext4 read 26521 26756 37622 41.9 Mixed ext4 write 11367 11469 11480 1.0% a Advertised IOPs random read/random write: 75k/36k b Mix of 70/30 read/write work load used to simulate real world use. The data demonstrates that in this type of environment, enacting DRBD for local data replication with Intel hardware will have a minimal impact on overall performance as compared to running a single SSD, and can even have positive implications. Since LINBIT had both 2 very fast SSDs and low latency Dolphin replication links, DRBD s read-balancing functionality was used to increase the read performance of the DRBD device. As you can see, their read performance surpasses that of a single Intel SSD by up to 63.9%. LINBIT achieved 11367 IOPS when writing to the SSD with the ext4 filesystem without DRBD installed; when replicating writes with DRBD, 11480 IOPS. This represents a slight performance enhancement when using DRBD and synchronously replicating data. The performance improvements are even bigger for reads. Figure 2. Random Read/Write Results Graph 3 3 Advertised IOPS rr/rw: 75k/36k 4
Increased performance when using DRBD is counter intuitive. There is natural overhead when synchronously replicating data, so why are the disks performing faster? DRBD is carefully optimized for performance. This involves flushing kernel internal request queries where it makes sense from DRBD s point of view. This can lead to the effect that a certain test pattern gets executed faster with DRBD than without it. In random read/write mode, it is safe to say that using these technologies together will enhance service availability with minimal performance implications. 1.4. Disaster Recovery Testing The goal: determine the performance implications of asynchronous long-distance replication when using high performance Intel SSD drives. Off-site data replication is the next step for Intel SSD testing. When replicating across large distances, the terminology High Availability changes to Disaster Recovery. For this type of scenario, LINBIT uses real-time asynchronous replication with DRBD Proxy. Without using DRBD Proxy, writes will only be able to move as fast as the WAN link; in LINBIT s test environment 1Gb per second. DRBD Proxy buffers writes in memory, so that network connection speeds do not limit the local disk speed. Therefore, the local speed results were the same as in the High Availability testing above. After connecting the SSDs via a 1Gib WAN connection that spans over 15 miles with ~50ms latency, we maxed out the throughput of the line. When using DRBD Proxy, local performance was not affected by the WAN Connection s poor latency or comparably low throughput, and our replication to the DR site was as fast as the WAN link could handle. 2. Conclusion Shared Nothing High Availability and Disaster Recovery replication architectures, with the help of fast SSD storage, can add outstanding resiliency to IT systems with minimal performance implications. LINBIT finds that when synchronously replicating data using DRBD the achieved write speed is near the advertised speeds of using a single Intel 800GB SSD using sequential read/writes. While using random read/writes, deploying DRBD will also have very little impact on SSD write performance as compared to using a single drive and will actually increase read performance. Users can guarantee 100% data protection without sacrificing performance using the DRBD Open Source Software Replication Solution. Users simply need two separate systems, DRBD data replication software, and high performance storage in the form of Intel DC S 3700 series SSDs. 3. Appendix 3.1. Notes The CPU cost of using DRBD is negligible. In our testing, sustained sequential writes to our DRBD replicated ext4 volume put our 16 core system under a load of 1.54; the same test to a non-replicated ext4 volume resulted in a load of 1.44. DRBD has an option to checksum the blocks as they are replicated. This would put the CPU under heavy load; this option should not be used in production unless the user suspects corruption over the network. The setting al-updates no; causes a full resynchronization after Primary crashes. This may not be practical for the typical DRBD cluster, however it is commonly used when performance is of higher priority, and when our disks are capable of completing a full resync in an acceptable amount of time (under an hour in our test environment). Because DRBD replicates at the block level, it is completely application agnostic. This means that it can replicate databases, filesystems, and any other data that is written to disk on a Linux system. Each 5
of these databases, filesystems, and applications are considered to be a resource. Users frequently replicate multiple resources simultaneously. Replication locally maxes out at about 60 resources at one time, and long-distance replication has a set maximum of 32 resources. 3.2. DRBD Replication Protocols: A, B, and C Protocol A. Asynchronous replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has finished, and the replication packet has been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The data on the standby node is consistent after fail-over, however, the most recent updates performed prior to the crash could be lost. Protocol A is most often used in long distance replication scenarios. When used in combination with DRBD Proxy it makes an effective disaster recovery solution. LINBIT did use DRBD in protocol A when replicating data over the WAN due to bandwidth and latency constraints. There are plenty of database and filesystem replication technologies that can replicate in asynchronous mode as well, making the results less significant than the DRBD Protocol C tests. The significance of LINBIT s protocol A trial lies in the fact that they used it to replicate arbitrary data over long distances, and not just a specific database or filesystem. Protocol B. Memory synchronous (semi-synchronous) replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has occurred, and the replication packet has reached the peer node. Normally, no writes are lost in case of forced failover. However, in the event of simultaneous power failure on both nodes and concurrent, irreversible destruction of the primary s data store, the most recent writes completed on the primary may be lost. Protocol B is mostly intended for completeness. If you need to know your data is in two places at once, use Protocol C. If you don t require the guarantee that your data is in two places at once, or your replicating over long-distances, use Protocol A. Protocol C. Synchronous replication protocol. Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. As a result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course, inevitable even with this replication protocol if both nodes (or their storage subsystems) are irreversibly destroyed at the same time. 3.3. Primary/Secondary vs Dual Primary clusters DRBD can be run in 3 different modes: The Primary/Secondary (or active/passive) mode is what is used in the replication tests for this document. Only the primary server can write data and the replica is inaccessible. If the primary server fails during a transaction, the secondary server takes over so that services continue. Oftentimes this fail-over transition period is transparent to end users, or may be assumed to be a simple blip in the network. In more complex configurations with large amounts of data this transition could take up to a few minutes. The Active/Active cluster mode runs some services/resources on one node and the rest on another node. This is, essentially, a way to ensure that you are using both servers, instead of one just sitting idle. For instance, your e-mail server could be active on node A, and your database server can be active on node B. If either node fails, the active services can be transitioned to the other. We did not use this type of cluster in our testing, as we did not have multiple applications running at the same time, and our focus was to simply measure the speed and overhead of a single resource. Dual Primary clusters write data to one filesystem, mounted on both servers simultaneously. These clusters have the advantage of nearly 0 failover time (depending on fencing speed and server down detection time, etc.), and a certain amount of load sharing between servers. However, it also introduces additional risk of data divergence in the event of network issues (called split-brains) and additional locking overhead. Given the additional overhead of a clustered filesystem, which would be necessary in 6
order to run DRBD in dual-primary mode, LINBIT did not use this method for testing. If we were testing the operating costs of a clustered filesystem such as GFS2 or OCFS2, this method would be appropriate. 4. Testing Specifications 4.1. Server Specifications CPU: Intel Xeon CPU E7520 @ 1.87GHz (16 core) RAM: 32GiB of RAM 4.2. Dolphin Interconnects (network) 6.07Gib/s throughput (777MiB/s) 95.42 microsecond (usec) latency with packet size of 65536 Bytes super jumbo frames: packet size of 65536 Bytes 4.3. Software Versions OS: CentOS 6 x86_64 DRBD: 8.4.5 4.4. DRBD Configuration resource intel { disk /dev/sdb1; device /dev/drbd100; meta-disk internal; disk { md-flushes no; read-balancing least-pending; al-updates no; disk-barrier no; disk-flushes no; c-plan-ahead 10; resync-rate 400M; c-max-rate 600M; c-min-rate 10M; c-fill-target 44K; al-extents 6481; net { protocol C; max-buffers 80k; max-epoch-size 20000; sndbuf-size 512k; verify-alg sha1; on thor.us.linbit { address ssocks 10.6.0.1:7805; on odin.us.linbit { address ssocks 10.6.0.2:7805; 7
4.5. IO commands used For sequential write testing, dd was used with 128MiB blocks. FIO Testing Parameters for random read/writes. ioengine=libaio direct=1 bs=4k filesize=10g iodepth=64 numjobs=16 FIO Testing Parameters for mixed read/writes. rwmixread=70 oengine=libaio direct=1 bs=4k iodepth=64 filesize=10g numjobs=16 8