EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either synchronous or asynchronous mirroring to automatically fail over to the mirror site in the event of a failure on the primary site. This white paper examines the performance implications of enabling this feature and how those implications change in the presence of a wide-area network connection between the sites. June 2009

Copyright 2009 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. Part Number h6393 Applied Technology 2

Table of Contents Executive summary...4 Introduction...4 Audience... 4 Terminology... 4 Overview of SQL database mirroring...5 Test environment...6 Storage architecture... 6 Hardware resources... 7 Software resources... 7 Test methodology...8 Test results...8 Database mirroring on a LAN... 8 Database mirroring on a WAN... 10 SATA drives at the mirror server...16 Conclusion...23 References...24 Applied Technology 3

Executive summary It is important for IT planners and database administrators to plan and maintain high availability of databases in the event of disasters and planned or unplanned outages. Planned outages can include system upgrades and patch installations, and unplanned outages can include hardware failures, power failures, link failures, natural calamities, and terrorism. It is challenging to keep the database available during such events with no or minimum possible data loss and interruption to the production workload, while keeping the associated costs to a minimum. To have the database available in all these cases, a secondary copy must be maintained at a geographically distant location or at the same place depending on the criticality of the database. EMC Celerra provides array-based replication technology called Celerra Replicator for iscsi, which can be used to protect SQL databases and provide a secondary copy of the database for disaster recovery (DR). EMC Solutions for Microsoft SQL Server EMC Celerra Unified Storage Platforms Reference Architecture available on Powerlink provides more details on array-based technologies of EMC Celerra. Apart from these array-based technologies, Microsoft SQL Server provides a native feature for high availability called database mirroring. In this method, the log records are transferred to a remote database server and applied to the remote database, thus keeping the remote database up to date with the recent changes on the primary database. There are two operating modes in SQL database mirroring, synchronous and asynchronous. Each operating mode has its own advantages and disadvantages. The database mirroring performance depends on the operating mode and the network characteristics. This white paper explains the following recommendations based on different test scenarios: The two operating modes for database mirroring, synchronous and asynchronous, provide different tradeoffs that must be evaluated for each environment individually. Asynchronous mirroring provides protection with minimal impact to the performance of the primary server, but with some possibility of data loss. Synchronous mirroring provides protection with no data loss, but with greater impact to the performance of the primary server. The cost for each method, potential data loss, or performance is directly related to the quality of the network connection between the sites. This relationship will hold regardless of the storage platform used. Introduction This white paper helps you to select the suitable database mirroring operating mode depending on your recovery time objective (RTO) and recovery point objective (RPO) requirements. It also shows how the database mirroring performance varies with varying network properties. It explores the possibility of reducing the total cost of ownership of your database mirroring solution by using low-cost storage at the secondary site. Audience This white paper is intended for EMC personnel, EMC partners, and customers. Terminology Mirror: The mirror is the copy of the principal database. The server that hosts the mirror database is known as the mirror server. The mirror is not accessible to the applications and is always in a restoring state. Applied Technology 4

Principal: In database mirroring, there are two copies of a single database, but only one copy is accessible to the clients at any given time. The copy of the database that the applications connect to is called the principal database. The server that hosts the principal database is known as the principal server. Recovery point objective (RPO): RPO is the point in time to which systems and data must be recovered after an outage. This defines the amount of data loss a business can endure. Recovery time objective (RTO): RTO is the period of time within which systems, applications, or functions must be recovered after an outage. This defines the amount of downtime that a business can endure and survive. Witness: When database mirroring is used in synchronous mode, the witness provides a mechanism for automatic failover. The witness server does not manage any transactions, but simply serves as a tie-breaker vote when principal and mirror are trying to do automatic state changes. Overview of SQL database mirroring In database mirroring, two copies of a single database are maintained on two partner servers. At any time only one copy is available to the clients and the other copy acts as a standby. The database that is serving clients is known as the principal database and the server holding the principal database is known as the principal server. The standby database copy is called the mirror database and the server that holds this database is called the mirror server. The principal server transfers a stream of database log records and these are applied on the mirror database. Every database modification made on the principal database is applied on the mirror database. This includes not only the data modifications but also any modifications to the physical or logical structure of the database. There are two operating modes in database mirroring. Microsoft calls these high-safety and highperformance modes. High-safety mode is synchronous mirroring, while high-performance mode is asynchronous mirroring. The operating modes are set by using the transaction safety settings. If the transaction safety is set to FULL, it is in high-safety (or synchronous mirroring) mode. In the high-safety mode, the principal server waits for an acknowledgement from the mirror server before committing a transaction thus increasing the transaction latency. If the transaction safety setting is set to OFF, the operating mode is called high-performance (or asynchronous mirroring) mode. In this mode, the principal server sends a confirmation to the client immediately after sending the log record to the mirror server without waiting for an acknowledgement from it. The mirror server tries to keep up with the log records sent by the principal. The mirror server might lag behind the principal database. The gap between the principal and the mirror server depends on the workload on the principal and also on throughput and latency of the network between the two servers. In asynchronous mirroring, typically the transaction latency is small because there is no wait time involved but there is a risk of some data loss. While sending the log records from the principal to the mirror, if the log records cannot be sent at the rate at which they are generated, a queue builds up at the principal. This is known as a send queue. The send queue does not use extra storage or memory. It exists entirely in the transaction log of the principal. It refers to the part of the log that has not yet been sent to the mirror. While applying log records on the mirror, if the log records cannot be applied at the rate at which they are received, a queue builds up at the mirror. This is known as a redo queue. Similar to the send queue, the redo queue also does not use extra storage or memory. It exists entirely in the transaction log of the mirror. It refers to the part of the hardened log that remains to be applied to the mirror database to roll it forward. When the principal database or the server hosting fails, database mirroring provides a mechanism to fail over to the mirror database. When the principal server fails, an administrator can log in to the mirror server and bring it online as the principal. When this is completed, the database is once again available for user transactions. However, some applications require a much faster response to failure than would be possible if you rely solely on an administrator to complete a process. For such scenarios SQL Server database mirroring also has the ability to fail over automatically. Automatic failover is an extension of the high-safety mode called high-availability mode. It is not applicable in the high-performance asynchronous mirroring mode. It requires the addition of a third database server role called a witness. The witness server does not manage any transactions, but simply Applied Technology 5

serves as a tie-breaker vote when the principal and mirror are trying to do automatic state changes. When the principal and mirror lose contact with each other both will try to contact the witness. If the witness can still communicate with the principal, then nothing changes. However, if the witness cannot contact the principal, the mirror server becomes the principal and comes online automatically. Witness is very similar to the quorum resource, which is commonly used for failover clustering implementations and prevents a split-brain, a condition in which both sides in a cluster relationship are prevented from being online at the same time. Test environment This section describes the test environment. Figure 1 shows the reference architecture diagram of a SQL database mirroring setup by using EMC Celerra iscsi storage. Figure 1 SQL database mirroring Reference Architecture Storage architecture One EMC Celerra NS40 is connected to each database server. Table 1 provides the details of Celerra configuration. Table 1 Celerra configuration Hardware DART version 5.6.39-5 Disks RAID config and disk layout EMC Celerra NS40 Fifteen FC disks in one shelf, each disk with 300 GB and 15k rpm Fifteen additional SATA disks at the mirror with 1 TB and 7200 rpm (used for the SATA test instead of FC disks at the mirror ) RAID 1 (1+1) config Ten disks for DB and four disks for log Applied Technology 6

Figure 2 shows the disk layout used for this testing. Figure 2 Disk layout Ten disks are used to store the DB files and four disks are used to store the log files. The disks are configured by using RAID 1. The storage for the database servers is provisioned by using iscsi. Microsoft iscsi initiator is used to connect to LUNs on the Celerra target. Hardware resources Table 2 lists the configuration of the servers used in the test environment. Table 2 Server configuration Hardware Quantity Configuration 4U servers Two Four 2.8 GHz AMD Opteron dual-core processors 1U server (for WANem machine) 16 GB RAM Six 1 Gb/s Ethernet NICs Two 146 GB, 15k rpm internal SCSI drives One Two 3.0 GHz Intel Xeon processors 2 GB RAM Two on-board 1 Gb/s Ethernet NICs 40 GB hard disk Software resources Table 3 lists the software resources used in the test environment. Table 3 Software resources Software Operating system SQL Server Network Emulator Description Microsoft Windows Server 2008 Enterprise Edition (64-bit) Microsoft SQL Server 2008 Enterprise Edition (64-bit) WANem for simulating network latency Applied Technology 7

Test methodology EMC performed a series of tests to characterize the performance of SQL Server database mirroring. Database mirroring performance is a function of transaction safety levels and environmental characteristics such as the network latency, throughput, and so on. Both transaction safety levels (FULL/OFF or Synchronous /Asynchronous mode) were tested. These tests were performed on LAN and WAN environments with different network latencies. A T3 WAN link was simulated between the two SQL servers and 10 ms, 30 ms, 50 ms, 70 ms, and 100 ms latencies were tested. The workload was a simulation of an Online Transaction Processing (OLTP) environment suitable for many types of database applications. The workload was simulated and the user counts and TPS metrics reported were only a representation of the system response to this workload. In the remaining section of this paper, one user will be defined as a user in the context of an OLTP user similar to that defined in common industry benchmarks. Table 4 shows the distribution of transactions used in the workload tested. Table 4 Distribution of workload transactions Transaction type Percentage (%) Stock level transaction 4.0 Delivery transaction 4.0 Order status transaction 4.0 Payment transaction 43.0 New order transaction 45.0 A configuration is considered to be scaling linearly if the transaction per second (TPS) obtained increases linearly with scaling user load and the average response time is less than 2 seconds. The configuration saturates at the point when TPS declines and average response time crosses 2 seconds. The saturation user count (or the maximum supported users) is the point just before the 2 seconds gating metrics. The results shown are the comparative numbers and not absolute values. For example, if the TPS obtained for a configuration in the LAN environment is 50 and the value drops to 40 after including a WAN delay of 10 ms, then the TPS value for the 10-ms scenario is shown in the graph as 80 percent of the LAN scenario TPS. Note: In figures containing a series for both TPS and average response time, the higher values for TPS are preferred, while the lower values for average response time are preferred. Test results This section details the results of the various tests outlined in the previous section. Database mirroring on a LAN Figure 3 on page 9 compares the saturation user counts for three scenarios tested in a LAN environment. The first one is the baseline for comparison. This is a standalone server without mirroring. In the other two scenarios, SQL Server is configured with synchronous (safety FULL) and asynchronous (safety OFF) database mirroring modes. Both principal and mirror servers are connected in a LAN environment. The network delay in this case is less than 1 ms. Applied Technology 8

Figure 3 Database mirroring performance comparison with the baseline in a LAN Both synchronous and asynchronous mirroring mode scenarios saturated at 73 percent of the baseline saturation users. This indicates that there is a 27 percent performance decline in terms of maximum supported users when using either of the database mirroring modes. Both the mirroring scenarios saturated at the same user count in the LAN scenario. This contrasts with the theory that asynchronous mode performs better than synchronous mode. Even though the saturation levels are the same, asynchronous mode clearly outperforms synchronous mode in terms of TPS and average response time. Applied Technology 9

Figure 4 shows TPS and the average response time for synchronous and asynchronous modes. Figure 4 Comparison of TPS and average response time with the baseline The saturation user count is the same for both modes in a LAN environment, but the difference is obvious when the principal and mirror servers are connected through a WAN (explained in the following sections). Figure 4 shows that there is a significant difference in the average response time and TPS values between the synchronous and asynchronous modes. While in synchronous mode, the average response time values increased to more than 300 percent and TPS dropped to 95 percent of the baseline scenario values. For the asynchronous mirroring mode these values stayed almost equal to their baseline values. Synchronous mirroring is suitable in environments where data loss is unacceptable, and the associated performance decline is acceptable. Asynchronous mirroring is suitable when some data can be lost to improve the performance of the system. The magnitude of these tradeoffs is described in the following sections. Database mirroring on a WAN Database mirroring on WAN is a commonly used scenario. Normally the mirror server is located at a distant location from the principal server. This serves as a disater recovery (DR) solution and if there is a problem at the primary site, database operations can still continue from the mirror site. This section presents the test results of both database mirroring methods performed in a simulated WAN environment. A T3 WAN link was simulated between the principal and the mirror sites. The bandwidth remained constant and the latencies were changed to study the performance implications. Applied Technology 10

Figure 5 shows the saturation user counts for both mirroring methods with varying network latency. Figure 5 Synchronous and asynchronous mode performance comparison on a WAN Note: The OLTP users supported in the synchronous and asynchronous modes during the LAN test are taken as a baseline for calculating values shown in the graph. As the network latency increases, the database that is configured with synchronous mirroring mode saturates more quickly than the database that is configured with asynchronous mirroring mode. The value of maximum number of OLTP users supported reduces gradually as the latency increases. When the WAN latency is 100 ms, synchronous mode supports only 25 percent of the users that it supported in the LAN environment. In the asynchronous mirroring mode, the saturation user count remains constant irrespective of the WAN delay. A constant OLTP user load that is 70 percent of baseline (LAN scenario) saturation is run on both scenarios to study the performance implications with varying network latencies. The following section illustrates how each mode performed with increasing network latencies. Figure 6 on page 12 shows the impact of network latency on the average response time and TPS values compared to the 100 ms latency value. Applied Technology 11

Figure 6 Synchronous mirroring TPS and average response time with increasing network latency Note: The TPS and average response time percentage values are calculated based on the values obtained in the 100 ms WAN latency test. Figure 6 shows that with synchronous mirroring, the average response time values rise to very high values as the network latency increases. Similarly, the TPS values drop to low values as the network latency increases. Applied Technology 12

Figure 7 shows the average delay per transaction as the network latency increases for synchronous mirroring. Figure 7 Synchronous mirroring Average delay per transaction with increasing network latency Figure 7 shows the average delay per transaction as the network latency increases.the transactional delay increases in line with the network latency in synchronous mirroring. This means that the principal server must wait longer to receive the acknowledgement from the mirror server to complete a transaction. Hence, average response time increases and TPS decreases as the network latency increases. The transaction delay in asynchronous mirroring is always zero. Applied Technology 13

Figure 8 shows the average delay per transaction as the network latency increases for asynchronous mirroring. Figure 8 WAN latency Asynchronous mirroring TPS and average response time with increasing Figure 8 shows that as the network latency increases, asynchronous mirroring does not show any performance degradation. The average response time and TPS values are almost constant for all the WAN latencies (note that the average response time value is in milliseconds and the deviation from max to min is less than 3 ms). This is because there is no wait time involved in asynchronous mirroring. The transactional delay is zero in all the tested latency scenarios. Though asynchronous mirroring is better from the performance standpoint, it poses a risk of data loss in the event of a principal server failure. In database mirroring, when log records are sent from the principal database to the mirror database, a send queue builds up at the principal if the log records cannot be sent at the rate at which they are generated. This is the total number of bytes of log that have not yet been sent to the mirror server. The size of unsent logs is an indicator of the possible data loss in the event of a principal server failure. If there is no log send queue at the principal server, it means that all the data on the principal server has been transmitted and received by the mirror server. If the log send queue exists, it means that the mirror server is lagging behind the principal server by the number of log records equal to the size of the log send queue. Applied Technology 14

Figure 9 shows the log send queue at the principal server in asynchronous mirroring. Figure 9 Asynchronous mirroring Log send queue As the network latency increases, the send queue increases. In the asynchronous mode, the unsent log accumulates when the mirror server falls behind the principal server (and also when the mirroring is paused or suspended). Therefore, it can be concluded that in case of principal server failure: There is a minimal risk of data loss in synchronous mirroring. There is a risk of data loss for asynchronous mirroring and the degree of it increases as the network latency increases. The log send queue for synchronous mirroring is almost zero in all the tested scenarios. This is an expected behavior. In synchronous mirroring, after synchronization, the unsent log is accumulated only when the mirroring is paused or suspended. After receiving the log records from the principal, the mirror server writes them to the log disk and redoes the log on the mirror to roll the mirror database forward. If the principal server sends the log records quicker than the mirror can roll forward, a redo queue is built up at the mirror. This redo queue is an indicator of the time required to fail over to the mirror server in case of failure. The actual time to roll forward all the records on the redo queue depends on the system hardware and the current workload. The longer the redo queue, the longer it takes to fail over to the mirror server. Applied Technology 15

Figure 10 shows the log redo queue on the mirror server for both synchronous and asynchronous mirroring. Figure 10 Redo queue (in both the mirroring modes) as network latency increases Figure 10 shows that the redo queue starting from a 30 ms WAN delay in synchronous mode is very small. This is because of the reduction in the TPS starting from 30 ms, and hence, the reduction in the amount of log records sent to the mirror server. In asynchronous mode, the redo queue increases as the network latency increases. This indicates that as the network latency increases, the asynchronous mode requires more time to fail over to the mirror. The synchronous mirroring mode achieves better RTO when compared with the asynchronous mirroring mode. SATA drives at the mirror server EMC performed testing to explore the possibility of replacing the FC drives with low-cost SATA drives at the mirror server, while the principal server is still using FC disks. This is to explore if this approach can be used on the lightly loaded databases where the log generation rate of the principal server is comparatively low but the database is critical enough to be mirrored to the remote location. If this proves to be a practical use case, the costly FC drives can be replaced with the low-cost SATA drives, thus reducing the total cost of ownership (TCO) of the solution. Applied Technology 16

The test configuration is not changed except that the SATA drives are replacing FC drives at the mirror server. The workload used in this test is exactly the same as that used in the scenario where FC drives are connected at both the principal and mirror sites. Figure 11 shows the comparison of saturation user count for the synchronous mirroring mode for FC and SATA disk scenarios. Figure 11 Synchronous mirroring saturation user count FC disks and SATA disks at the mirror server The saturation user counts at all the WAN latencies tested are absolutely equal for both scenarios. Applied Technology 17

Figure 12 shows the saturation user count for the asynchronous mirroring mode for both FC and SATA disk scenarios at different WAN latencies. Figure 12 Asynchronous mirroring mode saturation user count FC disks compared with SATA disks at the mirror server From Figure 12 it is clear that the results were the same for both scenarios at all the latencies tested. Applied Technology 18

Figure 13 shows TPS and average response time values in the synchronous mirroring mode for both FC and SATA disk scenarios with a constant user load at different WAN latencies. Figure 13 Synchronous mirroring TPS and average response time comparison for FC and SATA at the mirror site Figure 13 shows that the TPS and average response time values for both scenarios are almost equal. Applied Technology 19

Figure 14 shows TPS and average response time comparisons in the asynchronous mirroring mode for both FC and SATA disk scenarios with a constant user load at different WAN latencies. Figure 14 Asynchronous mirroring TPS and average response time comparison for FC and SATA disks at the mirror site The average response time values are higher for the SATA disk scenario compared to the FC disk scenario. At a 100 ms WAN delay, the average response time value for the SATA disk scenario is 9.5 times more than the FC disk scenario. However, there is no difference in TPS because the values are almost equal throughout the tested range. Applied Technology 20

Figure 15 shows the log send queue in the asynchronous mirroring mode for both FC and SATA drive scenarios for a constant user load at different WAN latencies. Figure 15 Async mirroring Log send queue comparison for FC and SATA drives at the mirror site The log send queue values for both scenarios are comparable and do not differ greatly. Applied Technology 21

Figure 16 shows the redo queue for both scenarios. Figure 16 Asynchronous mirroring Redo queue comparison for FC and SATA disks at the mirror site The redo queue for the SATA disk scenario is always greater than the FC disk scenario. This means that in case of a disaster, the SATA disk scenario will take a longer time to fail over compared to the FC disk scenario. The general assumption is that the SATA disk scenario will result in less performance because of the following reasons. Proven performance superiority of FC disks The mirror server data disks handle higher write rates compared to the principal server Figure 16 shows that the SATA disks at the mirror site also perform almost equal to the FC disk scenario in terms of saturation user counts and TPS values. This is probably because the SATA drives are not fully busy at the tested user load as shown in Figure 17 on page 23. Applied Technology 22

Figure 17 SATA disk idle time at the mirror site during asynchronous mirroring The following points should be taken into consideration when SATA drives are used at the mirror site: RTO will be poor because SATA drives will take more time to roll forward the database in case of a disaster. In case of a disaster, the production database will be running on SATA drives and hence the performance levels will be poor when compared to the FC disks. The results shown in Figure 16 on page 22 are obtained in an EMC test lab environment. Because no two environments are the same, results may vary in other environments. It is advised to test the performance before making a decision to use SATA drives in any production environment. Conclusion The following conclusions can be drawn from the tests: The addition of database mirroring to your environment may incur a performance penalty. In the OLTP environment presented in this paper, the penalty was around 27 percent. As the latency of the network connection between the principal and mirror site increases, there are clear impacts on the mirror relationship. In the case of synchronous mirroring, the performance of the primary database will decline, while in asynchronous mirroring the mirror server will begin to lag behind the primary. The exact magnitude of these effects is dependent on the workload being serviced and on the network connection between the two sites. There is a risk of data loss in case of principal server failure in asynchronous mirroring mode. The degree of data loss increases as the network latency increases. Applied Technology 23

For some workloads, the mirror server may be able to use SATA drives while the primary server uses FC. Such an approach may help reduce the cost associated with the solution. However, thorough testing must be done to ensure that this approach can meet the organization service levels during primary operation, and in the event of a failover. References The following documents are available on Powerlink: EMC Solutions for Microsoft SQL Server EMC Celerra Unified Storage Platforms Applied Best Practices white paper EMC Solutions for Microsoft SQL Server 2005 on VMware ESX Server EMC CLARiiON CX3 FCP Applied Best Practices white paper EMC Solutions for Microsoft SQL Server EMC Celerra Unified Storage Platforms Reference Architecture EMC Solutions for Microsoft SQL Server 2008 for Tiered Storage Enabled by EMC CLARiiON CX4 Series on iscsi and Windows 2008 Reference Architecture EMC Solutions for Tiered Storage for Microsoft SQL Server 2008 Enabled by EMC CLARiiON CX4 Series iscsi, Windows 2008, and VMware ESX Server Reference Architecture EMC Solutions for Microsoft SQL Server 2005 on Windows 2008 EMC CLARiiON CX3 Series FCP Reference Architecture Applied Technology 24