Business Continuity Solutions on Dell PowerEdge Servers A White Paper Series for Microsoft SQL Server Abstract This white paper is one in a series of Business Continuity papers that describe the various options from Dell and Dell s premier partners such as Microsoft and EMC on how to achieve maximum data availability, protection and integrity for Microsoft SQL Server 2005 databases. April 2008 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. 1
Microsoft and SQL Server are registered trademarks of Microsoft Corporation. 2
Table of Contents INTRODUCTION... 4 Purpose and Scope... 4 SQL SERVER 2005 BUSINESS CONTINUITY OPTIONS:... 4 Log shipping:... 5 Replication:... 6 Clustered SQL Server:... 8 Native Backup and Restore:... 9 Database Mirroring:... 9 CONCLUSION...11 GLOSSARY:...12 REFERENCES...13 3
Introduction Most of today s business applications are data-centric, requiring fast and reliable access to intelligent information architectures that can often be provided by a high-performance relational database system. Microsoft SQL Server is one among the relational database systems that provides such a back-end data store for mission-critical, line-of-business applications. Microsoft SQL Server 2005 offers significant architectural enhancements in performance, scalability, availability, and security. It is disruptive and expensive when customers, employees, partners, and other stakeholders are adversely affected by database outages. DELL along with Microsoft has extended its reach to increase productivity and keep information close to hand, flexible enough to meet your organization s administrative model. This technical report delivers an overview of how to achieve Business Continuity for Microsoft SQL Server 2005. In this series of white papers we will explore the options and techniques available at the database, hardware and storage layers as well as Dell partner software solutions. In addition, the Dell Engineering team will be providing a deep-dive into each of these technologies providing use cases, best practices, manageability, and data protection and how they fit into the Business Continuity model. Purpose and Scope The purpose of this technical report is to provide a summary of the various options and techniques available at the Database layer. We look into each of the features Microsoft has built into the Database engine and how they fit into the business continuity strategy. The target audience is database and systems administrators, decision makers and architects that are beginning to implement a business continuity strategy. SQL Server 2005 Business Continuity Options: Configuring systems to prepare for disaster/recovery and support business continuity is one of the primary considerations when architecting a database solution. With each subsequent release of SQL Server database software, Microsoft is developing new techniques to run business critical applications, uninterrupted, on the SQL Server platform. Early evolutions to support business continuity and disaster recovery were typically clumsy and required an expert to architect, configure, and continually monitor the setup to ensure the DR solution was functioning properly. Prior to SQL Server 2005 architects had four choices to configure a disaster recovery solution: Log Shipping Replication Microsoft Clustering Services (MSCS) Native Backup and Restore However, with the demand for business continuity, Microsoft introduced a new feature in SQL Server 2005; Database Mirroring. This feature enables organizations to meet towering requirements for mission critical databases. These features vary in their RPO, RTO and also vary in their relative complexities and resource needs. The following are the DR features that SQL Server 2005 presents: 4
Log shipping: Using log shipping to provide BC and DR is the configuration that requires the least alteration to out of the box operation for the database server. In this configuration the architect is exploiting the standard operation of the database to provide DR. Databases operating normally will write all operations that add, modify, and delete data to the databases transaction log (.ldf) prior to adding, modifying, or deleting the data in the database. The transaction log provides the database with speed and can be used to recover the database with minimal data loss. The architect uses the logs ability to recover the database to keep a near real time copy in a separate location by periodically making a copy of the transaction log and shipping the log file copy to the DR server. The DR server is then able to apply the transaction log copy to keep the database in a near real time state. To simplify the setup and configuration of log shipping, administrators are able to configure the necessary components in a SQL Server Database Maintenance Plan. This tool lets the architect set up the source, shipping location, backup and restore intervals, and alert thresholds for the log shipping. Once the Database Maintenance Plan is completed it will create all corresponding SQL Server Agent Jobs for the log shipping to run successfully. Note that log shipping is a per-database BC solution. Log shipping cannot be used at the instance level. When a standby server is restoring transaction logs, the database is in exclusive mode and it is unusable. However, you can run batch reporting jobs between transaction log restorations or Database Console Commands (DBCC) checks to continuously verify the integrity of the standby server. For applications such as decision support servers that require continuous processing on a database server, log shipping is not an appropriate option. The latency on the standby server is based on how frequently the transaction log backups are taken at the primary server and then applied at the standby server. If the primary server fails, you may lose the changes that were made by the transactions that occurred after your most recent transaction log backup. For example, if transaction log backups are taken every 10 minutes, transactions during the most recent 10 minutes may be lost. This does not necessarily mean that the data updates that are made to the primary server during the latency period will be lost. Typically, new updates in the primary transaction log can be recovered and applied at the warm standby server with only a small delay in switching from the primary server to the standby server. The main purpose of log shipping is to maintain a warm standby server. Advantages and disadvantages of using log shipping Advantages: You can recover all database activities. The recovery includes any objects that were created such as tables and views. It also includes security changes such as the new users who were created and any permission changes. You can restore the database faster. The restoration of the database and the transaction log is based on low-level page formats. Therefore, log shipping speeds up the restoration process and results in the fast recovery of data. Log shipping offers one advantage over database mirroring which explained in the next section: it allows for multiple secondary servers, and any secondary server can be used for reporting. Because of this advantage enterprises can implement log shipping in combination with database mirroring. Minimal alteration from standard database operations Less skill needed to configure or maintain setup Near real time database that can remain secure or can be configured to be opened for queries. Disadvantages: The database is unusable during the restoration process because the database is in exclusive mode on the standby server. 5
There is a lack of granularity. During the restoration process, all the changes in the primary server are applied at the standby server. You cannot use log shipping to apply changes to a few tables and to reject the remaining changes There is no automatic failover of applications. When the primary server fails because of a disaster, the standby server does not failover automatically. Therefore, you must explicitly redirect the applications that connect to the primary server to the standby (failover) server. The connecting applications also need to be aware of both the servers involved in the log shipping pair. Requires monitoring of free space on source file system and DR file system for copies of log files on the source and DR Status of copy process is not very robust to report status of copy, restore of logs, or potential failures in the process Configuration highly dependent upon disk space, network bandwidth, etc. In case of missing log files full backup is required to rebuild. Application SQL2005 SQL 2005 Principal Warm Standby Server Log Transaction Log Backup Directory Transaction Log Backup Directory Figure 1: Transaction Log Shipping Replication: Using a Replication solution to accomplish business continuity and disaster recovery is the other database solution available on the SQL Server 2000 platform. Unlike log shipping, replication has many different configurations to choose from to provide data redundancy. Based upon the requirements of the business, and the configuration of the environment, the architect can choose which of the configurations best addresses their requirements. Replication has three main replication configurations: Snapshot Transactional Merge Snapshot replication will take a picture of the database at a point in time and send the data to the DR. This configuration is beneficial for small databases that you do not require to be in a near real time state. 6
Transactional replication will take the initial picture of the database, send the data to the DR, and then send individual insert, update, and delete statements from the primary database to the DR through stored procedures configured during setup. Transactional replication will allow for the nearest real time configuration for the DR database. Merge replication takes the picture of the database and sends it to the DR, but they allows the administrator to configure rules to allow updating of data in either the Primary or the DR server. The rules, configured by the administrator, define which data updates take precedence over others. Merge replication is very complex to setup and configure and should be used only when experienced DBA s are on staff and the architecture requires this model. Replication can provide environments near real time DR, but they require thorough understanding of the data for setup and troubleshooting. Advantages: Wizards are in place for guidance through basic configurations Flexible architecture allows for configuration of replication to fit the specific requirements of the data and environment Tools built in to assist in monitoring and troubleshooting of replications Provides near real time DR and allows for a geographically separate DR Disadvantages: Implementation of replication becomes difficult if there is no underlying control on the database architecture (3 rd party applications) Replication configuration or troubleshooting can become complex quickly and requires a DBA expert to resolve. Through understanding of database configuration is required for effective setup of replication solution Very large databases can take large amounts of time to setup or reconfigure, which can translate to no DR for extended periods. 1) Writes to primary Application 3) Reads from primary or secondary 2) Replicate changes to secondary host SQL2005 SQL 2005 Primary Server Secondary Server external storage external storage Figure 2: Peer-to-peer replication with writes to primary 7
Clustered SQL Server: Microsoft Clustering Service (MSCS) is a utility provided to assist in disaster recovery and business continuity by enabling administrators to configure the system with redundant resources to be available in the case of a failure. MSCS differs from the previous two solutions because it is configured on the operating system and allows the SQL Server Database Engine program to take advantage of the capabilities of this service. The most common configuration for Microsoft Clustering Service is an Active / Passive configuration which consists of one Active Server taking database transactions and one idle Passive server. The two servers are configured to share the same disk storage for the database, and using continuous internal polling, will determine which of the servers is active and which is passive. In the case of a failure on the active server, no polling response will be returned from the current active server and the cluster will run internal programs and processes to fail over the database and ownership of the disk storage to the passive server. The passive server takes ownership of the disk and then will be able to open up the database to user connections. Once open the node will be able to continue reading from and writing to the same disk media that the original node was writing to. Advantages: Microsoft supported high availability solution (Supported since NT4.0) Minimized downtime due to CPU or memory failures High availability not only for the database, but also the OS and other cluster aware applications Disadvantages: Does not provide true disaster recovery due to the requirement of shared disk on the cluster configuration Requires operating system, hardware, and database knowledge to configure the Microsoft clustering solution. Blocking on the database can be interpreted as unresponsiveness and will cause the cluster to failover the database which can result in unnecessary downtime. Application In the event of failure on primary active node, SQL Server is restarted on the secondary node. SQL2005 SQL2005 MSCS Resource Group Primary Server Heartbeat private interconnect Secondary Server Shared external storage Figure 3: Typical Microsoft Clustering (MSCS) implementation 8
Native Backup and Restore: As part of a Business Continuity strategy and to satisfy legal regulations for data retention, regular intervals to backup the database is a fundamental technique. In the event of unrecoverable error, organizations are required to be able to restore critical information. The Backup and Restore feature allows organizations to create copies of their databases and then store them in an external backup location. These backups can be used at a later date to restore lost data. Based on the specific requirements for a particular database, a strategy needs to be architected to meet that criterion. For example, weekly or nightly backups can be implemented, however if any new data that was added to the database after the last backup will be lost in the event of failure. In addition, as databases increase in size so do backup times. This plays a crucial part in deciding which solution to implement. If near real-time or minimal downtime is required, the other solutions listed in this document may provide a suitable solution. Microsoft recommends that you use the Backup and Restore feature only for non-mission-critical database applications. Advantages You can back the database up to removable media to help protect against disk failures. You do not have to depend on the network as you do when you use failover clustering or log shipping. Disadvantages When you back up the database, you cannot perform operations such as table creation, index creation, database shrinking, or non logged operations. If a failure occurs, you may lose your most recent data. If a disaster occurs, you must manually restore the database Database Mirroring: The introduction of SQL Server 2005 brought a new choice for high availability and business continuity. The database mirroring solution was intended to provide another solution to address company s requirements for high availability and disaster recovery for the database that could mimic the capabilities of Microsoft Clustering Solution. Database mirroring was configured to be able to provide this functionality to the database without additional configuration required for the operating system and hardware. Mirroring also allows for true disaster recovery as the databases are on separate servers, in different data centers, running off their own sets of disk media. Database mirroring is similar to log shipping (see previous section) in its mechanism for sending transactions to another database. The difference is that log shipping will batch up a set of transactions and send them to the secondary location to be restored, where mirroring will continuously stream each individual transaction to the DR site. This provides the benefit of a near real time standby server. The database can either be configured for automatic or manual failover. While a DR database is setup as part of a mirror, the database will be in a standby mode to prevent users from tampering with the DR database and to ensure the fastest possible restore time. Users can set up a database snapshot to be able to query the DR for reporting or troubleshooting purposes. The mirror can be configured in three ways: 1. High Safety 2. High Performance 3. Witness role 9
The first setting is a high safety setting that will ensure synchronous transfer of data. The transaction is written to the DR before writing it to the primary. The high safety setting can add some transaction time overhead for the write process. The second setting is high performance, which will write the transaction to the primary server and then send it to the DR with the assumption that it was written. This is the fastest setting for write performance, but does not guarantee the transaction was committed to the DR. The third setting will allow for automatic failover but will require a third server to take on the role of a witness to determine if the primary server or the secondary server has ownership. This setting also requires that the mirror adopt the high safety setting of two phase commit to ensure transactions are written to the DR before being committed on the primary server. This will ensure in the case of an automatic failover that your DR is an exact replica of your primary server. Database mirroring is a full HA/BC solution in SQL Server 2005 at the database level. Source and destination SQL Server databases are known as principal and mirror, respectively. Basically, a client interacts with the principal and submits a transaction. The principal writes the requested change to the principal transaction log and automatically transfers the information describing the transaction over to the mirror, where it is written to the mirror transaction log. The mirror then sends an acknowledgement to the principal. The mirror continuously uses the remote transaction log to replicate changes made to the principal database to the mirror database. In case of a disaster the mirror is made live and new connections to the principal get diverted here automatically. Mirroring can be implemented on a per-database basis. Mirroring only works with databases that use the full recovery model. The simple and bulk-logged recovery models do not support database mirroring. Therefore, all bulk operations are always fully logged. Database mirroring works with any supported database compatibility level. Two or three instances of SQL server can leverage the existing infrastructure. Many applications can use of multiple databases on a single server or one application may reference multiple databases however database mirroring works with a single database at a time. Advantage and disadvantages of using database mirroring Advantages Microsoft supported high availability and disaster recovery solution Closest near real time failover of database from primary to DR Setup is configured at database level only and allows for three different setups to help meet business requirements. Database mirroring improves the availability of the production database during upgrades. Disadvantages The mirror database should be identical to the principal database. For example, all objects, logins, and permissions should be identical. Database mirroring involves the transfer of information from one computer to another computer over a network. Therefore, the security of the information that SQL Server transfers is very important High safety setting requires network overhead and can slow database write performance Can require an additional server resource if automatic failover is configured. DR database is closed to user connections and requires a database snapshot to create a read copy of DR to read the content 10
Application SQL2005 SQL 2005 Principal Mirror Witness (optional) Figure 4: Typical Implementation of Database mirroring Conclusion This paper summarizes the various solutions from Microsoft SQL Server 2005 to achieve business continuity at the database layer. Depending on data protection requirements and availability, any of these solutions or a combination of them can be implemented as part of the enterprise strategy. For instance, Log Shipping can be supplemental to Database Mirroring 1 to provide maximum protection in the event of unplanned downtime. In addition, Dell recommends implementing a DR strategy at every level of the solution stack. The complete business continuity strategy should include high availability at the database, hardware and storage layers. As well as planning for downtime and methods for backup and recovery of the data. Along with premier partners such as EMC and Microsoft, DELL offers services and reference architectures with best practices such as redundant components and storage RAID to help customers achieve maximum business continuity. The following table presents a comparison of the Microsoft SQL Server 2005 options. Feature Automatic Failover Ease of Configuration Granularity of Recovery Backup and Log Shipping Replication Failover Data Base Restore Clustering Mirroring No No No Yes Yes Easy Easy Medium hard Hard Medium Database Database Database object SQL Server Instance Database 1 Database Mirroring and Log Shipping Working Together, SQL Server Best Practices Article, Jan 2008 http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/dbmandlogshipping.docx 11
RPO RTO Administrative Overhead Data loss up to last backup copy Time taken for database recovery depends on size Maintain backups and manual restores Possible data loss Time taken for database recovery Minimal Some data loss is possible May run into minutes May get involved in case of complex publishersubscriber scenarios No data loss 20 30 seconds plus time taken to recover databases Maintaining cluster hardware No data loss <3 seconds Checking mirror status Glossary: Recovery Time objective (RTO): The maximum time an outage can be tolerated is referred to as recovery time objective. Recovery Point objective (RPO): The amount of data loss that can be tolerated is referred as recovery point objective. Transaction log: The transaction log is a serial record of all the transactions that have been performed against the database since the transaction log was last backed up. Business Continuity: Business continuance (referred to as business continuity) describes the process and procedures an organization puts in place to ensure that essential functions can continue during and after a disaster. Business continuity planning seeks to prevent disruption of mission-critical services and to reinstate full functioning, quickly and efficiently. Business continuity is not a specific technology and should integrate a variety of strategies and technologies to address all potential causes of outage, balancing cost vs. acceptable risk, resulting in a resilient infrastructure. As a first step in business continuity, high-availability planning is deciding which of the organization s functions are essential to be available and operational during a crisis. Once the crucial/mission-critical components are identified, it is essential to identify your RPO and RTO objectives for the identified crucial/mission-critical apportioning in terms of cost and acceptable risk. To appropriately architect a disaster recovery solution, one must be familiar with the following terms. Availability Generally, a degree to which a system, subsystem, service, or equipment is in an operable state for a proportion of time in a functional condition. It refers to the ability of the user community to access the system. Disaster Recovery (DR) A process of regaining access to the data, hardware, and software is necessary to resume critical business operations after a disaster. A disaster recovery plan should also include methods or plans of copying necessary mission-critical data to a recovery site to regain access to such mission-critical data after a disaster. High Availability (HA) A system design protocol and associated implementation that ensure a certain absolute degree of operational continuity of a system, service, or equipment during a given measurement period. 12
High-availability planning should include strategies to prevent single points of failure that could potentially disrupt the availability of mission-critical business operations. References DELL Database Solutions http://www.dell.com/sql DELL Services http://www.dell.com/services Database Mirroring and Log Shipping Working Together, Jan 2008 http://technet.microsoft.com/en-us/sqlserver/bb671430.aspx Best Practices for Backup and Restore in SQL Server 2005 http://www.dell.com/downloads/global/solutions/public/white_papers/sql2005_backup_wp. pdf SQL Server 2005 Mission Critical High Availability http://technet.microsoft.com/en-us/sqlserver/bb331801.aspx SQL Server 2005 DR Options: http://support.microsoft.com/kb/822400 13