Protecting SQL Server in Physical And THE NEED TO PROTECT SQL SERVER AND RELATED APPLICATIONS Microsoft SQL Server is the preferred database management system for a large number of Windows-based applications. Often, these applications and their associated data are critical to the operation of the business. Typical SQL Server applications include order entry, manufacturing control, inventory management, purchasing, along with many industry-specific applications. Protecting SQL Server and its databases is a top priority for many businesses, but it is generally not enough. If the application itself is not also protected, business operations will be disrupted. All but the simplest database applications include a separate application element that includes application logic, user interface control and sometimes communication with other applications or networks. Applications may include in-memory data or file system data, in addition to SQL Server data, all adding to the complexity of the protection challenge. DEFINE THE PROBLEM Improving the availability of SQL Server and related applications involves reducing or eliminating many possible downtime causes. At the highest level, downtime can be separated into two categories: planned downtime and unplanned downtime. Planned downtime is less disruptive since it can be scheduled for nights or weekends when user activity is much lower. Unplanned downtime, on the other hand, tends to occur at the worst possible times and can impact the business severely. Unplanned downtime can have many causes including hardware failures, software failures, operator errors, data loss or corruption, and site outages. This paper discusses the different causes of
both planned and unplanned downtime along with some important considerations for evaluating solutions to address them. SOLVE THE PROBLEM Most availability solutions today fall into one of three categories: traditional failover clusters, virtualization clusters and data replication. Some solutions combine elements of both clustering and data replication; however, there is no single solution that can address all possible causes of downtime. Traditional and virtualization clusters both rely on shared storage and the ability to run applications on an alternate server if the primary server fails or requires maintenance. Data replication solutions maintain a second copy of the application data, at either a local or remote site, and support either manual or automated failover to handle planned or unplanned server failures. All of these solutions rely on redundant servers to provide availability. Applications can be moved to an alternate server if a primary server fails or requires maintenance. It is also possible to add redundant components within a server to reduce the chances of server failure. ELIMINATE DOWNTIME Most availability solutions rely on a recovery process called failover that begins after a failure occurs. A failover moves application processing to an alternate host after an unplanned failure occurs or by operator command to accommodate planned maintenance activity. Failovers are effective in bringing applications back online reasonably quickly but they do result in application downtime, loss of in-process transactions and in-memory application data, and expose the possibility of data corruption. Even a routine failover will result in minutes or tens of minutes of downtime including the time required for application restart and data recovery resulting from an unplanned failure. In the worst case, software bugs or errors in scripts or operational procedures can result in failovers that do not work properly, with the result that downtime can extend to hours or even days. Reducing the number of failovers, shortening the duration of failovers, and ensuring that the failover process is completely reliable, all contribute to the elimination of SQL Server downtime. SQL Server includes support for Microsoft clustering, data mirroring with failover, and data replication. Many other third party availability solutions also include specific support for SQL Server. Local server redundancy and basic failover address the most common failures that cause unplanned SQL Server and application downtime. However, data loss or corruption, and site disruptions, although less common, can
cause much longer outages and require additional solution elements to properly address. EVALUATE UNPLANNED DOWNTIME CAUSES Unplanned downtime can be caused by a number of different events: Catastrophic server failures caused by memory, processor or motherboard failures Server component failures including power supplies, fans, internal disks, disk controllers, host bus adapters and network adapters Software failures of the operating system, middleware or application Site problems such as power failures, network disruptions, fire, flooding or natural disasters There are also the problems of data loss and corruption that require solutions beyond hardware redundancy and failover. Each category of unplanned downtime is addressed in more detail below. AVOID SERVER HARDWARE FAILURES Server core components include power supplies, fans, memory, CPUs and main logic boards. Purchasing robust, name brand servers, performing recommended preventative maintenance, and monitoring server errors for signs of future problems can all help reduce the chances of failover due to catastrophic server failure. Downtime caused by server component failures can be significantly reduced by adding redundancy at the component level. Examples are: redundant power and cooling, ECC memory, with the ability to correct single-bit memory errors, teaming of Ethernet cards and use of RAID arrays. REDUCE STORAGE HARDWARE FAILURES Storage protection relies on device redundancy combined with RAID storage algorithms to protect data access and data integrity from hardware failures. There are distinct issues for both local disk storage and for shared network storage. For local storage, it is quite easy to add extra disks configured with RAID protection. A second disk controller is also required if you want to protect against controller failures. Access to shared storage relies on either a fiber channel or Ethernet storage network. To assure uninterrupted access to shared storage, these networks must be designed to eliminate all single points of failure. This requires redundancy of network paths, network switches, and network connections to each storage array.
SAY GOODBYE TO NETWORKING FAILURES The network infrastructure itself must be fault-tolerant, consisting of redundant network paths, switches, routers and other network elements. Server connections can also be duplicated to eliminate failovers caused by the failure of a single server or network component. Take care to ensure that the physical network hardware does not share common components. For example, dual-ported network cards share common hardware logic, and a single card failure can disable both ports. Full redundancy requires either two separate adapters or the combination of a built-in network port along with a separate network adapter. MINIMIZE SOFTWARE FAILURES Software failures can occur at the operating system level or at the SQL Server and application level. In virtualization environments, the hypervisor itself or virtual machines can fail. In addition to hard failures, performance problems, or functional problems can seriously impact SQL Server users, even while all of the software components continue to operate. Beyond proper software installation and configuration along with the timely installation of hot fixes, the best way to improve software reliability is the use of effective monitoring tools. Fortunately, there is a wide choice of monitoring and management tools for SQL Server available from Microsoft as well as from third parties. REDUCE OPERATOR ERRORS Operator errors are a major cause of downtime. Proven, well-documented procedures and properly skilled and trained IT staff will greatly reduce the chance for operator errors. But some availability solutions can actually increase the chance of operator errors by requiring specialized staff skills and training, by introducing the need for complex failover script development and maintenance, or by requiring the precise coordination of configuration changes across multiple servers. SECURE YOURSELF FROM SITE-WIDE OUTAGES Site failures can range from an air conditioning failure or leaking roof that affect a single building, a power failure that affects a limited local area, or a major hurricane that affects a large geographic area. Site disruptions can last anywhere from a few hours to days or even weeks. There are two methods for dealing with Site Disasters: redundant servers can be tightly coupled across high speed/low latency links, to provide zero data-loss and zero down-time, or redundant servers can be loosely coupled over medium speed/higher latency/greater distance lines, to provide a disaster recovery (DR) capability where a remote server can be restarted with a copy of the application database which only
misses the last few updates. In the latter case, asynchronous data replication is used to keep a backup copy of the data. Data replication is combined with error detection and failover tools to help get a disaster recovery site up and running in minutes or hours, rather than days. PROTECT AGAINST DATA LOSS AND CORRUPTION Data loss and corruption cannot be eliminated through hardware redundancy alone. Errors in application logic or mistakes by users or IT staff can result in accidentally deleted files or records, incorrect data changes and other data loss or integrity problems. Certain types of hardware or software failures can lead to data corruption. Site problems or natural disasters can result in loss of access to data or the complete loss of data. Beyond the need to protect current data, both business and regulatory requirements add the need to archive and retrieve historical data, often spanning several years and multiple types of data. Full protection against data loss and corruption requires a comprehensive backup and recovery strategy along with a disaster recovery plan. In the past, backup and recovery strategies have been based on writing data to tape media that can be stored off-site. However, this approach has several drawbacks: Backup operations require storage and processing resources that can interfere with production operation and may require some applications to be stopped during the backup window Backup intervals typically range from a few hours to a full day, with the risk of losing several hours of data updates that occur between backups Using tape backup for disaster recovery results in recovery times measured in days, an unacceptable level of downtime for many organizations Data replication is a better solution for both data protection and disaster recovery. Data replication solutions capture data changes from the primary production system and send them, in real time, to a backup system at a remote disaster site, at the local site, or both. There is still the chance that a system failure can occur before data changes have been replicated, but the exposure is in seconds or minutes rather than hours or days. Data replication can be combined with error detection and failover tools to help get a disaster recovery site up and running in minutes or hours rather than days. Local data copies can be used to reduce tape backup requirements and to separate archival tape backup from production system operation to eliminate resource contention and remove backup window restrictions.
CONSIDER ISSUES THAT CAUSE PLANNED DOWNTIME Hardware and software reconfiguration, hardware upgrades, software hot fixes and service packs, and new software releases can all require planned downtime. Planned downtime can be scheduled for nights and weekends, when system activity is lower, but there are still issues to consider. IT staff morale can suffer if off-hour activity is too frequent. Companies may need to pay overtime costs for this work. And application downtime, even on nights and weekends, can still be a problem for many companies that use their systems on a 24/7 basis. Using redundant servers in any availability solution allows reconfiguration and upgrades to be applied to one server while SQL Server and applications can continue to run on a different server. After the reconfiguration or upgrade is completed, SQL Server and applications can be moved to the upgraded server with minimal downtime. Most of the work can be done during normal hours. Solutions based on virtualization, which can move applications from one server to another with no downtime, can reduce planned downtime even further. ADDED BENEFITS OF VIRTUALIZATION The latest server virtualization technologies, while not required for protecting SQL Server, do offer some unique benefits that can make SQL Server protection both easier and more effective. Virtualization makes it very easy to set up evaluation, and test and development environments without the need for additional, dedicated hardware. Many companies cannot afford the additional hardware required for testing SQL Server and database applications in a traditional, physical environment; but effective testing is one of the keys to avoiding problems when making configuration changes, installing hot fixes or moving to a new update release. Virtualization offers a single solution for protecting both SQL Server and the related database applications. SQL Server and each application can be configured as individual virtual machines and run on any available host within a resource pool. Availability characteristics can be matched to each virtual machine as dictated by application characteristics and business needs. Virtualization allows resources to be adjusted dynamically to accommodate growth or peak loads. The alternative is to buy enough extra capacity upfront to handle expected growth, but this can result in expensive excess capacity. On the other hand, if the configuration was sized only for the short-term load requirements, growth can
lead to poor performance and ultimately to the disruption associated with upgrading or replacing production hardware. Infrastructure components that support the SQL Server environment, including Active Directory, DNS and DHCP that have traditionally required separate servers and distinct availability solutions, can now be implemented as virtual machines in a common resource pool and leverage the common availability solution that is used to address the entire virtualization environment. Virtualization also makes disaster recovery easier to implement, more effective, and less costly. Virtual machines separate the software configuration from the underlying hardware. This provides total flexibility in the hardware required for the disaster site. One set of hardware can provide disaster backup for multiple applications and cost effective configurations can be chosen strictly based on their disaster recovery role. Software configurations change over time and changes must be duplicated at the disaster site to ensure proper operation. This can be extremely time consuming and error prone in a physical environment. In a virtual environment, the configuration is contained within the virtual machine definition file. Simply copying this file to the disaster site is all that is needed to maintain configuration compatibility. everrun BENEFITS everrun software offers a unique set of features and benefits that make everrun a great choice for protecting Microsoft SQL Server and related database applications: Higher levels of availability than competing failover solutions Selectable fault-tolerance using a range of everrun products along with the availability dial of everrun VM Simple to install and manage with no need for specialized staff experience or training Solutions for both virtual and physical environments Solutions for local, near distance and long distance availability and data protection Choice of either locally-attached or networked storage Built-in failover policy eliminates complex policy definition and the potential for policy errors Sophisticated error detection is faster and more reliable than simple heartbeats Cost effective for a wide range of SQL Server environments Application transparency allows database applications to be protected along with SQL Server without the need for failover scripting or other applicationspecific procedures or knowledge
Providing availability for SQL Servers and database applications running in remote locations can present unique challenges compared to operating in a corporate data center. Remote sites often have less skilled IT staff or lack local IT staff entirely. A simple, less complex availability solution like everrun is an ideal solution. The ability to use local storage, without the requirement for a SAN, makes everrun more suited to environments without a datacenter infrastructure. everrun HA everrun HA turns two commodity servers into a high-availability server that presents a single image to software, administrators and users. Component failures of disks, storage interfaces and network interfaces are handled transparently with no application disruption. System failures, unlike everrun FT, require an application restart on the surviving system. everrun HA provides Component Fault Tolerance that uses disk and network components across the pair of servers for redundancy. Unlike most clustering solutions, everun HA does not require redundant components within each server and does not use multipath IO and NIC teaming for failover, saving both cost and complexity while providing a solution that is compatible with existing storage and network infrastructure. everrun FT everrun FT turns two commodity servers into a fault-tolerant server that presents a single image to software, administrators and users. SQL Server and applications continue to run through a full range of possible errors including power supplies, fans, storage devices and interfaces, network interfaces, memory, processors and even motherboards. There is no downtime, no loss of application state or in-memory data; the everrun software transparently utilizes redundant components to handle processing tasks in the event of component or complete system failures. The individual servers within the redundant pair can be taken offline for hardware or software maintenance and reintegrated online for most maintenance activities. everrun VM everrun VM integrates Marathon s availability solutions with the Citrix XenServer virtualization platform. everrun VM offers a range of availability options an availability dial within a single product, allowing users to choose the most appropriate availability level for each application. Basic Failover and Component Fault Tolerance options are available today; more options on the dial will become available over time and will be easily integrated into an existing configuration.
Applications that do not need any protection at all can run in the standard XenServer environment along with everrun protected applications. Availability options include standard failover (similar to the VMware HA feature), Component Fault-Tolerance (availability features equivalent to everrun HA), and System Fault-Tolerance (availability features equivalent to everrun FT). Availability selections are made for each protected virtual machine. Virtual machines at different availability levels, along with unprotected virtual machines, can coexist within the same XenServer pool. everrun SPLITSITE everrun SplitSite is an option for the VM, HA and FT products that allows the two systems in a server pair (or two hosts running a protected virtual machine) to be geographically separated by distances up to 100 miles. The operating characteristics of SplitSite systems are identical to VM, HA and FT systems running side by side. With SplitSite, protection from many local site problems can be combined with bestin-class local availability for a single, integrated availability solution. everrun DR everrun DR is a data replication and failover software solution for long-distance disaster recovery protection. everrun DR uses sophisticated data compression, coalescing and de-duplication techniques to make optimum use of WAN bandwidth. Application aware solutions for SQL Server 2000, SQL Server 2005 and SQL Server 2008 can restore business operation within minutes. everrun CDP everrun CDP is a continuous data protection software solution that supports data recovery to any point in time or bookmarked event. Continuous replication eliminates the risk of data loss that can occur between traditional backups or snapshots. everrun CDP allows users to create an out-of-band backup environment using the continuously updated data replicas as the source for all backups, eliminating resource contention with production systems and removing restrictive backup windows.
ARCHITECTURE EXAMPLES The following diagrams show how everrun software can be used in several different scenarios to protect SQL Server. Diagram 1 Diagram 1 shows SQL Server, along with a manufacturing application, in an everrun VM environment with two physical servers. SQL Server and the manufacturing application each run as virtual machines protected by everrun VM. Both SQL Server and the application are fully protected using a single solution that is simple to install and maintain.
Diagram 2 Diagram 2 shows SQL Server, along with a medical application, with protection provided by everrun FT and everrunha across two separate sites using the SplitSite option. A pair of servers one located at the primary site and one at the backup site protects SQL Server using everrun HA. A second pair of servers, also split between the two sites, protects the medical application using everrun FT. The sites are connected using a high-speed network link. The combination of everrun HA, everrun FT and everrun SplitSite provides all the benefits of everrun local protection while adding protection for site failures and local disasters.
Diagram 3 Diagram 3 shows SQL Server, along with an ERP application, in a disaster recovery scenario using everrun DR across two extended distance sites. everrun VM is used for local protection and to simplify management and reduce cost of the disaster recovery solution. Data is replicated across the two sites in real time using everrun DR. SQL Server failover is also managed by everrun DR. PROTECTING SQL SERVER WITH everrun CASE STUDIES A Canadian hospital specializing in cancer care uses a medical application called MOSAIQ and its associated SQL 2005 database to manage radiation therapy. After evaluating several availability technologies, the hospital chose everrun FT and the SplitSite option to protect this critical application and database. HP servers are located in two separate hospital buildings, each connected to a different power grid. The SQL 2005 database is mirrored across local storage on the two servers. This
WORLDWIDE HEADQUARTERS Marathon Technologies Corporation 295 Foster Street, Littleton, MA 01460 Tel 1.800.884.6425 / 1.978.489.1100 Fax 1.978.489.1101 Email: info@marathontechnologies.com Web: www.marathontechnologies.com EMEA HEADQUARTERS Marathon Technologies UK Ltd Regus House, Trinity Court Wokingham Road, Bracknell Berkshire, RG42 1PL Tel +44 (0) 1344.706.241 Fax +44 (0) 1344.706.242 Email: emea@marathontechnologies.com Web: www.marathontechnologies.com solution provides continuous availability for both the application and database and protects against a range of hardware, software and site failures. The hospital installs application upgrades, a major cause of planned downtime, on one server at a time, virtually eliminating downtime associated with this process. Commenting on their experience with Marathon everrun, the hospital s system administrator says, The systems just stay up and run. A major U.S. retailer uses a sophisticated warehouse control system, built on a SQL Server database, to operate a major distribution center. Any failure of the SQL Server or the warehouse control application means that the distribution center operation and the resulting restocking of hundreds of retail stores comes to a complete stop. The solution recommended by the warehouse system integrator included Marathon everrun HA. Using a pair of HP DL380 servers and internal storage mirrored across the servers, the company has encountered no problems with the Marathon system in almost two years of operation. Rich Products, a major U.S. frozen foods manufacturer, operates a large manufacturing facility in Arlington, Tennessee on a 24 hours per day, 7 days per week schedule. The facility produces its products in large batches using Wonderware Historian and SQL Server 2005 to collect a wide range of critical data throughout the process. Continuous data collection is a necessity for Rich Products to ensure that its products are manufactured to the highest levels of quality and are in full compliance with all FDA and other regulatory requirements. Failure of SQL Server or Wonderware Historian can shut down the entire production line with direct financial consequences. Rich Products considered a cluster solution but determined that cluster failover introduced unacceptable levels of downtime and data loss for their environment. They instead chose Marathon everrun FT with its ability to continue processing through a variety of hardware and software faults. Since installing their Marathon solution, Rich Products has experienced no downtime or data loss from their SQL Server and Wonderware Historian applications. CONCLUSION Protecting SQL Server and related applications requires addressing many different possible causes of both planned and unplanned downtime. Marathon everrun solutions provide the most comprehensive, effective and affordable options that can address the full range of SQL Server and database application downtime risks. Want to keep Microsoft SQL Server up and running through failures and disasters? Contact us for more information or to take test drive, marathontechnologies.com The Marathon logo, SplitSite and everrun are trademarks or registered trademarks of Marathon Technologies Corporation. Microsoft and Windows are registered trademarks of Microsoft Corporation. All other trademarks and registered trademarks are the property of their respective owners. Copyright 2008 Marathon Technologies Corporation. All rights reserved. Marathon Technologies Corporation reserves the right to make changes to this document at any time and without further notice. Marathon Technologies Corporation assumes no responsibility for any errors that may appear in this document.