Disaster Recovery and Data Replication Architectures



Similar documents
Disaster Recovery: Weighing Data Replication Alternatives

HA / DR Jargon Buster High Availability / Disaster Recovery

Protecting Microsoft SQL Server

Veritas Cluster Server from Symantec

Disaster Recovery for Oracle Database

Continuous Data Protection for any Point-in-Time Recovery: Product Options for Protecting Virtual Machines or Storage Array LUNs

Real-time Protection for Hyper-V

Veritas InfoScale Availability

Veritas Cluster Server by Symantec

Symantec Cluster Server powered by Veritas

PROTECTING MICROSOFT SQL SERVER TM

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Veritas Storage Foundation High Availability for Windows by Symantec

HP StorageWorks Data Protection Strategy brief

Data Sheet: Disaster Recovery Veritas Volume Replicator by Symantec Data replication for disaster recovery

Building Continuous Cloud Infrastructures

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Storage Based Replications

High availability and disaster recovery with Microsoft, Citrix and HP

DISASTER RECOVERY STRATEGIES FOR ORACLE ON EMC STORAGE CUSTOMERS Oracle Data Guard and EMC RecoverPoint Comparison

Veritas Replicator from Symantec

Using Hitachi Protection Manager and Hitachi Storage Cluster software for Rapid Recovery and Disaster Recovery in Microsoft Environments

Daniela Milanova Senior Sales Consultant

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Sanovi DRM for Oracle Database

17573: Availability and Time to Data How to Improve Your Resiliency and Meet Your SLAs

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

NetApp SnapMirror. Protect Your Business at a 60% lower TCO. Title. Name

Best practices for data migration.

Protecting Microsoft Hyper-V 3.0 Environments with CA ARCserve

Microsoft Cross-Site Disaster Recovery Solutions

Implementing the Right High Availability and Disaster Recovery Plan for Your Business Date: August 2010 Author: Mark Peters, Senior Analyst

VERITAS NetBackup 6.0 Enterprise Server INNOVATIVE DATA PROTECTION DATASHEET. Product Highlights

Business Continuity: Choosing the Right Technology Solution

Affordable Remote Data Replication

Double-Take Replication in the VMware Environment: Building DR solutions using Double-Take and VMware Infrastructure and VMware Server

VERITAS Business Solutions. for DB2

Running Successful Disaster Recovery Tests

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Neverfail for Windows Applications June 2010

Eliminate SQL Server Downtime Even for maintenance

IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity

SQL-BackTrack the Smart DBA s Power Tool for Backup and Recovery

VERITAS and HP A LONG-TERM COMMITMENT

IBM TotalStorage IBM TotalStorage Virtual Tape Server

Real-time Data Replication

BUSINESS CONTINUITY AND DISASTER RECOVERY FOR ORACLE 11g

Taking the Disaster out of Disaster Recovery

CA XOsoft Replication and CA XOsoft High Availability CA Partner Frequently Asked Questions

CA XOsoft Replication r12.5 and CA XOsoft High Availability r12.5

SQL Server Storage Best Practice Discussion Dell EqualLogic

Increasing Recoverability of Critical Data with EMC Data Protection Advisor and Replication Analysis

Contents. SnapComms Data Protection Recommendations

DISASTER RECOVERY BUSINESS CONTINUITY DISASTER AVOIDANCE STRATEGIES

Symantec Storage Foundation High Availability for Windows

EMC Solutions for Disaster Recovery

RPO represents the data differential between the source cluster and the replicas.

TOP FIVE REASONS WHY CUSTOMERS USE EMC AND VMWARE TO VIRTUALIZE ORACLE ENVIRONMENTS

Virtual Server System and Data Protection, Recovery and Availability

Oracle Database Disaster Recovery Using Dell Storage Replication Solutions

Virtual Server System and Data Protection, Recovery and Availability

Virtualizing disaster recovery using cloud computing

Technical Considerations in a Windows Server Environment

The Benefits of Continuous Data Protection (CDP) for IBM i and AIX Environments

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

System Migrations Without Business Downtime. An Executive Overview

The case for cloud-based disaster recovery

Backup and Recovery Solutions for Exadata. Ľubomír Vaňo Principal Sales Consultant

EMC Business Continuity and Disaster Recovery Solutions

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

HITACHI DATA SYSTEMS USER GORUP CONFERENCE 2013 MAINFRAME / ZOS WALTER AMSLER, SENIOR DIRECTOR JANUARY 23, 2013

Symantec NetBackup Snapshots, Continuous Data Protection, and Replication

Reducing the Cost and Complexity of Business Continuity and Disaster Recovery for

How To Write A Server On A Flash Memory On A Perforce Server

Leveraging the Cloud for Data Protection and Disaster Recovery

Microsoft SQL Server 2005 on Windows Server 2003

Backup and Recovery Solutions for Exadata. Cor Beumer Storage Sales Specialist Oracle Nederland

Symantec Disaster Recovery Advisor

Disaster Recover Challenges Today

Implementing Disaster Recovery? At What Cost?

Network-based Intelligent Data Protection Yossi Mossel, Product Manager

WHITE PAPER: HIGH CUSTOMIZE AVAILABILITY AND DISASTER RECOVERY

Oracle Active Data Guard

EMC RECOVERPOINT: BUSINESS CONTINUITY FOR SAP ENVIRONMENTS ACROSS DISTANCE

SOLUTION BRIEF: CA ARCserve R16. Virtual Server System and Data Protection, Recovery and Availability

Deployment Options for Microsoft Hyper-V Server

Transcription:

Disaster Recovery and Data Replication Architectures Gartner IT Security Summit 2005 6 8 June 2005 Marriott Wardman Park Hotel Washington, District of Columbia These materials can be reproduced only with Gartner's written approval. Such approvals must be requested via e- mail quote.requests@gartner.com.

Strategic Planning Assumptions: Through 2007, fewer than 20 percent of enterprises will operate at Stage 3 the highest level of disaster recovery management process maturity (0.8 probability). By year-end 2007, large enterprises with well-defined disaster recovery processes and regularly tested plans will rise from the current level of approximately 60 percent to 80 percent (0.8 probability). Disaster Recovery Management Is Maturing Stage 0 No data recovery or shelfware plan Stage 2 Data recovery as a process, component of business continuity management Link DR to business process requirements Defined organization Plan regularly tested Formalized reporting Stage 1 Data recovery as an IT project Platform-based Plan occasionally tested Ad hoc project status reporting Stage 3 Business integration Partner integration Process integration Continuous improvement culture Frequent, diverse testing Formalized reporting to BCM, executives and board Disaster recovery management (DRM) has evolved over 20 years, from its roots in platform-based IT recovery (such as mainframe) to integration in business continuity plans. Enterprises tend to evolve through at least four DRM stages, moving to the next stage when the benefits outweigh the risks of inaction. In the first phase, disaster recovery (DR) plans are nonexistent or they exist only as shelfware. They are not tested or maintained and would not enable or direct recovery actions. Enterprises typically move next to define a DR plan on a project basis. Typically, there is a realization inside IT that some disaster risk mitigation must be implemented or there is outside business pressure to protect a specific business process (such as the call center). The project is focused on a plan with occasional testing; however, it s typically not integrated into other IT/business processes and is not maintained. In the next phase, enterprises focus on building a DRM organization and processes, ensuring a life cycle approach to maintaining a plan, and regularly test the plan (once or twice per year). Business process owners actively determine IT recoverability requirements, as well as participate in tests. In the final phase, the focus is on process integration that is, with change management to ensure that DR plans are kept up-to-date and with incident/problem management to leverage IT support processes. DRM is also considered early in the stages of a new project. Emphasis is also on end-to-end planning, including partner integration and continuous improvement/best practices. Page 1

Client Issues 1. How will enterprises justify investments in technologies, people and business processes needed to deliver continuous application availability and protection from site disasters? 2. What technologies will be critical for data replication architectures and what are their tradeoffs? 3. What architectures and best practices will enable enterprises to achieve 24x7 availability and disaster protection as required by the business? The need for 24x7 application availability and protection against site disasters is mandatory for critical business applications used in critical business processes. This presentation focuses on the technology and data replication architectures and best practices to achieve near-real-time recovery time and point objectives. Client issues include: How will enterprises justify investments in technologies, people and business processes needed to deliver continuous application availability and protection from site disasters? What technologies will be critical for data replication architectures and what are What architectures and best practices will enable enterprises to achieve 24x7 availability and disaster protection as required by the business? Page 2

Client Issue: How will enterprises justify investments in technologies, people and business processes needed to deliver continuous application availability and protection from site disasters? Investing to Reduce Unplanned Downtime 40% Operations Errors 20% Environmental Factors, HW, OS, Power, Disasters 40% Application Failure Investment Strategy Redundancy Service Contracts Availability Monitoring BCM/DRM and Testing People and Process Hiring and Training IT Process Maturity Reduce Complexity Automation Change and Problem Mgt. People and Process App. Architecture/Design Mgmt. Instrumentation Change Management Problem Management Configuration Management Performance/Capacity Mgt. Based on extensive feedback from clients, we estimate that, on average, 40 percent of unplanned mission-critical application downtime is caused by application failures (including bugs, performance issues or changes to applications that cause problems); 40 percent by operations errors (including incorrectly or not performing an operations task); and about 20 percent by hardware, operating systems, environmental factors (for example, heating, cooling and power failures), first-day security vulnerabilities, and natural or manmade disasters. To address the 80 percent of unplanned downtime caused by people failures (vs. technology failures or disasters), enterprises should invest in improving their change and problem management processes (to reduce the downtime caused by application failures); automation tools, such as job scheduling and event management (to reduce the downtime caused by operator errors); and improving availability through enterprise architecture (including reducing complexity) and management instrumentation. The balance should be addressed by eliminating single points of failure through redundancy, implementing BC/DR plans and reducing time-to-repair through technology support/maintenance agreements. Action Item: Don t let redundancy give you a false sense of security since 80 percent of downtime is caused by people and process issues. Page 3

Strategic Planning Assumption: By year-end 2007, 65 percent of large enterprises will integrate disaster recovery requirements into the new project life cycle, up from fewer than 25 percent in 2004 (0.8 probability). Justification Vehicle: The Business Impact Assessment Productivity Number of employees affected x hours out x burdened hourly rate Damaged Reputation Customers Suppliers Financial markets Banks Business partners... RTO RPO Know your downtime costs per hour, day, two days... Revenue Direct loss Other Expenses Temporary employees, equipment rental, overtime costs, extra shipping costs, travel expenses, legal obligations... Compensatory payments Lost future revenue Billing losses Investment losses Financial Performance Revenue recognition Cash flow Lost discounts (A/P) Payment guarantees Credit rating Stock price Client Issue: How will enterprises justify investments in technologies, people and business processes needed to deliver continuous application availability and protection from site disasters? Enterprises need to understand the consequences of downtime to justify investments for operational availability and business continuity (BC)/DR. A first step in developing a BC/DR plan is performing a business impact analysis (BIA), where critical business processes are identified and prioritized and costs of downtime are evaluated over time. The BIA is performed by a project team consisting of business unit, security and IT personnel. Key goals of the BIA are to: 1) agree on the cost of business downtime over varying time periods, 2) identify business process availability and recovery time objectives, and 3) identify business process recovery point objectives. The BIA results feed into the recovery strategy and process. Enterprises that have never instituted a BIA into their application life cycle processes typically initiate a BIA project and use their findings to ensure that current recovery strategies meet business process requirements. With real-time enterprise (RTE) applications, it is critical that BC/DR is built into the life cycle for new applications and business process enhancement projects so that availability and recovery requirements are built into the architecture and design. Action Item: Integrate business continuity management (BCM) and DRM into the enterprise project life cycle to ensure that recovery needs are identified in projects initial phases or in changes in business processes and systems. Page 4

Criticality Ratings/Classification Systems Class Class 1 (RTE) Class 2 Class 3 Class 4 Business Process Services Customer-/Partner-Facing Functions critical to revenue production where loss is significant Less-critical revenueproducing functions Supply chain Enterprise back-office functions Departmental functions DR Service Levels and Strategy RTO = 0-4 hrs.; RPO = 0-4 hrs. Dedicated recovery environment Architecture may include automated failover RTO = 8-24 hrs.; RPO = four hours Dedicated or shared recovery environment RTO = three days; RPO = one day Shared recovery environment May include quick-ship programs RTO = five+ days; RPO = one day Quick ship contracts typical Sourcing at time of disaster where RTOs are lengthy Client Issue: How will enterprises justify investments in technologies, people and business processes needed to deliver continuous application availability and protection from site disasters? Business needs for application service availability/dr should be defined during the business requirements phase. Ignoring this early often results in a solution that doesn't meet needs and ultimately requires significant rearchitecture to improve service. We recommend a classification scheme of supported service levels and associated costs. These drive tasks and spending in development and application architecture, systems architecture and operations. Business managers then develop a business case for a particular service classification. From a DR perspective, this case is developed in the BIA and recovery strategy phases. Servicelevel definitions should include scheduled uptime, percentage availability in scheduled uptime, and recovery time and point objectives. In this example, Class 1 application services have an RTE strategy and are those that the enterprise would suffer irreparable harm from if they were unavailable. Not all applications in a critical business process would be grouped in Class 1; rather, only those deemed most-critical or with the most downtime effect. The DR architecture for Class 1 and even Class 2 would result in an implementation across two physical sites to meet availability/recovery needs. Action Item: Develop a service-level classification system with associated development, infrastructure and operations architecture requirements. A repeatable process is a process that works. Page 5

Tactical Guideline: There is no one right disaster recovery data center strategy. Many companies implement all four methods, depending on their application and recovery requirements. Disaster Recovery Strategies Client Questions: Trade-off Costs, Risks, Complexity How many data centers should I have? One or many? Where should they be located? Is close better than far? Should I reduce the cost of DR by using idle assets for other purposes? Common Strategies Production/Load Sharing Production/Outsourcing /DR Production/Development and Test DR Production/Standby DR Gartner frequently gets questions from clients about data center strategies such as, how many data centers, how are they used, and what is a strategy for DR? Although there are no right answers to these questions (the right answer for your organization depends on your business and IT strategy), there are common themes across large enterprises. Although data center consolidation has often been used to reduce cost through economies of scale, consolidation across the oceans is fairly rare. This is due to network latency causing unacceptable response time levels for worldwide applications. However, those that do operate a single application instance worldwide achieve greater visibility across business units (such as the supply chain) and reduced overall costs of operation and application integration. As far as the number of data centers since the Sept. 11 tragedy, there has been a slight increase in the overall number of data centers for many organizations to achieve protection from disasters. From a DR perspective, the trend is toward sub-24-hour recovery time objective (RTO), and often sub-four-hours, resulting in dedicated recovery environments either internally or at outsourcers. Often to reduce total cost of ownership, the recovery environment is shared with development/test or production load sharing, or through shared contracts with DR service providers. Furthermore, capacity-on-demand programs are popular for internal recovered mainframe environments. Page 6

Tactical Guideline: Zero transaction loss requires transaction mirroring and all the costs associated with it. Data Recovery Architectures: Redundant Everything Production Site Geographic Load Balancer Site Load Balancer Web Server Clusters Application Server Clusters Database Server Clusters Disk Transaction Replication DB Replication Remote Copy LAN and PC Tape Backup Secondary Site Geographic Load Balancer Site Load Balancer Web Server Clusters Application Server Clusters Database Server Clusters PIT Image, Tape B/U For application services with short RTO/recovery point objective (RPO) requirements, multi-site architectures are used. Often, a new IT service or application is initially deployed with a single-site architecture and migrates to multiple sites as its criticality grows. Multiple sites complicate architecture design (for example, load balancing, database partitioning, database replication and site synchronization must be designed into the architecture). For non-transaction processing applications, multiple sites often run concurrently, connecting users to the closest or least-used site. To reduce complexity, most transaction processing (TP) applications replicate data to an alternative site, but the alternative databases are idle unless a disaster occurs. A switch to the alternative site can typically be accomplished in 15 minutes to 60 minutes. Some enterprises prefer to partition databases and split the TP load between sites and consolidate data later for decision support and reporting. This reduces the impact of a site outage, affecting only a portion of the user base. Others prefer more complex architectures with bi-directional replication to maintain a single database image. All application services require end-to-end data backup and offsite storage as a component of the DR strategy. Often, the DR architecture will implement point-in-time replicas to enable synchronized backup and recovery. Application services with greater than 24-hour RTO typically recover via tape in the alternative site. Page 7

Strategic Planning Assumption: By 2008, more than 75 percent of all enterprises using SAN storage will have deployed a PIT solution to meet service level objectives (0.7 probability). Point In Time Copies: The Data Corruption Solution Controller based EMC TimeFinder/Snap IBM Flashcopy HDS ShadowImage StorageTek SnapShot Software Oracle 10g Flashback BMC SQL Backtrack Veritas Storage Foundation PIT copy solutions are a prerequisite to building Real Time Infrastructures (RTIs). Their penetration rate in enterprise sites is at least three to five times greater than the penetration rate of remote copy solutions. There are two key reasons for this disparity. First, PIT copies protect against a frequent source of downtime: data corruption. Second, they shrink planned downtime for backups from hours to seconds or minutes and they simplify other operational issues like check-pointing production workloads and application testing. Software based PIT copy technologies limit storage vendor lock-ins and have the potential of leveraging their closeness to the protected applications into greater efficiency and tighter application integration. However, with these advantages, there is the downside of potentially more complex software architectures (with many tools potentially implemented) and the need for additional testing. Storage or controller based solutions give up some intimacy with the applications to deliver a more agnostic platform and application solution, but at the cost of greater storage vendor lock-ins. In most situations choosing between software and controller based solutions will be driven by prior investments, internal skills, application scale and complexity, and price. Page 8

Decision Framework: Use application/transaction-level replication where 24x7 continuous application availability (no downtime) is required and for new application projects. Strategic Planning Assumption: Through 2007, application and transaction-level replication will be used by less than 10 percent of large enterprises (0.8 probability). Data Replication Alternatives Pros Cons Application/ Transaction Level Architected for no downtime transparent data access Full or partial recovery scenarios Loosely coupled application components designed for integrity Supports heterogeneous disk DBAs understand and have confidence in the solution Must be designed upfront Significant re-architecture when not designed upfront Requires applications/db groups to be responsible for recovery Does not offer prepackaged solutions that provide integrity and consistency across applications Product Examples: IBM WebSphere MQ or other message-oriented middleware; Teradata Dual Active Warehouse or other built-in applicationlevel replication; also fault-tolerant middleware or virtualization The best method of achieving continuous 24x7 availability is building recovery right into the application or enterprise architecture. This way, enterprises architect transparent access to data, even when some components are down. Users are never or very rarely impacted by downtime, even with a site failure. Typically, the architecture consists of asynchronous message queuing middleware but may be implemented with fault tolerant infrastructure middleware that replicates the transaction to redundant applications in another location. Applications and database architects and database administrators (DBAs) have confidence in this type of solution because it is based on transactions not bits, or blocks or files that lack application/transaction context. However, this type of architecture may take significant effort in the application design stages and most enterprises do not task their application development organization with recovery responsibilities. Furthermore, this method does not provide a prepackaged solution to ensure against conflict resolution and consistency and integrity across applications during recovery. Rather, application developers and architects must assess methods to roll back or forward to a consistent point in time and may code consistency transactions into applications to enable this to happen. Due to these drawbacks, most enterprises use the infrastructure to enable recovery for the majority of their needs and reserve this method for the most critical subset of applications. Page 9

Decision Framework: Consider replication at the database management system level to provide short RPO and RTO for mission-critical applications, keeping in mind that data integrity across applications must be designed into the applications and transactions. Additional Data Replication Alternatives DBMS Log-Based Replication Pros Often included with DBMS Some enable read/write use of second copy and conflict resolution Hardware-independent; supports heterogeneous disk No special application design required Allows flexibility in recovery to a specific point in time other than last backup/point of failure DBAs understand and have confidence in the solution Generally low bandwidth requirement; lower network costs Can be used for short RPO, with longer RTO, reducing software license costs Cons DBMS-specific solution More operational complexity than storage-controller-based Replication Solution automation/ integration varies Does not replicate configuration data in files Requires active duplication of server resources Requires DB logs and/or journaling could impact production performance No assurance of crossapplication data integrity Complex failback due to synchronization issues Database log-based replication is a popular method for ensuring recoverability. Logs are read, with changes typically shipped asynchronously (some solutions offer synchronous replication, but it s rarely used), and can be applied continuously, on a delay or to a backup upon disaster (this decision depends on the RTO and RPO). As with transaction-level replication, DBAs and application architects understand and have confidence in the solution. Furthermore, many solutions allow read/write access of the second copy, and, therefore, it is possible to create failover transparency in the solution (if replication is closely synchronized). However, care must be taken to avoid conflict resolution. To minimize conflict resolution, most enterprises apply transactions at a primary location and only switch to the secondary when the application cannot access the primary database. A major downside of this solution is that replication is needed for every database, thus, labor/management costs increase. Furthermore, configuration data stored in file systems (rather than the database) is not replicated and synchronization must be designed separately, typically through change control procedures. Moreover, crossapplication recovery integrity must be built into the solution for example, by writing synchronization transactions and rolling back or forward to achieve consistency. Despite the drawbacks, thousands of enterprises use database log-based replication to achieve short RTOs or RPOs. Page 10

Decision Framework: Most relational DBMS products include log-based replication, but the degree of synchronization (speed) and automation varies considerably. Some third-party tools offer multi-dbms replication support, as well as integrated automation and synchronization. DBMS Log-Based/Journaling/Shadowing Product Strength Weakness Oracle Data Guard Automation, function Failover included resynchronization DB2 UDB HADR for v.8.2 SQL Server Log Shipping Quest SharePlex for Oracle GoldenGate Data Synchronization Lakeview, Vision, DataMirror Log apply automation; included in ESE Function included Bidirectional Bidirectional, multi- DBMS support AS/400 Failover automation; Cannot read target Automation, failover Resynchronization Cost Gaining ground outside NonStop Gaining ground outside AS/400 ENET RRDF z/os z/os only HP NonStop RDF NonStop NonStop only Oracle Data Guard is a popular method for DR, as it is included with the Oracle license and has integrated automation built-in for failover and failback and failback resynchronization in 10g, which enables changes made at the secondary site to be integrated back into the primary site database management system (DBMS), resulting in no lost transactions. Data Guard offers two methods for replication: shipping of archive logs (after commitment), which could mean 15 to 30 or more minutes RPO, or shipping of redo log changes, which could be implemented in synchronous mode for zero data loss or in the more commonly implemented asynchronous mode. DB2 log shipping is included and offers asynchronous replication but does not have built-in automation. In DB2 UDB v.8.2, IBM added HADR (included with Enterprise Server Edition only), which automates shipping, receiving and applying logs (not failover). HADR does not enable users to read the target DBMS; for this, you must implement the more-complex DB2 replication. SQL Server also lacks failover and failback automation. Quest SharePlex, a popular tool for Oracle replication, provides close synchronization (a few seconds to a minute) and bidirectional support. GoldenGate offers similar technology for multiple DBMS platforms. HP NonStop has strong replication functionality for its DBMS. Suppliers of AS/400 replication technology are Lakeview Technology, Vision Solutions and DataMirror. On the mainframe, ENET RRDF is often deployed in environments with short RPO and longer RTO, so that the changed data is maintained in an alternative site but not applied until disaster (or in tests). Page 11

Decision Framework: Consider storage controller unit-based replication to achieve short RPO/RTO, where enterprises desire to move to a single solution to address many applications/data sources in the enterprise. Data Replication Alternatives Pros/Cons Storage Controller Unit-based Pros Infrastructure-based solution requires less effort from application groups Platform and data type independence (including mainframe) Single solution for all applications/dbmss Operational simplicity Most solutions assure, but do not guarantee, data integrity across servers and logical disk volumes Minimal host resource consumption Cons Data copies not available for read/write access Short recovery time, but user work is interrupted Storage hardware and software dependent Less storage vendor pricing leverage Failover is typically not packaged and requires scripting High connectivity costs Monitoring/control must be built Lack of customer and vendor understanding of procedures and technology to assure data integrity; taken for granted that it works Homogeneous SAN required Storage controller unit-based solutions are popular for enterprises seeking to build recoverability into the infrastructure and use the same solution for all applications and databases. Software on the disk array captures block-level writes and transmits them to a disk array in another location. Because many servers (including mainframes) can be attached to a single disk array, there are fewer replication sessions to manage, thus greatly reducing the complexity. Although solutions generally ensure write-order integrity for each array, some are able to provide a method for data integrity of the copy across arrays. These solutions started out synchronous and therefore have been highly utilized in close proximity (under 50 miles). Synchronous solutions are extremely popular in financial services industries where RPOs are set to no-loss-of-work. However, asynchronous solutions are slowly gaining ground. The major drawback to storage controller-based solutions is that recovery cannot be transparent to the applications because control of the target copy is maintained by the primary site, which is only an issue for applications requiring 24x7 availability. Many enterprises use storage controller-based solutions and use transaction-level or DBMS-level replication for those few applications requiring more stringent availability. Another drawback is lock-in to storage hardware and software. Page 12

Decision Framework: Consider synchronous, storage controller unit replication where the two facilities are less than 50 to 60 miles apart, so that network latency does not affect the performance of applications. Storage Controller Unit-Based Synchronous Product Strength Weakness EMC SRDF IBM ESS Metro Mirror; formerly PPRC Hitachi TrueCopy HP Continuous Access XP Strong market leader, Cross platform MF and distributed, m:n consistency groups Price competitive, Consistency groups across mainframe and distributed Consistency groups; 4:1 MF consistency; small but loyal base; Sun reseller Small buy loyal base; MC/SG integration Price, but getting more competitive Late entry to market in early 2004 Consistency groups 1:1 for distributed operating system environments Nearly exclusive to HP-UX EMC SRDF was first to market with an array-based replication solution in the mid-1990s, and, as a result, is the clear market leader. In addition, EMC offers consistency groups across multiple arrays so that when problems occur, all replication is halted to provide greater assurance that the secondary site has data integrity. Furthermore, unlike the alternative solutions, it is the only solution that supports distributed servers and mainframes on the same array. In the late 1990s, when SRDF had little competition, users often complained about pricing; however, with additional market entrants, SRDF is being priced more competitively. IBM s Enterprise Storage Server (ESS) Metro Mirror (formerly Peer to Peer Remote Copy PPRC) is price-competitive. It supports consistency groups across mainframe and distributed environments, and now supports Geographically Distributed Parallel Sysplex (GDPS). Hitachi s TrueCopy supports consistency groups via time-stamping for mainframe and distributed operating system platforms. However, consistency groups are more functional for mainframe platforms when they can support four arrays vs. one in the distributed environment. HP licenses Hitachi s TrueCopy and adds value to it for its solution. It can support multiple server operating system platforms, but most customers use it almost exclusively for HP-UX systems. Page 13

Decision Framework: Asynchronous, storage controller unit replication should be considered where the two facilities are beyond synchronous replication distances (more than 50 to 60 miles apart). Storage Controller Unit-Based Asynchronous Product Strength Weakness Hitachi TrueCopy Async Market leader; Sun reseller Mostly mainframe installed base Hitachi Universal Replicator Journal-based; Pull tech; Sun reseller 1:1 consistency groups; new to market Integrated with MC/Serviceguard HP Continuous Access XP Extension EMC SRDF/A EMC/Hitachi/IBM XRC controller and hostbased replication m:n consistency groups (new) Supported by multiple storage vendors Nearly exclusive to HP-UX Fairly new to market; few production installs Mainframe only NetApp SnapMirror Proven market leader Historically a midrange player IBM ESS Global Mirror HACMP integration; 8:1 consistency groups MF and distributed New to market; does not support GDPS (planned YE04) Compared to synchronous storage controller unit replication, asynchronous storage controller unit replication is relatively new, making its debut with Hitachi TrueCopy in the late-1990s. Although Hitachi is the market leader in asynchronous storage controller unit-based replication, its installed base and market share pales in comparison with synchronous replication. However, for many enterprises that have recovery sites more than 50 to 60 miles apart, asynchronous replication alternatives have reduced the complexity of their recovery environment because they could migrate to asynchronous from synchronous multihop architectures. A synchronous multihop architecture is one where, due to the greater distance between facilities, a local copy is taken, then split and replicated in an asynchronous mode to the secondary site. In this architecture, four to six copies of the data are required vs. the two copies required otherwise. EMC s SRDF has many multihop installations and many clients are testing its new SRDF/A to assess whether they can migrate from their multihop architectures to a single hop with SRDF/A. In April 2004, IBM announced its first asynchronous solution for the ESS, branding it Global Mirror, rather than PPRC. Hitachi released its new Universal Replicator in September 2004, enabling more replication flexibility and the promise of future heterogeneity. Page 14

Decision Framework: Consider file-level replication to provide short RPO and RTO for Windows-based applications. Consider volume manager-based replication for applications requiring short RTO/RPO, where a heterogeneous disk is implemented. Data Replication Alternatives Pros/Cons File-based Volume managerbased Pros Storage hardware independent One solution for all applications/data on a server Failover/failback may be integrated and automated Read access for second copy, supporting horizontal scaling Low cost for software Storage hardware independent One solution for all applications/data on a server Failover/failback may be integrated and automated Cons File system dependent More operational complexity than storage controller-based replication Application synchronization must be designed in the application Volume manager dependent Data copies not available for read/write access Application synchronization must be designed in the application More operational complexity than storage controller-based replication File-based replication is a single-server solution that captures file writes and transmits them to an alternative location. The major benefits are: 1) it does not require storage area network (SAN) storage, and 2) the files can be used for read-access at the alternative location. File-based replication is most popular in the Windows environment where SAN storage is not as prevalent, especially for critical applications, such as Exchange. A drawback to this type of solution is that it is server-based; therefore, management complexity rises as compared with storage controller unit replication. Volume manager-based replication is similar to storage controller unit-based replication in that it replicates at the block level and the target copy cannot be accessed for read/write. It requires a replication session for each server and, therefore, has high management complexity. However, no SAN storage is required and it supports all types of disk storage solutions. Both of these solutions are used for one-off applications/servers where recoverability is critical. Furthermore, both solutions tend to offer integrated and automated failover/failback functionality. Page 15

Decision Framework: Consider file-based replication for critical Windows-based applications and volume-based replication for critical applications where a heterogeneous disk is deployed. File and Volume/Host-Based Replication Product Strength Weakness NSI DoubleTake Market leader, Windows-only integrated automation Legato RepliStor EMC, integrated Lack of focus automation XOsoft WANSync No planned downtime required, integrated automation New to market Veritas Volume Replicator IBM Geo Remote Mirror Market leader, Integrated with commonly used volume manager, multiplatform, VCS integration Integrated with AIX volume manager Price; requires VxVM AIX only Softek Replicator Multiplatform Low market penetration In file-based replication, NSI DoubleTake was the market leader in 2003 with an estimated $19.4 million in new license revenue. NSI primarily sells through indirect channels (such as Dell and HP) to midmarket and enterprise clients. Many use DoubleTake for Exchange and file/print. In the mid-to-late1990s, Legato had significant filebased replication market share for its RepliStor product (then called Octopus), but it narrowed its focus (and thus market share) and is broadening its focus since EMC s acquisition of Legato. RepliStor provides EMC with a solution for enterprises that do not have or want heterogeneous disk. A newcomer on the market, XOsoft differentiates itself in scheduled uptime no planned downtime is necessary to implement replication. Therefore, one common use is disk migrations. Veritas is the leader for volume manager-based replication and has the same look and feel as its popular volume manager product, VxVM. It is also integrated into VCS, where the DR option provides long-distance replication with failover. Veritas improves manageability of multiple, heterogeneous replication sessions and geographic clusters with CommandCentral Availability, previously called Global Cluster Manager. Softek also offers a multiplatform volume manager-based solution, but it has low market penetration. Formerly called TDMF Open, it has been rebranded Replicator. IBM offers a volume manager-based solution for AIX called Geo Remote Mirror. Page 16

Strategic Imperative: Managing the diversity of the infrastructure will reduce complexity and improve recoverability and ability to automate the process. Other Recovery Technologies Emerging network-based replication Topio Data Protection Suite, Kashya KBX4000, FalconStor IPStor Mirroring, DataCore SAN Symphony Remote Mirroring, StoreAge multimirror, IBM SAN Volume Controller Point in time or snapshots to quickly recover from data corruption EMC TimeFinder/Snap, IBM Flashcopy, HDS ShadowImage, Oracle 10g Flashback, BMC SQL Backtrack, Imceda SQL Lightspeed, StorageTek SnapShot, Veritas Storage Foundation Wide-area clusters for automated recovery HP Continental Cluster, IBM Geographically Dispersed Parallel Sysplex, Veritas Cluster Server Global Cluster Option Stretching local clusters across a campus to increase return on investment HP MC/ServiceGuard, IBM HACMP, Microsoft Clustering, Oracle RAC, SunCluster, Veritas Cluster Server Capacity on demand/emergency backup for in-house recovery. Becoming mainstream on S/390 and zseries mainframes Speed server recovery with Server Provisioning and Configuration Management Tools There are many other recovery technologies that may be used in disaster recovery architectures. A relatively new set of network-based replication products (sometimes called virtualization controllers) moves the software from the storage array controller into a separate array controller sitting in the storage fabric. This group of suppliers hopes it can change the game and be successful at chipping away at storage controller unit-based replication market share. They offer similar benefits in addition to heterogeneous disk support. Clustering local, campus and wide-area offers automation for failover and failback, speeding recovery time and reducing manual errors. Stretch clustering, where a local cluster is stretched across buildings or campuses using the same architecture as local clustering, is becoming more popular as a way to take already purchased redundancy to achieve some degree of disaster recovery (with a single point of failure for the data and networks). Servers configured with capacity on demand enables pre-loaded but idle CPUs and memory to be turned on at the recovery site for disaster recovery testing and in the event of disasters. This reduces the overall cost of dedicated hardware for disaster recovery. And, finally, many enterprises are implementing standard server images (or scripted installation routines) and using these templates (on disk) to restore servers and applications. This is significantly faster than restoring the server from tape and can restore many servers in parallel, significantly reducing manual effort. Page 17

Client Issue: What architectures and best practices will enable enterprises to achieve 24x7 availability and disaster protection as required by the business? Best Practices in Disaster Recovery Consider DR requirements in new project design phase and annually thereafter Testing, Testing, Testing end-to-end test where possible partial where not tabletop tests can be advantageous to assess capabilities to address scenarios as well as procedures fast follow-up/response to test findings Incident/Problem/Crisis Process where IT incident could result in invocation of DR plan, leverage problem management process which should already be in place damage assessment: must assess costs of failing over to alternate location vs. time to recover in primary location Use automation to reduce errors in failover/failback Use same automation for planned downtime (which results in frequent testing) The most important parts of disaster recovery management are: 1) considering DR requirements during new project design phase to match an appropriate solution to business requirements rather than retrofitting it at a higher cost and 2) testing it is only as a result of testing that an enterprise can be confident about its plan as well as improve the plan through refining procedures and process. As much as possible, tests should be end-toend in nature and include business process owners as well as external partners (for example, that integrate with enterprise systems). When an end-to-end test is not possible, partial tests should be done, with tabletop walkthroughs to talk through the other components of the tests. Through frequent testing, participants become comfortable with solving many kinds of problems in a way, they become more agile so that whatever the disaster, people are likely able to react in a positive way to recover the enterprise without lulling into a chaotic state (which would threaten recoverability). Moreover, for IT disasters, enterprises should leverage their incident and problem management processes and pull in the DR team during the assessment process. Another best practice is using automation as much as possible, not only to avoid human error during times of crisis, but to enable other employees who may be implementing the plan to proceed recovery, even if members of the primary recovery team are unavailable. By using the automation during planned downtime periods, testing becomes part of standard production operations. Page 18

Case Study: A large, regional financial services company uses DBMS-based replication to build a DR architecture with under 15-second RPO, no data loss upon failback and one-hour RTO. Case Study DBMS Log-Based Replication Provides RTO Under One Hour Primary Production Site Quest SharePlex captures DBMS changes from Oracle redo logs Failover Failback Local Failover Server; HP-UX, MC/Serviceguard Secondary Production Site SQL is applied continuously to remote DBMSs. In the event of disaster, replication is reversed. Production DBMS Standby DBMS Month-End DBMS Test DBMSs Disaster Recovery DBMS DR Test DBMSs Async replication: RPO = 0 to 15 seconds Standby DBMS to mitigate risk of data corruption Month-end DBMS for reporting RTO less than one hour Test disaster process once/quarter Architecture minimizes planned downtime for migrations/upgrades Client Issue: What architectures and best practices will enable enterprises to achieve 24x7 availability and disaster protection as required by the business? A financial services company processes transaction data with a packaged application based on Oracle RDBMS. Database access comes from internal (such as loan officers) and external customers (such as automated teller machines), with some 300,000 transactions per day. To ensure data availability/recovery, the company deployed Quest SharePlex to replicate its 500GB Oracle DBMS: 1) locally to mitigate data corruption risks in the production database and provide a reporting database and 2) remotely (500 miles) as part of its DR plan. SharePlex captures the changes to the DBMS (from the Oracle redo logs) and transmits them to local and remote hosts. Changes are then applied continuously (by converting them to SQL and applying them to the target DBMSs). SharePlex keeps primary and target DBMSs synchronized, and the company maintains a maximum of 15 seconds RPO. In a site disaster, the target is activated as the primary, any unposted changes would be posted and the active database would be updated in the application middleware. The failover process, once initiated, takes approximately one hour. Once the remote site is processing transactions, the replication process is reversed back to the primary data center. Although the remote site is missing some transactions (<15 seconds RPO), they are not lost. When a failback occurs, 100 percent of the transactions will have been accounted for, with zero data loss. The company uses the same architecture to minimize downtime for migrations (Oracle 8i to 9i; Tru64 to HP-UX). Page 19

Recommendations Make your disaster recovery management processes mature so they are integrated with business and IT processes and meet changing business requirements. Infuse a continuous improvement culture. Plan for disaster recovery and availability requirements during the design phase of new projects and annually re-assess for production systems. Test, test and test more. Use automation to reduce complexity and errors associated with failover/failback. Select the replication methods that match business requirements for RPO and RTO. If a single infrastructure-based solution is desirable, consider storage controller-based replication. If 24x7 continuous availability is required, consider application, transaction or database-level replication. Page 20

This is the end of this presentation. Click any where to continue. These materials can be reproduced only with Gartner s written approval. Such approvals must be requested via e-mail quote.requests@gartner.com.