Delivering Fat-Free CDP with Delphix Using Database Virtualization for Continuous Data Protection without Storage Bloat White Paper
Delivering Fat- Free CDP with Delphix Revision: June 2012 You can find the most up- to- date technical documentation at: http://www.delphix.com/support The Delphix Web site also provides the latest product updates. If you have comments about this documentation, submit your feedback to: help@delphix.com 2012 Delphix Corp. All rights reserved. The Delphix logo and design are registered trademarks of Delphix Corp. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. Delphix Corp. 275 Middlefield Road, Suite 50 Menlo Park, CA 94025 www.delphix.com 2
Enterprise Architecture Issues Global organizations have significantly digitized their business processes and information over the past decade. Today, applications that run on databases touch nearly every Global 1000 process and interaction with employees, partners, and customers. To ensure that a digital business can maintain operations and up time, firms have spent billions building out enterprise IT architectures. A primary function of enterprise IT is safeguarding and ensuring the availability of data the most valuable of which is often stored in databases. Without a strong recovery strategy, a large organization would grind to a halt following any significant outage or disaster. To ensure that never happens, firms have implemented extensive disaster recovery and backup implementations. While these solutions can synchronously replicate up- to- date information to remote locations, this approach on its own can also backfire. If a critical database is corrupted, business processes such as supply chain management or order processing can stop completely. Even worse, in many cases the very solution for protecting information the disaster recovery systems can increase the damage by replicating corrupted data to backups and failover systems. A recent example of this cascading failure occurred at Salesforce.com in August 2011. Despite using industry best practices and state of the art failover and backup technology, Salesforce.com suffered a major outage due to a corrupted database. Salesforce.com s storage systems replicated the corruption to their alternate datacenter, so there was no fast recovery mechanism in place. There are several key lessons in the Salesforce.com outage. First, replication strategies designed to protect an organization s operations can become a weapon instead of a shield. Salesforce.com is hardly unique; the problem of storage technology propagating corrupted data across the enterprise is very common among large organizations. Second, as organizations scale up their digital operations, the very nature of that scale changes the notion of an edge case. Put differently, a one in a million chance of failure is a real problem if you process a billion transactions a month. Over time and in large environments, edge case failures are a certainty, and it is imperative that IT organizations put protective measures in place ahead of such failures. Third, synchronous replication technologies do not provide sufficient coverage. Any organization that has had to resort to restore from tapes can attest to the cost, pain, and time of trying to recover from unreliable and antiquated backup mechanisms. Existing methods for data protection are not sufficient in a modern environment at today s scale. Outages will become more frequent and more costly as digital businesses continue to scale. 3
Outage Costs The report, Trends in IT Value, by The Standish Group provides useful estimates of the costs of system outage for key business processes. At one end of the spectrum, The Standish Group estimates that each minute of downtime in a securities trading operation costs $73,000. That translates to $4.4 million per hour and $105 million per day of outage costs. At the other end of the spectrum, an outage in an enterprise email system is estimated to cost $1,900 per minute, i.e. $114,000 per hour and $2.7 million per day. This amount is smaller but still significant. In between those extremes are the key applications that power most large organizations: ERP ($888,000 per hour of outage), order processing ($798,000 per hour of outage), and supply chain ($690,000 per hour of outage). Source: Trends In IT Value Report (Standish Group) For a large investment bank, the costs of a major outage can reach hundreds of millions of dollars over a few days. Traditional Data Protection Solutions Production Traditional data protection solutions for disaster recovery include short- term rollback technologies and longer- term backup/restore technologies. Each addresses a piece of the overall protection problem, but neither can fully prevent dramatic outage costs (e.g. due to complex, corruption- triggered events). In addition, each brings significant storage costs via storage bloat. These Disaster Recovery increased costs can limit 4 Hour Flashback (Continuous) On production, limited flexibility enterprise- wide deployment. SRDF 5 Days Business Continuance Volumes (Hourly/Daily) Susceptible to array errors, patches While many organizations have short- term rollback options like Oracle Flashback, these tend only to work for a short window of time on production systems often as little as 4 hours. The use case for technologies such as Flashback is immediate rollback of user error (e.g. a DBA accidentally drops a table, which breaks referential integrity and forces an application error). Short- term rollback technologies usually do not retain days or weeks of data and cannot help if the production database itself is corrupted and unavailable due to an event like a problematic database software patch. Enterprise storage arrays often provide less granular coverage through full copies of storage volumes, but the recovery window typically only covers a few days due to the cost of full system copies. Snapshots can be more space efficient and provide a longer window of FAST RECOVERY 4
coverage, but often come at the impact of production performance a tradeoff most companies are not willing to make. Production In addition to primary recovery strategies, most organizations also have longer- term disk and tape backup options, including offsite tape archives. While these will likely provide the necessary coverage window, the process of setting up a restore volume, restoring data via backup software, applying logs Disaster Recovery to get to a specific point in 4 Hour Flashback (Continuous) On production, limited flexibility time, and configuring database servers can dramatically 5 Days Business Continuance Volumes (Hourly/Daily) SRDF Susceptible to array errors, patches increase the duration and cost of downtime. FAST RECOVERY Disk and Tape Backup, Archive 1-2 weeks (Daily) Even worse, the time to Requires lengthy restore, no granularity retrieve offsite tapes alone can Months to Years be counted in days or even Tape shipping time, unreliable restores weeks adding insult to injury. At a large organization, a week of outage in a key system can mean hundreds of millions of dollars lost and damaged credibility for IT and the business itself. While traditional options for near- term and long- term data recovery each serve a purpose, they are of limited value in a complex database failure. Data is either missing or cannot be recovered within an acceptable period of time. While a significant outage may be an edge case, the potential cost is so large that protective mechanisms need to be put in place. SLOW RESTORE Delphix Virtualization Creates a Bloat-Free Data Recovery Layer 5 Production Delphix provides a powerful addition to existing replication, backup, and recovery solutions. By virtualizing copies of production databases and managing all updates to these copies, Delphix brings extended TimeFlow SRDF Fast: quickly provision parallel firefighting environments Complete: continuous granularity with log shipping (Superior RPO) Extended: 30:1 efficiency enables long retention periods Disaster Recovery Disk and Tape Backup, Archive 4 Hour Flashback (Continuous) On production, limited flexibility 5 Days Business Continuance Volumes (Hourly/Daily) Susceptible to array errors, patches 1-2 weeks (Daily) Requires lengthy restore, no granularity Months to Years Tape shipping time, unreliable restores FAST RECOVERY SLOW RESTORE and flexible firefighting capabilities to database outages generated by complex events (e.g. late identification of corruption). Moreover, Delphix does this without increasing the storage footprint.
Delphix enables DBAs to instantly roll back a virtual copy of a corrupted production database to the last known good state. At the same time, Delphix can retain a very long tail of information, so that the customer can perform this rollback across an extended period of time. Virtual databases can be opened quickly and easily by database teams in a few minutes and as little as three clicks without the need to wait for lengthy restores to a new target volume. As a result, a business gains both the instant, granular recovery capabilities of rollback technologies and the long- term coverage of a tape based backup/recovery system while eliminating the need to wait for prolonged restores. With its patented TimeFlow technology, Delphix retains all updates to copies of production databases, down to the second. A Delphix user can pick any point in time in the life of a protected database and instantly open a virtual copy of that database at that point in time. Users can even set specific transaction boundaries, such as a specific SCN for an Oracle database. With its unique ability to compress and eliminate redundancy in database storage, Delphix can deliver large reduction ratios as high as 30:1, so a business can store 30 days worth of recoverability in the space of a single copy. Traditional continuous data protection (CDP) technologies, in contrast, generally require enormous amounts of storage as much as 7x the original production copy for a similar window of recoverability. As a result, the Delphix efficiency advantage enables far longer retention periods to recover to a known good state following an outage or disaster. In the Salesforce.com example above, Delphix could have dramatically minimized the effects of the corrupted database outage. Upon realizing that the corrupted data blocks had been propagated to the recovery site, a Delphix admin could have simply moved the Delphix TimeFlow slider back until the last known good state in time. Parallel Firefighting Environments to Determine Last Good State 11:00AM If a previous good state in time is unknown, a Delphix admin can open multiple parallel firefighting environments at different points in time simultaneously. With database virtualization, Delphix can open up to 35 12:00PM 1:00PM 2:00PM 3:00PM 4:00PM No Corruption No Corruption! Corruption! 2. Quickly Open Parallel Firefighting Sandboxes for Troubleshooting 1:21:05PM 1. Problem First Detected copies of a database at different points in time, all from a single, shared data footprint. Virtual databases (VDBs) look and behave like normal, full read/write databases, with performance parity to physical database copies. 6 3. ID root cause and determine resolution
VDBs can be used as failover databases or can be recovered to a physical target using the Delphix V2P feature. As a result, Delphix acts as a flexible safety net for large organizations that require critical databases to be running and available at all times. Delphix can be deployed as an additional protective measure for enterprise databases, and does not impact other backup technologies or solutions. Delphix installs as a virtual appliance in a private cloud in minutes and connects to databases through standard APIs. For instance, with Oracle databases, Delphix maintains synchronization with an Oracle database through the RMAN API, and does not interfere with RMAN processes used by backup solutions (Delphix is an Oracle Gold- Certified Partner). Conclusion The scale of modern business operations has greatly increased the risk of system outage. Simply put, the effective likelihood of a failure due to outage or error is much greater when transactions occur at the rate of millions instead of thousands. For global organizations, the cost of downtime is simply too great; traditional recovery solutions are not comprehensive enough on their own. Delphix adds a powerful extended recovery layer to global IT operations. By virtualizing the information in production databases, Delphix can prevent tens to hundreds of millions of dollars in downtime costs from a single outage incident. Organizations looking to increase agility by rolling out new applications and technologies should evaluate Delphix as a fundamental layer in their data protection architecture. Delphix provides four significant, differentiated benefits to enterprise customers: Ability to quickly provision multiple parallel firefighting environments simultaneously Full granularity with log synchronization and application for superior RPO (recovery point objective) Superior RTO (recovery time objective) with the ability to open VDBs in minutes Ability to maintain long retention windows with 30:1 TimeFlow reduction ratios. 2012 Delphix Corp. All rights reserved. 7