Application Defined Continuity Sanovi DRM for Oracle Database White Paper Copyright@2012, Sanovi Technologies
Table of Contents Executive Summary 3 Introduction 3 Audience 3 Oracle Protection Overview 3 Oracle Protection Using Logs 3 Oracle Log Solution Architecture 4 Benefits of Sanovi DRM for Oracle 4 Choice of Replication 5 Oracle DataGuard 5 Sanovi File Replication 5 Storage based Block Replication 5 RPO Monitoring 5 Solution Boundary Conditions 6 DR Lifecycle 8 Sanovi DRM Use-case Scenarios 10 Monitoring RPO and DR Solution Health 10 Failover and Recovery 10 DR Drills 10 Reports & Analytics 11 Summary 11 Sanovi 2 of 11
Executive Summary Sanovi DRM provides cost effective and comprehensive disaster recovery (DR) solutions for Oracle databases, which are the backend for several enterprise applications, such as core banking solutions and ERP solutions. It is an enterprise class Disaster Recovery Management software that offers monitoring, recovery automation and DR Drill automation for Oracle databases. This enables end users to deploy a DR solution quickly and ensure that the Oracle database on the DR site recovers predictably and with little or no manual intervention. Sanovi DRM helps end users meet their recovery metrics reliably and predictably. Sanovi DRM interoperates with heterogeneous technologies like platform OS and various third-party file or block replication in use to run and replicate Oracle data between the primary site and the DR site. Introduction Businesses are faced with an ever-increasing need to recover critical databases and applications within set recovery goals. This can be a complex issue if not handled carefully, because replication of data to the DR/secondary site in itself does not ensure that the database will recover and come up. This white paper gives the reader an introduction to Oracle database protection concepts. It also discusses details of how Sanovi DRM provides a comprehensive best practice approach for monitoring and automating Oracle recovery and drill. Audience This white paper is intended for Oracle DBAs, system integrators, DR solution designers, and system administrators. Oracle Protection Overview Oracle is a leading RDBMS and a data store for several critical enterprise applications. An RDBMS presents data to the user as a relationship and stores the data in tables, across rows and columns. A database transaction is an operation on the database that results in one or more rows and/or columns being altered in a coherent and reliable manner. Applications that use the database will access data as transactions. When replicating a database from a primary server to a DR server, the coherence of the relationships between the data elements of the database must be preserved. That is why replicating a live database is not as simple as copying database files to the DR site. Oracle Protection Using Logs Oracle recommends, as a best practice, protection of its database by shipping Oracle log files from the production server to the DR server. An Oracle database has three important types of physical files : Control files: These files are used for database administration. Log files: These files keep records of the changes made to the data, which are used during the recovery of the database. Data files: These files contain system and application information about the database. Oracle protection using log files entails creating a log file on the production server where the live database is running. This log file is made available on the DR server and updated to the Oracle server on the DR side. Oracle protection using log files makes use of the archive Sanovi 3 of 11
log facility of the Oracle database to protect and recover the data. In this solution, a server with a standby database is set up on the remote site with an initial copy of the complete database from the production server. During normal operations, the archive logs are: 1. Periodically dumped on the production server. 2. Replicated or transferred to the DR (remote) server. 3. Applied to the database on the DR (remote) server. Oracle Log Solution Architecture A typical Oracle DR log solution architecture is shown in the figure below. Application Server Network Primary Oracle DR Oracle The following requirements must be met for the Oracle log solution to be viable: There must be two sites. The first site is termed the primary site. Under normal operating conditions, this site has the production database. The second site is termed the remote site. Under normal operating conditions, it has the database operating as a standby. The two sites are connected over an IP network. The network connection bandwidth must be sufficient to support replication of log files from the primary site to the remote site. The production database must operate in archive log mode and the DR database must operate in standby mode. The databases on the production and remote servers can reside on an external or internal storage medium, and the platform can run on physical or virtual hardware environment. The secondary site/dr platform/os must be the same as the production platform/os. The Sanovi DRM server should have network access to the production and remote database servers to monitor and manage the DR solution. Benefits of Sanovi DRM for Oracle There are several features and benefits that Sanovi DRM for Oracle offers. Some of them are: RPO monitoring Reports on the transaction lag between the DR and the primary database Events and alerts Monitors and provides alerts for over 40 conditions that impair the Sanovi 4 of 11
health of the DR solution health on Automation of DR lifecycle Provides out-of-the-box automation of the protection and recovery process based on best practices DR drill automation Automates the steps to run a DR drill Heterogeneous replication support Supports various replication technologies to implement the log-based solution Reports and analytics Provides easy-to-publish reports on compliance with RPO and RTO metrics Choice of Replication The choice of replication technology to transport log files from the primary to the DR server is mainly driven by the value of the Recovery Point Objective (RPO) of the solution. A zero RPO value requires synchronous replication, while a non-zero RPO value can be supported using file or storage level asynchronous replication between the primary and the DR server. Sanovi DRM supports the following replication technologies: Oracle DataGuard The Oracle enterprise edition ships with DataGuard, which provides the replication capability to ship log files from the primary to the DR database. DataGuard is wellintegrated with Oracle and supports both synchronous and asynchronous replication capabilities. Sanovi File Replication Sanovi DRM ships with built-in file replication, which integrates seamlessly with the Oracle database. This replication can be used for standard as well as enterprise editions of Oracle. Sanovi File Replication is ideal when the recovery point is more than 15 minutes. Storage-based Block Replication Products such as Hitachi TrueCopy, EMC SRDF and IBM Global Mirror offer block-level replication of storage volumes that the database log files reside on. The processing required for replication is taken care of by the storage rather than the host CPU. Sanovi DRM supports heterogeneous replication technologies while offering an abstraction and reporting on the recovery metrics such as RPO and database transaction details for all of the above replication choices. RPO Monitoring Sanovi DRM provides a real-time view of this very important metric for the Oracle DR solution. The current recovery point is calculated by computing the difference between the current transaction on the primary database and the last applied transaction on the DR. Using the RPO value thus reported, the user can be sure of the point that the DR database will recover up to, in case of failure of the primary database. When the RPO increases and deviates above a set threshold, an alert is generated to warn the user. Sanovi 5 of 11
Solution Boundary Conditions While setting up the log-dumping on the primary server and applying it on the DR server is straightforward, there are several conditions that can cause the solution to derail and fail. Sanovi DRM provides monitoring and policy-based remediation of several such conditions. Here are a few examples of conditions that impair the solution: The folder to which the log file is dumped on the primary database server is full, causing the production database to stop. The folder to which the log file is copied to on the DR database server is full, causing replication to stop. The network link is down and the outage causes the replication to stop. This condition must be monitored and replication should be restarted once the network becomes available. The log-apply process on the DR server stops due to apply errors. Sanovi 6 of 11
The table below lists the top conditions that are monitored and alerted on by Sanovi DRM. Event Description Event Severity Event Impact Oracle process not running Affects Oracle availability Log replication failed due to PFR failure Affects normal mode. Database instance down/not available Affects application availability/normal mode Database server down/not available Affects application availability/normal mode DB log path - Out of space Log replication failure due to insufficient disk space WAN link error The WAN link has toggled its state Oracle process not running on DR server Affects the availability of Oracle database Oracle not opened and/or not connecting to DR Server Affects Oracle availability Oracle Listener not running or DR server Affects Oracle availability Database instance down/not available on DR server. Affects Application availability / normal mode. Intermittent log failure in PFR SERIOUS RPO may be impacted An operation on Oracle database failed on primary server An operation on Oracle database failed on DR server Network connectivity to Primary Server is lost Execution of current continuity operation is affected Execution of current continuity operation is affected The Group cannot be managed Network connectivity to DR Server is lost The Group cannot be managed Network connectivity to DR Server (current production) is lost Database instance is down on the DR (current production) server. Database instance is down on the primary (currently non-production)server. Database is Active and is under Sanovi DRM management on DR (current production) server Database is Active and is under Sanovi DRM management on primary(currently nonproduction) server SERIOUS NORMAL Dumping of archive logs failed in NormalCopy WARNING Current production server cannot be managed Production data is not available for applications Fallback operation cannot be initiated Production data is available for applications Fallback operation can be initiated Effects RPO and Continuity Applying of archive logs failed in Normal Copy WARNING Effects RPO/RTO and continuity Sanovi 7 of 11
The table below lists the top conditions that are monitored and alerted on by Sanovi DRM. Normal copy failed due to failure of Dumping or Applying of logs Log sequence missed in the archive logs on DR Log volume on the Primary Server is nearing threshold Log volume on the DR Server is nearing threshold Log Volume on the Current Production (DR) Server is nearing threshold. The group is in Failover mode. Oracle configuration STANDBY_FILE_MANAGEMENT is not set to AUTO Oracle production database has been restored from an older backup SERIOUS SERIOUS Effects of continuity. This is in turn effects RPO/RTO Effects DR. Continuity operations cannot proceed forward Effects continuity Effects continuity Effects continuity DR will not be in sync with production if table spaces / data files are added/deleted in production DR database is out-of-sync with the production database Archive logs on the oracle production database has been reset SERIOUS DR database is out-of-sync with the production database Control files on the DR needs to be refreshed Affects the consistency of the database DR Lifecycle Every DR solution has a lifecycle that it operates around. The phases of the DR lifecycle are: Normal Copy Failover Switchover Switchback Normal Copy This is the everyday mode of operation when the production server is up and running. During this operation only Oracle archive logs are created, replicated on to the DR site, and applied on the DR server. Sanovi DRM provides the required Normal Copy workflow automation that automates the following tasks: Ensure that the primary DB is configured to be in ARCHIVE MODE. This means that the DB will archive the redo logs. Ensure that the DR database, which is the standby DB, is in manual mode. This is required to apply the archive logs that are transported from the primary to the DR server. Periodically dump archive logs on the primary DB. There are options to this action that dictate the frequency of dumping the logs. These options can be used to tune the effect of archive log dumping on the DB performance. Replicate the archive logs dumped on the primary server to the DR standby server. Periodically apply the archive logs on to the DR database. Sanovi 8 of 11
Switchover In this operation the DR site becomes the primary site, the primary site becomes the DR site, and reverse replication is established from the DR site to the original primary site. This workflow is used to exercise the solution during a DR Drill. The workflow for the switchover operation is shown in the BPM diagram below. Switchback During this operation the control is transferred back to the primary site (the primary site is brought into production again) and the original DR site functions as a DR. The workflow for the switchback operation is shown in the BPM diagram below. Failover This operation is triggered when a disaster strikes the primary Oracle server or when this operation is manually triggered. During this operation the DR Oracle server becomes the production server. In the failover process the DR server becomes on-line. The following actions constitute the failover operation. Sanovi 9 of 11
Ensure that the DR (remote) DB is in manual mode. Stop replication of log files from the primary server to the DR server. In case the outage has caused the network between the sites to go down, the replication process may already be down. On the DR server, apply any remaining archive logs and switch the DR standby DB from manual mode to read-write mode and ensure it is accessible for operations. The switchover and switchback phases of the lifecycle are typically used for testing the DR solution. Sanovi DRM Use-case Scenarios Monitoring RPO and DR Solution Health Sanovi DRM provides a central dashboard to monitor the real-time health and recovery metrics of critical applications. The DR solution health is composed of the replication status, the DR server, and the normal copy process. When the status is green, all parameters and processes regarding DR are active and healthy. The RPO meter gives the real-time calculated value of the recovery point. This configured value versus the current measured value is indicated in the bar graph. Failover and Recovery Sanovi DRM for Oracle dramatically reduces the need for manual operation to failover and recover the Oracle database at the DR site. Sanovi DRM delivers pre-packaged workflows that are based on industry best practices and hence ensure that the database is being recovered in a consistent manner irrespective of the replication technology. The screen shot below displays the buttons in case of a disaster. DR Drills The Sanovi DRM Drill module for Oracle provides the required workflow automation for conducting DR drills. There are three phases of activity in performing drills: Pre-drill phase Execution phase Post-drill phase In the pre-drill phase, basic validation of various configurations and resources is done. Depending on the complexity of the infrastructure, this task usually takes a few days. The second phase is the execution of the drill. This is usually a completely manual operation, where the application, database and infrastructure teams have to be present. The third and final phase is the post-drill phase, which involves reporting and analytics, and is a manual and Excel sheet- driven exercise. Sanovi DRM provides solutions to dramatically reduce the time and manual effort involved in performing DR drills. Customer experience indicates an Sanovi 10 of 11
80% reduction in time and an over 70% reduction in the number of people required to perform these drills. The screen shot above is a snapshot of the test report for an Oracle group. Reports and Analytics Sanovi DRM provides reports that are required for regulatory and audit purposes. The software also provides reports that are very useful in capacity planning to ensure that the underlying infrastructure can meet business recovery SLAs. RPO deviation from configured value across time RTO deviation from configured value across time Replication data lag over time Workflow execution instances versus time to execute A RPO deviation graph across time is shown below. Summary Sanovi DRM for Oracle provides a complete pre-packaged DR solution that includes monitoring, automation and alerting on the NormalCopy process, automation of failover and DR drills for Oracle databases. The Sanovi solution provides business benefits by increasing operational efficiency and application uptime. Critical enterprise applications are built around Oracle databases, and Sanovi DRM provides a comprehensive recovery capability for complex applications built around these databases. Sanovi 11 of 11