Daniela Milanova Senior Sales Consultant
Oracle Disaster Recovery Solution
What is Data Guard? Management, monitoring and automation software infrastructure that protects data against failure, errors, and corruptions of the database Automates the process of maintaining a copy of a Oracle production database (standby database)
Data Guard Architecture Clients Clients Primary Site Standby Site Data Changes Primary Database Standby Database Services types: Log transport services Log apply services Role-management services
Software Data Guard Requirements Same release of Oracle Database Enterprise Edition must be installed for all databases Incase of using ASM/OMF, all should use the same combination
Hardware an OS Data Guard Requirements The hardware can be different for the primary and standby database The operating system and platform architecture for the primary and standby databases must be the same The operating system version for the primary and standby databases can be different In case of all databases are on the same system, OS must allow mounting more than one database with the same name.
Data Guard At the Highest Level Data Guard comprises of two parts REDO APPLY Maintains a physical, block for block copy of the Production (also called Primary) database. Can be open in Read Only mode for short time reporting SQL APPLY Maintains a logical, transaction for transaction copy of the Production database. Can be open in Read Write for reporting purposes and cloning activities
REDO Apply Architecture Primary Database Asynchronous/ Synchronous Redo Shipping Physical Standby Database MRP Redo Apply Network Backup DIGITAL DATA STORAGE DIGITAL DATA STORAGE Maintains a Physical block for block copy of the Primary Database
SQL Apply Architecture Primary Database Asynchronous/ Synchronous Redo Shipping Logical Standby Database Network SQL Apply Continuously Open for Reports Transform Redo to SQL Maintains a Logical transactional copy of the Primary Database Additional Indexes and Materialized Views
Data Protection & Disaster Recovery Solution with Reporting Capability Clients Standby Site Physical Standby Database Primary Site Data Guard Data Changes Reporting Clients Primary Database Data Changes Standby Site Logical Standby Database
Data Guard Data Protection Modes Maximum protection No data loss In case of failure remote writting the primary database is shutsdown Maximum availability No data loss In case of failure remote writting the primary database works in maximum performance Maximum performance Highest possible level of data protection No affecting performance of the primary database
Data Guard Role Transition Oracle Data Guard supports two roletransition operations Switchover Planned role reversal Used for OS or hardware maintenance No data loss Failover Unplanned role reversal Use in Emergency Zero or minimal data loss depending on choice of data protection mode
Existing Site Recovery Tradeoffs Primary Database Redo Shipment Standby Database Reporting on delayed data Delayed Apply Log apply may be delayed to protect from user errors but: Switchover/Failover gets delayed Reports run on old data After failing over to standby, production DB must be rebuilt
Enhanced DR with Flashback Database Primary Database Redo Shipment Real Time Apply Standby Database Real Time Reporting No Delay! Flashback Log Flashback Log Primary: No reinstantiation after failover! Flashback DB removes the need to delay application of logs Flashback DB removes the need to reinstantiate primary after failover Real-time apply enables real-time reporting on standby
Rolling Database Upgrades In Oracle Database SQL Apply provides the starting point for performing rolling upgrades of the Oracle RDBMS software and database with minimal interruption of service. By utilizing a Logical standby database customers can upgrade one database while running on the original production database and then run in a mixed version environment before returning to the original, but upgraded, configuration!
SQL Apply Rolling Database Upgrades Clients A Redo B Patch Set Upgrades 1 Version X Version X Initial SQL Apply Config Major Release Upgrades Upgrade Redo Cluster Software & Hardware Upgrades 4 X+1 X+1 Switchover to B, upgrade A
Benefits of Oracle Disaster Recovery Solution Disaster recovery and high availability Complete data protection Efficient utilization of system resources Flexibility in data protection to balance availability against performance requirements Automatic gap detection and resolution Centralized and simple management Integrated with Oracle database
Ease of Use New and Improved Data Guard Manager! Monitoring SQL Apply Unsupported Storage Attributes Applied Logs and Apply Progress Managing the Logical Standby Bypassing the Guard Skipping Table Redo Skipping Failed (and subsequently fixed) Transactions
New Data Guard Feature: Fast-Start Failover Automatic and fast Physical and Logical standby each complete failover in less than 20 seconds Old primary is reinstated automatically once connectivity is re established between Observer and primary database
Data Guard Best Practices: Switchover for Planned Maintenance For fastest switchover (< 1 minute) Prior to switchover A physical standby transitioning from read-only back to Redo Apply should be restarted Disconnect all sessions and stop job processing Shutdown abort for all secondary RAC instances on both primary and standby databases Enable real-time apply on the standby database and ensure the standby is synchronized with the primary database For switchovers using SQL or command line interface, open the new primary directly from the mount state Or, simulate a Fast-Start Failover - complete transactions and shutdown abort all primary instances
Data Guard Best Practices: Faster Redo Transport Set SDU=32K Tune network parameters that affect network buffer sizes and queue lengths Ensure sufficient network bandwidth for peak database redo generation rate + other activities http://www.oracle.com/technology/deploy/availa bility/pdf/maa_dg_netbestprac.pdf
Data Guard Best Practices: Tune Network Parameters Send and receive buffer size = 3 x bandwidth delay product (BDP) BDP = the product of the estimated minimum bandwidth and the round trip time between the primary and standby server BDP = 1,000 Mbps * 25ms (.025 secs) = 1,000,000,000 *.025 = 25,000,000 Megabits / 8 = 3,125,000 bytes Tune network device queues to eliminate packet losses and waits. Set device queues to a minimum of 10,000 (default 100)
Impact of Network Tuning Test Results - Oracle Database10g Release 1 & 2 Test Results - Oracle Database10g Release 1 & 2
Data Guard or Remote Mirroring Remote Mirroring (host-based and storage-based) is another way to protect enterprise data However: What about Data Reliability? What about Data Recoverability? What about Data Availability? What about Cost? A well-designed Business Continuity Plan must consider these critical issues in addition to simple data protection
Data Guard is the Preferred Solution 1. Better Network Efficiency - Transmits only redo data - Remote mirroring solutions: datafiles, archivelog files, redolog files must be mirrored 2. Better suited for WAN-s Fibre/ESCON-based mirroring solutions have an intrinsic distance limitation Protocol converters needed adds to the cost, complexity and latency Data Guard based on standard TCP/IP Data Guard doesn t have to deal with protocol converters, extra cost and latency issues 3. Better Data Protection Data Guard enables zero data loss Preserves write-order consistency Avoids logical and physical corruptions Both SQL Apply and Redo Apply validates redo data before applying
Data Guard is the Preferred Solution 4. Higher Flexibility Data Guard based on commodity hardware Does not force lock-in with storage vendors Remote mirroring solutions typically need identically configured storage from the same vendor 5. Better Functionality Data Guard is a comprehensive DR solution: Redo Apply/SQL Apply Flexible protection modes Push-button switchover/failover Graceful handling of network connectivity problems 6. Higher ROI Provides more value for DR investment Standby database can be opened read-only or read-write Allow backups to be offloaded on the standby database Allows reporting/queries using the standby database Integrated natively with other HA features (RAC, RMAN, etc.) No extra cost
Data Guard and Remote Mirroring - Summary For protecting Oracle data, Oracle Data Guard s integrated disaster recovery solution involving standby databases is preferred to remote disk mirroring: For technical reasons For business reasons Remote mirroring may be used to protect non- Oracle database data that are changing frequently: File system data Data in databases that are not Oracle
Competitive Strengths vs. SharePlex SharePlex Redo log-based replication tool from Quest software Heavy front-end processing to extract transaction information from the primary redo logs Somewhat similar to Data Guard SQL Apply It doesn t make sense for customers to use SharePlex: Data Guard SharePlex Cost Feature support DR Zero Data Loss Primary system overhead Integration with HA features Free Native feature of the database Comprehensive and integrated DR solution Supported Minimal Integrated with RAC, RMAN, Flashback, Expensive Based on unpublished and unsupported interface 1 At best a replication solution No support because of architecture limitations Much more Limited integration 1. See MetaLink Note 97080.1
10g New Features and Best Practices
Data Guard Release 10.2 Redo Transport Improvements Increased network write sizes to 10 MB to better utilize network capacity for both ARCH and LNS LNS can potentially write 10MB or less Full decoupling of LGWR and LNS processes No more waits during log switches No more waits when LNS buffer is full Intra-file parallelism support for ARCH Up to 29 parallel remote archive processes
1GB/100Mbps/25msRTT
Data Guard Best Practices: Gap Resolution and Data Loss For fastest gap resolution Leverage intra-file archive parallelism (MAX_CONNECTIONS attr) Follow tips for tuning redo transport to improve network utilization To minimize data loss For a low latency, high bandwidth network, use SYNC transport For high latency or low bandwidth networks, use ASYNC to minimize primary database performance impact Follow tips for tuning redo transport Example: Less than 7 seconds of data loss exposure for high redo rates of 2-12 MB/sec with <=25 ms latency in our tests
Data Guard Best Practices: Reduce Overhead on Primary Performance Gains with 10g Release 2 ASYNC Transport For redo rates less than 2 MB/sec, there is less than 5% impact on the primary database across different latencies For very high redo rates of 20 MB/sec, less than 10% impact on primary database even with latencies of 50 and 100 ms Primary database performance impact was 2-3 times less with the new ASYNC transport compared to previous releases Best Practice Allocate additional I/O bandwidth for Online Redo Log Files
Data Guard Best Practices: Using Standby for Backups Offload Backups to Physical Standby Database Eliminate backup overhead on primary database RMAN allows for backup operations while Redo Apply is in progress Best Practices For simplicity, use identical directory structures on the primary and standby databases Use RMAN Recovery Catalog so that backups taken on one database server can be restored on another Use a catalog server physically separate from primary and standby sites Reference MAA RMAN/Data Guard best practices paper http://www.oracle.com/technology/deploy/availability/pdf/ RMAN_DataGuard_10g_wp.pdf
Data Guard or Remote Mirroring? Load 200txns/sec & Redo rate 1.1 MB/sec Data Guard SYNC transport has less overhead on the primary database
Data Guard Advantage Because Data Guard only transmits redo. A remote mirroring solution must transmit all database writes A remote mirroring solution needs to transmit the following writes: LGWR - log writer, DBWR database writer, ARCH - archiver, RVWR flashback log writer, and foreground direct writes Both DBWR and LGWR are affected by network latency in a remote mirroring solution. In contrast, only LGWR is impacted by network latency in a Data Guard solution Higher wait times for DBWR can be very etrimental to performance, causing contention for free buffers and an increase in buffer busy waits
Some customer references
First American Real Estate Solutions Nations largest source of Real Estate data 100 million properties Online services for 50,000 clients Lenders, Information Resellers, Government, Utilities, Corporations, Appraisers, Agents & Title Companies Thousands of concurrent online users at peak www.firstamres.com
HA/DR Requirements High Availability: 24x7-365 days/year Limited instances of planned downtime once/quarter Recovery Point Objective (RPO) - maximum data loss Oracle9i: 10MB for computer failure, 200MB for site failure Recovery Time Objective (RTO) for Oracle Database Oracle9i: 10 minutes for computer failure, 1 hour for site failure Oracle Database 10g goals RPO: zero data loss for computer failure, 10MB for site failure RTO: zero downtime for computer failure, 10 minutes for site failure
First American Oracle 9i HA/DR Architecture Primary Production Site Local Standby #1 Data Guard LGWR Asynchronous Redo Shipping Local Standby #2 Data Guard Delayed Apply (30 minutes) LGWR Asynchronous Redo Shipping Remote Disaster Recovery Site Remote Standby #3 Primary Database Data Guard Archive Log Shipping (ARCH) 1500 miles >
Looking Ahead to Oracle Database 10g Real Application Clusters Transparent failover on node failure, zero data loss Flashback Technologies Flashback Database & Flashback Table Protect/repair for logical corruptions Enhanced LGWR ASYNC redo transport Improve RPO for remote DR site Real Time Apply Improve RTO
First American Oracle Database 10g Architecture - Plan Primary Production Site Remote Disaster Recovery Site Primary Database Real Application Cluster Data Guard LGWR Asynchronous redo shipping 1500 miles > Standby Database Data Guard
First American Oracle Database 10g Benefits Higher Availability transparent node failover RAC for HA, Data Guard for DR Better remote data protection ASYNC enhancements = less compromise on WAN Better protection against logical corruption Fewer databases, surgically repair vs full point in time Less downtime Faster failover, quicker repair of logical corruptions
Oracle Corporation Global Single Instance (GSI) A key enabler in Oracle saving $1 billion annually Consolidation: 1 is the magic number Versus 75 separate implementations of Oracle Apps Versus 100 s of Oracle databases world wide Oracle E-Business Suite 7,000 concurrent users 5.5TB Oracle database www.oracle.com
Oracle Global Single Instance HA/DR Requirements HA requirement Continuous operation regardless of component failure DR requirement Protect against site failure, physical & logical corruption RPO 5 minutes of transactions RTO database failover in less than 1 hour High workload OLTP system 8.2MB/sec redo generation at peak, 2.5MB/sec sustained WAN, dual OC12 1,000 miles of separation, 25-35ms RTT network latency
Oracle Global Single Instance HA/DR Architecture GSI Production Site (4) SUN F12Ks 36 CPU s each Disaster Recovery Site (4) SUN F12Ks DR domain 8 CPU s each Development & Test domain: 28 CPU s each Primary Database Data Guard LGWR Asynchronous redo shipping 1,000 miles > Standby Database (4 hour delayed apply)
Utilization of Standby Resources Four node Standby Cluster 2 domains: DR, Development & Test DR domain has sufficient capacity to maintain standby database and execute failover At Failover time: Failover is executed, standby assumes primary role Development & Test is stopped CPU s are re-allocated to the new production domain Nodes are upgraded in a rolling fashion with no application downtime
Delayed Apply Downtime Avoided Human error caused logical corruption on primary 160,000 row table updated by mistake Standby database configured with 4 hour delayed apply Instead of 10 hours of downtime, just 30 minutes Cancel recovery on standby and open read only Stop the affected application on primary Export data from standby Recreate table on primary, import data to primary db after disabling triggers Restart application on primary Restart recovery on standby
Oracle Global Single Instance Oracle Database 10g Feature Adoption Flashback Technologies Flashback Table Flashback Database Data Guard 10g Real Time Apply Asynchronous Redo Transport enhancements Redo Apply performance enhancements Benefits Faster failover, better data protection
Ohio Savings Bank Founded in 1899 In Top 20 of all US Mortgage Lenders Provide mortgage services to independent brokers nationwide via Web $13 billion in assets Reputation for Innovation 2002 Web Site of the Year (Mortgage Technology Magazine) www.ohiosavings.com
HA/DR Requirements 24 x 7-365 days/year Recovery Point Objective: zero data loss Recovery Time Objective: 30 minutes Planned maintenance windows Sunday mornings
Ohio Savings Bank Oracle9i Architecture Online Mortgage Services Primary Production 2-node RAC Cluster HP N-Class PA-RISC EMC Symmetrix SAN attached HP-UX v11.0 Remote DR Site HP N-Class PA-RISC EMC Symmetrix SAN attached HP-UX v11.0 Data Guard Archive Log Shipping (ARCH) Primary Database 3 rd party storage based synchronous disk mirroring for online logs 15 miles >
Ohio Savings Bank Oracle Database 10g Architecture Customer Call Center Primary Production 3-node RAC Cluster HP DL-380, 2 Zeon CPUs/node EMC Symmetrix & Clariion SAN attached Red Hat Linux Remote DR Site 3-node RAC Cluster HP DL-380, 2 Zeon CPUs/node EMC Symmetrix & Clariion SAN attached Red Hat Linux Primary Database Data Guard Maximum Availability synchronous redo shipping Zero Data Loss 15 miles > Standby Database
Ohio Savings Bank Oracle Database 10g Features Deployed Automatic Storage Management Reduces time spent managing storage RMAN Flash Recovery Area Fully automates disk-based backup & recovery Oracle Data Guard Zero Data Loss Replaces 3 rd party remote mirroring Standby DB also used for daily exports
Ohio Savings Bank Automatic Storage Management Automatically spreads database files across all available storage Automatic rebalancing of used disk space when disks are added or removed Increases I/O distribution beyond disk array striping Reduces DBA workload
Ohio Savings Bank, Future Plans GRID from concept to reality Add nodes to the existing RAC 10g cluster Manage cluster via a single system view Add mortgage database, and potentially the OSB Data Warehouse to same RAC 10g cluster Define application workloads as services Establish rules to dynamically allocate processing resources to services Maximize the utilization of resources while meeting changing business needs
Oracle Disaster Recovery Solution Includes as Oracle Products: Oracle Database Enterprise Edition on both sites
Oracle Maximum Available Architecture
Oracle Maximum Availability Architecture Clients Clients Application Servers Application Servers WAN Traffic Manager hb Dedicated Network Instance1 hb Instance2 Instance1 hb Instance2 Data Guard hb Primary Site RAC based Secondary Site
Resources Maximum Availability Architecture white papers: http://otn.oracle.com/deploy/availability/htdocs/maa.html New SQL Apply Best Practices Paper now available! HA Portal on OTN: http://otn.oracle.com/deploy/availability Data Guard home page on OTN: http://otn.oracle.com/deploy/availability/htdocs/odg_overview.html