1 Combining high-availability and disaster recovery: Implementing Oracle Maximum Availability Architecture (MAA) on Oracle 10gR2 RDBMS Tuomas Nurmela 1 1 TietoEnator Processing & Network Oy, Espoo, Finland Abstract. Increasing dependence of business on 24/7 IT systems sets requirements to support recovery from hardware, software and operational errors and reduce number and length of maintenance windows through online and automated or semi-automated system management activities. These requirements cannot be satisfied purely by technology let alone a single technological innovation on a single layer of a system stack. Oracle has defined Maximum Availability Architecture (MAA) to describe the combination of its technologies to support recovery from hardware, software and operational errors. With regard to the Oracle RDBMS, these extend the online and automated management features of the RDBMS. This paper reviews the MAA on Oracle RDBMS with particular focus using Linux as platform to implement the MAA. The paper also provides a threat analysis to review the extent and limitations of MAA-based implementation high-availability and disaster recovery capabilities. Keywords: Databases, high-availability, disaster recovery 1 Introduction Availability is the ability of a component or service to perform its required function at a stated instant or over a stated period of time. It is usually expressed as the availability ratio, i.e. the proportion of time that the service is actually available for use by the Customers within the agreed service hours [7, A.2]. Availability can be defined as being a function of mean time between failure (MTBF) and mean time to recover (MTTR). Therefore, anything that increase of MTBF (or uptime between planned maintenance) and/or decreases MTTR (or downtime during planned maintenance) provides high-availability (HA) support. As noted by the definition, at one point, from the end-user perspective, too slow response times can make a system unavailable in practice. Disaster recovery (DR) can be defined as means of recovering lost data and capability to restart the service. Whereas high-availability mechanisms typically provide means for service self-healing, disaster recovery can be seen to take the alternative view of backing up and recovering. Typical metrics include recovery point objective (RPO, how much data is acceptable to be lost under any circumstances) as
2 well as recovery time objective (RTO, in what time is the recovery of the system done technically or operationally under any circumstances). The increasing dependence of businesses on 24/7 IT systems or regulatory requirements to set up such systems leads to near telecom-level availability requirements of % or %. This sets requirements to support recovery from hardware, software and operational errors as well as reduce maintenance windows. General ICT services market analyst  view is that only 20% of failures are hardware related, where as operational errors (whether unintentional or mistaken actions , e.g. truncating a table from the wrong db schema or dropping a table instead of truncating) contribute 40% of total failures and application failures, including performance problems, contribute the last 40%. In a survey by Gartner , when IT decision makers were asked of the most significant unplanned downtime contributor, 53 percent indicated application failures (due people or practices failure in e.g. application changes), 21 percent indicated operational failures (due people or practices relating to infrastructure changes) and 22 percent indicated technology (hardware, ) or environment (the site) failures. The distribution of errors among error categories gives a clear indication that e.g. operational errors should be addressed in systems design. However, often the architecture discussion focuses only on hardware and software faults. E.g. Drake et. al.  describe highly available database system design for five or six nine environments. Yet they provide no reference as to how the operational errors could be addressed through systems design and do not discuss how fault tolerant architectures increase architecture complexity by definition, therefore potentially only increasing unavailability through operational errors. This paper provides an overview of Oracle database features relating to Maximum Availability Architecture (MAA) , the best practices approach to utilizing Oracle database high-availability and disaster recovery capability. MAA is not a single technology nor does it assume that all problems could be solved on one technology layer of the database stack or only in the database for that matter. Rather MAA provides guidance how different Oracle technologies are supposed to be used together and describes how to minimize unavailability under different types of errors and changes. The structure of the paper is as follows: Section 2 introduces the main database technologies used in MAA, describing their background, current state in Oracle 10gR2 on Linux, operational issues and 11g new features. Section 3 provides an assessment of how these technologies work together under different threat scenarios in a similar manner as discussed in  in regards to another database engine. The paper makes a number of assumptions in regards to MAA scope. The paper does consider the impact of immediate environment of the database (e.g. application server connectivity through database drivers, IP-network both in terms of LAN and WAN connectivity and the storage area network (SAN) including the storage subsystem). Beyond the immediate environment, a number of other assumptions limit the scope: first, the focus is on the Linux implementation of MAA. The architecture implementation on other platforms, especially on Windows, differs. Second, the assumption is that the processor architecture is based on symmetric multiprocessing. Issues that relate to e.g. non-uniform memory access (NUMA) server architectures are not discussed. Third, HA design for complete redundancy including access-routing
3 redundancy and application server redundancy is outside the scope of the paper. Fourth, the competing Oracle or other vendor high-availability or disaster recovery approaches are not discussed, the focus is only on MAA approach. 2 MAA architectural components MAA on Oracle database is based on use of Oracle Real Application Cluster (RAC) and Oracle Data Guard. Architecturally this equals a 2 to n node shared disk, shared distributed cache cluster with one-way replication to another identical cluster. Replication protocol supports either 1-safe and 2-safe schemes . Additionally the local shared disks can be mirrored using Oracle Automated Storage Management (ASM). Fast recovery from logical errors (including operational errors) is based on Oracle Flashback technologies, enabling selective temporal scan of system changes and applying undo-changes to the database. Finally, backup impact to disk performance is reduced by block-level incremental backups. The next subsections focus on RAC, Data Guard and Flashback respectively, concluding to short overview of ASM and RMAN backup support. Main focus is on Oracle 10gR2 functionality, with an overview of evolution and future developments with 11g which relate to high availability, disaster recovery or recovery from operational errors. Recovery steps are discussed in more detail in Section Real Application Cluster (RAC) Oracle Real Application Cluster (RAC) is a shared disk, shared distributed cache database. RAC is a continuation of technology initially introduced in Oracle 6 DEC VAX/VMS (in Oracle 8 for all platforms in general) as Oracle Parallel Server (OPS). Oracle 8 introduced the shared distributed cache, but limited this to global enqueues (i.e. locks). With 8i Cache fusion, the non-dirty data block sharing was implemented. Oracle 9i removed this limitation, allowing use of shared distributed cache for all reads. However, it still mainly relied on an external, -vendor dependent clusterware to provide server node-level high availability functionality [2, 4 pp. 24, 5 pp. 111]. This is not to say that DBMS reliance to support is anything new or novel, rather being the norm . With Oracle 10g, Oracle made the move to provide its own clusterware, thereby moving down the stack. Potentially this provides a more integrated solution, reducing the occasionally observed, hard to troubleshoot problems resulting from use of third party clusterware.  notes that the clusterware was originally licensed from Compaq. It was originally known as the Oracle Cluster Management Services (OCMS) and was released on Linux and Windows. This subsection focuses on the clusterware layer, Oracle RAC layer and additional services to support failure masking, taking into account Oracle 11g additions. Additionally, the operational management impact and immediate environment expectations are noted.
4 Clusterware provides a shared-disk active-passive cluster capability on top of normal services. It functions as a building block for the upper RAC layer. Clusterware contains disk data structures and processes that enable this. It also assumes a private network interconnect between nodes for node membership management and other cross-node activities. Clusterware disk data structures [14, Chapter 1-2, 30, pp.18-19] contain voting disks and Oracle Cluster Registry (OCR). Voting disks are used by the Clusterware to detect node failure and avoid split brain syndrome (i.e., unsynchronized multiple node access to global resources in case of network partitioning). Clusterware nodes register themselves to the voting disks, which requires that they have an access path to the disks. To ensure voting disk availability, MAA configuration uses two additional copies of it. Each voting disk should be allocated its own separate 256 MB LUN from SAN . OCR is used to maintain a centralized repository of all clusterware resources and members. OCR should be located on its own 256MB LUN from SAN . OCR registry is supported by a per local-instance OCR process. This provides a cache of the OCR content locally on the cluster node. One of the local OCR processes functions as a master OCR cache, providing and synchronizing access to the OCR repository. Instance-specific clusterware processes [14, Chapter 1-2][30 pp ] contains three daemon processes. These are the Oracle Cluster Synchronization Services Daemon (OCSSD), Event Manager Daemon (EVMD) and Cluster Ready Services Daemon (CRSD). All processes are initialized by INIT.CSSD process. OCSSD establishes node and cluster initial state and maintains local view of the cluster. First steps contain establishing cluster configuration, establishing, node relationships with other nodes and determining location of OCR repository. After this identity of the master OCR cache is dynamically established through voting. The node which receives the master OCR cache identity activates itself. This is followed by synchronization of group and lock resources and activation of other nodes. After initialization the OCSSD functionality consists of Node Membership (NM) and Group Membership (GM) management. NM handles both private network and storage connectivity checks. GM handles joining the group and managed and unmanaged leaving of group. OCSSD functionality is central to cluster functionality: in case of private interconnect or group membership failure, OCSSD reboots the local node. In case of OCSSD failure, local node is rebooted. CRSD is the main resource manager of the cluster. It learns the identity of the resources from the OCR and starts them. Upon failure of a resource, CSRD attempts to restart them up to five times (non-configurable threshold). CRSD is also responsible for initializing the RACGIMON clusterwide health monitoring process. Failure of CRSD results to restart of CRSD by Oracle instance s PMON background process. Cluster resources managed by the CRSD include the Oracle instance, Global Services Daemon (GSD), Oracle net listener, Virtual IP and Oracle Notification Services (ONS) in addition to the local database instance background processes. GSD supports system management by enabling execution of global (cluster-wide) commands from a single node. Virtual IP provides abstraction of actual IP, to support failover handling between RAC nodes. By binding to a floating end-point address,
5 connections can be transferred from a failed host to another cluster member. However, this requires support in the Oracle NET listener and communication between nodes through ONS. Oracle NET listener is connection handler of an Oracle host. In RAC environments, the services (oracle instance schemas) served by the listener can be bound to VIP. Both nodes can have their own listener bound to a VIP in case load balancing and failover are configured. RAC-cluster instance load balancing can be done on client or server-side as per normal single instance listener load balancing. In client-side load balancing , clients are configured with multiple IP addresses (or VIPs), with the thick driver connecting in a round-robin manner at connect time. Fault-tolerance can be supported through failover to pre-created connections. In server-side load balancing , two listeners are configured each with information of the other remote listener. PMON processes communicate with each other periodically to provide state and load information, which is utilized for load balancing based on least load. The load balancing in both client- and server-load balancing is done at connection initialization, in the listener handshake, prior to binding to the user process. In case of connection pools, this means that connection reuse will be incapable of utilizing the load balancing mechanisms. Still, more importantly to HA, additional fault masking mechanisms exist. These optional enhancements and ONS are discussed later after RAC layer overview. EVMD publishes CRSD generated events through the ONS peer-to-peer process group. It also initiates the RACGEVT process, which manages server Fast Application Notification (FAN) callouts. FAN callouts are server-side executables, which are ran as a response to an event [14, Chapter 6]. Callouts automate basic administration tasks such as sending an event to a work management system or stopping and starting applications. Particular event handling definition is contained in a callout file, which are collected in a specific directory. Finally, OPROCD for monitoring against hangs and IO fencing. On Linux the clusterware utilizes the hangcheck-timer kernel module for monitoring all CPU cores against hangs . This is done through thread creation and timed sleeps. While not mandatory, MAA recommends its usage. IO fencing mechanism is dependent on the SAN storage subsystem used. Use of SCSI-3 Persistent Reservations on voting disk LUNs are common in todays high-end systems. Clusterwide clusterware processes are processes which have only instance running in the whole cluster. RACGIMON is the only clusterwide process. The process handles failover, start and stop of the processes. RACGIMON reuses the shared memory data structures used by local PMON instance to evaluate node states [30, pp.51-52]. On failure RACGIMON is restarted on another node by CRSD. The clusterware architecture is presented in Figure 1 below.
6 Key Oracle RAC Oracle RAC Process Linux kernel module Used local file Non-process resource Network Redundant link LUN Optional element Facility border Node A clusterware OPROCD OCSSD CALLOUT CRSD RACGIMON EVMD ACTION RACGEVT hangd Resources GSD ONS VIP listener Public Private LAN LAN Resources GSD ONS VIP listener Public Private Node B clusterware OCSSD OPROCD CRSD CALLOUT EVMD RACGEVT ACTION hangd SAN SAN storage SAN Shared resource Voting disk Figure 1: Oracle Clusterware Architecture Clusterware is basically an extension, which provides node fault tolerance mechanisms and coordinates these with the database instance on process faults. The RAC layer builds on top of this functionality to create a shared-disk active-active 2 to n node cluster. Currently test on a real application in one of the largest installations showed near linear scalability for 64 nodes. A node in on RAC layer is called a RAC instance. All the nodes in the cluster form the RAC database. The actual enhancements to standard Oracle database instance functionality are contained in changed disk structures, new RAC data structures, RAC-specific background processes and additional functionality included in background processes part of a normal Oracle database instance. RAC disk data structures relate to enabling RAC instances to access all the files that form the RAC database state. These include the database files (containing metadata, rollback and database data), control file (and its redundant copies), log files and archived log files. RAC instances have a concurrent, shared access to database files and control files. Under normal operating conditions redo log files and archived log files are accessed by the local RAC instance, with each running their own set of redo logs called a redo log thread. However, these are located in a shared storage area. In case of RAC instance failure another RAC instance can recover the RAC database state by acquiring the redo log thread and related archives. Each RAC instance records its redo log thread information to the control files to support faster recovery. RAC memory data structures contain the Global Resource Directory (GRD). GRD contains status information of all global enqueues (i.e., locks) and shared data block buffer cache (BC). These are collectively called GRD resources. For the BC GRD maintains records of the following information [30, pp ] data block address, location (instance having the most up to date version of resource), mode of reservation, role of block in terms of local (i.e., not yet shared) or global (already shared through distributed cache) System Change Number (local to RAC instance) indicating the most recent change and indication if the data block is a current or some past image.
7 GRD is distributed with each RAC instance having its own local part of GRD. This is located in the instance shared memory allocation, System Global Area (SGA). The mastership over data blocks and related enqueues is determined in node initialization. This happens by determining a hash values for GRD resources and distributing mastership of a ranges of hash values to different RAC instances. However, remastering of a resource mastership can happen as a result of usage patterns or RAC instance failure. GRD resource information is maintained by RAC instance-specific Global Enqueue Service Daemons and Global Cache Service Daemons. RAC-specific background processes [14, Chapter 1][30, pp ] include Global Enqueue Service Daemon (LMD), Global Enqueue Service Monitor process (LMON), Global Cache Service Monitor Daemon (LMSn), Lock Manager process (LCK) and the Diagnostics Daemon (DIAG). LMD functions as a resource access manager, providing access to global enqueues and data blocks in distributed cache. LMD handles all requests to resources that are mastered (i.e. are in the local part of the GRD), whether these requests are from the local or remote instance. LMD does not actually provide resource lock management, this is delegated with service request to LMS functionality (on distributed cache level) and LCK (on local instance non-shared resources, such as other SGA caches). As the primary resource manager to mastered resources, LMD is also responsible for global (i.e., cross-instance) deadlock detection. LMSn handles the actual block transfer from local instance to requesting remote RAC instance based on service request queue filled by LMD. The transfer is done by direct copy from data block buffer cache to remote RAC instance. LMS provides read consistency by rollback of uncommitted transactions prior to transfer. This is required because the distributed shared cache functions on the block granularity. LMS also informs the remote instance if the block cannot be acquired due lock type incompatibility. On successful transfer of block, GRD information is updated. LMON handles the group membership on RAC-layer. It maintains group membership through polling local and remote RAC instances. In group membership changes LMON is responsible for remastering of GRD resources. In instance recovery operations, LMON is responsible for determining failed instance prior to start of cache recovery (a specific part of recovery), which contains recovery of GMD resources. Therefore, LMON can be thought to be analogous to the instance-specific SMON in terms of recovery, functioning on the RAC distributed cache level. LMON is often referred to as Cluster Group Services in Oracle conceptual documentation . LCK handles access to local resources that are not shared, e.g. SGA cache resources and local row-lock information. As such, LCK seems to be a mediating process between the normal Oracle instance functionality and RAC-specific functionality. Information on LCK functionality was especially scarce. DIAG is the instance health monitor. It also records process failure dump information to instance alert log. Normal local instance background processes  are also affected by RAC to a certain degree. RAC mainly impacts instance monitoring processes as well as processes participating to data i/o. Process monitor, PMON, is responsible for monitoring and reinitialization attempt of local non-rac related processes (both user and background processes). From RAC
8 perspective, PMON also handles local listener registration and monitoring for local and when utilized remote listeners to support listener failover and server-side load balancing. Upon failure of process, PMON handles the clean up of memory data structures and creates the alert log (with the exception of RAC-specific background processes, for which DIAG writes the alert log entries). System Monitor, SMON, is responsible for system recovery actions during the ARIES-like two-phase recovery. Given that RAC is a shared database, SMON of a non-failed instance may be required to carry out recovery of database state from logs and archives on behalf of a failed instance. SMON does not carry out cache recovery, as this is done by LMON. Dirty buffer writer, DBWR, is responsible writing dirty buffer cache blocks to disk under certain conditions. With RAC, DBWR must coordinate writes with the global cache services LMD and LMSn processes as well as with the LCK lock manager. The RAC architecture is depicted in Figure 2 below. The RAC instances only contain background processes which directly relate to RAC, a number of normal Oracle instance background processes are not depicted. Key Process Linux kernel module Instance A SGA BC GRD LMD LMON LMSn LMD LMON LMSn Instance B SGA GRD BC Used local file Non-process resource Network PMON SMON LCK DIAG DBWR LCK DIAG DBWR PMON SMON Redundant link LGWR LGWR File ARCH SPFILE SPFILE ARCH Optional element Oracle Clusterware Oracle Clusterware Facility border Public LAN Public Private LAN Private SAN SAN REDO A ARCH A REDO B ARCH B Control file Data file(s) SAN Figure 2: Oracle Real Application Cluster Architecture The figure assumes archives are shared through SAN instead of copied to multiple locations. Optional mechanisms contain enhancements enabling failure masking through utilization of Oracle Notification Service (ONS). Failure masking from clients such as e.g. application servers is important to enable the whole system to benefit from the database HA capabilities. ONS is a publish and subscribe system for non-reliable messaging. It is implemented as a set of ONS daemons, one running on each node and on each involved client such as middle-tier servers, with ONS daemons replicating locally
9 received cluster events to others. Events include up, down and restart of registered clusterware and RAC instance components. Transparent application failover (TAF) framework is a RAC failure masking mechanism functioning on application layer. TAF enables transparent read-query reexecution. It does not support continuation of transactions, maintenance of nondefault session parameters or database session-related information (such as PL/SQL procedure state). TAF use requires an additional configuration to the client- or serverside load balancing utilizing fault-tolerance and is independent of ONS. To extend failure masking capabilities to handle transaction continuation for connection failure, Fast Application Notification (FAN) based services need to be utilized either in the driver layer or in the application code . FAN is based on ONS. FAN is not transparent like TAF, as specific event-handling is required. Because typical middle-tier connections utilize connection pooling to establish a logical connection to a database, the connection pooling layer seems to fit well for internal plumbing required for such event handling. This is the concept behind Oracle 10gR2 Implicit Connection Cache (ICC) [30, pp , ]. As ICC utilizes ONS, the combination is called Fast Connection Failure (FCF). ICC functions as any JDBC 3.0 cache. If underlying connectivity is based on a type 2 driver ( thick driver, Oracle OCI driver), ICC is able to handle connectivity-related ONS events behind the scenes, transparently to the application. It does this by wrapping the physical connection to a logical connection. In a case of client connection failure to a RAC instance listener, the physical connections are closed, new connections are established to an available RAC node and these are wrapped to the same logical connections. Other events are forwarded to the application layer (which then again may have its own logic for handling transaction errors etc). ICC does not do connection rebalancing in case of up event. Beyond failure masking, ONS events can also be utilized for more advanced forms of load balancing, such as the one provided by Load Balancing Advisory service. This uses listener service specific goals (definable through PL/SQL package DBMS_SERVICE procedure MODIFY_SERVICE) to maintain load sharing, informing the application of load imbalances. Immediate environment in a MAA configuration should support RAC through a number of redundancy mechanisms. These include (i) use of two network interconnects (NICs) and Linux network interconnect bonding of production and private interconnect network access, to support HA of network interfacing, (ii) use of two host bus adapters (HBAs) and 3 rd party HA drivers to support HA of SAN interfacing and (iii) dual switching infrastructure for both LAN and SAN network to avoid single points of failure in the immediate network. Finally, (iv) either Oracle storage management or external storage redundancy mechanism should be used to ensure data availability. Clusterware provides a verification tool, by which the correct configuration can be ensured to a certain extent. From operational management perspective, RAC supports all the normal Oracle online and automated management features. In terms of rolling upgrades, RAC is a limited solution: system and hardware upgrades can be done, but Oracle patching is supported only in explicit, qualified one-off patches (i.e., patch sets, version migration and some one-off patches cannot be done without off-line maintenance). On the other hand, since RAC is an active-active cluster, there is no downtime from taking a node
10 down and the expected application staleness time ( brownout period ) due to transfer of from one node to another only impacts individual sessions . Oracle 11g RAC new features offer mainly administrative improvements. In addition to these, from clusterware perspective , new voting disks can be added online. From shared database improvements perspective , runtime load balancing has been incorporated into the OCI thick driver functionality. Additionally, 11g a RAC instance can act as a global coordinator for a distributed transaction spanning multiple RAC instances. 2.2 Data Guard Oracle Data Guard  is a one-way replication database mechanism supporting 1- safe (i.e., commit on primary site is sufficient) or 2-safe replication (i.e., data is committed to standby prior commit acknowledgement) . Data Guard [6, Chapter 7, 13] is a continuation of what was first introduced in Oracle 8i as Oracle Standby Database. The Standby database focused only on physical log shipping. It supported read-queries if put into a non-standby mode to do so (but had to be switched back to standby mode). Also, once the original site was recovered after the disaster, the original standby database could not be converted back to standby mode; rather the whole standby database had to be re-established (a particular problem when running a non-identical production and standby environments). Likewise, if there were gaps in transfer of log records, these had to be manually resolved. With Oracle 9i, role switchover support was established. Spotting of gaps became automatic, based on log shipping gaps or lack of heartbeat from primary node. Also, Data Guard Broker was established to support a third-node controlled, automated switchover and 2-safe support was established as one of the possible data protection modes. With 9iR2, the physical log shipping architecture was complemented with logical standby database. With 10gR1, support for usage of Data Guard with RAC was established. Flashback database support (i.e. support for point in time recovery of the production database could be imitated on the standby database) was also established. With 10gR2, support for automated failover management through Data Guard Broker was established. Most of the other developments relate to performance and manageability improvements of the Data Guard itself. Both 10g releases have also extended the logical standby database implementation by reducing a number of limitations concerning its use in terms of database data types and object types (index types, nonstandard tables etc). This section focuses on the 10gR2 Data Guard physical and logical standby database architectures, providing an overview of the processes and disk data structures related to these. Both of the architectures enable one-way replication with different levels of safety. Physical standby database  requires identical physical database structure to the primary node with same and Oracle enterprise edition versions. Primary and standby database functionality are dependent on the mode of protection, which defines how the oracle background processes participate to the activity. With 10gR2, there are three possible safety modes (called levels of protection ) [6, Chapter 7]:
11 Maximum protection: Redo information must be written to one standby (as there can be multiple standbys) prior to commit, following 2-safe algorithm requirements. If this is not possible, primary node shutdown follows. The redo transfer is delegated to log network service (LNS) process. LNS captures the log writer (LGWR) redo record write from the redo log buffer (RLB) to the redo log file, attempting to immediately transfer the data to the standby. Maximum availability: Redo information must written to one standby prior commit. If this is not possible, primary node protection mode is changed to maximum performance. Therefore, no standbys need to be available. Transfer mechanism is dependent on whether maximum protection or performance protection mode is followed. Maximum performance: Transactions commit on local redo write, following 1- safe algorithm requirements. Therefore, no standbys need to be available. Transfer of redo records can be asynchronous or deferred to archiving of redo log. The asynchronous log write is done by a separate log writer network service process (LNS). LNS reads the redo log record and transfers this to the standbys. Alternatively, if write is deferred to redo log archival, the archival process (ARCH) does log shipping after ARCH completes the creation of an archival from a full redo log. Regardless of the protection mode, a separate Remote File Server (RFS) receives the redo information (whether its from LNS or ARCH) on the standby. These are written to a separate redo log group called standby redo logs in case the source is primary database instance LNS or to archive logs if the source is ARCH. Standby redo logs contains the redo log updates, whether these are transaction queries (updates, deletes, inserts) or transaction commits or physical changes. As the standby redo log fills, ARCH creates a local archive of the standby redo log. After creation of an archival, a separate Managed Recovery Process (MRP) utilizes the created archive logs, applying them to the database files to keep the database more up to date. Archives which have been applied are deleted on the standby. MRP can spawn additional processes for the work. Basically the closer the MRP apply is to the primary database committed state, the shorted the actual recovery time will be. As the functionality is not based on applying the redo records (but rather completed logs), recovery time of few minutes is typical. This rather long recovery time can be tuned if standby redo is used. In such case, Real Time Apply [15, Chapter 6] is possible, in which case MRP applies the current standby redo log as it is being created. Also, as noted, maximum availability and performance protection modes enable disconnected operations for the primary site. Therefore, a mechanism to catch up the primary on the standby after the connection problem has been resolved is required. Failure client (FALc) and server (FALs) processes are responsible for managing these situations. Under such circumstances the standby FALc contacts the primary node FALs, requesting retransfer of missing data. FALs checks the missing archives and requests ARCH to transfer these. ARCH for catch up transfer supports intra-file parallelism, up to five ARCH processes can share read and transfer of a single archived log file. On the receiving side, 29 ARCH processes can be spawned to keep up with archive creation and applying them to the database. After applying the archives, the archives are deleted on the standby database. Therefore, point in time
12 recovery of standby database cannot be done after switchover beyond the flashback database (described later), in case it is used. Finally, it should be noted that fast switchover support should be taken into account in both Standby and Primary node. This may be supported by a number of means. First, its important to understand the standby redo logs used by Physical Standby differ from online redo logs (as they may not be online if e.g. standby is down). As such, they are separately configured. The standby redo logs can also be created in the primary site so that in switchover duration from Primary to Standby can be done quickly in cases where it is part of operations practices (e.g. certain types of Oracle patching or patching e.g. at layer). Second, if the Primary or Standby instance is actually a RAC database, all but one instance of the RAC database should be shut down to reduce the role transition duration. Optional Data Guard background processes include the use of Data Guard Broker (DGB). This enables centralized management of the Data Guard deployment and, when required, automatic failover (called fast-start failover ). DGB consists of the Observer third party and the DGB Monitor daemon (DMON). The physical standby database architecture with the DGB is depicted in Figure 3. The instances only contain background processes which directly relate to Data Guard, a number of normal Oracle instance processes are not depicted. Description of the broker optional elements follow the figure. Key Process DataGuard Broker The Observer Linux kernel module Used local file Non-process resource Network Redundant link Primary instance SGA FALs LAN or WAN Standby instance FALc SGA File BC LNS MRP BC Optional element Facility border RLB DGB conf DGB conf RFS RLB Network DMON LGWR REDO STANDBY REDO LGWR DMON ARCH ARCH ARCH ARCH DBWR DATA FILE DATA FILE DBWR SPFILE CTRL FILE STANDBY REDO SPFILE CTRL FILE Figure 3: Oracle Data Guard Physical Standby Database Architecture The Observer is a third-party arbitrator process, located in a third location (e.g. test environment database server in a third location), which functions also as a coordinator for failover. After disconnection between primary and standby, the standby checks from the Observer, if it also views that the primary is down. Under these circumstances, the standby can take over activities automatically. DMON runs on primary and all standby instances. DMON establishes a global view of the Data Guard environment, enabling centralized management of operations. The Oracle 10gR2 DMON understands the notion of RAC database and local RAC
13 instances, starting automatic switchover only in case that all RAC database RAC instances have failed. DGB configuration file is local to the all participating nodes. Logical Standby database  reduces the need for same versions compared to physical standby. Logical standby also allows capability for query of standby database without a separate state change that would limit applying changes. This is further supported by limited differences in logical and physical database structure. These include e.g. use of separate data files for new indexes and materialized views, which can be supported in read queries (for e.g. reporting applications). However, logical standby usage sets limitations in terms supported data types and object types (tables, views, sequences and PL/SQL procedures). Logical standby only supports Maximum availability and Maximum performance protection modes. In both cases redo transfer from primary is the same as in physical standby, but the application of SQL to the target database is done through separate transformation steps. These consist of redo log mining and redo record apply. Log mining consists of a reader (Pmr) and preparer (PmpX) as well as builder processes (Pmb). The reader accesses the standby redo and feeds the parallel preparer processes. The transfer is done through use of shared pool (SP) in the SGA. These transform the redo records to logical change records (LCRs). LCRs are further provided to a builder, which organizes the LCRs to transactions. In the redo record apply the transactions are first sorted to dependency order by an analyzer process (Paa), which feeds the applier processes (PaaX). A Logical Standby Process (LSP) coordinates the activities between analyzer and appliers by monitoring dependency information provided by analyzer, assigning apply processes to transactions and separately authorizing commits. Because of application of SQL, online redo logs are generated in addition to the standby redo logs. Use of Data Guard Broker is identical to use in a physical standby database. The logical standby database with the Broker is depicted in Figure 4. The instances only contain background processes which directly relate to DG, a number of normal Oracle instance processes are not depicted. Key Process DataGuard Broker The Observer Linux kernel module Used local file Non-process resource Network Redundant link File Primary instance SGA FALs BC LNS LAN or WAN Standby instance FALc SGA RFS BC Optional element Facility border SP RLB DGB conf DGB conf LSP RLB SP Network DMON LGWR REDO STANDBY REDO LGWR Pmr DMON ARCH ARCH ARCH ARCH PmpX Paa DBWR DATA FILE DATA FILE DBWR Pmb PaaX SPFILE CTRL FILE STANDBY REDO SPFILE CTRL FILE REDO Figure 4: Oracle Data Guard Logical Standby Database Architecture
14 Mixed Standbys (i.e., an environment where a primary has both a physical and a logical standby(s)) are also supported. In case the switchover or failover is made to the logical standby, the physical standby continues to protect the original primary (as its identical copy). From immediate environment perspective the Data Guard is a disaster recovery solution. Therefore, the immediate environment is not a central concern. Data Guard can be deployed over both LAN and WAN. While protection mode needs to be based on business requirements, the actual throughput of replication is dependent also on application profiles, used protection mode, redo transport mode (when not equaling protection mode) and tuning of Oracle network protocol and networking related parameters . From operational management perspective the central reason for use of Data Guard in a MAA configuration is to enable rolling upgrades in planned maintenance in cases where RAC does not support them, without requiring complete shutdown of a service . In such a high-end two site MAA architecture, the primary instance is formed of an Oracle RAC cluster which uses Data Guard to transfer redo records to a standby site, which also consists of an Oracle RAC cluster. Data Guard Broker contains the functionality to understand that recovery of site means all RAC instances are down. Such setups require use of logical standby architecture, as only it allows use of different versions of Oracle. Also, in such environments with such rolling upgrade requirements, the use of planned switchover requires minimizing the downtime caused. This can be done through e.g. shutdown of all but one RAC instance on both primary and standby site and ensuring long transactions have been committed and applied in the standby [15, 28]. Still, even with the capabilities enabled, the actual work requires significant preplanning. This is required to avoid both potential planning and execution errors. This way also back-out plans can be created to avoid typical errors of least-effort ad-hoc plans [26, pp. 157] in case something unexpected happens (although some DBAs may first take a short break before doing any environment or database changing actions specifically to avoid these types of errors). This also leads to the conclusion that MAA environments are not only more expensive in terms of hardware and software license costs but also in terms of operations (due to more time used on planning and most likely to higher requirement to use senior DBAs to handle the increased complexity). Of course this needs to be reflected against the cost of unavailability. It should be noted both sites can perform as a primary for specific Oracle RAC database, with each functioning as the standby for the other. However, such circumstances require planning in terms of access network connectivity if the two sites are separated by a WAN infrastructure and use two separate network segments. Alternatively the standby can run non-critical services, which are not redundant. Oracle 11g Data Guard new features [21, Chapter 1] include enhancements on redo transport (through e.g. option to compress transferred data) and on both physical and logical standby database architecture. Physical Standbys also support read queries without a separate mode switch. Transactional consistency is guaranteed, with natural limitation to synchronized protection modes. Primary and standby database structure still need to be identical. This capability requires a separate product option called Active Data Guard. For structural differences, physical standby also supports the notion of a snapshot database. This is a fully functional database forked
15 to its own lifecycle at one point from a physical standby. A snapshot can be reverted back to physical standby mode, at the same time losing the changes and applying the changes the primary has provided [21, Chapter 9]. This could be also one way of maintaining an identical test database copy for all change testing. Beyond snapshots, a physical standby can also be converted to a logical standby for the duration of a rolling upgrade. Physical primary and standbys also support more extensive block corruption testing by enabling redo record creation for reads. The redo information can be used to rebuild block row content in case corruption is identified in block read. Logical standbys supports additional object (e.g. table, index etc) types. Logical standbys also enable switchover so that there is no need to shutdown all but one of the primary and standby RAC database instances. 2.3 Flashback technologies Flashback technologies  are a group of database features which enable queries or database state transfer backwards in time. Prior to this, Oracle had a separate set of packaged PL/SQL procedures called Log Miner, which effectively enabled similar query functionality, but was based on archived redo logs and required a significant amount of manual work. Database state transfer backwards in time was only possible through point in time recovery, requiring significant amounts of downtime time. The scope of flashback technologies functionality depends on Oracle database version. Flashback Queries were introduced in Oracle9iR1, enabling use of undo tablespace (single or group of database files used for Oracle multiversioning, i.e., transactional read-consistency and rollbacks) segment information. This allowed for query of all data at a point in time assuming the data was available in undo. The functionality was extended in 10g with Flashback Versions Query, enabling e.g. finding out which transactions had manipulated data or how a particular row had changed content over transactions. Similarly, all changes of a particular transactions can be discovered with Flashback Transaction Query. All flashback query activities enable finding out extent of operational or applications logic error. Other flashback technologies enable direct data recovery in case of e.g. operational or application logic error. Oracle 10g extends the Flashback query with recovery related commands. These are Flashback Table, Flashback Drop and Flashback database functionality. These enable backward-in-time recovery for particular tables (regardless of e.g. table structural changes), recovery from accidental drop of table or index and generic backward-in-time travel respectively. The recovery mechanisms eliminate the need to make a whole database instance offline point-intime recovery, reducing operational and technical MTTR. Technically whereas Flashback Queries and Flashback Table are based on use of undo tablespace, Flashback Drop uses metadata modifications, only marking the table or index as dropped. The actual drop is done as a timed job, providing the DBA time to undo the drop. Flashback Database on the other hand utilizes a separate Flash Recovery Area. This consists of copies of all datafiles, incremental backups, an extra copy of archive logs, control files, control file backups and flash recovery logs. Flashback database utilizes a separate flashback buffer in the SGA (FB), which is populated with before images of blocks. A recovery writer process (RVWR),
16 periodically writes from FB to the flash recovery logs. These logs are written in a circular fashion similar to redo logs. In case a flashback database recovery is required, the before images prior to the point in time to which the recovery is intended are used. After this, redo is utilized to roll forward the database to the exact intended point in time. I.e., physical logging before block image is used to establish a data block baseline to which physiological logging redo records are applied to roll forward the block information. Flash recovery logs are not archived. Therefore, the capability of Flashback Database is limited by the flash recovery log size, the MTTR time for flash recovery (as basically any one image can be used for basis of roll forward) and number and the amount of activity which requires before image logging (i.e. any write activity, whether from DML or DDL). The other files contained in the flash recovery area are created during backups to effectively maintain local backups. The backups effectively determine one point in time beyond which flash recovery log data should not be maintained. Oracle 11g Flashback new features extend both Flashback Queries and Flashback Database . Flashback transaction extends the Flashback Queries by allowing the undo of a certain transaction, utilizing undo structures to create a compensating transaction. The flashback can be extended to dependent transactions. Flashback database archive extends the functionality flashback database by allowing creation of flash recovery log archives. This removes the need to size flashback area according to some time estimate. The approach allows e.g. simulation of a transaction at a any point-in-time. Flashback database archive is a separate database product option, called Oracle Total Recall. 2.4 Other MAA technologies on Oracle database Beyond the clustering, replication and operational error supporting features, Oracle Automatic Storage Management (ASM) and Recovery Manager (RMAN) backup application are seen as central parts of MAA. The former relates to ensuring the integrated stack and centralized management, in this sense having similar underlying goals as clusterware. The latter relates to incremental backup capability, which in general, is extremely important in VLDB environments to ensure performance during backup and reduced MTTR in recovery. ASM  plays a relatively minor role in many MAA setups. The database file fault tolerance is typically provided by a SAN storage device. Therefore, ensuring operational management of the SAN storage device becomes a primary concern from system perspective. Still, for DBAs, ASM may reduce errors by allowing file management to be done with the same tools as the database management, although this greatly depends on DBA background. In non-maa setups, When using Oracle RAC on a Standard Edition Oracle, ASM is required. While ASM usage for fault tolerance is relatively small, it s still worth noting that ASM does support online data management actions, resulting to increased uptime. Issues such as moving data files (and also migrating between storage subsystems) can be done through its disk management or redundant file configuration facilities .
17 RMAN  backup application special features relate to Oracle 10g capability to do block-level incremental backup. By this, only changed database blocks since the last backup are included in the backup set. Basically this requires that changes in an active database are tracked on block-level. From database architecture perspective this requires an additional disk data structure and a background process. The new disk data structure is called change tracking file. This is built by capturing block identifiers as redo is generated to the redo logs. A separate local-instance background process, change tracking writer (CtWR), is responsible for this redo capture and write to change tracking file. In RAC environments, this results in all nodes creating their own change tracking file and using that as basis of backup. On start of backup, the RMAN backup process reads the change tracking file, backing only the listed blocks. Dynamic views provide information from the internal database structures, which enables evaluating reduction of backup size. 3 Threat analysis The threat analysis contains only recovery related scenarios of the MAA environment provided in Figure 5. Key Process DataGuard Broker The Observer Linux kernel module Used local file Non-process resource Network Redundant link File Optional element Facility border Network SP Instance A SGA BC GRD FB RLB PMON SMON DMON FALs RVWR LNS LMD LMON LMSn LCK DIAG DBWR LGWR ARCH SPFILE DGB conf Oracle Clusterware Primary Site LCK DIAG RLB LAN or WAN Standby Site LMD Instance B Instance A LMON SGA Pmr SGA LMSn GRD BC PmpX BC GRD SP FB Pmb SP Paa FB RLB LMD LMON LMSn LCK DIAG LMD Instance B LMON SGA Pmr LMSn GRD BC PmpX DBWR PMON SMON PaaX PMON SMON DBWR DBWR PMON SMON PaaX LGWR FALs DMON DMON FALc LGWR LGWR FALc DMON SPFILE ARCH LNS RVWR RVWR RFS ARCH SPFILE SPFILE ARCH RFS RVWR DGB conf Oracle Clusterware DGB conf DGB conf Oracle Clusterware Oracle Clusterware LCK DIAG RLB SP FB Pmb Paa Public Private LAN LAN Public Private Public Private LAN LAN Public Private Flashback area Flashback area SAN STANDBY REDO SAN REDO A REDO B ARCH A ARCH B Control file Data file(s) SAN SAN STANDBY REDO SAN REDO A REDO B ARCH A ARCH B Control file Data file(s) SAN Figure 5: Oracle MAA architecture for threat analysis As discussed in the previous section, configuring MAA requires a number of architecture decisions and use of options. In particular, clusterware requires (i) optional voting disks, (ii) configuring VIP, listeners and fault-tolerant load balancing and (iii) correct setup of immediate environment. RAC requires (i) sharing of optional files on the SAN storage to ensure automated node recovery, (ii) use of TAF and FAN event handling in the application. Data Guard requires (i) use of logical standby database, (ii) pre-configuration of standby redo for primary and (iii) use of Data Guard Broker, to centralize management (and therefore, reduce operational error risk). In particular the Data Guard sets limitations as maximum protection safety (i.e., a strict 2-safe algorithm) cannot be used.
18 The main focus is to establish the means by which the database is protected against different types of failures, following previous discussion of features. Issues outside the scope of high availability and disaster recovery are outside the scope of analysis. Multiple concurrent failure analysis is beyond the scope of the analysis. 3.1 Database process and resource failure Process failure in the database can happen to user-related processes as well as database instance processes. User processes include per user query processes and e.g. Oracle multithreaded server (MTS) processes aka dispatch servers, which mimick user processes to ensure resource availability. PMON handles monitoring and recovery of these processes. Background processes on the other hand are monitored by CRSD on the clusterware layer and RACGIMON on the RAC layer respectively. Failure of a background process is results to node failure. Disk data structure failure depends on whether the failure in question is local to one node or shared by both RAC nodes using a common resource (control file, database tablespace file, redo log). The former case leads to node failure. The latter case leads to site failure. In some cases of storage corruption, impact can be prevented through use of a data validation algorithm inside the storage system, following the Assisted Resilient Data initiative . Datablock recovery is supported by Oracle RAC as an automated online activity. In case a data block buffer has been corrupted by e.g. user process termination or incomplete redo application, it is recovered based on PMON instantiation of recovery. The predecessor image is looked up from the shared distributed cache, fetched and redo is applied. If predecessor image is not available, datafile data block image is used [30, pp ]. 3.2 Server hardware and immediate environment failure CPU failure can be masked by the server hardware in some cases. Otherwise node failure follows. Local disk storage failure is handled initially by disk parity error checking and mirroring, If this fails, node failure follows. Memory failure is dependent on hardware. in itself may be error correcting or support shutdown of individual memory banks. If this fails, node failure follows. LAN interconnect failure is handled at layer through Linux interface bonding. The failure of one NIC port results to the other one taking over without noticeable gaps. If this fails (e.g. due configuration error), clusterware heart beat failure follows, resulting to node failure. LAN network failure divides to private and public network failures. In case of public access network failure in the site infrastructure to site A cluster nodes, site failure follows. In case of private network failure, voting disks are used to determine group membership on clusterware layer and some members are evicted from the
19 cluster and RAC database. This avoid the split brain syndrome. However, site failure is a plausible approach as split brain is not the only problem. One other would be the latency increase. This is because the distributed cache cannot be used. Therefore, disk would need to be utilized for transfer of data blocks between the RAC instances. SAN interconnect and SAN network failure for a single node is handled in third party host bus adapter (HBA) drivers. In case of loss of connectivity through one SAN fabric route to the SAN storage, an alternative SAN interconnect or SAN fabric route is used. If this fails, node failure results from either incapability to access voting disks (clusterware layer) or incapability to access one of the shared files on the shared storage (RAC layer). In case of failure of access to shared resource on all RAC instances nodes e.g. due failure of shared storage, site failure follows after CRSD attempts to mount resources for five times and finally boot of nodes on clusterware layer (resulting to DMON loss of connectivity to Observer on the Oracle Data Guard Broker functional area). 3.3 RAC node failure RAC node failure  [30, pp ] is either failure of node in clusterware layer or failure of RAC instance on RAC layer. This can result from misconfiguration of previously discussed high availability masking as well as RAC instance failure (e.g. internal software error). RAC node failure requires that another RAC instance takes over the management of instance recovery. First, another instance must identify failure of a node. This happens on both clusterware layer (by OCSSD notication of group membership loss with OPROCD failed IO fencing or CRSD failure to restart resources and resulting PMON boot) and RAC layer (by LMON). If the RAC database has more then two RAC instances, this identification can happen concurrently on a number of RAC instances. Such race conditions are resolved by which LMON process is first able to acquire access to the redo logs of the failed RAC instance. The node starts remastering the resources. Only the failed RAC instance s GRD resources are remastered (called lazy remastering ). The cache recovery starts with rebuilding the global enqueues of the failed RAC instance s part of GRD. This is done by the LMD and LMON processes of each node that received mastership of new resources. Remastering continues with remastering cache data block resources in continuous sets of blocks. A set size is dependent on a hidden parameter. Like with dynamic remastering, node workloads impact the remastering result. Cache recovery can be started by the SMON on the recovering RAC instance after the global enqueues have been rebuilt. SMON first builds the set of data blocks that require recovery. To do this SMON executes the first pass read of the failed RAC instance s redo logs in parallel. Redo records contain Block Written Records, which indicate when a block was written to disk. This can be used to determine, if redo record changes need to be applied to a block or not (i.e., if the data block on disk content already contains the row-related changes). If it is required, the block is put into the recovery set. Also, the related redo records collected. After building the recovery set, the recovering RAC instance initiates remastering to assume mastership of the required GRD resources (the data blocks and the related enqueues). This ends
20 the brownout period of the RAC database, as the other RAC instances can provide service to all non-recoverable GRD resources and the database is opened. Transaction recovery is limited to the recovery set, following normal 2-phase, ARIES-like recovery. In the first phase, the collected redo records are applied. This is followed by the rollback of uncommitted transactions complete the two-phase recovery. The behavior can follow normal Oracle deferred roll back approach, in which case the recovery set available for use after first phase. In this case, roll back is deferred until requests to data blocks containing uncommitted data. Rollback is applied to the data block prior to providing it to the requestor. The rest of the transaction (i.e., rows that relate to the uncommitted transaction(s) on the recovered block on other blocks) are rolled back in the background. Alternatively, early admission to database can be skipped and the second phase can be carried out through a parallel recovery mechanism. In this approach, concurrency is dependent on number of CPU cores available on recovering node and that the processing load on these is not high already. Beyond these, the failed RAC node at one point (through automatically after reboot initiated by the node itself or maintenance work required to boot the node) comes back online. At this point, GRD resources are remastered as if a new node joined to the cluster. From the client perspective, the current MAA documentation promises a client failover for RAC node failure is resolved in less than 10 seconds irrespective of type of query being executed. However, this requires that the server listener and thick driver clients are configured with failover configuration discussed in Section 2.1 and that checkpointing of RAC is tuned to match recovery time requirements. 3.4 Operational errors and planned activities Operational management focusing on reducing service downtime needs to be concerned with fast recovery from operational errors (whether mistakes or misunderstandings) and reduction of downtime in carrying out planned activities typically involving some type of change to the system. Operational error recovery is dependent on type of operational error. On user or application errors, flashback queries are used to track impact of a faulty transaction. Undoing compensations need to be created manually. Administrator table and index drops are recovered through flashback table. Other accidental changes are recovered through flashback database, impacting the whole database state. Planned activities contain changes such as hardware upgrades, patching and Oracle clusterware, database patching and database migration to a next release. upgrades, patching and some Oracle patching can be masked (resulting to at worst session timeout) on Oracle RAC layer. However, clusterware patching, database patching and migration to next release requires a site switchover.