Disaster Recovery with General Parallel File System. August 2004
|
|
|
- Sharyl Brooks
- 10 years ago
- Views:
Transcription
1 Disaster Recovery with General Parallel File System August 2004
2 Abstract The ability to detect and quickly recover from a massive-scale hardware failure is of paramount importance to businesses that make use of real-time data processing systems. General Parallel File System (GPFS) provides a number of features that facilitate the implementation of highly-available GPFS environments capable of withstanding catastrophic hardware failures. By maintaining a redundant replica of the file system s data at another (geographically separated) location, we allow the system to sustain its processing using the secondary replica of the data in the event of a total failure in the prime environment. In this paper, we present an overview of the disaster recovery features available in the 2.2 release of GPFS and provide detailed hands-on guidance on implementing the various types of disaster-tolerant configurations supported in this release. Overview and terminology Businesses that depend on real-time data processing systems are vulnerable to a profound negative impact from natural or unnatural disasters such as fires, tornadoes, earthquakes, or power failures. A catastrophe can permanently disable the customer s data processing infrastructure and, without proper backup procedures, result in a permanent loss of businesscritical data. To help minimize the impact from such unplanned hardware outages, many businesses are implementing preventive measures enabling them to quickly recover from a disastrous failure and enable the near continual availability of mission-critical applications and the associated data. GPFS 2.2 provides you with a number of features for the support of your disaster recovery solution. On a very high level, a disaster-resilient GPFS cluster environment is typically made up of two (sometimes three) distinct hardware sites that operate in a coordinated fashion and are separated by a substantial geographic distance. Each site houses some number of GPFS nodes and a storage resource holding a complete replica of the file system. In the event of a catastrophic hardware failure that disables the operation of an entire site, GPFS is designed to failover to the remaining subset of the cluster and continue serving the data using the replica of the file system that survived the disaster. In this paper, we examine several methods for maintaining the secondary replica, which include synchronous mirroring (through the use of IBM TotalStorage Enterprise Storage Server (ESS) Peer-to-Peer Remote Copy (PPRC) or logical GPFS replication) along with a non-synchronous approach that utilizes ESS FlashCopy. Certain high-end storage subsystems such as ESS implement support for volume-level geographic mirroring of data. For example, the PPRC feature of the ESS enables users to establish a persistent mirroring relationship between pairs of Logical Units (LUNs) on two subsystems connected over an ESCON or a fiber-channel link. All updates performed on the set of primary (or source) LUNs appear in the same order on the secondary (target) disks in the target subsystem. The PPRC mechanism provides that if the source volume fails, the target holds an exact bitwise replica of the source s content as seen at the time of the failure. This solution is described in the section Active/passive GPFS configurations with ESS Peer-to-Peer Remote Copy The existing data and metadata replication features of GPFS can be similarly used to implement synchronous mirroring between a pair of geographically separated sites. The usage of logical replication-based mirroring can be seen as an attractive alternative to PPRC, in particular because it offers a generic solution that relies on no specific support from the disk subsystem estigpfsdrwp doc 2
3 beyond the basic ability to read and write data blocks. This solution is described in the section Active/active GPFS configurations employing logical replication and quorum tiebreakers The primary advantage of both synchronous mirroring methods lies in the minimization of the risk of permanent data loss. Both methods provide us with two consistent up-to-date replicas of the file system, each available for recovery should the other one fail. However, inherent to all solutions that synchronously mirror data over a wide-area network link is the latency penalty necessarily induced by the replicated write I/Os. This makes both mirroring methods prohibitively inefficient for certain types of performance-oriented applications. An alternative technique involves taking periodic point-in-time copies of the file system using a facility such as the ESS FlashCopy. The copy is subsequently transferred to a remote back-up location using PPRC and/or written to tape. The key difference between this and the synchronous mirroring techniques is that the creation of the secondary replica takes place asynchronously with respect to the regular file system activity against the primary replica, effectively eliminating the write penalty associated with synchronous mirroring. This solution is described in the section Online backup with ESS FlashCopy This paper is written for Information Technology professionals who have some experience with GPFS, ESS PPRC, and FlashCopy. For more information on these technologies, please refer to the sources listed in the References section. The reader is assumed to have some familiarity with the standard administrative concepts of GPFS and understand the node quorum rules enforced by GPFS to help protect the integrity of on-disk data in the event of a network partition. Active/active GPFS configurations employing logical replication and quorum tiebreakers Overview of active/active dispersed GPFS clusters In an active/active GPFS configuration, a single GPFS cluster is defined over two geographically separated physical clusters, or sites. One or more file systems are created; each to be mounted and accessed concurrently from both locations. The data and metadata replication features of GPFS are used to maintain a secondary copy of each file system block, relying on the concept of disk failure groups to control the physical placement of the individual copies. The high-level organization of a replicated active/active GPFS cluster is shown in Figure 1. We partition the set of available disk volumes into two failure groups (one defined for each site) and create a standard replicated file system (specifying the replication factor of two for both data and metadata). When allocating new file system blocks, GPFS always assigns replicas of the same block to distinct failure groups. This allows for a sufficient level of redundancy allowing each site to continue operating independently should the other site fail. estigpfsdrwp doc 3
4 Site C (tiebreaker) Site A IP network Site B Shared disk access (SAN / NSD) Site A storage Site B storage non-quorum node quorum node primary cluster configuration data server secondary cluster configuration data server Figure 1. High-level organization of a replicated geographically dispersed GPFS cluster As described in Concepts, Planning and Installation Guide, GPFS enforces a node quorum rule to prevent multiple nodes from assuming the role of the file system manager in the event of a network partition. Thus, a majority of quorum nodes must remain active in order for the cluster to sustain normal file system usage (multi-node quorum). Furthermore, GPFS uses a quorum replication algorithm to maintain the content of the file system descriptor (one of the central elements of the GPFS metadata). When formatting the file system, GPFS assigns some number of disks (usually three) as the descriptor replica holders that are responsible for maintaining an up-to-date copy of the descriptor. Similar to the node quorum requirement, a majority of the replica holder disks must remain available at all times to sustain normal file system operations. Considering the above quorum constraints, we suggest introducing a third site into the configuration to fulfill the role of a tiebreaker for the node and the file system descriptor quorum decisions. Typically, the tiebreaker site would accommodate a single quorum node and a single GPFS disk (we suggest using a logical volume built on top of the tiebreaker node s internal disk) with the disk usage defined as a descriptor only disk. The desconly option for disk usage instructs GPFS to exclude the respective disk from the regular block allocation activity and avoid using this disk for the storage of data and metadata. This disk serves no function other than to provide an additional replica of the file system descriptor needed to sustain quorum should a disaster cripple one of the other descriptor replica disks. The tiebreaker disk is accessed over NSD with the tiebreaker node defined as its primary NSD server. estigpfsdrwp doc 4
5 Since the tiebreaker node does not normally access the file system, direct SAN connectivity between that node and the subsystems that supply the actual storage is unnecessary. However, we must instruct GPFS to disregard all disk access errors on the tiebreaker by enabling the unmountondiskfail configuration parameter through the mmchconfig command for that node. If enabled, this parameter forces the tiebreaker node to treat the lack of disk connectivity as a local error (resulting in a failure to mount the file system) rather that reporting this condition to the file system manager as a disk failure. The three-site configuration proposed here is resilient to a complete failure of any single hardware site. Should all disks volumes in one of the failure groups become unavailable as a result of a disaster, GPFS performs a transparent failover to the remaining set of disks and continues serving the data to the surviving subset of nodes with no administrative intervention. Note that while nothing prevents users from placing the tiebreaker resources at one of the active sites, we suggest installing the tiebreakers at a third geographically distinct location to minimize the risk of double-site failures. Setting up an active/active dispersed GPFS cluster This is a step-by-step example illustrating the creation of an active/active geographically dispersed GPFS cluster and a single replicated file system. We assume the following configuration: Site A Node hostnames: nodea001, nodea002, nodea003, nodea004 Disk volume names: hdisk11, hdisk12, hdisk13, hdisk14 Site B Node hostnames: nodeb001, nodeb002, nodeb003, nodeb004 Disk volume names: hdisk15, hdisk16, hdisk17, hdisk18 Site C (tiebreaker) Node hostnames: nodetiebr Disk volume names: tielv Disks hdisks11-hdisk18 are SAN-attached and accessible from all nodes at sites A and B. tielv is a logical volume defined over nodetiebr s internal disk. 1. Create a GPFS cluster of type lc selecting two nodes, one at each site, as the primary and seconday cluster data server nodes: mmcrcluster t lc n NodeDescFile p nodea001 s nodeb Create and configure the GPFS nodeset: mmconfig a 3. Disable the GPFS autoload option (see below): mmchconfig autoload=no estigpfsdrwp doc 5
6 4. Define the set of network shared disks for the cluster where disks at sites A and B are assigned to failure groups 1 and 2, respectively. The tiebreaker disk is assigned to failure group 3: mmcrnsd F DiskDescFile 5. Enable unmountondiskfail on the tiebreaker to prevent this node from falsely declaring the disks as having failed: mmchconfig unmountondiskfail=yes nodetiebr 6. Start the GPFS daemon on all nodes: mmstartup -a 7. Create a replicated file system and mount it on all nodes in the cluster: mmcrfs /gpfs/fs0 fs0 F DiskDescFile m 2 M 2 r 2 R 2 mount /gpfs/fs0 nodea001:nonquorum-client nodea002:nonquorum-client nodea003:nonquorum-client nodea004:quorum-manager nodeb001:nonquorum-client nodeb002:nonquorum-client nodeb003:nonquorum-client nodeb004:quorum-manager nodetiebr:quorum-client Sample NodeDescFile #/dev/hdisk11:::dataandmetadata:1 gpfs1nsd:::dataandmetadata:1 #/dev/hdisk12:::dataandmetadata:1 gpfs2nsd:::dataandmetadata:1 #/dev/hdisk13:::dataandmetadata:1 gpfs3nsd:::dataandmetadata:1 #/dev/hdisk14:::dataandmetadata:1 gpfs4nsd:::dataandmetadata:1 #/dev/hdisk15:::dataandmetadata:2 gpfs5nsd:::dataandmetadata:2 #/dev/hdisk16:::dataandmetadata:2 gpfs6nsd:::dataandmetadata:2 #/dev/hdisk17:::dataandmetadata:2 gpfs7nsd:::dataandmetadata:2 #/dev/hdisk18:::dataandmetadata:2 gpfs8nsd:::dataandmetadata:2 #/dev/tielv:nodetiebr::desconly:3 gpfs9nsd:::desconly:3 Sample DiskDescFile Steps to take after a disaster The procedure provided in Setting up an active/active dispersed GPFS cluster creates a disasterresilient GPFS file system that masks failures of a single geographic site (assuming that the other site and the tiebreaker remain operational). That is, if a catastrophic failure disables site A, GPFS detects this failure, mark site A s disks as down, and continue serving the data using the replica located at site B. The failover to the surviving replica is performed automatically and requires no estigpfsdrwp doc 6
7 administrative intervention although, depending on the specifics of the user application, manual steps may need to be taken to transfer the application s activity to the nodes at the surviving site. Later, depending on the nature of the disaster, we may want to permanently remove the failed resources from the GPFS configuration or, if the outage is of a temporary nature, preserve the original configuration and simply restart GPFS on the affected nodes after the repair. In the case of permanent loss of hardware assets, execute the following procedure to remove all references to the failed nodes and disks from the GPFS configuration: (Note: We assume that Site A has become unavailable while B and the tiebreaker survived the disaster). 1. As the primary cluster data server node, nodea001 has become unavailable. Migrate the server to a node at site B: mmchcluster p nodeb002 (Alternatively, if the secondary cluster data server node has become unavailable as a result of a disaster, run mmchcluster s to migrate the secondary server to the surviving site). 2. Delete the failed disks from the GPFS configuration: mmdeldisk fs0 gpfs1nsd;gpfs2nsd;gpfs3nsd;gpfs4nsd mmdelnsd gpfs1nsd;gpfs2nsd;gpfs3nsd;gpfs4nsd 3. Shut down GPFS on all surviving nodes: mmshutdown a 4. Delete the failed nodes from the GPFS configuration: mmdelnode nodea001,nodea002,nodea003,nodea004 mmdelnode c mmdelcluster nodea001,nodea002,nodea003,nodea Start the GPFS daemon on the remaining nodes and remount the file system on all nodes: mmstartup -a mount /gpfs/fs0 On the other hand, if the nature of the disaster is such that the resources at the failed site are left physically intact but temporarily inaccessible (for example, as a result of a power outage), no changes to the GPFS configuration are needed at this stage. The GPFS cluster enters the failover state, in which it will remain until a decision is made to restore the operation of the affected site by executing the failback procedure as described in the section Transition to the active/active state after the repair estigpfsdrwp doc 7
8 Please note: We suggest users not to make any changes to the GPFS configuration (with commands like mmchconfig, mmchfs, etc.) while the cluster is in the failover state. These commands require both cluster configuration servers to be operational and while the servers can be migrated to the surviving site, it is best to avoid this step if the disaster does not leave the affected site permanently disabled. The reason is that the migration and the subsequent changes to the configuration would leave the two sites with distinct (and possibly inconsistent) copies of the GPFS configuration data file (mmsdrfs). If it becomes absolutely necessary to modify the GPFS configuration while in failover mode, please ensure that all nodes at the affected site are left in a stable inactive state (e.g. powered down) and will remain in such state until the decision is made to execute the failback procedure. As a means of precaution, we suggest disabling the GPFS autoload option on all nodes to prevent GPFS from bringing itself up automatically on the affected nodes should they come up spontaneously at some point after a disaster. During failback, please make sure to run the mmchcluster p LATEST command to resynchronize the mmsdrfs data across the entire cluster before restarting the GPFS daemon in the disaster-affected site. Transition to the active/active state after the repair Once the physical operation of site A has been restored, an administrative procedure is executed to resume file system access at the recovered site and rebalance the replicated file system data across the two locations. The following two cases must be considered: If the disaster resulted in a permanent loss of hardware resources (nodes and/or disks) and if these resources have subsequently been replaced, the procedure would involve defining these new resources as part of the GPFS cluster. In the following example, we are augmenting the existing GPFS configuration with 4 new nodes (nodea005-nodea008) and 4 disks (hdisk11- hdisk14) located at site A: 1. Add the new nodes to the GPFS cluster and the nodeset: mmaddcluster nodea005:nonquorum-client,nodea006:nonquorum-client, nodea007:nonquorum-client,nodea008:quorum-manager mmaddnode nodea005,nodea006,nodea007,nodea Migrate the primary cluster data server back to site A: mmchcluster p nodea005 (Alternatively, if the secondary cluster data server was migrated after a disaster, run mmchcluster s to migrate the secondary server back to the recovered site). 3. Start GPFS on the newly added nodes and mount the file system: mmstartup w nodea005,nodea006,nodea007,nodea008 estigpfsdrwp doc 8
9 mount /gpfs/fs0 (on nodes nodea005,nodea006,nodea007, nodea008) 4. Create the new NSDs: mmcrnsd F NewDiskDescFile 5. Add the disks to the GPFS file system and rebalance it to restore the initial replication properties: mmadddisk fs0 F NewDiskDescFile r -a This command will perform a full scan of the file system s metadata and rebalance the files to make use of the new disks. This is a potentially time-consuming task and we suggest running the above command with the a option to move the rebalancing operation into asynchronous mode. For the same reason, we recommend executing this command only once, specifying all new disks in the NewDiskDescFile, as opposed to running it once for every new disk. #/dev/hdisk11:::dataandmetadata:1 gpfs10nsd:::dataandmetadata:1 #/dev/hdisk12:::dataandmetadata:1 gpfs11nsd:::dataandmetadata:1 #/dev/hdisk13:::dataandmetadata:1 gpfs12nsd:::dataandmetadata:1 #/dev/hdisk14:::dataandmetadata:1 gpfs13nsd:::dataandmetadata:1 Sample NewDiskDescFile describing the disks at site A On the other hand, if the GPFS cluster is in the failover state (with the affected resources still part of the configuration), the failback procedure is executed to resume the operation of GPFS across the entire cluster. The procedure is as follows: 1. Ensure that all nodes have the latest copy of the mmsdrfs file: mmchcluster p LATEST 2. Start GPFS on all nodes at site A and mount the file system on these nodes: mmstartup w nodea001,nodea002,nodea003,nodea004 mount /gpfs/fs0 (on nodes nodea001,nodea002,nodea003, nodea004) 3a. If the data on disks at Site A is intact (meaning that no unexpected modifications to the content of these disks have taken place during or after the disaster), bring the disks on-line: mmchdisk fs0 start a This command also restripes the file system to restore the initial replication properties. 3b. Otherwise, delete the affected disks from the file system, recreate the NSDs, and add the disks back: mmdeldisk fs0 gpfs1nsd;gpfs2nsd;gpfs3nsd;gpfs4nsd estigpfsdrwp doc 9
10 mmdelnsd gpfs1nsd;gpfs2nsd;gpfs3nsd;gpfs4nsd mmcrnsd F NewDiskDescFile mmadddisk fs0 F NewDiskDescFile r -a The mmadddisk command with the r option also restripes the file system to restore the initial replication properties. Active/passive GPFS configurations with ESS Peer-to-Peer Remote Copy Overview of active/passive PPRC environments In an active/passive environment, two GPFS clusters are set up in two geographically distinct locations (the production and the recovery sites). We refer to these as peer GPFS clusters. A GPFS file system is defined over a set of disk volumes located at the production site and these disks are mirrored using PPRC to a secondary set of volumes located at the recovery site. During normal operation, only the nodes in the production GPFS cluster would mount and access the main replica of the file system residing on the production site s storage resource. In the event of a catastrophe in the production cluster, the PPRC failover task is executed to enable access to the secondary replica of the file system located on the target PPRC disks. (Please refer to the IBM ESS Redbook for a detailed explanation of PPRC failover and failback and the means of invoking these procedures in your environment). The secondary replica is then mounted on nodes in the recovery cluster as a regular GPFS file system, thus allowing the processing of data to resume at the recovery site. At a latter point, after restoring the physical operation of the production site, we execute the failback procedure to resynchronize the content of the PPRC volume pairs between the two clusters and re-enable access to the file system in the production environment. A schematic view of an active/passive GPFS environment with PPRC is shown in Figure 2. Note that in such environments, only one of the sites accesses the GPFS file system at any given time, which is one of the primary differences between a configuration of this type and the active/active model described in the previous section. After defining the set of NSDs and creating the GPFS file system in the primary cluster, we will use the mmfsctl syncfsconfig command to import the file system s definition into the peer recovery cluster. This command extracts the relevant configuration data from the local mmsdrfs file, contacts one of the nodes in the peer recovery cluster, and imports the extracted data into that cluster s mmsdrfs file to make the file system known to the recovery nodes. The mmfsctl command is available for GPFS beginning with APAR IY estigpfsdrwp doc 10
11 Production Site Production GPFS cluster Recovery site Recovery GPFS cluster IP network Share disk access (SAN/NSD) Share disk access (SAN/NSD) PPRC source volume(s) PPRC PPRC target volume(s) non-quorum node quorum node primary cluster configuration data server secondary cluster configuration data server Figure 2. An active/passive disaster-tolerant GPFS environment using PPRCmirroring Syntax mmfsctl {Device all} syncfsconfig -n PeerNodesFile [-S PeerChangeSpecFile] Purpose Synchronizes the configuration state of a GPFS file system between the local cluster and its peer Parameters Device or all The device name of the GPFS file system whose configuration is being resynchronized. If the all option is used, this operation is performed on all GPFS file systems defined in the local cluster. -n PeerNodeFile The list of contact nodes (one hostname per line) for the peer cluster. For efficiency reasons, we suggest specifying the identities of the peer cluster s primary and secondary configuration data servers. If none of the contact nodes are reachable, the command will report an error and no changes will take place. -S PeerChangeSpecFile The description of changes to be made to the file system in the peer cluster during the import step. The format of this file is identical to that of the ChangeSpecFile used as input to the mmimportfs command. This option can be used, for example, to define the assignment of the NSD servers for use in the peer cluster. Comments The primary configuration data server of the peer cluster must be available and accessible via remote shell and remote copy at the time of the execution of this command. Additionally, the peer GPFS clusters should be defined to use identical remote shell and remote copy mechanisms (ssh/scp or rsh/rcp) and these mechanisms should be set up to allow nodes in peer clusters to communicate without the use of a password. estigpfsdrwp doc 11
12 At all times, the peer clusters must see a consistent image of the mirrored file system s configuration state contained in the mmsdrfs file. You must ensure that after the initial creation of the file system, all subsequent updates to the local configuration data are propagated and imported into the peer cluster. The mmfsctl syncfsconfig command should be executed to resynchronize the configuration state between the peer clusters after each of the following actions in the primary GPFS cluster: 1) Additions, removals, and replacements of disks (mmadddisk / mmdeldisk / mmrpldisk) 2) Modifications to disk attributes (mmchdisk) 3) Changes to the file system s mountpoint (mmchfs -T) Furthermore, we suggest users run the mmfsctl syncfsconfig command as part of the failback procedure to propagate any changes made in failover state back into the primary cluster s mmsdrfs file. The /var/mmfs/etc/syncfsconfig user exit provides a means of automating this resynchronization step. This exit is invoked after the initial creation of the file system in the local cluster, as well as after the execution of all administrative commands that modify the file system s configuration state. Syntax /var/mmfs/etc/syncfsconfig Device Purpose Invoked after every administrative command that modifies the state of the file system in mmsdrfs on the node from which the command was issued. In disaster recovery environments, this exit is typically used for the propagation of administrative changes to the peer recovery cluster. Parameters Device The device name of the GPFS file system whose configuration has been modified. Unless some form of specialized administrative processing is implemented in your GPFS environment, we suggest defining this user exit to call mmfsctl syncfsconfig, for instance as follows: mmfsctl $1 syncfsconfig n recoverynodefile S recoveryspecfile Sample contents of /var/mmfs/etc/syncfsconfig Setting up an active/passive PPRC-mirrored GPFS environment This example describes the creation of an active/passive GPFS environment with PPRC mirroring. We assume the following configuration: Production site Node hostnames: nodep001, nodep002, nodep003, nodep004 Storage subsystems: ESS P; logical subsystem LSS P LUN ids and disk volume names: lunp1 (hdisk11), lunp2 (hdisk12), lunp3 (hdisk13), lunp4 (hdisk14) estigpfsdrwp doc 12
13 Recovery site Node hostnames: Storage subsystems: LUN ids and disk volume names: noder001, noder002, noder003, noder004 ESS R; logical subsystem LSS R lunr1 (hdisk11), lunr2 (hdisk12) lunr3 (hdisk13), lunr4 (hdisk14) All disks are SAN-attached and are directly accessible from all local nodes. 1. Create a dual-active ESS copy services domain; assign ESS P as ServerA and ESS R as ServerB. 2. Establish a PPRC logical path from LSS P to LSS R. Enable the consistency group option. See the section Data integrity and the use of PPRC consistency groups. 3. Establish synchronous PPRC volume pairs: lunp1-lunr1 (source-target) lunp2-lunr2 (source-target) lunp3-lunr3 (source-target) lunp4-lunr4 (source-target) Use the copy entire volume option and leave the permit read from secondary option disabled. At the recovery site: 4. Define the recovery GPFS cluster and the nodeset: mmcrcluster t lc n NodeDescFileR p noder001 s noder002 mmconfig a At the production site: 5. Define the production GPFS cluster and the nodeset: mmcrcluster t lc n NodeDescFileP p nodep001 s nodep002 mmconfig a 6. Automate the propagation of the configuration state to the recovery cluster by invoking the mmfsctl syncfsconfig command from the syncfsconfig user exit: echo mmfsctl $1 syncfsconfig n NodeDescFileR > /var/mmfs/etc/syncfsconfig chmod 744 /var/mmfs/etc/syncfsconfig 7. Start the GPFS daemon on all nodes in the production cluster: mmstartup -a estigpfsdrwp doc 13
14 8. Define the NSDs: mmcrnsd F DiskDescFileP 9. Create the GPFS file system and mount it on all nodes: mmcrfs /gpfs/fs0 fs0 F DiskDescFileP mount /gpfs/fs0 nodep001 nodep002 nodep003 nodep004 Sample NodeDescFileP noder001 noder002 noder003 noder004 Sample NodeDescFileR #/dev/hdisk11 gpfs10nsd:::dataandmetadata:-1 #/dev/hdisk12 gpfs11nsd:::dataandmetadata:-1 #/dev/hdisk13 gpfs12nsd:::dataandmetadata:-1 #/dev/hdisk14 gpfs13nsd:::dataandmetadata:-1 Sample DiskDescFile Failover to the recovery site and subsequent failback The following steps can restore access to the file system at the recovery site after a disastrous node and/or subsystem failure in the production location: Failover to the recovery site 1. From a node in the primary cluster, stop the GPFS daemon on all surviving nodes in the production cluster (if not already stopped): mmshutdown a 2. Perform PPRC failover. This step involves establishing synchronous PPRC pairs lunr1- lunp1, lunr2-lunp2, etc. with the failover option. After the completion of this step, the volumes on ESS R should enter the suspended source PPRC state. 3. Start the daemon on all nodes in the recovery cluster and mount the file system: mmstartup a mount /gpfs/fs0 (on a node in the recovery cluster) (on all nodes in the recovery cluster) estigpfsdrwp doc 14
15 Once the physical operation of the production site has been restored, execute the failback procedure to transfer the file system activity back to the production GPFS cluster. The failback operation is a two-step process: 1. Resynchronize the paired volumes by establishing a temporary PPRC pair with lunr* acting as the sources for lunp*. The PPRC modification bitmap is traversed to find the mismatching disk tracks, whose content is then copied from lunr* to lunp*. 2. Stop GPFS on all nodes in the recovery cluster and reverse the disk roles (the original primary disks become the primaries again), effectively bringing the PPRC environment to the initial (predisaster) state. Additionally, if the state of the file system s configuration changed while in failover mode, issue the mmfsctl syncfsconfig command to resynchronize the content of the mmsdrfs file between the two clusters. Failback to site A - resynchronization 1. Remove the existing PPRC path from LSS P to LSS R. Use the force removal option. 2. Establish a PPRC path from LSS R to LSS P. Enable the consistency group option. 3. Establish synchronous PPRC volume pairs lunr1-lunp1, lunr2-lunp2, etc. with the failback option. Leave the permit read from secondary option disabled. 4. Wait for all PPRC pairs to reach the duplex (fully synchronized) state. 5. If necessary, update the GPFS configuration data in the production cluster to propagate the changes made while in failover mode: mmfsctl fs0 syncfsconfig n NodeDescFileP (on a node at the recovery site) Failback to site A role reversal 6. Stop the GPFS daemon on all nodes in the recovery cluster: mmshutdown a (on a node in the recovery cluster) 7. Establish synchronous PPRC volume pairs lunp1-lunr1, lunp2-lunr2, etc. with the PPRC failover option. 8. Remove the existing PPRC path from LSS R to LSS P. Use the force removal option. 9. Establish a PPRC path from LSS P to LSS R. Enable the consistency group option. 10. Establish synchronous PPRC volume pairs lunp1-lunr1, lunp2-lunr2, etc. with the PPRC failback option. Leave the permit read from secondary option disabled. 11. Start the daemon on all nodes in the production cluster and mount the file system: estigpfsdrwp doc 15
16 mmstartup a mount /gpfs/fs0 (on a node in the primary cluster) (on all nodes in the primary cluster) Data integrity and the use of PPRC consistency groups The integrity of the post-disaster replica of the file system (contained on the secondary PPRC disk volumes) depends on the assumption that the order of dependent write operations is preserved by the PPRC mirroring mechanism. That is, given two PPRC volume pairs P1-S1 and P2-S2, if two sequential write requests, first to P1 and later to P2 are issued by the application, we expect PPRC to commit these updates to the secondary disks in the exact same order: S1 followed by S2. While the synchronous nature of PPRC enables such ordering during periods of stability, certain types of rolling failure scenarios require additional consideration. By default (without the PPRC consistency group option), if the primary ESS detects the loss of physical PPRC connectivity or the failure of one of the secondary disks, it responds by moving the corresponding primary volume(s) to the suspended state but continues to process the subsequent I/O requests as normal. The subsystem does not report this condition to the driver and, as a result, the application continues normal processing without any knowledge of the lost updates. GPFS relies on log recovery techniques to provide for the atomicity of updates to the file system s metadata, which is why such behavior would expose GPFS to a serious data integrity risk. Figure 3 illustrates this problem. Consider a set-up with two primary and two secondary subsystems, each providing access to two LUNs. The four primary LUNs make up the primary replica of the file system. At some point, GPFS attempts to issue a sequence of four write requests, one to each primary disk, with the expectation that the updates appear in the exact order they were issued. (That is, if we number the writes 1 through 4, , 1-2-3, 1-2, or 1 could all be considered as consistent outcomes, but for example would not be). If PPRC path 1 breaks before the start of the first write, the recovery site receives updates 3 and 4, but not necessarily 1 and 2 a result that violates the write dependency rule and renders the target replica of the file system unusable. The PPRC consistency group option determines the behavior of the primary subsystem in situations where a sudden failure of the secondary volume (or a failure of the inter-site interconnect) makes it impossible to sustain the normal synchronous mirroring process. If the PPRC path between the pair has been defined with the consistency group option, the primary volume enters a long busy state, and the subsystem reports this condition back to the host system with the QUEUE FULL (QF) SCSI status byte code. Typically, the driver makes several attempts to requeue the request and ultimately report the failure to the requestor. This, in turn, allows GPFS to execute the appropriate action in response to the failure (marking the disk down or panicking the file system). To ensure the proper ordering of updates to the recovery copy of the file system, we urge users to always enable the consistency group option when configuring the PPRC paths between the peer logical subsystems. estigpfsdrwp doc 16
17 Production Site ESS A1 (PPRC primary) Recovery Site ESS B1 (PPRC secondary) 1 2 PPRC path 1 ESS A2 (PPRC primary) ESS B2 (PPRC secondary) 3 4 PPRC path Figure 3. Violation of write ordering without the use of a PPRC consistency group Online backup with ESS FlashCopy Creating a FlashCopy image of a GPFS file system The FlashCopy feature of the Enterprise Storage Server allows users to make a point-in-time copy of an ESS disk volume to another volume in the same logical subsystem. This function is designed to provide an near instantaneous (time-zero) copy or a view of the original data on the target disk, while the actual transfer of data takes place asynchronously and is fully transparent to the user. The intended use of FlashCopy is to produce secondary copies of the production data quickly and with minimal application downtime. In a GPFS cluster, the FlashCopy technology can be used to produce point-in-time copies of a GPFS file system, allowing for an easy implementation of an online backup mechanism. Similar to the PPRC mirroring scenarios, it is crucially important to consider the issues of data consistency when FlashCopy is used. To prevent the appearance of out-of-order updates, all disk volumes that make up the file system must be brought to same logical point in time prior to taking the FlashCopy image. Two methods may be used to enable data consistency in the FlashCopy image of your GPFS file system: 1. Generate a FlashCopy image using a consistency group. With FlashCopy v2, you can use the FlashCopy consistency group mechanism to effectively freeze the source disk volume at the logical instant at which its image appears on the target disk. estigpfsdrwp doc 17
18 When issuing the FlashCopy creation command, enable the freeze FlashCopy consistency group option, which will instruct the ESS to suspend all I/O activity against the source volume. From that point on, all I/O requests directed to this volume will be responded to with the QUEUE FULL (QF) SCSI status code. Once all disk volumes belonging to your file system have been flashed, run the FlashCopy consistency created task to release the consistency group and resume the normal processing of I/O. For details on the use of FlashCopy consistency groups, please refer to IBM TotalStorage Enterprise StorageServer Implementing ESS Copy Services in Open Environments. Note that the consistency group feature is available only with FlashCopy v2. 2. Generate a FlashCopy image using file system-level suspension. You may temporarily quiesce the I/O activity at the GPFS file system level (as opposed to the disk volume level) through the use of the mmfsctl suspend and mmfsctl resume commands. The mmfsctl command is available for GPFS beginning with APAR IY The mmfsctl suspend command instructs GPFS to flush its data buffers on all nodes, write the cached metadata structures to disk, and suspend the execution of all subsequent I/O requests. This command leaves the on-disk copy of the file system in a fully consistent state, ready to be flashed and copied onto a set of back-up disks. mmctlfs resume is the converse operation that resumes normal file system I/O after its suspension. Syntax mmfsctl Device {suspend resume} Purpose Prior to generating a FlashCopy image of the file system, run mmfsctl suspend to temporarily quiesce all I/O activity and flush the GPFS buffers on all nodes that mount the file system. All subsequent requests are suspended until mmctlfs resume is issued to restore the normal processing. Parameters Device The device name of the GPFS file system suspend Instructs GPFS to flush its internal buffers on all nodes, bring the file system to a consistent state on disk, and suspend the execution of all subsequent application I/O requests. resume Instructs GPFS to resume the normal processing of I/O requests on all nodes. These examples illustrate the creation of a FlashCopy image to enable data consistency. We assume the following configuration: Storage subsystems: ESS 1; logical subsystem LSS 1 LUN ids and disk volume names: luns1 (hdisk11), luns2 (hdisk12), lunt1, lunt2 luns1 and luns2 are the FlashCopy source volumes. These disks are SAN-connected and appear on the GPFS nodes as hdisk11 and hdisk12, respectively. A single GPFS file system fs0 has been defined over these two disks. lunt1 and lunt2 are the FlashCopy target volumes. None of the GPFS nodes have direct connectivity to these disks. estigpfsdrwp doc 18
19 Generating a FlashCopy image using a consistency group 1. Run the establish FlashCopy pair task with the freeze FlashCopy consistency group option. Create the following volume pairs: luns1 lunt1 (source-target) luns2 lunt2 (source-target) 2. When the previous task completes, run the consistency created task on all logical subsystems that hold any of the FlashCopy disks (only LSS1 in this example). Generating a FlashCopy image using file system-level suspension 1. Suspend all file system activity and flush the GPFS buffers on all nodes: mmfsctl fs0 suspend (run on any node in the GPFS cluster) 2. Run the establish FlashCopy pair task to create the following volume pairs: luns1 lunt1 (source-target) luns2 lunt2 (source-target) 3. Resume the file system activity: mmfsctl fs0 resume (run on any node in the GPFS cluster) Both techniques allow for the consistency of the FlashCopy image by the means of temporary suspension of I/O, but either can be seen as the preferred method depending on your specific requirements and the nature of your GPFS client application. While the use of FlashCopy consistency groups provide for the proper ordering of updates, this method does not by itself suffice to guarantee the atomicity of updates as seen from the point of view of the user application. If the application process is actively writing data to GPFS, the ondisk content of the file system may, at any point in time, contain some number of incomplete data record updates (those that have not been fully written to disk), and possibly some number of in-progress updates to the GPFS metadata. These would appear as partial updates in the FlashCopy image of the file system, which must be dealt before enabling the image for normal file system use. While the use of metadata logging techniques enables GPFS to detect and recover from partial updates to the file system s metadata, ensuring the atomicity of updates to the actual data remains the responsibility of the user application. Consequently, the use of FlashCopy consistency groups is suitable only for applications that implement proper mechanisms for the recovery from incomplete updates to their data. On the other hand, by quiescing the I/O activity at the file system level (with the mmfsctl suspend and mmfsctl resume commands), we permit GPFS to fully flush its data buffers, thus allowing it to complete all outstanding updates to the user data, as well as the GPFS metadata. This, in turn, prevents the appearance of incomplete updates in the FlashCopy image. For this estigpfsdrwp doc 19
20 reason, we suggest file system-level suspension as the general-purpose method for protecting the integrity of your FlashCopy images. Several uses of the FlashCopy replica after its initial creation can be considered. For example, if your primary operating environment suffers a permanent loss or a corruption of data, you may choose to flash the target disks back onto the originals to quickly restore access to a copy of the file system as seen at the time of the previous snapshot. Before restoring the file system from a FlashCopy, please make sure to suspend the activity of the GPFS client processes and unmount the file system on all GPFS nodes. To help protect your data against site-wide disasters, you may chose to instead transfer the replica off-site to a remote location using PPRC or any other type of bulk data transfer technology. Alternatively, you may choose to mount the FlashCopy image as a separate file system on your backup processing server and transfer the data to tape storage. To enable regular file system access to your FlashCopy replica, create a single-node GPFS cluster on your backup server node and execute the mmfsctl syncfsconfig command to import the definition of the file system from your production GPFS cluster. Please note: 1. The primary copy of a GPFS file system and its FlashCopy image cannot coexist in the same GPFS cluster. A node can mount either the original copy of the file system or one of its FlashCopy replicas, but not both. This restriction has to do with the current implementation of the NSD-to-LUN mapping mechanism, which scans all locallyattached disks, searching for a specific value (the NSD id) at a particular location on disk. If both the original volume and its FlashCopy image are visible to a particular node, these disks would appear to GPFS as distinct devices with identical NSD ids. For this reason, we ask users to zone their SAN configurations such that at most one replica of any given GPFS disk is visible from any node. That is, the nodes in your production cluster should have access to the disks that make up the actual file system but should not see the disks holding the FlashCopy image(s), whereas the backup server should see the FlashCopy targets but not the originals. Alternatively, you can use the /var/mmfs/etc/nsddevices user exit to explicitly define the subset of the locally visible disks to be accessed during the NSD device scan on the local node. (Consult the GPFS documentation errata for details on the use of nsddevices). 2. We suggest generating a new FlashCopy replica immediately after every administrative change to the state of the file system (for example, the additions or removals of disks). This would help eliminate the risk of a discrepancy between the GPFS configuration data contained in the mmsdrfs file and the on-disk content of the replica. estigpfsdrwp doc 20
21 References - GPFS documentation at - GPFS FAQ at - GPFS Architecture and Performance at - IBM Enterprise Storage Server, SG , at - IBM Enterprise Storage Server User s Guide, SC , at - IBM TotalStorage Enterprise Storage Server Implementing ESS Copy Services in Open Environments, SG , at - IBM TotalStorage Enterprise Storage Server Web Interface User s Guide, SC , at - IBM Enterprise Storage Server Command-Line Interfaces User s Guide, SC , at - ESS Copy Services Tasks: A Possible Naming Convention, at - Valid Combinations of ESS Copy Services at - Connectivity Guidelines for Synchronous PPRC, at - A Disaster Recovery Solution Selection Methodology, - IBM TotalStorage Solutions for Disaster Recovery, SG , at - Seven Tiers of Disaster Recovery at estigpfsdrwp doc 21
22 IBM Corporation 2004 IBM Corporation Marketing Communications Systems and Technology Group Route 100 Somers, New York Produced in the United States of America August 2004 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, Enterprise Storage Server, FlashCopy and TotalStorage are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at: Other company, product, and service names may be trademarks or service marks of others. Information concerning non-ibm products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-ibm products should be addressed with the suppliers. The IBM home page on the Internet can be found at The pseries home page on the Internet can be found at estigpfsdrwp doc 22
Informix Dynamic Server May 2007. Availability Solutions with Informix Dynamic Server 11
Informix Dynamic Server May 2007 Availability Solutions with Informix Dynamic Server 11 1 Availability Solutions with IBM Informix Dynamic Server 11.10 Madison Pruet Ajay Gupta The addition of Multi-node
Long-Distance Configurations for MSCS with IBM Enterprise Storage Server
Long-Distance Configurations for MSCS with IBM Enterprise Storage Server Torsten Rothenwaldt IBM Germany Presentation 2E06 Agenda Cluster problems to resolve in long-distance configurations Stretched MSCS
Connectivity. Alliance Access 7.0. Database Recovery. Information Paper
Connectivity Alliance Access 7.0 Database Recovery Information Paper Table of Contents Preface... 3 1 Overview... 4 2 Resiliency Concepts... 6 2.1 Database Loss Business Impact... 6 2.2 Database Recovery
Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software
Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication September 2002 IBM Storage Products Division Raleigh, NC http://www.storage.ibm.com Table of contents Introduction... 3 Key
Contents. SnapComms Data Protection Recommendations
Contents Abstract... 2 SnapComms Solution Environment... 2 Concepts... 3 What to Protect... 3 Database Failure Scenarios... 3 Physical Infrastructure Failures... 3 Logical Data Failures... 3 Service Recovery
Connectivity. Alliance Access 7.0. Database Recovery. Information Paper
Connectivity Alliance 7.0 Recovery Information Paper Table of Contents Preface... 3 1 Overview... 4 2 Resiliency Concepts... 6 2.1 Loss Business Impact... 6 2.2 Recovery Tools... 8 3 Manual Recovery Method...
Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.
Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA [email protected] Motivation WWW access has made many businesses 24 by 7 operations.
HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010
White Paper HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010 Abstract This white paper demonstrates key functionality demonstrated in a lab environment
Microsoft SQL Server Always On Technologies
Microsoft SQL Server Always On Technologies Hitachi Data Systems Contributes Always On Storage Solutions A Partner Solutions White Paper By Rick Andersen and Simon Pengelly December 2006 Executive Summary
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Pervasive PSQL Meets Critical Business Requirements
Pervasive PSQL Meets Critical Business Requirements Pervasive PSQL White Paper May 2012 Table of Contents Introduction... 3 Data Backup... 3 Pervasive Backup Agent... 3 Pervasive PSQL VSS Writer... 5 Pervasive
Windows Server Failover Clustering April 2010
Windows Server Failover Clustering April 00 Windows Server Failover Clustering (WSFC) is the successor to Microsoft Cluster Service (MSCS). WSFC and its predecessor, MSCS, offer high availability for critical
EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM
White Paper EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM Abstract This white paper explains the integration of EMC Mid-range Storage arrays with the Microsoft SQL Server I/O
DS8000 Data Migration using Copy Services in Microsoft Windows Cluster Environment
DS8000 Data Migration using Copy Services in Microsoft Windows Cluster Environment By Leena Kushwaha Nagaraja Hosad System and Technology Group Lab Services India Software Lab, IBM July 2013 Table of contents
E-Series. NetApp E-Series Storage Systems Mirroring Feature Guide. NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S.
E-Series NetApp E-Series Storage Systems Mirroring Feature Guide NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1 (408) 822-6000 Fax: +1 (408) 822-4501 Support telephone: +1 (888)
Server Clusters : Geographically Dispersed Clusters For Windows 2000 and Windows Server 2003
Server Clusters : Geographically Dispersed Clusters For Windows 2000 and Windows Server 2003 Microsoft Corporation Published: November 2004 Abstract This document describes what a geographically dispersed
Disaster Recovery for Oracle Database
Disaster Recovery for Oracle Database Zero Data Loss Recovery Appliance, Active Data Guard and Oracle GoldenGate ORACLE WHITE PAPER APRIL 2015 Overview Oracle Database provides three different approaches
SanDisk ION Accelerator High Availability
WHITE PAPER SanDisk ION Accelerator High Availability 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Introduction 3 Basics of SanDisk ION Accelerator High Availability 3 ALUA Multipathing
High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach
High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach Introduction Email is becoming ubiquitous and has become the standard tool for communication in many
Eliminate SQL Server Downtime Even for maintenance
Eliminate SQL Server Downtime Even for maintenance Eliminate Outages Enable Continuous Availability of Data (zero downtime) Enable Geographic Disaster Recovery - NO crash recovery 2009 xkoto, Inc. All
Synchronous Data Replication
S O L U T I O N S B R I E F Synchronous Data Replication Hitachi Data Systems Synchronous Data Replication Business continuity has become one of the top issues facing businesses and organizations all around
Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays
Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays V Tsutomu Akasaka (Manuscript received July 5, 2005) This paper gives an overview of a storage-system remote copy function and the implementation
HA / DR Jargon Buster High Availability / Disaster Recovery
HA / DR Jargon Buster High Availability / Disaster Recovery Welcome to Maxava s Jargon Buster. Your quick reference guide to Maxava HA and industry technical terms related to High Availability and Disaster
VERITAS Business Solutions. for DB2
VERITAS Business Solutions for DB2 V E R I T A S W H I T E P A P E R Table of Contents............................................................. 1 VERITAS Database Edition for DB2............................................................
High Availability Solutions for the MariaDB and MySQL Database
High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment
IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity
Simplifying storage processes and ensuring business continuity and high availability IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity The risks are even greater for companies that
Westek Technology Snapshot and HA iscsi Replication Suite
Westek Technology Snapshot and HA iscsi Replication Suite Westek s Power iscsi models have feature options to provide both time stamped snapshots of your data; and real time block level data replication
INUVIKA TECHNICAL GUIDE
--------------------------------------------------------------------------------------------------- INUVIKA TECHNICAL GUIDE FILE SERVER HIGH AVAILABILITY OVD Enterprise External Document Version 1.0 Published
End-to-End Availability for Microsoft SQL Server
WHITE PAPER VERITAS Storage Foundation HA for Windows End-to-End Availability for Microsoft SQL Server January 2005 1 Table of Contents Executive Summary... 1 Overview... 1 The VERITAS Solution for SQL
Best practices for fully automated disaster recovery of Microsoft SQL Server 2008 using HP Continuous Access EVA with Cluster Extension EVA
Best practices for fully automated disaster recovery of Microsoft SQL Server 2008 using HP Continuous Access EVA with Cluster Extension EVA Subtitle Table of contents Overview... 2 Key findings... 3 Solution
Using Hitachi Protection Manager and Hitachi Storage Cluster software for Rapid Recovery and Disaster Recovery in Microsoft Environments
Using Hitachi Protection Manager and Hitachi Storage Cluster software for Rapid Recovery and Disaster Recovery in Microsoft Environments Robert Burch, Advanced Technical Consultant Leland Sindt, Principal
Getting Started with Endurance FTvirtual Server
Getting Started with Endurance FTvirtual Server Marathon Technologies Corporation Fault and Disaster Tolerant Solutions for Windows Environments Release 6.1.1 June 2005 NOTICE Marathon Technologies Corporation
Implementing disaster recovery solutions with IBM Storwize V7000 and VMware Site Recovery Manager
Implementing disaster recovery solutions with IBM Storwize V7000 and VMware Site Recovery Manager A step-by-step guide IBM Systems and Technology Group ISV Enablement January 2011 Table of contents Abstract...
Module: Business Continuity
Upon completion of this module, you should be able to: Describe business continuity and cloud service availability Describe fault tolerance mechanisms for cloud infrastructure Discuss data protection solutions
High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation
High Availability & Disaster Recovery Development Project Concepts, Design and Implementation High Availability & Disaster Recovery Development Project CONCEPTS Who: Schmooze Com Inc, maintainers, core
Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006
Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006 All trademark names are the property of their respective companies. This publication contains opinions of
Backup and Recovery. What Backup, Recovery, and Disaster Recovery Mean to Your SQL Anywhere Databases
Backup and Recovery What Backup, Recovery, and Disaster Recovery Mean to Your SQL Anywhere Databases CONTENTS Introduction 3 Terminology and concepts 3 Database files that make up a database 3 Client-side
A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN
A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN Eman Al-Harbi [email protected] Soha S. Zaghloul [email protected] Faculty of Computer and Information
Violin Memory Arrays With IBM System Storage SAN Volume Control
Technical White Paper Report Best Practices Guide: Violin Memory Arrays With IBM System Storage SAN Volume Control Implementation Best Practices and Performance Considerations Version 1.0 Abstract This
Administering and Managing Failover Clustering
24_0672329565_ch18.qxd 9/7/07 8:37 AM Page 647 CHAPTER 18 Administering and Managing Failover Clustering Failover clustering is one of four SQL Server 2005 highavailability alternatives. Other SQL Server
Windows Geo-Clustering: SQL Server
Windows Geo-Clustering: SQL Server Edwin Sarmiento, Microsoft SQL Server MVP, Microsoft Certified Master Contents Introduction... 3 The Business Need for Geo-Clustering... 3 Single-location Clustering
Cisco Active Network Abstraction Gateway High Availability Solution
. Cisco Active Network Abstraction Gateway High Availability Solution White Paper This white paper describes the Cisco Active Network Abstraction (ANA) Gateway High Availability solution developed and
Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options
This product is protected by U.S. and international copyright and intellectual property laws. This product is covered by one or more patents listed at http://www.vmware.com/download/patents.html. VMware
Administering and Managing Log Shipping
26_0672329565_ch20.qxd 9/7/07 8:37 AM Page 721 CHAPTER 20 Administering and Managing Log Shipping Log shipping is one of four SQL Server 2005 high-availability alternatives. Other SQL Server 2005 high-availability
How Routine Data Center Operations Put Your HA/DR Plans at Risk
How Routine Data Center Operations Put Your HA/DR Plans at Risk Protect your business by closing gaps in your disaster recovery infrastructure Things alter for the worse spontaneously, if they be not altered
EonStor DS remote replication feature guide
EonStor DS remote replication feature guide White paper Version: 1.0 Updated: Abstract: Remote replication on select EonStor DS storage systems offers strong defense against major disruption to IT continuity,
Oracle Database 10g: Backup and Recovery 1-2
Oracle Database 10g: Backup and Recovery 1-2 Oracle Database 10g: Backup and Recovery 1-3 What Is Backup and Recovery? The phrase backup and recovery refers to the strategies and techniques that are employed
Availability Guide for Deploying SQL Server on VMware vsphere. August 2009
Availability Guide for Deploying SQL Server on VMware vsphere August 2009 Contents Introduction...1 SQL Server 2008 with vsphere and VMware HA/DRS...2 Log Shipping Availability Option...4 Database Mirroring...
VERITAS Volume Replicator in an Oracle Environment
VERITAS Volume Replicator in an Oracle Environment Introduction Remote replication of online disks and volumes is emerging as the technique of choice for protecting enterprise data against disasters. VERITAS
IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE
White Paper IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE Abstract This white paper focuses on recovery of an IBM Tivoli Storage Manager (TSM) server and explores
Oracle Database Disaster Recovery Using Dell Storage Replication Solutions
Oracle Database Disaster Recovery Using Dell Storage Replication Solutions This paper describes how to leverage Dell storage replication technologies to build a comprehensive disaster recovery solution
FICON Extended Distance Solution (FEDS)
IBM ^ zseries Extended Distance Solution (FEDS) The Optimal Transport Solution for Backup and Recovery in a Metropolitan Network Author: Brian Fallon [email protected] FEDS: The Optimal Transport Solution
Affordable Remote Data Replication
SANmelody Application Affordable Remote Data Replication Your Data is as Valuable as Anyone s You know very well how critical your data is to your organization and how much your business would be impacted
PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute
PADS GPFS Filesystem: Crash Root Cause Analysis Computation Institute Argonne National Laboratory Table of Contents Purpose 1 Terminology 2 Infrastructure 4 Timeline of Events 5 Background 5 Corruption
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
Preface... 1. Introduction... 1 High Availability... 2 Users... 4 Other Resources... 5 Conventions... 5
Table of Contents Preface.................................................... 1 Introduction............................................................. 1 High Availability.........................................................
Deploying Global Clusters for Site Disaster Recovery via Symantec Storage Foundation on Infortrend Systems
Deploying Global Clusters for Site Disaster Recovery via Symantec Storage Foundation on Infortrend Systems Application Notes Abstract: This document describes how to apply global clusters in site disaster
IBM TotalStorage IBM TotalStorage Virtual Tape Server
IBM TotalStorage IBM TotalStorage Virtual Tape Server A powerful tape storage system that helps address the demanding storage requirements of e-business storag Storage for Improved How can you strategically
WHITE PAPER: ENTERPRISE SECURITY. Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions
WHITE PAPER: ENTERPRISE SECURITY Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions for Microsoft Exchange Server 2003 and Microsoft SQL Server White Paper: Enterprise Security Symantec
How To Virtualize A Storage Area Network (San) With Virtualization
A New Method of SAN Storage Virtualization Table of Contents 1 - ABSTRACT 2 - THE NEED FOR STORAGE VIRTUALIZATION 3 - EXISTING STORAGE VIRTUALIZATION METHODS 4 - A NEW METHOD OF VIRTUALIZATION: Storage
A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES
A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES By: Edward Whalen Performance Tuning Corporation INTRODUCTION There are a number of clustering products available on the market today, and clustering has become
Stretched Clusters and VMware
Stretched Clusters and VMware vcenter TM Site Recovery Manager Understanding the Options and Goals TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Terms and Definitions... 4 Local Availability....
Automatic Service Migration in WebLogic Server An Oracle White Paper July 2008
Automatic Service Migration in WebLogic Server An Oracle White Paper July 2008 NOTE: The following is intended to outline our general product direction. It is intended for information purposes only, and
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
Appendix A Core Concepts in SQL Server High Availability and Replication
Appendix A Core Concepts in SQL Server High Availability and Replication Appendix Overview Core Concepts in High Availability Core Concepts in Replication 1 Lesson 1: Core Concepts in High Availability
WHITE PAPER PPAPER. Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions. for Microsoft Exchange Server 2003 & Microsoft SQL Server
WHITE PAPER PPAPER Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions for Microsoft Exchange Server 2003 & Microsoft SQL Server
Unitt www.unitt.com. Zero Data Loss Service (ZDLS) The ultimate weapon against data loss
Zero Data Loss Service (ZDLS) The ultimate weapon against data loss The ultimate protection for your business-critical data In the past, ultimate data protection was a costly issue, if not an impossible
Redundancy Options. Presented By: Chris Williams
Redundancy Options Presented By: Chris Williams Table of Contents Redundancy Overview... 3 Redundancy Benefits... 3 Introduction to Backup and Restore Strategies... 3 Recovery Models... 4 Cold Backup...
Disaster Recovery Solution Achieved by EXPRESSCLUSTER
Disaster Recovery Solution Achieved by EXPRESSCLUSTER http://www.nec.com/expresscluster/ NEC Corporation System Software Division December, 2012 Contents 1. Clustering system and disaster recovery 2. Disaster
VMware Site Recovery Manager with EMC RecoverPoint
VMware Site Recovery Manager with EMC RecoverPoint Implementation Guide EMC Global Solutions Centers EMC Corporation Corporate Headquarters Hopkinton MA 01748-9103 1.508.435.1000 www.emc.com Copyright
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
HRG Assessment: Stratus everrun Enterprise
HRG Assessment: Stratus everrun Enterprise Today IT executive decision makers and their technology recommenders are faced with escalating demands for more effective technology based solutions while at
System Migrations Without Business Downtime. An Executive Overview
System Migrations Without Business Downtime An Executive Overview Businesses grow. Technologies evolve. System migrations may be inevitable, but business downtime isn t. All businesses strive for growth.
VERITAS Storage Foundation 4.3 for Windows
DATASHEET VERITAS Storage Foundation 4.3 for Windows Advanced Volume Management Technology for Windows In distributed client/server environments, users demand that databases, mission-critical applications
EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER
White Paper EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER Abstract The objective of this white paper is to describe the architecture of and procedure for configuring EMC Documentum xplore
IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications
IBM Software Information Management Scaling strategies for mission-critical discovery and navigation applications Scaling strategies for mission-critical discovery and navigation applications Contents
DeltaV Virtualization High Availability and Disaster Recovery
DeltaV Distributed Control System Whitepaper October 2014 DeltaV Virtualization High Availability and Disaster Recovery This document describes High Availiability and Disaster Recovery features supported
Version 5.0. MIMIX ha1 and MIMIX ha Lite for IBM i5/os. Using MIMIX. Published: May 2008 level 5.0.13.00. Copyrights, Trademarks, and Notices
Version 5.0 MIMIX ha1 and MIMIX ha Lite for IBM i5/os Using MIMIX Published: May 2008 level 5.0.13.00 Copyrights, Trademarks, and Notices Product conventions... 10 Menus and commands... 10 Accessing online
PROTECTING MICROSOFT SQL SERVER TM
WHITE PAPER PROTECTING MICROSOFT SQL SERVER TM Your company relies on its databases. How are you protecting them? Published: February 2006 Executive Summary Database Management Systems (DBMS) are the hidden
High Availability Database Solutions. for PostgreSQL & Postgres Plus
High Availability Database Solutions for PostgreSQL & Postgres Plus An EnterpriseDB White Paper for DBAs, Application Developers and Enterprise Architects November, 2008 High Availability Database Solutions
TECHNICAL WHITE PAPER: DATA AND SYSTEM PROTECTION. Achieving High Availability with Symantec Enterprise Vault. Chris Dooley January 3, 2007
TECHNICAL WHITE PAPER: DATA AND SYSTEM PROTECTION Achieving High Availability with Symantec Enterprise Vault Chris Dooley January 3, 2007 Technical White Paper: Data and System Protection Achieving High
Protecting Microsoft SQL Server
Your company relies on its databases. How are you protecting them? Protecting Microsoft SQL Server 2 Hudson Place suite 700 Hoboken, NJ 07030 Powered by 800-674-9495 www.nsisoftware.com Executive Summary
Top Ten Private Cloud Risks. Potential downtime and data loss causes
Top Ten Private Cloud Risks Potential downtime and data loss causes Introduction: Risk sources Enterprises routinely build Disaster Recovery and High Availability measures into their private cloud environments.
IBM System Storage DS5020 Express
IBM DS5020 Express Manage growth, complexity, and risk with scalable, high-performance storage Highlights Mixed host interfaces support (Fibre Channel/iSCSI) enables SAN tiering Balanced performance well-suited
Module 14: Scalability and High Availability
Module 14: Scalability and High Availability Overview Key high availability features available in Oracle and SQL Server Key scalability features available in Oracle and SQL Server High Availability High
High availability and disaster recovery with Microsoft, Citrix and HP
High availability and disaster recovery White Paper High availability and disaster recovery with Microsoft, Citrix and HP Using virtualization, automation and next-generation storage to improve business
Microsoft Exchange 2003 Disaster Recovery Operations Guide
Microsoft Exchange 2003 Disaster Recovery Operations Guide Microsoft Corporation Published: December 12, 2006 Author: Exchange Server Documentation Team Abstract This guide provides installation and deployment
Abstract... 2. Introduction... 2. Overview of Insight Dynamics VSE and Logical Servers... 2
HP Insight Recovery Technical white paper Abstract... 2 Introduction... 2 Overview of Insight Dynamics VSE and Logical Servers... 2 Disaster Recovery Concepts... 4 Recovery Time Objective (RTO) and Recovery
Implementing Storage Concentrator FailOver Clusters
Implementing Concentrator FailOver Clusters Technical Brief All trademark names are the property of their respective companies. This publication contains opinions of StoneFly, Inc. which are subject to
Windows Server 2008 Hyper-V Backup and Replication on EMC CLARiiON Storage. Applied Technology
Windows Server 2008 Hyper-V Backup and Replication on EMC CLARiiON Storage Applied Technology Abstract This white paper provides an overview of the technologies that are used to perform backup and replication
Downtime, whether planned or unplanned,
Deploying Simple, Cost-Effective Disaster Recovery with Dell and VMware Because of their complexity and lack of standardization, traditional disaster recovery infrastructures often fail to meet enterprise
Maximum Availability Architecture. Oracle Best Practices For High Availability. Backup and Recovery Scenarios for Oracle WebLogic Server: 10.
Backup and Recovery Scenarios for Oracle WebLogic Server: 10.3 An Oracle White Paper January, 2009 Maximum Availability Architecture Oracle Best Practices For High Availability Backup and Recovery Scenarios
istorage Server: High-Availability iscsi SAN for Windows Server 2008 & Hyper-V Clustering
istorage Server: High-Availability iscsi SAN for Windows Server 2008 & Hyper-V Clustering Tuesday, Feb 21 st, 2012 KernSafe Technologies, Inc. www.kernsafe.com Copyright KernSafe Technologies 2006-2012.
Real-time Protection for Hyper-V
1-888-674-9495 www.doubletake.com Real-time Protection for Hyper-V Real-Time Protection for Hyper-V Computer virtualization has come a long way in a very short time, triggered primarily by the rapid rate
RPO represents the data differential between the source cluster and the replicas.
Technical brief Introduction Disaster recovery (DR) is the science of returning a system to operating status after a site-wide disaster. DR enables business continuity for significant data center failures
