Unisys SafeGuard Solutions

Transcription

1 Unisys SafeGuard Solutions Troubleshooting Guide Unisys SafeGuard Solutions Release 8.0 July 2009

2 .

3 unisys imagine it. done. Unisys SafeGuard Solutions Troubleshooting Guide Unisys SafeGuard Solutions Release 8.0 July

4 NO WARRANTIES OF ANY NATURE ARE EXTENDED BY THIS DOCUMENT. Any product or related information described herein is only furnished pursuant and subject to the terms and conditions of a duly executed agreement to purchase or lease equipment or to license software. The only warranties made by Unisys, if any, with respect to the products described in this document are set forth in such agreement. Unisys cannot accept any financial or other responsibility that may be the result of your use of the information in this document or software material, including direct, special, or consequential damages. You should be very careful to ensure that the use of this information and/or software material complies with the laws, rules, and regulations of the jurisdictions with respect to which it is used. The information contained herein is subject to change without notice. Revisions may be issued to advise of such changes and/or additions. Notice to U.S. Government End Users: This is commercial computer software or hardware documentation developed at private expense. Use, reproduction, or disclosure by the Government is subject to the terms of Unisys standard commercial license for the products, and where applicable, the restricted/limited rights provisions of the contract data rights clauses. Unisys is a registered trademark of Unisys Corporation in the United States and other countries. All other brands and products referenced in this document are acknowledged to be the trademarks or registered trademarks of their respective holders.

5 Contents Section 1. About This Guide Purpose and Audience Related Product Information Documentation Updates What s New in This Release Using This Guide Section 2. Overview Geographic Replication Environment Geographic Clustered Environment Data Flow Diagnostic Tools and Capabilities Event Log System Status Notifications Installation Diagnostics Host Information Collector (HIC) Cluster Logs Unisys SafeGuard 30m Collector RA Diagnostics Hardware Indicators SNMP Support kutils Utility Discovering Problems Events That Cause Journal Distribution Troubleshooting Procedures Identifying the Main Components and Connectivity of the Configuration Understanding the Current State of the System Verifying the System Connectivity Analyzing the Configuration Settings Section 3. Recovering in a Geographic Replication Environment Manual Failover of Volumes and Data Consistency Groups Accessing an Image Testing the Selected Image at Remote Site iii

6 Contents Section 4. Recovering in a Geographic Clustered Environment Checking the Cluster Setup MSCS Properties Network Bindings Group Initialization Effects on a Cluster Move-Group Operation Full-Sweep Initialization Long Resynchronization Initialization from Marking Mode Behavior of SafeGuard 30m Control During a Move-Group Operation Recovering by Manually Moving an Auto-Data (Shared Quorum) Consistency Group Taking a Cluster Data Group Offline Performing a Manual Failover of an Auto-Data (Shared Quorum) Consistency Group to a Selected Image Bringing a Cluster Data Group Online and Checking the Validity of the Image Reversing the Replication Direction of the Consistency Group Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) Recovery When All RAs and All Servers Fail on One Site Site 1 Failure (Site 1 Quorum Owner) Site 1 Failure (Site 2 Quorum Owner) Section 5. Solving Storage Problems User or Replication Volume Not Accessible Repository Volume Not Accessible Reformatting the Repository Volume Journal Not Accessible Journal Volume Lost Scenarios Total Storage Loss in a Geographic Replicated Environment Storage Failure on One Site in a Geographic Clustered Environment Storage Failure on One Site with Quorum Owner on Failed Site Storage Failure on One Site with Quorum Owner on Surviving Site Section 6. Solving SAN Connectivity Problems Volume Not Accessible to RAs Volume Not Accessible to SafeGuard 30m Splitter RAs Not Accessible to SafeGuard 30m Splitter Total SAN Switch Failure on One Site in a Geographic Clustered Environment iv

7 Contents Cluster Quorum Owner Located on Site with Failed SAN Switch Cluster Quorum Owner Not on Site with Failed SAN Switch Section 7. Solving Network Problems Public NIC Failure on a Cluster Node in a Geographic Clustered Environment Public or Client WAN Failure in a Geographic Clustered Environment Management Network Failure in a Geographic Clustered Environment Replication Network Failure in a Geographic Clustered Environment Temporary WAN Failures Private Cluster Network Failure in a Geographic Clustered Environment Total Communication Failure in a Geographic Clustered Environment Port Information Section 8. Solving Replication Appliance (RA) Problems Single RA Failures Single RA Failure with Switchover Reboot Regulation Failure of All SAN Fibre Channel Host Bus Adapters (HBAs Failure of Onboard WAN Adapter or Failure of Optional Gigabit Fibre Channel WAN Adapter Single RA Failures Without a Switchover Port Failure on a Single SAN Fibre Channel HBA on One RA Onboard Management Network Adapter Failure Single Hard Disk Failure Failure of All RAs at One Site All RAs Are Not Attached Section 9. Solving Server Problems Cluster Node Failure (Hardware or Software) in a Geographic Clustered Environment Possible Subset Scenarios Windows Server Reboot Unexpected Server Shutdown Because of a Bug Check Server Crash or Restart Server Unable to Connect with SAN Server HBA Failure v

8 Contents Infrastructure (NTP) Server Failure Server Failure (Hardware or Software) in a Geographic Replication Environment Section 10. Solving Performance Problems Slow Initialization General Description of High-Load Event High-Load (Disk Manager) Condition High-Load (Distributor) Condition Failover Time Lengthens Appendix A. Collecting and Using Logs Collecting RA Logs... A 1 Setting the Automatic Host Info Collection Option... A 2 Testing FTP Connectivity... A 2 Determining When the Failure Occurred... A 2 Converting Local Time to GMT or UTC... A 3 Collecting RA Logs... A 3 Collecting Server (Host) Logs... A 6 Using the MPS Report Utility... A 6 Using the Host Information Collector (HIC) Utility... A 7 Analyzing RA Log Collection Files... A 8 RA Log Extraction Directory... A 9 tmp Directory... A 14 Host Log Extraction Directory... A 15 Analyzing Server (Host) Logs... A 16 Analyzing Intelligent Fabric Switch Logs... A 16 Appendix B. Running Replication Appliance (RA) Diagnostics Clearing the System Event Log (SEL)... B 1 Running Hardware Diagnostics... B 2 Custom Test... B 3 Express Test... B 4 LCD Status Messages... B 4 Appendix C. Running Installation Manager Diagnostics Using the SSH Client... C 1 Running Diagnostics... C 1 IP Diagnostics... C 2 Fibre Channel Diagnostics... C 9 Synchronization Diagnostics... C 17 Collect System Info... C 18 vi

9 Contents Appendix D. Replacing a Replication Appliance (RA) Saving the Configuration Settings... D 2 Recording Policy Properties and Saving Settings... D 2 Modifying the Preferred RA Setting... D 3 Removing Fibre Channel Adapter Cards... D 4 Installing and Configuring the Replacement RA... D 4 Cable and Apply Power to the New RA... D 4 Connecting and Accessing the RA... D 4 Checking Storage-to-RA Access... D 5 Enabling PCI-X Slot Functionality... D 5 Configuring the RA... D 6 Verifying the RA Installation... D 7 Restoring Group Properties... D 8 Ensuring the Existing RA Can Switch Over to the New RA... D 8 Appendix E. Understanding Events Event Log... E 1 Event Topics... E 1 Event Levels... E 2 Event Scope... E 2 Displaying the Event Log... E 3 Using the Event Log for Troubleshooting... E 3 List of Events... E 4 List of Normal Events... E 5 List of Detailed Events... E 22 Appendix F. Configuring and Using SNMP Traps Software Monitoring... F 1 SNMP Monitoring and Trap Configuration... F 3 Installing MIB Files on an SNMP Browser... F 3 Resolving SNMP Issues... F 4 Appendix G. Using the Unisys SafeGuard 30m Collector Installing the SafeGuard 30m Collector... G 1 Before You Begin the Configuration... G 2 Handling the Security Breach Warning... G 3 Using Collector Mode... G 3 Getting Started... G 3 Understanding Operations in Collector Mode... G 7 Using Configuration Manager... G 15 Using Site Verifier... G 19 Using View Mode... G 20 Appendix H. Using kutils Usage... H vii

10 Contents Path Designations... H 1 Command Summary... H 2 Appendix I. Analyzing Cluster Logs Introduction to Cluster Logs... I 1 Creating the Cluster Log... I 2 Understanding the Cluster Log Layout... I 4 Sample Cluster Log... I 5 Posting Information to the Cluster Log... I 6 Diagnosing a Problem Using Cluster Logs... I 7 Gathering Materials... I 7 Opening the Cluster Log... I 8 Converting GMT/UCT to Local Time... I 8 Converting Cluster Log GUIDs to Text Resource Names... I 8 Understanding State Codes... I 10 Understanding Persistent State... I 13 Understanding Error and Status Codes... I 14 Index... 1 viii

11 Figures 2 1. Basic Geographic Clustered Environment Data Flow Data Flow with Fabric Splitter Data flow in CDP All RAs Fail on Site 1 (Site 1 Quorum Owner) All RAs Fail on Site 1 (Site 2 Quorum Owner) All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner) Volumes Tab Showing Volume Connection Errors Groups Tab Shows Paused by System Management Console Display: Storage Error and RAs Tab Shows Volume Errors Volumes Tab Shows Error for Repository Volume Groups Tab Shows All Groups are Still Alive Management Console Messages for the Repository Volume not Accessible Problem Volumes Tab Shows Journal Volume Error RAs Tab Shows Connection Errors Groups Tab Shows Group Paused by System Management Console Messages for the Journal Not Accessible Problem Management Console Volumes Tab Shows Errors for All Volumes RAs Tab Shows Volumes That Are Not Accessible Multipathing Software Reports Failed Paths to Storage Device Storage on Site 1 Fails Cluster Regroup Process Cluster Administrator Displays Multipathing Software Shows Server Errors for Failed Storage Subsystem Management Console Showing Inaccessible Volume Errors Management Console Messages for Inaccessible Volumes Management Console Error Display Screen Management Console Messages for Volumes Inaccessible to Splitter EMC PowerPath Shows Disk Error Management Console Display Shows a Splitter Down Management Console Messages for Splitter Inaccessible to RA SAN Switch Failure on One Site Management Console Display with Errors for Failed SAN Switch Management Console Messages for Failed SAN Switch ix

12 Figures Management Console Messages for Failed SAN Switch with Quorum Owner on Surviving Site Public NIC Failure of a Cluster Node Public NIC Error Shown in the Cluster Administrator Public or Client WAN Failure Cluster Administrator Showing Public LAN Network Error Management Network Failure Management Console Display: Not Connected Management Console Message for Event Replication Network Failure Management Console Display: WAN Down Management Console Log Messages: WAN Down Management Console RAs Tab: All RAs Data Link Down Private Cluster Network Failure Cluster Administrator Display with Failures Total Communication Failure Management Console Display Showing WAN Error RAs Tab for Total Communication Failure Management Console Messages for Total Communication Failure Cluster Administrator Showing Private Network Down Cluster Administrator Showing Public Network Down Single RA Failure Sample BIOS Display Management Console Display Showing RA Error and RAs Tab Management Console Messages for Single RA Failure with Switchover LCD Display on Front Panel of RA Rear Panel of RA Showing Indicators Location of Network LEDs Location of SAN Fibre Channel HBA LEDs Management Console Display: Host Connection with RA Is Down Management Console Messages for Failed RA (All SAN HBAs Fail) Management Console Showing WAN Data Link Failure Location of Hard Drive LEDs Management Console Showing All RAs Down Cluster Node Failure Management Console Display with Server Error Management Console Messages for Server Down Management Console Messages for Server Down for Bug Check Management Console Display Showing LA Site Server Down Management Console Images Showing Messages for Server Unable to Connect to SAN PowerPath Administrator Console Showing Failures PowerPath Administrator Console Showing Adapter Failure Event 1009 Display I 1. Layout of the Cluster Log... I 4 I 2. Expanded Cluster Hive (in Windows 2000 Server)... I 10 x

13 Tables 2 1. User Types Events That Cause Journal Distribution Possible Storage Problems with Symptoms Indicators and Management Console Errors to Distinguish Different Storage Volume Failures Possible SAN Connectivity Problems Possible Networking Problems with Symptoms Ports for Internet Communication Ports for Management LAN Communication and Notification Ports for RA-to-RA Internal Communication Possible Problems for Single RA Failure with a Switchover Possible Problems for Single RA Failure Without a Switchover Possible Problems for Multiple RA Failures with Symptoms Management Console Messages Pertaining to Reboots Possible Server Problems with Symptoms Possible Performance Problems with Symptoms B 1. LCD Status Messages... B 5 C 1. Messages from the Connectivity Testing Tool... C 8 E 1. Normal Events... E 5 E 2. Detailed Events... E 23 F 1. Trap Variables and Values... F 2 I 1. System Environment Variables Related to Clustering... I 2 I 2. Modules of MSCS... I 4 I 3. Node State Codes... I 12 I 4. Group State Codes... I 12 I 5. Resource State Codes... I 12 I 6. Network Interface State Codes... I 13 I 7. Network State Codes... I xi

14 Tables xii

15 Section 1 About This Guide Purpose and Audience This document presents procedures for problem analysis and troubleshooting of the Unisys SafeGuard 30m solution. It is intended for Unisys service representatives and other technical personnel who are responsible for maintaining the Unisys SafeGuard 30m solution installation. Related Product Information The methods described in this document are based on support and diagnostic tools that are provided as standard components of the Unisys SafeGuard 30m solution. You can find additional information about these tools in the following documents: Unisys SafeGuard Solutions Planning and Installation Guide Unisys SafeGuard Solutions Replication Appliance Administrator s Guide Unisys SafeGuard Solutions Replication Appliance Command Line Interface (CLI) Reference Guide Unisys SafeGuard Solutions Replication Appliance Installation Guide Note: Review the information in the Unisys SafeGuard Solutions Planning and Installation Guide about making configuration changes before you begin troubleshooting a problem. Documentation Updates This document contains all the information that was available at the time of publication. Changes identified after release of this document are included in problem list entry (PLE) To obtain a copy of the PLE, contact your Unisys service representative or access the current PLE from the Unisys Product Support Web site: Note: If you are not logged into the Product Support site, you will be asked to do so

16 About This Guide What s New in This Release Some of the important changes in the 8.0 release include the following: Changes in the UI The SafeGuard UI has changed in this release; however, these UI changes have not been made in this guide. Please refer to the SafeGuard 8.0 UI for the latest component names but follow the steps given in this guide to complete any procedure. For example, Stretch Cluster Support is now renamed as Stretch Cluster /VMware SRM Support in the SafeGuard 8.0 UI. Synchronous Replication You can now replicate data synchronously over Fibre Channel. The system can be set to replicate in synchronous mode, in asynchronous mode, or to dynamically switch between the two modes. It is determined by threshold values that are based on latency and throughput. Unisys SafeGuard Solutions Installer Wizard This wizard helps you install and configure a new Unisys SafeGuard Solutions installation on one or two sites. The wizard now supports IPv6. Add New RAs Wizard This wizard helps you add new RAs to existing RA clusters without any disruption. RA Replacement Wizard This wizard helps you replace an existing RA in an RA cluster with a new RA without any disruption. Upgrade Tool This CLI wizard helps you upgrade the RA code from 7.0 and 7.1 to 8.0. It is composed of two wizardsprepare Upgrade and Apply. System Monitoring Unisys SafeGuard Solutions monitors selected parameter values to let the user know how close they are to their limits. The system, policies, licensing, or limitations of external technologies determine the limits. Monitored parameters are shown in the Unisys SafeGuard Solutions Management Application and at the CLI command line. Support for CLARiiON LUNs greater than 2 TB When using a Unisys SafeGuard Solutions 8.0 CLARiiON splitter, Unisys SafeGuard Solutions supports the replication of CLARiiON CX3 and CX4 Series LUNs that are larger than 2 TB. CLARiiON splitter support for 2048 LUNs When using a Unisys SafeGuard Solutions 8.0 CLARiiON splitter, Unisys SafeGuard Solutions supports attachment of up to 2048 LUNs of CLARiiON CX3 and CX4 Series arrays. Improved SAN Diagnostics If there are SAN diagnostics errors, it is no longer possible to continue the installation until they are corrected. SAN Diagnostics and host SAN Diagnostics run

17 About This Guide automatically approximately once every hour on each RA. In addition, SAN Diagnostics runs each time the time zone changes. These tests are transparent to the user and do not affect system performance. Any SAN or host errors encountered will be displayed in the Unisys SafeGuard Solutions GUI and in the output of the get_system_status command in the CLI. If certain configuration errors are encountered during the SAN Diagnostics, the tests will rerun every minute until they are corrected, immediately displaying correction results. Using This Guide This guide offers general information in the first four sections. Read Section 2 to understand the overall approach to troubleshooting and to gain an understanding of the Unisys SafeGuard 30m solution architecture. Section 3 describes recovery in a geographic replication environment, and Section 4 offers information and recovery procedures for geographic clustered environments. Sections 5 through 10 group potential problems into categories and describe the problems. You must recognize symptoms, identify the problem or failed component, and then decide what to do to correct the problem. Sections 5 through 10 include a table at the beginning of each section that lists symptoms and potential problems. Each problem is then presented in the following format: Problem Description: Description of the problem Symptoms: List of symptoms that are typical for this problem Actions to Resolve the Problem: Steps recommended to solve the problem The appendixes provide information about using tools and offer reference information that you might find useful in different situations

18 About This Guide

19 Section 2 Overview The Unisys SafeGuard Solutions are flexible, integrated business continuance solutions especially suitable for protecting business-critical application environments. The Unisys SafeGuard 30m solution provides two distinct functions that act in concert: replication of data and automated application recovery through clustering over great distances. Typically, the Unisys SafeGuard 30m solution is implemented in one of these environments: Geographic replication environment: In this replication environment, data from servers at one site are replicated to a remote site. Geographic clustered environment: In this replication environment, Microsoft Cluster Service (MSCS) is installed on servers that span sites and that participate in one cluster. The use of a Unisys SafeGuard 30m Control resource allows automated failover and recovery by controlling the replication direction with a MSCS resource. The resource is used in this environment only. Geographic Replication Environment Unisys SafeGuard Solutions supports replication of data over Fibre Channel to local SANattached storage and over WAN to remote sites. It also allows failover to a secondary site and continues operations in the event of a disaster at the primary site. Unisys SafeGuard Solutions replicates data over any distance: within the same site (CDP), or to another site halfway around the globe (CRR), or both (CLR.) Geographic Clustered Environment In the geographic clustered environment, MSCS and cluster nodes are part of the environment. Figure 2 1 illustrates a basic geographic clustered environment that consists of two sites. In addition to server clusters, the typical configuration is made up of an RA cluster (RA 1 and RA 2) at each of the two sites. However, multiple RA cluster configurations are also possible. Note: The dashed lines in Figure 2 1 represent the server WAN connections. To simplify the view, redundant and physical connections are not shown

20 Overview Figure 2 1. Basic Geographic Clustered Environment Data Flow Figure 2 2 shows the data flow in the basic system configuration for data written by the server. The system replicates the data in snapshot replication mode to a remote site. The data flow is divided into the following segments: write, transfer, and distribute

21 Overview Figure 2 2. Data Flow Write Transfer The flow of data for a write transaction is as follows: 1. The host writes data to the splitter (either on the host or the fabric) that immediately sends it to the RA and to the production site replication volume (storage system). 2. After receiving the data, the RA returns an acknowledgement (ACK) to the splitter. The storage system returns an ACK after successfully writing the data to storage. 3. The splitter sends an ACK to the host that the write operation has been completed successfully. In snapshot replication mode, this sequence of events (steps 1 to 3) can be repeated multiple times before the snapshot is closed. The flow of data for transfer is as follows: 1. After processing the snapshot data (that is, applying the various compression techniques), the RA sends the snapshot over the WAN to its peer RA at the remote site. 2. The RA at the remote site writes the snapshot to the journal. At the same time, the remote RA returns an ACK to its peer at the production site. Note: Alternatively, you can set an advanced policy parameter so that lag is measured to the journal. In that case, the RA at the target site returns an ACK to its peer at the source site only after it receives an ACK from the journal (step 3). 3. After the complete snapshot is written to the journal, the journal returns an ACK to the RA

22 Overview Distribute When possible, and unless instructed otherwise, the Unisys SafeGuard 30m solution proceeds at first opportunity to distribute the image to the appropriate location on the storage system at the remote site. The logical flow of data for distribution is as follows: 1. The remote RA reads the image from the journal. 2. The RA reads existing information from the relevant remote replication volume. 3. The RA writes undo information (that is, information that can support a rollback, if necessary) to the journal. Note: Steps 2 and 3 are skipped when the maximum journal lag policy parameter causes distribution to operate in fast-forward mode. (See the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for more information.) 4. The RA writes the image to the appropriate remote replication volume. Alternatives to the basic system architecture The following are derivatives of the basic system architecture: Fabric Splitter An intelligent fabric switch can perform the splitting function instead of a Unisys SafeGuard Solutions host-based Splitter installed on the host. In this case, the host sends a single write transaction to the switch on its way to storage. At the switch, however, the message is split, with a copy sent also to RA (as shown in Figure 2 3). The system behaves the same way as it does when using a Unisys SafeGuard Solutions host-based splitter on the host to perform the splitting function

23 Overview Figure 2 3. Data Flow with Fabric Splitter Local Replication by CDP You can use CDP to perform replication over short distancesthat is, to replicate storage at the same site as CRR does over long distances. Operation of the system is similar to CRR including the ability to use the journal to recover from a corrupted data image, and the ability, if necessary, to fail over to the remote side or storage pool. In Figure 2 4, there is no WAN, the storage pools are part of the storage at the same site, and the same RA appears in each of the segments

24 Overview Figure 2 4. Data flow in CDP Note: The repository volume must belong to remote-side storage pool. Unisys SafeGuard Solutions support a simultaneous mix of groups for remote and local replication. Individual volumes and groups, however, must be designated for either remote or local replication, but not for both. Certain policy parameters do not apply for local replication by CDP. Single RA Note: Unisys SafeGuard Solutions does not support single RA configuration (at both sites or at a single site). Diagnostic Tools and Capabilities Event Log The Unisys SafeGuard 30m solution offers the following tools and capabilities to help you diagnose and solve problems. The replication capability of the Unisys SafeGuard 30m solution records log entries in response to a wide range of predefined events. The event log records all significant events that have recently occurred in the system. Appendix E lists and explains the events

25 Overview Each event is classified by an event ID. The event ID can be used to help analyze or diagnose system behavior, including identifying the trigger for a rolling problem, understanding a sequence of events, and examining whether the system performed the correct set of actions in response to a component failure. You can monitor system behavior by viewing the event log through the management console, by issuing CLI commands, or by reading RA logs. The exact period of time covered by the log varies according to the operational state of the environment during that period or, in the case of RA logs, the time period that was specified. The capacity of the event log is 5000 events. For problems that are not readily apparent and for situations that you are monitoring for failure, you can configure an notification to send all logs to you in a daily summary. Once you resolve the problem, you can remove the event notifications. See Configuring a Diagnostic Notification in this section to configure a daily summary of events. System Status The management console displays an immediate indication of any problem that interferes with normal operation of the Unisys SafeGuard 30m environment. If a component fails, the indication is accompanied by an error message that provides detailed information about the failure. You must log in to the management console to monitor the environment and to view events. The RAs are preconfigured with the users defined in Table 2 1. Table 2 1. User Types User Initial Password Permissions boxmgmt boxmgmt Install admin admin All except install and webdownload monitor monitor Read only webdownload webdownload webdownload SE Unisys(CSC) All except install and webdownload Note: The password boxmgmt is not used to log in to the management console; it is only used for SSH sessions. The CLI provides all users with status commands for the complete set of Unisys SafeGuard 30m components. You can use the information and statistics provided by these commands to identify bottlenecks in the system

26 Overview Notifications The notification mechanism sends specified event notifications (or alerts) to designated individuals. Also, you can set up an notification for once a day that contains a daily summary of events. Configuring a Diagnostic Notification 1. From the management console, click Alert Settings on the System menu. 2. Under Rules, click Add. 3. Using the diagnostic rule, select the appropriate topic, level, and type options. Diagnostic Rule This rule sends all messages on a daily basis to personnel of your choice. Topics: All Topics Level: Information Scope: Detailed Type Daily 4. Under Addresses, click Add. 5. In the New Address box, type the address to which you would like event notifications sent. You can specify more than one address. 6. Click OK. 7. Repeat steps 4 through 6 for each additional recipient. 8. Click OK. 9. Click OK. Installation Diagnostics The Diagnostics menu of the Installation Manager provides a suite of diagnostic tools for testing the functionality and connectivity of the installed RAs and Unisys SafeGuard 30m components. Appendix C explains how to use the Installation Manager diagnostics. Installation Manager is also used to collect RA logs and host splitter logs from one centralized location. See Appendix A for more information about collecting logs. Host Information Collector (HIC) The HIC collects extensive information about the environment, operation, and performance of any server on which a splitter has been installed. You can use the Installation Manager to collect logs across the entire environment including RAs and all servers on which the HIC feature is enabled. The HIC can also be used at the server. See Appendix A for more information about collecting logs

27 Overview Cluster Logs In a geographic clustered environment, MSCS maintains logs of events for the clustered environment. Analyzing these logs is helpful in diagnosing certain problems. Appendix I explains how to analyze these logs. Unisys SafeGuard 30m Collector The Unisys SafeGuard 30m Collector utility enables you to easily collect various pieces of information about the environment that can help in solving problems. Appendix G describes this utility. RA Diagnostics Diagnostics specific to the RAs are available to aid in identifying problems. Appendix B explains how to use the RA diagnostics. Hardware Indicators Hardware problemsfor example, RA disk failures or RA power problemsare identified by status LEDs located on the RAs themselves. Several indicators are explained in Section 8, Solving Replication Appliance (RA) Problems. SNMP Support kutils Utility The RAs support monitoring and problem notification using standard SNMP, including support for SNMPv3. You can use SNMP queries to the agent on the RA. Also, you can configure the environment such that events generate SNMP traps that are then sent to designated hosts. Appendix F explains how to configure and use SNMP traps. The kutils utility is a proprietary server-based program that enables you to manage server splitters across all platforms. The command-line utility is installed automatically when the Unisys SafeGuard 30m splitter is installed on the application server. If the splitting function is not on a host but rather is on an intelligent switch, the kutils utility is copied from the Splitter CD-ROM. (See the Unisys SafeGuard Solutions Planning and Installation Guide for more information.) Appendix H explains some kutils commands that are helpful in troubleshooting problems. See the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for complete reference information on the kutils utility. Discovering Problems Symptoms of problems and notifications occur in various ways with the Unisys SafeGuard 30m solution. The tools and capabilities described previously provide notifications for some conditions and events. Other problems are recognized from failures. Problems might be noted in the following ways:

28 Overview Problems with data because of a rolling disaster, which means that the site needs to use a previous snapshot to recover Problems with applications failing Inability to switch processing to the remote or secondary site Problems with the MSCS cluster (such as a failover to another cluster or site) Problems reported in an notification from an RA Problem reported in an SNMP trap notification Problems listed on the management console as reported in the overall system status or in group state or properties Problems reported in the daily summary of events In this guide, symptoms and notifications are often listed with potential problems. However, the messages and notifications vary based on the problem, and multiple events and notifications are possible at any given time. Events That Cause Journal Distribution Certain conditions might occur that can prevent access to the expected journal image. For instance, images might be flushed or distributed so that they are not available. Table 2 2 lists events that might cause the images to be unavailable. For tables listing all events, see Appendix E. Table 2 2. Events That Cause Journal Distribution Event ID Level Scope Description Trigger 4042 Info Detailed Group deactivated. (Group <group>, RA <RA>) 4062 Info Detailed Access enabled to latest image. (Group <group>, Failover site <site>) 4097 Warning Detailed Maximum journal lag exceeded. Distribution in fastforwardolder images removed from journal. (Group <group>) 4099 Info Detailed Initializing in long resynchronization mode. (Group <group>) A user action deactivated the group. Access was enabled to the latest image during automatic failover. Fast-forward action started and caused the snapshots taken before the fast-forward action to be lost and the maximum journal lag to be exceeded. The system started a long resynchronization

29 Overview Troubleshooting Procedures For troubleshooting, you must differentiate between problems that arise from environmental changes, network changes (cabling, routing and port blocking), or those changes related to zoning, logical unit number (LUN) masking, other devices in the SAN, and storage failures and problems that arise from misconfiguration or internal errors in the environmental setup. Refer to the preceding diagrams as you consider the general troubleshooting procedures that follow. Use the following four general tasks to help you identify symptoms and causes whenever you encounter a problem. Identifying the Main Components and Connectivity of the Configuration Knowledge of the main system components and the connectivity between these components is a key to understanding how the entire environment operates. This knowledge helps you understand where the problem exists in the overall system context and can help you correctly identify which components are affected. Identify the following components: Storage device, controller, and the configuration of connections to the Fibre Channel (FC) switch Switch and port types, and their connectivity Network configuration (WAN and LAN): IP addresses, routing schemes, subnet masks, and gateways Participating servers: operating system, host bus adapters (HBAs), connectivity to the FC switch Participating volumes: repository volumes, journal volumes, and replication volumes Understanding the Current State of the System Use the management console and the CLI get commands to understand the current state of the system: Is there any component which is shown to be in an error state? If so, what is the error? Is it down, disconnected from any other components? What is the state of the groups, splitters, volumes, transfer, and distribution? Is the current state stable or changing within intervals of time? Verifying the System Connectivity To verify the system connectivity, use physical and tool-based verification methods to answer the following questions: Are all the components physically connected? Are the activity or link lights active?

30 Overview Are the components connected to the correct switch or switches? Are they connected to the correct ports? Is there connectivity over the WAN between all appliances? Is there connectivity between the appliances on the same site over the management network? Analyzing the Configuration Settings Many problems occur because of improper configuration settings such as improper zoning. Analyze the configuration settings to ensure they are not the cause of the problem. Are the zones properly configured? Splitter-to-storage? Splitter-to-RA? RA-to-storage? RA-to-RA? Are the zones in the switch config? Has the proper switch config been applied? Are the LUNs properly masked? Is the splitter masked to see only the relevant replication volume or volumes? Are the RAs masked to see the relevant replication volume or volumes, repository volume, and journal volume or volumes? Are the network settings (such as gateway) for the RAs correct? Are there any possible IP conflicts on the network?

31 Section 3 Recovering in a Geographic Replication Environment This section provides recovery procedures so that user applications can be online as quickly as possible in a geographic replication environment. An older image might be required to recover from a rolling disaster, human error, a virus, or any other failure that corrupts the latest snapshot image. Ensure that the image is tested prior to reversing direction. Complete the procedures for manual failover of volumes and data consistency groups for each group that needs to be moved. Refer to the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for more information on logged and virtual (with roll or without roll) access modes. For specific environments, refer to the best practices documents listed under SafeGuard Solutions documentation on the Unisys Product Support Web site, Manual Failover of Volumes and Data Consistency Groups When you need to perform a manual failover of volumes and data consistency groups, complete the following tasks: 1. Accessing an image 2. Testing the selected image Accessing an Image 1. From the management console, select any one of the data consistency groups on the navigation pane. 2. Select the Status tab, (if it is not opened.) 3. Perform the following steps to allow access to the target image: a. Right-click Consistency Groups, and select Bookmark Image. b. Select Pause Transfer. Click Yes when the system prompts that the group activity will be paused. c. Right-click Consistency Groups and scroll down. d. Select the Remote Copy name and click Enable Image Access

32 Recovering in a Geographic Replication Environment The Enable Image Access dialog box appears. e. Select one of the following options. i. Select the latest image (for latest image access). ii. iii. f. Click Next. Select an image from the list (list of bookmarked images can be selected). Specify desired point in time (bookmarked image in desired time can be selected). The Image Access Mode dialog box appears. g. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. h. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access i. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused displays at the bottom in the Status tab under the components pane. Testing the Selected Image at Remote Site Perform the following steps to test the selected image at the remote site: 1. Mount the volumes at the remote site using the mountvol utility provided by Windows. Enter the command mountvol <drive:> <path> <volume name> 2. Repeat step 1 for all volumes in the group. 3. Ensure that the selected image is valid: all applications start successfully using the selected image the data in the image is consistent and valid For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 4. Skip to Unmounting Volumes at Production site and Reversing Replication Direction if you have tested the validity of the image and the test is successful. If the test is unsuccessful, continue with step To test a different image, perform the procedure Unmounting the Volumes and Disabling the Image Access at Remote site

33 Recovering in a Geographic Replication Environment Unmounting the Volumes and Disabling the Image Access at Remote Site 1. Before choosing another image, unmount the volume using the following batch file. If necessary, modify the program files/kdriver path to fit your off cd "c:\program files\kdriver\kutils" "c:\program files\kdriver\kutils\kutils.exe" flushfs e: "c:\windows\system32>mountvol.exe E:\ /P 2. Repeat step 1 for all volumes in the group. 3. Select one of the Consistency Groups in the navigation pane on the management console. 4. Right-click Consistency Groups and scroll down. 5. Select the Remote Copy name and click Disable Image Access. 6. Click Yes when the system prompts you to ensure that all group volumes are unmounted. 7. Repeat the procedures Accessing an Image and Testing the Selected Image at the Remote Site. Unmounting the Volumes at Production Site and Reversing Replication Direction Perform these steps at the host: 1. To unmount a volume at the production site, run the following batch file. If necessary, modify the program files\kdriver path to fit your off cd "c:\program files\kdriver\kutils" "c:\program files\kdriver\kutils\kutils.exe" flushfs e: c:\windows\system32>mountvol.exe E:\ /P 2. Repeat step 1 for all volumes in the group. Perform these steps on the management console: 1. Select a consistency group from the navigation pane. 2. Right-click Group and select Pause Transfer. Click Yes when the system prompts that the group activity will be paused. 3. Select the Status tab. The status of the transfer must display Paused. 4. Select the Remote Copy name and scroll down. 5. Select Failover to <Remote Site Name>. 6. Click Yes when the system prompts you to confirm failover. 7. Ensure that the Start data transfer immediately check box is selected. The following warning message appears:

34 Recovering in a Geographic Replication Environment Warning: Journal will be erased. Do you wish to continue? 8. Click Yes to continue

35 Section 4 Recovering in a Geographic Clustered Environment This section provides information and procedures that relate to geographic clustered environments running Microsoft Cluster Service (MSCS). Checking the Cluster Setup To ensure that the cluster configuration is correct, check the MSCS properties and the network bindings. For more detailed information, refer to Guide to Creating and Configuring a Server Cluster under Windows Server 2003, which you can download at MSCS Properties To check the MSCS properties, enter the following command from the command prompt: Cluster /prop Output similar to the following is displayed: T Cluster Name Value M AdminExtensions {4EC90FB0-D0BB-11CF-B5EF-0A0C90AB505} D DefaultNetworkRole 2 (0x2) S Description B Security (148 bytes) B Security Descriptor (148 bytes) M Groups\AdminExtensions M Networks\AdminExtensions M NetworkInterfaces\AdminExtensions M Nodes\AdminExtensions M Resources\AdminExtensions M ResourceTypes\AdminExtensions D EnableEventLogReplication 0 (0x0) D QuorumArbitrationTimeMax 300 (0x12c) D QuorumArbitrationTimeMin 15 (0xf) D DisableGroupPreferredOwnerRandomization 0 (0x0) D EnableEventDeltaGeneration 1 (0x1) D EnableResourceDllDeadlockDetection 0 (0x0) D ResourceDllDeadlockTimeout 240 (0xf0) D ResourceDllDeadlockThreshold 3 (0x3) D ResourceDllDeadlockPeriod 1800 (0x708) D ClusSvcHeartbeatTimeout 60 (0x3c) D HangRecoveryAction 3 (0x3)

36 Recovering in a Geographic Clustered Environment If the properties are not set correctly, use one of the following commands to correct the settings. Majority Node Set Quorum Cluster /prop HangRecoveryAction=3 Cluster /prop EnableEventLogReplication=0 Shared Quorum Network Bindings Cluster /prop QuorumArbitrationTimeMax=300 (not for majority node set) Cluster /prop QuorumArbitrationTimeMin=15 Cluster /prop HangRecoveryAction=3 Cluster /prop EnableEventLogReplication=0 The following binding priority order and settings are suggested as best practices for clustered configurations. These procedures assume that you can identify the public and private networks by the connection names that are referenced in the steps. Host-Specific Network Bindings and Settings 1. Open the Network Connections window. 2. On the Advanced menu, click Advanced Settings. 3. Select the Networks and Bindings tab. This tab shows the binding order in the upper pane and specific connection properties in the lower pane. 4. Verify that the public network connection is above the private network in the binding list in the upper pane. If it is not, follow these steps to change the order: a. Select a network connection in the binding list in the upper pane. b. Use the arrows to the right to move the network connection up or down in the list as appropriate. 5. Select the private network in the binding list. In the lower pane, verify that the File and Print Sharing for Microsoft Networks and the Client for Microsoft Networks check boxes are cleared for the private network. 6. Click OK. 7. Highlight the public connections, then right-click and click Properties. 8. Select Internet (TCP.IP) in the list, and click Properties. 9. Click Advanced

37 Recovering in a Geographic Clustered Environment 10. Select the WINS tab. 11. Ensure that Enable LM/Hosts lookup is selected. 12. Ensure that Disable NetBIOS over TCP/IP is selected. 13. Repeat steps 7 through 12 for the private network connection. Cluster-Specific Network Bindings and Settings 1. Open the Cluster Administrator. 2. Right-click the cluster (the top node in the tree structure in the left pane and click Properties. 3. Select the Networks Priority tab. 4. Ensure that the private network is at the top of the list and that the public network is below the private network. If it is not, follow these steps to change the order: a. Select the private network. b. Use the command button at the right to move up the private network up in the list as appropriate. 5. Select the private network, and click Properties. 6. Verify that the Enable this network for cluster use check box is selected and that Internal cluster communications only (private network) is selected. 7. Click OK. 8. Select the public network, and click Properties. 9. Verify that the Enable this network for cluster use check box is selected and that All communications (mixed network) is selected. 10. Click OK. Group Initialization Effects on a Cluster Move-Group Operation The following conditions affect failover times for a cluster move-group operation. A cluster move-group operation cannot complete if a lengthy consistency group initialization, such as a full-sweep initialization, long resynchronization, or initialization from marking mode, is executing in the background. Review these conditions and plan accordingly

38 Recovering in a Geographic Clustered Environment Full-Sweep Initialization A full-sweep initialization occurs when the disks on both sites are scanned or read in their entirety and a comparison is made, using checksums, to check for differences. Any differences are then replicated from the Production site disk to the remote site disk. A full-sweep initialization generates an entry in the management console log. A full-sweep initialization occurs in the following circumstances: Disabling or enabling a group Disabling a group causes all disk replication in the group to stop. A full-sweep initialization is performed once the group is enabled. The full-sweep initialization guarantees that the disks are consistent between the sites. Adding a new splitter server or host that has access to the disks in the group When adding a new splitter to the replication, there is a time before the splitter is added to the configuration when activity from this splitter to the disks is not being monitored or replicated. To guarantee that no write operations were performed by the new splitter before the splitter was configured in the replication, a full-sweep initialization is required for all groups that contain disks accessed by this splitter. This initialization is done automatically by the system. Double failure of a main component When a double failure of a main component occurs, a full-sweep initialization is required to guarantee that consistency was maintained. The main components include the host, the replication appliance (RA), and the storage subsystem. Long Resynchronization A long resynchronization occurs when the data difference that needs to be replicated to the other site cannot fit on the journal volume. The data is split into multiple snapshots for distribution to the other site, and all the previous snapshots are lost. Long resynchronization can be caused by long WAN outages, a group being disabled for a long time period, and other instances when replication has not been functional for a long time period. Long resynchronization is not connected with full-sweep initialization and can also happen during initialization from marking (see Initialization from Marking Mode ). It is dependant only on the journal volume size and the amount of data to be replicated. A long resynchronization is identified in the Status Tab in Components Pane under the remote journal bitmap in the management console. The status Performing Long Resync is visible for the group that is currently performing a long resynchronization

39 Recovering in a Geographic Clustered Environment Initialization from Marking Mode All other instances of initialization in the replication are caused by marking. The marking mode refers to a replication mode in which the location of dirty, or changed, data is marked in a bitmap on the repository volume. This bitmap is a standard sizeno matter how much data changes or what size disks are being monitoredso the repository volume cannot fill up during marking. The replication moves to marking mode when replication cannot be performed normally, such as during WAN outages. This marking mode guarantees that all data changes are still being recorded until replication is functioning normally. When replication can perform normally again, the RAs read the dirty, or changed, data from the source disk based on data recorded in the bitmap and replicates it to the disk on the remote site. The length of time for this process to complete depends on the amount of dirty, or changed, data as well as the performance of other components in the configuration, such as bandwidth and the storage subsystem. A high-load state can also cause the replication to move to marking mode. A high-load state occurs when write activity to the source disks exceeds the limits that the replication, bandwidth, or remote disks can handle. Replication moves into marking mode at this time until the replication determines the activity has reached a level at which it can continue normal replication. The replication then exits the high-load state and an initialization from marking occurs. See Section 10, Solving Performance Problems, for more information on high-load conditions and problems. Behavior of SafeGuard 30m Control During a Move-Group Operation During a move-group operation, the Unisys SafeGuard 30m Control resource in a clustered environment behaves as follow. Be aware of this information when dealing with various failure scenarios. 1. MSCS issues an offline request because of a failure with a group resourcefor example, a physical diskor an MSCS move group. The request is sent to the Unisys SafeGuard 30m Control resource on the node that owns the group. The MSCS resources that are dependent on the Unisys SafeGuard 30m Control resource, such as physical disk resources, are taken offline first. Taking the resources offline does not issue any commands to the RA. 2. MSCS issues an online request to the Unisys SafeGuard 30m Control resource on the node to which a group was moved, or in the case of failure, to the next node in the preferred owners list. 3. When the resource receives an online request from MSCS, the Unisys SafeGuard 30m Control resource issues two commands to control the access to disks: initiate_failover and verify_failover

40 Recovering in a Geographic Clustered Environment Initiate_Failover Command This command changes the replication direction from one site to another. If a same-site failover is requested, the command completes successfully with no action performed by the RA. The resource issues the verify_failover command to see if the RA performed the operations successfully. If a different-site failover is requested, the RA starts changing direction between sites and returns successfully. In certain circumstances, the RA returns a failure when the WAN is down or a long resynchronization occurs. If the RA returns a failure to the Unisys SafeGuard 30m Control resource, the resource logs the failure in the Windows application event log and retries the command continuously until the cluster pending timeout is reached. When a move-group operation fails to view events posted by the resource, check the application event log. The event source of the event entry is the 30m Control. Verify_Failover Command This command enables the Unisys SafeGuard 30m Control resource to determine the time at which the change of the replication direction completes. If a same-site failover is requested, the command completes successfully with no action performed by the RA. If a different-site failover is requested, the verify_failover command returns a pending status until the replication direction changes. The change of direction takes from 2 to 30 minutes. When the verify_failover command completes, write access to the physical disk is enabled to the host from the RA and the splitter. If the time to complete the verify_failover command is within the pending timeout, the Unisys SafeGuard 30m Control resource comes online followed by all the resources dependent on this resource. All dependent disks come online using the default physical disk timeout of an MSCS cluster. The physical disk is available to the physical disk resource immediately; there is no delay. Physical disk access is available when the Unisys SafeGuard 30m Control resource comes online. You do not need to change the default resource settings for the physical disk. However, the physical disk must be dependent on the Unisys SafeGuard 30m Control resource. If the time to complete the verify_failover command is longer than the pending timeout of the Unisys SafeGuard 30m Control resource, MSCS fails this resource. The default pending timeout for a Unisys SafeGuard 30m Control resource is 15 minutes or 900 seconds. This timeout occurs before the cluster disk timeout

41 Recovering in a Geographic Clustered Environment If you use the default retry value of 1, this resource issues the following commands: Initiate_failover Verify_failover Initiate_failover Verify_failover Using the default pending timeout, the Unisys SafeGuard 30m Control resource waits a total of 30 minutes to come online; this timeout period equals the timeout plus one retry. If the resource does not come online, MSCS attempts to move the group to the next node in the preferred owners list and then repeats this process. Recovering by Manually Moving an Auto-Data (Shared Quorum) Consistency Group An older image might be required to recover from a rolling disaster, human error, a virus, or any other failure that corrupts the latest snapshot image. It is impossible to recover automatically to an older image using MSCS because automatic cluster failover is designed to minimize data loss. The Unisys SafeGuard 30m solution always attempts to fail over to the latest image. Note: Manual image recovery is only for data consistency groups, not for the quorum group. To recover a data consistency group using an older image, you must complete the following tasks: Take the cluster data group offline. Perform a manual failover of an auto-data (shared quorum) consistency group to a selected image. Bring the cluster group online and check the validity of the image. Reverse the replication direction of the consistency group. Taking a Cluster Data Group Offline To take a group offline in the cluster for which you are performing a manual recovery, complete the following steps: 1. Open Cluster Administrator on one of the nodes in the MSCS cluster. 2. Right-click the group that you want to recover and click Take Offline. 3. Wait until all resources in the group show the status as Offline

42 Recovering in a Geographic Clustered Environment Performing a Manual Failover of an Auto-Data (Shared Quorum) Consistency Group to a Selected Image 1. Open the management console. 2. Select a consistency group from the navigation pane. Note: Do not select the quorum group. The data consistency group you select should be the cluster data group that you took offline. 4. Select the Policy tab in the selected consistency group. 5. Scroll down and select Stretch Cluster Support in the Policy tab. 6. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor [Manual(shared quorum) mode]. 7. Click Apply. 8. Perform the following steps to access the image: a. Right-click Consistency Groups and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. b. Right-click Consistency Groups and scroll down. c. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. d. Choose Select an image from the list and click Next. The Select Explicit Image dialog box appears and displays the available images. e. Select the desired image from the list and click Next. The Image Access Mode dialog box appears. f. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. g. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access h. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused status appears at the bottom in the Status tab under the components pane

43 Recovering in a Geographic Clustered Environment Bringing a Cluster Data Group Online and Checking the Validity of the Image 1. Open the Cluster Administrator window on the Management Console. 2. Move the group to the node on the recovered site by right-clicking the group that you previously took offline and then clicking Move Group. If the cluster has more than two nodes, a list of possible owner target nodes appears. Select the node to which you want to move the group. If the cluster has only two nodes, the move starts immediately. Go to step Bring the group online by right-clicking the group name and then clicking Bring Online. 4. Ensure that the selected image is valid; that is, verify that All applications start successfully using the selected image. The data in the image is consistent and valid. For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 5. If you tested the validity of the image and the test completed successfully, skip to Reversing the Replication Direction of the Consistency Group. 6. If the validity of the image fails and you choose to test a different image, perform the following steps: a. To take the group offline, right-click the group name and then click Take Offline on the Cluster Administrator. b. Select one of the consistency groups in the navigation pane on the Management Console. c. Right-click Consistency Groups and scroll down. d. Select the Remote Copy name and click Disable Image Access. e. Click Yes when the system prompts you to ensure that all group volumes are unmounted. 7. Perform the following steps if you want to choose a different image: a. Right-click Consistency Groups and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. b. Right-click Consistency Groups and scroll down. c. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. d. Choose Select an image from the list and click Next. The Select Explicit Image dialog box appears and displays the available images

44 Recovering in a Geographic Clustered Environment e. Select the desired image from the list and click Next. The Image Access Mode dialog box appears. f. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. g. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access h. Verify the target image name displayed below the bitmap in the components pane under Status tab. Transfer:Paused status appears at the bottom in the Status tab under the components pane. 8. To bring the cluster group online, using the Cluster Administrator, right-click the group name and then click Online to. 9. Ensure that the selected image is valid. Verify that All applications start successfully using the selected image. The data in the image is consistent and valid. For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 10. If you tested the validity of the image and the test completed successfully, skip to Reversing the Replication Direction of the Consistency Group. 11. If the image is not valid, repeat steps 6 through 9 as necessary. Reversing the Replication Direction of the Consistency Group 1. Select Consistency Groups from the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. 3. Select the Status tab. The status transfer must display Paused. 4. Select the Policy tab and expand the Advanced Settings (if it is not expanded). 5. Select Auto data (shared quorum) from the Global Cluster mode list. 6. Right-click Consistency Groups and select Failover to <Remote Site Name>. 7. Click Yes when the system prompts you to confirm failover

45 Recovering in a Geographic Clustered Environment 8. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 9. Click Yes to continue. Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) Problem Description The following points describe the behavior of the components in this event: When the quorum group is running on the site where the RAs failed (site 1), the cluster nodes on site 1 fail because of quorum lost reservations, and cluster nodes on site 2 attempt to arbitrate for the quorum resource. To prevent a split brain scenario, the RAs assume that the other site is active when a WAN failure occurs. (A WAN failure occurs if the RAs cannot communicate to at least one RA at the other site.) When the MSCS Reservation Manager on the surviving site (site 2) attempts the quorum arbitration request, the RA prevents access. Eventually, all cluster services stop and manual intervention is required to bring up the cluster service. Figure 4 1 illustrates this failure

46 Recovering in a Geographic Clustered Environment Symptoms Figure 4 1. All RAs Fail on Site 1 (Site 1 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-East2 Host (Surviving Host) 8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to the transaction log. Corruption may occur. System Event Log for Usmv-West2 (Failure Host) 8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to the transaction log. Corruption may occur

47 Recovering in a Geographic Clustered Environment If you review the cluster log, you find messages similar to the following examples: Cluster Log for Usmv-East2 (Surviving Host) Attempted to try five times before the cluster timed-out. The entries recorded five times in the log: 00000f ::2008/02/02-20:36: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (The requested resource is in use) f ::2008/02/02-20:36: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error f ::2008/02/02-20:36: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status f ::2008/02/02-20:36: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error b10::2008/02/02-20:36: ERR [FM] Failed to arbitrate quorum resource c336021a- 083e-4fa0-9d a590c206, error b10::2008/02/02-20:36: ERR [RGP] Node 2: REGROUP ERROR: arbitration failed b10::2008/02/02-20:36: ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster service on this node) a8::2008/02/02-20:37: ERR [JOIN] Unable to connect to any sponsor node a8::2008/02/02-20:38: ERR [FM] FmGetQuorumResource failed, error a8::2008/02/02-20:38: ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service) a8::2008/02/02-20:38: ERR [INIT] Failed to form cluster, status 5086 (The quorum disk could not be located by the cluster service). Cluster Log for Usmv-West2 (Failure Host) 00000d bbc::2008/02/02-20:31: ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000da ::2008/02/02-20:35: ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status 170 (The requested resource is in use) 00000da ::2008/02/02-20:35: ERR [RM] LostQuorumResource, cluster service terminated da b80::2008/02/02-20:35: ERR Network Name <Cluster Name>: Unable to open handle to cluster, status 1753 (There are no more endpoints available from the endpoint mapper) da c20::2008/02/02-20:35: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (The handle is invalid) a a14::2008/02/02-20:37: ERR [JOIN] Unable to connect to any sponsor node. Attempted to try five times before the cluster timed-out, The entries recorded five times in the log: e ::2008/02/02-20:37: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (The resource is in use) e ::2008/02/02-20:37: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error e ::2008/02/02-20:37: ERR Physical Disk <Disk Q:>: [DiskArb] BusReset completed, status 31 (A device attached to the system is not functioning) e ::2008/02/02-20:37: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to break reservation, error a a14::2008/02/02-20:37: ERR [FM] FmGetQuorumResource failed, error a a14::2008/08/02-20:37: ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service) a a14::2008/02/02-20:37: ERR [INIT] Failed to form cluster, status 5086 (The

48 Recovering in a Geographic Clustered Environment quorum disk could not be located by the cluster service) a a14::2008/02/02-20:37: ERR [CS] ClusterInitialize failed a a14::2008/02/02-20:37: ERR [CS] Service Stopped. exit code = 5086 Actions to Resolve the Problem If all RAs on site 1 fail and site 1 owns the quorum resource, perform the following tasks to recover: 1. Disable MSCS on all nodes at the site with the failed RAs. 2. Perform a manual failover of the quorum consistency group. 3. Reverse replication direction. 4. Start MSCS on a node on the surviving site. 5. Complete the recovery process. Caution Manual recovery is required only if the quorum device is lost because of a failure of an RA cluster. Before you bring the remote site online and before you perform the manual recovery procedure, ensure that MSCS is stopped and disabled on the cluster nodes at the production site (site 1 in this case). You must verify the server status with a network test. Improper use of the manual recovery procedure can lead to an inconsistent quorum disk and unpredictable results that might require a long recovery process. Disabling MSCS Stop MSCS on each node at the site where the RAs failed by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service and click Stop. 3. Change the startup type to Disabled. 4. Repeat steps 1 through 3 for each node on the site. Performing a Manual Failover of the Quorum Consistency Group 1. Connect to the Management Console by opening a browser to the management IP address of the surviving site. The management console can be accessed only by the site with a functional RA cluster because the WAN is down. 2. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum drive) in the navigation pane. 3. Select the Policy tab

49 Recovering in a Geographic Clustered Environment 4. Under Stretch Cluster Support, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor and click Apply. 5. Right-click the Quorum Consistency Group and then select Pause Transfer. Click Yes when the system prompts and the group activity will be stopped. 6. Perform the following steps to allow access to the target image: a. Right-click Consistency Groups and scroll down. b. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. c. Choose Select an image from the list and click Next. The Select Explicit Image dialog box displays the available images. d. Select the desired image from the list and then click Next. The Image Access Mode dialog box appears. e. Select Logged access (physical) and click Next. The Summary screen shows the Image name and the Image Access mode. f. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. g. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused status displays under the bitmap in the Status tab under the components pane. Reversing Replication Direction 1. Select the Quorum Consistency Group in the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. 3. Select the Status tab. The status of the transfer must show Paused. 4. Right-click Consistency Groups and select Failover to <Remote Site>. 5. Click Yes when the system prompts to confirm failover. 6. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 7. Click Yes to continue

50 Recovering in a Geographic Clustered Environment Starting MSCS MSCS should start within 1 minute on the surviving nodes when the MSCS recovery setting is enabled. You can manually start MSCS on each node of the surviving site by performing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service, and click Start. MSCS starts the cluster group and automatically moves all groups to the first-started cluster node. 3. Repeat steps 1 through 2 for each node on the site. Completing the Recovery Process To complete the recovery process, you must restore the global cluster mode property and start MSCS. Restoring the Global Cluster Mode Property for the Quorum Group Once the primary site is operational and you have verified that all nodes at both sites are online in the cluster, restore the failover settings by performing the following steps: 1. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum device) from the navigation pane. 2. Select the Policy tab. 3. Under Stretch Cluster Support, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor. 4. Click Apply. 5. Click Yes when the system prompts that the group activity will be stopped. Enabling MSCS Enable and start MSCS on each node at the site where the RAs failed by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Services and click Properties. 3. Change the startup type to Automatic. 4. Click Start 5. Repeat steps 1 through 4 for each node on the site. 6. Open the Cluster Administrator and move the groups to the preferred node

51 Recovering in a Geographic Clustered Environment Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) Problem Description If the quorum group is running on site 2 and the RAs fail on site 1, all cluster nodes remain in a running state. All consistency groups remain at the respective sites because all disk accesses are successful. In this case, because data is stored on the replication volumesbut the corresponding marking information is not written to the repository volumea full-sweep resynchronization is required following recovery. An exception is if the consistency group option Allow application to run even when Unisys SafeGuard Solutions cannot mark data was selected. The splitter prevents access to disks when the RAs are not available to write marking data to the repository volume, and I/Os fail. Figure 4 2 illustrates this failure. Symptoms Figure 4 2. All RAs Fail on Site 1 (Site 2 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples:

52 Recovering in a Geographic Clustered Environment System Event Log for Usmv-East2 Host (Surviving SiteSite 2) 8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" offline." 8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in Resource Group 'Group 0' failed. 8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- EAST2. 8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" online." System Event Log for Usmv-West2 Host (Failure SiteSite 1) 8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" offline." 8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in Resource Group 'Group 0' failed. 8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- EAST2. 8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" online." If you review the cluster log, you find messages similar to the following examples: Cluster Log for Surviving Site (Site 2) a fdc::2008/02/02-21:57: ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000ec b4::2008/02/02-22:09: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational ec f48::2008/02/02-22:10: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Cluster Log for Failure Site (Site 1) c e4::2008/02/02-22:09: ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_system_settings command failed. Error: ( ) c e4::2008/02/02-22:09: ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be performed because of an I/O device error) c.00000b8c::2008/02/02-22:10: ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_version command failed. Error: ( ) c.00000b8c::2008/02/02-22:10: ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_system_settings command failed. Error: ( ) c.00000b8c::2008/02/02-22:10: ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be performed because of an I/O device error)

53 Recovering in a Geographic Clustered Environment Actions to Resolve the Problem If all RAs on site 1 fail and site 2 owns the quorum resource, you do not need to perform manual recovery. Because the surviving site owns the quorum consistency group, MSCS automatically restarts, and the data consistency group fails over on the surviving site. Recovery When All RAs and All Servers Fail on One Site The following two cases describe an event in which a complete site fails (for example, site 1) and all data I/O, cluster node communication, disk reservations, and so forth, stop responding. MSCS nodes on site 2 detect a network heartbeat loss and loss of disk reservations, and try to take over the cluster groups that had been running on the nodes that failed. There are two cases for recovering from this failure based on which site owns the quorum group: The RAs and servers fail on site 1 and that site owns the quorum group. The RAs and servers fail on site 1 and site 2 owns the quorum group. Manual recovery of MSCS is required as described in the following topic, Site 1 Failure (Site 1 Quorum Owner). If the site can recover in an acceptable amount of time and the quorum owner does not reside on the failed site, manual recovery should not be performed. The two cases that follow respond differently and are solved differently based on where the quorum owner resides. Site 1 Failure (Site 1 Quorum Owner) Problem Description In the first failure case, all nodes at site 1 fail as well as the RAs. Thus, the RAs must fail quorum arbitration attempts initiated by nodes on the surviving site. Because the RAs on the surviving site (site 2) are not able to communicate over the communication networks, the RAs assume that it is a WAN network failure and do not allow automatic failover of cluster resources. MSCS attempts to fail over to a node at site 2. Because the quorum resource was owned by site 1, site 2 must be brought up using the manual quorum recovery procedure

54 Recovering in a Geographic Clustered Environment Figure 4 3 illustrates this case. Symptoms Figure 4 3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-East2 Host (Failure Site) 8/3/ :46:01 AM ClusSvc Error Startup/Shutdown 1073 N/A USMV-EAST2 Cluster service was halted to prevent an inconsistency within the server cluster. The error code was 5892 (The membership engine requested shutdown of the cluster service on this node). 8/3/ :46:00 AM ClusSvc Error Membership Mgr 1177 N/A USMV-EAST2 Cluster service is shutting down because the membership engine failed to arbitrate for the quorum device. This could be due to the loss of network connectivity with the current quorum owner. Check your physical network infrastructure to ensure that communication between this node and all other nodes in the server cluster is intact. 8/3/ :47:40 AM ClusSvc Error Startup/Shutdown 1009 N/A USMV-EAST2 Cluster service could not join an existing server cluster and could not form a new server cluster. Cluster service has terminated. 8/3/ :50:16 AM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting a bus reset for device \Device\ClusDisk

55 Recovering in a Geographic Clustered Environment If you review the cluster log, you find messages similar to the following examples: Cluster Log for Surviving Site (Site 2) 00000c f4::2008/02/02-17:13: ERR [NMJOIN] Unable to begin join, status 1717 (the NIC interface is unknown) c f4::2008/02/02-17:13: ERR [CS] ClusterInitialize failed c f4::2008/02/02-17:13: ERR [CS] Service Stopped. exit code = be e0::2008/02/02-17:14: ERR [JOIN] Unable to connect to any sponsor node be e0::2008/02/02-17:14: ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target e bac::2008/02/02-17:16: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status e8c.00000ea8::2008/02/02-17:30: ERR Physical Disk <Disk Q:>: [DiskArb] Signature of disk has changed or failed to find disk with id, old signature 0xe1e7208e new signature 0xe1e7208e, status 2 (the system cannot find the file specified) e8c.00000ea8::2008/02/02-17:30: ERR Physical Disk <Disk Q:>: SCSI: Attach, error attaching to signature e1e7208e, error e fc::2008/02/02-17:30: ERR [FM] FmGetQuorumResource failed, error e fc::2008/02/02-17:30: ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service) e fc::2008/02/0-17:30: ERR [INIT] Failed to form cluster, status e fc::2008/02/02-17:30: ERR [CS] ClusterInitialize failed e fc::2008/02/02-17:30: ERR [CS] Service Stopped. exit code = e80::2008/02/02-17:55: ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target cc ::2008/02/02-17:55: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Cluster Log for Failure Site (Site 1) 00000dc c48::2008/02/02-17:12: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status dc c48::2008/02/02-17:12: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error dc c48::2008/02/02-17:12: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status dc c48::2008/02/02-17:12: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error fe ::2008/02/02-17:13: ERR [FM] Failed to arbitrate quorum resource c336021a- 083e-4fa0-9d a590c206, error fe ::2008/02/02-17:13: ERR [RGP] Node 1: REGROUP ERROR: arbitration failed fe ::2008/02/02-17:13: ERR [NM] Halting this node due to membership or communications error. Halt code = fe ::2008/02/02-17:13: ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster service on this node) dc f34::2008/02/02-17:13: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Pending processing terminated by resource monitor dc f34::2008/02/02-17:13: ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online e4::2008/02/02-17:29: ERR [FM] FmGetQuorumResource failed, error

56 Recovering in a Geographic Clustered Environment e e4::2008/02/02-17:29: ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = e e4::2008/02/02-17:29: ERR [INIT] Failed to form cluster, status e e4::2008/02/02-17:29: ERR [CS] ClusterInitialize failed e e4::2008/02/02-17:29: ERR [CS] Service Stopped. exit code = b cc::2008/02/02-17:31: ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000ff d8::2008/02/02-17:31: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Actions to Resolve the Problem If all RAs and servers on site 1 fail and site 1 owns the quorum resource, perform the following tasks to recover: 1. Perform a manual failover of the quorum consistency group. 2. Reverse replication direction. 3. Start MSCS. 4. Power on the site if a power failure occurred. 5. Restore the failover settings. Note: Do not bring up any nodes until the manual recovery process is complete. Caution Manual recovery is required only if the quorum device is lost because of a failure of an RA cluster. If the cluster nodes at the production site are operational, you must disable MSCS. You must verify the server status with a network test or attempt to log in to the server. Use the procedure in Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner). Improper use of the manual recovery procedure can lead to an inconsistent quorum disk and unpredictable results that might require a long recovery process. Performing a Manual Failover of the Quorum Consistency Group To perform a manual failover of the quorum consistency group, follow the procedure given in the Actions to Resolve the Problem for Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) earlier in this section. Reversing Replication Direction 1. Select the Consistency Groupd from the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused

57 Recovering in a Geographic Clustered Environment 3. Select the Status tab. The status of the transfer must display Paused. 4. Right-click the Consistency Groups and select Failover to <Remote Site Name> 5. Click Yes when the system prompts to confirm failover. 6. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 7. Click Yes to continue. Starting MSCS MSCS should start within 1 minute on the surviving nodes when the MSCS recovery setting is enabled. You can manually start MSCS on each node of the surviving site by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service, and click Start. MSCS starts the cluster group and automatically moves all groups to the first-started cluster node. 3. Repeat steps 1 through 2 for each node on the site. Powering-on a Site If a site experienced a power failure, power on the site in the following order: Switches Storage Note: Wait until all switches and storage units are initialized before continuing to power on the site. RAs Note: Wait 10 minutes after you power on the RAs before you power on the hosts. Hosts Restoring the Global Cluster Mode Property for the Quorum Group Once the primary site is again operational and you have verified that all nodes at both sites are online in the cluster, restore the failover settings by completing the following steps: 1. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum drive) from the navigation pane. 2. Select the Policy tab. 3. Under Stretch Cluster Support, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor [Auto-quorum (shared quorum) mode]

58 Recovering in a Geographic Clustered Environment 4. Ensure that the Allow Regulation box check box is selected. 5. Click Apply. Site 1 Failure (Site 2 Quorum Owner) Problem Description If the quorum group is running on site 2 and a complete site failure occurs on site 1, a quorum failover is not required. Only data groups on the failed site will require failover. All data that is not mirrored and was in the failed RA cache is lost; the latest image on the remote site is used to recover. Cluster services will be up on all nodes on site 2, and cluster nodes will fail on site 1. You cannot move a group to nodes on a site where the RAs are down (site 1). MSCS attempts to fail over to a node at site 2. An alert is sent stating that a site or RA cluster has failed. Figure 4 4 illustrates this case. Figure 4 4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner)

59 Recovering in a Geographic Clustered Environment Symptoms The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-West2 (Failure Site) 8/3/2006 1:49:26 PM ClusSvc Information Failover Mgr 1205 N/A USMV-WEST2 "The Cluster Service failed to bring the Resource Group ""Cluster Group"" completely online or offline." 8/3/2008 1:49:26 PM ClusSvc Information Failover Mgr 1203 N/A USMV-WEST2 "The Cluster Service is attempting to offline the Resource Group ""Cluster Group""." 8/3/2008 1:50:46 PM ClusDisk Error None 1209 N/A USMV-WEST2 Cluster service is requesting a bus reset for device \Device\ClusDisk0. If you review the cluster log, you find messages similar to the following examples: Cluster Log for Failure Site (Site 1) 00000e c10::2008/02/02-20:50: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (the requested resource is in use) e c10::2008/02/02-20:50: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error e fb4::2008/02/02-20:52: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (the handle is invalid). Cluster Log for Surviving Site (Site 2) dd8::2008/02/02-20:49: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status dd8::2008/02/02-20:49: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error d e68::2008/02/02-20:49: ERR [GUM] GumSendUpdate: GumQueueLocking update to node 1 failed with 1818 (The remote procedure call was cancelled) d e68::2008/02/02-20:49: ERR [GUM] GumpCommFailure 1818 communicating with node ::2008/02/02-20:50: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (The handle is invalid). Actions to Resolve the Problem If all RAs and all servers on site 1 fail and site 2 owns the quorum resource, you do not need to perform manual recovery. Because the surviving site owns the quorum consistency group, MSCS automatically restarts, and the data consistency group fails over on the surviving site

60 Recovering in a Geographic Clustered Environment

61 Section 5 Solving Storage Problems This section lists symptoms that usually indicate problems with storage. Table 5 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive messages or SNMP traps for possible problems. Also, messages similar to notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 5 1. Possible Storage Problems with Symptoms Symptom The system pauses the transfer for the relevant consistency group. The server cannot access this volume; writes to this volume fail; the file system cannot be mounted; and so forth. The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The system pauses the transfer for all consistency groups. The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The event log reports that the repository volume is inaccessible. The event log indicates that the repository volume is corrupted. Possible Problem User or replication volume not accessible Repository volume not accessible

62 Solving Storage Problems Table 5 1. Possible Storage Problems with Symptoms Symptom The management console shows an error for the connections between this volume and all RAs on the relevant site. The system pauses the transfer for the relevant consistency group. The event log indicates that the journal was lost or corrupted. No volumes from the relevant target and worldwide name (WWN) are accessible to any initiator on the SAN. The cluster regroup process begins and the quorum device fails over to a site without failed storage. The management console shows a storage error and replication has stopped. Servers report multipath software errors. Applications that depend on physical disk resources go offline and fail when attempting to come online. Once resource retry threshold parameters are reached, site 1 fails over to site 2. With the default settings, this timing is about 30 minutes. Possible Problem Journal not accessible Total storage loss in a geographic replicated environment Storage failure on one site with quorum owner on failed site in a geographic clustered environment Storage failure on one site with quorum owner on surviving site in a geographic clustered environment

63 Solving Storage Problems Table 5 2 lists specific storage volume failures and the types of errors and indicators on the management console that distinguish each failure. Table 5 2. Indicators and Management Console Errors to Distinguish Different Storage Volume Failures Failure Groups Paused Status System Status Volumes Tab Logs Tab Data volume lost or failed Relevant Data Group Storage error Replication volume with error status Error 3012 Journal volume lost, failed, or corrupt Relevant Data Group Storage error Journal volume with error status Error 3012 Repository volume lost, failed, or corrupt All Storage and RA error failure Repository volume with error status Error 3014 User or Replication Volume Not Accessible Problem Description Symptoms The replication volume is not accessible to any host or splitter. The following symptoms might help you identify this failure: The management console shows an error for storage and the Volumes tab (status column) shows additional errors (See Figure 5 1)

64 Solving Storage Problems Figure 5 1. Volumes Tab Showing Volume Connection Errors Warnings and informational messages similar to those shown in Figure 5 2 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 5 2. Management Console Messages for the User Volume Not Accessible Problem

65 Solving Storage Problems The following table explains the numbered messages in Figure 5 2. Reference No. Event ID Description Immediate Daily Summary Group capabilities problem with the details showing that the RA is unable to access <group>. X The RA is unable to access the volume. X The Groups tab on the management console shows that the system paused the transfer for the relevant consistency group. (See Figure 5 3.) Figure 5 3. Groups Tab Shows Paused by System The server cannot access this volume; writes to this volume fail; the file system cannot be mounted; and so forth. Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists and has not failed or been removed from the storage device. Verify that the LUN is masked to the proper splitter or splitters and RAs. Verify that other servers in the SAN do not use this volume. For example, if an MSCS cluster in the SAN acquired ownership of this volume, it might reserve the volume and block other initiators from seeing the volume. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. Repository Volume Not Accessible Problem Description The repository volume is not accessible to any SAN-attached initiator, including the splitter and RAs

66 Solving Storage Problems Symptoms Or, the repository volume is corrupted---either by another initiator because of storage changes or as a result of storage failure. You must reformat the repository volume before replication can proceed normally. The following symptoms might help you identify this failure: The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The RAs tab on the management console shows errors for the volume. (See Figure 5 4.) The following error messages appear for the RAs error condition when you click Details: Error: RA 1 in LA can t access repository volume Error: RA 2 in LA can t access repository volume The following error message appears for the storage error condition, when you click Details: Error: Repository volume can t be accessed by any RAs Figure 5 4. Management Console Display: Storage Error and RAs Tab Shows Volume Errors The Volumes tab on the management console shows an error for the repository volume, as shown in Figure 5 5. Figure 5 5. Volumes Tab Shows Error for Repository Volume

67 Solving Storage Problems The Groups tab on the management console shows that the transfer is active for all consistency groups, as shown in Figure 5 6. Figure 5 6. Groups Tab Shows All Groups are Still Alive The Logs tab on the management console lists a message for event ID This message indicates that the RA is unable to access the repository volume or the repository volume is corrupted. (See Figure 5 7.) Figure 5 7. Management Console Messages for the Repository Volume not Accessible Problem Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists and has not failed or been removed from the storage device. Verify that the LUN is masked to the proper splitter or splitters and RAs. Verify that other servers in the SAN do not use this volume. For example, if an MSCS cluster in the SAN acquired ownership of this volume, it might reserve the volume and block other initiators from seeing the volume. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. If the volume is corrupted or you determine that it must be reformatted, perform the steps in Reformatting the Repository Volume

68 Solving Storage Problems Reformatting the Repository Volume Before you begin the reformatting process in a geographic clustered environment, be sure that all groups are located at the site for which the repository volume is not to be formatted. On RA 1 at the site for which the repository volume is to be formatted, determine from the Site Planning Guide which LUN is used for the repository volume. If the LUN is not recorded for the repository volume, a list is presented during the volume formatting process that shows LUNs and the previously used repository volume is identified. Perform the following steps to reformat a repository volume for a particular site: 1. Click the Data Group in the management console, and perform the following steps: a. Click Policy in the right pane. b. Scroll down and select Stretch Cluster Support in the Policy tab. c. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. d. Click Apply. e. Right-click the Data Group and select Disable Group. f. Click Yes when the system prompts for confirmation. g. Click Yes when the system prompts that the copy activities will be stopped. 2. Skip to step 6 for geographic replication environments. 3. Perform the following steps for geographic clustered environments: a. Open the Group Policy window for the quorum group. b. Click Policy in the right pane. c. Scroll down and select Stretch Cluster Support in the Policy tab. d. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. e. Click Apply. 4. Right-click the Consistency Group and select Disable Group. 5. Click Yes when the system prompts that the copy activities will be stopped. 6. Select the Splitters tab. a. Open the Splitter Properties window for the splitter. b. Select all the attached volumes. c. Click Detach and then click Apply. d. Click OK to close the window

69 Solving Storage Problems e. Delete the splitter at the site for which the repository volume is to be reformatted. 7. Open the PuTTY session on RA1 for the site. a. Log on with boxmgmt as the User ID and boxmgmt as the password. The Main menu is displayed. b. At the prompt, type 4 (Cluster Operation) and press Enter. c. Type 2 (Detach from cluster) at the Cluster Operations menu. d. Type y when prompted for confirmation. e. Type b to go back to the Setup menu. f. On the Setup menu, type 2 (Configure repository volume) and press Enter. g. Type 1 (Format repository volume) and press Enter. h. Enter the appropriate number from the list to select the LUN. Ensure that the WWN and LUN are for the volume that you want to format. The LUN and identifier displays. i. Confirm the volume to format. All data is removed from the volume. j. Verify that the operation succeeds and press Enter. k. On the Main Menu, type Q (quit) and press Enter. 8. Open a PuTTY session on each additional RA at the site for which the repository volume is to be formatted. 9. Log on with the boxmgmt as the user ID and boxmgmt as the password. The Main menu displays. a. At the prompt, type 2 (Setup) and press Enter. b. On the Setup menu, type 2 (Configure repository volume) and press Enter. c. Type 2 (Select a previously formatted repository volume) and press Enter. d. Enter the appropriate number from the list to select the LUN. Ensure that the WWN and LUN are for the volume that you want to format. The LUN and identifier displays. e. Confirm the volume to format. All data is removed from the volume. f. Verify that the operation succeeds and press Enter. g. On the Main menu, type Q (quit) and press Enter. Note: Complete step 9 for each additional RA at the site. 10. On the management console, select the Splitters tab. a. Click the Add New Splitter icon to open the Add splitter window. b. Click Rescan and select the splitter. 11. Open the Group Properties window and click the Policy tab and perform the following steps for each data group:

70 Solving Storage Problems a. Scroll down and select Stretch Cluster Support in the Policy tab. b. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. c. Click Apply. d. Right-click the Data Group and click Enable Group. 12. Skip to step 16 for geographic replication environments. 13. Perform the following steps for geographic clustered environments. a. Right-click the Quorum Group and click Enable Group. b. Click the Quorum Group and select Policy in the right pane. c. Scroll down and select Stretch Cluster Support in the Policy tab. d. Under This consistency group works with, select the This is the quorum group check box. e. Under Management Mode, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor. f. Click Apply. 14. Verify that initialization completes for all the groups. 15. Review the management console event log. 16. Ensure that no storage error or other component error appears. Journal Not Accessible Problem Description Symptoms The journal is not accessible to either RA. A journal for one of the consistency groups is corrupted. The corruption results from another initiator because of storage changes or as a result of storage failure. Because the snapshot history is corrupted, replication for the relevant consistency group cannot proceed. The following symptoms might help you identify this failure: The Volumes tab on the management console shows an error for the journal volume. (See Figure 5 8.)

71 Solving Storage Problems Figure 5 8. Volumes Tab Shows Journal Volume Error The RAs tab on the management console shows errors for connections between this volume and the RAs. (See Figure 5 9.) Figure 5 9. RAs Tab Shows Connection Errors The Groups tab on the management console shows that the system paused the transfer for the relevant consistency group, as shown in Figure Figure Groups Tab Shows Group Paused by System The Logs tab on the management console lists a message for event ID This message indicates that the RA is unable to access the volume. (See Figure 5 11.)

72 Solving Storage Problems Figure Management Console Messages for the Journal Not Accessible Problem Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists on the storage device and that it is only masked to the RAs. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. For a corrupted journal, check that the system recovers automatically by re-creating the data structures for the corrupted journal and that the system then initiates a fullsweep resynchronization. No manual intervention is needed. Journal Volume Lost Scenarios Problem Description Scenarios The journal volume is lost and will not be available in some scenarios as described below. Attempt to write data to the Journal volume with the speed higher than the journal data is distributed to the replication volume will result in Journal data loss. In this case the Journal volume may be full and attempt to perform write operation on it creates a problem. The user performs the following operations: Failover Recover production

73 Solving Storage Problems Actions to Resolve You can minimize the occurrence of this problem in scenario 1 by carefully configuring the Journal Lag. It is unavoidable in scenario 2. Total Storage Loss in a Geographic Replicated Environment Problem Description Symptoms All volumes belonging to a certain storage target and WWN (or controller, device) have been lost. The following symptoms might help you identify this failure: The symptoms can be the same as those from any of the volume failure problems listed previously (or a subset of those symptoms), if the symptoms are relevant to the volumes that were used on this target. All volumes common to a particular storage array have failed. The Volumes tab on the management console shows errors for all volumes. (See Figure 5 12.)

74 Solving Storage Problems Figure Management Console Volumes Tab Shows Errors for All Volumes No volumes from the relevant target and WWN are accessible to any initiator on the SAN, as shown on the RAs tab on the management console. (See Figure 5 13.) Figure RAs Tab Shows Volumes That Are Not Accessible Multipathing software (such as EMC PowerPath Administrator) reports failed paths to the storage device, as shown in Figure

75 Solving Storage Problems Figure Multipathing Software Reports Failed Paths to Storage Device Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that the storage device has not experienced a power outage. Instead, the device is functioning normally according to all external indicators. Verify that the Fibre Channel switch and the storage device indicate an operating Fibre Channel connection (that is, the relevant LEDs show OK). If the indicators are not OK, the problem might be a faulty Fibre Channel port (storage, switch, or patch panel) or a faulty Fibre Channel cable. Verify that the initiator can be seen from the switch name server. If not, the problem could be a Fibre Channel port or cable problem (as in the preceding item). Otherwise, the problem could be a misconfiguration of the port on the switch (for example, type or speed could be wrong). Verify that the target WWN is included in the relevant zones (that is, hosts and RA). Verify also that the current zoning configuration is the active configuration. If you use the default zone, verify that it is set to permit by default. Verify that the relevant LUNs still exist on the storage device and are masked to the proper splitters and RAs. Verify that volumes have read and write permissions on the storage system. Verify that these volumes are exposed and managed by the proper hosts and that there are no other hosts on the SAN that use this volume. Storage Failure on One Site in a Geographic Clustered Environment In a geographic clustered environment where MSCS is running, if the storage subsystem on one site fails, the symptoms and resulting actions depend on whether the quorum owner resided on the failed storage subsystem

76 Solving Storage Problems To understand the two scenarios and to follow the actions for both possibilities, review Figure Figure Storage on Site 1 Fails Storage Failure on One Site with Quorum Owner on Failed Site Problem Description Symptoms In this case, the cluster quorum owner as well as the quorum resource resides on the failed storage subsystem. The quorum and resource automatically fail over to the node that gains control through MSCS arbitration. This node resides on the site without the storage failure. The RAs use the last available image. This action results in a loss of data that has yet to be replicated. The resources cannot fail back to the failed site until the storage subsystem is restored. The following symptoms might help you identify this failure. A node on which the cluster was running might report a delayed write failure or similar error

77 Solving Storage Problems The quorum reservation is lost, and MSCS stops on the cluster node that owned the quorum resource. This action triggers a cluster regroup process, which allows other cluster nodes to arbitrate for the quorum device. Figure 5 16 shows typical listings for the cluster regroup process. Figure Cluster Regroup Process

78 Solving Storage Problems Cluster nodes located on the failed storage subsystem fail quorum arbitration because the service cannot provide a reservation on the quorum volume. The resources fail over to the site without a storage failure. The first cluster node on the site without the storage failure that successfully completes arbitration of the quorum device assumes ownership of the cluster. The following messages illustrate this process. Cluster Log Entries INFO Physical Disk <Disk Q:>: [DiskArb] DisksArbitrate INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Disk unique id present trying new attach INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number from ClusDisk registry key INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to PhysicalDrive7. INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Returns success. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitration Parameters: ArbAttempts 5, SleepBeforeRetry 500 ms. INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info to insure the disk is accessible. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature b876c301. ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate for ownership of the disk by reading/writing various disk sectors. ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. INFO Physical Disk <Disk Q:>: [DiskArb] We are about to break reserve. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing BusReset on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info from the disk to insure disk is accessible. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 0. INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO [RGP] Node 2: RGP Send packets: 0x3, 0xc , 0x , 0x0. INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 11) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 11) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 0. WARN Physical Disk <Disk Q:>: [DiskArb] Assume ownership of the device. INFO Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine starts. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 0. In Cluster Administrator, the groups that were online on one node change to the node that wins arbitration, as shown in Figure

79 Solving Storage Problems Figure Cluster Administrator Displays Multipathing software, if present, reports errors on the host servers of the site for which the storage subsystem failed. Figure 5 18 shows errors for failed storage devices. Figure Multipathing Software Shows Server Errors for Failed Storage Subsystem

80 Solving Storage Problems Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that all cluster resources failed over to a node on the site for which the storage subsystem did not fail and that these resources are online. If the cluster is running and no additional errors are reported, the problem has probably been isolated to a total site storage failure. Log in to the storage subsystem, and verify that all LUNs are present and configured properly. If the storage subsystem appears to be operating, the problem is most likely because of a failed SAN switch. See Total SAN Switch Failure on One Site in a Geographic Clustered Environment in Section 6. Resolve the failure of the storage subsystem before attempting failback. Once the storage subsystem is working and the RAs and host can access it, a full initialization is initiated. Storage Failure on One Site with Quorum Owner on Surviving Site Problem Description Symptoms In this case, the cluster quorum owner does not reside on the failed storage subsystem, but other resources do reside on the failed storage subsystem. The cluster resources fail over to a site without a failed storage subsystem. The RAs use the last available image. This action results in a loss of data that has yet to be replicated (if not synchronous). The resources cannot fail back to the failed site until the storage subsystem is restored. The following symptoms might help you identify this failure: The cluster marks the data groups containing the physical disk resources as failed. Applications dependent on the physical disk resource go offline. Failed resources attempt to come online on the failed site, but fail. Then the resources fail over to the site with a valid storage subsystem. Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that multipathing software, if present, reports errors on the host servers at the site with the suspected failed storage subsystem. (See Figure 5 19.) Verify that all cluster resources failed over to site 2 in Cluster Administrator. Entries similar to the following occur in the cluster log for a host at the site with a failed storage subsystem (thread ID and timestamp removed)

81 Solving Storage Problems Cluster Log Disk reservation lost.. ERR Physical Disk <Disk R:>: [DiskArb] CompletionRoutine: reservation lost! Status 1167 Arbitrate for disk... INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is called. INFO Physical Disk <Disk R:>: [DiskArb] Stopping reservation thread. ERR Physical Disk <Disk R:>: [DiskArb] Failed to read (sector 12), error ERR Physical Disk <Disk R:>: [DiskArb] Error cleaning arbitration sector, error INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is complete. INFO Physical Disk <Disk R:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature 42b77e24 INFO Physical Disk <Disk R:>: [DiskArb] Signature of disk has changed or failed to find disk with id, old signature 0x42b77e24 new signature 0x42b77e24, status 2 INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is called. INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is complete. Control goes offline at failed site... INFO [FM] FmpDoMoveGroup: Entry WARN [FM] FmpHandleResourceTransition: Resource failed, post a work item INFO [FM] FmpMoveGroup: Entry INFO [FM] FmpMoveGroup: Moving group c3cb79bc-b92f-427c-ac9a-2c474b81f6da to node 1 (1) INFO [FM] FmpOfflineResource: Disk R: depends on Data1LANY. Shut down first. INFO Unisys SafeGuard 30m Control <Data1LANY>: KfResourceOffline: Resource 'Data1LANY' going offline. After trying other nodes at site move to remote site... INFO [FM] FmpMoveGroup: Take group c3cb79bc-b92f-427c-ac9a-2c474b81f6da request to remote node 1 Move succeeds... INFO [FM] FmpMoveGroup: Exit group < DiskR >, status = 0 INFO [FM] New owner of Group c3cb79bc-b92f-427c-ac9a-2c474b81f6da is 1, state 0, curstate 0. INFO [GUM] s_gumupdatenode: completed update seq 443 type 0 context 9 INFO [FM] FmpDoMoveGroup: Exit, status = 0 INFO [FM] FmpDoMoveGroupOnFailure: FmpDoMoveGroup returns 0 INFO [FM] FmpDoMoveGroupOnFailure Exit. Log in to the failed storage subsystem and determine whether the storage reports failed or missing disks. If the storage subsystem appears to be fine, the problem is most likely because of a SAN switch failure. See Total SAN Switch Failure on One Site in a Geographic Clustered Environment in Section 6. Once the storage for the site that failed is back online, a full sweep is initiated. Check that the messages Starting volume sweep and Starting full sweep are displayed as an Events Notice

82 Solving Storage Problems

83 Section 6 Solving SAN Connectivity Problems This section lists symptoms that usually indicate problems with connections to the storage subsystem. Table 6 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive messages or SNMP traps for possible problems. Also, messages similar to notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 6 1. Possible SAN Connectivity Problems Symptoms The system pauses the transfer. If the volume is accessible to another RA, a switchover occurs, and the relevant groups start running on the new RA. The relevant message appears in the event log. The link to the volume from the disconnected RA or RAs shows an error. The volume is accessible to the splitters that are attached to it. Possible Problem Volume not accessible to RAs The system pauses the transfer for the relevant groups. If the volume is not accessible, the management console shows an error for the splitter. If a replication volume is not accessible, the splitter connection to that volume shows an error. Volume not accessible to SafeGuard 30m splitter

84 Solving SAN Connectivity Problems Table 6 1. Possible SAN Connectivity Problems Symptoms The system pauses the transfer for the relevant group or groups. If the connection with only one of the RAs is lost, the group or groups can restart the transfer by means of another RA, beginning with a short initialization. The splitter connection to the relevant RAs shows an error. The relevant message describes the lost connection in the event log. The management console shows a server down. Messages on the management console show that the splitter is down and that the node fails over. Multipathing software (such as EMC PowerPath Administrator) messages report an error. Cluster nodes fail and the cluster regroup process begins. Applications fail and attempt to restart. Messages regarding failed physical disks are displayed on the management console. The cluster resources fail over to the remote site. Possible Problem RAs not accessible to SafeGuard 30m splitter Server unable to connect with SAN (See Server Unable to Connect with SAN in Section 9. This problem is not described in this section.) Total SAN switch failure on one site in a geographic clustered environment Volume Not Accessible to RAs Problem Description Symptoms A volume (repository volume, replication volume, or journal) is not accessible to one or more RAs, but it is accessible to all other relevant initiatorsthat is, the splitter. The following symptoms might help you identify this failure: The system pauses the transfer. If the volume is accessible to another RA, a switchover occurs, and the relevant group or groups start running on the new RA

85 Solving SAN Connectivity Problems The management console displays failures similar to those in Figure 6 1. Figure 6 1. Management Console Showing Inaccessible Volume Errors Warnings and informational messages similar to those shown in Figure 6 2 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 6 2. Management Console Messages for Inaccessible Volumes

86 Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure 6 2. Reference No. Event ID Description Immediate Daily Summary For each consistency group, the surviving site reports a group capabilities problem The group is deactivated indefinitely by the system. X X The RA is unable to access the volume (RA1, Data1_LA_1). X For each consistency group, the site reports a group capabilities problem. X Splitter writer to RA failed. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 2:07:04 AM ClusSvc Error Physical Disk Resource 1038 N/A USMV- X460 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration 1/4/2009 2:07:04 AM Ftdisk Warning Disk 57 N/A USMV- X460 The system failed to flush data to the transaction log. Corruption may occur. System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV- X460' ' on network 'Local Area Connection 3'. 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV- X460' on network ' Local Area Connection 4'. 1/4/2009 5:07:05 AM ClusDisk Error None 1209 N/A ES3120-X64 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1135 N/A ES3120-X64 Cluster node USMV-X460 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. 1/4/2009 5:07:05 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 "The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""." 1/4/2009 5:07:05 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 "The Cluster Service brought the Resource Group ""Cluster Group"" online." 1/4/2009 5:07:05 AM Service Control Manager Information None 7036 N/A USMV-X460 The Cluster Service entered the running state. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images:

87 Solving SAN Connectivity Problems Cluster Log for USMV- X460 Host (Host on Failure Site) a ::2009/01/04-10:07: ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status a ::2009/01/04-10:07: ERR [RM] LostQuorumResource, cluster service terminated a a8::2009/01/04-10:07: ERR Network Name <Cluster Name>: Unable to open handle to cluster, status a c::2009/01/04-10:07: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status a a8::2009/01/04-10:07: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error a a8::2009/01/04-10:07: ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error 2. Cluster Log for ES3120-X64 Host (Host on Surviving Site) ::2009/01/04-10:07: INFO [ClMsg] Received interface unreachable event for node 2 network ::2009/01/04-10:07: INFO [ClMsg] Received interface unreachable event for node 2 network c::2009/01/04-10:07: WARN [NM] Interface 1db021ff-a472-4df2-97fe- 77fda4dc1a38 is unavailable (node: USMV-X460, network: Local Area Connection 3) e8::2009/01/04-10:07: WARN [NM] Interface fc-1fd0-4fc9-b7b0- a2355ca47f75 is unavailable (node: USMV-X460, network: Local Area Connection 4). Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Verify that the physical connection between the inaccessible RAs and the Fibre Channel switch is healthy. Verify that any disconnected RA appears in the name server of the Fibre Channel switch. If not, the problem could be because of a bad port on the switch, a bad host bus adaptor (HBA), or a bad cable. Verify that any disconnected RA is present in the proper zone and that the current zoning configuration is enabled. Verify that the correct volume is configured (WWN and LUN). To double-check, enter the Create Volume command in the management console, and verify that the same volume does not appear on the list of volumes that are available to be created. If the volume is not accessible to the RAs but is accessible to a splitter, and the server on which that splitter is installed is clustered using MSCS, Oracle RAC, or any other software that uses a reservation method, the problem probably occurs because the server has reserved the volume. For more information about the clustered environment installation process, see the Unisys SafeGuard Solutions Planning and Installation Guide and the Unisys SafeGuard Solutions Replication Appliance Administrator's Guide

88 Solving SAN Connectivity Problems Volume Not Accessible to SafeGuard 30m Splitter Problem Description Symptoms A volume (repository volume, replication volume, or journal) is not accessible to one or more splitters but is accessible to all other relevant initiators (for example, the RAs). The following symptoms might help you identify this failure: The system pauses the transfer for the relevant groups. If the repository volume is not accessible, the management console shows an error for the splitter. If a replication volume is not accessible, the splitter connection to that volume shows an error. The management console System Status screen and the Splitter Settings screen show error indications similar to those in Figure 6 3. Figure 6 3. Management Console Error Display Screen Warnings and informational messages similar to those shown in Figure 6 4 appear on the management console. See the table after the figure for an explanation of the numbered console messages

89 Solving SAN Connectivity Problems Figure 6 4. Management Console Messages for Volumes Inaccessible to Splitter

90 Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure 6 4. Reference No. Event ID Description For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. Immediate Daily Summary X Negotiating Transfer Protocol X For each consistency group at the failed site, the data transfer starts and then the initialization starts For each consistency group at the failed site, initialization completes. X X Pausing Data Transfer X For each consistency group, a minor problem is reported. The details show that sides are not linked and cannot transfer data Transferring the latest snapshot before pausing the transfer (no detail is lost). X X The splitter write operation failed. X Writes to Replication Volume ID disabled. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations

91 Solving SAN Connectivity Problems The multipathing software (such as EMC PowerPath) on the server at the failed site reports disk error as shown in Figure 6 5. Figure 6 5. EMC PowerPath Shows Disk Error If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 4:09:28 AM Emcmpio Error None 100 N/A USMV-X460 Path Bus 2 Tgt 60 Lun 1 to APM is dead. 1/4/2009 4:09:28 AM Emcmpio Error None 102 N/A USMV-X A560E00378AABEBF3C8DB11 is dead. 1/4/2009 4:09:28 AM Emcmpio Error None 104 N/A USMV-X460 All paths to A560E00378AABEBF3C8DB11 are dead. 1/4/2009 4:09:31 AM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/4/2009 4:11:07 AM ClusDisk Error None 1069 N/A USMV-X460 Cluster resource 'Disk R:' in Resource Group 'DiskR' failed. 1/4/2009 4:11:08 AM ClusSvc Information Failover Mgr 1153 N/A USMV-X460 Cluster service is attempting to failover the Cluster Resource Group 'DiskR' from node USMV-X460 to node ES3120-X64. 1/4/2009 4:11:30 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online

92 Solving SAN Connectivity Problems System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/4/2009 7:10:34 AM ClusDisk Error None 1209 N/A USMV-X460 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/4/2009 7:11:07 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 The Cluster Service is attempting to bring online the Resource Group "DiskR". 1/4/2009 7:11:30 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ae ::2009/01/04-12:09: ERR Physical Disk <Disk R:>: [DiskArb] CompletionRoutine: reservation lost! Status ae ::2009/01/04-12:09: ERR Physical Disk <Disk R:>: LooksAlive, error checking device, error ae ::2009/01/04-12:09: ERR Physical Disk <Disk R:>: IsAlive, error checking device, error ae cc::2009/01/04-12:10: ERR Physical Disk <Disk R:>: [DiskArb] Error cleaning arbitration sector, error 2 Cluster Log for ES3120-X64 Host (Host on Surviving Site) ::2009/01/04-11:52: INFO [ClMsg] Received interface unreachable event for node 2 network ::2009/01/04-11:52: INFO [ClMsg] Received interface unreachable event for node 2 network bdc::2009/01/04-11:52: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status bdc::2009/01/04-11:52: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170 Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Verify that the physical connection between the disconnected splitter or splitters and the Fibre Channel switch is healthy. Verify that any host on which a disconnected splitter resides appears in the name server of the Fibre Channel switch. If not, the problem could be because of a bad port on the switch, a bad HBA, or a bad cable. Verify that any host on which a disconnected splitter resides is present in the proper zone and that the current zoning configuration is enabled. If a replication volume is not accessible to the splitter at the source site, but appears as OK in the management console for that splitter, verify that the splitter is not functioning at the target site (TSP not enabled). During normal replication, the system prevents target-site splitters from accessing the replication volumes

93 Solving SAN Connectivity Problems RAs Not Accessible to SafeGuard 30m Splitter Problem Description Symptoms One or more RAs on a site are not accessible to the splitter through the Fibre Channel. The following symptoms might help you identify this failure: The system pauses the transfer for the relevant groups. If the connection with only one of the RAs is lost, the groups can restart the transfer by means of another RA, beginning with a short initialization. The splitter connection to the relevant RAs shows an error. The management console displays error indicators similar to those in Figure 6 6. Figure 6 6. Management Console Display Shows a Splitter Down Warnings and informational messages similar to those shown in Figure 6 7 appear on the management console. See the table after the figure for an explanation of the numbered console messages

94 Solving SAN Connectivity Problems Figure 6 7. Management Console Messages for Splitter Inaccessible to RA The following table explains the numbered messages shown in Figure 6 7. Reference No. Event ID Description Immediate Daily Summary The surviving site Negotiating transfer protocol The failed site stop accepting writes to the consistency group For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. X X X Splitter down problem X The splitter for server USMV- X460 is unable to access the RA The synchronization completed message after the splitter is restored and replication completes The original site starts the synchronization For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. X X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure

95 Solving SAN Connectivity Problems Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 9:04:18 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-X460 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 1/4/2009 9:04:19 PM Service Control Manager Error None 7031 N/A USMV-X460 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in milliseconds: Restart the service 1/4/2009 9:04:19 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/4/2009 9:05:34 PM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/5/ :04:20 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV-X460' on network 'Local Area Connection 3'. 1/5/ :04:20 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV-X460' on network 'Local Area Connection 4'. 1/5/ :04:35 AM ClusDisk Error None 1209 N/A ES3120-X64 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/5/ :04:56 AM ClusSvc Warning Node Mgr 1135 N/A ES3120-X64 Cluster node USMV- X460 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. 1/5/ :04:56 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 The Cluster Service is attempting to bring online the Resource Group "Cluster Group". 1/5/ :05:09 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "Cluster Group" online. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ae ::2009/01/05-05:04: ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status ae ::2009/01/05-05:04: ERR [RM] LostQuorumResource, cluster service terminated ae a50::2009/01/05-05:04: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error ae a50::2009/01/05-05:04: ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error ae a50::2009/01/05-05:04: ERR Network Name <Cluster Name>: Unable to open handle to cluster, status Cluster Log for ES3120-X64 Host (Host on Surviving Site) ::2009/01/05-05:05: INFO [ClMsg] Received interface up event for node 2 network ::2009/01/05-05:05: INFO [ClMsg] Received interface up event for node 2 network b0::2009/01/05-05:04: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status b0::2009/01/05-05:04: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error

96 Solving SAN Connectivity Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Identify which of the components is the problematic one. A problematic component is likely to have additional errors or problems: A problematic RA might not be accessible to other splitters or might not recognize certain volumes. A problematic splitter might not recognize any RAs or the storage subsystem. Connect to the storage switch to verify the status of each connection. Ensure that each connection is configured correctly. If you cannot find any additional problems, there is a good chance that the problem is with the zoning; that is, somehow, the splitters are not exposed to the RAs. Verify the physical connectivity of the RAs and the servers (those on which the potentially problematic splitters reside) to the Fibre Channel switch. For each connection, verify that it is healthy and appears correctly in the name server, zoning, and so forth. Verify that this is not a temporary situation---for instance, if the RAs were rebooting or recovering from another failure, the splitter might not yet identify them. Total SAN Switch Failure on One Site in a Geographic Clustered Environment A total SAN switch failure implies that cluster nodes and RAs have lost access to the storage device that was connected to the SAN on one site. This failure causes the cluster nodes to lose their reservation of the physical disks and triggers an MSCS failover to the remote site. In a geographic clustered environment where MSCS is running, if the connection to a storage device on one site fails, the symptoms and resulting actions depend on whether or not the quorum owner resided on the failed storage device. To understand the two scenarios and to follow the actions for both possibilities, review Figure

97 Solving SAN Connectivity Problems Figure 6 8. SAN Switch Failure on One Site

98 Solving SAN Connectivity Problems Cluster Quorum Owner Located on Site with Failed SAN Switch Problem Description Symptoms The following point explains the expected behavior of the MSCS Reservation Manager when an event of this nature occurs: If the cluster quorum owner is located on the site with the failed SAN, the quorum reservation is lost. This loss causes the cluster nodes to fail and triggers a cluster regroup process. This regroup process allows other cluster nodes participating in the cluster to arbitrate for the quorum device. Cluster nodes located on the failed SAN fail quorum arbitration because the failed SAN is not able to provide a reservation on the quorum volume. The cluster nodes in the remote location attempt to reserve the quorum device and succeed arbitration of the quorum. The node that owns the quorum device assumes ownership of the cluster. The cluster owner brings online the data groups that were owned by the failed site. The following symptoms might help you identify this failure: All resources fail over to the surviving site (site 2 in this case) and come online successfully. Cluster nodes fail at the source site. If the consistency groups are configured asynchronously, this failover results in loss of data. The failover is fully automated and does not require additional downtime. The RAs cannot replication data until the SAN is operational. Failures are reported on the server and the management console. Replication stopped on all consistency groups. The management console displays error indications similar to those in Figure 6 9. Figure 6 9. Management Console Display with Errors for Failed SAN Switch Warnings and informational messages similar to those shown in Figure 6 10 appear on the management console. See the table after the figure for an explanation of the numbered console messages

99 Solving SAN Connectivity Problems Figure Management Console Messages for Failed SAN Switch

100 Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure Reference No. Event ID Description Immediate Daily Summary The surviving site pauses the data. X The original site reporting the splitter down status RA unable to access splitter The group is deactivated indefinitely by the system For each consistency group, the surviving site reports a group consistency problem. The details show a WAN problem. X X X X The RA is unable to access the repository volume The RA is unable to access the volume The system is pausing data transfer X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images:

101 Solving SAN Connectivity Problems System Event Log for USMV-X460 Host (Host on Failure Site) 1/14/2009 8:25:58 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/14/2009 8:25:58 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-X460 Reservation of cluster disk '' has been lost. Please check your system and disk configuration. System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/14/ :25:58 PM Service Control Manager Error None 7031 N/A USMV-X460 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in milliseconds: Restart the service. 1/14/ :25:58 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ba b6c::2009/01/15-04:25: ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status ba b6c::2009/01/15-04:25: ERR [RM] LostQuorumResource, cluster service terminated ba bb4::2009/01/15-04:26: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error ba bb4::2009/01/15-04:26: ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error ba bb4::2009/01/15-04:26: ERR Network Name <Cluster Name>: Unable to open handle to cluster, status ba a8::2009/01/15-04:26: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. Cluster Log for ES3120-x64 Host (Host on Surviving Site) 00000b c::2009/01/15-04:26: INFO [ClMsg] Received interface unreachable event for node 1 network b c::2009/01/15-04:26: INFO [ClMsg] Received interface unreachable event for node 1 network c::2009/01/15-04:26: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status c::2009/01/15-04:26: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. Actions to Resolve the Problem To resolve this situation, diagnose the SAN switch failure

102 Solving SAN Connectivity Problems Cluster Quorum Owner Not on Site with Failed SAN Switch Problem Description Symptoms The following points explain the expected behavior of the MSCS Reservation Manager when an event of this nature occurs: If a SAN failure occurs and the cluster nodes do not own the quorum resource, the state of the cluster services on these nodes is not affected. The cluster nodes remain as active cluster members; however, the data groups containing the SafeGuard 30m Control instance and the physical disk resources on these nodes are marked as failed, and any applications dependent on them are taken offline. These resources first try to restart, and then eventually fail over to the surviving site. The following symptoms might help you identify this failure: Applications fail and attempt to restart. The data groups containing the SafeGuard 30m Control instance and the physical disk resources on these nodes are marked as failed, and any applications dependent on them are taken offline. These resources first try to restart, and then eventually fail over to the surviving site. The cluster nodes remain as active cluster members. The management console displays error indications similar to those in Figure 6 9. Warnings and informational messages similar to those shown in Figure 6 11 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure Management Console Messages for Failed SAN Switch with Quorum Owner on Surviving Site The following table explains the numbered messages shown in Figure Reference No. Event ID Description Immediate Daily Summary The system is pausing data transfer on the failure site X

103 Solving SAN Connectivity Problems To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for ES3120-X64 Host (Host on Failure Site) 1:36:07 AM ClusDisk Error None 1209 N/A USMV-X460 Cluster service is requesting a bus reset for device \Device\ClusDisk0. System Event Log for USMV-X460 Host (Host on Surviving Site) 1/14/ :36:46 PM ClusSvc Information Node Mgr 1201 N/A USMV-X460 The Cluster Service brought the Resource Group "Cluster Group" online If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for Usmv ES3120-X64 Host (Host on Failure Site) ae0::2009/01/15-06:25: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status ae0::2009/01/15-06:25: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error ::2009/01/15-06:36: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. Cluster Log for USMV-X460 Host (Host on Surviving Site) ba8::2009/01/15-06:24: ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status bcc.00000bd0::2009/01/15-06:36: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status bcc.00000bd0::2009/01/15-06:36: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170 Actions to Resolve the Problem To resolve this situation, diagnose the SAN switch failure

104 Solving SAN Connectivity Problems

105 Section 7 Solving Network Problems This section lists symptoms that usually indicate networking problems. Table 7 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive messages or SNMP traps for possible problems. Also, messages are displayed on the management console similar to the messages. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 7 1. Possible Networking Problems with Symptoms Symptom The cluster groups with the failed network connection fail over to the next preferred node. If only one node is configured at the site with the failure, replication direction changes and applications run on the backup site. If the NIC is teamed, no failover occurs and no symptoms are obvious. The networks on the Cluster Administrator screen show an error. Host system and application event log messages contain error or warning messages. Clients on site 2 are not able to access resources associated with the IP resource located on site 1. Public communication between the two sites fails, only allowing local cluster public communication between cluster nodes and local clients. The networks on the Cluster Administrator screen show an error. Possible Problem Public NIC failure on a cluster node in a geographic clustered environment Public or client WAN failure in a geographic clustered environment

106 Solving Network Problems Table 7 1. Possible Networking Problems with Symptoms Symptom You cannot access the management console or initiate an SSH session through PuTTY using the management IP address of the remote site. Management console log indicates that the WAN data links to the RAs are down. All consistency groups show the transfer status as Paused by system. On the management console, all consistency groups show the transfer status switching between Paused by system and initializing/active. All groups appear unstable over the WAN connection. The networks on the Cluster Administrator screen show an error. You cannot access the management console using the management IP address of the remote site. The cluster is no longer accessible from nodes except from one surviving node. Unable to reach DNS server. Unable to communicate to NTP server. Unable to reach mail server. The management console shows errors for the WAN or for RA data links. The management console logs show RA communication errors. Possible Problem Management network failure in a geographic clustered environment Replication network failure in a geographic clustered environment Temporary WAN failures Private cluster network failure in a geographic clustered environment Total communication failure in a geographic clustered environment Port information

107 Solving Network Problems Public NIC Failure on a Cluster Node in a Geographic Clustered Environment Problem Description If a public network interface card (NIC) of a cluster node failed, the cluster node of the failed public NIC cannot access clients. The cluster node of the failed NIC can participate in the cluster as a member because it can communicate over the private cluster network. Other cluster nodes are not affected by this error. The MSCS S software detects a failed network and the cluster resources fail over to the next preferred node. All cluster groups used for replication that contain a virtual IP address for the failed network connection succeed to fail over to the next preferred node. However, the Unisys SafeGuard 30m Control resources cannot fail back to the node with a failed public network because they cannot communicate with the site management IP address of the RAs. Note: A teamed public network interface does not experience this problem and therefore is the recommended configuration. Figure 7 1 illustrates this failure. Figure 7 1. Public NIC Failure of a Cluster Node

108 Solving Network Problems Symptoms The following symptoms might help you identify this failure: All cluster groups used for replication that contain a virtual IP address for the failed network connection fail over to the next preferred node. If no other node exists at the same site, replication direction changes and the application run at the backup site. If you review the host system event log, you can find messages similar to the following examples: Windows System Event Log Messages on Host Server Type: error Source: ClusSvc EventID: 1077, 1069 Description: The TCP/IP interface for Cluster IP Address xxx has failed. Type: error Source: ClusSvc EventID: 1069 Description: Cluster resource xxx in Resource Group xxx failed. Type: error Source: ClusSvc EventID: 1127 Description: The interface for cluster node xxx on network xxx failed. If the condition persists, check the cabling connecting the node to the network. Next, check for hardware or software errors in nodes s network Adapter. If you attempt to move a cluster group to the node with the failing public NIC, the event 2002 message is displayed in the host application event log. Application Event Log Message on Host Server Type: warning Source: 30mControl Event Category: None EventID: 2002 Date : 12/17/2008 Time: 16:16:36 PM User : N/A\ Computer: USMV-WEST2 Description: Online resource failed. RA CLI command failed because of a network communication error or invalid IP address. Action: Verify the network connection between the system and the site management IP Address specified for the resource. Ping each site management IP Address specified for the specified resource. Note: The preceding information can also be viewed in the cluster log

109 Solving Network Problems The management console display and management console logs do not show any errors. When the public NIC fails on a node that does not use teaming, the Cluster Administrator displays an error indicator similar to Figure 7 2. If the public NIC interface is teamed, you do not see error messages in the Cluster Administrator. Figure 7 2. Public NIC Error Shown in the Cluster Administrator Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, verify that the public interface for all nodes is in an Up state. If multiple nodes at a site show public connections failed in the Cluster Administrator, physically check the network switch for connection errors. If the private network also shows errors, physically check the network switch for connection errors. 2. Inspect the NIC link indicators on the host and, from a client, use the Ping command to verify the physical IP address of the adapter (not the virtual IP address). 3. Isolate a NIC or cabling issue by moving cables at the network switch and at the NIC. 4. Replace the NIC in the host if necessary. No configuration of the replaced NIC is necessary. 5. Move the cluster resources back to the original node after the resolution of the failure

110 Solving Network Problems Public or Client WAN Failure in a Geographic Clustered Environment Problem Description When the public or client WAN fails, some clients cannot access virtual IP networks that are associated with the cluster. The WAN components that comprise this failure might be two switches that are possibly on different subnets using gateways. This failure results from connectivity issues. The MSCS cluster would detect and fail the associated node if the failure resulted from an adapter failure or media failure to the adapter. Instead, cluster groups do not fail and the public LAN shows an unreachable for this failure mode. Public communication between the two sites failed, only allowing local cluster public communication between cluster nodes and local clients. The cluster node state does not change on either site because all cluster nodes are able to communicate with the private cluster network. All resources remain online and no cluster group errors are reported in the Cluster Administrator. Clients on the remote site cannot access resources associated with the IP resource located on the local site until the public or client network is again operational. Depending on the cause of the failure and the network configuration, the SafeGuard 30m Control might fail to move a cluster group because the management network might be the same physical network as the public network. Whether this failure to move the group occurs or not depends on how the RAs are physically wired to the network

111 Solving Network Problems Figure 7 3 illustrates this scenario. Symptoms Figure 7 3. Public or Client WAN Failure The following symptoms might help you identify this failure: Clients on site 2 are not able to access resources associated with the IP resource located on site 1. Public communication between the two sites displays as unreachable allowing local cluster public communication between cluster nodes and local clients. When the public cluster network fails, the Cluster Administrator displays an error indicator similar to Figure 7 4. All private network connections show as unreachable when the problem is i a WAN issue. If only two of the connections show as failed (and the nodes are physically located at the same site), the issue is probably local to the site. If only one connection failed, the issue is probably a host network adapter

112 Solving Network Problems Figure 7 4. Cluster Administrator Showing Public LAN Network Error If you review the system event log, messages similar to the following examples are displayed: Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID: 1123 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on network 'Public LAN'. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1126 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2ST2 Description: The interface for cluster node 'USMV-WEST2' on network 'Public LAN' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node USMV-WEST2. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check

113 Solving Network Problems for failures in any other network components to which the node is connected such as hubs, switches, or bridges. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1130 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2 Description: Cluster network 'Public network is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network. A cluster group containing a SafeGuard 30m Control resource might fail to move to another node when the management network has network components common to the public network. (Refer to Management Network Failure in a Geographic Clustered Environment. ) Symptoms might include those in Management Network Failure in a Geographic Clustered Environment when these networks are physically the same network. Refer to this topic if the clients at one site are not able to access the IP resources at another site. The management console logs might display the messages in the following table when this connection fails and is then restored. Event ID Description Immediate Daily Summary 3023 For each RA at the site, this console log message is displayed: Error in LAN link to RA. (RA <RA>) 3022 When the LAN link is restored, a management console log displays: LAN link to RA restored. (RA<RA>) X X

114 Solving Network Problems Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, view the network properties of the public and private network. The private network should be operational with no failure indications. The public network should display errors. Refer to the previous symptoms to identify that this is a WAN issue. If the error is limited to one host, the problem might be a host network adapter. See Cluster Node Public NIC Failure in a Geographic Clustered Environment. 2. Check for network problems using a method such as isolating the failure to the network switch or gateway by pinging from the cluster node to the gateway at each site. 3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.) a. Log on to an RA with user ID as boxmgmt and password as boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 4. Isolate the site by determining which gateway or network switch failed. Use standard network methods such as pinging to make the determination

115 Solving Network Problems Management Network Failure in a Geographic Clustered Environment Problem Description When the management network fails in a geographic clustered environment, you cannot access the management console for the affected site. The replication environment is not affected. If you try to move a cluster group to the site with the failed management network, the move fails. Figure 7 5 illustrates this scenario. Symptoms Figure 7 5. Management Network Failure The following symptoms might help you identify this failure: The indicators for the onboard management network adapter of the RA are not illuminated. Network switch port lights show that no link exists with the host adapter

116 Solving Network Problems You cannot access the management console or initiate a SSH session through PuTTY using the management IP address of the failed site from remote site. You can access the management console from a client local to the site. If you cannot access the management IP address from either site, see Section 8, Solving Replication Appliance (RA) Problems. A cluster move operation to the site with the failed management network might fail. The event ID 2002 message is displayed in the host application event log. Application Event Log Message on Host Server Type Source : warning : 30mControl Event Category: None EventID : 2002 Date : 12/17/2008 Time User Computer Description : 16:25:36 PM : N/A : USMV-WEST2 error or invalid IP address. Action : Online resource failed. RA CLI command failed because of a network communication : Verify the network connection between the system and the site management IP Address specified for the resource. Ping each site management IP Address mentioned for the specified resource. Note: The preceding information can also be viewed in the cluster log. If the management console was open with the IP address of the failed site, the message Connection with RA was lost, please check RA and network settings is displayed. The management console display shows not connected, and the components have a question mark Unknown status as illustrated in Figure

117 Solving Network Problems Figure 7 6. Management Console Display: Not Connected The management console log displays a message for event 3023 as shown in Figure

118 Solving Network Problems Figure 7 7. Management Console Message for Event 3023 The management console log messages might appear as in the following table. Event ID Description Immediate Daily Summary 3023 For each RA at the site, this console log message is displayed: Error in LAN link to RA. (RA <RA>) 3022 When the LAN link is restored, a management console log displays: LAN link to RA restored. (RA <RA>) X X Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. Ping from the cluster node to the RA box management IP address at the same site. Repeat this action for the other site. If the local connections are working at both sites, the problem is with the WAN connection such as a network switch or gateway connection. 2. If one site from step 1 fails, ping from the cluster node to the gateway of that site. If the ping completes, then proceed to step

119 Solving Network Problems 3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.) a. Log in to an RA as user boxmgmt with the password boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 4. Isolate the site by determining which gateway failed. Use standard network methods such as pinging to make the determination. Replication Network Failure in a Geographic Clustered Environment Problem Description This type of event occurs when the RA cannot replicate data to the remote site because of a replication network (WAN) failure. Because this error is transparent to MSCS and the cluster nodes, cluster resources and nodes are not affected. Each cluster node continues to run, and data transactions sent to their local cluster disk are completed

120 Solving Network Problems Figure 7 8 illustrates this failure. Figure 7 8. Replication Network Failure The RA cannot not replicate data while the WAN is down. During this failure, the RA keeps a record of data written to local storage. Once the WAN is restored, the RA updates the replication volumes on the remote site. During the replication network failure, the RAs prevent the quorum and data resources from failing over to the remote site. This behavior differs from a total communication failure or a total site failure in which the data groups are allowed to fail over. The quorum group is never allowed to fail over automatically when the RAs cannot communicate over the WAN. Notes: If the management network has also failed, see Total Communication Failure in a Geographic Clustered Environment later in this section. If all RAs at a site have failed, see Failure of All RAs at One Site in Section 8. If the administrator issues a move-group operation from the Cluster Administrator for a data or quorum group, the cluster accepts failover only to another node within the same site. Group failover to the remote site is not allowed, and the resource group fails back to a node on the source site

121 Solving Network Problems Symptoms Although automatic failover is not allowed, the administrator can perform a manual failover to the remote site. Performing a manual failover results in a loss of data. The administrator chooses an available image for the failover. Important considerations for this type of failure are as follow: This type of failure does not have an immediate effect on the cluster service or the cluster nodes. The quorum group cannot fail over to the remote site and goes back online at the source site. Only local failovers are permitted. Remote failovers require that the administrator perform the manual failover process. The SafeGuard 30m Control resource and the data consistency groups cannot fail over to the remote site while the WAN is down; they go back online at the source site. Only one site has up-to-date data. Replication does not occur until the WAN is restored. If the administrator manually chooses to use remote data instead of the source data, data loss occurs. Once the WAN is restored, normal operation continues; however, the groups might initiate a long resynchronization. The following symptoms might help you identify this failure: The management console display shows errors similar to the image in Figure 7 9. This image shows the dialog box displayed after clicking the red Errors in the right column. The More Info message box is displayed with messages similar to those in the figure but appropriate for your site. If only one RA is down, see Section 8 for resolution actions. Notice in the figure that all RA data links at the site are down. Figure 7 9. Management Console Display: WAN Down

122 Solving Network Problems This figure also shows the Groups tab and the messages that the data consistency groups and the quorum group are Paused by system. If the groups are not paused by the system, a switchover might have occurred. See Section 8 for more information. If all groups are not paused, see Section 5, Solving Storage Problems. Warnings and informational messages similar to those shown in Figure 7 10 appear on the management console when the WAN is down. See the table after the figure for an explanation of the numbered console messages. Figure Management Console Log Messages: WAN Down The following table explains the numbers in Figure You might also see the events in the table denoted by an asterisk (*) in the management console log. Reference No./Legend Event ID Description Immediate Daily Summary * 3001 The RA is currently experiencing a problem communicating with its cluster. The details explain that an event 3000 means that the RA functionality will be restored. * 3000 The RA is successfully communicating with its cluster. In this case, the RA communicates by means of the management link For each consistency group on the EAST2 and the WEST2 sites, the transfer is paused For each quorum group on the EAST2 and the WEST2 sites, the transfer is paused. X X X X

123 Solving Network Problems Reference No./Legend Event ID Description Immediate Daily Summary * 4043 For each group on the EAST2 and WEST2 sites, the group site is deactivated message might appear with the detail showing the reason for the switchover. The RA attempts to switch over to resolve the problem The event is repeated after the switchover attempt. X X If you review the management console RAs tab, the data link column lists errors for all RAs, as shown in Figure The data link is the replication link between peer RAs. Notice that the WAN link shows OK because the RAs can still communicate over the management link. There is no column for the management link. Figure Management Console RAs Tab: All RAs Data Link Down If you review the host application event log, no messages appear for this failure unless a data resource move-group operation is attempted. If this move-group operation is attempted, then messages similar to the following are listed: Application event log Event Type : Warning Event Source : 30mControl Event Category: None Event ID : 1119 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description : Online resource failed. Cannot complete transfer for auto failover (7). The following could cause this error: 1. Wan is down. 2. Long resynchronization might be in progress. The resource might have to be brought online manually. RA Version: 3.1(K.87) Resource name: Data

124 Solving Network Problems RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch initiate_failover group=data1 active_site=west cluster_owner=usmv-west2 If you review the system event log, a message similar to the following example is displayed: System Event Log Event Type : Error Event Source : ClusSvc Event Category: Event ID : 1069 Failover Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description : Cluster resource 'Data1' in Resource Group 'Group 0' failed. Note: Data1 would change to the Quorum drive if the quorum was moved. If you review the cluster log, you can see an error if a data or a quorum move-group operation is attempted. Messages similar to the following are listed: Cluster Log for the Node to which the Move Was Attempted Key messages 00000e c::2008/12/16-22:39: INFO [RGP] Node 2: RGP Incoming pkt: 0x3fff, 0x1, 0x3, 0x e c::2008/12/16-22:39: INFO [RGP] Node 2: RGP recv pkt : 0x10003, 0x , 0x , 0x b6c c0::2008/12/16-22:39: INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info from the disk to insure disk is accessible b6c c0::2008/12/16-22:39: INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature 4a b6c c0::2008/12/16-22:39: ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status b6c c0::2008/12/16-22:39: ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error b6c c0::2008/12/16-22:39: INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status e fd0::2008/12/16-22:39: INFO [MM] MmSetQuorumOwner(0,0), old owner e fd0::2008/12/16-22:39: WARN [MM] MmSetQuorumOwner: regroup is in progress, forcing the new value in e fd0::2008/12/16-22:39: ERR [FM] Failed to arbitrate quorum resource 6fbf7ffc- 8c a6cde32e, error e fd0::2008/12/16-22:39: INFO [NM] We do not own the quorum resource, status

125 Solving Network Problems Cluster Log for the Node to which the Data Group Move Was Attempted ::2008/10/04-04:41: INFO [GUM] GumSendUpdate: completed update seq type 0 context ::2008/10/04-04:41: INFO [FM] FmpPropagateResourceState: resource 4b59c5d6-5c66-4e8f f6d62f999 offline event ::2008/10/04-04:41: INFO [FM] RmTerminateResource: 4b59c5d6-5c66-4e8f f6d62f999 is now offline cc dc::2008/10/04-04:41: INFO Unisys SafeGuard 30m Control <Data2>: KfResourceTerminate: Resource 'Data2' terminated. AbortOnline=1 CancelConnect=0 terminateprocess= ::2008/10/04-04:41: INFO [CP] CppResourceNotify for resource Data ::2008/10/04-04:41: INFO [FM] RmTerminateResource: a0-48ecb9b1-c8b205038ed4 is now offline ::2008/10/04-04:41: INFO [FM] RestartResourceTree, Restart resource a0-48ec-b9b1-c8b205038ed ::2008/10/04-04:41: INFO [FM] FmpRmOnlineResource: bringing resource a0-48ec-b9b1-c8b205038ed4 (resid ) online ::2008/10/04-04:41: INFO [CP] CppResourceNotify for resource Data ::2008/10/04-04:41: INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwquoblockingresources for resource a0-48ec-b9b1- c8b205038ed cc.00000eac::2008/10/04-04:41: INFO Unisys SafeGuard 30m Control <Data2>: KfResourceOnline: 'Data2' going online. PendingTimeout= cc.00000eac::2008/10/04-04:41: INFO Unisys SafeGuard 30m Control <Data2>: KfGetLocalSiteInfo: FirstSiteName = 'WEST', FirstSiteIP = ' ', SecondSiteName = 'EAST', SecondSiteIP = ' ' cc.00000eac::2008/10/04-04:41: INFO Unisys SafeGuard 30m Control <Data2>: KfResourceOnline: SiteIP = ' '. SiteName = EAST. Status =!u! d3c::2008/10/04-04:41: ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: FAILED to run command 'UCLI -ssh -l plugin -pw **** -superbatch initiate_failover group='data1' active_site='east' cluster_owner=usmv-east2'. Return code: (6) d3c::2008/10/04-04:41: ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online d3c::2008/10/04-04:41: INFO [RM] RmpSetResourceStatus, Posting state 4 notification for resource <Data1> ::2008/10/04-04:41: INFO [FM] NotifyCallBackRoutine: enqueuing event Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. On the management console, observe that a WAN error occurred for all RAs and that the data link is in error for all RAs. If that is not the case, see Section 8 for resolution actions. 2. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.)

126 Solving Network Problems a. Log in to an RA as user boxmgmt with the password boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 3. Isolate the site by determining which network switch or gateway failed. Use standard network methods such as pinging to make the determination. 4. In some cases, the WAN connection might appear to be down because a firewall is blocking ports. See Port Information later in this section. 5. If all RAs at both sites can connect to the gateway, the problem is related to the link. In this case, check the connectivity between subnets by pinging between machines on the same subnet (not RAs) and between a non-ra machine at one site and an RA at the other site. 6. Verify that no routing problems exist between the sites. 7. Optionally, follow the recovery actions to manually move cluster and data resource groups to the other site if necessary. This action results in a loss of data. Do not attempt this manual recovery unless the WAN failure has affected applications. If you choose to manually move groups, refer to Section 4 for the procedures. Once you observe on the management console that the WAN error is gone, verify that the consistency groups are resynchronizing. If a move-group operation is issued to the other site while the group is resynchronizing, the command fails with a return code 7 (long resync in progress) and move back to the original node. Temporary WAN Failures Problem Description Symptoms All applications are unaffected. The target image is not up-to-date. On the management console, messages showing the transfer between sites switch between the paused by system and initializing/active. All groups appear unstable over the WAN connection. Actions to Resolve the Problem Perform the following actions to isolate and resolve this problem: 1. If the connection problem is temporary but reoccurs, check for a problematic network such as a high percentage of packet loss because of bad network

127 Solving Network Problems connections, insufficient bandwidth that is causing an overloaded network, and so on. 2. Verify that the bandwidth allocated to this link is reasonable and that no unreasonable external or internal (consistency group bandwidth policy) limits are causing an overloaded network. Private Cluster Network Failure in a Geographic Clustered Environment Problem Description When the private cluster network fails, the cluster nodes are able to communicate with the public cluster network if the cluster public address is set for all communication. No cluster resources fail over, and current processing on the cluster nodes continues. Clients do not experience any impact by this failure

128 Solving Network Problems Figure 7 12 illustrates this scenario. Symptoms Figure Private Cluster Network Failure Unisys recommends that the public cluster network be set for All communications and the private cluster LAN be set for internal cluster communications only You can verify these settings in the Networks properties section within Cluster Administrator. See Checking the Cluster Setup in Section 4. If the public cluster network was not set for All communications but instead was set for Client access only, the following symptoms occur: All nodes except the node that owned the quorum stop MSCS. This action is completed to prevent a split brain situation. All resources move to the surviving node. The following symptoms might help you identify this failure: When the private cluster network fails, the Cluster Administrator displays an error indicator similar to Figure All private network connections show a status of Unknown when the problem is a WAN issue

129 Solving Network Problems If only two of the connections failed (and the nodes are physically located at the same site), the issue is probably local to the site. If only one connection failed, the issue is probably a host network adapter. Figure Cluster Administrator Display with Failures On the cluster nodes at both sides, the system event log contains entries from the cluster service similar to the following: Event Type : Warning Event Source : ClusSvc Event Category: Event ID : 1123 Node Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on network 'Private'. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1126 Date : 12/17/2008 Time : 16:25:36 PM User : N/A

130 Solving Network Problems Computer : USMV-WEST2 Description: The interface for cluster node 'USMV- EAST2' on network 'Private' is unreachable by at least one other cluster node attached to the network. The server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node USMV- EAST2. If the condition persists, check the cable connecting the node to the network. Then, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. Event Type : Warning Event Source : ClusSvc Event Category: Event ID : 1130 Node Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: Cluster network 'Private is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network. Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, view the network properties of the public and private network. The public network should be operational with no failure indications. The private network should display errors. Refer to the previous symptoms to identify that this is a WAN issue. If the error is limited to one host, the problem might be a host network adapter. See Public NIC Failure on a Cluster Node in a Geographic Clustered Environment for action to resolve a host network problem. 2. Check for network problems using methods such as isolating the failure to the network switch or gateway with the problem

131 Solving Network Problems Total Communication Failure in a Geographic Clustered Environment Problem Description A total communication failure implies that the cluster nodes and RAs are no longer able to communicate with each other over the public and private network interfaces. Figure 7 14 illustrates this failure. Figure Total Communication Failure When this failure occurs, the cluster nodes on both sites detect that the cluster heartbeat has been broken. After six missed heartbeats, the cluster nodes go into a regroup process to determine which node takes ownership of all cluster resources. This process consists of checking network interface states and then arbitrating for the quorum device. During the network interface detection phase, all nodes perform a network interface check to determine that the node is communicating through at least one network interface dedicated for client access, assuming the network interface is set for All communications cations or Client access only. If this process determines that the node is not communicating through any viable network, the cluster node voluntarily stops cluster service and drops out of the quorum arbitration process. The remaining nodes then attempt to arbitrate for the quorum device

132 Solving Network Problems Symptoms Quorum arbitration succeeds on the site that originally owned the quorum consistency group and fails on the nodes that did not own the quorum consistency group. Cluster service then shuts itself down on the nodes where quorum arbitration fails. In Microsoft Windows 2000 environments, MSCS does not check for network interface availability during the regroup process and starts the quorum arbitration process immediately after a regroup process is initiatedthat is, after six missed heartbeats. Once the cluster has determined which nodes are allowed to remain active in the cluster, the cluster node attempts to bring online all data groups previously owned by the other cluster nodes. The SafeGuard 30m Control resource and its associated dependent resources will come online. During this total communication failure, replication is Paused by system. An extended outage requires a full volume sweep. Refer to Section 4 for more information. The following symptoms might help you identify this failure: The management console shows a WAN error; all groups are paused. The other site shows a status of Unknown. Figure 7 15 illustrates one site. Figure Management Console Display Showing WAN Error

133 Solving Network Problems The RAs tab on the management console lists errors as shown in Figure Figure RAs Tab for Total Communication Failure Warnings and informational messages similar to those shown in Figure 7 17 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure Management Console Messages for Total Communication Failure

134 Solving Network Problems The following table explains the numbered messages in Figure Reference No. Event ID Description For each consistency group, a group capabilities minor problem is reported. The details indicate that a WAN problem is suspected on both RAs For each consistency group on the West and the East sites, the transfer is paused. The details indicate a WAN problem is suspected For each RA at each site, the following error message is reported: Error in WAN link to RA at other site (R A x) The following message is displayed: User action succeeded. The details indicate that a failover was initiated. This message appears when the groups are moved by the SafeGuard Control resource to the surviving cluster node. Immediate Daily Summary X X X X All cluster resources appear online after successfully failing over to the surviving node. The cluster service stops on all nodes except the surviving node. From the surviving node, the host system event log has entries similar to the following: Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1123 Date : 12/17/2008 Time User Computer : : 16:25:36 PM : N/A USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on Public network. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1123 Date : 12/17/

135 Solving Network Problems Time User Computer : : 16:25:36 PM : N/A USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on Private network. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1135 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: Cluster node USMV-EAST2 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. Event Type : Information Event Source : ClusSvc Event Category: Failover Mgr Event ID : 1200 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The Cluster Service is attempting to bring online the Resource Group "Group 1". Event Type : Information Event Source : ClusSvc Event Category: Failover Mgr Event ID : 1201 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The Cluster Service brought the Resource Group "Group 1" online

136 Solving Network Problems From the surviving node, the private and public network connections show an exclamation mark Unknown status as shown in Figures 7 18 and Figure Cluster Administrator Showing Private Network Down Figure Cluster Administrator Showing Public Network Down

137 Solving Network Problems Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. When you observe on the management console that a WAN error occurred on site 1 and on site 2, call the other site to verify that each management console is available and shows a WAN down because of the failure. If only one site can access the management console, the problem is probably not a total WAN failure but rather a management network failure. In that case, see Management Network Failure in a Geographic Clustered Environment. 2. In the Cluster Administrator, verify that only one node is active in the cluster. 3. View the network properties of the public and private network. The display should show an Unknown status for the private and public network. 4. Check for network problems using methods such as isolating the failure to the network switch or gateway by pinging from the cluster node to the gateway at each site. Port Information Problem Description Symptoms Communications problems might occur because of firewall settings that prevent all necessary communication. The following symptoms might help you identify this problem: Unable to reach the DNS server. Unable to communicate to the NTP server. Unable to reach the mail server. The RAs tab shows RA data link errors. The management console shows errors for the WAN. The management console logs show RA communications errors

138 Solving Network Problems Actions to Resolve Perform the port diagnostics from each of the RAs by following the steps given in Appendix C. The following tables provide port information that you can use in troubleshooting the status of connections. Table 7 2. Ports for Internet Communication Port Numbers Protocol or Protocols Unisys Product Support IP Address 21 FTP Used for remote maintenance (TCP) The following tables list ports used for communication other than Internet communication. Table 7 3. Ports for Management LAN Communication and Notification Port Numbers Protocol or Protocols 21 Default FTP port (needed for collecting system information) 22 Default SSH and communications between RAs 25 Default outgoing mail (SMTP) alerts from the RA are configured. 80 Web server for management (TCP) 123 Default NTP port 161 Default SNMP port 443 Secure Web server for management (TCP) 514 Syslog (UDP) 1097 RMI (TCP) 1099 RMI (TCP) 4401 RMI (TCP) 4405 Host-to-RA kutils communications (SQL commands) and KVSS (TCP) 7777 Automatic host information collection

139 Solving Network Problems The ports listed in Table 7 4 are used for both the management LAN and WAN. Table 7 4. Ports for RA-to-RA Internal Communication Port Numbers Protocol or Protocols 23 telnet 123 NTP (UDP) 1097 RMI (TCP) 1099 RMI (TCP) 4444 TCP 5001 TCP (default iperf port for performance measuring between RAs) 5010 Management server (UDP, TCP) 5020 Control (UDP, TCP) 5030 RMI (TCP) 5040 Replication (UDP, TCP) 5060 Mpi_perf (TCP) 5080 Connectivity diagnostics tool

140 Solving Network Problems

141 Section 8 Solving Replication Appliance (RA) Problems This section lists symptoms that usually indicate problems with one or more Unisys SafeGuard 30m replication appliances (RAs). The problems include hardware failures. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. For problems relating to RAs, gather the RA logs and ask the following questions: Are any errors displayed on the management console? Is the issue constant? Is the issue a one-time occurrence? Does the issue occur at intervals? What are the states of the consistency groups? What is the timeframe in which the problem occurred? When was the first occurrence of the problem? What actions were taken as a result of the problem or issue? Were any recent changes made in the replication environment? If so, what? Table 8 1 lists symptoms and possible causes for the failure of a single RA on one site with a switchover as a symptom. Table 8 2 lists symptoms and possible causes for the failure of a single RA on one site without switchover symptoms. Table 8 3 lists symptoms and other possible problems regarding multiple RA failures. Each problem and the actions to resolve it are described in this section. In addition to the symptoms listed, you might receive messages or SNMP traps for possible problems. Also, messages similar to notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display

142 Solving Replication Appliance (RA) Problems Table 8 1. Possible Problems for Single RA Failure with a Switchover Symptoms The management console shows RA failure. Single RA failure Possible Problem Possible Contributing Causes to Single RA Failure with a Switchover The system frequently pauses transfer for all consistency groups. If you log in to the failed RA as the boxmgmt user, a message is displayed explaining that the reboot regulation limit has been exceeded. The management console shows repeated events that report an RA is up followed by an RA is down. The link indicator lights on all host bus adapters (HBAs) are not illuminated. The port indicator lights on the Fibre Channel switch no longer show a link to the RA. Port errors occur or there is no target when running the SAN diagnostics. The management console shows RA failure with details pointing to a problem with the repository volume. The link indicator lights on the HBA or HBAs are not illuminated. The port indicator lights on the network switch or hub no longer show a link to the RA. Reboot regulation failover Failure of all SAN Fibre Channel HBAs on one RA Onboard WAN network adapter failure (Or failure of the optional gigabit Fibre Channel WAN network adapter)

143 Solving Replication Appliance (RA) Problems Table 8 2. Possible Problems for Single RA Failure Without a Switchover Symptoms The link indicators lights on the onboard management network adapter are not illuminated. The failure light for the hard disk indicates a failure. An error message that appears during a boot operation indicates failure of one of the internal disks. The link indicator lights on the HBA are not illuminated. The port indicator lights on the Fibre Channel switch no longer show a link to the RA. For one of the ports on the relevant RA, errors appear when running the SAN diagnostics. Possible Problem Onboard management network adapter failure Single hard-disk failure Port failure of a single SAN Fibre Channel HBA on one RA Table 8 3. Possible Problems for Multiple RA Failures with Symptoms Symptoms Replication has stopped on all groups. MSCS fails over groups to the other site, or MSCS fails on all nodes. The management console displays a WAN error to the other site. Replication has stopped on all groups. MSCS fails over groups to the other site, or MCSC fails on all nodes. The management console displays a WAN error to the other site. Possible Problem Failure of all RAs on one site All RAs on one site are not attached

144 Solving Replication Appliance (RA) Problems Single RA Failures Problem Description When an RA fails, a switchover might occur. In some cases, a switchover does not occur. See Single RA Failures With Switchover and Single RA Failures Without Switchover. Understanding Management Console Access If the RA that failed had been running site controlthat is, the RA owned the virtual management IP networkand a switchover occurs, the virtual IP address moves to the new RA. If you attempt to connect to the management console using one of the static management IP addresses of the RAs, a connection error occurs if the RA does not have site control. Thus, you should use the site management IP address to connect to the management console. At least one RA (either RA 1 or RA 2) must be attached to the RA cluster for the management console to function. If the RA that failed was running site control and a switchover does not occur (such as with an onboard management network connection failure), the management console might not be accessible. Also, attempts to log in using PuTTY fail if you use the boxmgmt log-in account. When an RA does not have site control, you can always log in using PuTTY and the boxmgmt log-in account. You cannot determine which RA owns site control unless the management console is accessible. The site control RA is designated at the bottom of the display as follows: Another situation in which you cannot log in to the management console is when the user account has been locked. In this case, follow these steps: 1. Log in interactively using PuTTY with another unlocked user account. 2. Enter unlock_user. 3. Determine whether any users are listed, and follow the messages to unlock the locked user accounts

145 Solving Replication Appliance (RA) Problems Figure 8 1 illustrates a single RA failure. Figure 8 1. Single RA Failure Single RA Failure with Switchover In this case, a single RA fails, and there is an automatic switchover to a surviving RA on the same site. Any groups that had been running on the failed RA run on a surviving RA at the same site. Each RA handles the replicating activities of the consistency groups for which it is designated as the preferred RA. The consistency groups that are affected are those that were configured with the failed RA as the preferred RA. Thus, whenever an RA becomes inoperable, the handling of the consistency groups for that RA switches over automatically to the functioning RAs in the same RA cluster. During the RA switchover process, the server applications do not experience any I/O failures. In a geographic clustered environment, MSCS is not aware of the RA failure, and all application and replication operations continue to function normally. However, performance might be affected because the I/O load on the surviving RAs is now increased

146 Solving Replication Appliance (RA) Problems Symptoms Failures of an RA that cause a switchover are as follows: RA hardware issues (such as memory, motherboard, and so forth) Reboot regulation failover Failure of all SAN Fibre Channel HBAs on one RA Onboard WAN network adapter failure (or failure of the optional gigabit Fibre Channel WAN network adapter) The following symptoms might help you identify this failure: The RA does not boot. From a power-on reset, the BIOS display shows the BIOS information, RAID adapter utility prompt, logical drives found, and so forth. The display is similar to the information shown in Figure 8 2. Figure 8 2. Sample BIOS Display Once the RA initializes, the log-in screen is displayed. Note: Because status messages normally scroll on the screen, you might need to press Enter to see the log-in screen. The management console system status shows an RA failure. (See Figure 8 3.) To display more information about the error, click the red error in the right column. The More Info dialog box is displayed with a message similar to the following: RA 1 in West is down

147 Solving Replication Appliance (RA) Problems Figure 8 3. Management Console Display Showing RA Error and RAs Tab The RAs tab on the management console shows information similar to that in Figure 8 3, specifically The RA status for RA 1 on the West site shows an error. The peer RA on the East site (RA 1) shows a data link error. Each RA on the East site shows a WAN connection failure. The surviving RA at the failed site (West) does not show any errors. Warnings and informational messages similar to those shown in Figure 8 4 appear on the management console when an RA fails and a switchover occurs. See the table after the figure for an explanation of the numbered console messages. In your

148 Solving Replication Appliance (RA) Problems environment, the messages pertain only to the groups configured to use the failed RA as the preferred RA. Figure 8 4. Management Console Messages for Single RA Failure with Switchover The following table explains the numbered messages shown in Figure 8 4. Reference No. Event ID Description Immediate Daily Summary At the same site, the other RA reports a problem getting to the LAN of the failed RA The site with the failed RA reports that the RA is probably down. X X

149 Solving Replication Appliance (RA) Problems Reference No. Event ID Description Immediate Daily Summary The management console is now running on RA For each consistency group, a minor problem is reported. The details show that the RA is down or not a cluster member For each consistency group, the transfer is paused at the surviving site to allow a switchover. The details show the reason for the pause as switchover For each consistency group at the same site, the groups are activated at the surviving RA. This probably means that a switchover to RA 2 at the failed site was successful For each consistency group at the failed site, the splitter is again splitting A WAN link error is reported from each RA at the surviving site regarding the failed RA at the other site For each consistency group at the failed site, the transfer is started For each consistency group at the failed site, an initialization is performed For each consistency group at the failed site, the initialization completes. X X X X X X X X X The failed RA (RA 1) is now restored. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. Actions to Resolve the Problem The following list summarizes the actions you need to perform to isolate and resolve the problem: Check the LCD display on the front panel of the RA. See LCD Status Messages in Appendix B for more information. If the LCD display shows an error, run the RA diagnostics. See Appendix B for more information. Check all indicator lights on the rear panel of the RA. Review the symptoms and actions in the following topics: Reboot Regulation

150 Solving Replication Appliance (RA) Problems Onboard WAN Network Adapter Failure If you determine that the failed RA must be replaced, contact the Unisys service representative for a replacement RA. After you receive the replacement RA, follow the steps in Appendix D to install and configure it. The following procedure provides a detailed description of the actions to perform: 1. Remove the front bezel of the RA and look at the LCD display. During normal operation, the illuminated message should identify the system. If the LCD display flashes amber, the system needs attention because of a problem with power supplies, fans, system temperature, or hard drives. Figure 8 5 shows the location of the LCD display. Figure 8 5. LCD Display on Front Panel of RA If an error message is displayed, check Table B 1. For example, the message E0D76 indicates a drive failure. (Refer to Single Hard Disk Failure in this section.) If the message code is not listed in the Table B 1, run the RA diagnostics, (see Appendix B). 2. Check the indicators at the rear of the RA as described in the following steps and visually verify that all are working correctly. Figure 8 6 illustrates the rear panel of the RA

151 Solving Replication Appliance (RA) Problems Note: The network connections on the rear panel labeled 1 and 2 in the following illustration might appear different on your RA. The connection labeled 1 is always the RA replication network, and the connection labeled 2 is always the RA management network. Pay special attention to the labeling when checking the network connections. Figure 8 6. Rear Panel of RA Showing Indicators Ping each network connection (management network and replication network), and visually verify that the LEDs on either side of the cable on the back panel are illuminated. Figure 8 7 shows the location of these LEDs. If the LEDs are off, the network is not connected. The green LED is lit if the network is connected nected to a valid link partner on the network. The amber LED blinks when network data is being sent or received. If the management network LEDs indicate a problem, refer to Onboard Management Network Adapter Failure in this section. If the replication network LEDs indicate a problem, refer to Onboard WAN Network Adapter Failure in this section. Figure 8 7. Location of Network LEDs

152 Solving Replication Appliance (RA) Problems Check that the green LEDs for the SAN Fibre Channel HBAs are illuminated as shown in Figure 8 8. Figure 8 8. Location of SAN Fibre Channel HBA LEDs The following table explains the LED patterns and their meanings. If the LEDs indicate a problem, refer to the two topics for SAN Fibre Chanel HBA failures in this section. Green LED Amber LED Activity On On Power On Off Online Off On Signal acquired Off Flashing Loss of synchronization Flashing Flashing Firmware error Reboot Regulation Problem Description After frequent, unexplained reboots or restarts of the replication process, the RA automatically detaches from the RA cluster. When installing the RAs, you can enable or disable this reboot regulation feature. The factory default is for the feature to be enabled so that reboot regulation is triggered whenever a specified number of reboots or failures occur within the specified time interval. The two parameters available for the reboot regulation feature are the number of reboots (including internal failures) and the time interval. The default value for the number of reboots is 10, and the default value for the time interval is 2 hours

153 Solving Replication Appliance (RA) Problems Symptoms Only Unisys personnel should change these values. Use the Installation Manager to change the parameter values or disable the feature. See the Unisys SafeGuard Solutions Replication Appliance Installation Guide for information about using the Installation Manager tools to make these changes. The following symptoms might help you identify this failure: Frequent transfer pauses for all consistency groups that have the same preferred RA. If you log in to the RA as the boxmgmt user, the following message is displayed: Reboot regulation limit has been exceeded Several messages might be displayed on the Logs tab of the management console as an RA reboots to try to correct a problem. These messages are listed in Table 8 4. Table 8 4. Management Console Messages Pertaining to Reboots Reference No./Legend Event ID Description Immediate Daily Summary * 3008 The RA appears to be down. The RA might attempt to perform a reboot to correct the problem. X * 3023 Error in LAN link (as RA reboots). * 3021 Error in WAN link (as RA reboots). X X * 3007 The RA is up (the reboot completes). * 3022 The LAN link is restored (the reboot has completed). * 3020 The WAN link at other site is restored (the reboot has completed). X X X When any of these messages appear multiple times in a short time period, they might indicate an RA that has continuously rebooted and might have reached the reboot regulation limit

154 Solving Replication Appliance (RA) Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Collect the RA logs before you attempt to resolve the problem. See Appendix A for information about collecting logs. 2. To determine whether the hardware is faulty, run the RA diagnostics described in Appendix B. 3. If the problem remains, submit the RA logs to Unisys for analysis. 4. Once the problem is corrected, the RA automatically attaches to the RA cluster after a power-on reset. If necessary, reattach the RA to the RA cluster manually by following these steps: a. Log in as boxmgmt to the RA through an SSH session using PuTTY. b. At the prompt, type 4 (Cluster operations) and press Enter. c. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster. d. At the prompt, type Q (Quit). Failure of All SAN Fibre Channel Host Bus Adapters (HBAs Problem Description Symptoms All SAN Fibre Channel HBAs or adapter ports on the RA fail. This scenario is unlikely because the RA has redundant ports that are located on different physical adapters. A SAN connectivity problem is more likely. Note: A single redundant path does not show errors on the management console display. See Port Failure on a Single SAN Fibre Channel HBA on One RA. The following symptoms might help you identify this failure: The link indicator lights on all SAN Fibre Channel HBAs are not illuminated. (Refer to Figure 8 8 for the location of these LEDs.) The port indicator lights on the Fibre Channel switch no longer show a link to the RA. Port errors occur or no target appears when running the Installation Manager SAN diagnostics. Information on the Volumes tab of the management console is inconsistent or periodically changing. The management console shows failures for RAs, storage, and hosts. (See Figure 8 9.)

155 Solving Replication Appliance (RA) Problems Figure 8 9. Management Console Display: Host Connection with RA Is Down If you click the red error indication for RAs in the right column, the message is RA 2 in West can t access repository volume If you click the red error indication for storage in the right column, the following messages are displayed:

156 Solving Replication Appliance (RA) Problems If you click the red error indication in the right column for splitters, the message is ERROR: USMV-WEST2 s connection with RA2 is down Warnings and informational messages similar to those shown in Figure 8 10 appear on the management console when an RA fails with this type of problem. See the table after the figure for an explanation of the numbered console messages. Also, refer to Figure 8 4 and the table that explains the messages for information about an RA failure with a generic switchover. Refer to Table 8 4 for other messages that might occur whenever an RA reboots to try to correct the problem