Unisys SafeGuard Solutions Troubleshooting Guide Unisys SafeGuard Solutions Release 8.0 July 2009
.
unisys imagine it. done. Unisys SafeGuard Solutions Troubleshooting Guide Unisys SafeGuard Solutions Release 8.0 July 2009 6872 5688 006
NO WARRANTIES OF ANY NATURE ARE EXTENDED BY THIS DOCUMENT. Any product or related information described herein is only furnished pursuant and subject to the terms and conditions of a duly executed agreement to purchase or lease equipment or to license software. The only warranties made by Unisys, if any, with respect to the products described in this document are set forth in such agreement. Unisys cannot accept any financial or other responsibility that may be the result of your use of the information in this document or software material, including direct, special, or consequential damages. You should be very careful to ensure that the use of this information and/or software material complies with the laws, rules, and regulations of the jurisdictions with respect to which it is used. The information contained herein is subject to change without notice. Revisions may be issued to advise of such changes and/or additions. Notice to U.S. Government End Users: This is commercial computer software or hardware documentation developed at private expense. Use, reproduction, or disclosure by the Government is subject to the terms of Unisys standard commercial license for the products, and where applicable, the restricted/limited rights provisions of the contract data rights clauses. Unisys is a registered trademark of Unisys Corporation in the United States and other countries. All other brands and products referenced in this document are acknowledged to be the trademarks or registered trademarks of their respective holders.
Contents Section 1. About This Guide Purpose and Audience... 1 1 Related Product Information... 1 1 Documentation Updates... 1 1 What s New in This Release... 1 2 Using This Guide... 1 3 Section 2. Overview Geographic Replication Environment... 2 1 Geographic Clustered Environment... 2 1 Data Flow... 2 2 Diagnostic Tools and Capabilities... 2 6 Event Log... 2 6 System Status... 2 7 E-mail Notifications... 2 8 Installation Diagnostics... 2 8 Host Information Collector (HIC)... 2 8 Cluster Logs... 2 9 Unisys SafeGuard 30m Collector... 2 9 RA Diagnostics... 2 9 Hardware Indicators... 2 9 SNMP Support... 2 9 kutils Utility... 2 9 Discovering Problems... 2 9 Events That Cause Journal Distribution... 2 10 Troubleshooting Procedures... 2 11 Identifying the Main Components and Connectivity of the Configuration... 2 11 Understanding the Current State of the System... 2 11 Verifying the System Connectivity... 2 11 Analyzing the Configuration Settings... 2 12 Section 3. Recovering in a Geographic Replication Environment Manual Failover of Volumes and Data Consistency Groups... 3 1 Accessing an Image... 3 1 Testing the Selected Image at Remote Site... 3 2 6872 5688 006 iii
Contents Section 4. Recovering in a Geographic Clustered Environment Checking the Cluster Setup... 4 1 MSCS Properties... 4 1 Network Bindings... 4 2 Group Initialization Effects on a Cluster Move-Group Operation... 4 3 Full-Sweep Initialization... 4 4 Long Resynchronization... 4 4 Initialization from Marking Mode... 4 5 Behavior of SafeGuard 30m Control During a Move-Group Operation... 4 5 Recovering by Manually Moving an Auto-Data (Shared Quorum) Consistency Group... 4 7 Taking a Cluster Data Group Offline... 4 7 Performing a Manual Failover of an Auto-Data (Shared Quorum) Consistency Group to a Selected Image... 4 8 Bringing a Cluster Data Group Online and Checking the Validity of the Image... 4 9 Reversing the Replication Direction of the Consistency Group... 4 10 Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner)... 4 11 Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner)... 4 17 Recovery When All RAs and All Servers Fail on One Site... 4 19 Site 1 Failure (Site 1 Quorum Owner)... 4 19 Site 1 Failure (Site 2 Quorum Owner)... 4 24 Section 5. Solving Storage Problems User or Replication Volume Not Accessible... 5 3 Repository Volume Not Accessible... 5 5 Reformatting the Repository Volume... 5 8 Journal Not Accessible... 5 10 Journal Volume Lost Scenarios... 5 12 Total Storage Loss in a Geographic Replicated Environment... 5 13 Storage Failure on One Site in a Geographic Clustered Environment... 5 15 Storage Failure on One Site with Quorum Owner on Failed Site... 5 16 Storage Failure on One Site with Quorum Owner on Surviving Site... 5 20 Section 6. Solving SAN Connectivity Problems Volume Not Accessible to RAs... 6 2 Volume Not Accessible to SafeGuard 30m Splitter... 6 6 RAs Not Accessible to SafeGuard 30m Splitter... 6 11 Total SAN Switch Failure on One Site in a Geographic Clustered Environment... 6 14 iv 6872 5688 006
Contents Cluster Quorum Owner Located on Site with Failed SAN Switch... 6 16 Cluster Quorum Owner Not on Site with Failed SAN Switch... 6 20 Section 7. Solving Network Problems Public NIC Failure on a Cluster Node in a Geographic Clustered Environment... 7 3 Public or Client WAN Failure in a Geographic Clustered Environment... 7 6 Management Network Failure in a Geographic Clustered Environment... 7 11 Replication Network Failure in a Geographic Clustered Environment... 7 15 Temporary WAN Failures... 7 22 Private Cluster Network Failure in a Geographic Clustered Environment... 7 23 Total Communication Failure in a Geographic Clustered Environment... 7 27 Port Information... 7 33 Section 8. Solving Replication Appliance (RA) Problems Single RA Failures... 8 4 Single RA Failure with Switchover... 8 5 Reboot Regulation... 8 12 Failure of All SAN Fibre Channel Host Bus Adapters (HBAs... 8 14 Failure of Onboard WAN Adapter or Failure of Optional Gigabit Fibre Channel WAN Adapter... 8 20 Single RA Failures Without a Switchover... 8 22 Port Failure on a Single SAN Fibre Channel HBA on One RA... 8 22 Onboard Management Network Adapter Failure... 8 24 Single Hard Disk Failure... 8 25 Failure of All RAs at One Site... 8 26 All RAs Are Not Attached... 8 28 Section 9. Solving Server Problems Cluster Node Failure (Hardware or Software) in a Geographic Clustered Environment... 9 2 Possible Subset Scenarios... 9 3 Windows Server Reboot... 9 3 Unexpected Server Shutdown Because of a Bug Check... 9 8 Server Crash or Restart... 9 12 Server Unable to Connect with SAN... 9 14 Server HBA Failure... 9 18 6872 5688 006 v
Contents Infrastructure (NTP) Server Failure... 9 19 Server Failure (Hardware or Software) in a Geographic Replication Environment... 9 21 Section 10. Solving Performance Problems Slow Initialization... 10 2 General Description of High-Load Event... 10 3 High-Load (Disk Manager) Condition... 10 4 High-Load (Distributor) Condition... 10 4 Failover Time Lengthens... 10 5 Appendix A. Collecting and Using Logs Collecting RA Logs... A 1 Setting the Automatic Host Info Collection Option... A 2 Testing FTP Connectivity... A 2 Determining When the Failure Occurred... A 2 Converting Local Time to GMT or UTC... A 3 Collecting RA Logs... A 3 Collecting Server (Host) Logs... A 6 Using the MPS Report Utility... A 6 Using the Host Information Collector (HIC) Utility... A 7 Analyzing RA Log Collection Files... A 8 RA Log Extraction Directory... A 9 tmp Directory... A 14 Host Log Extraction Directory... A 15 Analyzing Server (Host) Logs... A 16 Analyzing Intelligent Fabric Switch Logs... A 16 Appendix B. Running Replication Appliance (RA) Diagnostics Clearing the System Event Log (SEL)... B 1 Running Hardware Diagnostics... B 2 Custom Test... B 3 Express Test... B 4 LCD Status Messages... B 4 Appendix C. Running Installation Manager Diagnostics Using the SSH Client... C 1 Running Diagnostics... C 1 IP Diagnostics... C 2 Fibre Channel Diagnostics... C 9 Synchronization Diagnostics... C 17 Collect System Info... C 18 vi 6872 5688 006
Contents Appendix D. Replacing a Replication Appliance (RA) Saving the Configuration Settings... D 2 Recording Policy Properties and Saving Settings... D 2 Modifying the Preferred RA Setting... D 3 Removing Fibre Channel Adapter Cards... D 4 Installing and Configuring the Replacement RA... D 4 Cable and Apply Power to the New RA... D 4 Connecting and Accessing the RA... D 4 Checking Storage-to-RA Access... D 5 Enabling PCI-X Slot Functionality... D 5 Configuring the RA... D 6 Verifying the RA Installation... D 7 Restoring Group Properties... D 8 Ensuring the Existing RA Can Switch Over to the New RA... D 8 Appendix E. Understanding Events Event Log... E 1 Event Topics... E 1 Event Levels... E 2 Event Scope... E 2 Displaying the Event Log... E 3 Using the Event Log for Troubleshooting... E 3 List of Events... E 4 List of Normal Events... E 5 List of Detailed Events... E 22 Appendix F. Configuring and Using SNMP Traps Software Monitoring... F 1 SNMP Monitoring and Trap Configuration... F 3 Installing MIB Files on an SNMP Browser... F 3 Resolving SNMP Issues... F 4 Appendix G. Using the Unisys SafeGuard 30m Collector Installing the SafeGuard 30m Collector... G 1 Before You Begin the Configuration... G 2 Handling the Security Breach Warning... G 3 Using Collector Mode... G 3 Getting Started... G 3 Understanding Operations in Collector Mode... G 7 Using Configuration Manager... G 15 Using Site Verifier... G 19 Using View Mode... G 20 Appendix H. Using kutils Usage... H 1 6872 5688 006 vii
Contents Path Designations... H 1 Command Summary... H 2 Appendix I. Analyzing Cluster Logs Introduction to Cluster Logs... I 1 Creating the Cluster Log... I 2 Understanding the Cluster Log Layout... I 4 Sample Cluster Log... I 5 Posting Information to the Cluster Log... I 6 Diagnosing a Problem Using Cluster Logs... I 7 Gathering Materials... I 7 Opening the Cluster Log... I 8 Converting GMT/UCT to Local Time... I 8 Converting Cluster Log GUIDs to Text Resource Names... I 8 Understanding State Codes... I 10 Understanding Persistent State... I 13 Understanding Error and Status Codes... I 14 Index... 1 viii 6872 5688 006
Figures 2 1. Basic Geographic Clustered Environment... 2 2 2 2. Data Flow... 2 3 2 3. Data Flow with Fabric Splitter... 2 5 2 4. Data flow in CDP... 2 6 4 1. All RAs Fail on Site 1 (Site 1 Quorum Owner)... 4 12 4 2. All RAs Fail on Site 1 (Site 2 Quorum Owner)... 4 17 4 3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner)... 4 20 4 4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner)... 4 24 5 1. Volumes Tab Showing Volume Connection Errors... 5 4 5 3. Groups Tab Shows Paused by System... 5 5 5 4. Management Console Display: Storage Error and RAs Tab Shows Volume Errors... 5 6 5 5. Volumes Tab Shows Error for Repository Volume... 5 6 5 6. Groups Tab Shows All Groups are Still Alive... 5 7 5 7. Management Console Messages for the Repository Volume not Accessible Problem... 5 7 5 8. Volumes Tab Shows Journal Volume Error... 5 11 5 9. RAs Tab Shows Connection Errors... 5 11 5 10. Groups Tab Shows Group Paused by System... 5 11 5 11. Management Console Messages for the Journal Not Accessible Problem... 5 12 5 12. Management Console Volumes Tab Shows Errors for All Volumes... 5 14 5 13. RAs Tab Shows Volumes That Are Not Accessible... 5 14 5 14. Multipathing Software Reports Failed Paths to Storage Device... 5 15 5 15. Storage on Site 1 Fails... 5 16 5 16. Cluster Regroup Process... 5 17 5 17. Cluster Administrator Displays... 5 19 5 18. Multipathing Software Shows Server Errors for Failed Storage Subsystem... 5 19 6 1. Management Console Showing Inaccessible Volume Errors... 6 3 6 2. Management Console Messages for Inaccessible Volumes... 6 3 6 3. Management Console Error Display Screen... 6 6 6 4. Management Console Messages for Volumes Inaccessible to Splitter... 6 7 6 5. EMC PowerPath Shows Disk Error... 6 9 6 6. Management Console Display Shows a Splitter Down... 6 11 6 7. Management Console Messages for Splitter Inaccessible to RA... 6 12 6 8. SAN Switch Failure on One Site... 6 15 6 9. Management Console Display with Errors for Failed SAN Switch... 6 16 6 10. Management Console Messages for Failed SAN Switch... 6 17 6872 5688 006 ix
Figures 6 11. Management Console Messages for Failed SAN Switch with Quorum Owner on Surviving Site... 6 20 7 1. Public NIC Failure of a Cluster Node... 7 3 7 2. Public NIC Error Shown in the Cluster Administrator... 7 5 7 3. Public or Client WAN Failure... 7 7 7 4. Cluster Administrator Showing Public LAN Network Error... 7 8 7 5. Management Network Failure... 7 11 7 6. Management Console Display: Not Connected... 7 13 7 7. Management Console Message for Event 3023... 7 14 7 8. Replication Network Failure... 7 16 7 9. Management Console Display: WAN Down... 7 17 7 10. Management Console Log Messages: WAN Down... 7 18 7 11. Management Console RAs Tab: All RAs Data Link Down... 7 19 7 12. Private Cluster Network Failure... 7 24 7 13. Cluster Administrator Display with Failures... 7 25 7 14. Total Communication Failure... 7 27 7 15. Management Console Display Showing WAN Error... 7 28 7 16. RAs Tab for Total Communication Failure... 7 29 7 17. Management Console Messages for Total Communication Failure... 7 29 7 18. Cluster Administrator Showing Private Network Down... 7 32 7 19. Cluster Administrator Showing Public Network Down... 7 32 8 1. Single RA Failure... 8 5 8 2. Sample BIOS Display... 8 6 8 3. Management Console Display Showing RA Error and RAs Tab... 8 7 8 4. Management Console Messages for Single RA Failure with Switchover... 8 8 8 5. LCD Display on Front Panel of RA... 8 10 8 6. Rear Panel of RA Showing Indicators... 8 11 8 7. Location of Network LEDs... 8 11 8 8. Location of SAN Fibre Channel HBA LEDs... 8 12 8 9. Management Console Display: Host Connection with RA Is Down... 8 15 8 10. Management Console Messages for Failed RA (All SAN HBAs Fail)... 8 17 8 11. Management Console Showing WAN Data Link Failure... 8 21 8 12. Location of Hard Drive LEDs... 8 26 8 13. Management Console Showing All RAs Down... 8 27 9 1. Cluster Node Failure... 9 2 9 2. Management Console Display with Server Error... 9 4 9 3. Management Console Messages for Server Down... 9 5 9 4. Management Console Messages for Server Down for Bug Check... 9 9 9 5. Management Console Display Showing LA Site Server Down... 9 14 9 6. Management Console Images Showing Messages for Server Unable to Connect to SAN... 9 15 9 7. PowerPath Administrator Console Showing Failures... 9 17 9 8. PowerPath Administrator Console Showing Adapter Failure... 9 18 9 9. Event 1009 Display... 9 20 I 1. Layout of the Cluster Log... I 4 I 2. Expanded Cluster Hive (in Windows 2000 Server)... I 10 x 6872 5688 006
Tables 2 1. User Types... 2 7 2 2. Events That Cause Journal Distribution... 2 10 5 1. Possible Storage Problems with Symptoms... 5 1 5 2. Indicators and Management Console Errors to Distinguish Different Storage Volume Failures... 5 3 6 1. Possible SAN Connectivity Problems... 6 1 7 1. Possible Networking Problems with Symptoms... 7 1 7 2. Ports for Internet Communication... 7 34 7 3. Ports for Management LAN Communication and Notification... 7 34 7 4. Ports for RA-to-RA Internal Communication... 7 35 8 1. Possible Problems for Single RA Failure with a Switchover... 8 2 8 2. Possible Problems for Single RA Failure Without a Switchover... 8 3 8 3. Possible Problems for Multiple RA Failures with Symptoms... 8 3 8 4. Management Console Messages Pertaining to Reboots... 8 13 9 1. Possible Server Problems with Symptoms... 9 1 10 1. Possible Performance Problems with Symptoms... 10 1 B 1. LCD Status Messages... B 5 C 1. Messages from the Connectivity Testing Tool... C 8 E 1. Normal Events... E 5 E 2. Detailed Events... E 23 F 1. Trap Variables and Values... F 2 I 1. System Environment Variables Related to Clustering... I 2 I 2. Modules of MSCS... I 4 I 3. Node State Codes... I 12 I 4. Group State Codes... I 12 I 5. Resource State Codes... I 12 I 6. Network Interface State Codes... I 13 I 7. Network State Codes... I 13 6872 5688 006 xi
Tables xii 6872 5688 006
Section 1 About This Guide Purpose and Audience This document presents procedures for problem analysis and troubleshooting of the Unisys SafeGuard 30m solution. It is intended for Unisys service representatives and other technical personnel who are responsible for maintaining the Unisys SafeGuard 30m solution installation. Related Product Information The methods described in this document are based on support and diagnostic tools that are provided as standard components of the Unisys SafeGuard 30m solution. You can find additional information about these tools in the following documents: Unisys SafeGuard Solutions Planning and Installation Guide Unisys SafeGuard Solutions Replication Appliance Administrator s Guide Unisys SafeGuard Solutions Replication Appliance Command Line Interface (CLI) Reference Guide Unisys SafeGuard Solutions Replication Appliance Installation Guide Note: Review the information in the Unisys SafeGuard Solutions Planning and Installation Guide about making configuration changes before you begin troubleshooting a problem. Documentation Updates This document contains all the information that was available at the time of publication. Changes identified after release of this document are included in problem list entry (PLE) 18697467. To obtain a copy of the PLE, contact your Unisys service representative or access the current PLE from the Unisys Product Support Web site: http://www.support.unisys.com/all/ple/18697467 Note: If you are not logged into the Product Support site, you will be asked to do so. 6872 5688 006 1 1
About This Guide What s New in This Release Some of the important changes in the 8.0 release include the following: Changes in the UI The SafeGuard UI has changed in this release; however, these UI changes have not been made in this guide. Please refer to the SafeGuard 8.0 UI for the latest component names but follow the steps given in this guide to complete any procedure. For example, Stretch Cluster Support is now renamed as Stretch Cluster /VMware SRM Support in the SafeGuard 8.0 UI. Synchronous Replication You can now replicate data synchronously over Fibre Channel. The system can be set to replicate in synchronous mode, in asynchronous mode, or to dynamically switch between the two modes. It is determined by threshold values that are based on latency and throughput. Unisys SafeGuard Solutions Installer Wizard This wizard helps you install and configure a new Unisys SafeGuard Solutions installation on one or two sites. The wizard now supports IPv6. Add New RAs Wizard This wizard helps you add new RAs to existing RA clusters without any disruption. RA Replacement Wizard This wizard helps you replace an existing RA in an RA cluster with a new RA without any disruption. Upgrade Tool This CLI wizard helps you upgrade the RA code from 7.0 and 7.1 to 8.0. It is composed of two wizardsprepare Upgrade and Apply. System Monitoring Unisys SafeGuard Solutions monitors selected parameter values to let the user know how close they are to their limits. The system, policies, licensing, or limitations of external technologies determine the limits. Monitored parameters are shown in the Unisys SafeGuard Solutions Management Application and at the CLI command line. Support for CLARiiON LUNs greater than 2 TB When using a Unisys SafeGuard Solutions 8.0 CLARiiON splitter, Unisys SafeGuard Solutions supports the replication of CLARiiON CX3 and CX4 Series LUNs that are larger than 2 TB. CLARiiON splitter support for 2048 LUNs When using a Unisys SafeGuard Solutions 8.0 CLARiiON splitter, Unisys SafeGuard Solutions supports attachment of up to 2048 LUNs of CLARiiON CX3 and CX4 Series arrays. Improved SAN Diagnostics If there are SAN diagnostics errors, it is no longer possible to continue the installation until they are corrected. SAN Diagnostics and host SAN Diagnostics run 1 2 6872 5688 006
About This Guide automatically approximately once every hour on each RA. In addition, SAN Diagnostics runs each time the time zone changes. These tests are transparent to the user and do not affect system performance. Any SAN or host errors encountered will be displayed in the Unisys SafeGuard Solutions GUI and in the output of the get_system_status command in the CLI. If certain configuration errors are encountered during the SAN Diagnostics, the tests will rerun every minute until they are corrected, immediately displaying correction results. Using This Guide This guide offers general information in the first four sections. Read Section 2 to understand the overall approach to troubleshooting and to gain an understanding of the Unisys SafeGuard 30m solution architecture. Section 3 describes recovery in a geographic replication environment, and Section 4 offers information and recovery procedures for geographic clustered environments. Sections 5 through 10 group potential problems into categories and describe the problems. You must recognize symptoms, identify the problem or failed component, and then decide what to do to correct the problem. Sections 5 through 10 include a table at the beginning of each section that lists symptoms and potential problems. Each problem is then presented in the following format: Problem Description: Description of the problem Symptoms: List of symptoms that are typical for this problem Actions to Resolve the Problem: Steps recommended to solve the problem The appendixes provide information about using tools and offer reference information that you might find useful in different situations. 6872 5688 006 1 3
About This Guide 1 4 6872 5688 006
Section 2 Overview The Unisys SafeGuard Solutions are flexible, integrated business continuance solutions especially suitable for protecting business-critical application environments. The Unisys SafeGuard 30m solution provides two distinct functions that act in concert: replication of data and automated application recovery through clustering over great distances. Typically, the Unisys SafeGuard 30m solution is implemented in one of these environments: Geographic replication environment: In this replication environment, data from servers at one site are replicated to a remote site. Geographic clustered environment: In this replication environment, Microsoft Cluster Service (MSCS) is installed on servers that span sites and that participate in one cluster. The use of a Unisys SafeGuard 30m Control resource allows automated failover and recovery by controlling the replication direction with a MSCS resource. The resource is used in this environment only. Geographic Replication Environment Unisys SafeGuard Solutions supports replication of data over Fibre Channel to local SANattached storage and over WAN to remote sites. It also allows failover to a secondary site and continues operations in the event of a disaster at the primary site. Unisys SafeGuard Solutions replicates data over any distance: within the same site (CDP), or to another site halfway around the globe (CRR), or both (CLR.) Geographic Clustered Environment In the geographic clustered environment, MSCS and cluster nodes are part of the environment. Figure 2 1 illustrates a basic geographic clustered environment that consists of two sites. In addition to server clusters, the typical configuration is made up of an RA cluster (RA 1 and RA 2) at each of the two sites. However, multiple RA cluster configurations are also possible. Note: The dashed lines in Figure 2 1 represent the server WAN connections. To simplify the view, redundant and physical connections are not shown. 6872 5688 006 2 1
Overview Figure 2 1. Basic Geographic Clustered Environment Data Flow Figure 2 2 shows the data flow in the basic system configuration for data written by the server. The system replicates the data in snapshot replication mode to a remote site. The data flow is divided into the following segments: write, transfer, and distribute. 2 2 6872 5688 006
Overview Figure 2 2. Data Flow Write Transfer The flow of data for a write transaction is as follows: 1. The host writes data to the splitter (either on the host or the fabric) that immediately sends it to the RA and to the production site replication volume (storage system). 2. After receiving the data, the RA returns an acknowledgement (ACK) to the splitter. The storage system returns an ACK after successfully writing the data to storage. 3. The splitter sends an ACK to the host that the write operation has been completed successfully. In snapshot replication mode, this sequence of events (steps 1 to 3) can be repeated multiple times before the snapshot is closed. The flow of data for transfer is as follows: 1. After processing the snapshot data (that is, applying the various compression techniques), the RA sends the snapshot over the WAN to its peer RA at the remote site. 2. The RA at the remote site writes the snapshot to the journal. At the same time, the remote RA returns an ACK to its peer at the production site. Note: Alternatively, you can set an advanced policy parameter so that lag is measured to the journal. In that case, the RA at the target site returns an ACK to its peer at the source site only after it receives an ACK from the journal (step 3). 3. After the complete snapshot is written to the journal, the journal returns an ACK to the RA. 6872 5688 006 2 3
Overview Distribute When possible, and unless instructed otherwise, the Unisys SafeGuard 30m solution proceeds at first opportunity to distribute the image to the appropriate location on the storage system at the remote site. The logical flow of data for distribution is as follows: 1. The remote RA reads the image from the journal. 2. The RA reads existing information from the relevant remote replication volume. 3. The RA writes undo information (that is, information that can support a rollback, if necessary) to the journal. Note: Steps 2 and 3 are skipped when the maximum journal lag policy parameter causes distribution to operate in fast-forward mode. (See the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for more information.) 4. The RA writes the image to the appropriate remote replication volume. Alternatives to the basic system architecture The following are derivatives of the basic system architecture: Fabric Splitter An intelligent fabric switch can perform the splitting function instead of a Unisys SafeGuard Solutions host-based Splitter installed on the host. In this case, the host sends a single write transaction to the switch on its way to storage. At the switch, however, the message is split, with a copy sent also to RA (as shown in Figure 2 3). The system behaves the same way as it does when using a Unisys SafeGuard Solutions host-based splitter on the host to perform the splitting function. 2 4 6872 5688 006
Overview Figure 2 3. Data Flow with Fabric Splitter Local Replication by CDP You can use CDP to perform replication over short distancesthat is, to replicate storage at the same site as CRR does over long distances. Operation of the system is similar to CRR including the ability to use the journal to recover from a corrupted data image, and the ability, if necessary, to fail over to the remote side or storage pool. In Figure 2 4, there is no WAN, the storage pools are part of the storage at the same site, and the same RA appears in each of the segments. 6872 5688 006 2 5
Overview Figure 2 4. Data flow in CDP Note: The repository volume must belong to remote-side storage pool. Unisys SafeGuard Solutions support a simultaneous mix of groups for remote and local replication. Individual volumes and groups, however, must be designated for either remote or local replication, but not for both. Certain policy parameters do not apply for local replication by CDP. Single RA Note: Unisys SafeGuard Solutions does not support single RA configuration (at both sites or at a single site). Diagnostic Tools and Capabilities Event Log The Unisys SafeGuard 30m solution offers the following tools and capabilities to help you diagnose and solve problems. The replication capability of the Unisys SafeGuard 30m solution records log entries in response to a wide range of predefined events. The event log records all significant events that have recently occurred in the system. Appendix E lists and explains the events. 2 6 6872 5688 006
Overview Each event is classified by an event ID. The event ID can be used to help analyze or diagnose system behavior, including identifying the trigger for a rolling problem, understanding a sequence of events, and examining whether the system performed the correct set of actions in response to a component failure. You can monitor system behavior by viewing the event log through the management console, by issuing CLI commands, or by reading RA logs. The exact period of time covered by the log varies according to the operational state of the environment during that period or, in the case of RA logs, the time period that was specified. The capacity of the event log is 5000 events. For problems that are not readily apparent and for situations that you are monitoring for failure, you can configure an e-mail notification to send all logs to you in a daily summary. Once you resolve the problem, you can remove the event notifications. See Configuring a Diagnostic E-mail Notification in this section to configure a daily summary of events. System Status The management console displays an immediate indication of any problem that interferes with normal operation of the Unisys SafeGuard 30m environment. If a component fails, the indication is accompanied by an error message that provides detailed information about the failure. You must log in to the management console to monitor the environment and to view events. The RAs are preconfigured with the users defined in Table 2 1. Table 2 1. User Types User Initial Password Permissions boxmgmt boxmgmt Install admin admin All except install and webdownload monitor monitor Read only webdownload webdownload webdownload SE Unisys(CSC) All except install and webdownload Note: The password boxmgmt is not used to log in to the management console; it is only used for SSH sessions. The CLI provides all users with status commands for the complete set of Unisys SafeGuard 30m components. You can use the information and statistics provided by these commands to identify bottlenecks in the system. 6872 5688 006 2 7
Overview E-mail Notifications The e-mail notification mechanism sends specified event notifications (or alerts) to designated individuals. Also, you can set up an e-mail notification for once a day that contains a daily summary of events. Configuring a Diagnostic E-mail Notification 1. From the management console, click Alert Settings on the System menu. 2. Under Rules, click Add. 3. Using the diagnostic rule, select the appropriate topic, level, and type options. Diagnostic Rule This rule sends all messages on a daily basis to personnel of your choice. Topics: All Topics Level: Information Scope: Detailed Type Daily 4. Under Addresses, click Add. 5. In the New Address box, type the e-mail address to which you would like event notifications sent. You can specify more than one e-mail address. 6. Click OK. 7. Repeat steps 4 through 6 for each additional e-mail recipient. 8. Click OK. 9. Click OK. Installation Diagnostics The Diagnostics menu of the Installation Manager provides a suite of diagnostic tools for testing the functionality and connectivity of the installed RAs and Unisys SafeGuard 30m components. Appendix C explains how to use the Installation Manager diagnostics. Installation Manager is also used to collect RA logs and host splitter logs from one centralized location. See Appendix A for more information about collecting logs. Host Information Collector (HIC) The HIC collects extensive information about the environment, operation, and performance of any server on which a splitter has been installed. You can use the Installation Manager to collect logs across the entire environment including RAs and all servers on which the HIC feature is enabled. The HIC can also be used at the server. See Appendix A for more information about collecting logs. 2 8 6872 5688 006
Overview Cluster Logs In a geographic clustered environment, MSCS maintains logs of events for the clustered environment. Analyzing these logs is helpful in diagnosing certain problems. Appendix I explains how to analyze these logs. Unisys SafeGuard 30m Collector The Unisys SafeGuard 30m Collector utility enables you to easily collect various pieces of information about the environment that can help in solving problems. Appendix G describes this utility. RA Diagnostics Diagnostics specific to the RAs are available to aid in identifying problems. Appendix B explains how to use the RA diagnostics. Hardware Indicators Hardware problemsfor example, RA disk failures or RA power problemsare identified by status LEDs located on the RAs themselves. Several indicators are explained in Section 8, Solving Replication Appliance (RA) Problems. SNMP Support kutils Utility The RAs support monitoring and problem notification using standard SNMP, including support for SNMPv3. You can use SNMP queries to the agent on the RA. Also, you can configure the environment such that events generate SNMP traps that are then sent to designated hosts. Appendix F explains how to configure and use SNMP traps. The kutils utility is a proprietary server-based program that enables you to manage server splitters across all platforms. The command-line utility is installed automatically when the Unisys SafeGuard 30m splitter is installed on the application server. If the splitting function is not on a host but rather is on an intelligent switch, the kutils utility is copied from the Splitter CD-ROM. (See the Unisys SafeGuard Solutions Planning and Installation Guide for more information.) Appendix H explains some kutils commands that are helpful in troubleshooting problems. See the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for complete reference information on the kutils utility. Discovering Problems Symptoms of problems and notifications occur in various ways with the Unisys SafeGuard 30m solution. The tools and capabilities described previously provide notifications for some conditions and events. Other problems are recognized from failures. Problems might be noted in the following ways: 6872 5688 006 2 9
Overview Problems with data because of a rolling disaster, which means that the site needs to use a previous snapshot to recover Problems with applications failing Inability to switch processing to the remote or secondary site Problems with the MSCS cluster (such as a failover to another cluster or site) Problems reported in an e-mail notification from an RA Problem reported in an SNMP trap notification Problems listed on the management console as reported in the overall system status or in group state or properties Problems reported in the daily summary of events In this guide, symptoms and notifications are often listed with potential problems. However, the messages and notifications vary based on the problem, and multiple events and notifications are possible at any given time. Events That Cause Journal Distribution Certain conditions might occur that can prevent access to the expected journal image. For instance, images might be flushed or distributed so that they are not available. Table 2 2 lists events that might cause the images to be unavailable. For tables listing all events, see Appendix E. Table 2 2. Events That Cause Journal Distribution Event ID Level Scope Description Trigger 4042 Info Detailed Group deactivated. (Group <group>, RA <RA>) 4062 Info Detailed Access enabled to latest image. (Group <group>, Failover site <site>) 4097 Warning Detailed Maximum journal lag exceeded. Distribution in fastforwardolder images removed from journal. (Group <group>) 4099 Info Detailed Initializing in long resynchronization mode. (Group <group>) A user action deactivated the group. Access was enabled to the latest image during automatic failover. Fast-forward action started and caused the snapshots taken before the fast-forward action to be lost and the maximum journal lag to be exceeded. The system started a long resynchronization 2 10 6872 5688 006
Overview Troubleshooting Procedures For troubleshooting, you must differentiate between problems that arise from environmental changes, network changes (cabling, routing and port blocking), or those changes related to zoning, logical unit number (LUN) masking, other devices in the SAN, and storage failures and problems that arise from misconfiguration or internal errors in the environmental setup. Refer to the preceding diagrams as you consider the general troubleshooting procedures that follow. Use the following four general tasks to help you identify symptoms and causes whenever you encounter a problem. Identifying the Main Components and Connectivity of the Configuration Knowledge of the main system components and the connectivity between these components is a key to understanding how the entire environment operates. This knowledge helps you understand where the problem exists in the overall system context and can help you correctly identify which components are affected. Identify the following components: Storage device, controller, and the configuration of connections to the Fibre Channel (FC) switch Switch and port types, and their connectivity Network configuration (WAN and LAN): IP addresses, routing schemes, subnet masks, and gateways Participating servers: operating system, host bus adapters (HBAs), connectivity to the FC switch Participating volumes: repository volumes, journal volumes, and replication volumes Understanding the Current State of the System Use the management console and the CLI get commands to understand the current state of the system: Is there any component which is shown to be in an error state? If so, what is the error? Is it down, disconnected from any other components? What is the state of the groups, splitters, volumes, transfer, and distribution? Is the current state stable or changing within intervals of time? Verifying the System Connectivity To verify the system connectivity, use physical and tool-based verification methods to answer the following questions: Are all the components physically connected? Are the activity or link lights active? 6872 5688 006 2 11
Overview Are the components connected to the correct switch or switches? Are they connected to the correct ports? Is there connectivity over the WAN between all appliances? Is there connectivity between the appliances on the same site over the management network? Analyzing the Configuration Settings Many problems occur because of improper configuration settings such as improper zoning. Analyze the configuration settings to ensure they are not the cause of the problem. Are the zones properly configured? Splitter-to-storage? Splitter-to-RA? RA-to-storage? RA-to-RA? Are the zones in the switch config? Has the proper switch config been applied? Are the LUNs properly masked? Is the splitter masked to see only the relevant replication volume or volumes? Are the RAs masked to see the relevant replication volume or volumes, repository volume, and journal volume or volumes? Are the network settings (such as gateway) for the RAs correct? Are there any possible IP conflicts on the network? 2 12 6872 5688 006
Section 3 Recovering in a Geographic Replication Environment This section provides recovery procedures so that user applications can be online as quickly as possible in a geographic replication environment. An older image might be required to recover from a rolling disaster, human error, a virus, or any other failure that corrupts the latest snapshot image. Ensure that the image is tested prior to reversing direction. Complete the procedures for manual failover of volumes and data consistency groups for each group that needs to be moved. Refer to the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for more information on logged and virtual (with roll or without roll) access modes. For specific environments, refer to the best practices documents listed under SafeGuard Solutions documentation on the Unisys Product Support Web site, www.support.unisys.com Manual Failover of Volumes and Data Consistency Groups When you need to perform a manual failover of volumes and data consistency groups, complete the following tasks: 1. Accessing an image 2. Testing the selected image Accessing an Image 1. From the management console, select any one of the data consistency groups on the navigation pane. 2. Select the Status tab, (if it is not opened.) 3. Perform the following steps to allow access to the target image: a. Right-click Consistency Groups, and select Bookmark Image. b. Select Pause Transfer. Click Yes when the system prompts that the group activity will be paused. c. Right-click Consistency Groups and scroll down. d. Select the Remote Copy name and click Enable Image Access. 6872 5688 006 3 1
Recovering in a Geographic Replication Environment The Enable Image Access dialog box appears. e. Select one of the following options. i. Select the latest image (for latest image access). ii. iii. f. Click Next. Select an image from the list (list of bookmarked images can be selected). Specify desired point in time (bookmarked image in desired time can be selected). The Image Access Mode dialog box appears. g. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. h. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access i. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused displays at the bottom in the Status tab under the components pane. Testing the Selected Image at Remote Site Perform the following steps to test the selected image at the remote site: 1. Mount the volumes at the remote site using the mountvol utility provided by Windows. Enter the command mountvol <drive:> <path> <volume name> 2. Repeat step 1 for all volumes in the group. 3. Ensure that the selected image is valid: all applications start successfully using the selected image the data in the image is consistent and valid For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 4. Skip to Unmounting Volumes at Production site and Reversing Replication Direction if you have tested the validity of the image and the test is successful. If the test is unsuccessful, continue with step 5. 5. To test a different image, perform the procedure Unmounting the Volumes and Disabling the Image Access at Remote site. 3 2 6872 5688 006
Recovering in a Geographic Replication Environment Unmounting the Volumes and Disabling the Image Access at Remote Site 1. Before choosing another image, unmount the volume using the following batch file. If necessary, modify the program files/kdriver path to fit your environment. @echo off cd "c:\program files\kdriver\kutils" "c:\program files\kdriver\kutils\kutils.exe" flushfs e: "c:\windows\system32>mountvol.exe E:\ /P 2. Repeat step 1 for all volumes in the group. 3. Select one of the Consistency Groups in the navigation pane on the management console. 4. Right-click Consistency Groups and scroll down. 5. Select the Remote Copy name and click Disable Image Access. 6. Click Yes when the system prompts you to ensure that all group volumes are unmounted. 7. Repeat the procedures Accessing an Image and Testing the Selected Image at the Remote Site. Unmounting the Volumes at Production Site and Reversing Replication Direction Perform these steps at the host: 1. To unmount a volume at the production site, run the following batch file. If necessary, modify the program files\kdriver path to fit your environment. @echo off cd "c:\program files\kdriver\kutils" "c:\program files\kdriver\kutils\kutils.exe" flushfs e: c:\windows\system32>mountvol.exe E:\ /P 2. Repeat step 1 for all volumes in the group. Perform these steps on the management console: 1. Select a consistency group from the navigation pane. 2. Right-click Group and select Pause Transfer. Click Yes when the system prompts that the group activity will be paused. 3. Select the Status tab. The status of the transfer must display Paused. 4. Select the Remote Copy name and scroll down. 5. Select Failover to <Remote Site Name>. 6. Click Yes when the system prompts you to confirm failover. 7. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: 6872 5688 006 3 3
Recovering in a Geographic Replication Environment Warning: Journal will be erased. Do you wish to continue? 8. Click Yes to continue. 3 4 6872 5688 006
Section 4 Recovering in a Geographic Clustered Environment This section provides information and procedures that relate to geographic clustered environments running Microsoft Cluster Service (MSCS). Checking the Cluster Setup To ensure that the cluster configuration is correct, check the MSCS properties and the network bindings. For more detailed information, refer to Guide to Creating and Configuring a Server Cluster under Windows Server 2003, which you can download at MSCS Properties http://www.microsoft.com/downloads/details.aspx?familyid=96f76ed7-9634-4300-9159-89638f4b4ef7&displaylang=en To check the MSCS properties, enter the following command from the command prompt: Cluster /prop Output similar to the following is displayed: T Cluster Name Value -- -------------------- ------------------------------ ----------------------- M AdminExtensions {4EC90FB0-D0BB-11CF-B5EF-0A0C90AB505} D DefaultNetworkRole 2 (0x2) S Description B Security 01 00 14 80... (148 bytes) B Security Descriptor 01 00 14 80... (148 bytes) M Groups\AdminExtensions M Networks\AdminExtensions M NetworkInterfaces\AdminExtensions M Nodes\AdminExtensions M Resources\AdminExtensions M ResourceTypes\AdminExtensions D EnableEventLogReplication 0 (0x0) D QuorumArbitrationTimeMax 300 (0x12c) D QuorumArbitrationTimeMin 15 (0xf) D DisableGroupPreferredOwnerRandomization 0 (0x0) D EnableEventDeltaGeneration 1 (0x1) D EnableResourceDllDeadlockDetection 0 (0x0) D ResourceDllDeadlockTimeout 240 (0xf0) D ResourceDllDeadlockThreshold 3 (0x3) D ResourceDllDeadlockPeriod 1800 (0x708) D ClusSvcHeartbeatTimeout 60 (0x3c) D HangRecoveryAction 3 (0x3) 6872 5688 006 4 1
Recovering in a Geographic Clustered Environment If the properties are not set correctly, use one of the following commands to correct the settings. Majority Node Set Quorum Cluster /prop HangRecoveryAction=3 Cluster /prop EnableEventLogReplication=0 Shared Quorum Network Bindings Cluster /prop QuorumArbitrationTimeMax=300 (not for majority node set) Cluster /prop QuorumArbitrationTimeMin=15 Cluster /prop HangRecoveryAction=3 Cluster /prop EnableEventLogReplication=0 The following binding priority order and settings are suggested as best practices for clustered configurations. These procedures assume that you can identify the public and private networks by the connection names that are referenced in the steps. Host-Specific Network Bindings and Settings 1. Open the Network Connections window. 2. On the Advanced menu, click Advanced Settings. 3. Select the Networks and Bindings tab. This tab shows the binding order in the upper pane and specific connection properties in the lower pane. 4. Verify that the public network connection is above the private network in the binding list in the upper pane. If it is not, follow these steps to change the order: a. Select a network connection in the binding list in the upper pane. b. Use the arrows to the right to move the network connection up or down in the list as appropriate. 5. Select the private network in the binding list. In the lower pane, verify that the File and Print Sharing for Microsoft Networks and the Client for Microsoft Networks check boxes are cleared for the private network. 6. Click OK. 7. Highlight the public connections, then right-click and click Properties. 8. Select Internet (TCP.IP) in the list, and click Properties. 9. Click Advanced. 4 2 6872 5688 006
Recovering in a Geographic Clustered Environment 10. Select the WINS tab. 11. Ensure that Enable LM/Hosts lookup is selected. 12. Ensure that Disable NetBIOS over TCP/IP is selected. 13. Repeat steps 7 through 12 for the private network connection. Cluster-Specific Network Bindings and Settings 1. Open the Cluster Administrator. 2. Right-click the cluster (the top node in the tree structure in the left pane and click Properties. 3. Select the Networks Priority tab. 4. Ensure that the private network is at the top of the list and that the public network is below the private network. If it is not, follow these steps to change the order: a. Select the private network. b. Use the command button at the right to move up the private network up in the list as appropriate. 5. Select the private network, and click Properties. 6. Verify that the Enable this network for cluster use check box is selected and that Internal cluster communications only (private network) is selected. 7. Click OK. 8. Select the public network, and click Properties. 9. Verify that the Enable this network for cluster use check box is selected and that All communications (mixed network) is selected. 10. Click OK. Group Initialization Effects on a Cluster Move-Group Operation The following conditions affect failover times for a cluster move-group operation. A cluster move-group operation cannot complete if a lengthy consistency group initialization, such as a full-sweep initialization, long resynchronization, or initialization from marking mode, is executing in the background. Review these conditions and plan accordingly. 6872 5688 006 4 3
Recovering in a Geographic Clustered Environment Full-Sweep Initialization A full-sweep initialization occurs when the disks on both sites are scanned or read in their entirety and a comparison is made, using checksums, to check for differences. Any differences are then replicated from the Production site disk to the remote site disk. A full-sweep initialization generates an entry in the management console log. A full-sweep initialization occurs in the following circumstances: Disabling or enabling a group Disabling a group causes all disk replication in the group to stop. A full-sweep initialization is performed once the group is enabled. The full-sweep initialization guarantees that the disks are consistent between the sites. Adding a new splitter server or host that has access to the disks in the group When adding a new splitter to the replication, there is a time before the splitter is added to the configuration when activity from this splitter to the disks is not being monitored or replicated. To guarantee that no write operations were performed by the new splitter before the splitter was configured in the replication, a full-sweep initialization is required for all groups that contain disks accessed by this splitter. This initialization is done automatically by the system. Double failure of a main component When a double failure of a main component occurs, a full-sweep initialization is required to guarantee that consistency was maintained. The main components include the host, the replication appliance (RA), and the storage subsystem. Long Resynchronization A long resynchronization occurs when the data difference that needs to be replicated to the other site cannot fit on the journal volume. The data is split into multiple snapshots for distribution to the other site, and all the previous snapshots are lost. Long resynchronization can be caused by long WAN outages, a group being disabled for a long time period, and other instances when replication has not been functional for a long time period. Long resynchronization is not connected with full-sweep initialization and can also happen during initialization from marking (see Initialization from Marking Mode ). It is dependant only on the journal volume size and the amount of data to be replicated. A long resynchronization is identified in the Status Tab in Components Pane under the remote journal bitmap in the management console. The status Performing Long Resync is visible for the group that is currently performing a long resynchronization. 4 4 6872 5688 006
Recovering in a Geographic Clustered Environment Initialization from Marking Mode All other instances of initialization in the replication are caused by marking. The marking mode refers to a replication mode in which the location of dirty, or changed, data is marked in a bitmap on the repository volume. This bitmap is a standard sizeno matter how much data changes or what size disks are being monitoredso the repository volume cannot fill up during marking. The replication moves to marking mode when replication cannot be performed normally, such as during WAN outages. This marking mode guarantees that all data changes are still being recorded until replication is functioning normally. When replication can perform normally again, the RAs read the dirty, or changed, data from the source disk based on data recorded in the bitmap and replicates it to the disk on the remote site. The length of time for this process to complete depends on the amount of dirty, or changed, data as well as the performance of other components in the configuration, such as bandwidth and the storage subsystem. A high-load state can also cause the replication to move to marking mode. A high-load state occurs when write activity to the source disks exceeds the limits that the replication, bandwidth, or remote disks can handle. Replication moves into marking mode at this time until the replication determines the activity has reached a level at which it can continue normal replication. The replication then exits the high-load state and an initialization from marking occurs. See Section 10, Solving Performance Problems, for more information on high-load conditions and problems. Behavior of SafeGuard 30m Control During a Move-Group Operation During a move-group operation, the Unisys SafeGuard 30m Control resource in a clustered environment behaves as follow. Be aware of this information when dealing with various failure scenarios. 1. MSCS issues an offline request because of a failure with a group resourcefor example, a physical diskor an MSCS move group. The request is sent to the Unisys SafeGuard 30m Control resource on the node that owns the group. The MSCS resources that are dependent on the Unisys SafeGuard 30m Control resource, such as physical disk resources, are taken offline first. Taking the resources offline does not issue any commands to the RA. 2. MSCS issues an online request to the Unisys SafeGuard 30m Control resource on the node to which a group was moved, or in the case of failure, to the next node in the preferred owners list. 3. When the resource receives an online request from MSCS, the Unisys SafeGuard 30m Control resource issues two commands to control the access to disks: initiate_failover and verify_failover. 6872 5688 006 4 5
Recovering in a Geographic Clustered Environment Initiate_Failover Command This command changes the replication direction from one site to another. If a same-site failover is requested, the command completes successfully with no action performed by the RA. The resource issues the verify_failover command to see if the RA performed the operations successfully. If a different-site failover is requested, the RA starts changing direction between sites and returns successfully. In certain circumstances, the RA returns a failure when the WAN is down or a long resynchronization occurs. If the RA returns a failure to the Unisys SafeGuard 30m Control resource, the resource logs the failure in the Windows application event log and retries the command continuously until the cluster pending timeout is reached. When a move-group operation fails to view events posted by the resource, check the application event log. The event source of the event entry is the 30m Control. Verify_Failover Command This command enables the Unisys SafeGuard 30m Control resource to determine the time at which the change of the replication direction completes. If a same-site failover is requested, the command completes successfully with no action performed by the RA. If a different-site failover is requested, the verify_failover command returns a pending status until the replication direction changes. The change of direction takes from 2 to 30 minutes. When the verify_failover command completes, write access to the physical disk is enabled to the host from the RA and the splitter. If the time to complete the verify_failover command is within the pending timeout, the Unisys SafeGuard 30m Control resource comes online followed by all the resources dependent on this resource. All dependent disks come online using the default physical disk timeout of an MSCS cluster. The physical disk is available to the physical disk resource immediately; there is no delay. Physical disk access is available when the Unisys SafeGuard 30m Control resource comes online. You do not need to change the default resource settings for the physical disk. However, the physical disk must be dependent on the Unisys SafeGuard 30m Control resource. If the time to complete the verify_failover command is longer than the pending timeout of the Unisys SafeGuard 30m Control resource, MSCS fails this resource. The default pending timeout for a Unisys SafeGuard 30m Control resource is 15 minutes or 900 seconds. This timeout occurs before the cluster disk timeout. 4 6 6872 5688 006
Recovering in a Geographic Clustered Environment If you use the default retry value of 1, this resource issues the following commands: Initiate_failover Verify_failover Initiate_failover Verify_failover Using the default pending timeout, the Unisys SafeGuard 30m Control resource waits a total of 30 minutes to come online; this timeout period equals the timeout plus one retry. If the resource does not come online, MSCS attempts to move the group to the next node in the preferred owners list and then repeats this process. Recovering by Manually Moving an Auto-Data (Shared Quorum) Consistency Group An older image might be required to recover from a rolling disaster, human error, a virus, or any other failure that corrupts the latest snapshot image. It is impossible to recover automatically to an older image using MSCS because automatic cluster failover is designed to minimize data loss. The Unisys SafeGuard 30m solution always attempts to fail over to the latest image. Note: Manual image recovery is only for data consistency groups, not for the quorum group. To recover a data consistency group using an older image, you must complete the following tasks: Take the cluster data group offline. Perform a manual failover of an auto-data (shared quorum) consistency group to a selected image. Bring the cluster group online and check the validity of the image. Reverse the replication direction of the consistency group. Taking a Cluster Data Group Offline To take a group offline in the cluster for which you are performing a manual recovery, complete the following steps: 1. Open Cluster Administrator on one of the nodes in the MSCS cluster. 2. Right-click the group that you want to recover and click Take Offline. 3. Wait until all resources in the group show the status as Offline. 6872 5688 006 4 7
Recovering in a Geographic Clustered Environment Performing a Manual Failover of an Auto-Data (Shared Quorum) Consistency Group to a Selected Image 1. Open the management console. 2. Select a consistency group from the navigation pane. Note: Do not select the quorum group. The data consistency group you select should be the cluster data group that you took offline. 4. Select the Policy tab in the selected consistency group. 5. Scroll down and select Stretch Cluster Support in the Policy tab. 6. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor [Manual(shared quorum) mode]. 7. Click Apply. 8. Perform the following steps to access the image: a. Right-click Consistency Groups and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. b. Right-click Consistency Groups and scroll down. c. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. d. Choose Select an image from the list and click Next. The Select Explicit Image dialog box appears and displays the available images. e. Select the desired image from the list and click Next. The Image Access Mode dialog box appears. f. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. g. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access h. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused status appears at the bottom in the Status tab under the components pane. 4 8 6872 5688 006
Recovering in a Geographic Clustered Environment Bringing a Cluster Data Group Online and Checking the Validity of the Image 1. Open the Cluster Administrator window on the Management Console. 2. Move the group to the node on the recovered site by right-clicking the group that you previously took offline and then clicking Move Group. If the cluster has more than two nodes, a list of possible owner target nodes appears. Select the node to which you want to move the group. If the cluster has only two nodes, the move starts immediately. Go to step 3. 3. Bring the group online by right-clicking the group name and then clicking Bring Online. 4. Ensure that the selected image is valid; that is, verify that All applications start successfully using the selected image. The data in the image is consistent and valid. For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 5. If you tested the validity of the image and the test completed successfully, skip to Reversing the Replication Direction of the Consistency Group. 6. If the validity of the image fails and you choose to test a different image, perform the following steps: a. To take the group offline, right-click the group name and then click Take Offline on the Cluster Administrator. b. Select one of the consistency groups in the navigation pane on the Management Console. c. Right-click Consistency Groups and scroll down. d. Select the Remote Copy name and click Disable Image Access. e. Click Yes when the system prompts you to ensure that all group volumes are unmounted. 7. Perform the following steps if you want to choose a different image: a. Right-click Consistency Groups and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. b. Right-click Consistency Groups and scroll down. c. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. d. Choose Select an image from the list and click Next. The Select Explicit Image dialog box appears and displays the available images. 6872 5688 006 4 9
Recovering in a Geographic Clustered Environment e. Select the desired image from the list and click Next. The Image Access Mode dialog box appears. f. Select the option Logged access (physical) and click Next. The Summary screen displays the Image name and the Image Access mode. g. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. The following message appears during the process: Enabling log access h. Verify the target image name displayed below the bitmap in the components pane under Status tab. Transfer:Paused status appears at the bottom in the Status tab under the components pane. 8. To bring the cluster group online, using the Cluster Administrator, right-click the group name and then click Online to. 9. Ensure that the selected image is valid. Verify that All applications start successfully using the selected image. The data in the image is consistent and valid. For example, you might want to test whether you can start a database application on this image. You might also want to run proprietary test procedures to validate the data. 10. If you tested the validity of the image and the test completed successfully, skip to Reversing the Replication Direction of the Consistency Group. 11. If the image is not valid, repeat steps 6 through 9 as necessary. Reversing the Replication Direction of the Consistency Group 1. Select Consistency Groups from the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. 3. Select the Status tab. The status transfer must display Paused. 4. Select the Policy tab and expand the Advanced Settings (if it is not expanded). 5. Select Auto data (shared quorum) from the Global Cluster mode list. 6. Right-click Consistency Groups and select Failover to <Remote Site Name>. 7. Click Yes when the system prompts you to confirm failover. 4 10 6872 5688 006
Recovering in a Geographic Clustered Environment 8. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 9. Click Yes to continue. Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) Problem Description The following points describe the behavior of the components in this event: When the quorum group is running on the site where the RAs failed (site 1), the cluster nodes on site 1 fail because of quorum lost reservations, and cluster nodes on site 2 attempt to arbitrate for the quorum resource. To prevent a split brain scenario, the RAs assume that the other site is active when a WAN failure occurs. (A WAN failure occurs if the RAs cannot communicate to at least one RA at the other site.) When the MSCS Reservation Manager on the surviving site (site 2) attempts the quorum arbitration request, the RA prevents access. Eventually, all cluster services stop and manual intervention is required to bring up the cluster service. Figure 4 1 illustrates this failure. 6872 5688 006 4 11
Recovering in a Geographic Clustered Environment Symptoms Figure 4 1. All RAs Fail on Site 1 (Site 1 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-East2 Host (Surviving Host) 8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to the transaction log. Corruption may occur. System Event Log for Usmv-West2 (Failure Host) 8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to the transaction log. Corruption may occur. 4 12 6872 5688 006
Recovering in a Geographic Clustered Environment If you review the cluster log, you find messages similar to the following examples: Cluster Log for Usmv-East2 (Surviving Host) Attempted to try five times before the cluster timed-out. The entries recorded five times in the log: 00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (The requested resource is in use). 00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error 170. 00000638.00000b10::2008/02/02-20:36:18.273 ERR [FM] Failed to arbitrate quorum resource c336021a- 083e-4fa0-9d37-7077a590c206, error 170. 00000638.00000b10::2008/02/02-20:36:18.273 ERR [RGP] Node 2: REGROUP ERROR: arbitration failed. 00000638.00000b10::2008/02/02-20:36:18.273 ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster service on this node). 00000684.000005a8::2008/02/02-20:37:53.473 ERR [JOIN] Unable to connect to any sponsor node. 00000684.000005a8::2008/02/02-20:38:06.020 ERR [FM] FmGetQuorumResource failed, error 170. 00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service). 00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] Failed to form cluster, status 5086 (The quorum disk could not be located by the cluster service). Cluster Log for Usmv-West2 (Failure Host) 00000d80.00000bbc::2008/02/02-20:31:21.257 ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000da0.00000130::2008/02/02-20:35:48.395 ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status 170 (The requested resource is in use) 00000da0.00000130::2008/02/02-20:35:48.395 ERR [RM] LostQuorumResource, cluster service terminated... 00000da0.00000b80::2008/02/02-20:35:49.145 ERR Network Name <Cluster Name>: Unable to open handle to cluster, status 1753 (There are no more endpoints available from the endpoint mapper). 00000da0.00000c20::2008/02/02-20:35:49.145 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (The handle is invalid). 00000a04.00000a14::2008/02/02-20:37:23.456 ERR [JOIN] Unable to connect to any sponsor node. Attempted to try five times before the cluster timed-out, The entries recorded five times in the log: 000001e4.00000598::2008/02/02-20:37:23.799 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (The resource is in use). 000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk <Disk Q:>: [DiskArb] BusReset completed, status 31 (A device attached to the system is not functioning). 000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to break reservation, error 31. 00000a04.00000a14::2008/02/02-20:37:25.830 ERR [FM] FmGetQuorumResource failed, error 31. 00000a04.00000a14::2008/08/02-20:37:25.830 ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service). 00000a04.00000a14::2008/02/02-20:37:25.830 ERR [INIT] Failed to form cluster, status 5086 (The 6872 5688 006 4 13
Recovering in a Geographic Clustered Environment quorum disk could not be located by the cluster service). 00000a04.00000a14::2008/02/02-20:37:25.830 ERR [CS] ClusterInitialize failed 5086 00000a04.00000a14::2008/02/02-20:37:25.846 ERR [CS] Service Stopped. exit code = 5086 Actions to Resolve the Problem If all RAs on site 1 fail and site 1 owns the quorum resource, perform the following tasks to recover: 1. Disable MSCS on all nodes at the site with the failed RAs. 2. Perform a manual failover of the quorum consistency group. 3. Reverse replication direction. 4. Start MSCS on a node on the surviving site. 5. Complete the recovery process. Caution Manual recovery is required only if the quorum device is lost because of a failure of an RA cluster. Before you bring the remote site online and before you perform the manual recovery procedure, ensure that MSCS is stopped and disabled on the cluster nodes at the production site (site 1 in this case). You must verify the server status with a network test. Improper use of the manual recovery procedure can lead to an inconsistent quorum disk and unpredictable results that might require a long recovery process. Disabling MSCS Stop MSCS on each node at the site where the RAs failed by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service and click Stop. 3. Change the startup type to Disabled. 4. Repeat steps 1 through 3 for each node on the site. Performing a Manual Failover of the Quorum Consistency Group 1. Connect to the Management Console by opening a browser to the management IP address of the surviving site. The management console can be accessed only by the site with a functional RA cluster because the WAN is down. 2. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum drive) in the navigation pane. 3. Select the Policy tab. 4 14 6872 5688 006
Recovering in a Geographic Clustered Environment 4. Under Stretch Cluster Support, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor and click Apply. 5. Right-click the Quorum Consistency Group and then select Pause Transfer. Click Yes when the system prompts and the group activity will be stopped. 6. Perform the following steps to allow access to the target image: a. Right-click Consistency Groups and scroll down. b. Select the Remote Copy name and click Enable Image Access. The Enable Image Access dialog box appears. c. Choose Select an image from the list and click Next. The Select Explicit Image dialog box displays the available images. d. Select the desired image from the list and then click Next. The Image Access Mode dialog box appears. e. Select Logged access (physical) and click Next. The Summary screen shows the Image name and the Image Access mode. f. Click Finish. Note: This process might take a long time to complete depending on the value of the journal lag setting in the group policy of the consistency group. g. Verify the target image name displayed below the bitmap in the components pane under the Status tab. Transfer:Paused status displays under the bitmap in the Status tab under the components pane. Reversing Replication Direction 1. Select the Quorum Consistency Group in the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. 3. Select the Status tab. The status of the transfer must show Paused. 4. Right-click Consistency Groups and select Failover to <Remote Site>. 5. Click Yes when the system prompts to confirm failover. 6. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 7. Click Yes to continue. 6872 5688 006 4 15
Recovering in a Geographic Clustered Environment Starting MSCS MSCS should start within 1 minute on the surviving nodes when the MSCS recovery setting is enabled. You can manually start MSCS on each node of the surviving site by performing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service, and click Start. MSCS starts the cluster group and automatically moves all groups to the first-started cluster node. 3. Repeat steps 1 through 2 for each node on the site. Completing the Recovery Process To complete the recovery process, you must restore the global cluster mode property and start MSCS. Restoring the Global Cluster Mode Property for the Quorum Group Once the primary site is operational and you have verified that all nodes at both sites are online in the cluster, restore the failover settings by performing the following steps: 1. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum device) from the navigation pane. 2. Select the Policy tab. 3. Under Stretch Cluster Support, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor. 4. Click Apply. 5. Click Yes when the system prompts that the group activity will be stopped. Enabling MSCS Enable and start MSCS on each node at the site where the RAs failed by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Services and click Properties. 3. Change the startup type to Automatic. 4. Click Start 5. Repeat steps 1 through 4 for each node on the site. 6. Open the Cluster Administrator and move the groups to the preferred node. 4 16 6872 5688 006
Recovering in a Geographic Clustered Environment Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) Problem Description If the quorum group is running on site 2 and the RAs fail on site 1, all cluster nodes remain in a running state. All consistency groups remain at the respective sites because all disk accesses are successful. In this case, because data is stored on the replication volumesbut the corresponding marking information is not written to the repository volumea full-sweep resynchronization is required following recovery. An exception is if the consistency group option Allow application to run even when Unisys SafeGuard Solutions cannot mark data was selected. The splitter prevents access to disks when the RAs are not available to write marking data to the repository volume, and I/Os fail. Figure 4 2 illustrates this failure. Symptoms Figure 4 2. All RAs Fail on Site 1 (Site 2 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: 6872 5688 006 4 17
Recovering in a Geographic Clustered Environment System Event Log for Usmv-East2 Host (Surviving SiteSite 2) 8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" offline." 8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in Resource Group 'Group 0' failed. 8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- EAST2. 8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" online." System Event Log for Usmv-West2 Host (Failure SiteSite 1) 8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" offline." 8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in Resource Group 'Group 0' failed. 8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- EAST2. 8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster Service brought the Resource Group ""Group 0"" online." If you review the cluster log, you find messages similar to the following examples: Cluster Log for Surviving Site (Site 2) 000005a0.00000fdc::2008/02/02-21:57:33.543 ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000ec8.000008b4::2008/02/02-22:09:03.139 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. 00000ec8.00000f48::2008/02/02-22:10:39.715 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Cluster Log for Failure Site (Site 1) 0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_system_settings command failed. Error: (2685470674). 0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be performed because of an I/O device error). 0000033c.00000b8c::2008/02/02-22:10:08.168 ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_version command failed. Error: (2685470674). 0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys SafeGuard 30m Control <Data1>: KfGetKboxData: get_system_settings command failed. Error: (2685470674). 0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be performed because of an I/O device error). 4 18 6872 5688 006
Recovering in a Geographic Clustered Environment Actions to Resolve the Problem If all RAs on site 1 fail and site 2 owns the quorum resource, you do not need to perform manual recovery. Because the surviving site owns the quorum consistency group, MSCS automatically restarts, and the data consistency group fails over on the surviving site. Recovery When All RAs and All Servers Fail on One Site The following two cases describe an event in which a complete site fails (for example, site 1) and all data I/O, cluster node communication, disk reservations, and so forth, stop responding. MSCS nodes on site 2 detect a network heartbeat loss and loss of disk reservations, and try to take over the cluster groups that had been running on the nodes that failed. There are two cases for recovering from this failure based on which site owns the quorum group: The RAs and servers fail on site 1 and that site owns the quorum group. The RAs and servers fail on site 1 and site 2 owns the quorum group. Manual recovery of MSCS is required as described in the following topic, Site 1 Failure (Site 1 Quorum Owner). If the site can recover in an acceptable amount of time and the quorum owner does not reside on the failed site, manual recovery should not be performed. The two cases that follow respond differently and are solved differently based on where the quorum owner resides. Site 1 Failure (Site 1 Quorum Owner) Problem Description In the first failure case, all nodes at site 1 fail as well as the RAs. Thus, the RAs must fail quorum arbitration attempts initiated by nodes on the surviving site. Because the RAs on the surviving site (site 2) are not able to communicate over the communication networks, the RAs assume that it is a WAN network failure and do not allow automatic failover of cluster resources. MSCS attempts to fail over to a node at site 2. Because the quorum resource was owned by site 1, site 2 must be brought up using the manual quorum recovery procedure. 6872 5688 006 4 19
Recovering in a Geographic Clustered Environment Figure 4 3 illustrates this case. Symptoms Figure 4 3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-East2 Host (Failure Site) 8/3/2008 10:46:01 AM ClusSvc Error Startup/Shutdown 1073 N/A USMV-EAST2 Cluster service was halted to prevent an inconsistency within the server cluster. The error code was 5892 (The membership engine requested shutdown of the cluster service on this node). 8/3/2008 10:46:00 AM ClusSvc Error Membership Mgr 1177 N/A USMV-EAST2 Cluster service is shutting down because the membership engine failed to arbitrate for the quorum device. This could be due to the loss of network connectivity with the current quorum owner. Check your physical network infrastructure to ensure that communication between this node and all other nodes in the server cluster is intact. 8/3/2008 10:47:40 AM ClusSvc Error Startup/Shutdown 1009 N/A USMV-EAST2 Cluster service could not join an existing server cluster and could not form a new server cluster. Cluster service has terminated. 8/3/2008 10:50:16 AM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 4 20 6872 5688 006
Recovering in a Geographic Clustered Environment If you review the cluster log, you find messages similar to the following examples: Cluster Log for Surviving Site (Site 2) 00000c54.000008f4::2008/02/02-17:13:31.901 ERR [NMJOIN] Unable to begin join, status 1717 (the NIC interface is unknown). 00000c54.000008f4::2008/02/02-17:13:31.901 ERR [CS] ClusterInitialize failed 1717 00000c54.000008f4::2008/02/02-17:13:31.917 ERR [CS] Service Stopped. exit code = 1717 00000be0.000008e0::2008/02/02-17:14:53.686 ERR [JOIN] Unable to connect to any sponsor node. 00000be0.000008e0::2008/02/02-17:14:56.374 ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 000001e0.00000bac::2008/02/02-17:16:37.563 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. 00000e8c.00000ea8::2008/02/02-17:30:20.275 ERR Physical Disk <Disk Q:>: [DiskArb] Signature of disk has changed or failed to find disk with id, old signature 0xe1e7208e new signature 0xe1e7208e, status 2 (the system cannot find the file specified). 00000e8c.00000ea8::2008/02/02-17:30:20.289 ERR Physical Disk <Disk Q:>: SCSI: Attach, error attaching to signature e1e7208e, error 2. 000008e8.000008fc::2008/02/02-17:30:20.289 ERR [FM] FmGetQuorumResource failed, error 2. 000008e8.000008fc::2008/02/02-17:30:20.289 ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service). 000008e8.000008fc::2008/02/0-17:30:20.289 ERR [INIT] Failed to form cluster, status 5086. 000008e8.000008fc::2008/02/02-17:30:20.289 ERR [CS] ClusterInitialize failed 5086 000008e8.000008fc::2008/02/02-17:30:20.360 ERR [CS] Service Stopped. exit code = 5086 00000710.00000e80::2008/02/02-17:55:02.092 ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 000009cc.00000884::2008/02/02-17:55:12.413 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Cluster Log for Failure Site (Site 1) 00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 2. 00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 2. 00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 2. 00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error 2. 00000fe4.00000810::2008/02/02-17:13:20.030 ERR [FM] Failed to arbitrate quorum resource c336021a- 083e-4fa0-9d37-7077a590c206, error 2. 00000fe4.00000810::2008/02/02-17:13:20.030 ERR [RGP] Node 1: REGROUP ERROR: arbitration failed. 00000fe4.00000810::2008/02/02-17:13:20.030 ERR [NM] Halting this node due to membership or communications error. Halt code = 1000 00000fe4.00000810::2008/02/02-17:13:20.030 ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster service on this node). 00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Pending processing terminated by resource monitor. 00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. 000009e4::2008/02/02-17:29:20.587 ERR [FM] FmGetQuorumResource failed, error 2. 6872 5688 006 4 21
Recovering in a Geographic Clustered Environment 000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] ClusterForm: Could not get quorum resource. No fixup attempted. Status = 5086 000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] Failed to form cluster, status 5086. 000008e4.000009e4::2008/02/02-17:29:20.587 ERR [CS] ClusterInitialize failed 5086 000008e4.000009e4::2008/02/02-17:29:20.602 ERR [CS] Service Stopped. exit code = 5086 000005b4.000008cc::2008/02/02-17:31:11.075 ERR [FM] FmpSetGroupEnumOwner:: MM returned MM_INVALID_NODE, chose the default target 00000ff4.000008d8::2008/02/02-17:31:19.901 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the Management Console that the WAN connection is operational. Actions to Resolve the Problem If all RAs and servers on site 1 fail and site 1 owns the quorum resource, perform the following tasks to recover: 1. Perform a manual failover of the quorum consistency group. 2. Reverse replication direction. 3. Start MSCS. 4. Power on the site if a power failure occurred. 5. Restore the failover settings. Note: Do not bring up any nodes until the manual recovery process is complete. Caution Manual recovery is required only if the quorum device is lost because of a failure of an RA cluster. If the cluster nodes at the production site are operational, you must disable MSCS. You must verify the server status with a network test or attempt to log in to the server. Use the procedure in Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner). Improper use of the manual recovery procedure can lead to an inconsistent quorum disk and unpredictable results that might require a long recovery process. Performing a Manual Failover of the Quorum Consistency Group To perform a manual failover of the quorum consistency group, follow the procedure given in the Actions to Resolve the Problem for Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) earlier in this section. Reversing Replication Direction 1. Select the Consistency Groupd from the navigation pane. 2. Right-click the Group and select Pause Transfer. Click Yes when the system prompts and the group activity will be paused. 4 22 6872 5688 006
Recovering in a Geographic Clustered Environment 3. Select the Status tab. The status of the transfer must display Paused. 4. Right-click the Consistency Groups and select Failover to <Remote Site Name> 5. Click Yes when the system prompts to confirm failover. 6. Ensure that the Start data transfer immediately check box is selected. The following warning message appears: Warning: Journal will be erased. Do you wish to continue? 7. Click Yes to continue. Starting MSCS MSCS should start within 1 minute on the surviving nodes when the MSCS recovery setting is enabled. You can manually start MSCS on each node of the surviving site by completing the following steps: 1. In the Control Panel, point to Administrative Tools, and then click Services. 2. Right-click Cluster Service, and click Start. MSCS starts the cluster group and automatically moves all groups to the first-started cluster node. 3. Repeat steps 1 through 2 for each node on the site. Powering-on a Site If a site experienced a power failure, power on the site in the following order: Switches Storage Note: Wait until all switches and storage units are initialized before continuing to power on the site. RAs Note: Wait 10 minutes after you power on the RAs before you power on the hosts. Hosts Restoring the Global Cluster Mode Property for the Quorum Group Once the primary site is again operational and you have verified that all nodes at both sites are online in the cluster, restore the failover settings by completing the following steps: 1. Click the Quorum Consistency Group (that is, the consistency group that holds the quorum drive) from the navigation pane. 2. Select the Policy tab. 3. Under Stretch Cluster Support, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor [Auto-quorum (shared quorum) mode]. 6872 5688 006 4 23
Recovering in a Geographic Clustered Environment 4. Ensure that the Allow Regulation box check box is selected. 5. Click Apply. Site 1 Failure (Site 2 Quorum Owner) Problem Description If the quorum group is running on site 2 and a complete site failure occurs on site 1, a quorum failover is not required. Only data groups on the failed site will require failover. All data that is not mirrored and was in the failed RA cache is lost; the latest image on the remote site is used to recover. Cluster services will be up on all nodes on site 2, and cluster nodes will fail on site 1. You cannot move a group to nodes on a site where the RAs are down (site 1). MSCS attempts to fail over to a node at site 2. An e-mail alert is sent stating that a site or RA cluster has failed. Figure 4 4 illustrates this case. Figure 4 4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner) 4 24 6872 5688 006
Recovering in a Geographic Clustered Environment Symptoms The following symptoms might help you identify this failure: The management console display shows errors and messages similar to those for Total Communication Failure in a Geographic Clustered Environment in Section 7. If you review the system event log, you find messages similar to the following examples: System Event Log for Usmv-West2 (Failure Site) 8/3/2006 1:49:26 PM ClusSvc Information Failover Mgr 1205 N/A USMV-WEST2 "The Cluster Service failed to bring the Resource Group ""Cluster Group"" completely online or offline." 8/3/2008 1:49:26 PM ClusSvc Information Failover Mgr 1203 N/A USMV-WEST2 "The Cluster Service is attempting to offline the Resource Group ""Cluster Group""." 8/3/2008 1:50:46 PM ClusDisk Error None 1209 N/A USMV-WEST2 Cluster service is requesting a bus reset for device \Device\ClusDisk0. If you review the cluster log, you find messages similar to the following examples: Cluster Log for Failure Site (Site 1) 00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170 (the requested resource is in use). 00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 00000e50.00000fb4::2008/02/02-20:52:05.133 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (the handle is invalid). Cluster Log for Surviving Site (Site 2) 00000178.00000dd8::2008/02/02-20:49:30.976 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000178.00000dd8::2008/02/02-20:49:30.992 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumSendUpdate: GumQueueLocking update to node 1 failed with 1818 (The remote procedure call was cancelled). 00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumpCommFailure 1818 communicating with node 1 00000178.00000810::2008/02/02-20:50:45.492 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6 (The handle is invalid). Actions to Resolve the Problem If all RAs and all servers on site 1 fail and site 2 owns the quorum resource, you do not need to perform manual recovery. Because the surviving site owns the quorum consistency group, MSCS automatically restarts, and the data consistency group fails over on the surviving site. 6872 5688 006 4 25
Recovering in a Geographic Clustered Environment 4 26 6872 5688 006
Section 5 Solving Storage Problems This section lists symptoms that usually indicate problems with storage. Table 5 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for possible problems. Also, messages similar to e-mail notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 5 1. Possible Storage Problems with Symptoms Symptom The system pauses the transfer for the relevant consistency group. The server cannot access this volume; writes to this volume fail; the file system cannot be mounted; and so forth. The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The system pauses the transfer for all consistency groups. The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The event log reports that the repository volume is inaccessible. The event log indicates that the repository volume is corrupted. Possible Problem User or replication volume not accessible Repository volume not accessible 6872 5688 006 5 1
Solving Storage Problems Table 5 1. Possible Storage Problems with Symptoms Symptom The management console shows an error for the connections between this volume and all RAs on the relevant site. The system pauses the transfer for the relevant consistency group. The event log indicates that the journal was lost or corrupted. No volumes from the relevant target and worldwide name (WWN) are accessible to any initiator on the SAN. The cluster regroup process begins and the quorum device fails over to a site without failed storage. The management console shows a storage error and replication has stopped. Servers report multipath software errors. Applications that depend on physical disk resources go offline and fail when attempting to come online. Once resource retry threshold parameters are reached, site 1 fails over to site 2. With the default settings, this timing is about 30 minutes. Possible Problem Journal not accessible Total storage loss in a geographic replicated environment Storage failure on one site with quorum owner on failed site in a geographic clustered environment Storage failure on one site with quorum owner on surviving site in a geographic clustered environment 5 2 6872 5688 006
Solving Storage Problems Table 5 2 lists specific storage volume failures and the types of errors and indicators on the management console that distinguish each failure. Table 5 2. Indicators and Management Console Errors to Distinguish Different Storage Volume Failures Failure Groups Paused Status System Status Volumes Tab Logs Tab Data volume lost or failed Relevant Data Group Storage error Replication volume with error status Error 3012 Journal volume lost, failed, or corrupt Relevant Data Group Storage error Journal volume with error status Error 3012 Repository volume lost, failed, or corrupt All Storage and RA error failure Repository volume with error status Error 3014 User or Replication Volume Not Accessible Problem Description Symptoms The replication volume is not accessible to any host or splitter. The following symptoms might help you identify this failure: The management console shows an error for storage and the Volumes tab (status column) shows additional errors (See Figure 5 1). 6872 5688 006 5 3
Solving Storage Problems Figure 5 1. Volumes Tab Showing Volume Connection Errors Warnings and informational messages similar to those shown in Figure 5 2 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 5 2. Management Console Messages for the User Volume Not Accessible Problem 5 4 6872 5688 006
Solving Storage Problems The following table explains the numbered messages in Figure 5 2. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 4003 Group capabilities problem with the details showing that the RA is unable to access <group>. X 2 3012 The RA is unable to access the volume. X The Groups tab on the management console shows that the system paused the transfer for the relevant consistency group. (See Figure 5 3.) Figure 5 3. Groups Tab Shows Paused by System The server cannot access this volume; writes to this volume fail; the file system cannot be mounted; and so forth. Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists and has not failed or been removed from the storage device. Verify that the LUN is masked to the proper splitter or splitters and RAs. Verify that other servers in the SAN do not use this volume. For example, if an MSCS cluster in the SAN acquired ownership of this volume, it might reserve the volume and block other initiators from seeing the volume. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. Repository Volume Not Accessible Problem Description The repository volume is not accessible to any SAN-attached initiator, including the splitter and RAs. 6872 5688 006 5 5
Solving Storage Problems Symptoms Or, the repository volume is corrupted---either by another initiator because of storage changes or as a result of storage failure. You must reformat the repository volume before replication can proceed normally. The following symptoms might help you identify this failure: The management console shows an error for all connections to this volumethat is, all RAs on the relevant site and all splitters attached to this volume. The RAs tab on the management console shows errors for the volume. (See Figure 5 4.) The following error messages appear for the RAs error condition when you click Details: Error: RA 1 in LA can t access repository volume Error: RA 2 in LA can t access repository volume The following error message appears for the storage error condition, when you click Details: Error: Repository volume can t be accessed by any RAs Figure 5 4. Management Console Display: Storage Error and RAs Tab Shows Volume Errors The Volumes tab on the management console shows an error for the repository volume, as shown in Figure 5 5. Figure 5 5. Volumes Tab Shows Error for Repository Volume 5 6 6872 5688 006
Solving Storage Problems The Groups tab on the management console shows that the transfer is active for all consistency groups, as shown in Figure 5 6. Figure 5 6. Groups Tab Shows All Groups are Still Alive The Logs tab on the management console lists a message for event ID 3014. This message indicates that the RA is unable to access the repository volume or the repository volume is corrupted. (See Figure 5 7.) Figure 5 7. Management Console Messages for the Repository Volume not Accessible Problem Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists and has not failed or been removed from the storage device. Verify that the LUN is masked to the proper splitter or splitters and RAs. Verify that other servers in the SAN do not use this volume. For example, if an MSCS cluster in the SAN acquired ownership of this volume, it might reserve the volume and block other initiators from seeing the volume. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. If the volume is corrupted or you determine that it must be reformatted, perform the steps in Reformatting the Repository Volume. 6872 5688 006 5 7
Solving Storage Problems Reformatting the Repository Volume Before you begin the reformatting process in a geographic clustered environment, be sure that all groups are located at the site for which the repository volume is not to be formatted. On RA 1 at the site for which the repository volume is to be formatted, determine from the Site Planning Guide which LUN is used for the repository volume. If the LUN is not recorded for the repository volume, a list is presented during the volume formatting process that shows LUNs and the previously used repository volume is identified. Perform the following steps to reformat a repository volume for a particular site: 1. Click the Data Group in the management console, and perform the following steps: a. Click Policy in the right pane. b. Scroll down and select Stretch Cluster Support in the Policy tab. c. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. d. Click Apply. e. Right-click the Data Group and select Disable Group. f. Click Yes when the system prompts for confirmation. g. Click Yes when the system prompts that the copy activities will be stopped. 2. Skip to step 6 for geographic replication environments. 3. Perform the following steps for geographic clustered environments: a. Open the Group Policy window for the quorum group. b. Click Policy in the right pane. c. Scroll down and select Stretch Cluster Support in the Policy tab. d. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. e. Click Apply. 4. Right-click the Consistency Group and select Disable Group. 5. Click Yes when the system prompts that the copy activities will be stopped. 6. Select the Splitters tab. a. Open the Splitter Properties window for the splitter. b. Select all the attached volumes. c. Click Detach and then click Apply. d. Click OK to close the window. 5 8 6872 5688 006
Solving Storage Problems e. Delete the splitter at the site for which the repository volume is to be reformatted. 7. Open the PuTTY session on RA1 for the site. a. Log on with boxmgmt as the User ID and boxmgmt as the password. The Main menu is displayed. b. At the prompt, type 4 (Cluster Operation) and press Enter. c. Type 2 (Detach from cluster) at the Cluster Operations menu. d. Type y when prompted for confirmation. e. Type b to go back to the Setup menu. f. On the Setup menu, type 2 (Configure repository volume) and press Enter. g. Type 1 (Format repository volume) and press Enter. h. Enter the appropriate number from the list to select the LUN. Ensure that the WWN and LUN are for the volume that you want to format. The LUN and identifier displays. i. Confirm the volume to format. All data is removed from the volume. j. Verify that the operation succeeds and press Enter. k. On the Main Menu, type Q (quit) and press Enter. 8. Open a PuTTY session on each additional RA at the site for which the repository volume is to be formatted. 9. Log on with the boxmgmt as the user ID and boxmgmt as the password. The Main menu displays. a. At the prompt, type 2 (Setup) and press Enter. b. On the Setup menu, type 2 (Configure repository volume) and press Enter. c. Type 2 (Select a previously formatted repository volume) and press Enter. d. Enter the appropriate number from the list to select the LUN. Ensure that the WWN and LUN are for the volume that you want to format. The LUN and identifier displays. e. Confirm the volume to format. All data is removed from the volume. f. Verify that the operation succeeds and press Enter. g. On the Main menu, type Q (quit) and press Enter. Note: Complete step 9 for each additional RA at the site. 10. On the management console, select the Splitters tab. a. Click the Add New Splitter icon to open the Add splitter window. b. Click Rescan and select the splitter. 11. Open the Group Properties window and click the Policy tab and perform the following steps for each data group: 6872 5688 006 5 9
Solving Storage Problems a. Scroll down and select Stretch Cluster Support in the Policy tab. b. Under Management Mode, select Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor. c. Click Apply. d. Right-click the Data Group and click Enable Group. 12. Skip to step 16 for geographic replication environments. 13. Perform the following steps for geographic clustered environments. a. Right-click the Quorum Group and click Enable Group. b. Click the Quorum Group and select Policy in the right pane. c. Scroll down and select Stretch Cluster Support in the Policy tab. d. Under This consistency group works with, select the This is the quorum group check box. e. Under Management Mode, select Group is managed by 30m, Unisys SafeGuard Solutions can only monitor. f. Click Apply. 14. Verify that initialization completes for all the groups. 15. Review the management console event log. 16. Ensure that no storage error or other component error appears. Journal Not Accessible Problem Description Symptoms The journal is not accessible to either RA. A journal for one of the consistency groups is corrupted. The corruption results from another initiator because of storage changes or as a result of storage failure. Because the snapshot history is corrupted, replication for the relevant consistency group cannot proceed. The following symptoms might help you identify this failure: The Volumes tab on the management console shows an error for the journal volume. (See Figure 5 8.) 5 10 6872 5688 006
Solving Storage Problems Figure 5 8. Volumes Tab Shows Journal Volume Error The RAs tab on the management console shows errors for connections between this volume and the RAs. (See Figure 5 9.) Figure 5 9. RAs Tab Shows Connection Errors The Groups tab on the management console shows that the system paused the transfer for the relevant consistency group, as shown in Figure 5 10. Figure 5 10. Groups Tab Shows Group Paused by System The Logs tab on the management console lists a message for event ID 3012. This message indicates that the RA is unable to access the volume. (See Figure 5 11.) 6872 5688 006 5 11
Solving Storage Problems Figure 5 11. Management Console Messages for the Journal Not Accessible Problem Actions to Resolve Perform the following actions to isolate and resolve the problem: Determine whether other volumes from the same storage device are accessible to the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer to Total Storage Loss in a Geographic Replicated Environment. Verify that this LUN still exists on the storage device and that it is only masked to the RAs. Verify that the volume has read and write permissions on the storage system. Verify that the volume, as configured in the management console, has the expected WWN and LUN. For a corrupted journal, check that the system recovers automatically by re-creating the data structures for the corrupted journal and that the system then initiates a fullsweep resynchronization. No manual intervention is needed. Journal Volume Lost Scenarios Problem Description Scenarios The journal volume is lost and will not be available in some scenarios as described below. Attempt to write data to the Journal volume with the speed higher than the journal data is distributed to the replication volume will result in Journal data loss. In this case the Journal volume may be full and attempt to perform write operation on it creates a problem. The user performs the following operations: Failover Recover production 5 12 6872 5688 006
Solving Storage Problems Actions to Resolve You can minimize the occurrence of this problem in scenario 1 by carefully configuring the Journal Lag. It is unavoidable in scenario 2. Total Storage Loss in a Geographic Replicated Environment Problem Description Symptoms All volumes belonging to a certain storage target and WWN (or controller, device) have been lost. The following symptoms might help you identify this failure: The symptoms can be the same as those from any of the volume failure problems listed previously (or a subset of those symptoms), if the symptoms are relevant to the volumes that were used on this target. All volumes common to a particular storage array have failed. The Volumes tab on the management console shows errors for all volumes. (See Figure 5 12.) 6872 5688 006 5 13
Solving Storage Problems Figure 5 12. Management Console Volumes Tab Shows Errors for All Volumes No volumes from the relevant target and WWN are accessible to any initiator on the SAN, as shown on the RAs tab on the management console. (See Figure 5 13.) Figure 5 13. RAs Tab Shows Volumes That Are Not Accessible Multipathing software (such as EMC PowerPath Administrator) reports failed paths to the storage device, as shown in Figure 5 14. 5 14 6872 5688 006
Solving Storage Problems Figure 5 14. Multipathing Software Reports Failed Paths to Storage Device Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that the storage device has not experienced a power outage. Instead, the device is functioning normally according to all external indicators. Verify that the Fibre Channel switch and the storage device indicate an operating Fibre Channel connection (that is, the relevant LEDs show OK). If the indicators are not OK, the problem might be a faulty Fibre Channel port (storage, switch, or patch panel) or a faulty Fibre Channel cable. Verify that the initiator can be seen from the switch name server. If not, the problem could be a Fibre Channel port or cable problem (as in the preceding item). Otherwise, the problem could be a misconfiguration of the port on the switch (for example, type or speed could be wrong). Verify that the target WWN is included in the relevant zones (that is, hosts and RA). Verify also that the current zoning configuration is the active configuration. If you use the default zone, verify that it is set to permit by default. Verify that the relevant LUNs still exist on the storage device and are masked to the proper splitters and RAs. Verify that volumes have read and write permissions on the storage system. Verify that these volumes are exposed and managed by the proper hosts and that there are no other hosts on the SAN that use this volume. Storage Failure on One Site in a Geographic Clustered Environment In a geographic clustered environment where MSCS is running, if the storage subsystem on one site fails, the symptoms and resulting actions depend on whether the quorum owner resided on the failed storage subsystem. 6872 5688 006 5 15
Solving Storage Problems To understand the two scenarios and to follow the actions for both possibilities, review Figure 5 15. Figure 5 15. Storage on Site 1 Fails Storage Failure on One Site with Quorum Owner on Failed Site Problem Description Symptoms In this case, the cluster quorum owner as well as the quorum resource resides on the failed storage subsystem. The quorum and resource automatically fail over to the node that gains control through MSCS arbitration. This node resides on the site without the storage failure. The RAs use the last available image. This action results in a loss of data that has yet to be replicated. The resources cannot fail back to the failed site until the storage subsystem is restored. The following symptoms might help you identify this failure. A node on which the cluster was running might report a delayed write failure or similar error. 5 16 6872 5688 006
Solving Storage Problems The quorum reservation is lost, and MSCS stops on the cluster node that owned the quorum resource. This action triggers a cluster regroup process, which allows other cluster nodes to arbitrate for the quorum device. Figure 5 16 shows typical listings for the cluster regroup process. Figure 5 16. Cluster Regroup Process 6872 5688 006 5 17
Solving Storage Problems Cluster nodes located on the failed storage subsystem fail quorum arbitration because the service cannot provide a reservation on the quorum volume. The resources fail over to the site without a storage failure. The first cluster node on the site without the storage failure that successfully completes arbitration of the quorum device assumes ownership of the cluster. The following messages illustrate this process. Cluster Log Entries INFO Physical Disk <Disk Q:>: [DiskArb]------- DisksArbitrate -------. INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Disk unique id present trying new attach INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number from ClusDisk registry key INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to PhysicalDrive7. INFO Physical Disk <Disk Q:>: [DiskArb] DisksOpenResourceFileHandle: Returns success. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitration Parameters: ArbAttempts 5, SleepBeforeRetry 500 ms. INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info to insure the disk is accessible. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature b876c301. ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate for ownership of the disk by reading/writing various disk sectors. ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. INFO Physical Disk <Disk Q:>: [DiskArb] We are about to break reserve. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing BusReset on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info from the disk to insure disk is accessible. INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 0. INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO [RGP] Node 2: RGP Send packets: 0x3, 0xc0004000, 0x40004000, 0x0. INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 11) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 12) [ES3120-X64:0] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful write (sector 12) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Successful read (sector 11) [ES3120-X64:1] (0,4cbd785d:01c96d8e). INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature b876c301. INFO Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 0. WARN Physical Disk <Disk Q:>: [DiskArb] Assume ownership of the device. INFO Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine starts. INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 0. In Cluster Administrator, the groups that were online on one node change to the node that wins arbitration, as shown in Figure 5 17. 5 18 6872 5688 006
Solving Storage Problems Figure 5 17. Cluster Administrator Displays Multipathing software, if present, reports errors on the host servers of the site for which the storage subsystem failed. Figure 5 18 shows errors for failed storage devices. Figure 5 18. Multipathing Software Shows Server Errors for Failed Storage Subsystem 6872 5688 006 5 19
Solving Storage Problems Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that all cluster resources failed over to a node on the site for which the storage subsystem did not fail and that these resources are online. If the cluster is running and no additional errors are reported, the problem has probably been isolated to a total site storage failure. Log in to the storage subsystem, and verify that all LUNs are present and configured properly. If the storage subsystem appears to be operating, the problem is most likely because of a failed SAN switch. See Total SAN Switch Failure on One Site in a Geographic Clustered Environment in Section 6. Resolve the failure of the storage subsystem before attempting failback. Once the storage subsystem is working and the RAs and host can access it, a full initialization is initiated. Storage Failure on One Site with Quorum Owner on Surviving Site Problem Description Symptoms In this case, the cluster quorum owner does not reside on the failed storage subsystem, but other resources do reside on the failed storage subsystem. The cluster resources fail over to a site without a failed storage subsystem. The RAs use the last available image. This action results in a loss of data that has yet to be replicated (if not synchronous). The resources cannot fail back to the failed site until the storage subsystem is restored. The following symptoms might help you identify this failure: The cluster marks the data groups containing the physical disk resources as failed. Applications dependent on the physical disk resource go offline. Failed resources attempt to come online on the failed site, but fail. Then the resources fail over to the site with a valid storage subsystem. Actions to Resolve Perform the following actions to isolate and resolve the problem: Verify that multipathing software, if present, reports errors on the host servers at the site with the suspected failed storage subsystem. (See Figure 5 19.) Verify that all cluster resources failed over to site 2 in Cluster Administrator. Entries similar to the following occur in the cluster log for a host at the site with a failed storage subsystem (thread ID and timestamp removed). 5 20 6872 5688 006
Solving Storage Problems Cluster Log Disk reservation lost.. ERR Physical Disk <Disk R:>: [DiskArb] CompletionRoutine: reservation lost! Status 1167 Arbitrate for disk... INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is called. INFO Physical Disk <Disk R:>: [DiskArb] Stopping reservation thread. ERR Physical Disk <Disk R:>: [DiskArb] Failed to read (sector 12), error 1168. ERR Physical Disk <Disk R:>: [DiskArb] Error cleaning arbitration sector, error 1168. INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is complete. INFO Physical Disk <Disk R:>: [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with signature 42b77e24 INFO Physical Disk <Disk R:>: [DiskArb] Signature of disk has changed or failed to find disk with id, old signature 0x42b77e24 new signature 0x42b77e24, status 2 INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is called. INFO Physical Disk <Disk R:>: [DiskArb] StopPersistentReservations is complete. Control goes offline at failed site... INFO [FM] FmpDoMoveGroup: Entry WARN [FM] FmpHandleResourceTransition: Resource failed, post a work item INFO [FM] FmpMoveGroup: Entry INFO [FM] FmpMoveGroup: Moving group c3cb79bc-b92f-427c-ac9a-2c474b81f6da to node 1 (1) INFO [FM] FmpOfflineResource: Disk R: depends on Data1LANY. Shut down first. INFO Unisys SafeGuard 30m Control <Data1LANY>: KfResourceOffline: Resource 'Data1LANY' going offline. After trying other nodes at site move to remote site... INFO [FM] FmpMoveGroup: Take group c3cb79bc-b92f-427c-ac9a-2c474b81f6da request to remote node 1 Move succeeds... INFO [FM] FmpMoveGroup: Exit group < DiskR >, status = 0 INFO [FM] New owner of Group c3cb79bc-b92f-427c-ac9a-2c474b81f6da is 1, state 0, curstate 0. INFO [GUM] s_gumupdatenode: completed update seq 443 type 0 context 9 INFO [FM] FmpDoMoveGroup: Exit, status = 0 INFO [FM] FmpDoMoveGroupOnFailure: FmpDoMoveGroup returns 0 INFO [FM] FmpDoMoveGroupOnFailure Exit. Log in to the failed storage subsystem and determine whether the storage reports failed or missing disks. If the storage subsystem appears to be fine, the problem is most likely because of a SAN switch failure. See Total SAN Switch Failure on One Site in a Geographic Clustered Environment in Section 6. Once the storage for the site that failed is back online, a full sweep is initiated. Check that the messages Starting volume sweep and Starting full sweep are displayed as an Events Notice. 6872 5688 006 5 21
Solving Storage Problems 5 22 6872 5688 006
Section 6 Solving SAN Connectivity Problems This section lists symptoms that usually indicate problems with connections to the storage subsystem. Table 6 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for possible problems. Also, messages similar to e-mail notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 6 1. Possible SAN Connectivity Problems Symptoms The system pauses the transfer. If the volume is accessible to another RA, a switchover occurs, and the relevant groups start running on the new RA. The relevant message appears in the event log. The link to the volume from the disconnected RA or RAs shows an error. The volume is accessible to the splitters that are attached to it. Possible Problem Volume not accessible to RAs The system pauses the transfer for the relevant groups. If the volume is not accessible, the management console shows an error for the splitter. If a replication volume is not accessible, the splitter connection to that volume shows an error. Volume not accessible to SafeGuard 30m splitter 6872 5688 006 6 1
Solving SAN Connectivity Problems Table 6 1. Possible SAN Connectivity Problems Symptoms The system pauses the transfer for the relevant group or groups. If the connection with only one of the RAs is lost, the group or groups can restart the transfer by means of another RA, beginning with a short initialization. The splitter connection to the relevant RAs shows an error. The relevant message describes the lost connection in the event log. The management console shows a server down. Messages on the management console show that the splitter is down and that the node fails over. Multipathing software (such as EMC PowerPath Administrator) messages report an error. Cluster nodes fail and the cluster regroup process begins. Applications fail and attempt to restart. Messages regarding failed physical disks are displayed on the management console. The cluster resources fail over to the remote site. Possible Problem RAs not accessible to SafeGuard 30m splitter Server unable to connect with SAN (See Server Unable to Connect with SAN in Section 9. This problem is not described in this section.) Total SAN switch failure on one site in a geographic clustered environment Volume Not Accessible to RAs Problem Description Symptoms A volume (repository volume, replication volume, or journal) is not accessible to one or more RAs, but it is accessible to all other relevant initiatorsthat is, the splitter. The following symptoms might help you identify this failure: The system pauses the transfer. If the volume is accessible to another RA, a switchover occurs, and the relevant group or groups start running on the new RA. 6 2 6872 5688 006
Solving SAN Connectivity Problems The management console displays failures similar to those in Figure 6 1. Figure 6 1. Management Console Showing Inaccessible Volume Errors Warnings and informational messages similar to those shown in Figure 6 2 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 6 2. Management Console Messages for Inaccessible Volumes 6872 5688 006 6 3
Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure 6 2. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 4003 For each consistency group, the surviving site reports a group capabilities problem. 2 4044 The group is deactivated indefinitely by the system. X X 3 3012 The RA is unable to access the volume (RA1, Data1_LA_1). X 4 4003 For each consistency group, the site reports a group capabilities problem. X 5 5049 Splitter writer to RA failed. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 2:07:04 AM ClusSvc Error Physical Disk Resource 1038 N/A USMV- X460 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration 1/4/2009 2:07:04 AM Ftdisk Warning Disk 57 N/A USMV- X460 The system failed to flush data to the transaction log. Corruption may occur. System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV- X460' ' on network 'Local Area Connection 3'. 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV- X460' on network ' Local Area Connection 4'. 1/4/2009 5:07:05 AM ClusDisk Error None 1209 N/A ES3120-X64 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/4/2009 5:07:05 AM ClusSvc Warning Node Mgr 1135 N/A ES3120-X64 Cluster node USMV-X460 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. 1/4/2009 5:07:05 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 "The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""." 1/4/2009 5:07:05 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 "The Cluster Service brought the Resource Group ""Cluster Group"" online." 1/4/2009 5:07:05 AM Service Control Manager Information None 7036 N/A USMV-X460 The Cluster Service entered the running state. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: 6 4 6872 5688 006
Solving SAN Connectivity Problems Cluster Log for USMV- X460 Host (Host on Failure Site) 000000a4.00000110::2009/01/04-10:07:04.042 ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status 2 000000a4.00000110::2009/01/04-10:07:04.042 ERR [RM] LostQuorumResource, cluster service terminated... 000000a4.000000a8::2009/01/04-10:07:30.040 ERR Network Name <Cluster Name>: Unable to open handle to cluster, status 1753. 000000a4.0000088c::2009/01/04-10:07:30.040 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. 000000a4.000000a8::2009/01/04-10:07:30.040 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 2 000000a4.000000a8::2009/01/04-10:07:30.040 ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error 2. Cluster Log for ES3120-X64 Host (Host on Surviving Site) 00000300.00000644::2009/01/04-10:07:05.914 INFO [ClMsg] Received interface unreachable event for node 2 network 1 00000300.00000644::2009/01/04-10:07:05.914 INFO [ClMsg] Received interface unreachable event for node 2 network 2 00000300.0000072c::2009/01/04-10:07:34.101 WARN [NM] Interface 1db021ff-a472-4df2-97fe- 77fda4dc1a38 is unavailable (node: USMV-X460, network: Local Area Connection 3). 00000300.000000e8::2009/01/04-10:07:34.101 WARN [NM] Interface 280245fc-1fd0-4fc9-b7b0- a2355ca47f75 is unavailable (node: USMV-X460, network: Local Area Connection 4). Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Verify that the physical connection between the inaccessible RAs and the Fibre Channel switch is healthy. Verify that any disconnected RA appears in the name server of the Fibre Channel switch. If not, the problem could be because of a bad port on the switch, a bad host bus adaptor (HBA), or a bad cable. Verify that any disconnected RA is present in the proper zone and that the current zoning configuration is enabled. Verify that the correct volume is configured (WWN and LUN). To double-check, enter the Create Volume command in the management console, and verify that the same volume does not appear on the list of volumes that are available to be created. If the volume is not accessible to the RAs but is accessible to a splitter, and the server on which that splitter is installed is clustered using MSCS, Oracle RAC, or any other software that uses a reservation method, the problem probably occurs because the server has reserved the volume. For more information about the clustered environment installation process, see the Unisys SafeGuard Solutions Planning and Installation Guide and the Unisys SafeGuard Solutions Replication Appliance Administrator's Guide. 6872 5688 006 6 5
Solving SAN Connectivity Problems Volume Not Accessible to SafeGuard 30m Splitter Problem Description Symptoms A volume (repository volume, replication volume, or journal) is not accessible to one or more splitters but is accessible to all other relevant initiators (for example, the RAs). The following symptoms might help you identify this failure: The system pauses the transfer for the relevant groups. If the repository volume is not accessible, the management console shows an error for the splitter. If a replication volume is not accessible, the splitter connection to that volume shows an error. The management console System Status screen and the Splitter Settings screen show error indications similar to those in Figure 6 3. Figure 6 3. Management Console Error Display Screen Warnings and informational messages similar to those shown in Figure 6 4 appear on the management console. See the table after the figure for an explanation of the numbered console messages. 6 6 6872 5688 006
Solving SAN Connectivity Problems Figure 6 4. Management Console Messages for Volumes Inaccessible to Splitter 6872 5688 006 6 7
Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure 6 4. Reference No. Event ID Description 1 4008 For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. E-mail Immediate E-mail Daily Summary X 2 4005 Negotiating Transfer Protocol X 3 4086 For each consistency group at the failed site, the data transfer starts and then the initialization starts. 4 4087 For each consistency group at the failed site, initialization completes. X X 5 4007 Pausing Data Transfer X 6 4001 For each consistency group, a minor problem is reported. The details show that sides are not linked and cannot transfer data. 7 4016 Transferring the latest snapshot before pausing the transfer (no detail is lost). X X 8 5030 The splitter write operation failed. X 9 5035 Writes to Replication Volume ID disabled. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. 6 8 6872 5688 006
Solving SAN Connectivity Problems The multipathing software (such as EMC PowerPath) on the server at the failed site reports disk error as shown in Figure 6 5. Figure 6 5. EMC PowerPath Shows Disk Error If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 4:09:28 AM Emcmpio Error None 100 N/A USMV-X460 Path Bus 2 Tgt 60 Lun 1 to APM00033702717 is dead. 1/4/2009 4:09:28 AM Emcmpio Error None 102 N/A USMV-X460 600601609A560E00378AABEBF3C8DB11 is dead. 1/4/2009 4:09:28 AM Emcmpio Error None 104 N/A USMV-X460 All paths to 600601609A560E00378AABEBF3C8DB11 are dead. 1/4/2009 4:09:31 AM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/4/2009 4:11:07 AM ClusDisk Error None 1069 N/A USMV-X460 Cluster resource 'Disk R:' in Resource Group 'DiskR' failed. 1/4/2009 4:11:08 AM ClusSvc Information Failover Mgr 1153 N/A USMV-X460 Cluster service is attempting to failover the Cluster Resource Group 'DiskR' from node USMV-X460 to node ES3120-X64. 1/4/2009 4:11:30 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online. 6872 5688 006 6 9
Solving SAN Connectivity Problems System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/4/2009 7:10:34 AM ClusDisk Error None 1209 N/A USMV-X460 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/4/2009 7:11:07 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 The Cluster Service is attempting to bring online the Resource Group "DiskR". 1/4/2009 7:11:30 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ae8.00000780::2009/01/04-12:09:31.578 ERR Physical Disk <Disk R:>: [DiskArb] CompletionRoutine: reservation lost! Status 2 00000ae8.00000600::2009/01/04-12:09:31.875 ERR Physical Disk <Disk R:>: LooksAlive, error checking device, error 2. 00000ae8.00000600::2009/01/04-12:09:31.875 ERR Physical Disk <Disk R:>: IsAlive, error checking device, error 2. 00000ae8.000009cc::2009/01/04-12:10:31.874 ERR Physical Disk <Disk R:>: [DiskArb] Error cleaning arbitration sector, error 2 Cluster Log for ES3120-X64 Host (Host on Surviving Site) 00000300.00000644::2009/01/04-11:52:40.825 INFO [ClMsg] Received interface unreachable event for node 2 network 1 00000300.00000644::2009/01/04-11:52:40.825 INFO [ClMsg] Received interface unreachable event for node 2 network 2 00000950.00000bdc::2009/01/04-11:52:55.512 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000950.00000bdc::2009/01/04-11:52:55.512 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170 Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Verify that the physical connection between the disconnected splitter or splitters and the Fibre Channel switch is healthy. Verify that any host on which a disconnected splitter resides appears in the name server of the Fibre Channel switch. If not, the problem could be because of a bad port on the switch, a bad HBA, or a bad cable. Verify that any host on which a disconnected splitter resides is present in the proper zone and that the current zoning configuration is enabled. If a replication volume is not accessible to the splitter at the source site, but appears as OK in the management console for that splitter, verify that the splitter is not functioning at the target site (TSP not enabled). During normal replication, the system prevents target-site splitters from accessing the replication volumes. 6 10 6872 5688 006
Solving SAN Connectivity Problems RAs Not Accessible to SafeGuard 30m Splitter Problem Description Symptoms One or more RAs on a site are not accessible to the splitter through the Fibre Channel. The following symptoms might help you identify this failure: The system pauses the transfer for the relevant groups. If the connection with only one of the RAs is lost, the groups can restart the transfer by means of another RA, beginning with a short initialization. The splitter connection to the relevant RAs shows an error. The management console displays error indicators similar to those in Figure 6 6. Figure 6 6. Management Console Display Shows a Splitter Down Warnings and informational messages similar to those shown in Figure 6 7 appear on the management console. See the table after the figure for an explanation of the numbered console messages. 6872 5688 006 6 11
Solving SAN Connectivity Problems Figure 6 7. Management Console Messages for Splitter Inaccessible to RA The following table explains the numbered messages shown in Figure 6 7. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 4005 The surviving site Negotiating transfer protocol 2 4105 The failed site stop accepting writes to the consistency group 3 4008 For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. X X X 4 5013 Splitter down problem X 5 5002 The splitter for server USMV- X460 is unable to access the RA. 6 4087 The synchronization completed message after the splitter is restored and replication completes. 7 4086 The original site starts the synchronization. 8 4008 For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. X X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. 6 12 6872 5688 006
Solving SAN Connectivity Problems Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for USMV-X460 Host (Host on Failure Site) 1/4/2009 9:04:18 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-X460 Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 1/4/2009 9:04:19 PM Service Control Manager Error None 7031 N/A USMV-X460 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service 1/4/2009 9:04:19 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/4/2009 9:05:34 PM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "DiskR" online System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/5/2009 12:04:20 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV-X460' on network 'Local Area Connection 3'. 1/5/2009 12:04:20 AM ClusSvc Warning Node Mgr 1123 N/A ES3120-X64 The node lost communication with cluster node 'USMV-X460' on network 'Local Area Connection 4'. 1/5/2009 12:04:35 AM ClusDisk Error None 1209 N/A ES3120-X64 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 1/5/2009 12:04:56 AM ClusSvc Warning Node Mgr 1135 N/A ES3120-X64 Cluster node USMV- X460 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. 1/5/2009 12:04:56 AM ClusSvc Information Failover Mgr 1200 N/A ES3120-X64 The Cluster Service is attempting to bring online the Resource Group "Cluster Group". 1/5/2009 12:05:09 AM ClusSvc Information Failover Mgr 1201 N/A ES3120-X64 The Cluster Service brought the Resource Group "Cluster Group" online. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ae8.00000780::2009/01/05-05:04:18.662 ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status 2 00000ae8.00000780::2009/01/05-05:04:18.662 ERR [RM] LostQuorumResource, cluster service terminated... 00000ae8.00000a50::2009/01/05-05:04:19.288 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 00000ae8.00000a50::2009/01/05-05:04:19.288 ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error 170. 00000ae8.00000a50::2009/01/05-05:04:19.288 ERR Network Name <Cluster Name>: Unable to open handle to cluster, status 1753. Cluster Log for ES3120-X64 Host (Host on Surviving Site) 00000300.00000644::2009/01/05-05:05:19.638 INFO [ClMsg] Received interface up event for node 2 network 1 00000300.00000644::2009/01/05-05:05:19.638 INFO [ClMsg] Received interface up event for node 2 network 2 00000950.000008b0::2009/01/05-05:04:35.334 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000950.000008b0::2009/01/05-05:04:35.334 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170 6872 5688 006 6 13
Solving SAN Connectivity Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Identify which of the components is the problematic one. A problematic component is likely to have additional errors or problems: A problematic RA might not be accessible to other splitters or might not recognize certain volumes. A problematic splitter might not recognize any RAs or the storage subsystem. Connect to the storage switch to verify the status of each connection. Ensure that each connection is configured correctly. If you cannot find any additional problems, there is a good chance that the problem is with the zoning; that is, somehow, the splitters are not exposed to the RAs. Verify the physical connectivity of the RAs and the servers (those on which the potentially problematic splitters reside) to the Fibre Channel switch. For each connection, verify that it is healthy and appears correctly in the name server, zoning, and so forth. Verify that this is not a temporary situation---for instance, if the RAs were rebooting or recovering from another failure, the splitter might not yet identify them. Total SAN Switch Failure on One Site in a Geographic Clustered Environment A total SAN switch failure implies that cluster nodes and RAs have lost access to the storage device that was connected to the SAN on one site. This failure causes the cluster nodes to lose their reservation of the physical disks and triggers an MSCS failover to the remote site. In a geographic clustered environment where MSCS is running, if the connection to a storage device on one site fails, the symptoms and resulting actions depend on whether or not the quorum owner resided on the failed storage device. To understand the two scenarios and to follow the actions for both possibilities, review Figure 6 8. 6 14 6872 5688 006
Solving SAN Connectivity Problems Figure 6 8. SAN Switch Failure on One Site 6872 5688 006 6 15
Solving SAN Connectivity Problems Cluster Quorum Owner Located on Site with Failed SAN Switch Problem Description Symptoms The following point explains the expected behavior of the MSCS Reservation Manager when an event of this nature occurs: If the cluster quorum owner is located on the site with the failed SAN, the quorum reservation is lost. This loss causes the cluster nodes to fail and triggers a cluster regroup process. This regroup process allows other cluster nodes participating in the cluster to arbitrate for the quorum device. Cluster nodes located on the failed SAN fail quorum arbitration because the failed SAN is not able to provide a reservation on the quorum volume. The cluster nodes in the remote location attempt to reserve the quorum device and succeed arbitration of the quorum. The node that owns the quorum device assumes ownership of the cluster. The cluster owner brings online the data groups that were owned by the failed site. The following symptoms might help you identify this failure: All resources fail over to the surviving site (site 2 in this case) and come online successfully. Cluster nodes fail at the source site. If the consistency groups are configured asynchronously, this failover results in loss of data. The failover is fully automated and does not require additional downtime. The RAs cannot replication data until the SAN is operational. Failures are reported on the server and the management console. Replication stopped on all consistency groups. The management console displays error indications similar to those in Figure 6 9. Figure 6 9. Management Console Display with Errors for Failed SAN Switch Warnings and informational messages similar to those shown in Figure 6 10 appear on the management console. See the table after the figure for an explanation of the numbered console messages. 6 16 6872 5688 006
Solving SAN Connectivity Problems Figure 6 10. Management Console Messages for Failed SAN Switch 6872 5688 006 6 17
Solving SAN Connectivity Problems The following table explains the numbered messages shown in Figure 6 10. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 4008 The surviving site pauses the data. X 2 5013 The original site reporting the splitter down status. 3 5002 RA unable to access splitter 4 4044 The group is deactivated indefinitely by the system. 5 4003 For each consistency group, the surviving site reports a group consistency problem. The details show a WAN problem. X X X X 6 3014 The RA is unable to access the repository volume. 7 3012 The RA is unable to access the volume. 8 4007 The system is pausing data transfer X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: 6 18 6872 5688 006
Solving SAN Connectivity Problems System Event Log for USMV-X460 Host (Host on Failure Site) 1/14/2009 8:25:58 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. 1/14/2009 8:25:58 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-X460 Reservation of cluster disk '' has been lost. Please check your system and disk configuration. System Event Log for ES3120-X64 Host (Host on Surviving Site) 1/14/2009 11:25:58 PM Service Control Manager Error None 7031 N/A USMV-X460 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. 1/14/2009 11:25:58 PM Ftdisk Warning Disk 57 N/A USMV-X460 The system failed to flush data to the transaction log. Corruption may occur. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for USMV-X460 Host (Host on Failure Site) 00000ba4.00000b6c::2009/01/15-04:25:58.072 ERR Physical Disk <Disk Q:>: [DiskArb] CompletionRoutine: reservation lost! Status 2 00000ba4.00000b6c::2009/01/15-04:25:58.072 ERR [RM] LostQuorumResource, cluster service terminated... 00000ba4.00000bb4::2009/01/15-04:26:01.713 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 55. 00000ba4.00000bb4::2009/01/15-04:26:01.713 ERR Physical Disk <Disk Q:>: [DiskArb] Error cleaning arbitration sector, error 55. 00000ba4.00000bb4::2009/01/15-04:26:01.713 ERR Network Name <Cluster Name>: Unable to open handle to cluster, status 1753. 00000ba4.000000a8::2009/01/15-04:26:01.713 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. Cluster Log for ES3120-x64 Host (Host on Surviving Site) 00000b98.0000056c::2009/01/15-04:26:01.287 INFO [ClMsg] Received interface unreachable event for node 1 network 1 00000b98.0000056c::2009/01/15-04:26:01.287 INFO [ClMsg] Received interface unreachable event for node 1 network 2 00000768.0000023c::2009/01/15-04:26:16.130 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000768.0000023c::2009/01/15-04:26:16.130 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. Actions to Resolve the Problem To resolve this situation, diagnose the SAN switch failure. 6872 5688 006 6 19
Solving SAN Connectivity Problems Cluster Quorum Owner Not on Site with Failed SAN Switch Problem Description Symptoms The following points explain the expected behavior of the MSCS Reservation Manager when an event of this nature occurs: If a SAN failure occurs and the cluster nodes do not own the quorum resource, the state of the cluster services on these nodes is not affected. The cluster nodes remain as active cluster members; however, the data groups containing the SafeGuard 30m Control instance and the physical disk resources on these nodes are marked as failed, and any applications dependent on them are taken offline. These resources first try to restart, and then eventually fail over to the surviving site. The following symptoms might help you identify this failure: Applications fail and attempt to restart. The data groups containing the SafeGuard 30m Control instance and the physical disk resources on these nodes are marked as failed, and any applications dependent on them are taken offline. These resources first try to restart, and then eventually fail over to the surviving site. The cluster nodes remain as active cluster members. The management console displays error indications similar to those in Figure 6 9. Warnings and informational messages similar to those shown in Figure 6 11 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 6 11. Management Console Messages for Failed SAN Switch with Quorum Owner on Surviving Site The following table explains the numbered messages shown in Figure 6 11. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 4009 The system is pausing data transfer on the failure site X 6 20 6872 5688 006
Solving SAN Connectivity Problems To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: System Event Log for ES3120-X64 Host (Host on Failure Site) 1:36:07 AM ClusDisk Error None 1209 N/A USMV-X460 Cluster service is requesting a bus reset for device \Device\ClusDisk0. System Event Log for USMV-X460 Host (Host on Surviving Site) 1/14/2009 10:36:46 PM ClusSvc Information Node Mgr 1201 N/A USMV-X460 The Cluster Service brought the Resource Group "Cluster Group" online If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for Usmv ES3120-X64 Host (Host on Failure Site) 00000268.00000ae0::2009/01/15-06:25:08.838 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000268.00000ae0::2009/01/15-06:25:08.838 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 00000268.00000750::2009/01/15-06:36:06.578 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. Cluster Log for USMV-X460 Host (Host on Surviving Site) 00000740.00000ba8::2009/01/15-06:24:52.302 ERR IP Address <Cluster IP Address>: WorkerThread: GetClusterNotify failed with status 6. 00000bcc.00000bd0::2009/01/15-06:36:07.097 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000bcc.00000bd0::2009/01/15-06:36:07.097 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170 Actions to Resolve the Problem To resolve this situation, diagnose the SAN switch failure. 6872 5688 006 6 21
Solving SAN Connectivity Problems 6 22 6872 5688 006
Section 7 Solving Network Problems This section lists symptoms that usually indicate networking problems. Table 7 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for possible problems. Also, messages are displayed on the management console similar to the e-mail messages. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 7 1. Possible Networking Problems with Symptoms Symptom The cluster groups with the failed network connection fail over to the next preferred node. If only one node is configured at the site with the failure, replication direction changes and applications run on the backup site. If the NIC is teamed, no failover occurs and no symptoms are obvious. The networks on the Cluster Administrator screen show an error. Host system and application event log messages contain error or warning messages. Clients on site 2 are not able to access resources associated with the IP resource located on site 1. Public communication between the two sites fails, only allowing local cluster public communication between cluster nodes and local clients. The networks on the Cluster Administrator screen show an error. Possible Problem Public NIC failure on a cluster node in a geographic clustered environment Public or client WAN failure in a geographic clustered environment 6872 5688 006 7 1
Solving Network Problems Table 7 1. Possible Networking Problems with Symptoms Symptom You cannot access the management console or initiate an SSH session through PuTTY using the management IP address of the remote site. Management console log indicates that the WAN data links to the RAs are down. All consistency groups show the transfer status as Paused by system. On the management console, all consistency groups show the transfer status switching between Paused by system and initializing/active. All groups appear unstable over the WAN connection. The networks on the Cluster Administrator screen show an error. You cannot access the management console using the management IP address of the remote site. The cluster is no longer accessible from nodes except from one surviving node. Unable to reach DNS server. Unable to communicate to NTP server. Unable to reach mail server. The management console shows errors for the WAN or for RA data links. The management console logs show RA communication errors. Possible Problem Management network failure in a geographic clustered environment Replication network failure in a geographic clustered environment Temporary WAN failures Private cluster network failure in a geographic clustered environment Total communication failure in a geographic clustered environment Port information 7 2 6872 5688 006
Solving Network Problems Public NIC Failure on a Cluster Node in a Geographic Clustered Environment Problem Description If a public network interface card (NIC) of a cluster node failed, the cluster node of the failed public NIC cannot access clients. The cluster node of the failed NIC can participate in the cluster as a member because it can communicate over the private cluster network. Other cluster nodes are not affected by this error. The MSCS S software detects a failed network and the cluster resources fail over to the next preferred node. All cluster groups used for replication that contain a virtual IP address for the failed network connection succeed to fail over to the next preferred node. However, the Unisys SafeGuard 30m Control resources cannot fail back to the node with a failed public network because they cannot communicate with the site management IP address of the RAs. Note: A teamed public network interface does not experience this problem and therefore is the recommended configuration. Figure 7 1 illustrates this failure. Figure 7 1. Public NIC Failure of a Cluster Node 6872 5688 006 7 3
Solving Network Problems Symptoms The following symptoms might help you identify this failure: All cluster groups used for replication that contain a virtual IP address for the failed network connection fail over to the next preferred node. If no other node exists at the same site, replication direction changes and the application run at the backup site. If you review the host system event log, you can find messages similar to the following examples: Windows System Event Log Messages on Host Server Type: error Source: ClusSvc EventID: 1077, 1069 Description: The TCP/IP interface for Cluster IP Address xxx has failed. Type: error Source: ClusSvc EventID: 1069 Description: Cluster resource xxx in Resource Group xxx failed. Type: error Source: ClusSvc EventID: 1127 Description: The interface for cluster node xxx on network xxx failed. If the condition persists, check the cabling connecting the node to the network. Next, check for hardware or software errors in nodes s network Adapter. If you attempt to move a cluster group to the node with the failing public NIC, the event 2002 message is displayed in the host application event log. Application Event Log Message on Host Server Type: warning Source: 30mControl Event Category: None EventID: 2002 Date : 12/17/2008 Time: 16:16:36 PM User : N/A\ Computer: USMV-WEST2 Description: Online resource failed. RA CLI command failed because of a network communication error or invalid IP address. Action: Verify the network connection between the system and the site management IP Address specified for the resource. Ping each site management IP Address specified for the specified resource. Note: The preceding information can also be viewed in the cluster log. 7 4 6872 5688 006
Solving Network Problems The management console display and management console logs do not show any errors. When the public NIC fails on a node that does not use teaming, the Cluster Administrator displays an error indicator similar to Figure 7 2. If the public NIC interface is teamed, you do not see error messages in the Cluster Administrator. Figure 7 2. Public NIC Error Shown in the Cluster Administrator Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, verify that the public interface for all nodes is in an Up state. If multiple nodes at a site show public connections failed in the Cluster Administrator, physically check the network switch for connection errors. If the private network also shows errors, physically check the network switch for connection errors. 2. Inspect the NIC link indicators on the host and, from a client, use the Ping command to verify the physical IP address of the adapter (not the virtual IP address). 3. Isolate a NIC or cabling issue by moving cables at the network switch and at the NIC. 4. Replace the NIC in the host if necessary. No configuration of the replaced NIC is necessary. 5. Move the cluster resources back to the original node after the resolution of the failure. 6872 5688 006 7 5
Solving Network Problems Public or Client WAN Failure in a Geographic Clustered Environment Problem Description When the public or client WAN fails, some clients cannot access virtual IP networks that are associated with the cluster. The WAN components that comprise this failure might be two switches that are possibly on different subnets using gateways. This failure results from connectivity issues. The MSCS cluster would detect and fail the associated node if the failure resulted from an adapter failure or media failure to the adapter. Instead, cluster groups do not fail and the public LAN shows an unreachable for this failure mode. Public communication between the two sites failed, only allowing local cluster public communication between cluster nodes and local clients. The cluster node state does not change on either site because all cluster nodes are able to communicate with the private cluster network. All resources remain online and no cluster group errors are reported in the Cluster Administrator. Clients on the remote site cannot access resources associated with the IP resource located on the local site until the public or client network is again operational. Depending on the cause of the failure and the network configuration, the SafeGuard 30m Control might fail to move a cluster group because the management network might be the same physical network as the public network. Whether this failure to move the group occurs or not depends on how the RAs are physically wired to the network. 7 6 6872 5688 006
Solving Network Problems Figure 7 3 illustrates this scenario. Symptoms Figure 7 3. Public or Client WAN Failure The following symptoms might help you identify this failure: Clients on site 2 are not able to access resources associated with the IP resource located on site 1. Public communication between the two sites displays as unreachable allowing local cluster public communication between cluster nodes and local clients. When the public cluster network fails, the Cluster Administrator displays an error indicator similar to Figure 7 4. All private network connections show as unreachable when the problem is i a WAN issue. If only two of the connections show as failed (and the nodes are physically located at the same site), the issue is probably local to the site. If only one connection failed, the issue is probably a host network adapter. 6872 5688 006 7 7
Solving Network Problems Figure 7 4. Cluster Administrator Showing Public LAN Network Error If you review the system event log, messages similar to the following examples are displayed: Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID: 1123 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on network 'Public LAN'. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1126 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2ST2 Description: The interface for cluster node 'USMV-WEST2' on network 'Public LAN' is unreachable by at least one other cluster node attached to the network. the server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node USMV-WEST2. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check 7 8 6872 5688 006
Solving Network Problems for failures in any other network components to which the node is connected such as hubs, switches, or bridges. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1130 Date : 12/17/2008 Time: 16:25:36 PM User : N/A Computer: USMV-WEST2 Description: Cluster network 'Public network is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network. A cluster group containing a SafeGuard 30m Control resource might fail to move to another node when the management network has network components common to the public network. (Refer to Management Network Failure in a Geographic Clustered Environment. ) Symptoms might include those in Management Network Failure in a Geographic Clustered Environment when these networks are physically the same network. Refer to this topic if the clients at one site are not able to access the IP resources at another site. The management console logs might display the messages in the following table when this connection fails and is then restored. Event ID Description E-mail Immediate E-mail Daily Summary 3023 For each RA at the site, this console log message is displayed: Error in LAN link to RA. (RA <RA>) 3022 When the LAN link is restored, a management console log displays: LAN link to RA restored. (RA<RA>) X X 6872 5688 006 7 9
Solving Network Problems Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, view the network properties of the public and private network. The private network should be operational with no failure indications. The public network should display errors. Refer to the previous symptoms to identify that this is a WAN issue. If the error is limited to one host, the problem might be a host network adapter. See Cluster Node Public NIC Failure in a Geographic Clustered Environment. 2. Check for network problems using a method such as isolating the failure to the network switch or gateway by pinging from the cluster node to the gateway at each site. 3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.) a. Log on to an RA with user ID as boxmgmt and password as boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 4. Isolate the site by determining which gateway or network switch failed. Use standard network methods such as pinging to make the determination. 7 10 6872 5688 006
Solving Network Problems Management Network Failure in a Geographic Clustered Environment Problem Description When the management network fails in a geographic clustered environment, you cannot access the management console for the affected site. The replication environment is not affected. If you try to move a cluster group to the site with the failed management network, the move fails. Figure 7 5 illustrates this scenario. Symptoms Figure 7 5. Management Network Failure The following symptoms might help you identify this failure: The indicators for the onboard management network adapter of the RA are not illuminated. Network switch port lights show that no link exists with the host adapter. 6872 5688 006 7 11
Solving Network Problems You cannot access the management console or initiate a SSH session through PuTTY using the management IP address of the failed site from remote site. You can access the management console from a client local to the site. If you cannot access the management IP address from either site, see Section 8, Solving Replication Appliance (RA) Problems. A cluster move operation to the site with the failed management network might fail. The event ID 2002 message is displayed in the host application event log. Application Event Log Message on Host Server Type Source : warning : 30mControl Event Category: None EventID : 2002 Date : 12/17/2008 Time User Computer Description : 16:25:36 PM : N/A : USMV-WEST2 error or invalid IP address. Action : Online resource failed. RA CLI command failed because of a network communication : Verify the network connection between the system and the site management IP Address specified for the resource. Ping each site management IP Address mentioned for the specified resource. Note: The preceding information can also be viewed in the cluster log. If the management console was open with the IP address of the failed site, the message Connection with RA was lost, please check RA and network settings is displayed. The management console display shows not connected, and the components have a question mark Unknown status as illustrated in Figure 7 6. 7 12 6872 5688 006
Solving Network Problems Figure 7 6. Management Console Display: Not Connected The management console log displays a message for event 3023 as shown in Figure 7 7. 6872 5688 006 7 13
Solving Network Problems Figure 7 7. Management Console Message for Event 3023 The management console log messages might appear as in the following table. Event ID Description E-mail Immediate E-mail Daily Summary 3023 For each RA at the site, this console log message is displayed: Error in LAN link to RA. (RA <RA>) 3022 When the LAN link is restored, a management console log displays: LAN link to RA restored. (RA <RA>) X X Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. Ping from the cluster node to the RA box management IP address at the same site. Repeat this action for the other site. If the local connections are working at both sites, the problem is with the WAN connection such as a network switch or gateway connection. 2. If one site from step 1 fails, ping from the cluster node to the gateway of that site. If the ping completes, then proceed to step 3. 7 14 6872 5688 006
Solving Network Problems 3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.) a. Log in to an RA as user boxmgmt with the password boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 4. Isolate the site by determining which gateway failed. Use standard network methods such as pinging to make the determination. Replication Network Failure in a Geographic Clustered Environment Problem Description This type of event occurs when the RA cannot replicate data to the remote site because of a replication network (WAN) failure. Because this error is transparent to MSCS and the cluster nodes, cluster resources and nodes are not affected. Each cluster node continues to run, and data transactions sent to their local cluster disk are completed. 6872 5688 006 7 15
Solving Network Problems Figure 7 8 illustrates this failure. Figure 7 8. Replication Network Failure The RA cannot not replicate data while the WAN is down. During this failure, the RA keeps a record of data written to local storage. Once the WAN is restored, the RA updates the replication volumes on the remote site. During the replication network failure, the RAs prevent the quorum and data resources from failing over to the remote site. This behavior differs from a total communication failure or a total site failure in which the data groups are allowed to fail over. The quorum group is never allowed to fail over automatically when the RAs cannot communicate over the WAN. Notes: If the management network has also failed, see Total Communication Failure in a Geographic Clustered Environment later in this section. If all RAs at a site have failed, see Failure of All RAs at One Site in Section 8. If the administrator issues a move-group operation from the Cluster Administrator for a data or quorum group, the cluster accepts failover only to another node within the same site. Group failover to the remote site is not allowed, and the resource group fails back to a node on the source site. 7 16 6872 5688 006
Solving Network Problems Symptoms Although automatic failover is not allowed, the administrator can perform a manual failover to the remote site. Performing a manual failover results in a loss of data. The administrator chooses an available image for the failover. Important considerations for this type of failure are as follow: This type of failure does not have an immediate effect on the cluster service or the cluster nodes. The quorum group cannot fail over to the remote site and goes back online at the source site. Only local failovers are permitted. Remote failovers require that the administrator perform the manual failover process. The SafeGuard 30m Control resource and the data consistency groups cannot fail over to the remote site while the WAN is down; they go back online at the source site. Only one site has up-to-date data. Replication does not occur until the WAN is restored. If the administrator manually chooses to use remote data instead of the source data, data loss occurs. Once the WAN is restored, normal operation continues; however, the groups might initiate a long resynchronization. The following symptoms might help you identify this failure: The management console display shows errors similar to the image in Figure 7 9. This image shows the dialog box displayed after clicking the red Errors in the right column. The More Info message box is displayed with messages similar to those in the figure but appropriate for your site. If only one RA is down, see Section 8 for resolution actions. Notice in the figure that all RA data links at the site are down. Figure 7 9. Management Console Display: WAN Down 6872 5688 006 7 17
Solving Network Problems This figure also shows the Groups tab and the messages that the data consistency groups and the quorum group are Paused by system. If the groups are not paused by the system, a switchover might have occurred. See Section 8 for more information. If all groups are not paused, see Section 5, Solving Storage Problems. Warnings and informational messages similar to those shown in Figure 7 10 appear on the management console when the WAN is down. See the table after the figure for an explanation of the numbered console messages. Figure 7 10. Management Console Log Messages: WAN Down The following table explains the numbers in Figure 7 10. You might also see the events in the table denoted by an asterisk (*) in the management console log. Reference No./Legend Event ID Description E-mail Immediate E-mail Daily Summary * 3001 The RA is currently experiencing a problem communicating with its cluster. The details explain that an event 3000 means that the RA functionality will be restored. * 3000 The RA is successfully communicating with its cluster. In this case, the RA communicates by means of the management link. 1 4001 For each consistency group on the EAST2 and the WEST2 sites, the transfer is paused. 2 4008 For each quorum group on the EAST2 and the WEST2 sites, the transfer is paused. X X X X 7 18 6872 5688 006
Solving Network Problems Reference No./Legend Event ID Description E-mail Immediate E-mail Daily Summary * 4043 For each group on the EAST2 and WEST2 sites, the group site is deactivated message might appear with the detail showing the reason for the switchover. The RA attempts to switch over to resolve the problem. 3 4001 The event is repeated after the switchover attempt. X X If you review the management console RAs tab, the data link column lists errors for all RAs, as shown in Figure 7 11. The data link is the replication link between peer RAs. Notice that the WAN link shows OK because the RAs can still communicate over the management link. There is no column for the management link. Figure 7 11. Management Console RAs Tab: All RAs Data Link Down If you review the host application event log, no messages appear for this failure unless a data resource move-group operation is attempted. If this move-group operation is attempted, then messages similar to the following are listed: Application event log Event Type : Warning Event Source : 30mControl Event Category: None Event ID : 1119 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description : Online resource failed. Cannot complete transfer for auto failover (7). The following could cause this error: 1. Wan is down. 2. Long resynchronization might be in progress. The resource might have to be brought online manually. RA Version: 3.1(K.87) Resource name: Data1 6872 5688 006 7 19
Solving Network Problems RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.25.50 initiate_failover group=data1 active_site=west cluster_owner=usmv-west2 If you review the system event log, a message similar to the following example is displayed: System Event Log Event Type : Error Event Source : ClusSvc Event Category: Event ID : 1069 Failover Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description : Cluster resource 'Data1' in Resource Group 'Group 0' failed. Note: Data1 would change to the Quorum drive if the quorum was moved. If you review the cluster log, you can see an error if a data or a quorum move-group operation is attempted. Messages similar to the following are listed: Cluster Log for the Node to which the Move Was Attempted Key messages 00000e20.0000064c::2008/12/16-22:39:09.851 INFO [RGP] Node 2: RGP Incoming pkt: 0x3fff, 0x1, 0x3, 0x2. 00000e20.0000064c::2008/12/16-22:39:09.851 INFO [RGP] Node 2: RGP recv pkt : 0x10003, 0x40004000, 0x40000000, 0x1. 00000b6c.000008c0::2008/12/16-22:39:09.992 INFO Physical Disk <Disk Q:>: [DiskArb] Read the partition info from the disk to insure disk is accessible. 00000b6c.000008c0::2008/12/16-22:39:09.992 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing GetPartInfo on signature 4a130615. 00000b6c.000008c0::2008/12/16-22:39:09.992 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. 00000b6c.000008c0::2008/12/16-22:39:09.992 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to write (sector 12), error 170. 00000b6c.000008c0::2008/12/16-22:39:09.992 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170. 00000e20.00000fd0::2008/12/16-22:39:09.992 INFO [MM] MmSetQuorumOwner(0,0), old owner 2. 00000e20.00000fd0::2008/12/16-22:39:09.992 WARN [MM] MmSetQuorumOwner: regroup is in progress, forcing the new value in. 00000e20.00000fd0::2008/12/16-22:39:09.992 ERR [FM] Failed to arbitrate quorum resource 6fbf7ffc- 8c96-4257-9272-3655a6cde32e, error 170. 00000e20.00000fd0::2008/12/16-22:39:09.992 INFO [NM] We do not own the quorum resource, status 6. 7 20 6872 5688 006
Solving Network Problems Cluster Log for the Node to which the Data Group Move Was Attempted 00000754.00000790::2008/10/04-04:41:05.292 INFO [GUM] GumSendUpdate: completed update seq 281137 type 0 context 8 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] FmpPropagateResourceState: resource 4b59c5d6-5c66-4e8f-8329-819f6d62f999 offline event. 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] RmTerminateResource: 4b59c5d6-5c66-4e8f- 8329-819f6d62f999 is now offline 000007cc.000007dc::2008/10/04-04:41:05.292 INFO Unisys SafeGuard 30m Control <Data2>: KfResourceTerminate: Resource 'Data2' terminated. AbortOnline=1 CancelConnect=0 terminateprocess=0. 00000754.00000790::2008/10/04-04:41:05.292 INFO [CP] CppResourceNotify for resource Data2 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] RmTerminateResource: 18766578-77a0-48ecb9b1-c8b205038ed4 is now offline 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] RestartResourceTree, Restart resource 18766578-77a0-48ec-b9b1-c8b205038ed4 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] FmpRmOnlineResource: bringing resource 18766578-77a0-48ec-b9b1-c8b205038ed4 (resid 1234104) online. 00000754.00000790::2008/10/04-04:41:05.292 INFO [CP] CppResourceNotify for resource Data2 00000754.00000790::2008/10/04-04:41:05.292 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwquoblockingresources for resource 18766578-77a0-48ec-b9b1- c8b205038ed4 000007cc.00000eac::2008/10/04-04:41:05.292 INFO Unisys SafeGuard 30m Control <Data2>: KfResourceOnline: 'Data2' going online. PendingTimeout=900000. 000007cc.00000eac::2008/10/04-04:41:05.292 INFO Unisys SafeGuard 30m Control <Data2>: KfGetLocalSiteInfo: FirstSiteName = 'WEST', FirstSiteIP = '172.16.7.50', SecondSiteName = 'EAST', SecondSiteIP = '172.16.7.60'. 000007cc.00000eac::2008/10/04-04:41:05.292 INFO Unisys SafeGuard 30m Control <Data2>: KfResourceOnline: SiteIP = '172.16.7.60'. SiteName = EAST. Status =!u!. 00000820.00000d3c::2008/10/04-04:41:05.510 ERR Unisys SafeGuard 30m Control <Data1>: KfLogit: FAILED to run command 'UCLI -ssh -l plugin -pw **** -superbatch 172.16.7.60 initiate_failover group='data1' active_site='east' cluster_owner=usmv-east2'. Return code: (6) 00000820.00000d3c::2008/10/04-04:41:05.510 ERR Unisys SafeGuard 30m Control <Data1>: UrcfKConGroupOnlineThread: Error 1117 bringing resource online. 00000820.00000d3c::2008/10/04-04:41:05.510 INFO [RM] RmpSetResourceStatus, Posting state 4 notification for resource <Data1> 00000754.00000834::2008/10/04-04:41:05.510 INFO [FM] NotifyCallBackRoutine: enqueuing event Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. On the management console, observe that a WAN error occurred for all RAs and that the data link is in error for all RAs. If that is not the case, see Section 8 for resolution actions. 2. Use the Installation Manager site connectivity IP diagnostic from the RAs to the gateway at each site by performing the following steps. (For more information, see Appendix C.) 6872 5688 006 7 21
Solving Network Problems a. Log in to an RA as user boxmgmt with the password boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the gateway that you want to test. g. Repeat steps a through f for each RA. 3. Isolate the site by determining which network switch or gateway failed. Use standard network methods such as pinging to make the determination. 4. In some cases, the WAN connection might appear to be down because a firewall is blocking ports. See Port Information later in this section. 5. If all RAs at both sites can connect to the gateway, the problem is related to the link. In this case, check the connectivity between subnets by pinging between machines on the same subnet (not RAs) and between a non-ra machine at one site and an RA at the other site. 6. Verify that no routing problems exist between the sites. 7. Optionally, follow the recovery actions to manually move cluster and data resource groups to the other site if necessary. This action results in a loss of data. Do not attempt this manual recovery unless the WAN failure has affected applications. If you choose to manually move groups, refer to Section 4 for the procedures. Once you observe on the management console that the WAN error is gone, verify that the consistency groups are resynchronizing. If a move-group operation is issued to the other site while the group is resynchronizing, the command fails with a return code 7 (long resync in progress) and move back to the original node. Temporary WAN Failures Problem Description Symptoms All applications are unaffected. The target image is not up-to-date. On the management console, messages showing the transfer between sites switch between the paused by system and initializing/active. All groups appear unstable over the WAN connection. Actions to Resolve the Problem Perform the following actions to isolate and resolve this problem: 1. If the connection problem is temporary but reoccurs, check for a problematic network such as a high percentage of packet loss because of bad network 7 22 6872 5688 006
Solving Network Problems connections, insufficient bandwidth that is causing an overloaded network, and so on. 2. Verify that the bandwidth allocated to this link is reasonable and that no unreasonable external or internal (consistency group bandwidth policy) limits are causing an overloaded network. Private Cluster Network Failure in a Geographic Clustered Environment Problem Description When the private cluster network fails, the cluster nodes are able to communicate with the public cluster network if the cluster public address is set for all communication. No cluster resources fail over, and current processing on the cluster nodes continues. Clients do not experience any impact by this failure. 6872 5688 006 7 23
Solving Network Problems Figure 7 12 illustrates this scenario. Symptoms Figure 7 12. Private Cluster Network Failure Unisys recommends that the public cluster network be set for All communications and the private cluster LAN be set for internal cluster communications only You can verify these settings in the Networks properties section within Cluster Administrator. See Checking the Cluster Setup in Section 4. If the public cluster network was not set for All communications but instead was set for Client access only, the following symptoms occur: All nodes except the node that owned the quorum stop MSCS. This action is completed to prevent a split brain situation. All resources move to the surviving node. The following symptoms might help you identify this failure: When the private cluster network fails, the Cluster Administrator displays an error indicator similar to Figure 7 13. All private network connections show a status of Unknown when the problem is a WAN issue. 7 24 6872 5688 006
Solving Network Problems If only two of the connections failed (and the nodes are physically located at the same site), the issue is probably local to the site. If only one connection failed, the issue is probably a host network adapter. Figure 7 13. Cluster Administrator Display with Failures On the cluster nodes at both sides, the system event log contains entries from the cluster service similar to the following: Event Type : Warning Event Source : ClusSvc Event Category: Event ID : 1123 Node Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on network 'Private'. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1126 Date : 12/17/2008 Time : 16:25:36 PM User : N/A 6872 5688 006 7 25
Solving Network Problems Computer : USMV-WEST2 Description: The interface for cluster node 'USMV- EAST2' on network 'Private' is unreachable by at least one other cluster node attached to the network. The server cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node USMV- EAST2. If the condition persists, check the cable connecting the node to the network. Then, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. Event Type : Warning Event Source : ClusSvc Event Category: Event ID : 1130 Node Mgr Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: Cluster network 'Private is down. None of the available nodes can communicate using this network. If the condition persists, check for failures in any network components to which the nodes are connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, check for hardware or software errors in the adapters that attach the nodes to the network. Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. In the Cluster Administrator, view the network properties of the public and private network. The public network should be operational with no failure indications. The private network should display errors. Refer to the previous symptoms to identify that this is a WAN issue. If the error is limited to one host, the problem might be a host network adapter. See Public NIC Failure on a Cluster Node in a Geographic Clustered Environment for action to resolve a host network problem. 2. Check for network problems using methods such as isolating the failure to the network switch or gateway with the problem. 7 26 6872 5688 006
Solving Network Problems Total Communication Failure in a Geographic Clustered Environment Problem Description A total communication failure implies that the cluster nodes and RAs are no longer able to communicate with each other over the public and private network interfaces. Figure 7 14 illustrates this failure. Figure 7 14. Total Communication Failure When this failure occurs, the cluster nodes on both sites detect that the cluster heartbeat has been broken. After six missed heartbeats, the cluster nodes go into a regroup process to determine which node takes ownership of all cluster resources. This process consists of checking network interface states and then arbitrating for the quorum device. During the network interface detection phase, all nodes perform a network interface check to determine that the node is communicating through at least one network interface dedicated for client access, assuming the network interface is set for All communications cations or Client access only. If this process determines that the node is not communicating through any viable network, the cluster node voluntarily stops cluster service and drops out of the quorum arbitration process. The remaining nodes then attempt to arbitrate for the quorum device. 6872 5688 006 7 27
Solving Network Problems Symptoms Quorum arbitration succeeds on the site that originally owned the quorum consistency group and fails on the nodes that did not own the quorum consistency group. Cluster service then shuts itself down on the nodes where quorum arbitration fails. In Microsoft Windows 2000 environments, MSCS does not check for network interface availability during the regroup process and starts the quorum arbitration process immediately after a regroup process is initiatedthat is, after six missed heartbeats. Once the cluster has determined which nodes are allowed to remain active in the cluster, the cluster node attempts to bring online all data groups previously owned by the other cluster nodes. The SafeGuard 30m Control resource and its associated dependent resources will come online. During this total communication failure, replication is Paused by system. An extended outage requires a full volume sweep. Refer to Section 4 for more information. The following symptoms might help you identify this failure: The management console shows a WAN error; all groups are paused. The other site shows a status of Unknown. Figure 7 15 illustrates one site. Figure 7 15. Management Console Display Showing WAN Error 7 28 6872 5688 006
Solving Network Problems The RAs tab on the management console lists errors as shown in Figure 7 16. Figure 7 16. RAs Tab for Total Communication Failure Warnings and informational messages similar to those shown in Figure 7 17 appear on the management console. See the table after the figure for an explanation of the numbered console messages. Figure 7 17. Management Console Messages for Total Communication Failure 6872 5688 006 7 29
Solving Network Problems The following table explains the numbered messages in Figure 7 17. Reference No. Event ID Description 1 4001 For each consistency group, a group capabilities minor problem is reported. The details indicate that a WAN problem is suspected on both RAs. 2 4008 For each consistency group on the West and the East sites, the transfer is paused. The details indicate a WAN problem is suspected. 3 3021 For each RA at each site, the following error message is reported: Error in WAN link to RA at other site (R A x) 4 1008 The following message is displayed: User action succeeded. The details indicate that a failover was initiated. This message appears when the groups are moved by the SafeGuard Control resource to the surviving cluster node. E-mail Immediate E-mail Daily Summary X X X X All cluster resources appear online after successfully failing over to the surviving node. The cluster service stops on all nodes except the surviving node. From the surviving node, the host system event log has entries similar to the following: Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1123 Date : 12/17/2008 Time User Computer : : 16:25:36 PM : N/A USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on Public network. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1123 Date : 12/17/2008 7 30 6872 5688 006
Solving Network Problems Time User Computer : : 16:25:36 PM : N/A USMV-WEST2 Description: The node lost communication with cluster node 'USMV-EAST2' on Private network. Event Type : Warning Event Source : ClusSvc Event Category: Node Mgr Event ID : 1135 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: Cluster node USMV-EAST2 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. Event Type : Information Event Source : ClusSvc Event Category: Failover Mgr Event ID : 1200 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The Cluster Service is attempting to bring online the Resource Group "Group 1". Event Type : Information Event Source : ClusSvc Event Category: Failover Mgr Event ID : 1201 Date : 12/17/2008 Time : 16:25:36 PM User : N/A Computer : USMV-WEST2 Description: The Cluster Service brought the Resource Group "Group 1" online. 6872 5688 006 7 31
Solving Network Problems From the surviving node, the private and public network connections show an exclamation mark Unknown status as shown in Figures 7 18 and 7 19. Figure 7 18. Cluster Administrator Showing Private Network Down Figure 7 19. Cluster Administrator Showing Public Network Down 7 32 6872 5688 006
Solving Network Problems Actions to Resolve the Problem Note: Typically, a network administrator for the site is required to diagnose which network switch, gateway, or connection is the cause of this failure. Perform the following actions to isolate and resolve the problem: 1. When you observe on the management console that a WAN error occurred on site 1 and on site 2, call the other site to verify that each management console is available and shows a WAN down because of the failure. If only one site can access the management console, the problem is probably not a total WAN failure but rather a management network failure. In that case, see Management Network Failure in a Geographic Clustered Environment. 2. In the Cluster Administrator, verify that only one node is active in the cluster. 3. View the network properties of the public and private network. The display should show an Unknown status for the private and public network. 4. Check for network problems using methods such as isolating the failure to the network switch or gateway by pinging from the cluster node to the gateway at each site. Port Information Problem Description Symptoms Communications problems might occur because of firewall settings that prevent all necessary communication. The following symptoms might help you identify this problem: Unable to reach the DNS server. Unable to communicate to the NTP server. Unable to reach the mail server. The RAs tab shows RA data link errors. The management console shows errors for the WAN. The management console logs show RA communications errors. 6872 5688 006 7 33
Solving Network Problems Actions to Resolve Perform the port diagnostics from each of the RAs by following the steps given in Appendix C. The following tables provide port information that you can use in troubleshooting the status of connections. Table 7 2. Ports for Internet Communication Port Numbers Protocol or Protocols Unisys Product Support IP Address 21 FTP 192.61.61.78 443 Used for remote maintenance (TCP) 129.225.216.130 The following tables list ports used for communication other than Internet communication. Table 7 3. Ports for Management LAN Communication and Notification Port Numbers Protocol or Protocols 21 Default FTP port (needed for collecting system information) 22 Default SSH and communications between RAs 25 Default outgoing mail (SMTP) e-mail alerts from the RA are configured. 80 Web server for management (TCP) 123 Default NTP port 161 Default SNMP port 443 Secure Web server for management (TCP) 514 Syslog (UDP) 1097 RMI (TCP) 1099 RMI (TCP) 4401 RMI (TCP) 4405 Host-to-RA kutils communications (SQL commands) and KVSS (TCP) 7777 Automatic host information collection 7 34 6872 5688 006
Solving Network Problems The ports listed in Table 7 4 are used for both the management LAN and WAN. Table 7 4. Ports for RA-to-RA Internal Communication Port Numbers Protocol or Protocols 23 telnet 123 NTP (UDP) 1097 RMI (TCP) 1099 RMI (TCP) 4444 TCP 5001 TCP (default iperf port for performance measuring between RAs) 5010 Management server (UDP, TCP) 5020 Control (UDP, TCP) 5030 RMI (TCP) 5040 Replication (UDP, TCP) 5060 Mpi_perf (TCP) 5080 Connectivity diagnostics tool 6872 5688 006 7 35
Solving Network Problems 7 36 6872 5688 006
Section 8 Solving Replication Appliance (RA) Problems This section lists symptoms that usually indicate problems with one or more Unisys SafeGuard 30m replication appliances (RAs). The problems include hardware failures. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. For problems relating to RAs, gather the RA logs and ask the following questions: Are any errors displayed on the management console? Is the issue constant? Is the issue a one-time occurrence? Does the issue occur at intervals? What are the states of the consistency groups? What is the timeframe in which the problem occurred? When was the first occurrence of the problem? What actions were taken as a result of the problem or issue? Were any recent changes made in the replication environment? If so, what? Table 8 1 lists symptoms and possible causes for the failure of a single RA on one site with a switchover as a symptom. Table 8 2 lists symptoms and possible causes for the failure of a single RA on one site without switchover symptoms. Table 8 3 lists symptoms and other possible problems regarding multiple RA failures. Each problem and the actions to resolve it are described in this section. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for possible problems. Also, messages similar to e-mail notifications might be displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. 6872 5688 006 8 1
Solving Replication Appliance (RA) Problems Table 8 1. Possible Problems for Single RA Failure with a Switchover Symptoms The management console shows RA failure. Single RA failure Possible Problem Possible Contributing Causes to Single RA Failure with a Switchover The system frequently pauses transfer for all consistency groups. If you log in to the failed RA as the boxmgmt user, a message is displayed explaining that the reboot regulation limit has been exceeded. The management console shows repeated events that report an RA is up followed by an RA is down. The link indicator lights on all host bus adapters (HBAs) are not illuminated. The port indicator lights on the Fibre Channel switch no longer show a link to the RA. Port errors occur or there is no target when running the SAN diagnostics. The management console shows RA failure with details pointing to a problem with the repository volume. The link indicator lights on the HBA or HBAs are not illuminated. The port indicator lights on the network switch or hub no longer show a link to the RA. Reboot regulation failover Failure of all SAN Fibre Channel HBAs on one RA Onboard WAN network adapter failure (Or failure of the optional gigabit Fibre Channel WAN network adapter) 8 2 6872 5688 006
Solving Replication Appliance (RA) Problems Table 8 2. Possible Problems for Single RA Failure Without a Switchover Symptoms The link indicators lights on the onboard management network adapter are not illuminated. The failure light for the hard disk indicates a failure. An error message that appears during a boot operation indicates failure of one of the internal disks. The link indicator lights on the HBA are not illuminated. The port indicator lights on the Fibre Channel switch no longer show a link to the RA. For one of the ports on the relevant RA, errors appear when running the SAN diagnostics. Possible Problem Onboard management network adapter failure Single hard-disk failure Port failure of a single SAN Fibre Channel HBA on one RA Table 8 3. Possible Problems for Multiple RA Failures with Symptoms Symptoms Replication has stopped on all groups. MSCS fails over groups to the other site, or MSCS fails on all nodes. The management console displays a WAN error to the other site. Replication has stopped on all groups. MSCS fails over groups to the other site, or MCSC fails on all nodes. The management console displays a WAN error to the other site. Possible Problem Failure of all RAs on one site All RAs on one site are not attached 6872 5688 006 8 3
Solving Replication Appliance (RA) Problems Single RA Failures Problem Description When an RA fails, a switchover might occur. In some cases, a switchover does not occur. See Single RA Failures With Switchover and Single RA Failures Without Switchover. Understanding Management Console Access If the RA that failed had been running site controlthat is, the RA owned the virtual management IP networkand a switchover occurs, the virtual IP address moves to the new RA. If you attempt to connect to the management console using one of the static management IP addresses of the RAs, a connection error occurs if the RA does not have site control. Thus, you should use the site management IP address to connect to the management console. At least one RA (either RA 1 or RA 2) must be attached to the RA cluster for the management console to function. If the RA that failed was running site control and a switchover does not occur (such as with an onboard management network connection failure), the management console might not be accessible. Also, attempts to log in using PuTTY fail if you use the boxmgmt log-in account. When an RA does not have site control, you can always log in using PuTTY and the boxmgmt log-in account. You cannot determine which RA owns site control unless the management console is accessible. The site control RA is designated at the bottom of the display as follows: Another situation in which you cannot log in to the management console is when the user account has been locked. In this case, follow these steps: 1. Log in interactively using PuTTY with another unlocked user account. 2. Enter unlock_user. 3. Determine whether any users are listed, and follow the messages to unlock the locked user accounts. 8 4 6872 5688 006
Solving Replication Appliance (RA) Problems Figure 8 1 illustrates a single RA failure. Figure 8 1. Single RA Failure Single RA Failure with Switchover In this case, a single RA fails, and there is an automatic switchover to a surviving RA on the same site. Any groups that had been running on the failed RA run on a surviving RA at the same site. Each RA handles the replicating activities of the consistency groups for which it is designated as the preferred RA. The consistency groups that are affected are those that were configured with the failed RA as the preferred RA. Thus, whenever an RA becomes inoperable, the handling of the consistency groups for that RA switches over automatically to the functioning RAs in the same RA cluster. During the RA switchover process, the server applications do not experience any I/O failures. In a geographic clustered environment, MSCS is not aware of the RA failure, and all application and replication operations continue to function normally. However, performance might be affected because the I/O load on the surviving RAs is now increased. 6872 5688 006 8 5
Solving Replication Appliance (RA) Problems Symptoms Failures of an RA that cause a switchover are as follows: RA hardware issues (such as memory, motherboard, and so forth) Reboot regulation failover Failure of all SAN Fibre Channel HBAs on one RA Onboard WAN network adapter failure (or failure of the optional gigabit Fibre Channel WAN network adapter) The following symptoms might help you identify this failure: The RA does not boot. From a power-on reset, the BIOS display shows the BIOS information, RAID adapter utility prompt, logical drives found, and so forth. The display is similar to the information shown in Figure 8 2. Figure 8 2. Sample BIOS Display Once the RA initializes, the log-in screen is displayed. Note: Because status messages normally scroll on the screen, you might need to press Enter to see the log-in screen. The management console system status shows an RA failure. (See Figure 8 3.) To display more information about the error, click the red error in the right column. The More Info dialog box is displayed with a message similar to the following: RA 1 in West is down 8 6 6872 5688 006
Solving Replication Appliance (RA) Problems Figure 8 3. Management Console Display Showing RA Error and RAs Tab The RAs tab on the management console shows information similar to that in Figure 8 3, specifically The RA status for RA 1 on the West site shows an error. The peer RA on the East site (RA 1) shows a data link error. Each RA on the East site shows a WAN connection failure. The surviving RA at the failed site (West) does not show any errors. Warnings and informational messages similar to those shown in Figure 8 4 appear on the management console when an RA fails and a switchover occurs. See the table after the figure for an explanation of the numbered console messages. In your 6872 5688 006 8 7
Solving Replication Appliance (RA) Problems environment, the messages pertain only to the groups configured to use the failed RA as the preferred RA. Figure 8 4. Management Console Messages for Single RA Failure with Switchover The following table explains the numbered messages shown in Figure 8 4. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 3023 At the same site, the other RA reports a problem getting to the LAN of the failed RA. 2 3008 The site with the failed RA reports that the RA is probably down. X X 8 8 6872 5688 006
Solving Replication Appliance (RA) Problems Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 3 2000 The management console is now running on RA 2. 4 4001 For each consistency group, a minor problem is reported. The details show that the RA is down or not a cluster member. 5 4008 For each consistency group, the transfer is paused at the surviving site to allow a switchover. The details show the reason for the pause as switchover. 6 4041 For each consistency group at the same site, the groups are activated at the surviving RA. This probably means that a switchover to RA 2 at the failed site was successful. 7 5032 For each consistency group at the failed site, the splitter is again splitting. 8 3021 A WAN link error is reported from each RA at the surviving site regarding the failed RA at the other site. 9 4010 For each consistency group at the failed site, the transfer is started. 10 4086 For each consistency group at the failed site, an initialization is performed. 11 4087 For each consistency group at the failed site, the initialization completes. X X X X X X X X X 12 3007 The failed RA (RA 1) is now restored. X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. Actions to Resolve the Problem The following list summarizes the actions you need to perform to isolate and resolve the problem: Check the LCD display on the front panel of the RA. See LCD Status Messages in Appendix B for more information. If the LCD display shows an error, run the RA diagnostics. See Appendix B for more information. Check all indicator lights on the rear panel of the RA. Review the symptoms and actions in the following topics: Reboot Regulation 6872 5688 006 8 9
Solving Replication Appliance (RA) Problems Onboard WAN Network Adapter Failure If you determine that the failed RA must be replaced, contact the Unisys service representative for a replacement RA. After you receive the replacement RA, follow the steps in Appendix D to install and configure it. The following procedure provides a detailed description of the actions to perform: 1. Remove the front bezel of the RA and look at the LCD display. During normal operation, the illuminated message should identify the system. If the LCD display flashes amber, the system needs attention because of a problem with power supplies, fans, system temperature, or hard drives. Figure 8 5 shows the location of the LCD display. Figure 8 5. LCD Display on Front Panel of RA If an error message is displayed, check Table B 1. For example, the message E0D76 indicates a drive failure. (Refer to Single Hard Disk Failure in this section.) If the message code is not listed in the Table B 1, run the RA diagnostics, (see Appendix B). 2. Check the indicators at the rear of the RA as described in the following steps and visually verify that all are working correctly. Figure 8 6 illustrates the rear panel of the RA. 8 10 6872 5688 006
Solving Replication Appliance (RA) Problems Note: The network connections on the rear panel labeled 1 and 2 in the following illustration might appear different on your RA. The connection labeled 1 is always the RA replication network, and the connection labeled 2 is always the RA management network. Pay special attention to the labeling when checking the network connections. Figure 8 6. Rear Panel of RA Showing Indicators Ping each network connection (management network and replication network), and visually verify that the LEDs on either side of the cable on the back panel are illuminated. Figure 8 7 shows the location of these LEDs. If the LEDs are off, the network is not connected. The green LED is lit if the network is connected nected to a valid link partner on the network. The amber LED blinks when network data is being sent or received. If the management network LEDs indicate a problem, refer to Onboard Management Network Adapter Failure in this section. If the replication network LEDs indicate a problem, refer to Onboard WAN Network Adapter Failure in this section. Figure 8 7. Location of Network LEDs 6872 5688 006 8 11
Solving Replication Appliance (RA) Problems Check that the green LEDs for the SAN Fibre Channel HBAs are illuminated as shown in Figure 8 8. Figure 8 8. Location of SAN Fibre Channel HBA LEDs The following table explains the LED patterns and their meanings. If the LEDs indicate a problem, refer to the two topics for SAN Fibre Chanel HBA failures in this section. Green LED Amber LED Activity On On Power On Off Online Off On Signal acquired Off Flashing Loss of synchronization Flashing Flashing Firmware error Reboot Regulation Problem Description After frequent, unexplained reboots or restarts of the replication process, the RA automatically detaches from the RA cluster. When installing the RAs, you can enable or disable this reboot regulation feature. The factory default is for the feature to be enabled so that reboot regulation is triggered whenever a specified number of reboots or failures occur within the specified time interval. The two parameters available for the reboot regulation feature are the number of reboots (including internal failures) and the time interval. The default value for the number of reboots is 10, and the default value for the time interval is 2 hours. 8 12 6872 5688 006
Solving Replication Appliance (RA) Problems Symptoms Only Unisys personnel should change these values. Use the Installation Manager to change the parameter values or disable the feature. See the Unisys SafeGuard Solutions Replication Appliance Installation Guide for information about using the Installation Manager tools to make these changes. The following symptoms might help you identify this failure: Frequent transfer pauses for all consistency groups that have the same preferred RA. If you log in to the RA as the boxmgmt user, the following message is displayed: Reboot regulation limit has been exceeded Several messages might be displayed on the Logs tab of the management console as an RA reboots to try to correct a problem. These messages are listed in Table 8 4. Table 8 4. Management Console Messages Pertaining to Reboots Reference No./Legend Event ID Description E-mail Immediate E-mail Daily Summary * 3008 The RA appears to be down. The RA might attempt to perform a reboot to correct the problem. X * 3023 Error in LAN link (as RA reboots). * 3021 Error in WAN link (as RA reboots). X X * 3007 The RA is up (the reboot completes). * 3022 The LAN link is restored (the reboot has completed). * 3020 The WAN link at other site is restored (the reboot has completed). X X X When any of these messages appear multiple times in a short time period, they might indicate an RA that has continuously rebooted and might have reached the reboot regulation limit. 6872 5688 006 8 13
Solving Replication Appliance (RA) Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Collect the RA logs before you attempt to resolve the problem. See Appendix A for information about collecting logs. 2. To determine whether the hardware is faulty, run the RA diagnostics described in Appendix B. 3. If the problem remains, submit the RA logs to Unisys for analysis. 4. Once the problem is corrected, the RA automatically attaches to the RA cluster after a power-on reset. If necessary, reattach the RA to the RA cluster manually by following these steps: a. Log in as boxmgmt to the RA through an SSH session using PuTTY. b. At the prompt, type 4 (Cluster operations) and press Enter. c. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster. d. At the prompt, type Q (Quit). Failure of All SAN Fibre Channel Host Bus Adapters (HBAs Problem Description Symptoms All SAN Fibre Channel HBAs or adapter ports on the RA fail. This scenario is unlikely because the RA has redundant ports that are located on different physical adapters. A SAN connectivity problem is more likely. Note: A single redundant path does not show errors on the management console display. See Port Failure on a Single SAN Fibre Channel HBA on One RA. The following symptoms might help you identify this failure: The link indicator lights on all SAN Fibre Channel HBAs are not illuminated. (Refer to Figure 8 8 for the location of these LEDs.) The port indicator lights on the Fibre Channel switch no longer show a link to the RA. Port errors occur or no target appears when running the Installation Manager SAN diagnostics. Information on the Volumes tab of the management console is inconsistent or periodically changing. The management console shows failures for RAs, storage, and hosts. (See Figure 8 9.) 8 14 6872 5688 006
Solving Replication Appliance (RA) Problems Figure 8 9. Management Console Display: Host Connection with RA Is Down If you click the red error indication for RAs in the right column, the message is RA 2 in West can t access repository volume If you click the red error indication for storage in the right column, the following messages are displayed: 6872 5688 006 8 15
Solving Replication Appliance (RA) Problems If you click the red error indication in the right column for splitters, the message is ERROR: USMV-WEST2 s connection with RA2 is down Warnings and informational messages similar to those shown in Figure 8 10 appear on the management console when an RA fails with this type of problem. See the table after the figure for an explanation of the numbered console messages. Also, refer to Figure 8 4 and the table that explains the messages for information about an RA failure with a generic switchover. Refer to Table 8 4 for other messages that might occur whenever an RA reboots to try to correct the problem. 8 16 6872 5688 006
Solving Replication Appliance (RA) Problems Figure 8 10. Management Console Messages for Failed RA (All SAN HBAs Fail) 6872 5688 006 8 17
Solving Replication Appliance (RA) Problems The following table explains the numbered messages shown in Figure 8 10. You might also see the messages denoted with an asterisk (*). Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 3014 The RA is unable to access the repository volume (RA 2). X 2 4003 For each consistency group that had the failed RA as the preferred RA, a group consistency problem is reported. The details show a repository volume problem. X 3 3012 The RA is unable to access volumes (all volumes for repository, journal, and data are listed). 4 4086 Initialization started (RA 1, Quorum --- West). 5 4087 Initialization complete (RA 1, Quorum - West). The group has completed the switchover. X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Refer to Section 6, Solving SAN Connectivity Problems, to determine whether the problem is described there. 2. If you determine that the SAN Fibre Channel HBA failed and must be replaced, contact a Unisys service representative for a replacement adapter. 3. Once the replacement adpter is received, perform the following steps to replace the failed HBA: a. Open a PuTTY session using the IP address of the RA and log in as boxmgmt/boxmgmt. Appendix C provides additional information about the Installation Manager diagnostics. b. On the Main menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter. d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel details) and press Enter. 8 18 6872 5688 006
Solving Replication Appliance (RA) Problems Information similar to the following is displayed: >>Site1 Box 1>>3 Port 0 wwn = 50012482001c6fb0 node_wwn = 50012482001c6fb1 Port id = 0x20100 operating mode = point to point speed = 2 GB Port 1 ---------------------------------- wwn = 50012482001ce3c4 node_wwn = 50012482001ce3c5 Port id = 0x10100 operating mode = point to point speed = 2 GB e. Write down the port information. f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. g. On the Diagnostics menu, type B (Back) and press Enter. h. On the Main Menu, type 4 (Cluster operations) and press Enter. i. On the Cluster Operations menu, type 2 (Detach from cluster) and press Enter. j. Shut down the RA. k. Replaced the failed adapter with the replacement and then boot the RA. Note: The replacement adapter does not require any settings to be changed. l. Repeat steps a through d, and again view the Fibre Channel details to see the new WWN for the replaced HBA. m. Using the management of the SAN switch, make the modifications to the zoning as needed to replace the failed WWN with the new WWN. n. Use the new WWN to configure the storage. o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and press Enter. (Refer to steps a through c to access the Fibre Channel Diagnostics menu.) When you select the SAN diagnostics option, the system conducts automatic tests that are designed to identify the most common problems encountered in the configuration of SAN environments. Once the tests complete, a message is displayed confirming the successful completion of SAN diagnostics, or a report is displayed that details any critical configuration problems. p. Once no problems are reported from the SAN diagnostics, type B (Back) and press Enter. 6872 5688 006 8 19
Solving Replication Appliance (RA) Problems q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. r. On the Diagnostics menu, type B (Back) and press Enter. s. On the Main Menu, type 4 (Cluster operations) and press Enter. t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter. This action reattaches the RA, which automatically reboots and restarts replication. Note: The replacement Fibre Channel HBA does not need any configuration changes. Failure of Onboard WAN Adapter or Failure of Optional Gigabit Fibre Channel WAN Adapter Problem Description The onboard WAN adapter failed. This capability serves the replication network. Notes: Symptoms The gigabit Fibre Channel WAN adapter is an optional component found in some environments. When this board fails, the symptoms are the same as those observed when the onboard WAN adapter fails. In that case, the indicator lights pertain to the gigabit Fibre Channel WAN board instead of the onboard capability. The actions to resolve the problem are similar once you isolate the board as the problem. That is, contact a Unisys service representative for a replacement part. The following symptoms might help you identify this failure: Transfer between sites pauses temporarily for all consistency groups for which this is the preferred RA while an RA switchover occurs. Applications continue to run. High loads might occur because of reduced total throughput capacity. The link indicators on the onboard WAN adapter might not be illuminated. (See Figure 8 6 for the location of the connector for the replication network WAN. Figure 8 7 illustrates the LEDs.) The port lights on the network switch might indicate that there is no link to the onboard WAN adapter. The management console shows a WAN data link failure for RA 1. The More information for this error provides the message: RA-x WAN data link is down. (See Figure 8 11.) 8 20 6872 5688 006
Solving Replication Appliance (RA) Problems Figure 8 11. Management Console Showing WAN Data Link Failure The RAs tab on the management console (Figure 8 11) shows an error for the same RA at each site, indicating that the connectivity between them has been lost. Warnings and informational messages similar to those shown in Figure 8 4 for an RA failure are displayed for this failure. Refer to the table after Figure 8 4 for descriptions of the messages. For this failure, the details of event ID 4001 show a WAN data path problem. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Isolate the problem to the onboard WAN adapter by performing the actions in Replication Network Failure in a Geographic Clustered Environment in Section 7. If you determine that the motherboard must be replaced, contact a Unisys service representative for a replacement part. Contact the Unisys Support Center for the appropriate BIOS for the replacement part. Note: The replacement motherboard might not have the disk controller set for RAID1 (mirroring). Check the setting and change it if necessary. In rare cases, you might need to obtain a replacement RA from a Unisys service representative. After you receive the replacement RA, follow the steps in Appendix D to install and configure it. 6872 5688 006 8 21
Solving Replication Appliance (RA) Problems Single RA Failures Without a Switchover Problem Description Some failures that might occur on an RA do not cause a switchover. These failures are Port failure on a single SAN Fibre Channel HBA on one RA Onboard management network adapter failure Single hard disk failure Port Failure on a Single SAN Fibre Channel HBA on One RA Problem Description Symptoms One SAN Fibre Channel HBA port on the RA failed. The following symptoms might help you identify this failure: The Logs tab on the management console displays a message for event ID 3030 Warning RA switched path to storage. (RA <RA>, Volumes <volumes>)only if the connection failed during an I/O operation. The link indicator lights on the SAN Fibre Channel HBA are not illuminated. (Refer to Figure 8 8 for the location of these LEDs.) The port indicator lights on the Fibre Channel switch no longer show a link to the RA. For one port on the relevant RA, errors occur when running the Installation Manager SAN diagnostics. See Appendix C for information about these diagnostics. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. If you determine that the SAN Fibre Channel HBA failed and must be replaced, contact a Unisys service representative for a replacement part. 2. Once the replacement adapter is received, perform the following steps to replace the failed HBA: a. Open a PuTTY session using the IP address of the RA, and log in as boxmgmt/boxmgmt. Appendix C provides additional information about the Installation Manager diagnostics. b. On the Main menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter. d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel details) and press Enter. 8 22 6872 5688 006
Solving Replication Appliance (RA) Problems Information similar to the following is displayed: >>Site1 Box 1>>3 Port 0 wwn = 50012482001c6fb0 node_wwn = 50012482001c6fb1 Port id = 0x20100 operating mode = point to point speed = 2 GB Port 1 ---------------------------------- wwn = 50012482001ce3c4 node_wwn = 50012482001ce3c5 Port id = 0x10100 operating mode = point to point speed = 2 GB e. Write down the port information. f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. g. On the Diagnostics menu, type B (Back) and press Enter. h. On the Main Menu, type 4 (Cluster operations) and press Enter. i. On the Cluster Operations menu, type 2 (Detach from cluster) and press Enter. j. Shut down the RA. k. Replaced the failed adapter with the replacement and then boot the RA. Note: The replacement adapter does not require any settings to be changed. l. Repeat steps a through d and again view the Fibre Channel details to see the new WWN for the replaced HBA. m. Using the management of the SAN switch, make the modifications to the zoning as needed to replace the failed WWN with the new WWN. n. Use the new WWN to configure the storage. o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and press Enter. (Refer to steps a through c to access the Fibre Channel Diagnostics menu.) When you select the SAN diagnostics option, the system conducts automatic tests that are designed to identify the most common problems encountered in the configuration of SAN environments. Once the tests complete, a message is displayed confirming the successful completion of SAN diagnostics, or a report is displayed that details any critical configuration problems. p. Once no problems are reported from the SAN diagnostics, type B (Back) and press Enter. 6872 5688 006 8 23
Solving Replication Appliance (RA) Problems q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. r. On the Diagnostics menu, type B (Back) and press Enter. s. On the Main Menu, type 4 (Cluster operations) and press Enter. t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter. This action reattaches the RA, which automatically reboots and restarts replication. Note: The replacement Fibre Channel HBA does not need any configuration changes. Onboard Management Network Adapter Failure Problem Description Symptoms The onboard management network adapter failed. The following symptoms might help you identify this failure: On the management console, the system status and RA status do not display any error indications. The link indicators on the onboard management network adapter are not illuminated. (See Figure 8 6 for the location of the connector for the onboard management network adapter. Figure 8 7 illustrates the LEDs.) If RA site control was running on the failed RA, you cannot access the management console or if the management console was open, a banner is displayed showing not connected. If RA site control was not running on the failed RA, you can access the management console. You cannot determine which RA owns site control unless the management console is accessible. The RA site control is designated at the bottom of the display as follows: See Management Network Failure in a Geographic Clustered Environment in Section 7 for additional symptoms. The Logs tab on the management console might display a message for event ID 3023Error in LAN link to RA (RA1)for this failure. 8 24 6872 5688 006
Solving Replication Appliance (RA) Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Isolate the problem to the onboard management network adapter by performing the actions in Management Network Failure in a Geographic Clustered Environment in Section 7. If you determine the motherboard must be replaced, contact a Unisys service representative for a replacement part. Contact the Unisys Support Center for the appropriate BIOS for the replacement part. Note: The replacement motherboard might not have the disk controller set for RAID1 (mirroring). Check the setting and change it if necessary. In rare cases, you might need to obtain a replacement RA from a Unisys service representative. After you receive the replacement RA, follow the steps in Appendix D to install and configure it. Single Hard Disk Failure Problem Description Symptoms One of the mirrored internal hard disks for the RA failed. The following symptoms might help you identify this failure: The failure light for a hard disk indicates a failure. Figure 8 12 illustrates the location of the LEDs for hard disks in the RA. 6872 5688 006 8 25
Solving Replication Appliance (RA) Problems Figure 8 12. Location of Hard Drive LEDs An error message that appears during boot indicates failure of one of the internal disks. The LCD display on the front panel of the RA indicates a drive failure. This error code is E0D76 as shown in Figure 8 5. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: If the drive failed, you must replace the hard drive. Contact a Unisys service representative for a replacement part. Install the new drive; resynchronization occurs automatically. Do not power off or reboot the RA while resynchronization is taking place. Failure of All RAs at One Site Problem Description If all RAs fail on one site, replication stops and the data that are currently changing on the remote site are marked for synchronization. Once the RAs are restored, synchronization occurs through a full \-sweep operation. This type of failure is unlikely unless the power source fails. 8 26 6872 5688 006
Solving Replication Appliance (RA) Problems Symptoms The following symptoms might help you identify this failure: Transfer is paused for all consistency groups. Depending on the environment and group settings, applications that were running on the failed site might stop. If the quorum resource belonged to a node at the failed site, MCSC might fail. The symptoms for this failure are similar to a total site failure and a network failure on both the management network and WAN. Because the WAN link is functioning, the difference is that the following are true: Neither site can access the management console using the site management IP address of the site with the failed RAs. Both sites can access the management console using the site management IP address of the site with the functioning RAs. Communicate with the administrator at the other site to determine whether that site can access the management console. Both sites should see a display similar to Figure 8 13. Figure 8 13. Management Console Showing All RAs Down Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Restore power to the failed RAs. 2. If recovery of applications is needed prior to restoring the RAs, see the recovery topics in Section 3 for geographic replication environments and in Section 4 for geographic clustered environments. 6872 5688 006 8 27
Solving Replication Appliance (RA) Problems All RAs Are Not Attached Problem Description Symptoms If all RAs at a site are not attached, connection to the management console is not available. Also, you cannot access the RA using a PuTTY session and the site management IP address. You cannot log into the RA using the RA management IP address and the admin user account. The RA that runs site control is assigned a virtual IP address that is the site management IP address. Either RA 1 or RA 2 must be attached to the cluster to have an RA cluster with site control running. The following symptoms might help you identify this failure: You cannot log in to the management console using the site management IP addresses of the failed sites. You cannot initiate an SSH session through PuTTY using the admin account to either RA management IP address or the site management IP address. From the management console of the other site, the WAN appears to be down. (See Figure 8 11.) Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Ping the RA using the management IP address. If the ping is not successful, refer to Management Network Failure in a Geographic Clustered Environment in Section 7. If the ping completes successfully, continue with steps 2 through 5. 2. Log in as boxmgmt to each RA management IP address through an SSH session using PuTTY. (See Using the SSH Client in Appendix C for more information.) If this is not successful, the RA is probably not attached. 3. To verify that the RA is not attached, follow these steps: a. Log in as boxmgmt to the RA. b. At the prompt, type 4 (Cluster operations) and press Enter. Note: The reboot regulation limit has been exceeded message is displayed when you log in as boxmgmt. In that case, see Reboot Regulation in this section. c. At the prompt, type 2 (Detach from cluster) and press Enter. Do not type y to detach. If the RA was not attached, a message is displayed stating that it is not detached. Note: Either RA 1 or RA 2 must be attached to have a cluster. RAs 3 through 8 cannot become cluster masters. 4. If the RA is not attached, then type B (Back) and press Enter. 5. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster. 8 28 6872 5688 006
Solving Replication Appliance (RA) Problems 6. At the prompt, type Q (Quit). 7. Once the RA is attached, log in as admin to the management console and also initiate a SSH session to the management IP address to ensure that both are operational. 8. At the management console, click the RAs tab and check that all connections are working. 6872 5688 006 8 29
Solving Replication Appliance (RA) Problems 8 30 6872 5688 006
Section 9 Solving Server Problems This section lists symptoms that usually indicate problems with one or more servers. The problems listed in this section include hardware failure problems. Table 9 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for any of the possible problems or causes. Also, messages similar to e-mail notifications are displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 9 1. Possible Server Problems with Symptoms Symptom The management console shows a server down. Messages on the management console show the splitter is down and that the node fails over. Multipathing software (such as EMC PowerPath Administrator) messages report errors. (This symptom might occur if the server is unable to connect with the SAN or if the server HBA fails.) Possible Problem Cluster node failure (hardware or software) in geographic clustered environment possibly resulting from Windows server reboot Unexpected server shutdown because of a bug check Server crash or restart Server unable to connect with SAN Server HBA failure Host logs and RA log timestamps are not synchronized. Applications are down. Infrastructure (NTP) server failure Server failure (hardware or software) in geographic replication environment possibly resulting from Windows server reboot Unexpected server shutdown because of a bug check Server crash or restart Server unable to connect with SAN Server HBA failure 6872 5688 006 9 1
Solving Server Problems Cluster Node Failure (Hardware or Software) in a Geographic Clustered Environment Problem Description MSCS uses several heartbeat mechanisms to detect whether a node is still actively responding to cluster activities. MSCS assumes a cluster node has failed when the cluster node no longer responds to heartbeats that are broadcast over the public\private cluster networks and when a SCSI reservation is lost on the quorum volume. Figure 9 1 illustrates this failure. Figure 9 1. Cluster Node Failure If the server that crashed was the MSCS leader (quorum owner), another cluster node (the challenger) tries to become leader and arbitrate for the quorum device. Because the failed server is no longer the quorum device owner in the reservation manager, the arbitration by the challenger instantly succeeds. If the challenger node is from the same site as the failed server, arbitration instantly succeeds, and no failover of the quorum device to the remote site is required. If the challenger node is from the remote site, the RA reverses the replication direction of the quorum consistency group. Once failover completes, the challenger arbitration is completed. 9 2 6872 5688 006
Solving Server Problems When a nonleader MSCS node fails, the data groups move to the remaining MSCS local or remote nodes, depending on preferred ownership settings. From the perspective of the RA, this situation is equivalent to a user-initiated move of the data groups. That is, the SafeGuard 30m Control resource on the node that tries to bring the group online sends a command to fail over the group to its site. If the group fails over to a cluster node on the same site, failover occurs instantly. Otherwise, a consistency group failover is initiated to the remote site. The SafeGuard 30m Control resource does not come online until the consistency group has completed failover. Possible Subset Scenarios The symptoms of a server failure vary based on the reasons that the server went down. Five different scenarios are described as subsets of this type of failure: Windows Server Reboot Unexpected Server Shutdown Because of a Bug Check Server Crash or Restart Server Unable to Connect with SAN Server HBA Failure One of the first things to determine in troubleshooting a server failure is whether the failure was an unexpected event (a crash ) or an orderly event such as an operator reboot. When the server crashes, you usually see a blue screen and do not have access to messages. Once the server comes up again, then you can view messages regarding the reason it crashed. These messages help diagnose the reason for the initial shutdown or failure. In an orderly event, the Windows event log is stopped, and you can view events that point to the reason for the reboot or restart. Windows Server Reboot Problem Description The consistency groups fail over to another local node or to the other site because a server fails or goes down. In this scenario, the shutdown is an orderly event and thus causes the Windows event log service to stop. 6872 5688 006 9 3
Solving Server Problems Symptoms The following symptoms might help you identify this failure: The management console display shows a server failure similar to that shown in Figure 9 2. Figure 9 2. Management Console Display with Server Error Warning and informational messages similar to those shown in Figure 9 3 appear on the management console when a server fails. See the table after the figure for an explanation of the numbered console messages. 9 4 6872 5688 006
Solving Server Problems Figure 9 3. Management Console Messages for Server Down 6872 5688 006 9 5
Solving Server Problems The following table explains the numbered messages shown in Figures 9 3. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 5008 The source site reports that server USMV-WEST2 performed an orderly shutdown. 2 4062 The surviving site accesses the latest image of the consistency group during the failover. 3 5032 For each consistency group that moves to a surviving node, the splitter is again splitting. 4 4008 For each consistency group that moves to a surviving node, the transfer is paused. In the details of this message, the reason for the pause is given. 5 1008 The Unisys SafeGuard 30m Control resource successfully issued an initiate_failover command. 6 4086 For each consistency group that moves to asurviving node, data transfer starts and then a quick initialization starts. 7 4087 For each consistency group that moves to a surviving node, initialization completes. X X X X X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the system event logs, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images. System Event Log for Usmv-West2 Host (Failure Host on Site 1) 12/17/2008 18:17:13 PM EventLog Information None 6006 N/A USMV-WEST2 The Event log service was stopped. 12/17/2008 18:17:48 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free. 12/17/2008 18:17:48 PM EventLog Information None 6005 N/A USMV-USMV-WEST2. The Event log service was started. 9 6 6872 5688 006
Solving Server Problems System Event Log for Usmv-East2 Host (Surviving Host on Site 2) 12/17/2008 18:17:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Public'. 12/17/2008 18:17:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Private'. 12/17/2008 18:17:56 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. If you review the cluster log, you can find messages similar to the following examples that are based on the failed node owning the quorum used to generate the previous management console images: Cluster Log for Usmv-West2 Host (Failure Host on Site 1) 0000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [GUM]GumUpdateRemoteNode: Failed to get completion status for async RPC call,status 1115.(Error 1115: A system shutdown is in progress) 0000089c.00000a54::2008/12/17-18:17:42.107 ERR [GUM] GumSendUpdate: Update on node 2 failed with 1115 when it must succeed 0000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [GUM] GumpCommFailure 1115 communicating with node 20000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [NM] Banishing node 1 from active cluster membership. 0000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [RGP] Node 1: REGROUP WARNING: reload failed. 0000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [NM] Halting this node due to membership or communications error. Halt code = 1. 0000089c.00000a54:: 2008/12/17-18:17:42.107 ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5890. (Error 5890: An operation was attempted that is incompatible with the current membership state of the node) 0000091c.00000fe4:: 2008/12/17-18:17:42.107 ERR [RM] LostQuorumResource, cluster service terminated... Cluster Log for Usmv-East2 Host (Surviving Host on Site 2) 00000268.00000c38::2008/12/17-18:17:42.107 INFO [ClMsg] Received interface unreachable event for node 1 network 2 00000268.00000c38::2008/12/17-18:17:42.107 INFO [ClMsg] Received interface unreachable event for node 1 network 1 00000268.00000b70::2008/12/17-18:17:42.107 WARN [NM] Communication was lost with interface 374359a2-5782-4b1d-a863-07f84f8c97d9 (node: USMV-WEST2, network: private) 00000268.00000b70::2008/12/17-18:17:42.107 INFO [NM] Updating local connectivity info for network afe1f350-f66a-460a-a526-6f58987b911d. 00000268.00000b70::2008/12/17-18:17:42.107 INFO [NM] Started state recalculation timer (2400ms) for network afe1f350-f66a-460a-a526-6f58987b911d (private) 00000268.00000b70::2008/12/17-18:17:42.107 WARN [NM] Communication was lost with interface 15b9fbe1-c05f-4e90-b937-17fdc27c133e (node: USMV-WEST2, network: public) 00000268.00000b70::2008/12/17-18:17:42.107 INFO [NM] Updating local connectivity info for network 9d905035-8105-4c87-a5bc-ce82e49e764a. 00000268.00000b70::2008/12/17-18:17:42.107 INFO [NM] Started state recalculation timer (2400ms) for network 9d905035-8105-4c87-a5bc-ce82e49e764a (public) 00000268.000005d0::2008/12/17-18:17:39.733 INFO [NM] We own the quorum resource.. 6872 5688 006 9 7
Solving Server Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Check for event 5008 in the management console logs. If this event is replaced by event 5013, the host probably crashed. See Unexpected Server Shutdown Because of a Bug Check and Server Crash or Restart. Review the cluster log and check for the system shutdown message as shown in the preceding examples. Determine whether the quorum resource moved by checking the surviving nodes for the message We own the quorum resource. Review the Windows system event log messages and determine whether or not the server failure was a crash or an orderly event. In this case, based on the example messages, the Windows system event log shows that the system started the reboot or shutdown in an orderly manner at 6:17:13 p.m. (message 6006). Because the event log service was shut down, the events that follow show that the event log service restarted. For an orderly event, often an operator shuts down the system for some planned reason. If the event log messages do not point to an orderly event, then review Unexpected Server Shutdown Because of a Bug Check and Server Crash or Restart as possible scenarios that fit the circumstances. Unexpected Server Shutdown Because of a Bug Check Problem Description Symptoms The consistency groups fail over to another local node or to the other site because a server fails or shuts down unexpectedly and then reboots after the blue screen event. The following symptoms might help you identify this failure: The management console display shows a server failure similar to that shown in Figure 9 2. Warning and informational messages similar to those shown in Figure 9 4 appear on the management console when a server fails. See the table after the figure for an explanation of the numbered console messages. 9 8 6872 5688 006
Solving Server Problems Figure 9 4. Management Console Messages for Server Down for Bug Check 6872 5688 006 9 9
Solving Server Problems The following table explains the numbered messages shown in Figure 9 4. Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 5013 The splitter for the server USMV-WEST2 is down unexpectedly. 2 4008 For each consistency group, the transfer is paused at the source (down) site. In the details of this message, the reason for the pause is given. 3 5002 The splitter for server USMV-WEST2 is unable to access the RA unexpectedly. X X X 4 4008 For each consistency group, the transfer is paused at the surviving site to allow a switchover. In the details of this message, the reason for the pause is given. 5 4062 The surviving site accesses the latest image of the consistency group during the failover. 6 5032 For each consistency group at the surviving site, the splitter is splitting to the replication volumes. 7 5002 The RA at the source (down) site cannot access the splitter for server USMV-WEST2. 8 4010 For each consistency group at the source site, the transfer is started. 9 4086 For each consistency group at the source site, data transfer starts and then initialization starts. 10 4087 For each consistency group at the source site, initialization completes. X X X X X X To see the details of the messages listed on the management console display, you must collect the logs and then review the messages for the time of the failure. Appendix A explains how to collect the management console logs, and Appendix E lists the event IDs with explanations. If you review the Windows system event logs after the system reboots, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images. 9 10 6872 5688 006
Solving Server Problems System Log for Usmv-West2 Host (Failure Host on Site 1) 12/17/2008 18:17:42 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system shutdown at 18:25:42 PM on 12/17/2008 was unexpected. 12/17/2008 18:17:42 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R) Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free. 12/17/2008 18:17:42 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log service was started. 12/17/2008 18:17:42 PM Save Dump Information None 1001 N/A USMV-WEST2 The computer has rebooted from a bugcheck. The bugcheck was: 0x0000007e (0xffffffffc0000005, 0xe000015f97c8a664, 0xe000015f9e52be68, 0xe000015f9e52afb0). A dump was saved in: C:\WINDOWS\MEMORY.DMP. System Log for Usmv-East2 Host (Surviving Host on Site 2) 12/17/2008 18:25:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Public'. 12/17/2008 18:25:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Private'. 12/17/2008 18:25:42 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting a bus reset for device\device\clusdisk0. 12/17/2008 18:25:42 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the previous management console images: Cluster Log for Usmv-West2 Host (Failure Host on Site 1) For this error situation, no entries appear in the cluster log. Cluster Log for Usmv-East2 Host (Surviving Host on Site 2) 000007e0.00000138::2008/12/17-18:25:42.104 INFO [ClMsg] Received interface unreachable event for node 1 network 2 000007e0.00000138:: 2008/12/17-18:25:42.104 INFO [ClMsg] Received interface unreachable event for node 1 network 1 000007e0.00000124:: 2008/12/17-18:25:42.104 WARN [NM] Communication was lost with interface 5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: Public) 000007e0.00000124:: 2008/12/17-18:25:42.104 WARN [NM] Communication was lost with interface f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: B) 000001c0.00000664:: 2008/12/17-18:25:42.507 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. (Error 170: the request resource is in use) 000001c0.00000664:: 2008/12/17-18:25:42.507 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 000001c0.00000664:: 2008/12/17-18:25:42.507 INFO Physical Disk <Disk Q:>: [DiskArb] We are about to break reserve. 000007e0.00000a0c:: 2008/12/17-18:25:42.881 INFO [NM] We own the quorum resource. 6872 5688 006 9 11
Solving Server Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Review the Windows application event log messages to determine the cause of the unexpected event. In this case, based on the four example messages, the first Windows system event log shows event 6008 in which the system unexpectedly shut down; it was not a reboot. Then event 6009 is typically displayed as a reboot message. This event occurs regardless of the reason for the reboot. The same is true for event 6005. The Save Dump event 1001 shows that a memory dump was saved. Based on this message, consult the Microsoft Knowledge Base regarding bug checks. (http://support.microsoft.com/). Search for bug check 0x0000007e, or stop error 0x0000007e and replace the stop number with the one displayed. 2. Once you have the appropriate Knowledge Base article from the Microsoft site, follow the recommendations in the article to resolve the issue. 3. If the information from the Knowledge Base article does not solve resolve the problem, collect and save the memory dump file and then submit it to the Unisys Support Center. Server Crash or Restart Problem Description Symptoms When the server goes down for whatever reason and then restarts in a geographic clustered environment, the consistency groups fail over to the other site and then fail over to the original site once the server is restarted. The following symptoms might help you identify his failure: The management console display shows a server failure similar to that shown in Figure 9 2. Warnings and informational messages similar to those shown in Figure 9 4 appear on the management console when the server fails. See the table after that figure for an explanation of the numbered console messages. If you review the Windows system event log, you can find messages similar to the following examples that are based on the testing cases used to generate the management console images for Figures 9 2 and 9 4: 9 12 6872 5688 006
Solving Server Problems System Log for Usmv-West2 Host (Failure Host on Site 1) 12/17/2008 18:42:39 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system shutdown at 18:05:55 PM on 12/17/2008 was unexpected. 12/17/2008 18:42:39 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R) Windows (R) 5.02. 3790 Service Pack 2 Multiprocessor Free. 12/17/2008 18:42:39 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log service was started. System Log for Usmv-East2 Host (Surviving Host on Site 2) 12/17/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Public'. 12/17/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost communication with cluster node 'USMV-WEST2' on network 'Private'. 12/17/2008 18:05:55 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting a bus reset for device \Device\ClusDisk0. 12/17/2008 18:05:55 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. If you review the cluster log, you can find messages similar to the following examples that are based on the testing cases used to generate the management console images for Figures 9 2 and 9 4: Cluster Log for Usmv-West2 Host (Failure Host on Site 1) For this error situation, no entries appear in the cluster log. Cluster Log for Usmv-East2 Host (Surviving Host on Site 2) 000007e0.00000138::2008/12/17-18:05:55.102 INFO [ClMsg] Received interface unreachable event for node 1 network 2 000007e0.00000138:: 2008/12/17-18:05:55.102 INFO [ClMsg] Received interface unreachable event for node 1 network 1 000007e0.00000124:: 2008/12/17-18:05:55.102 WARN [NM] Communication was lost with interface 5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: Public) 000007e0.00000124:: 2008/12/17-18:05:55.102 WARN [NM] Communication was lost with interface f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: USMV-WEST2, network: Private LAN) 000001c0.00000168:: 2008/12/17-18:05:55.504 ERR Physical Disk <Disk Q:>: [DiskArb] GetPartInfo completed, status 170. (Error 170: the requested resource is in use) 000001c0.00000168:: 2008/12/17-18:05:55.504 ERR Physical Disk <Disk Q:>: [DiskArb] Failed to read (sector 12), error 170. 000001c0.00000168:: 2008/12/17-18:05:55.504 INFO Physical Disk <Disk Q:>: [DiskArb] We are about to break reserve. 000007e0.00000764:: 2008/12/17-18:05:55.079 INFO [NM] We own the quorum resource. 6872 5688 006 9 13
Solving Server Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. Run the Microsoft Product Support MPS Report Utility to gather system information. (See Using the MPS Report Utility in Appendix A.) 2. Submit the MPS report to the Unisys Support Center. Server Unable to Connect with SAN Problem Description Symptoms The server is unable to connect to the SAN. The following symptoms might help you identify this failure: The management console display shows a server failure similar to that shown in Figure 9 5. Figure 9 5. Management Console Display Showing LA Site Server Down To display more information about the error, click on More in the right column. A message similar to the following is displayed: ERROR: Splitter USMV-WEST2 is down Warnings and informational messages similar to those shown in Figure 9 6 appear on the management console when the server fails. See the table after the figure for an explanation of the numbered console messages. 9 14 6872 5688 006
Solving Server Problems Figure 9 6. Management Console Images Showing Messages for Server Unable to Connect to SAN The following table explains the numbered messages in Figure 9 6. 6872 5688 006 9 15
Solving Server Problems Reference No. Event ID Description E-mail Immediate E-mail Daily Summary 1 5013 The splitter for the server USMV-WEST2 is down. 2 4008 For each consistency group at the failed site, the transfer is paused to allow a failover to the surviving site. 3 4008 For each consistency group, the transfer is paused at the surviving site to allow a failover. In the details of this message, the reason for the pause is given. 4 5002 The splitter for the server USMV-WEST2 is unable to access the RA. 5 4010 The consistency groups on the original failed site start data transfer. 6 4086 For each consistency group at the failed site, data transfer starts and then initialization starts. 7 4087 For each consistency group at the failed site, data transfer completes. X X X X X X X The multipathing software (EMC PowerPath Administrator) flashes a red X on the right side of the toolbar. The PowerPath Administrator Console reports failures similar to those shown in Figure 9 7. 9 16 6872 5688 006
Solving Server Problems Figure 9 7. PowerPath Administrator Console Showing Failures If you review the server system event log, you can find error messages similar to the following examples that are based on the testing cases used to generate the previous management console images. Type : warning Source : Ftdisk EventID : 57 Description : The system failed to flush data to the transaction log. Corruption may occur. Type : error Source : Emcpbase EventID : 100 Description : Path Bus x Tgt y LUN z to APMxxxx is dead The event 100 will appear numerous times for each bus, target and LUN. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: 1. At the server, run a tool such as the PowerPath Administrator that might aid in diagnosing the problem. 2. Log in to the storage software and determine whether problems are reported. If so, use the information for that software to correct the problems. Something might have happened to the volume, or the zoning configuration on the switch might have been changed. Also, a connection issue could exist such as a fabric switch or storage cable failure. 6872 5688 006 9 17
Solving Server Problems 3. If the problem is not limited to one server, run the Installation Manager Fibre Channel diagnostics. Appendix C explains how to run the Installation Manager diagnostics and provides information about the various diagnostic capabilities. 4. If the problem still appears at the host, an adapter with multiple ports might have failed. Replace the Fibre Channel adapter in the host if the storage, zoning, and cabling appear correct. Ensure that the storage and zoning are corrected to use the new WWN as necessary. (See Server HBA Failure for resolution actions.) Server HBA Failure Problem Description Symptoms One HBA in the server failed on a host that has multiple paths to storage. The following symptoms might help you identify this failure: The multipathing software (such as EMC PowerPath Administrator) flashes a red X on the right side of the toolbar. The PowerPath Administrator console reports failures similar to those shown in Figure 9 8. Figure 9 8. PowerPath Administrator Console Showing Adapter Failure 9 18 6872 5688 006
Solving Server Problems If you review the server system event log, you can find error messages similar to the following example: Actions to Resolve Type : error Source : Emcpbase EventID : 100 Description: Path Bus x Tgt y LUN z to APMxxxx is dead The event 100 will appear numerous times for each target and LUN. To replace an HBA in the server, perform the following steps: 1. Run Emulex HBAnywhere and record the WWNs in use by the server. 2. Shut down the server. 3. Replace the failed HBA and then boot the server. 4. Run Emulex HBAnywhere and record the new WWN. 5. Using the SAN switch management modify the zoning as needed to replace the failed WWN with new WWN. 6. If manual discovery was used for the storage, update the configuration to use the new WWN. Infrastructure (NTP) Server Failure Problem Description Symptoms The replication environment is not affected by an NTP server failure. Timestamps of log entries are affected. The following symptoms might help you identify the failure: When comparing log entries of a failover, the host application log and the management console entries are not synchronized. You are unable to run the synchronization diagnostics as described in the Unisys SafeGuard Solutions Replication Appliance Installation Guide. Actions to Resolve the Problem To resolve an NTP server failure, perform the following steps: 1. Temporarily change the cluster mode for a data consistency group to MSCS manual (for a group replicating from the source site to the target site). 2. Perform a move-group operation on a cluster group that contains a Unisys SafeGuard Control resource to a node at the target site. 3. View the management console log for event 1009 as shown in Figure 9 9. 6872 5688 006 9 19
Solving Server Problems Figure 9 9. Event 1009 Display 4. View the host application event log for event 1115, as follows: Event Type : Warning Event Source : 30mControl Event Category : None Event ID : 1115 Date : 12/17/2008 Time : 18:17:07 PM User : N/A Computer : USMV-EAST2 Description: Online resource failed. Group is not a MSCS auto-data group (5). Action: Verify through the Management Console that the Global cluster mode is set to MSCS auto-data. Or if doing manual recovery, ensure an image has been selected. Resource name: Data1 RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.7.60 initiate_failover group=data1 active_site=east cluster_owner=usmv-east2 5. Compare the timestamps. If the time between the timestamps is not within a couple of minutes, the host and RAs are not synchronized. 6. Use the Installation Manager site connectivity IP diagnostic by performing the following steps. (For more information, see Appendix C.) a. Log in to an RA as user boxmgmt with the password boxmgmt. b. On the Main Menu, type 3 (Diagnostics) and press Enter. c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. 9 20 6872 5688 006
Solving Server Problems e. When asked to select a target for the tests, type 5 (Other host) and press Enter. f. Enter the IP address for the NTP server that you want to test. Note: In step e, you must specify 5 (Other host) and 4 (NTP Server). This choice is because site 2 does not specify an NTP server in the configuration, and the test will fail if you use 4 (NTP Server). 7. If the NTP server fails, check that the NTP service on the NTP server is functioning correctly. 8. Use the Installation Manager port diagnostics IP diagnostic to ensure that no ports are blocked. (For more information about running port diagnostics, see Appendix C.) 9. Check that the NTP server specified for the host is the same NTP server specified for the RAs at site 1. (If you want to view the RA configuration settings, use the Installation Manager Setup View capability. For information about that capability, refer to the Unisys SafeGuard Solutions Replication Appliance Installation Guide.) 10. Repeat steps 1 through 5 choosing a group that will move a group from the target site to the source site. Server Failure (Hardware or Software) in a Geographic Replication Environment Problem Description When a server goes down in a geographic replication environment, the circumstances and Windows event log messages are similar to those for the server failure in a geographic clustered environment. That is, the five subset scenarios previously presented apply as far as the event log messages and actions to resolve are concerned. The primary difference is that the main symptom of the server failure in this environment is that the user applications fail. Refer to the previous five subset scenarios for more details. 6872 5688 006 9 21
Solving Server Problems 9 22 6872 5688 006
Section 10 Solving Performance Problems This section lists symptoms that usually indicate performance problems. Table 10 1 lists symptoms and possible problems indicated by the symptom. The problems and their solutions are described in this section. This section also includes a general discussion of high-load event. The graphics, behaviors, and examples in this section are similar to what you observe with your system but might differ in some details. The management console provides graphs that you can use to evaluate performance. For more information, see the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide. In addition to the symptoms listed, you might receive e-mail messages or SNMP traps for the possible problems. Also, messages similar to e-mail notifications are displayed on the management console. If you do not see the messages, they might have already dropped off the display. Review the management console logs for messages that have dropped off the display. Table 10 1. Possible Performance Problems with Symptoms Symptom The initialization progression indicator (%) in the management interface progresses significantly slower than expected. Initialization completes after a significantly longer period of time than expected. The event log indicates that the disk manager has reported high load conditions for a specific consistency group or groups. A consistency group or groups start to initialize. This initialization can occur once or multiple times, depending on the circumstances. Slow initialization Possible Problem High-load (disk manager) 6872 5688 006 10 1
Solving Performance Problems Table 10 1. Possible Performance Problems with Symptoms Symptom The event log indicates that the distributor has reported high load conditions for a specific consistency group or groups. A consistency group or groups start to initialize. This initialization can occur once or multiple times, depending on the circumstances. Applications are offline for a lengthy period during changes in the replication direction. High load (distributor) Possible Problem Failover time lengthens Slow Initialization Problem Description Symptoms Initialization of a consistency group or groups takes longer than expected. Progression of initialization is reported through the management console in percentages. You might notice that the percentage for a group has not progressed in a long time or progresses at a slow rate. This progression might or might not be normal depending on several factors. For some groups, it might be natural to take a long time to advance to the next percentage. One percent of 10 TB is much larger than one percent of 100 GB; therefore, larger groups would take longer to advance in initialization. The following symptoms might help you identify this failure: The initialization progression indicator (%) in the management interface progresses significantly slower than expected. Initialization completes after a significantly longer period of time than expected. Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Verify the bandwidth of the connection between sites using the Installation Manager network diagnostic tools to test the WAN speed while there is no traffic over the WAN. Appendix C explains how to run these diagnostics. Use the Installation Manager Fibre Channel diagnostic tools or customer storage/san diagnostic tools to test the performance of the source and target storage LUNs to ensure that all storage LUNs are capable of handling the observed load. Appendix C explains how to run the Installation Manager diagnostics. 10 2 6872 5688 006
Solving Performance Problems If storage performance on either site is poor, the replication system could be limited in its ability to read from the replication volumes on the source site or to write to the journal volume on the remote site. Poor storage performance reduces the maximum speed at which the RAs can initialize. Verify that no bandwidth limitation exists on the relevant group or groups properties. Use the event log to verify that no other events occurred during initializationfor example, high load conditions, WAN disconnections, or storage disconnectionsthat could have caused the initialization to restart. Diagnosis of these types of problems is usually specific to the environment. Collect RA logs and submit a service request to Unisys support if the cause of slow initialization cannot be determined through the actions given above. See Appendix A for information about collecting logs. General Description of High-Load Event A high-load event reports that, at the time of the event, a bottleneck existed in the replication process. To keep track of the changes being made during the bottleneck, the replication goes into marking mode and records the location of all changed data on the source replication volume until the activity causing the bottleneck has subsided. The three possible points at which a bottleneck might occur are Between the host and RADisk Manager Of the three points for a bottleneck to occur, this point is the rarest to cause the bottleneck. This type of bottleneck occurs when the host is writing to the storage device faster than the RA can handle. The WAN This type of bottleneck occurs when the host is writing to the storage device faster than the RAs can replicate over the available bandwidth. For example, a host is writing to the storage device during peak hours at a rate of 60 Mbps. The RAs compress this data down to 15 Mbps. The available bandwidth is 10 Mbps. Clearly, during peak hours, the bandwidth is not sufficient to support the write rate; therefore, during peak hours, a number of high load events occur. The remote storagedistributor This type of bottleneck occurs when the storage device containing the journal volume on the remote site cannot keep up with the speed that the data is being replicated to the remote site. To avoid this situation, configure the journal volume on the fastest possible LUNs using the fastest RAID and the most disk spindles. Also, use multiple journal volumes located on different physical disks in the storage array or use separate disk subsystems in the same consistency group so that the replication can perform an additional layer of striping. The replication stripes the images across these multiple journal volumes. 6872 5688 006 10 3
Solving Performance Problems High-Load (Disk Manager) Condition Problem Description Symptoms The disk manager reports high-load conditions. The following symptoms might help you identify this failure: The event log indicates that the disk manager reported high load conditions for a specific consistency group or groups (event ID 4019). A consistency group or groups start to initialize. This initialization can occur once or multiple times, depending on the circumstances. Actions to Resolve Perform the following actions to isolate and resolve the problem: Use the Installation Manager network diagnostic tools to test the WAN speed while there is no traffic over the WAN. Appendix C explains how to run these diagnostics. Analyze the performance data for the consistency groups on the RA to ensure that the incoming write rate is not outside the limits of the available bandwidth or the capabilities of the RA. High loads can occur naturally during traffic peaks or during periods of high external activity on the WAN. If the high load events occur infrequently or can be associated with a temporal peak, consider this behavior as normal. Diagnosis of these types of problems is usually specific to the environment. Collect RA logs and submit a service request to the Unisys Support Center if the high load events occur frequently and you cannot resolve the problem through the actions previously listed. See Appendix A for information about collecting logs. High-Load (Distributor) Condition Problem Description Symptoms The distributor reports high-load conditions. The following symptoms might help you identify this failure: The event log indicates that the distributor reported high load conditions for a specific consistency group or groups. A consistency group or groups start to initialize. This initialization can occur once or multiple times, depending on the circumstances. 10 4 6872 5688 006
Solving Performance Problems Actions to Resolve the Problem Perform the following actions to isolate and resolve the problem: Use the Installation Manager Fibre Channel diagnostic tools or customer storage or SAN diagnostic tools to test the performance of the target-site storage LUNs. Appendix C explains how to run the Installation Manager diagnostics. Analyze the WAN performance of the consistency group or groups, and ensure that loads are not too high for handling by the target-site storage devices. High loads can occur naturally during traffic peaks. If the high-load events occur infrequently or can be associated with a temporal peak, consider this behavior as normal. Diagnosis of these types of problems is usually specific to the environment. Collect RA logs and submit a service request to the Unisys Support Center if the high-load events occur frequently and you cannot resolve the problem through the actions previously listed. See Appendix A for information about collecting logs. Failover Time Lengthens Problem Description Symptoms Prior to changing the replication direction, the images must be distributed to the targetsite volumes. The applications are not available during this process. Applications are offline for a lengthy period during changes to the replication direction. Actions to Resolve the Problem Refer to the Unisys SafeGuard Solutions Planning and Installation Guide for more information on pending timeouts. 6872 5688 006 10 5
Solving Performance Problems 10 6 6872 5688 006
Appendix A Collecting and Using Logs Whenever a failure occurs, you might need to collect and analyze log information to assist in diagnosing the problem. This appendix presents information on the following tasks: Collecting RA logs Collecting server (host) logs Analyzing RA log collection files Analyzing server (host) logs Analyzing intelligent fabric switch logs Collecting RA Logs When you collect logs from one RA, you automatically collect logs from all other RAs and from the servers. Occasionally, you might need to collect logs from the servers (hosts) manually. Refer to Collecting Server (Host) Logs later in this appendix for more information. Each time you complete a log collection, the files are saved for a maximum of 7 days. The length of time the files remain available depends on the size and number of log collections performed. To ensure that you have the log files that you need, download and store the files locally. Log files with dates older than 7 days from the current date are automatically removed. To collect the RA logs, perform the following procedures: 1. Set the Automatic Host Info Collection option 2. Test FTP connectivity 3. Determine when the failure occurred 4. Convert local time to GMT or UTC 5. Collect logs from the RA 6872 5688 006 A 1
Collecting and Using Logs Setting the Automatic Host Info Collection Option Perform the following steps to set the Automatic Host Info Collection Option: 1. On the System menu select System Settings in the Management Console. The System Settings page appears. 2. Choose the Automatic Host Info Collection option from Miscellaneous Settings. For more information, refer to the Unisys SafeGuard Solutions Planning and Installation Guide. Testing FTP Connectivity To test FTP connectivity, perform the following steps on the management PC. The information you provide depends on whether logs are being collected locally on an FTP server or sent to an FTP server at the Unisys Product Support site. 1. To initiate an FTP session, type FTP at a command prompt. Press Enter. 2. Type Open. Press Enter. 3. At the To prompt, enter one of the following and then press Enter: ftp.ess.unisys.com (the Unisys FTP address) Your local FTP server IP address 4. At the User prompt, enter one of the following and then press Enter: FTP, if you specified the Unisys FTP address Your local FTP user account 5. At the Password prompt, enter one of the following and then press Enter: Your Internet e-mail address if you specified the Unisys FTP address Your local FTP account password 6. Type bye and press Enter to log out. Determining When the Failure Occurred Perform the following steps to determine when the failure occurred: Note: If you cannot determine the failure time from the RA logs, use the Windows event logs on each server (host) to determine the failure time. 1. Select the Logs tab from the navigation pane in the Management Console. A list of events is displayed. Each event entry includes a Level column that indicates the severity of the event. If necessary, click View and select Detailed. 2. Scan the Description column to find the event for which you want to gather logs. A 2 6872 5688 006
Collecting and Using Logs 3. Select the event and click the Filter Log option. The Filter Log dialog box appears. 4. Select any option from scope list (normal, detailed, advanced) and level list (info, warning, error). 5. Write down the timestamp that is displayed for the event. You must convert the time displayed to GMTalso called Coordinated Universal Time (UTC). This timestamp is used to calculate the start date and end time for log collection. 6. Click OK. Converting Local Time to GMT or UTC Perform the following steps to convert the time in which the failure occurred to GMT or UTC. You need the time zone you wrote down in the preceding procedure. 1. In Windows Control Panel, click Date and Time. 2. Select the Time Zone tab. 3. Look in the list for the GMT or UTC offset value corresponding to the time zone you wrote down in the procedure Determining When the Failure Occurred. The offset value represents the number of hours that the time zone is ahead or behind GMT or UTC. 4. Add or subtract the GMT or UTC offset value from the local time. Example If the time zone is Pacific Standard Time, the GMT or UTC offset value is 8:00. If the time in which the failure occurred is 13:30, then GMT or UTC is 21:30. Collecting RA Logs Use the Installation Manager, which is a centralized collection tool, to collect logs from all accessible RAs, servers (hosts), and intelligent fabric switches. Before you begin log collection, determine the failure date and time. If you have SANTap switches and want to collect information from the switches, know the user name and password to access the switches. To collect RA logs, perform the following steps: 1. Start the SSH client by performing the steps in Using the SSH Client in Appendix C. Use the site management IP address; log in with boxmgmt as the login user name and boxmgmt as the password. 2. On the Main Menu, type 3 (Diagnostics) and press Enter. 3. On the Diagnostics menu, type 4 (Collect system info) and press Enter. 6872 5688 006 A 3
Collecting and Using Logs 4. When prompted, provide the following information. Press Enter after each item. (The program displays date and time in GMT/UTC format.) a. Start date: This date specifies how far back the log collection is to start. Use the MM/DD/YYYY format. Do not accept the default date; the date should be at least 2 days earlier than the current date. This date must include the date and time in which the failure occurred. b. Start time: This time specifies the GMT/UTC in which log collection is to start. Use the HH:MM:SS format. c. End date: This date specifies when log collection is to end. Accept the default date, which is the current date. d. End time: This time specifies when log collection is to end. Accept the default time, which is the current time. 5. Type y to collect information from the other site. 6. Type y or n, and press Enter when asked about sending the results to an FTP server. If you choose not to send the results to an FTP server, skip to step 8. The results are stored at the URL http://<ra boxmgmt IP address>/info/. You can access the collected results by logging in with webdownload as the log-in name and webdownload as the password. (If your system is set for secure Web transactions, then the URL begins with https://.) If you choose to send the results to an FTP server and the procedure has been performed previously, all of the information is filled in. If not, provide the following information for the management PC: a. When prompted for the FTP server, type one of the following and then press Enter. The IP address of the Unisys Product Support FTP server, 192.61.61.78, or ftp.ess.unisys.com The IP address of your local FTP server b. Press Enter to accept the default FTP port number, or type a different port number if you are using a management PC with a nonstandard port number. c. Type the local user account when prompted for the FTP user name. Press Enter. d. If you are using the Unisys FTP server, type incoming as the folder name of the FTP location in which to store the collected information. Press Enter. If you are using a local FTP server, press Enter for none. A 4 6872 5688 006
Collecting and Using Logs e. Type a name for the file on the FTP server in the following format: <8-digit UCF number_><customer name>.tar Example: 19557111_Company1.tar Note: If no name is specified, the name will be similar to the following: sysinfo-<ra-list>-hosts-from-<ra-list>)-<date>.tar Example: sysinfo-l1-l2-r1-r2-hosts-from-l1-r1-2006.08.17.16.28.31.tar f. Type the appropriate password. Press Enter. 7. On the Collection mode menu, type 3 (RAs and hosts) and press Enter. Note: The hosts part of this menu selection (RAs and hosts) collects intelligent fabric switch information. 8. Type y or n, and press Enter when asked if you have SANTap switches from which you want to collect information. If you do not have SANTap switches, go to step 10. If you want to collect information from SANTap switches, enter the user name and password to access the switch when prompted. 9. Type n if prompted on whether to perform a full collection, unless otherwise instructed by a Unisys service representative. 10. Type n when prompted to limit collection time. The collection program checks connectivity to all RAs and then displays a list of the available hosts and SANTap switches from which to collect information. 11. Type All and press Enter. The Installation Manager shows the collection progress and reports that it successfully collected data. This collection might take several minutes. Once the data collection completes, a message indicates that the collected information is available at the FTP server you specified or at the URL (http://<ra boxmgmt IP address>/info/ or https://<ra boxmgmt IP address>/info/). 12. Press Enter. 13. On the Diagnostics dialog box, type Q and press Enter to exit the program. 14. Type Y when prompted to quit and press Enter. Verifying the Results Ensure that Failed for hosts has no entries. The success or failure entries might be listed multiple times. For the collection to be successful for hosts and intelligent fabric switches, all entries must indicate Succeeded for hosts. For the collection to be successful for RAs, all entries must indicate Collected data from <RA list>. 6872 5688 006 A 5
Collecting and Using Logs There is a 20-minute timeout on the collection process for RAs. There is a 15-minute timeout on the collection process for each host. If the collection from the remote site failed because of a WAN failure, run the process locally at the remote site. If the connection with an RA is lost while the collection is in process, no information is collected. Run the process again. If you transferred the data by FTP to a management PC, you can transfer the collected data to the Unisys Product Support Web site at your convenience. Otherwise, if you are connected to the Unisys Product Support Web site, the collected data is transferred automatically to this Web site. If you use the Web interface, you must download the collected data to the management PC and then transfer the collected data to the Unisys Product Support Web site at your convenience. Collecting Server (Host) Logs Use the following utilities to collect log information: MPS Report Utility Host information collector (HIC) utility Using the MPS Report Utility Use the Microsoft MPS Report Utility to collect detailed information about the current host configuration. You must have administrative rights to run this utility. Unisys uses the cluster (MSCS) version of this utility if that version is available from Microsoft. This version of the utility enables you to gather cluster information as well as the standard Microsoft information. If the server is not clustered, the utility still runs, but the cluster files in the output are blank. The average time for the utility to complete is between 5 and 20 minutes. It might take longer if you run the utility during peak production time. You can download the MPS Report Utility from the Unisys FTP server at the following location: (You are not prompted for a username or password.) ftp://ftp.ntsupport.unisys.com/outbound/mps-reports/ Select one of the following directories, depending on your operating system environment: 32-BIT 64-BIT-IA64 64-BIT-X64 (not a clustered version) A 6 6872 5688 006
Collecting and Using Logs Output Files Individual output files are created by using the following directory structure. Depending on the MPS Report version, the file name and directory name might vary. Directory: %systemroot%\mpsreports, typically C:\windows\MPSReports File name: %COMPUTERNAME%_MPSReports_xxx.CAB Using the Host Information Collector (HIC) Utility Note: You can skip this procedure unless directed to complete it by the Unisys support personnel. Host log collection occurs automatically if the Automatic Host Info Collection option on the System menu of the management console is selected. Perform the following steps to collect log information from the hosts: 1. At the command prompt on the host, change to the appropriate directory depending on your system: For 32-bit and Intel Itanium 2-based systems, enter cd C:\Program Files\KDriver\hic For x64 systems, enter cd C:\Program Files (x86)\kdriver\hic 2. Type one of the following commands: host_info_collector n (noninteractive mode) host_info_collector (interactive mode) If you choose the interactive mode command, provide the following site information: Account ID: Click System Settings on the System menu of the Management Console, and click on Account Settings in the System Settings dialog box to access this information. Account name: The name of the customer who purchased the Unisys SafeGuard 30m solution. Contact name: The name of the person responsible for collecting logs. Contact mail: The mail account of the person responsible for collecting logs. Note: Ignore messages about utilities that are not installed. 6872 5688 006 A 7
Collecting and Using Logs Verifying the Results The process generates a single tar file of the host logs in the gzip format. On 32-bit and Intel Itanium 2-based systems, the host logs are located in the following directory: C:\Program Files\KDriver\hic On 64-bit systems, the host logs are located on the following directory: C:\Program Files (x86)\kdriver\hic Analyzing RA Log Collection Files If you use the Installation Manager RA log collection process, logs are collected from all accessible RAs and servers (hosts). When the tar file is extracted using this process, the information is gathered in a file on the FTP server that is, by default, named with the following format: sysinfo-<ra list>-hosts-from-<ra list>)-<date>.tar The <date> is in the format yyyy.mm.dd.hh.mm.ss. An example of such a file name is sysinfo-lr-l2-r1-r2-hosts-from-l1-r1-2007.09.07.17.37.39.tar For each RA on which logs were collected, directories are created with the following formats: extracted.<ra identifier>.<date and time of collection> HLR-<RA identifier>-<date and time of collection> The <date and time of collection> is in the format yyyy.mm.dd.hh.mm.ss. An example of the name of an extracted directory for the RA is extracted.l1.2007.06.05.19.25.03 (from left RA 1 on June 5, 2007 at 19:25:03) In the RA identifier information, the l1 to 8 and r1 to 8 designations refer to RAs at the left and right sites. That is, site 1 RAs 1 through 8 are designated with l, and site 2 RAs 1 through 8 are designated with r. If the RA collected a host log, the host information is collected in a directory beginning with HLR. For example, HLR-r1-2007.06.05.19.25.03 is the directory from right (site 2) RA1 on June 5, 2007 at 19:25:03. This directory is described in Host Log Extraction Directory later in this appendix. A 8 6872 5688 006
Collecting and Using Logs RA Log Extraction Directory Several files and directories are placed inside the extracted directory for the RA: parameters: file containing the time frame for the collection CLI: file that containing the output collected by running CLI commands aiw: file containing the internal log of the system, which is used by third-level support aiq: file containing the internal log of the system, which is used by third-level support cm_cli: internal file used by third-level support init_hl: internal file used by third-level support kbox_status: file used by third-level support unfinished_init_hl: file used by third-level support log: file containing the log of the collection process itself (used only by third-level support) summary: file containing a summary of the main events from the internal logs of the system, which is used by third-level support files: directory containing the original directories from the appliance processes: directory containing some internal information from the system such as network configuration, processes state, and so forth tmp: temporary directory Of the preceding items, you should understand the time frame of the collection from the parameters file and focus on the CLI file information. To determine whether the logs were correctly collected, check that the time frame of the collection correlates with the time of the issue, and verify that logs were collected from all nodes. Root-Level Files Several files are saved at the root level of the extracted directory: parameters file, CLI file, aiw file, aiq file, cm_cli file, init_hl file, kbox_status file, unfinished_init_hl file, log file, and summary file. Parameters File The parameters file contains the parameters given to the log gathering tool. Those parameters set the time frame for the log collection and are reflected in the parameters file. The format for the date is yyyy/mm/dd. The following example illustrates the contents of a parameters file: only_connectivity= 0 min= 2007/08/03 16:25:02 max= 2007/08/04 19:25:02 withcores= 1 6872 5688 006 A 9
Collecting and Using Logs The value 0 for only_connectivity in the parameters file is a standard value for logs. The value 1 for withcores means that core logs (long) were collected for the time displayed. CLI File The CLI file contains the output from executing various CLI commands. The commands issued to produce the information are saved to the CLI file in the tmp directory. Usually executing CLI commands in the process of collecting logs produces volumes of output. The types of information that are contained in the CLI file are as follows: Account settings and license Alert settings Box states Consistency groups, settings, and state Consistency group statistics Site name <site>splitters Management console logs for the period collected Global accumulators (used by third-level support) Various settings and system statistics Save_settings command output Splitters settings and state Volumes settings and state Available images The commands used to collect the output are listed in the runclifile, described later in this appendix. Log File This file contains a report of the log collection that executed. It shows the start and stop time for the log. If there is a problem running CLI commands, information appears at the end of the file similar to the following: 2007/06/05 19:25:40: info: running CLI commands 2007/06/05 19:25:40: info: retrieving site name 2007/06/05 19:25:40: info: site name is "Tunguska" 2007/06/05 19:25:40: info: retrieving groups 2007/06/05 19:25:40: error: while running CLI commands: when running CLI get _groups, RC=2 2007/06/05 19:25:40: error: while running CLI commands: errors retrieving gr oups. skipping CLI commands. A 10 6872 5688 006
Collecting and Using Logs Summary File The summary file is at the root of the extracted directory and contains a summary of the main events from the internal logs of the system. The format of this file is used by thirdlevel support. However, you might find a summary of the errors helpful in some cases. Files Directory The files directory contains several subdirectories and files in those directories. The directories are etc, home, collector, rreasons, proc, and var. etc Directory This directory contains the rc.local file, which is used by third-level support. home Directory The home directory contains the kos directory containing several files and these subdirectories: cli, connectivity_tool, control, customer_monitor, hlr, install_logs, kbox, management, monitor, mpi perf, old_config, replication, rmi, snmp, and utils. The home directory also contains the collector and rreasons directories. collector Directory This directory contains the connectivity_tool subdirectory, which lists results from connectivity tests to configured IP addresses on the local host loopback and the specific ports on the IP addresses that require testing for various protocols. rreasons Directory This directory contains the rreasons.log file, which lists the reasons for any reboots in the specified time frame. This file is used by third-level support but can be helpful in reviewing the reboot reasons, as shown in the following sample file: *************************************************************************** *************************************************************************** *************************************************************************** *************************************************************************** *************************************************************************** === LogLT STARTED HERE - 2007/07/05 22:40:40 === *************************************************************************** *************************************************************************** *************************************************************************** *************************************************************************** *************************************************************************** Couldn t open logger.ini file, so assuming default all with level DEBUG2 007/07/05 22:40:40.834 - #2-1421 - RebootReasons: getrebootreasons2007/07/ 05 22:40:40.834 - #2-1421 - rreasons: Reboot Log: [Mon Apr 16 20:33:00 200 7] : kernel watchdog 0 expired (time=66714 lease=1390 last_tick=65233) 0=(13 90,65233) 1=(30000,63214) 2=(1400,65233) 6872 5688 006 A 11
Collecting and Using Logs Note: In the example, the kernel watchdog 0 expired message indicates a typical reboot that was not a result of an error. Other Directories The proc, and var directories are also contained within the files directory and are used by third-level support. processes Directory The processes directory contains the InfoCollect, sbin, usr, home, and bin directories and several subdirectories. InfoCollect Directory Under the InfoCollect directory, the SanDiag.sh file contains the SAN diagnostic logs. The ConnectivityTest.sh file contains connection information. Connection errors in this log do not indicate an error in the configuration or function. sbin Directory This directory contains files with information pertaining to networking. Ifconfig file: Lists configuration information as shown in the following example: eth0 Link encap:ethernet HWaddr 00:14:22:11.DD:1B inet addr:10.10.21.51 Bcast:10.255.255.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:286265797 errors:0 dropped:0 overruns:0 frame:0 TX packets:228318046 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:1377792659 (1.2 GiB) TX bytes:2189256742 (2.0 GiB) Base address:0xecc0 Memory:fe6e0000-fe700000 eth1 eth1 lo Link encap:ethernet HWaddr 00:14:22:11.DD:1C inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:13341097 errors:0 dropped:0 overruns:0 frame:0 TX packets:12365085 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:5000 RX bytes:4156827090 (3.8 GiB) TX bytes:4192345752 (3.9 GiB) Base address:0xdcc0 Memory:fe4e0000-fe500000 Link encap:ethernet HWaddr 00:14:22:11.DD:1C inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Base address:0xdcc0 Memory:fe4e0000-fe500000 Link encap:local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:11289452 errors:0 dropped:0 overruns:0 frame:0 TX packets:11289452 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3269809825 (3.0 GiB) TX bytes:3269809825 (3.0 GiB) A 12 6872 5688 006
Collecting and Using Logs route file: Lists other pieces of routing information, as shown in the following example: Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.10.21.0 * 255.255.255.0 U 0 0 0 eth0 172.16.0.0 * 255.255.0.0 U 0 0 0 eth1 usr Directory The usr directory contains two subdirectories: bin and sbin. The bin subdirectory contains the kps.pl file. The following is an example of the kps.pl file for an attached RA: Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05 Processes: control_process - UP control_loop.tcsh - UP replication - UP mgmt_loop.tcsh - UP management_server - UP cli - down rmi_loop.tcsh - UP rmi - UP monitor_loop.tcsh - UP load_monitor.pl - UP runall - down hlr_kbox - UP rcm_run_loop.tcsh - UP customer_monitor.pl - UP Modules: st - UP sll - UP var_link - UP kaio_mod-2.4.32-k22 - UP The following is an example of the kps.pl file for a detached RA: Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05 Processes: control_process - down control_loop.tcsh - down replication - down mgmt_loop.tcsh - down management_server - down cli - down rmi_loop.tcsh - down rmi - down monitor_loop.tcsh - down load_monitor.pl - down runall - down hlr_kbox - UP rcm_run_loop.tcsh - down customer_monitor.pl - down Modules: st - UP 6872 5688 006 A 13
Collecting and Using Logs sll - UP var_link - UP kaio_mod-2.4.32-k22 - UP The sbin subdirectory contains the biosdecode and dmidecode files. The biosdecode file provides hardware-specific RA BIOS information and the pointers to locations where this information is stored. The dmidecode file provides handle and other information for components capable of passing this information to a Desktop Management Interface (DMI) agent. home Directory The home directory contains the kos subdirectory, which contains other subdirectories that yield the get_users_lock_state.tcsh file. This file contains all the users on the RA. bin Directory The bin directory contains the df-h and lspci files. The df-h file contains directory size and disk size usage statistics for the RA hard disk drive. The lspci file contains PCI bridge bus numbers, revisions, and OEM identification strings for inbuilt devices in the RA. tmp Directory The tmp directory contains the runcli file listing the commands that generated the CLI file. It also contains the getgroups file, which is a temporary file to gather the list of consistency groups. runcli File The following is an example of the runcli file saved in the tmp directory that shows the CLI commands executed: get_logs from=<start time and date> to=<end time and date> n The time and date are specified as day, month, year as follows: get_logs from="22:03 03/08/2007" to="17:03 04/08/2007 n config_io_throttling n config_multipath_monitoring n get_account_settings n get_alert_settings n get_box_states n get_global_policy n get_groups n get_groups_sets n get_group_settings n get_group_state n get_group_statistics n A 14 6872 5688 006
Collecting and Using Logs get_id_names n get_initiator_bindings n get_pairs n get_raw_stats n get_snmp_settings n get_syslog_settings n get_system_status n get_system_settings n get_system_statistics n get_tweak_params n get_version n get_virtual_targets n save_settings n get_splitter_settings site="<site name>" get_splitter_states site="<site name>" get_san_splitter_view site="<site name>" get_san_volumes site="<site name>" get_santap_view site="<site name>" get_volume_settings site="<site name>" get_volume_state site="<site name>" get_images group="<group name>" (This command is repeated for each group.) getgroups File This internal file is used to generate the runcli file. Host Log Extraction Directory When the RA collects a host log, the host information is collected in a directory named with the HLR-<RA identifier>-<date and time of collection> format. Such a directory contains a tar.gz file for servers with a name similar in format to the following: HLR-r1_USMVEAST2_1157647546524147.tar.gz When you extract a tar.gz file, you can choose to decompress the ZIP file (to_transfer.tar) to a temp folder and open it, or you can choose to extract the files to a directory. When the file is for intelligent fabric switches, the file name does not have the.gz extension. 6872 5688 006 A 15
Collecting and Using Logs Analyzing Server (Host) Logs The output file from host collection is named Unisys_host_info_<server name>_<date>_<time>.tar.gz This file contains a folder named collected_items, which contains the following files and directories: Cluster_log: a folder containing the cluster.log file generated by MSCS Hic_logs: a folder containing logs used by third-level support Host_logs: a folder containing logs used by third-level support Msinfo32: information from the Msinfo32.exe file Registry.dump: the registry dump for this server Tweak: the internal RA parameters on this server Watchdog log: log created by the KDriverWatchDog service Commands: a file containing output from commands executed on this server, including A view of the LUNs recognized by this server Some internal RA structures Output from the dumpcfg.exe file Windows event logs for system, security, and applications Analyzing Intelligent Fabric Switch Logs The output file from collecting information from intelligent fabric switches is named with the following format: HLR-<RA info>_<switch vendor>_identifier.tar The following name is an example of this format: HLR-l1_CISCO_232c000dec1a7a02.tar Once you extract the.tar file, some files are listed with formats similar to the following: <site name>cvt_<identifier>.tar_at_<switch IP address>_m3_tech <site name>cvt_<identifier>.tar_at_<switch IP address>_m3_isapi_tech <site name>cvt_<identifier>.tar_at_<switch IP address>_m3_santap_tech A 16 6872 5688 006
Appendix B Running Replication Appliance (RA) Diagnostics This appendix Explains how to clear the system event log (SEL.) Describes how to run hardware diagnostics for the RA. Lists the LCD status messages shown on the RA. Clearing the System Event Log (SEL) Before you run the RA diagnostics, you need to clear the SEL to prevent errors from being generated during the diagnostics run. 1. Insert the bootable Replication Appliance (RA) Diagnostic CD-ROM in the CD/DVD drive. 2. Press Ctrl+Alt+Delete to reboot the RA. The RA displays the following event log menu. 3. Select Show all system event log records using the arrow keys, then press Enter. This action results in an SEL summary and indicates whether the SEL contains errors. If there are errors, an error description is given. Note: You cannot scroll up or down in this screen. A clear SEL without errors has IPMI SEL contains 1 records displayed in the summary. Anything greater than one record indicates that errors are present. 6872 5688 006 B 1
Running Replication Appliance (RA) Diagnostics Note: The preceding step did not clear the SEL; ignore the statement Log area Reset/Cleared. 4. Press any key to return to the main boot menu. 5. Select Clear System Event Log using the arrow keys, and press Enter to ensure that the SEL is cleared of all error entries. Note: Depending on whether there are error entries, this clearing action could take up to 1 minute to complete. 6. Press any key again to return to the main boot menu. 7. Select Show all system event log records using the arrow keys and press Enter. Confirm that IPMI SEL contains 1 records is shown. 8. Press any key to return to the main boot menu. Note: If you accidentally press Escape and leave the main boot menu, a Diag prompt is displayed. Type menu to return to the main boot menu. Running Hardware Diagnostics Running the hardware diagnostics for the RA includes completing the Custom Test and Express Test diagnostics. Follow these steps to run the hardware diagnostics for the RA: 1. At the main boot menu, use the arrow keys to select Run Diags ; then press Enter. 2. On the Customer Diagnostic Menu, press 2 to select Run ddgui graphicsbased diagnostic. The system diagnostic files begin loading and a message is displayed giving information about the software and showing initializing Once the diagnostics are loaded and ready to be executed, the Main Menu is displayed. B 2 6872 5688 006
Running Replication Appliance (RA) Diagnostics Custom Test 1. On the Main Menu, select Custom Test using the arrow keys; then press Enter. The Custom Test dialog box is displayed as follows: 2. Expand the PCI Devices folder to view the PCI devices installed in the system including those devices that are on-board. 3. Select the PCI Devices folder; then press Enter. This action causes each PCI device to be interrogated in turn and a message is displayed for each one. Verify that the correct number of QLogic adapters is shown. 4. Press OK after each message is displayed until all PCI devices have been recognized and passed. The message All tests passed. is displayed. Note: If any devices fail this test, investigate and rectify the problem; then clear the SEL as explained in Clearing the System Event Log (SEL). 5. Close the Custom Test dialog box and return to the Main Menu. 6872 5688 006 B 3
Running Replication Appliance (RA) Diagnostics Express Test 1. On the Main Menu, select Express Test using the arrow keys; then press Enter. A warning is displayed advising that media must be installed on all drives or else some tests might fail. 2. If a diskette drive is installed in the system, insert a blank, formatted diskette and then click OK to start the test. If no diskette drive is installed, just click OK. During testing, a status screen is displayed. If the diagnostic test run is successful, the message All tests passed. appears. Notes: During the video portion of the testing, the screen typically flickers and goes blank. If any errors occur, investigate and resolve the problem, and then rerun the diagnostic tests. Before you rerun the tests, be sure to clear the SEL as explained in Clearing the System Event Log (SEL). 3. Click OK to exit the diagnostic tests. The Main Menu is then displayed. 4. Select Exit using the arrow keys; then press Enter. The following message is displayed: Displaying the end of test result.log ddgui.txt. Strike a Key when ready. 5. Press any key to display the diagnostic test summary screen. 6. Verify that no errors are listed. Scroll up and down to see the different portions of the output. Note: If any errors are listed, investigate and resolve the problem; then rerun the diagnostic tests. Before you rerun the tests, be sure to clear the SEL as explained in Clearing the System Event Log (SEL). 7. Press Escape to return to the original Customer Diagnostic Menu. 8. Press 4 to quit and return to the main boot menu. 9. Select Exit; then press Enter. 10. Remove all media from the diskette and CD/DVD drives. LCD Status Messages The LCDs on the RA signify status messages. Table B 1 lists the LCD status messages that can occur and the probable cause for each message. The LCD messages refer to events recorded in the SEL. Note: For information about corrective actions for the messages listed in Table B 1, refer to the documentation supplied with the system. B 4 6872 5688 006
Running Replication Appliance (RA) Diagnostics Table B 1. LCD Status Messages Line 1 Message Line 2 Message Cause SYSTEM ID SYSTEM NAME The system ID is a unique name, 5 characters or less, defined by the user. The system name is a unique name, 16 characters or less, defined by the user. The system ID and name display under the following conditions: The system is powered on. The power is off and active POST errors are displayed. E000 OVRFLW CHECK LOG LCD overflow message. A maximum of three error messages can display sequentially on the LCD. The fourth message is displayed as the standard overflow message. E0119 TEMP AMBIENT Ambient system temperature is out of the acceptable range. E0119 TEMP BP The backplane board is out of the acceptable temperature range. E0119 TEMP CPU n The specified microprocessor is out of the acceptable temperature range. E0119 TEMP SYSTEM The system board is out of the acceptable temperature range. E0212 VOLT 3.3 The system power supply is out of the acceptable voltage range; the power supply is faulty or improperly installed. E0212 VOLT 5 The system power supply is out of the acceptable voltage range; the power supply is faulty or improperly installed. E0212 VOLT 12 The system power supply is out of the acceptable voltage range; the power supply is faulty or improperly installed. E0212 VOLT BATT Faulty battery; faulty system board. E0212 VOLT BP 12 The backplane board is out of the acceptable voltage range. E0212 VOLT BP 3.3 The backplane board is out of the acceptable voltage range. E0212 VOLT BP 5 The backplane board is out of the acceptable voltage range. E0212 VOLT CPU VRM The microprocessor voltage regulator module (VRM) voltage is out of the acceptable range. The microprocessor VRM is faulty or improperly installed. The system board is faulty. E0212 VOLT NIC 1.8V Integrated NIC voltage is out of the acceptable range; the power supply is faulty or improperly installed. The system board is faulty. 6872 5688 006 B 5
Running Replication Appliance (RA) Diagnostics Table B 1. LCD Status Messages Line 1 Message Line 2 Message Cause E0212 VOLT NIC 2.5V Integrated NIC voltage is out of the acceptable range. The power supply is faulty or improperly installed. The system board is faulty. E0212 VOLT PLANAR REG The system board is out of the acceptable voltage range. The system board is faulty. E0276 CPU VRM n The specified microprocessor VRM is faulty, unsupported, improperly installed, or missing. E0276 MISMATCH VRM n The specified microprocessor VRM is faulty, unsupported, improperly installed, or missing. E0280 MISSING VRM n The specified microprocessor VRM is faulty, unsupported, improperly installed, or missing. E0319 PCI OVER CURRENT The expansion cord is faulty or improperly installed. E0412 RPM FAN n The specified cooling fan is faulty, improperly installed, or missing. E0780 MISSING CPU 1 Microprocessor is not installed in socket PROC_1. E07F0 CPU IERR The microprocessor is faulty or improperly installed. E07F1 TEMP CPU n HOT The specified microprocessor is out of the acceptable temperature range and has halted operation. E07F4 POST CACHE The microprocessor is faulty or improperly installed. E07F4 POST CPU REG The microprocessor is faulty or improperly installed. E07FA TEMP CPU n THERM The specified microprocessor is out of the acceptable temperature range and is operating at a reduced speed or frequency. E0876 POWER PS n No power is available from the specified power supply. The specified power supply is improperly installed or faulty. E0880 INSUFFICIENT PS Insufficient power is being supplied to the system. The power supplies are improperly installed, faulty, or missing. E0CB2 MEM SPARE ROW The correctable errors threshold was met in a memory bank; the errors were remapped to the spare row. E0CF1 MBE DIMM Bank n The memory modules installed in the specified bank are not the same type and size. The memory module or modules are faulty. E0CF1 POST MEM 64K A parity failure occurred in the first 64 KB of main memory. E0CF1 POST NO MEMORY The main-memory refresh verification failed. E0CF5 LGO DISABLE SBE Multiple single-bit errors occurred on a single memory module. B 6 6872 5688 006
Running Replication Appliance (RA) Diagnostics Table B 1. LCD Status Messages Line 1 Message Line 2 Message Cause E0D76 DRIVE FAIL A hard drive or RAID controller is faulty or improperly installed. E0F04 POST DMA INIT Direct memory access (DMA) initialization failed. DMA page register write/read operation failed. E0F04 POST MEM RFSH The main-memory refresh verification failed. E0F04 POST SHADOW BIOS-shadowing failed. E0F04 POST SHD TEST The shutdown test failed. E0F0B POST ROM CHKSUM The expansion card is faulty or improperly installed. E0F0C VID MATCH CPU n The specified microprocessor is faulty, unsupported, improperly installed, or missing. E10F3 LOG DISABLE BIOS The BIOS disabled logging errors. E13F2 IO CHANNEL CHECK The expansion card is faulty or improperly installed. The system board is faulty. E13F4 E13F5 PCI PARITY PCI SYSTEM E13F8 CPU BUS INIT The microprocessor or system board is faulty or improperly installed. E13F8 CPU MCKERR Machine check error. The microprocessor or system board is faulty or improperly installed. E13F8 HOST TO PCI BUS E13F8 MEM CONTROLLER A memory module or the system board is faulty or improperly installed. E20F1 OS HANG The operating system watchdog timer has timed out. EFFF1 POST ERROR A BIOS error occurred. EFFF2 BP ERROR The backplane board is faulty or improperly installed. 6872 5688 006 B 7
Running Replication Appliance (RA) Diagnostics B 8 6872 5688 006
Appendix C Running Installation Manager Diagnostics To determine the causes of various problems as well as perform numerous procedures, you must access the Installation Manager functions and diagnostics capabilities. Using the SSH Client Throughout the procedures in this guide you might need to use the secure shell (SSH) client. Perform the following steps whenever you are asked to use the SSH client or to open a PuTTY session: 1. From Windows Explorer, double-click the PuTTY.exe file. 2. When prompted, enter the applicable IP address. 3. Select SSH for the protocol and keep the default port settings (port 22). 4. Click Open. 5. If prompted by a PuTTY security dialog box, click Yes. 6. When prompted to log in, type the identified user name and then press Enter. 7. When prompted for a password, type the identified password and then press Enter. Running Diagnostics When you open the PuTTY session and log in as boxmgmt/boxmgmt, the Main Menu of Installation Manager is displayed. This menu offers the following six choices: Installation, Setup, Diagnostics, Cluster Operations, Reboot/Shutdown, and Quit. For more information about these capabilities, see the Unisys SafeGuard Solutions Replication Appliance Installation Guide. 6872 5688 006 C 1
Running Installation Manager Diagnostics To access the various diagnostic capabilities of Installation Manager, perform the following steps: 1. Open a PuTTY session using the IP address of the RA, and log in as boxmgmt/boxmgmt. The Main Menu is displayed, as follows: ** Main Menu ** [1] Install [2] Setup [3] Diagnostics [4] Cluster Operations [5] Reboot / Shutdown [Q] Quit 2. Type 3 (Diagnostics) and press Enter. IP Diagnostics The Diagnostics menu is displayed as follows: ** Diagnostics ** [1] IP diagnostics [2] Fibre Channel diagnostics [3] Synchronization diagnostics [4] Collect system info [B] Back [Q] Quit The four diagnostics capabilities are explained in the following topics. Use the IP diagnostics when you need to check port connectivity, view IP addresses, test throughput, and review other related information. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter to access the IP Diagnostics menu as shown: ** IP Diagnostics ** [1] Site connectivity tests [2] View IP details [3] View routing table [4] Test throughput [5] Port diagnostics [6] System connectivity [B] Back [Q] Quit C 2 6872 5688 006
Running Installation Manager Diagnostics Site Connectivity Tests On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter to access the Site connectivity tests menu. Note: You must apply settings to the RA before you can test options 1 through 4 in the following list. The options to test are as follows: ** Select the target to which to test connectivity: ** [1] Gateway [2] Primary DNS server [3] Secondary DNS server [4] NTP Server [5] Other host [B] Back [Q] Quit Tests for options 1 through 4 return a result of success or failure. For option 5, you must specify the target IP address that you want to test. The test returns the relative success of 0 through 100 percent over both the management and WAN interfaces. View IP Details From the IP Diagnostics menu, type 2 (View IP details) and press Enter to run an ipconfig process. The displayed results of the process are similar to the following: eth0 eth1 Link encap:ethernet HWaddr 00:0F:1F:6A:03:E7 inet addr:10.10.17.61 Bcast:10.10.17.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:12751337 errors:0 dropped:0 overruns:0 frame:0 TX packets:13628048 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1084700432 (1034.4 Mb) TX bytes:2661155798 (2537.8 Mb) Base address:0xecc0 Memory:fe6e0000-fe700000 Link encap:ethernet HWaddr 00:0F:1F:6A:03:E8 inet addr:172.16.17.61 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:10519453 errors:0 dropped:0 overruns:0 frame:0 TX packets:10244866 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:5000 RX bytes:2846677622 (2714.8 Mb) TX bytes:2702094827 (2576.9 Mb) Base address:0xdcc0 Memory:fe4e0000-fe500000 6872 5688 006 C 3
Running Installation Manager Diagnostics eth1:1 Link encap:ethernet HWaddr 00:0F:1F:6A:03:E8 inet addr:172.16.17.60 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Base address:0xdcc0 Memory:fe4e0000-fe500000 lo Link encap:local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:3853904 errors:0 dropped:0 overruns:0 frame:0 TX packets:3853904 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3312865098 (3159.3 Mb) TX bytes:3312865098 (3159.3 Mb) View Routing Table On the IP Diagnostics menu, type 3 (View routing table) and press Enter to display the routing table. Test Throughput On the IP Diagnostics menu, type 4 (Test throughput) and press Enter to use iperf to test throughput to another RA. Once you select this option, Installation Manager guides you through the following dialog. The bold text shows sample entries. Note: The Fibre Channel interface only appears if the Installation Manager Diagnostic capability was preconfigured to run on Fibre Channel. Then the option appears a [2} in the menu list. Enter the IP address to which to test throughput: >>192.168.1.86 Select the interface from which to test throughput: ** Interface ** [1] Management interface [2] Fibre Channel Interface [3] WAN interface >>3 Enter the desired number of concurrent streams: >>2 Enter the test duration (seconds): >>10 C 4 6872 5688 006
Running Installation Manager Diagnostics If the test is successful, the system responds with a standard iperf output that resembles the following: Checking connectivity to 10.10.17.51 Connection to 10.10.17.51 established. Client Connecting to 10.10.17.51, TCP port 5001 Binding to local address 10.10.17.61 TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte) [ 6] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001 [ 5] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec [ 6] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec [SUM] 0.0-10.6 sec 118 Mbytes 93.9 Mbits/sec Port Diagnostics On the IP Diagnostics menu, type 5 (Port diagnostics) and press Enter to check that none of the ports used by the RAs are blocked (for example, by a firewall). You must test each RA individuallythat is, designate each RA, in turn, to be the server. Once you select the option, Installation Manager guides you through one of the following dialogs, depending on whether you designate the RA to be the server or the client. In the dialogs, sample entries are bold. For the server, the dialog is as follows: In which mode do you want to run ports diagnostics? ** ** [1] Server [2] Client >>1 Note: Before you select the server designation for the RA, detach the RA that you intend to specify as the server. 6872 5688 006 C 5
Running Installation Manager Diagnostics After you specify the RA that you want to test as the server, move to the RA from which you wish to run the port diagnostics tests. Designate that RA as a client, as noted in the following dialog: ** ** [1] Server [2] Client >>2 Did you already designate another RA to be the server (y/n) >>y Enter the IP address to test: >>10.10.17.51 If the test is successful, the system responds with output that resembles the following: Port No. TCP Connection 5030 OK 5040 OK 4401 OK 1099 OK 5060 Blocked 4405 OK 5001 OK 5010 OK 5020 OK Correct the problem on any port that returns a Blocked response. System Connectivity Use the system connectivity options to test connections and generate reports on connections between RAs anywhere in the system. You can perform the tests during installation and during normal operation. The tests performed to verify connections are as follows: Ping TCP (to ports and IP addresses, to the specific processes of the RA, and using SSH) UDP (general and to RA processes) RA internal protocols C 6 6872 5688 006
Running Installation Manager Diagnostics On the IP Diagnostics menu, type 6 (System connectivity) and press Enter to access the System Connectivity menu as follows: ** System Connectivity ** [1] System connectivity test [2] Advanced connectivity test [3] Show all results from last connectivity check [B] Back [Q] Quit When you select System connectivity test and Full mesh network check, the test reports errors in communications from any RA to any other RA in the system. When you select System connectivity test and Check from local RA to all other boxes, the test reports errors from the local RA to any other RA in the system. When you select Advanced connectivity test, the test reports on the connection from an IP address that you specified on the local appliance to an IP address and port that you specified on an RA anywhere in the system. Use this option to diagnose a problem specific to a local IP address or port. When you select Show all results from last connectivity check, the test reports all results from the previous testsnot only the errors, but also the tests that completed successfully. 6872 5688 006 C 7
Running Installation Manager Diagnostics You might receive one of the messages shown in Table C 1 from the connectivity test tool. Table C 1. Messages from the Connectivity Testing Tool Machine is down. <RA> is down. Message Connection to link: <link> protocol: <protocol> FAILED. Link <link> (<type of connection>) FAILED. All OK. Meaning There is no communication with the RA. Perform the following steps to determine the problem: Verify that the firewall permits pinging the RA, that is, using a CMP echo. Check that the RA is connected and operating. Check that the required ports are open. (Refer to Section 7, Solving Networking Problems, for tables with the port information.) The host connection exists but the RA is not responding. Perform the following steps to determine the problem: Check that the required ports are open. (Refer to Section 7, Solving Networking Problems for tables with the port information.) Verify that the RA is attached to an RA cluster. No connection is available to the host through the protocol. The connection that was checked has failed. The connection is working. To discover which port is involved in the error or failure, run the test again and select Show all results from last connectivity check. The port on which each failure occurred is shown. C 8 6872 5688 006
Running Installation Manager Diagnostics Fibre Channel Diagnostics Use the Fibre Channel diagnostics when you need to check SAN connections, review port settings, see details of the Fibre Channel, determine Fibre Channel targets and LUNs, and perform I/O operations to a LUN. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter to access the Fibre Channel Diagnostics menu as follows: ** Fibre Channel Diagnostics ** [1] Run SAN diagnostics [2] View Fibre Channel details [3] Detect Fibre Channel targets [4] Detect Fibre Channel LUNs [5] Detect Fibre Channel SCSI-3 reserved LUNs [6] Perform I/O to a LUN [B] Back [Q] Quit Run SAN Diagnostics On the Fibre Channel Diagnostics menu, type 1 (Run SAN diagnostics) and press Enter to run the SAN diagnostics. When you select this option, the system conducts a series of automatic tests to identify the most common problems encountered in the configuration of SAN environments, such as the following: Storage inaccessible within a site Delays with writes or reads to disk Disk not accessible in the network Configuration issues Once the tests complete, a message is displayed confirming the successful completion of SAN diagnostics, or a report is displayed that provides additional details. Results similar to the following are displayed for a successful diagnostics run of port 0: 0 errors: 0 warnings: Total=0 6872 5688 006 C 9
Running Installation Manager Diagnostics Sample results follow for a diagnostics run that returns errors: ConfigB_Site2 Box2>>1 >>Running SAN diagnostics. This may take a few moments... results of SAN diagnostics are 3 errors: 1. Found device with no guid : wwn=5006016b1060090d lun=0 port=0 vendor=dgc pr oduct=lunz 2. Found device with no guid : wwn=500601631060090d lun=0 port=0 vendor=dgc pr oduct=lunz 3. Found device with no guid : wwn=5006016b1060090d lun=0 port=1 vendor=dgc pr oduct=lunz 9 warnings: 1. device wwn=500601631060090d lun=8 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,125,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 2. device wwn=500601631060090d lun=7 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,127,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 3. device wwn=500601631060090d lun=6 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,129,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 4. device wwn=500601631060090d lun=5 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,131,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 5. device wwn=500601631060090d lun=4 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,133,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 6. device wwn=500601631060090d lun=3 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,135,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 7. device wwn=500601631060090d lun=2 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,137,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 C 10 6872 5688 006
Running Installation Manager Diagnostics 8. device wwn=500601631060090d lun=1 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,139,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 9. device wwn=500601631060090d lun=0 guid=(storage=clarion,buffer=vector(96,6, 1,96,155,195,14,0,141,87,93,152,230,229,218,17)) found in port 1 and not in po rt 0 Total=12 View the Fibre Channel Details On the Fibre Channel Diagnostics menu, type 2 (View Fibre Channel details) and press Enter to show the current Fibre Channel details. The operation mode is identified automatically according to the SAN switch configuration. Usually the RA is configured for the point-to-point mode unless the SAN switch is hard-wired to port L. Note: You can use the View Fibre Channel details capability to obtain information about WWNs that is needed for zoning. You can check the status for the following on the Fibre Channel Diagnostics menu: Speed Operating node Node WWN Changes made Connection issues Additions of new HBAs Sample results showing Fibre Channel details for port 0 and port 1 follow: ConfigB_Site2 Box2>>2 >> Port 0 ------------------------------------ wwn = 5001248200875c81 node_wwn = 5001248200875c80 port id = 0x20100 operating mode = point to point speed = 2 GB Port 1 ------------------------------------ wwn node_wwn port id operating mode speed = 5001248201a75c81 = 5001248201a75c80 = 0x20500 = point to point = 2 GB 6872 5688 006 C 11
Running Installation Manager Diagnostics If all cables are disconnected, the operating mode results for all ports are disconnected. If only one cable is disconnected, then the operating mode for the affected port is disconnected, as shown in the following sample results: ConfigB_Site2 Box2>>2 >> Port 0 ------------------------------------ wwn = 5001248200875c81 node_wwn = 5001248200875c80 port id = 0x20100 operating mode = point to point speed = 2 GB Port 1 ------------------------------------ wwn = 5001248201a75c81 node_wwn = 5001248201a75c80 port id = 0x0 operating mode = disconnected speed = 2 GB Detect Fibre Channel Targets On the Fibre Channel Diagnostics menu, type 3 (Detect Fibre Channel targets) and press Enter to see a list of the targets that are accessible to the RA through ports A and B. Some of the reasons to use this capability are as follows: Zoning issues Failure to detect a host SAN connection issues Need for WWN or storage details of each RA The following sample results provide port WWN, node WWN, and port information: ConfigB_Site2 Box2>>3 >> Port 0 Port WWN Node WWN Port ID ---------------------------------------------------- 1) 0x500601631060090d 0x500601609060090d 0x20000 2) 0x5006016b1060090d 0x500601609060090d 0x20400 C 12 6872 5688 006
Running Installation Manager Diagnostics Port 1 Port WWN Node WWN Port ID ---------------------------------------------------- 1) 0x500601631060090d 0x500601609060090d 0x20000 2) 0x5006016b1060090d 0x500601609060090d 0x20400 Detect Fibre Channel LUNs On the Fibre Channel Diagnostics menu, type 4(Detect Fibre Channel LUNs) and press Enter to see a list of all volumes on the SAN that are visible to the RA. Using this capability can detect Issues with volume access LUN repository details Additions of volumes In the following sample results that show the types of information returned, the information wraps around: ConfigB_Site2 Box2>>4 >>This operation may take a few minutes... Size Vendor Product Serial Number Vendor Specific UID Port WWN LUN CGs Site ID ================================================================================ 1. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 127 CLARION: 60,06,01,60,9b,c3,0e,00,8d,57,5d,98,e6,e5,da,11:0 1 500601631060090d 0 2 2. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 125 CLARION: 60,06,01,60,9b,c3,0e,00,8b,57,5d,98,e6,e5,da,11:0 1 500601631060090d 1 2 3. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 123 CLARION: 60,06,01,60,9b,c3,0e,00,89,57,5d,98,e6,e5,da,11:0 1 500601631060090d 2 2 4. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 121 CLARION: 60,06,01,60,9b,c3,0e,00,87,57,5d,98,e6,e5,da,11:0 1 500601631060090d 3 2 6872 5688 006 C 13
Running Installation Manager Diagnostics 5. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 119 CLARION: 60,06,01,60,9b,c3,0e,00,85,57,5d,98,e6,e5,da,11:0 1 500601631060090d 4 2 6. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 117 CLARION: 60,06,01,60,9b,c3,0e,00,83,57,5d,98,e6,e5,da,11:0 1 500601631060090d 5 2 7. 1.00GB DGC RAID 5 APM00031800182 LUN ID: 115 CLARION: 60,06,01,60,9b,c3,0e,00,81,57,5d,98,e6,e5,da,11:0 1 500601631060090d 6 0 8. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 113 CLARION: 60,06,01,60,9b,c3,0e,00,7f,57,5d,98,e6,e5,da,11:0 1 500601631060090d 7 2 9. 62.00GB DGC RAID 5 APM00031800182 LUN ID: 111 CLARION: 60,06,01,60,9b,c3,0e,00,7d,57,5d,98,e6,e5,da,11:0 1 500601631060090d 8 40 10. N/A DGC LUNZ APM00031800182 - N/A 0 500601631060090d 0 N/A 11. N/A DGC LUNZ APM00031800182 - N/A 0 5006016b1060090d 0 N/A 12. N/A DGC LUNZ APM00031800182 - N/A 1 5006016b1060090d 0 N/A C 14 6872 5688 006
Running Installation Manager Diagnostics Detect Fibre Channel Scsi3 Reserved LUNs On the Fibre Channel Diagnostics menu, type 5 (Detect Fibre Channel Scsi3 reserved LUNs) and press Enter to list all LUNs that have SCSI-3 reservations. The information returned includes the WWN, LUN number, port number, and reservation type. Perform I/O to a LUN On the Fibre Channel Diagnostics menu, type 6 (Perform I/O to a LUN) and press Enter to initiate a dialog that guides you through performing an I/O operation to a LUN. Note: The write operation removes any data that you might have. Use the write operation only when you are installing at the site. The following example for a read operation shows sample responses in bold type. SYDNEY Box1>>6 >>This operation may take a few minutes... Size Vendor Product Serial Number Vendor Specific UID Port WWN Ctrl LUN ============================================================================ 1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0 0 500601601060009a SP-A 0 0 500601681060009a SP-B 0 1 500601601060009a SP-A 0 1 500601681060009a SP-B 0... 10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0 0 500601601060009a SP-A 10 0 500601681060009a SP-B 10 1 500601601060009a SP-A 10 1 500601681060009a SP-B 10 Select: 6 Select operation to perform: ** Operation To Perform ** [1] Read [2] Write 6872 5688 006 C 15
Running Installation Manager Diagnostics SYDNEY Box1>>1 >> Enter the desired transaction size: SYDNEY Box1>>10485760 Do you want to read the whole LUN? (y/n) >>y 1 buffers in 1 buffers out total time : 0.395567 seconds 2.65082e+07 bytes/sec 25.2802 MB/sec 2.52802 IO/sec CRC = 4126172682534249172 I/O succeeded. The following example for a write operation shows sample responses in bold type. SYDNEY Box1>>6 >>This operation may take a few minutes... Size Vendor Product Serial Number Vendor Specific UID Port WWN Ctrl LUN ============================================================================ 1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0 0 500601601060009a SP-A 0 0 500601681060009a SP-B 0 1 500601601060009a SP-A 0 1 500601681060009a SP-B 0... 10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0 0 500601601060009a SP-A 10 0 500601681060009a SP-B 10 1 500601601060009a SP-A 10 1 500601681060009a SP-B 10 ============================================================================ Select: 10 Select operation to perform: ** Operation To Perform ** [1] Read [2] Write SYDNEY Box1>>2 >> Enter the desired transaction size: SYDNEY Box1>>10485760 C 16 6872 5688 006
Running Installation Manager Diagnostics Enter the number of transactions to perform: SYDNEY Box1>>100 Enter the number of blocks to skip: SYDNEY Box1>>16 100 buffers in 100 buffers out total time : 40.7502 seconds 2.57318e+07 bytes/sec 24.5398 MB/sec 2.45398 IO/sec CRC = 3829111553924479115 I/O succeeded. Synchronization Diagnostics On the Diagnostics menu, type 3 (Synchronization diagnostics) and press Enter to verify that a RA is synchronized. Note: The RA must be attached to run the synchronization diagnostics. Reattaching the RA causes the RA to reboot. The results displayed are similar to the following example: remote refid st t when poll reach delay offset jitter ============================================================================= *10.10.0.1 192.116.202.203 3 u 438 1024 377 0.337 12.971 6.241 +11 10.10.0.1 2 u 484 1024 376 0.090-4.530 0.023 LOCAL(0) LOCAL(0) 13 1 2 64 377 0.000 0.000 0.004 The columns in the previous output are defined as follows: remotehost names or addresses of the servers and peers used for synchronization refidcurrent source of synchronization ststratum ttype (u=unicast, m=multicast, l=local, =do not know) whentime since the peer was last heard, in seconds pollpoll interval, in seconds reachstatus of the reachability register in octal format delaylatest delay in milliseconds offsetlatest offset in milliseconds jitterlatest jitter in milliseconds 6872 5688 006 C 17
Running Installation Manager Diagnostics The symbol at the left margin indicates the synchronization status of each peer. The currently selected peer is marked with an asterisk (*); additional peers designated as acceptable for synchronization are marked with a plus sign (+). Peers marked with * and + are included in the weighted average computation to set the local clock. Data produced by peers marked with other symbols is discarded. The LOCAL(0) entry represents the values obtained from the internal clock on the local machine. Collect System Info On the Diagnostics menu, type 4 (Collect system info) and press Enter to collect system information for later processing and analysis. You specify where to place the information collected. In some cases, you might need to transfer it to a vendor for technical support. You are prompted to provide the following information: The time frame for log collection Whether to collect information from the remote site FTP details if you choose to send the results to an FTP server Which logs to collect Whether you have SANTap switches from which you want to collect information Note: The dialog asks whether you want full collection. If you choose full collection, additional technical information is supplied, but the time required for the collection process is lengthened. Unless specifically instructed by a Unisys service representative, do not choose full collection. The following dialog provides sample responses in bold type for collecting system information: >>GMT right now is 01/16/2009 08:24:16 Enter the start date: >>01/16/2009 Enter the start time: >>06:00:00 Enter the end date: >>01/16/2009 Enter the end time: >>08:24:16 Note: The start and end times are used only for collection of the system lo gs. Logs from hosts are collected in their entirety. Do you want to collect system information from the other site also? (y/n) >>y Do you want to send results to an ftp server? (y/n) >>y C 18 6872 5688 006
Running Installation Manager Diagnostics Enter the name of the ftp server to which you want to transfer the collecte d system information: >>ftp.ess.unisys.com Enter the port number to which to connect on the FTP server: >>21 Enter the FTP user name: >>MY_USERNAME Enter the location on the FTP server in which you want to put the collected system information: >>incoming Enter the file on the FTP server in which you want to put the collected sys tem information: >>19557111_company.tar Enter the FTP password: >>******* Select the logs you want to collect: ** Collection mode ** [1] Collect logs from RAs only [2] Collect logs from hosts only [3] Collect logs from RAs and hosts >>3 Do you have SANTap switches from which you want to collect information? >>n Do you want to perform full collection? (y/n) >>n Do you want to limit collection time? (y/n) >>n Once you complete the information-entry dialog, Installation Manager checks connectivity and displays a list of accessible hosts for which the feature is enabled. (See the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide for more information.). You must indicate the hosts for which you want to collect logs. You can select one or more individual hosts or enter NONE or ALL. Once you specify the hosts, Installation Manager returns system information and logs for all accessible RAs, including the remote RAs, if so instructed. This software also returns a success or failure status report for each RA from which it has been instructed to collect information. 6872 5688 006 C 19
Running Installation Manager Diagnostics Installation Manager also collects logs for the selected hosts and reports on the success or failure of each collection. The timeout on the collection process is 20 minutes. Once the information is collected and you requested that it be stored on an ftp server, the system reports that it is transferring the collected information to the specified FTP location. Once the transfer completes, you are prompted to press ENTER to continue. You can also open or download the stored files using your browser. Log in as webdownload/webdownload, and access the files at one of these URLs: For nonsecured servers: http://<ra IP address>/info/ For secured servers: https://<ra IP address>/info/ The following error conditions apply: If the connection with an RA is lost while information collection is in progress, no information is collected. You can run the process again. If the collection from the remote site failed because of a WAN failure, run the process locally at the remote site. If simultaneous information collection is occurring from the same RA, only the collector that established the first connection can succeed. FTP failure results in failure of the entire process. If this process fails to collect the desired host information, you can alternatively generate host information collection directly for individual hosts. Use the Host Information Collector (HIC) utility as described in Appendix A. Also, the Unisys SafeGuard Solutions Administrator s Guide provides additional information about the HIC utility. C 20 6872 5688 006
Appendix D Replacing a Replication Appliance (RA) To replace an RA at a site, you must perform the following tasks as described in this appendix: Save configuration settings. Record the group properties and save the Global cluster mode settings. Modify the Preferred RA setting. Detach the failed RA. Remove the Fibre Channel adapter cards. Install and configure the replacement RA. Verify the RA installation. Restore group properties. Ensure the existing RA can switch over to the new RA. Note: During this process, be sure that the direction of all consistency groups is from the site without the failed RA to the site with the RA during this process. You might need to move groups. 6872 5688 006 D 1
Replacing a Replication Appliance (RA) Saving the Configuration Settings Before you replace an RA, Unisys recommends that you save the current environment settings to a file. The saved file is a script that contains CLI commands for all groups, volumes, and replication pairs needed to re-create the environment. The file is used for backup purposes only. 1. From a command prompt on the management PC, enter the following command to change to the directory where the plink.exe file is located: cd putty 2. Update the following command with your site management IP address and administrator (admin) password, and then enter the command: plink -ssh site management IP address -l admin -pw admin password save_se ttings > sitexandsitey.txt Note: If a message is displayed asking whether you want to add a cached registry key, type y and press Enter. The file is automatically saved to the management PC in the same directory from which the command was issued. If you need to restore the settings saved in the previous procedure, update the following command with your site management IP address and administrator (admin) password, and then enter the command: plink -ssh site management IP address -l admin -pw admin password -m vers ion30.txt Recording Policy Properties and Saving Settings Before you begin the RA replacement procedure, ensure to record the policy properties and save the Global cluster mode settings. Perform the following steps for each consistency group to record policy properties and save settings: 1. Select the Policy tab. 2. Write down and save the current preferred RA settings and Stretch Cluster Support parameter for each consistency group. Use this record to restore these values after you replace the RA. 3. Click OK. 4. Repeat steps 1 through 3 for all the other groups. D 2 6872 5688 006
Replacing a Replication Appliance (RA) Modifying the Preferred RA Setting For each consistency group, record the Preferred RA and Global cluster mode settings so that they can be stored at the end of this procedure. Perform the following steps to change all consistency groups that were running on the failed RA to a surviving RA: 1. Select the Policy tab. 2. Change the Preferred RA setting to a surviving RA number for all consistency groups that had the Preferred RA value set to the failed RA. Perform steps 2a through 2e for each group. a. If the Stretch Cluster Support parameter is set to one of the following options, skip this step, and continue with step 4d: The Use Unisys SafeGuard Solutions/30m check box is not selected. Under the Management Mode, the Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor is selected. b. In the Stretch Cluster Support, under the Management Mode, change the parameter to Group in maintenance mode. It is managed by Unisys SafeGuard Solutions, 30m can only monitor (if using MSCS with shared quorum). c. Click Apply. d. Change the Preferred RA setting, and then click Apply. e. Change the Stretch Cluster Support parameter to the original setting. f. Click Apply. 3. Select the Consistency Group and click the Status tab to verify that all groups are running on the new RA number. Review the current status of the preferred RA under the components pane. 4. Detach the failed RA. If you can log on to the RA, detach the RA by performing the following steps. Else continue with Removing Fibre Channel Adapter Cards. a. Use the Putty utility to connect to the box management IP address for the RA that is being replaced. b. Type boxmgmt when prompted to log in, and then type the appropriate password if it has changed from the default password boxmgmt. The Main Menu is displayed. c. Type 4 (Cluster operations) and press Enter. d. Type 2 (Detach from cluster) to detach the RA from the cluster, and then press Enter. e. Type y when prompted to detach and press Enter. f. Type B (Back) and press Enter to return to the Main Menu. 6872 5688 006 D 3
Replacing a Replication Appliance (RA) g. Type quit and close the PuTTY window. Removing Fibre Channel Adapter Cards Perform the following to remove the RA and Fibre Channel host bus adapters (HBAs): 1. Power off the failed RA. 2. Physically disconnect and remove the failed RA from the rack. 3. Physically remove the Fibre Channel HBAs from the failed RA and insert them into the replacement RA. Note: If you cannot use the cards from the existing RA, refer to Failure of All SAN Fibre Channel Host Bus Adapters (HBAs) in Section 8 for information about replacing a failed HBA. Installing and Configuring the Replacement RA To install and configure the replacement RA, you must complete several tasks, as follow: Complete the procedure in Cable and Apply Power to New RA. Complete the procedure in Connecting and Accessing the RA. Complete the procedure in Configuring the RA. Complete the procedures in Verifying the RA Installation. Cable and Apply Power to the New RA 1. Insert the new RA into the rack and apply power. 2. Insert the Unisys SafeGuard Solutions RA Setup Disk CD-ROM into the CD/DVD drive of the RA. Ensure that this disk is the same version that is running in the other RAs. 3. Power off and then power on the RA. 4. As the RA boots, check the BIOS level as displayed in the Unisys banner and note the level displayed. At the end of the replacement procedure, you can compare the existing RA BIOS level with the new RA BIOS level. The RA BIOS might need to be updated. Connecting and Accessing the RA 1. Power on the appropriate RA. 2. Connect an Ethernet cable between the management PC used for installation and the WAN Ethernet segment to which the RA is connected. If you connect the management PC directly to the RA, use a crossover cable. 3. Assign the following IP address and subnet mask to the management PC: 10.77.77.50 (IP address) 255.255.255.0 (subnet mask) D 4 6872 5688 006
Replacing a Replication Appliance (RA) 4. Access the RA by using the SSH client. (See Appendix C.) Use the 10.77.77.77 IP address, which has a subnet mask of 255.255.255.0. 5. Log in with the boxmgmt user name and the boxmgmt password. 6. Provide the following information for the layout of the RA installation: a. When prompted about the number of sites in the environment Type 2 to install in a geographic replication environment or a geographic clustered environment. Type 1 to install in a continuous data protection environment. b. Type the number of RAs at the site, and press Enter. The Main Menu appears. Checking Storage-to-RA Access If the LUNs are not accessible, check your switch configuration and zoning. Verify that all LUNs are accessible by using the Main Menu of the Installation Manager and performing the following steps: 1. Type 3 (Diagnostics). 2. Type 2 (Fibre Channel diagnostics). 3. Type 4 (Detect Fibre Channel LUNs). After a few minutes, a list of detected LUNs appears. 4. Press the spacebar until all expected LUNs appear. 5. Type B (Back). 6. Type B again. The Main Menu appears. 7. If you do not see all Fibre Channel LUNs in step 4, correct the environment and repeat steps 1 through 6. Enabling PCI-X Slot Functionality If your system is configured with a gigabit (Gb) WAN, which is used for the optical WAN connection, perform the following steps on the Main Menu of the replacement RA: 1. Type 2 (Setup). 2. Type 8 (Advanced option). 3. Type 12 (Enable/disable additional remote interface). 4. Type yes when prompted on whether to enable the additional remote interface. 5. Type B twice to return to the Main Menu. 6872 5688 006 D 5
Replacing a Replication Appliance (RA) Configuring the RA 1. On the Main Menu, type 1 (Installation). 2. Type 2 (Get Setup information from an installed RA). Press Enter. The Get Settings Wizard menu appears with Get Settings from Installed RA selected. 3. Press Enter. 4. Type 1 (Management interface) to view the settings from the installed RA. 5. Type y when prompted to configure a temporary IP address. 6. Type the IP address. 7. Type the IP subnet mask and then press Enter. 8. Type y or n, depending on your environment, when prompted to configure a gateway. 9. Type the box management IP address of Site 1 RA 1 to import the settings from that RA. 10. Type y to import the settings. 11. Press Enter to continue when a message states that the configuration was successfully imported. The Get Settings Wizard menu appears with Apply selected. 12. Perform the following steps to apply the configuration to the RA: a. Press Enter to continue. The complete list of settings is displayed. These settings are the same as the ones for Site 1 RA 1. b. Type y to apply these settings. c. Type 1 or 2 when prompted for a site number, depending on the site on which the RA is located. d. Type the RA number when prompted. A confirmation message appears when the settings are applied successfully. e. Press Enter. The Get Settings Wizard menu appears with Proceed to the Complete Installation Wizard selected. f. Press Enter to continue. The Complete Installation Wizard menu appears with Configure repository volume selected. 13. Configure the repository volume by completing the following steps: a. Press Enter. b. Type 2 (Select a previously formatted repository volume). D 6 6872 5688 006
Replacing a Replication Appliance (RA) c. Select the number of the repository volume corresponding to the group of displayed volumes, and press Enter. d. Press Enter again. The Complete Installation Wizard menu appears with Attach to cluster selected. 14. Attach the RA to the RA cluster by completing the following steps: a. Press Enter. b. Type y at the prompt to attach to the cluster. The RA reboots. c. Close the PuTTY session if necessary. Verifying the RA Installation To verify that the RA is correctly installed, you must Verify the WAN bandwidth Verify the clock synchronization Verifying WAN Bandwidth Use the following procedure to verify the actual versus the expected WAN bandwidth. Note: Correct any problems and rerun the verification. 1. Open an SSH session to the box management IP address for the replacement RA. 2. Type boxmgmt when prompted to log in, and then type the appropriate password if it has changed from the default password boxmgmt. The Main Menu is displayed. 3. Type 3 (Diagnostics) and press Enter. The Diagnostics menu appears. 4. Type 1 (IP diagnostics) and press Enter. The IP Diagnostics menu appears. 5. Type 4 (Test throughput) and press Enter. 6. Type the WAN IP address of the peer RA; for example, site 2 RA 1 is the peer for site 1 RA 1. 7. Type 2 (WAN interface). 8. At the prompt, type 20 to change the default value for the desired number of concurrent streams. 9. At the prompt for the test duration, type 60 to change the default value. A message is displayed that the connection was established. 6872 5688 006 D 7
Replacing a Replication Appliance (RA) 10. After 60 seconds, make sure that the following information is displayed on the screen. Ignore any TCP Windows Size warnings. IP connection for every stream Interval, Transfer, and Bandwidth for every stream Expected bandwidth in the [SUM] display at the bottom of the screen 11. On the IP Diagnostics menu, type Q (Quit), and then type y. Verifying Clock Synchronization The timing of all Unisys SafeGuard 30m activities across all RAs in an installation must be synchronized against a single clock (for example, on the network time protocol [NTP] server). Consequently, you need to synchronize the replacement RA. For the procedure to verify RA synchronization, see the Unisys SafeGuard Solutions Replication Appliance Installation Guide. Restoring Group Properties Perform the following steps on the Management Console for each group that needs to have the Preferred RA setting restored to an RA other than RA 1.. All Preferred RA settings are set to RA 1. 1. Select the Policy tab for the consistency group. 2. On the General Settings section, change the Preferred RA setting to the original setting, and then click Apply. 3. Change the Stretch Cluster Support under Advanced to the original setting if it was changed earlier. 4. Click Apply. Ensuring the Existing RA Can Switch Over to the New RA Once the new RA is part of the configuration, the management console does not display any errors. Shut down any other RA at the site to ensure that the newly replaced RA can successfully complete the switchover. As the existing RA reboots, check the BIOS level as displayed in the Unisys banner and note it. Compare the BIOS level noted for the exiting (rebooting) RA with the BIOS level you noted for the replacement RA. If the BIOS levels do not match, contact the Unisys Support Center to obtain the correct BIOS. D 8 6872 5688 006
Appendix E Understanding Events Event Log Event Topics Various events generate entries to the Unisys SafeGuard 30m solution system log. These events are predefined in the system according to topic, level of severity, and scope. The Unisys SafeGuard 30m solution supports proactive notification of an eventeither by sending e-mail messages or by generating system log events that are logged by a management application. The system records log entries in response to a wide range of predefined events. Each event carries an event ID. For manageability, the system divides the events into general and advanced types. In most cases, you can monitor system behavior effectively by viewing the general events only. For troubleshooting a problem, technical support personnel might want to review the advanced log events. Event topics correspond to the components where the events occur, including Management (management console and CLI) Site RA Consistency group Splitter A single event can generate multiple log entries. 6872 5688 006 E 1
Understanding Events Event Levels The levels of severity for events are defined as follows (in ascending order): Info These messages are informative in nature, usually referring to changes in the configuration or normal system state. Warning These messages indicate a warning, usually referring to a transient state or to an abnormal condition that does not degrade system performance. Error Event Scope These messages indicate an important event that is likely to disrupt normal system behavior, performance, or both. A single change in the systemfor example, an error over a communications linecan affect a wide range of system components and cause the system to generate a large number of log events. Many of these events contain highly technical information that is intended for use by Unisys service representatives. When all of the events are displayed, you might find it difficult to identify the particular events in which you are interested. You can use the scope to manage the type and quantity of events that are displayed in the log. An event belongs to one of the following scopes: Normal Events with a Normal scope result when the system analyzes a wide range of system data to generate a single event that explains the root cause for an entire set of Detailed and Advanced events. Usually, these events are sufficient for effective monitoring of system behavior. Detailed Events with a Detailed scope include all events for all components that are generated for users and that are not included among the events that have a Normal scope. The display of Detailed events includes Normal events also. Advanced. Events with an Advanced scope contain technical information. In some cases, such as troubleshooting a problem, a Unisys service representative might need to retrieve information from the Advanced log events. E 2 6872 5688 006
Understanding Events Displaying the Event Log The event log is displayed either from the Management Console or using the CLI. To display event logs, select Logs in the navigation pane; the most recent events in the event log are displayed. For more information about a particular event log, double-click the event log. The Log Event Properties dialog box displays details of the individual event. You can sort the log events according to any of the columns (that is, level, scope, time, site, ID and topic) in ascending or descending order. Perform the following steps to display advanced logs: 1. Click the Filter log tool bar option in the event pane. The Filter Log dialog box appears. 2. Change the scope to Advanced. 3. Click OK. For more information about using the management console, see the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide. To display the event log from the CLI, run the get_logs command and specify values for each of the parameters. Specify the parameters carefully to avoid displaying unnecessary log information. You can use the terse display parameter to show more or less information for the displayed events as desired. For information about the CLI, see the Unisys SafeGuard Solutions Replication Appliance Command Line Interface (CLI) Reference Guide. Using the Event Log for Troubleshooting The event log provides information that can be useful in determining the cause or nature of problems that might arise during operation. The group capabilities events provide an important tool for understanding the behavior of a consistency group. Each group capabilities eventsuch as group capabilities OK, group capabilities minor problem, or group capabilities problemprovides a high-level description of a current group situation with regard to each of the RAs and identifies the RA that is currently handling the group. The information reported for each RA includes the following: RA status: Indicates whether an RA is currently a member of the RA cluster (that is, alive) or not a member (that is, dead). Marking status: yes or no. Transfer status: yes, no, no data loss (that is, flushing), or yes unstable (that is, the RA cannot be initialized if closed or detached). 6872 5688 006 E 3
Understanding Events Journal capability: yes (that is, distributing, logged access, and so forth), no, or static (that is, access to an image is enabled but access to a different image is not enabled, cannot distribute, and cannot support image access) Preferred: yes or no. In addition, the event log reports the RA on which the group is actually running and the status of the link between the sites. A group capabilities event is generated whenever there is a change in the capabilities of a group on any RA. The message reports on any limitations to the capabilities of the group and provides reasons for these limitations. Tracking logged events can explain changes in a group state (for example, the reason replication was paused, the reason the group switched to another RA, and so forth). The group capabilities events might offer reasons that particular actions are not performed. For example, if you want to know the reason the group transfer was paused, you can check the event log for the pause replication action. If, however, you want to know the reason a group transfer did not start, you might check the most recent group capabilities event. The level of a group capabilities event can be INFO, WARNING, or ERROR, depending on the severity of the reported situation. These levels correspond to the OK, minor problem, and problem bookmarks that follow group capabilities in the message descriptions. List of Events The list of events is presented in tabular format with the following given for each event: Event ID Topic (for example, Management, Site, RA, Splitter, Group) Level (for example, Info, Warning, Error ) Description Scope Time Site E 4 6872 5688 006
Understanding Events List of Normal Events Normal events include both root-cause events (a single description for an event that can generate multiple events) and other selected basic events. Some Normal events do not have a topic or trigger. Table E 1 lists Normal events with their descriptions. Table E 1. Normal Events Event ID Topic Level Description Trigger 1000 Management Info User logged in. (User <user>) 1001 Management Warning Log in failed. (User <user>) 1003 Management Warning Failed to generate SNMP trap. (Trap contents) 1004 Management Warning Failed to send e-mail alert to specified address. (Address <e-mail address>, Event summary <summary>) 1005 Management Warning Failed to update file. (File <file> 1006 Management Info Settings changed. (User <user>, Settings <settings>) 1007 Management Warning Settings change failed. (User <user>, Settings <settings>, Reason <reason>) 1008 Management Info User action succeeded. (User <user>, Action <action>) User log-in action User failed to log in The system failed to send SNMP trap. The system failed to send an e-mail alert. The system failed to update the local configuration file (passwords, SSH keys, system log configuration, and SNMP configuration). The user changed settings. The system failed to change settings. The user performed one of these actions: bookmark_image, clear_markers, set_markers, undo_logged_ writes, set_num_ of_streams. 6872 5688 006 E 5
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 1009 Management Warning User action failed. (User <user>, Action <action>, Reason <reason>) 1011 Management Error Grace period expired. You must install an activation code to activate your <RecoverPoint> license. 1014 Management Info User bookmarked an image. (Group <group>, Snapshot <bookmark>) 1015 Management Warning RA-to-storage multipathing problem (RA <RA>, Volume <volume>) One of these actions failed: bookmark_image, clear_markers, set_markers, undo_logged_ writes, set_num_ of_streams. Grace period expired The user bookmarked an image. Single path only or more paths between RA and volume are not available. 1016 Management Warning Off RA- multipathing fixed. problem (RA <RA>, Volume <volume>) All paths between the RA and volume are available. 1017 Management Warning RA- multipathing problem. (RA <RA>, Splitter<splitter>) One or more paths between the RA and the splitter are not available. 1018 Management Warning Off RA- multipathing problem fixed. (RA <RA>, Splitter <splitter>) All paths between the RA and the splitter are available. 1019 Management Warning User action succeeded. (Markers cleared. Group <group>,<copy>) (Replication set attached as clean. Group<group>) 3001 RA Warning RA is no longer a cluster member. (RA <RA>) 3005 RA Error Settings conflict between sites. (Reason <reason>) User cleared markers or attached replication set as clean. An RA is disconnected from site control. A settings conflict between the sites was discovered. E 6 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 3006 RA Error Off Settings conflict between sites resolved by user. (Using Site <site> settings) 3030 RA Warning RA switched path to storage. (RA <RA>, Volume <volume>) 4056 Group Warning No image was found in the journal to match the query. (Group <group>) 4090 Group Warning Target-side log is 90 percent full. When log is full, writing by hosts at target side is disabled. (Group <group>) 4106 Group Warning Capacity reached; cannot write additional markers for this group to <repository volume>. Starting full sweep. (Group <group>) 4117 Group Warning Virtual access buffer is 90 percent full. When the buffer is full, writing by hosts at the target side is disabled. (Group <group>) 5008 Splitter Warning Host shut down. (Host Splitter <splitter>) 5010 Splitter Warning Splitter stopped; depending on policy, writing by host might be disabled for some groups, and a full sweep might be required for other groups. (Splitter <splitter>) 5011 Splitter Warning Splitter stopped; full sweep is required. (Splitter <splitter>) 5012 Splitter Warning The splitter stopped; write operations to replication volumes are disabled. (Splitter <splitter>) A settings conflict between the sites was resolved by the user. A storage path change was initiated by the RA. No image was found in the journal to match the query. The target-side log is 90 percent full. The disk space for the markers was filled for the group. The usage of the virtual access buffer has reached 90 percent. The host was shut down or restarted. The user stopped the splitter after removing volumes; volumes are disconnected. The user stopped the splitter after removing volumes; volumes are disconnected. The splitter stopped; host access to all volumes is disabled. 6872 5688 006 E 7
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 10000 Info Changes are occurring in the system. Analysis in progress. 10001 Info System changes have occurred. The system is now stable. 10002 Info The system activity has not stabilizedissuing an intermediate report. 10101 Error The cause of the system activity is unclear. To obtain more information, filter the events log using the Detailed scope. 10102 Info Site control recorded internal changes that do not affect system operation. 10202 Info Settings have changed. 10203 Info The RA cluster is down. 10204 Error One or more RAs are disconnected from the RA cluster. 10205 Error A communications problem occurred in an internal process. 10206 Info An internal process was restarted. 10207 Error An internal process was restarted. 10210 Error Initialization is experiencing high-load conditions. 10211 Error A temporary problem occurred in the Fibre Channel link between the splitters and the RAs. 10212 Error Off The temporary problem that occurred in the Fibre Channel link between the splitters and the RAs is resolved. E 8 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 10501 Info Synchronization completed. 10502 Info Access to the target-side image is enabled. 10503 Error The system is transferring the latest snapshot before pausing transfer (no data loss). 10504 Info The journal was cleared. 10505 Info The system completed undoing writes to the target-side log. 10506 Info The roll to the physical images is complete. Logged access to the physical image is now available. 10507 Info Because of system changes, the journal was temporarily out of service. The journal is now available. 10508 Info All data were flushed from the local-side RA; automatic failover proceeds. 10509 Info The initial long resynchronization has completed. 10510 Info Following a paused transfer, the system is now cleared to restart transfer. 10511 Info The system finished recovering the replication backlog. 12001 Error The splitter is down. 12002 Error An error occurred in all WAN links to the other site. The other site is possibly down. 6872 5688 006 E 9
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 12003 Error An error occurred in the WAN link to the RA at the other site. 12004 Error An error occurred in the data link over the WAN. All RAs are unable to transfer replicated data to the other site. 12005 Error An error occurred in the data link over the WAN. The RA is unable to transfer replicated data to the other site. 12006 Error The RA is disconnected from the RA cluster. 12007 Error All RAs are disconnected from the RA cluster. 12008 Error The RA is down. 12009 Error The group entered high load. 12010 Error A journal error occurred. Full sweep is to be performed after the error is corrected. 12011 Error The target-side log or virtual buffer is full. Writing by hosts at the target side is disabled. 12012 Error The system cannot enable virtual access to the image. 12013 Error The system cannot enable access to a specified image. 12014 Error The Fibre Channel link between all RAs and all splitters and storage is down. 12016 Error The Fibre Channel link between all RAs and all storage is down. E 10 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 12022 Error The Fibre Channel link between the RA and splitters or storage volumes (or both) is down. 12023 Error The Fibre Channel link between the RA and all splitters and storage is down. 12024 Error The Fibre Channel link between the RA and all splitters is down. 12025 Error The Fibre Channel link between the RA and all storage is down. 12026 Error An error occurred in the WAN link to the RA at the other site. 12027 Error All replication volumes attached to the consistency group (or groups) are not accessible. 12029 Error The Fibre Channel link between all RAs and one or more volumes is down. 12033 Error The repository volume is not accessible; data might be lost. 12034 Error Writes to storage occurred without corresponding writes to the RA. 12035 Error An error occurred in the WAN link to the RA cluster at the other site. 12036 Error A renegotiation of the transfer protocol is requested. 12037 Error All volumes attached to the consistency group (or groups) are not accessible. 6872 5688 006 E 11
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 12038 Error All journal volumes attached to the consistency group (or groups) are not accessible. 12039 Error A long resynchronization started. 12040 Error The system detected bad sectors in a volume. 12041 Error The splitter is up. 12042 Error All WAN links to the other site are restored. 12043 Error The WAN link to the RA at the other site is restored. 12044 Error Problem with IP link between RA (in at least in one direction). 12045 Error Problem with all IP links between RA 12046 Error Problem with IP links between RA 12047 Error RA network interface card (NIC) problem. 14001 Error Off The splitter is up. 14002 Error Off All WAN links to the other site are restored. 14003 Error Off The WAN link to the RA at the other site is restored. 14004 Error Off The data link over the WAN is restored. All RAs can transfer replicated data to the other site. 14005 Error Off The data link over the WAN is restored. The RA can transfer replicated data to the other site. 14006 Error Off The connection of the RA to the RA cluster is restored. E 12 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 14007 Error Off The connection of all RAs to the RA cluster is restored. 14008 Error Off The RA is up. 14009 Error Off The group exited high load. The initialization completed. 14010 Error Off The journal error was corrected. A full sweep operation is required. 14011 Error Off The target-side log or virtual buffer is no longer full. 14012 Error Off Virtual access to an image is enabled. 14013 Error Off The system is no longer trying to access a diluted image. 14014 Error Off The Fibre Channel link between all RAs and all splitters and storage is restored. 14016 Error Off The Fibre Channel link between all RAs and all storage is restored. 14022 Error Off The Fibre Channel link that was down between the RA and splitters or storage volumes (or both) is restored. 14023 Error Off The Fibre Channel link between the RA and all splitters and storage is restored. 14024 Error Off The Fibre Channel link between the RA and all splitters is restored. 14025 Error Off The Fibre Channel link between the RA and all storage is restored. 14026 Error Off The WAN link to the RA at the other site is restored. 6872 5688 006 E 13
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 14027 Error Off Access to all volumes attached to the consistency group (or groups) is restored. 14029 Error Off The Fibre Channel link between all RAs and one or more volumes is restored. 14033 Error Off Access to the repository volume is restored. 14034 Error Off Replication consistency in writes to storage is restored. 14035 Error Off The WAN link to the RA at the other site is restored. 14036 Error Off The renegotiation of the transfer protocol is complete. 14037 Error Off Access to all replication volumes attached to the consistency group (or groups) is restored. 14038 Error Off Access to all journal volumes attached to the consistency group (or groups) is restored. 14039 Info The long resynchronization has completed. 14040 Error Off The system detected a correction of bad sectors in the volume. 14041 Error Off The system detected that the volume is no longer read-only. 14042 Error Off A synchronization is in progress to restore any failed writes in the group. 14043 Error Off A synchronization is in progress to restore any failed writes. E 14 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 14044 Error Off Problem with IP link between RAs (in at least in one direction) corrected. 14045 Error Off All IP links between RAs restored. 14046 Error Off IP link between RAs restored. 14047 Error Off RA network interface card (NIC) problem corrected. 16000 Error Transient root cause. 16001 Error The splitter was down. The problem is corrected. 16002 Error An error occurred in all WAN links to the other site. The problem is corrected. 16003 Error An error occurred in the WAN link to the RA at the other site. The problem is corrected. 16004 Error An error occurred in the data link over the WAN. All RAs were unable to transfer replicated data to the other site. The problem is corrected. 16005 Error An error occurred in the data link over the WAN. The RA was unable to transfer replicated data to the other site. The problem is corrected. 16006 Error The RA was disconnected from the RA cluster. The connection is restored. 16007 Error All RAs were disconnected from the RA cluster. The problem is corrected. 16008 Error The RA was down. The problem is corrected. 6872 5688 006 E 15
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 16009 Error The group entered high load. The problem is corrected. 16010 Error A journal error occurred. The problem is corrected. A full sweep is required. 16011 Error The target-side log or virtual buffer was full. Writing by the hosts at the target side was disabled. The problem is corrected. 16012 Error The system could not enable virtual access to the image. The problem is corrected. 16013 Error The system could not enable access to the specified image. The problem is corrected. 16014 Error The Fibre Channel link between all RAs and all splitters and storage was down. The problem is corrected. 16016 Error The Fibre Channel link between all RAs and all storage was down. The problem is corrected. 16022 Error The Fibre Channel link between the RA and splitters or storage volumes (or both) was down. The problem is corrected. 16023 Error The Fibre Channel link between the RA and all splitters and storage was down. The problem is corrected. 16024 Error The Fibre Channel link between the RA and all splitters was down. The problem is corrected. E 16 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 16025 Error The Fibre Channel link between the RA and all storage was down. The problem is corrected. 16026 Error An error occurred in the WAN link to the RA at the other site. The problem is corrected. 16027 Error All volumes attached to the consistency group (or groups) were not accessible. The problem is corrected. 16029 Error The Fibre Channel link between all RAs and one or more volumes was down. The problem is corrected. 16033 Error The repository volume was not accessible. The problem is corrected. 16034 Error Off Writes to storage occurred without corresponding writes to the RA. The problem is corrected. 16035 Error An error occurred in the WAN link to the RA at the other site. The problem is corrected. 16036 Error The renegotiation of the transfer protocol was requested and has been completed. 16037 Error All replication volumes attached to the consistency group (or groups) were not accessible. The problem is corrected. 6872 5688 006 E 17
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 16038 Error All journal volumes attached to the consistency group (or groups) were not accessible. The problem is corrected. 16039 Info The system ran a long resynchronization. 16040 Error The system detected bad sectors in the volume. The problem is corrected. 16041 Error The system detected that the volume was read-only. The problem is corrected. 16042 Error The splitter write operation might have failed while the group was transferring data. 16043 Error The splitter write operations might have failed. 16044 Error There was a problem with an IP link between RAs (in at least in one direction) 16045 Error There was a problem with all IP links between RAs. Problem has been corrected 16046 Error There was a problem with an IP link between RAs. Problem has been corrected. 16047 Error There was an RA network interface card (NIC) problem. Problem has been corrected. 18001 Error Off The splitter was temporarily up but is down again. 18002 Error Off All WAN links to the other site were temporarily restored, but the problem has returned. E 18 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 18003 Error Off The WAN link to the RA at the other site was temporarily restored, but the problem has returned. 18004 Error Off The data link over the WAN was temporarily restored, but the problem has returned. All RAs are unable to transfer replicated data to the other site. 18005 Error Off The data link over the WAN was temporarily restored, but the problem has returned. The RA is currently unable to transfer replicated data to the other site. 18006 Error Off The connection of the RA to the RA cluster was temporarily restored, but the problem has returned. 18007 Error Off All RAs were temporarily restored to the RA cluster, but the problem has returned. 18008 Error Off The RA was temporarily up, but is down again. 18009 Error Off The group temporarily exited high load, but the problem has returned. 18010 Error Off The journal error was temporarily corrected, but the problem has returned. 18011 Error Off The target-side log or virtual buffer was temporarily no longer full, and write operations by the hosts at the target side were re-enabled. However, the problem has returned. 6872 5688 006 E 19
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 18012 Error Off Virtual access to the image was temporarily enabled, but the problem has returned. 18013 Error Off Access to an image was temporarily enabled, but the problem has returned. 18014 Error Off The Fibre Channel link between all RAs and all splitters and storage was temporarily restored, but the problem has returned. 18016 Error Off The Fibre Channel link between all splitters and all storage was temporarily restored, but the problem has returned. 18022 Error Off The Fibre Channel link that was down between the RA and splitters or storage volumes (or both) was temporarily restored, but the problem has returned. 18023 Error Off The Fibre Channel link between the RA and all storage was temporarily restored, but the problem has returned. 18024 Error Off The Fibre Channel link between the RA and all splitters was temporarily restored, but the problem has returned. 18025 Error Off The Fibre Channel link between the RA and all storage was temporarily restored, but the problem has returned. 18026 Error The WAN link to the RA at the other site was temporarily restored, but the problem has returned. E 20 6872 5688 006
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 18027 Error Off Access to all journal volumes attached to the consistency group (or groups) was temporarily restored, but the problem has returned. 18029 Error Off The Fibre Channel link between all RAs and one or more volumes was temporarily restored, but the problem has returned. 18033 Error Off Access to the repository volume was temporarily restored, but the problem has returned. 18034 Error Off Replication consistency in write operations to storage and to RAs was temporarily restored, but the problem has returned. 18035 Error Off The WAN link to the RA at the other site was temporarily restored, but the problem has returned. 18036 Error Off The negotiation of the transfer protocol was completed but is again requested. 18037 Error Off Access to all volumes attached to the consistency group (or groups) was temporarily restored, but the problem has returned. 18038 Error Off Access to all replication volumes attached to the consistency group (or groups) was temporarily restored, but the problem has returned. 18039 Info The long resynchronization completed but has now restarted. 6872 5688 006 E 21
Understanding Events Table E 1. Normal Events Event ID Topic Level Description Trigger 18040 Error Off The user marked the volume as OK, but the bad-sectors problem persists. 18041 Error Off The user marked the volume as OK, but the read-only problem persists. 18042 Error Off The synchronization restored any failed write operations in the group, but the problem has returned. 18043 Error Off An internal problem has occurred. 18044 Error Off Problem with IP link between RAs (in at least one direction) was corrected, but problem has returned. 18045 Error Off Problem with all IP links between RAs (in at least in one direction) was corrected, but problem has returned. 18046 Error Off Problem with IP link between RAs was corrected, but problem has returned. 18047 Error Off RA network interface card (NIC) problem was corrected, but problem has returned. List of Detailed Events Detailed events are all events with respect to components generated for use by users and do not have a normal scope. Table E 2 lists these events and their descriptions. E 22 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 1002 Management Info User logged out. (User <user>) The user logged out of the system. 1010 Management Warning Grace period expires in 1 day. You must install an activation code to activate your Unisys SafeGuard solution license. 1012 Management Warning License expires in 1 day. You must obtain a new Unisys SafeGuard 30m solution license. 1013 Management Error License expired. You must obtain a new Unisys SafeGuard 30m solution license. 2000 Site Info Site management running on <RA>. 3000 RA Info RA as become a cluster member. (RA <RA>) 3002 RA Warning Site management switched over to this RA. (RA <RA>, Reason <reason>) The grace period expires in 1 day. The Unisys SafeGuard 30m solution license expires in 1 day. The Unisys SafeGuard 30m solution license expired. Site control is open; the RA has become the cluster leader. The RA is connected to site control. Leadership is transferred from an RA to another RA. 3007 RA Warning Off RA is up. (RA <RA>) The RA that was previously down came up. 3008 RA Warning RA appears to be down. (RA <RA>) 3011 RA Info RA access to a volume or volumes restored. (RA <RA>, Volume <volume>, Volume Type <type>) 3012 RA Warning RA unable to access a volume or volumes. (RA <RA>, Volume <volume>, Volume Type <type>) An RA suspects that the other RA is down. Volumes that were inaccessible became accessible. Volumes ceased to be accessible to the RA. 6872 5688 006 E 23
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 3013 RA Warning Off RA access to <repository volume> restored. (RA <RA>, Volume <volume>) The repository volume that was inaccessible became accessible. 3014 RA Warning RA unable to access <repository volume>. (RA <RA>, Volume <volume>) The repository volume became inaccessible to a single RA. 3020 RA Warning Off WAN connection to an RA at other site is restored. (RA at other site: <RA>) The RA regained the WAN connection to an RA at the other site. 3021 RA Warning Error in WAN connection to an RA at other site. (RA at other site: <RA>) The RA lost the WAN connection to an RA at the other site. 3022 RA Warning Off LAN connection to RA restored. (RA <RA>) The RA regained the LAN connection to an RA at the local site. 3023 RA Warning Error in LAN connection to an RA. RA <RA>) 4000 Group Info Group capabilities OK. (Group <group>) 4001 Group Warning Group capabilities minor problem. (Group <group>) The RA lost the LAN connection to an RA at the local site, without losing the connection through the repository volume. Capabilities are full and previous capabilities are unknown. Capabilities are either temporarily not full on the RA on which the group is currently running, or indefinitely not full on the RA on which the group is not running. E 24 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4003 Group Error Group capabilities problem. (Group <group>) 4007 Group Info Pausing data transfer. (Group <group>, Reason: <reason>) 4008 Group Warning Pausing data transfer. (Group <group>, Reason: <reason>) 4009 Group Error Pausing data transfer. (Group <group>, Reason: <reason>) 4010 Group Info Starting data transfer. (Group <group>) 4015 Group Info Transferring latest snapshot before pausing transfer (no data loss). (Group <group>) 4016 Group Warning Transferring latest snapshot before pausing transfer (no data loss). (Group <group>) 4017 Group Error Transferring latest snapshot before pausing transfer (no data loss). (Group <group>) 4018 Group Warning Transfer of latest snapshot from source is complete (no data loss). (Group <group>) Capabilities are not full indefinitely on the RA on which the group is running. The user stopped the transfer. The system temporarily stopped the transfer. The system stopped the transfer indefinitely. The user requested a start transfer. In a total storage disaster, the system flushed the buffer before stopping replication. In a total storage disaster, the system flushed the buffer before stopping replication. In a total storage disaster, the system flushed the buffer before stopping replication. In a total storage disaster, the last snapshot from the source site is available at the target site. 6872 5688 006 E 25
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4019 Group Warning Group in high load; transfer is to be paused temporarily. (Group <group>) The disk manager has a high load. 4020 Group Warning Off Group is no longer in high load. (Group <group>) The disk manager no longer has a high load. 4021 Group Error Journal fullinitialization paused. To complete initialization, enlarge the journal or allow long resynchronization. (Group <group>) 4022 Group Error Off Initialization resumed. (Group <group>) 4023 Group Error Journal fulltransfer paused. To restart the transfer, first disable access to image. (Group <group>) 4024 Group Error Off Transfer restarted. (Group <group>) 4025 Group Warning Group in high load initialization to be restarted. (Group <group>) In initialization, the journal is full and a long resynchronization is not allowed. End of an initialization situation in which the journal is full and a long resynchronization was not allowed. Access to the image is enabled and the journal is full. End of a situation in which access to the image is enabled and the journal is full. The group has a high load; initialization is to be restarted. 4026 Group Warning Off Group no longer in high load. (Group <group>) The group no longer has a high load. 4027 Group Error Group in high loadthe journal is full. The roll to physical image is paused, and transfer is paused. (Group <group>) 4028 Group Error Off Group no longer in high load. (Group <group>) No space remains to which to write during roll. Journal capacity was added, or image access was disabled. E 26 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4040 Group Error Journal errorfull sweep to be performed. (Group <group>) 4041 Group Info Group activated. (Group <group>, RA <RA>) 4042 Group Info Group deactivated. (Group <group>, RA <RA>) 4043 Group Warning Group deactivated. (Group <group>, RA <RA>) 4044 Group Error Group deactivated. (Group <group>, RA <RA>) 4051 Group Info Disabling access to image resuming distribution. (Group <group>) 4054 Group Error Enabling access to image. (Group <group>) 4057 Group Warning Specified image was removed from the journal. Try a later image. (Group <group>) 4062 Group Info Access enabled to latest image. (Group <group>, Failover site <site>) 4063 Group Warning Access enabled to latest image. (Group <group>, Failover site <site>) A journal volume error occurred. The group is replication-ready; that is, replication could take place if other factors are acceptable, such as RAs, network, and storage access. A user action deactivated the group. The system temporarily deactivated the group. The system deactivated the group indefinitely. The user disabled access to an image (that is, distribution is resumed). The system enabled access to an image indefinitely. The specified image was removed from the journal (that is, FIFO). Access was enabled to the latest image during automatic failover. Access was enabled to the latest image during automatic failover. 6872 5688 006 E 27
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4064 Group Error Access enabled to latest image. (Group <group>, Failover site <site>) 4080 Group Warning Current lag exceeds maximum lag. (Group <group>, Lag <lag>, Maximum lag <max_lag>) Access was enabled to the latest during automatic failover. The group lag exceeds the maximum lag (when not regulating an application). 4081 Group Warning off Current lag within policy. (Group <group>, Lag <lag>, Maximum lag <max_lag>) The group lag drops from above the maximum lag to below 90 percent of the maximum. 4082 Group Warning Starting full sweep. (Group <group>) 4083 Group Warning Starting volume sweep. (Group <group>, Pair <pair>) 4084 Group Info Markers cleared. (Group <group>) 4085 Group Warning Unable to clear markers. (Group <group>) 4086 Group Info Initialization started. (Group <group>) 4087 Group Info Initialization completed. (Group <group>) 4091 Group Error Target-side log is full; write operations by the hosts at the target side is disabled. (Group <group>, Site <site>) 4095 Group Info Writing target-side log to storage; writes to log cannot be undone. (Group <group>) Group markers were set. Volume markers were set. Group markers were cleared. An attempt to clear the group markers failed. Initialization started. Initialization completed. The target-side log is full. Started marking to retain write operations in the target-side log. E 28 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4097 Group Warning Maximum journal lag exceeded. Distribution in fastforwardolder images removed from journal. (Group <group>) Fast-forward action started (causing a loss of snapshots taken before as maximum journal lag was exceeded). 4098 Group Warning Off Maximum journal lag within limit. Distribution normal rollback information retained. (Group <group>) Five minutes have passed since the fast-forward action stopped. 4099 Group Info Initializing in long resynchronization mode. (Group <group>) 4110 Group Info Enabling virtual access to image. (Group <group>) 4111 Group Info Virtual access to image enabled. (Group <group>) 4112 Group Info Rolling to physical image. (Group <group>) 4113 Group Info Roll to physical image stopped. (Group <group>) 4114 Group Info Roll to physical image completelogged access to physical image is now enabled. (Group <group>) The system started a long resynchronization. The user initiated enabling virtual access to an image. The user enabled virtual access to an image. Rolling to the image (in background) while virtual access to the image is enabled. Rolling to the image (that is, the background, while virtual access to the image is enabled) is stopped. The system completed the roll to the physical image. 6872 5688 006 E 29
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4115 Group Error Unable to enable access to virtual image because of partition table error. (The partition table on at least one of the volumes in group <group> has been modified since logged access was last enabled to a physical image. To enable access to a virtual image, first enable logged access to a physical image.) 4116 Group Error Virtual access buffer is full writing by hosts at the target side is disabled. (Group <group>) 4118 Group Error Cannot enable virtual access to an image. (Group <group>) 4119 Group Error Initiator issued an out-ofbounds I/O operation. Contact technical support. (Initiator <initiator WWN>, Group <group>, Volume <volume>) 4120 Group Warning Journal usage (with logged access enabled) now exceeds this threshold. (Group <group>, <journal usage threshold>) 4121 Group Error Unable to gain permissions to write to replica. An attempt to pause on a virtual image is unsuccessful because of a change in the partition table of a volume or volumes in the group. An attempt to write to the virtual image is unsuccessful because the virtual access buffer usage is 100 percent. An attempt to enable virtual access to the image is unsuccessful because of insufficient memory. A configuration problem exists. Journal usage (with logged access enabled) has passed a specified threshold. RAs unable to write to replication or journal volumes because they do not have proper permissions. E 30 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 4122 Group Trying to regain permissions to write to replica. 4123 Group Error Unable to access volumes bad sectors encountered. 4124 Group Error Off Trying to access volumes that previously had bad sectors. 4125 Group Error Current protection window is now insufficient. 4126 Group Info Current protection window is now sufficient. 4127 Group Error Predicted protection window is now insufficient. 4128 Group Info Predicted protection window is now sufficient. 5000 Splitter Info Splitter or splitters are attached to a volume. (Splitter <splitter>, Volume <volume>) 5001 Splitter Info Splitter or splitters are detached from a volume. (Splitter <splitter>, Volume <volume>) 5002 Splitter Error RA is unable to access splitter. (Splitter <splitter>, RA <RA>) User has indicated that the permissions problem has been corrected. RAs unable to write to replication or journal volumes due to bad sectors on the storage. User has indicated that the bad sectors problem has been corrected. Selected current protection window is smaller than the required protection window. Selected required protection window is smaller than the current protection window. Predicted protection window is smaller than the current protection window. Closing the previous event The user attached a splitter to a volume. The user detached a splitter from a volume. The RA is unable to access a splitter. 6872 5688 006 E 31
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 5003 Splitter Error Off RA access to splitter is restored. (Splitter <splitter>, RA <RA>) 5004 Splitter Error Splitter is unable to access a replication volume or volumes. (Splitter <splitter>, Volume <volume>) 5005 Splitter Error Off Splitter access to replication volume or volumes is restored. (Splitter <splitter>, Volume <volume>) The RA can access a splitter that was previously inaccessible. The splitter cannot access a volume. The splitter can access a volume that was previously inaccessible. 5006 OBSOLETE 5007 OBSOLETE 5013 Splitter Error Splitter is down. (Splitter <splitter>) 5015 Splitter Error Off Splitter is up. (Splitter <splitter>) 5016 Splitter Warning Splitter has restarted. (Splitter <splitter>) 5030 Splitter Error Splitter write failed. (Splitter <splitter>, Group <group>) 5031 Splitter Warning Splitter is not splitting to replication volumes; volume sweeps are required. (Host <host>, Volumes <Volume Names>, Groups <Groups>) Connection to the splitter was lost with no warning; splitter crashed or the connection is down. Connection to the splitter was regained after a splitter crash. The boot timestamp of the splitter has changed. The splitter write operation to the RA was successful; the write operation to the storage device was not successful. The splitter is not splitting to the replication volumes. E 32 6872 5688 006
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 5032 Splitter Info Splitter is splitting to replication volumes. (Host <Host>, Volumes <Volume Names>, Groups (Groups) 5035 Splitter Info Writes to replication volumes are disabled. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5036 Splitter Warning Writes to replication volumes are disabled. (Host< host>, Volumes <Volume Names>, Groups <Groups>) 5037 Splitter Error Writes to replication volumes are disabled. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5038 Splitter Info Splitter delaying writes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5039 Splitter Warning Splitter delaying writes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5040 Splitter Error Splitter delaying writes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5041 Splitter Info Splitter is not splitting to replication volumes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5042 Splitter Warning Splitter is not splitting to replication volumes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) 5043 Splitter Error Splitter not splitting to replication volumes. (Splitter <splitter>, Volumes <Volume Names>, Groups <Groups>) The splitter started splitting to the replication volumes. Write operations to the replication volumes are disabled. Write operations to the replication volumes are disabled. Write operations to the replication volumes are disabled. The splitter is not splitting to the replication volumes because of a user decision. The splitter is not splitting to the replication volumes. The splitter is not splitting to the replication volumes because of a system action. 6872 5688 006 E 33
Understanding Events Table E 2. Detailed Events Event ID Topic Level Description Trigger 5045 Splitter Warning Simultaneous problems reported in splitter and RA. Full-sweep resynchronization is required after restarting data transfer. 5046 Splitter Warning Transient errorreissuing splitter write. The marking backlog on the splitter was lost as a result of concurrent disasters to the splitter and the RA. E 34 6872 5688 006
Appendix F Configuring and Using SNMP Traps The RA in the Unisys SafeGuard 30m solution is SNMP capablethat is, the solution supports monitoring and problem notification using the standard Simple Network Management Protocol (SNMP), including support for SNMPv3. The solution supports various SNMP queries to the agent and can be configured so that events generate SNMP traps, which are sent to designated servers. Software Monitoring To configure SNMP traps for monitoring, see the Unisys SafeGuard 30m Solution Planning and Installation Guide. You cannot query the RA software management information base (MIB). You can query the MIB-II. The RA SNMP agent includes MIB-II support. Also see Hardware Monitoring. For more information on MIB-II, see the document at http://www.faqs.org/rfcs/rfc1213.html All of the management console log events listed in Appendix E generate SNMP traps depending on the severity of the trap configuration. The Unisys MIB OID is 1.3.6.1.4.1.21658. The trap identifiers for Unisys traps are as follows: 1: Info 2: Warning 3: Error 6872 5688 006 F 1
Configuring and Using SNMP Traps The Unisys trap variables and their possible values are defined in Table F 1. Table F 1. Trap Variables and Values Variable OID Description Value dateandtime 3.1.1.1 Date and time that the trap was sent eventid 3.1.1.2 Unique event identifier (See values in List of Events in Appendix E.) sitename 3.1.1.3 Name of site where event occurred eventlevel 3.1.1.4 See values 1: info 2: warning 3: warning off 4: error 5: error off eventtopic 3.1.1.5 See values 1: site 2: K-Box 3: group 4: splitter 5: management hostname 3.1.1.6 Name of host kboxname 3.1.1.7 Name of RA volumename 3.1.1.8 Name of volume groupname 3.1.1.9 Name of group eventsummary 3.1.1.10 Short description of event eventdescription 3.1.1.11 More detailed description of event F 2 6872 5688 006
Configuring and Using SNMP Traps SNMP Monitoring and Trap Configuration To configure SNMP traps, see the Unisys SafeGuard Solutions Planning and Installation Guide. On the management console, use the SNMP Settings menu (in the System menu) to manage the SNMP capabilities. Through that menu, you can enable and disable the agent or the SNMP traps feature, modify the configuration for SNMP traps, and add or remove SNMP users. In addition, the RA provides several CLI commands for SNMP, as follows: The enable_snmp command to enable the SNMP agent The disable_snmp command to disable the SNMP agent The set_snmp_community command to define a community of users (for SNMPv1) The add_snmp_user command to add SNMP users (for SNMPv3) The remove_snmp_user command to remove SNMP users (for SNMPv3) The get_snmp_settings command to display whether the agent is currently set to be enabled, the current configuration for SNMP traps, and the list of registered SNMP users The config_snmp_traps command to configure the SNMP traps feature so that events generate traps. Before you enable the feature, you must designate the IP address or DNS name for a host at one or more sites to receive the SNMP traps. Note: You can designate a DNS name for a host only in installations for which a DNS has been configured. The test_snmp_trap command to send a test SNMP trap When the SNMP agent is enabled, SNMP users can submit queries to retrieve various types of information about the RA. You can also designate the minimum severity for which an event should generate an SNMP trap (that is, info, warning, or error in order from less severe to more severe with error as the initial default). Once the SNMP traps feature is enabled, the system sends an SNMP trap to the designated host whenever an event of sufficient severity occurs. Installing MIB Files on an SNMP Browser Install the RA MIB file (\MIBS\mib.txt on the Unisys SafeGuard Solutions Splitter Install Disk CD-ROM) on an SNMP browser. Follow the instructions for your browser to load the MIB file. 6872 5688 006 F 3
Configuring and Using SNMP Traps Resolving SNMP Issues For SNMP issues, first determine whether the issue is an SNMP trap or an SNMP monitoring issue by performing the procedure for verifying SNMP traps in the Unisys SafeGuard Solutions Planning and Installation Guide. If you do not receive traps, perform the steps in Monitoring Issues and then in Trap Issues. Monitoring Issues Trap Issues 1. Ping the RA management IP address from the management server that has the SNMP browser. 2. Ensure that the community name used on the RA configuration matches the management server running the SNMP browser (version 1 and 2). Use public as a community name. 3. Ensure that the user and password used on the RA configuration matches the management server running the SNMP browser (version 3). 1. Ensure that the trap destination is on the same network as the management network and that a firewall has not blocked SNMP traffic. 2. Ensure that the same version of SNMP is configured in the management software that receives traps. F 4 6872 5688 006
Appendix G Using the Unisys SafeGuard 30m Collector The Unisys SafeGuard 30m Collector utility enables you to easily collect information about the environment so that you can solve problems. An enterprise solution requires many logs, and gathering the log information can be time intensive. Often the person who collects the information is not familiar with all the interfaces to the hardware. The Collector solves these problems. An experienced installer configures log collection one time, and then other personnel can use a one-button approach to log collection. While upgrading from an old RA version to a new version, many overhead tasks, such as saving the current settings, recording the LUN IDs, recording the consistency group names, recording the replication sets, recording other settings if necessary and so on are maintained. Earlier these tasks were done using the plink program, but now you can accomplish these tasks using the 30m Collector. While upgrading from 7.x to 8.0, collect the current settings from the 30m Collector, reload the RA software, and then export the settings to the new RA with the help of the Configuration Manager that has been added as a feature to the 30m Collector. After exporting the settings to the new RA, verify the site configuration with the help of the Site Verifier utility from the 30m Collector. All these tasks can be done using a one-button approach. Note: While upgrading from 5.x to 8.0, the 30m Collector does not work. Use the plink program instead. You can use the 30m Collector utility to create custom scripts to complete tasks tailored to your environment. You choose which CLI commands to include in the custom scripts so that you build the capabilities you need. Refer to the Unisys SafeGuard Solutions Replication Appliance Command Line Interface (CLI) Reference Guide for more information about CLI commands. The Collector gathers configuration information from RAs, storage subsystems, and switches. No information is collected from the servers in the environment. Installing the SafeGuard 30m Collector This utility offers two modes: Collector and View. You determine the available modes when you install the program. If you install the Collector and specify Collector mode, both modes are enabled. If you install the Collector and specify View mode, the Collector mode functions are disabled. The View mode is primarily used by support personnel at the Unisys Support Center. If you are installing the Collector at a customer installation, be sure to install the utility on PCs at both sites. 6872 5688 006 G 1
Using the Unisys SafeGuard 30m Collector The utility requires.net Framework 3.5 and J# 2.0 redistributable package second edition, which are on the Unisys SafeGuard 30m Solution Control Install Disk CD-ROM in the Redistributable folder. The directories under this folder are dotnet Framework 3.5 and JSharp. Notes: The readme file on that CD-ROM contains the same information as this appendix. If you installed a previous version of the Collector, uninstall this utility and remove the folder and all of the files in the folder before you begin this installation. Perform the following steps to install the Collector: 1. Insert the CD-ROM in the CD/DVD drive, and start the file Unisys SafeGuard 30m Collector.msi. 2. On the Installation Wizard welcome screen, click Next. 3. On the Customer Information screen, type the user name and organization, and click Next. 4. On the Destination Folder screen, select a destination folder and click Next. Note: If you are using the Windows Vista operating system, install the Collector into a separate directory named C:\Unisys\30m\Collector. 5. On the Select Options: screen, select Collector mode install at site or select View mode install at support center, and then click Next. 6. On the Ready to Install the Program screen, click Install. The Installation wizard begins installing the files, and the Installing Unisys SafeGuard 30m Collector screen is displayed to indicate the status of the installation. After the files are installed, the Installation Wizard Completed screen is displayed. 7. Click Finish. Before You Begin the Configuration Before you begin configuring the Collector, be sure you have the following information: IP addresses SAN switches Network switches RA site management Log-in names SAN switches Network switches RA (for custom scripts, Configuration Manager, and Site Verifier) G 2 6872 5688 006
Using the Unisys SafeGuard 30m Collector Passwords SAN switches Network switches RA (for custom scripts, Configuration Manager, and Site Verifier) EMC Navisphere CLI Storage Autologon configuration SAN switches (Consult your SAN switch documentation for the autologon configuration.) If you are using a Cisco SAN switch, enable the SSH server before you begin the configuration. See Configuring RA, Storage, and SAN Switch Component Types Using Built-Ins in this appendix. Handling the Security Breach Warning If you previously installed the Collector and have uninstalled the utility and all the files, when you begin configuring RAs or adding RAs, you might get this message: WARNING û POTENTIAL SECURITY BREACH! If you receive this message, complete these steps: 1. Delete the IP address for the RA. 2. Use the following plink command: C:\>plink <IP address> -l admin -pw admin get_version Messages about the host key and a new key are displayed. 3. Type Y in response to the message Update cached key? Once you have updated the cached key, complete the steps in Configuring RAs to discover the IP addresses for the RAs. Using Collector Mode Installing the utility in Collector mode enables all the capabilities to gather log information using scripts and also enables View mode. Getting Started To access the Collector, follow these steps: 1. On the Start menu, point to Programs, then click Unisys, then click SafeGuard 30m Collector; and click SafeGuard 30m Collector. 2. Select the Components.ssc file on the Open Unisys SafeGuard 30m Collector File dialog box. 6872 5688 006 G 3
Using the Unisys SafeGuard 30m Collector Configuring RAs The Unisys SafeGuard 30m Collector program window is displayed with two panes open. To collect data, specify the site management IP address of either of the RA clusters for a site. The built-in scripts are a preconfigured set of CLI commands that facilitate easy data collection. The other site management IP address is automatically discovered when you specify either of the RA site management addresses. To configure the RA, perform these steps: 1. Start the Collector. 2. If needed, expand the Components tree in the left pane. 3. Select BI Built-In (under RA), right-click, and click Copy Built-In (Discover RA). 4. On the Script dialog box, type the RA site management IP address in the IP Address field and click Save. If you have multiple SafeGuard solutions, repeat steps 3 and 4 for each set of RA clusters. After you enter the IP address, the Collector window is updated with the folder of each site management IP address appearing below the RA folder. Each IP folder contains the built-in scripts that are enabled. The following sample window shows the IP address folders listed in the left pane. In this figure, two SafeGuard solutions are configuredthe set of IP addresses (192.59.152.85 and 192.59.152.86) for the two RA clusters in solution 1. The continuous data protection (CDP) solution always has only one RA cluster. G 4 6872 5688 006
Using the Unisys SafeGuard 30m Collector Adding Customer Information Add information about the Unisys service representative, customer, and architect so that the Unisys Support Center can contact the site easily. To add the information, perform the following steps on the Unisys SafeGuard 30m Collector program window. 1. On the File menu, click Properties. 2. On the Properties dialog box, select the appropriate tab: Customer, Architect, or CIR. 3. Type in the information for each field on each tab. (For instance, type text in the Name, Office, Mobile, E-mail, and Additional Info fields for the CIR tab.) The Architect tab provides an Installed Date field. Use the Additional Info field for any other information that the Unisys Support Center might need, such as a support request number. 4. Click OK. 6872 5688 006 G 5
Using the Unisys SafeGuard 30m Collector Running All Scripts To collect data from all enabled scripts in a SafeGuard Solutions Components (SSC) file, perform these steps on the Unisys SafeGuard 30m Collector program window. 1. Select Components. 2. Right-click, and click Run, or click the Run button. Note: The status bar shows the progress of script executions and the amount of data collected. Compressing an SSC File to Send to the Support Center Once you run the utility to collect information, you can compress the SSC file to send to the Unisys Support Center. Note: A Collector components file has the.ssc suffix. Once an SSC file is compressed, the corresponding SafeGuard Solutions Data (SSD) file has the.ssd suffix. On the Unisys SafeGuard 30m Collector program window, follow these steps to compress an SSC file: 1. Click Compress SSC on the File menu. Once the file is compressed, the file name and path are displayed at the top in the right pane of the window. The data is exported to the file named Components.ssd in the directory C:\Program Files\Unisys\30m\Collector\Data. Note: For the Microsoft Vista operating system, the SSD file resides in the directory where the Collector is installed. A typical location for this file is C:\Unisys\30m\Collector\Components.ssd. 2. Send the SSD file to the Unisys Support Center at Safeguard30msupport@unisys.com. Duplicating the Installation on Another PC To duplicate the installation of the Collector at a different PC (for example, on the second site), perform these steps: 1. Copy the SSD file from the PC with the installed Collector to the second PC, placing it in the C:\Program Files\Unisys\30m\Collector\Data directory. 2. Start the Collector. 3. Click Cancel on the Open Unisys SafeGuard 30m Collector File dialog box. The Unisys SafeGuard 30m Collector program window is displayed. Note: Once an SSD file is extracted, you can select the <name>.ssc file. 4. On the File menu, select Uncompress SSD. G 6 6872 5688 006
Using the Unisys SafeGuard 30m Collector 5. On the Open SafeGuard 30m Data File dialog box, select from the list of available files the SSD file that you wish to uncompress. If a message appears asking about overwriting the SSC file, click Yes. 6. Ensure that all scripts run from this PC by selecting each component type and running the scripts for each component. Understanding Operations in Collector Mode The Components.ssc file contains the configuration information. If you make changes to the Components.ssc filesuch as adding, deleting, editing, enabling, and disabling scriptsthese changes are automatically saved. You can also make these changes to a saved SSC file except that you cannot delete scripts from a saved SSC file. You must open the Components.ssc file to delete scripts. Understanding and Saving SSC Files Because you can enable and disable scripts in any SSC file, you can create saved SSC files for specific uses. If you want to run a subset of the available scripts, save the Components.ssc file as a new SSC file with a unique name. You can then enable or disable scripts in the saved SSC file. The saved SSC file is always updated from the Components.ssc file for information such as the available scripts and the details within each script. In addition, all changes that are made to any SSC file are updated in the Components.ssc file. Only scripts that were enabled in the saved SSC file are enabled when updated from a Components.ssc file. For example, you could save an SSC file with all RAs except one disabled. You might name it radisabled.ssc. If you have the radisabled.ssc file open and add a new script to it, the script is automatically added to the Components.ssc file. Whenever the Components.ssc file is updated with a new script; that script is automatically added to any saved SSC files. If you add a new RA to the configuration, the Components.ssc file and any existing saved SSC files are updated with the component and its scripts are disabled. If you make deletions to the Components.ssc file, the deletions are automatically removed from any saved SSC files. 6872 5688 006 G 7
Using the Unisys SafeGuard 30m Collector Sample Scenario If you want to collect data at one site only or if you want to view the data from one site, you can create a new saved SSC file for each site. Follow these steps to create the saved SSC files. 1. Add any desired scripts to the Components.ssc file. 2. Open an SSC file. 3. Click Save As on the File menu, and enter a unique name for the file. 4. Enable and disable scripts as desired. For example, you might disable one site. To do so, follow these steps: a. Select the IP address of a component (perhaps Site 1 RA cluster management IP.) b. Right-click and click Disable. Repeat steps 2 through 4 to create additional customized files. Opening an SSC File On the Unisys SafeGuard 30m Collector program window, perform the following steps to open an SSC file: 1. Click Open on the File menu. 2. Select an SSC file and click Open. Configuring RA, Storage, and SAN Switch Component Types Using Built-In Scripts The built-in scripts are preconfigured; they contain CLI commands for RAs, navicli commands for Clariion storage, and CLI commands for switches that facilitate easy data collection. It takes about 4 minutes for the built-in scripts for one RA to run and about 2 minutes for the built-in scripts for a SAN switch to run. After you configure built-in scripts, the left pane is updated with the IP addresses below the component type. Each IP folder contains the built-in scripts that are enabled. On the Unisys SafeGuard 30m Collector program window, follow these steps to use built-in scripts to configure RA, Storage, and SAN Switch component types: 1. Expand a component typera, Storage, or SAN Switchand select BI Built-In. 2. Right-click and click Copy Built-In. 3. On the Script dialog box, complete the available fields and click Save. Note: You can select one script instead of all scripts by selecting a script name instead of selecting BI-Built-In. G 8 6872 5688 006
Using the Unisys SafeGuard 30m Collector For the RA Component Type To collect data, specify the site management IP address of either of the RA clusters for a site. The other site management IP address is automatically discovered when you specify either of the RA site management addresses. If you have multiple SafeGuard solutions, repeat the three previous steps for each set of RA clusters. For the Storage Component Type Clariion is the only storage component with built-in scripts available. For the SAN Switch Component Type Before configuring a Cisco SAN switch, enter config mode on the switch and type #ssh server enable. To determine the state of the SSH server, type show ssh server when not in config mode. Refer to the Cisco MDS 9020 Fabric Switch Configuration Guide and Command Reference for more information about switch commands. If you run the tech-support command under SAN Switch from the Collector, the data capture might take a long time. You can follow the progress in the status bar of the window. If you run commands for a Brocade switch and receive the following message, the Brocade switch is downlevel and does not support the SSH protocol: rbash: switchshow: command not found Upgrade the switch software to a later version that supports the SSH protocol. Enabling Scripts You can interactively enable all the scripts in any SSC file, the scripts for one component in the SSC file, or a single script. To enable a disabled script, you must open the SSC file containing the script. Perform the following steps on the Unisys SafeGuard 30m Collector program window. Enable All Scripts 1. Select Components. 2. Right-click and click Enable. Enabled scripts are shown in green. Enable Scripts for One Component 1. Select the IP address of the component. 2. Right-click and click Enable. Enabled scripts are shown in green. 6872 5688 006 G 9
Using the Unisys SafeGuard 30m Collector Enable a Single Script 1. Select the script name. 2. Right-click and click Enable. Disabling Scripts The enabled script is shown in green. You can interactively disable all the scripts in any SSC file, the scripts for one component in the SSC file, or a single script. Perform the following steps on the Open Unisys SafeGuard 30m Collector program window. Disable All Scripts 1. Select Components. 2. Right-click and click Disable. Disabled scripts are shown in red. Disable Scripts for One Component 1. Select the IP address of the component. 2. Right-click and click Disable. Disabled scripts are shown in red. Disable a Single Script 1. Select the script name. 2. Right-click and click Disable. Running Scripts The disabled script is shown in red. You can interactively run all the scripts in any SSC file; the scripts for one component type such as RA, Storage, SAN Switch, or Other; the scripts for one component in the SSC file; or a single script. Note: You can use the Run button on the Collector toolbar or the Run command in the following procedures. Perform the following steps on the Unisys SafeGuard 30m Collector program window. Run All Scripts 1. Select Components. 2. Right-click and click Run. Run Scripts for One Component Type 1. Select a component typera, Storage, SAN Switch, or Other. 2. Right-click and click Run. G 10 6872 5688 006
Using the Unisys SafeGuard 30m Collector The status of the executing scripts is displayed in the right pane. The status bar shows the component type that is running, the IP address, the script name, and instructions for halting script execution. A progress bar indicates that the Collector is running the script and shows the amount of data being captured by the script. Once script execution completes, the status bar shows the last script run. Run Scripts for One Component 1. Select either the IP address or custom-named component. 2. Right-click and click Run. The status of the executing scripts is displayed in the right pane. The status bar shows the component type that is running, the IP address, the script name, and instructions for halting script execution. A progress bar indicates that the Collector is running the script and shows the amount of data being captured by the script. Once script execution completes, the status bar shows the last script run. Run a Single Script 1. Select a script name. 2. Right-click and click Run. The status of the executing scripts is displayed in the right pane. The status bar shows the component type that is running, the IP address, the script name, and instructions for halting script execution. A progress bar indicates that the Collector is running the script and shows the amount of data being captured by the script. Once script execution completes, the status bar shows the last script run. Stopping Script Execution To stop a script while it is executing, click Stop on the Collector toolbar. All scripts that have been stopped are marked with a green X. The status of the stopped script is displayed in the right pane. Deleting Scripts You can interactively delete scripts only in the Components.ssc file. Perform the following steps on the Unisys SafeGuard 30m Collector program window. Delete Scripts for One Component 1. Select the IP address or custom-named component. 2. Right-click and click Delete. Delete a Single Script 1. Expand an IP address or a custom-named component; then select a script name. 2. Right-click and click Delete. Adding Scripts for RA, Storage, and SAN Switch Component Types You can interactively add custom scripts to any SSC file by copying an existing script or by specifying a new script. Perform the following steps on the Unisys SafeGuard 30m Collector program window. 6872 5688 006 G 11
Using the Unisys SafeGuard 30m Collector Add New Script for a Component Type 1. Select a component typera, Storage, or SAN Switch. 2. Right-click and click New. 3. Complete the script form. 4. Click Save. Add a New Script Based on an Existing Custom Script 1. Select a script name. 2. Right-click and click New. 3. Complete the form. Change the script name and the command. 4. Click Save. Adding Scripts for the Other Component Type Perform the following steps on the Unisys SafeGuard 30m Collector program window. 1. Select the component type Other. 2. Right-click and click New. 3. On the Select Program dialog box, navigate to the appropriate directory and choose the file to run. Then click Open. 4. On the Script dialog box, type a component name in the Component field. 5. Type a unique name for the script in the Script Name field. 6. Review the selected file name that is displayed in the Command field. Modify the file name as necessary. The following example illustrates using a custom component (adding a new script as shown in the previous procedure) to mount and unmount drives. Note: In this example, the Collector must be installed on the server. C:\batch_File\mount_r.bat %This command, when run, mounts the specified drive Echo ON cd c:\windows\system32 mountvol.exe E:\ /P echo "Finished" C:\batch_File\mount_r.bat %This command, when run, unmounts the specified drive Echo ON cd c:\windows\system32 mountvol.exe R:\ \\?\Volume{1a1fb6a4-55bf-11db-9ef6-444553544200}\ echo "Finished" G 12 6872 5688 006
Using the Unisys SafeGuard 30m Collector Scheduling an SSC File Perform the following steps on the Unisys SafeGuard 30m Collector program window. 1. Click Schedule on the menu bar. 2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, enter the information required for each field as follows: a. Type the password. b. Type the date and start time. c. Select a Perform task option, which determines how often the schedule runs. d. Enter the end date if shown. (You do not need an end date for a Perform task of Once.) 3. Click Select. 4. On the Select Unisys SafeGuard 30m Collector dialog box, select the appropriate SSC file for which you wish to run the schedule, and then click Open. The Schedule Unisys SafeGuard 30m Collector File dialog box is again displayed. The Collector opens the selected SSC file as the current SSC file. 5. Click Add. 6. Click Exit. Note: You can create one schedule for an SSC file. To create additional schedules, create additional SSC files with the desired scripts enabled. The resultant scheduled data is appended to any current data (if available). For example, if you run the Collector using Windows Scheduler three times, three outputs are displayed in the right pane one after another with the timestamps for each. Querying a Scheduled SSC File Perform the following steps on the Unisys SafeGuard 30m Collector program window. 1. Click Schedule from the menu bar. 2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, click Query. 3. On the Tasks window, select the task name that is the same as the scheduled SSC file. 4. Right-click and click Properties. 5. View the details of the scheduled task in the window; then click OK to close the task Properties window. 6. Close the Tasks window and then select the Schedule Unisys SafeGuard 30m Collector window. 7. Click Exit. Note: For the Microsoft Vista operating system, if you want to see the scheduled task after scheduling a task, click Query on the Schedule Unisys SafeGuard 30m 6872 5688 006 G 13
Using the Unisys SafeGuard 30m Collector Collector File dialog box. The Vista Microsoft Management Control (mmc) window is displayed. Press F5 to see the scheduled task. Deleting a Scheduled SSC File Perform the following steps on the Unisys SafeGuard 30m Collector program window. 1. Select Schedule from the menu bar. 2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, click Query. 3. On the Tasks window, select the task name that is the same as the scheduled SSC file. 4. Right-click and click Delete. 5. Close the Tasks window and then select the Schedule Unisys SafeGuard 30m Collector window. 6. Click Exit. G 14 6872 5688 006
Using the Unisys SafeGuard 30m Collector Using Configuration Manager Note: This feature works only with RA version 3.0.x or later. Configuration Manager helps you to collect the following settings information from an existing RA: 1. Names of the consistency groups 2. Names of the copies 3. Replication set related information 4. Production copy related information 5. Replication and Journal volume related information for each group 6. Group, Copy, and Consolidation level policy information 7. Splitters information Apart from the preceding list of information, settings information can also include Group Set, Account Settings (Company Name and Contact Info), Alert Settings and Rules, SNMP settings, Multipath Monitoring and so on. Suppose, your current RA version is 3.0.x and you want to upgrade it to 3.1. Perform the following steps on the Unisys SafeGuard 30m Collector program window: 1. Click Configuration Manager and select Save Current Settings and Export to RA. Site Configuration window appears. 6872 5688 006 G 15
Using the Unisys SafeGuard 30m Collector 2. In Step 1 : Collect Current Settings, perform the following steps: a. Enter details for Management IP, Admin User ID and Admin Password b. Click Collect Settings. 3. In Step 2 : Settings Collected Successfully?, perform the following steps: a. If all the preceding credentials are correct, then the current settings are collected and saved in a file. Click Next. b. If the preceding credentials provided by you are incorrect, then a message appears stating that the Collector was unable to collect the settings. You must reenter the credentials. G 16 6872 5688 006
Using the Unisys SafeGuard 30m Collector c. Take a back up of the current settings file (a collection of configuration commands) when prompted and click OK to continue. Note: It is safe to back up the current settings file. Unisys recommends taking the backup. d. The Verify the File button is enabled. Click to verify that the information collected in Step 2 is correct. 4. In Step 3 : Open Configuration File and Check Version, perform the following steps: a. Open the Saved Configuration file when prompted. b. Click Browse to open File Open dialog box. c. You will see the folder created with the IP entered in Step 1 : Collect Current Settings of this procedure. Open the folder to view the SaveSettings.dat data file. d. Double-click to select the file and then click Done. The current script version appears. 5. In Step 4 : Upgrade Current Settings, perform the following steps: a. Select the new version. For example, 3.0. b. Click the Upgrade Current Settings button. This command upgrades the previous script to the current RA version and saves the output to the ConvertedSaveSettings.dat file in the same folder where the SaveSettings.dat file resides. Be careful while selecting the new RA version. Notes: Unisys recommends that you upgrade the RA (for example, from 3.0.x to 3.1). For more information, see Unisys SafeGuard Solutions Upgrade Instructions. Do not close Collector window, until you are instructed to do so. After RA is upgraded, you can create Consistency Groups, Copies, Volumes and so on by using the Configuration Manager feature in the 30m Collector without using the plink application. In addition to this, you will be able to get back the previous settings and policies. Is the script is not upgraded properly, then the exported converted settings can result in incorrect configuration. Choose the new version wisely. 6. In Final Step - Export Settings, Export Settings transfers the saved data of ConvertedSaveSettings file to the RA. This step uses the same credentials and management IP as entered in Step 1 : Collect Current Settings. Perform the following steps to export the settings: a. You are prompted with a choice to either click Yes (to continue) or No (to discontinue). b. If you select the Open Output Of Export Settings When Finished (recommended) check box, the 30 Collector will open the output of commands in a notepad. This serves as a report file for you. All commands that are executed on the RAs to create Consistency Groups, Copies, Volumes, Policy Settings and so on will be saved automatically. A message appears stating that 6872 5688 006 G 17
Using the Unisys SafeGuard 30m Collector during export settings, Collector may stop responding. Click OK to continue. c. After exporting the settings, a message appears stating that all the settings are exported. Click OK to continue. Note: It is highly recommended that you check all the settings thoroughly by opening the RA management console. Wrong settings may result in unexpected behavior and or data loss. 7. A message appears displaying the file path that contains the output of all the commands exported to the RAs. Click OK to continue. 8. The output of each of the commands opens in a notepad. You can also verify which commands executed successfully and which did not by seeing the return codes and the meaning of the codes from a single file. 9. Click Cancel to close the Site Configuration window. Note: Instead of exporting the settings from a command line program like plink.exe. Use the Configuration Manager because it is an easy-to-use one-button GUI. It is highly recommended that after exporting the settings to the new RAs, you thoroughly verify the settings manually. G 18 6872 5688 006
Using the Unisys SafeGuard 30m Collector Using Site Verifier The Site Verifier feature provides information about configuring sites. This feature compares two SaveSettings files, saved before and after a RA configuration. Note: Sometimes configuration changes are done manually and intentionally. Site Verifier may mark them as changed settings. In these cases, it is recommended to check the RA configuration by opening the RA management console and checking the configuration manually and thoroughly. Perform the following steps on the Unisys SafeGuard 30m Collector program window: 1. Select Configuration Manager and then select Site Verifier. The Verify Site Configuration window appears. 2. Click the Open Previous Settings button. In the File Open dialog box, click the IP folder and open the SaveSettings.dat file. The file opens in the left pane of the Verify Site Configuration window. 3. Click the Open Current Settings button. In the File Open dialog box, you will be redirected to the Current Settings folder. If you do not see the folder and the files under it, then click Cancel. 4. Under Collect Settings, enter details for Site Management IP, Admin User ID, and Admin Password. Click Collect. A message will appear confirming that the new settings collected successfully. Click OK to continue. 5. Click the Open Current Settings button. In the File Open dialog box, you will be redirected to the Current Settings folder. Click the folder and open the SaveSettings.dat file. The file opens in the right pane of the Verify Site Configuration window. 6. Click the Compare Both Settings button. If the files are equal, then it means that the settings before and after the RA was upgraded are same and no discrepancies are found. Whereas, if the files are not equal, then it means that the settings before and after the RA was upgraded are not the same. This does not necessarily indicate a mismatch in the settings. Some settings are necessary and intentional. You can also verify the settings by scrolling any one of the panes. Note: If the previous and current settings are unequal, then recheck the settings by opening the RA management console. Check the entire configuration manually to be sure. 7. Click Cancel to close the Site Verifier window. 6872 5688 006 G 19
Using the Unisys SafeGuard 30m Collector Using View Mode If you installed the Collector in View mode, the support personnel at the Unisys Support Center can use View Mode to view the information. To access the Collector, follow these steps: 1. Start the Collector. 2. On the Open Unisys SafeGuard 30m Collector File dialog box, click Cancel. The Unisys SafeGuard 30m Collector program window is displayed. G 20 6872 5688 006
Using the Unisys SafeGuard 30m Collector Note: Once an SSD file is extracted, you can select the <name>.ssc file. 3. On the File menu, click Uncompress SSD. 4. On the Open SafeGuard 30m Data File dialog box, select from the list of available files the SSD file that you wish to uncompress. 5. In View mode, expand the components tree and then expand a component type: RA, Storage,, SAN Switch, or Other. 6. Click a script name from those displayed to view the data collected from that script. The data is displayed in the right pane. The following figure displays a sample of View mode with data displayed in the right pane. 7. On the File menu, click Exit. 6872 5688 006 G 21
Using the Unisys SafeGuard 30m Collector G 22 6872 5688 006
Appendix H Using kutils Usage The server-based kutils utility enables you to manage host splitters across all platforms. This utility is installed automatically when you install the Unisys SafeGuard 30m splitter on a host machine. When the splitting function is performed by an intelligent fabric switch, you can install a stand-alone version of the kutils utility separately on host machines. For details on the syntax and use of the ktuils commands, see the Unisys SafeGuard Solutions Replication Appliance Administrator s Guide. A kutils command is always introduced with the kutils string. If you enter the string independentlythat is, without any parametersthe ktuils utility returns usages notes, as follows: C:\program files\kdriver\kutils>kutils Usage: kutils <command> <arguments> Path Designations You can designate the path to a device in the following ways: Device path example SCSI\DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001\5&133EF78A&0&000 Storage path example SCSI#DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001#5&133EF78A&0&000#{53 f56307-b6bf-11d0-94f2-00a0c91efb8b} Volume path example \\?\Volume{33b4a391-26af-11d9-b57b-505054503030} Each command notes the particular designation to use. In addition, some commands, such as showdevices and showfs, return the symbolic link for a device. The symbolic link generally provides additional information about the characteristics of the specific devices. 6872 5688 006 H 1
Using kutils The following are examples of symbolic links: \Device\0000005c \Device\EmcPower\Power2 \Device\Scsi\q123001Port2Path0Target0Lun2 Command Summary The kutils utility offers the following commands: flushfs: Initiates an operating system flush of the file system (Windows only). manage_auto_host_info_collection: Indicates whether the automatic host information collection is enabled or disabled, or enables or disables automatic host information collection. showdevices: Presents a list of physical devices to which the host has access, providing (as available) the device path, storage path, and symbolic link for each device (Windows only). showfs: Presents the drive designation and, as available, the device path, storage path, and symbolic link for each mounted physical device (Windows only). show_vol_info: Presents information on the specified volume, including: the Unisys SafeGuard 30m solution name (if created in Unisys SafeGuard Solutions), size, and storage path. show_vols: Presents information on all volumes to which the host has access including: the Unisys SafeGuard 30m solution name (if created in Unisys SafeGuard Solutions), size, and storage path sqlrestore: Restores an image previously created by the sqlsnap command (Windows only) sqlsnap: Performs a VDI-based SQL Server image (Windows only). start: Resumes the splitting of write operations. stop: Discontinues the splitting of write operations to an RA (that is, places the host splitter in pass-through mode in which data is written to storage only). unmount: Unmount the disk drives. H 2 6872 5688 006
Appendix I Analyzing Cluster Logs Samples of cluster log messages for problems and situations are listed throughout this guide. You can search on text strings from cluster log messages to find specific references. The information gathered in cluster logs is critical in determining the cause of a given cluster problem. Without the diagnostic information from the cluster logs, you might find it difficult to determine the root cause of a cluster problem. This appendix provides information to help you use the cluster log as a diagnostic tool. Introduction to Cluster Logs The cluster log is a text log file updated by the Microsoft Cluster Service (MSCS) and its associated cluster resource. The cluster log contains diagnostic messages about cluster events that occur on an individual cluster member or node. This file provides more detailed information than the cluster events written in the system event log. A cluster log reports activity for one node. All member nodes in a cluster perform as a single unit. Therefore, when a problem occurs, it is important to gather log information from all member nodes in the cluster. This information gathering is typically done using the Microsoft MPS Report Utility. Gather the information immediately after a problem occurs to ensure cluster log data is not overwritten. By default, the cluster log name and location are as follows: C:\Winnt\Cluster\cluster.log Note: For Windows 2003, the cluster.log file is located in the following path: C:\WINDOWS\Cluster Note: For Windows 2008, the cluster.log file is located in the following path: C:\WINDOWS\Cluster\Reports Captured with MPS Report Utility: <server name>_cluster.log 6872 5688 006 I 1
Analyzing Cluster Logs Creating the Cluster Log In Windows 2000 Advanced Server and Windows 2000 Datacenter Server, by default, cluster logging is enabled on all nodes. You can define the characteristics and behavior of the cluster log with system environment variables. To access the system environment variables, perform the following actions in Windows Server 2003: 1. In Control Panel, double-click System. 2. Select the Advanced tab. 3. Click Environment Variables. To access the system environment variables, perform the following actions in Windows Server 2008 (Longhorn): 1. In Control Panel, double-click System. 2. Click Advanced system settings tasks. 3. Click Environment Variables. You can get additional information regarding the system environment variables in Microsoft Knowledge Base article 16880, How to Turn On Cluster Logging in Microsoft Cluster Server at this URL: http://support.microsoft.com/default.aspx?scid=kb;en-us;168801 The default cluster settings are listed in Table I 1. Some parameters might not be listed when viewing the system environment variables. If a variable is not listed, its default value is still in effect. Table I 1. System Environment Variables Related to Clustering Variable Name Default Setting Comment ClusterLog %SystemRoot% \Cluster\Cluster.log Determines the location and name of cluster log file. ClusterLogSize 8 MB Determines the size of the cluster log. The default size is usually not large enough to retain history on enterprise systems. The recommended setting is 64 MB. I 2 6872 5688 006
Analyzing Cluster Logs Table I 1. System Environment Variables Related to Clustering Variable Name Default Setting Comment ClusterLogLevel 2 Sets the level of detail for log entries, as follows: 0 = No logging 1 = Errors only 2 = Errors and Warnings 3 = Everything that occurs Used only with the /debug parameter on MSCS startup. Review Microsoft Knowledge Base article 258078 for more information about using the /debug parameter. ClusterLogOverwrite Note: By default, the ClusterLogOverwrite setting is disabled. Unisys recommends that this setting remain disabled. When this setting is enabled, all cluster log history is lost if MSCS is restarted twice in succession. 0 Determines whether a new cluster log is to be created when MSCS starts. 0 = Disabled 1 = Enabled To create cluster log in Windows Server 2008 (Longhorn) using command prompt: Command Description: CLUSTER [[/CLUSTER:]cluster-name] LOG <options> <options> = /G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"] [/SPAN[MIN[UTE[S]]]: min] ] /SIZE:logsize-MB /LEVEL:logLevel Notes: The /SIZE must be between 8 and 1024 MB. The /LEVEL must be between 0 and 10. For example, the following command creates cluster log for last 10 minutes: CLUSTER /CLUSTER:CLUSTER_NAME LOG /G 10 6872 5688 006 I 3
Analyzing Cluster Logs Understanding the Cluster Log Layout Figure I 1 illustrates the layout of the cluster log. The paragraphs following the figure explain the various parts of the layout. Process ID Thread ID Date GMT Figure I 1. Layout of the Cluster Log The process ID is the process number assigned by the operating system to a service or application. The thread ID is a thread of a particular process. A process typically has multiple threads listed. Within a large cluster log, it is particularly useful to search by thread ID to find the messages related to the same thread. The date listed is the date of the entry. You can use this date to match the date of the problem in the system event log. The time entered in the Windows 2000 cluster log is always in Greenwich Mean Time (GMT). The format of the entry is HH:MM:SS.SSS. The SS.SSS entry represents seconds carried out to the thousandths of a second. There can be multiple.sss entries for the same thousandth of a second. Therefore, there can be more than 999 cluster log entries vsn exist for any given second. Cluster Module Table I 2 lists the various modules of MSCS. These module names are logged within square brackets in the cluster log. Table I 2. Modules of MSCS API ClMsg ClNet CP API Support Cluster messaging Cluster network engine Checkpoint Manager I 4 6872 5688 006
Analyzing Cluster Logs Table I 2. Modules of MSCS API API Support CS DM EP FM GUM INIT JOIN LM MM NM OM RGP RM Cluster service Database Manager Event Processor Failover Manager Global Update Manager Initialization Join Log Manager Membership Manager Node Manager Object Manager Regroup Resource Monitor For additional descriptions of the cluster components, refer to the Windows 2000 Server Resource Kit at this URL: http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/default.msp x?mfr=true Click the following link for Windows 2003 to refer to the Windows 2003 Server Resource Kit: http://www.microsoft.com/windowsserver2003/techinfo/reskit/tools/default.mspx Click the following link to interpret the cluster logs: http://technet2.microsoft.com/windowsserver/en/library/16eb134d-584e-46d9-9bf4-6836698cd26a1033.mspx?mfr=true Sample Cluster Log The sample cluster log that follows illustrates the component names in brackets. 00000848.00000ba0::2008/05/05-16:11:31.000 [RGP] Node 1: REGROUP INFO: regro up engine requested immediate shutdown. 00000848.00000ba0::2008/05/05-16:11:31.000 [NM] Prompt shutdown is requested by a membership engine 6872 5688 006 I 5
Analyzing Cluster Logs Cluster Operation 00000adc.00000acc::2008/05/05-16:11:31.234 [RM] Going away, Status = 1, Shut down = 0. The cluster operation is the task currently being performed by the cluster. Each cluster module (listed in Table I 2) can perform hundreds of operations, such as forming a cluster, joining a cluster, checkpointing, moving a group manually, and moving a group because of a failure. Posting Information to the Cluster Log The cluster log file is organized by date and time. Process threads of MSCS and resources post entries in an intermixed fashion. As the threads are performing various cluster functions, they constantly post entries to the cluster log in an interspersed manner. The following sample cluster log shows various disks in the process of coming online. The entries are not logically grouped by disk; rather, the entries are logged as each thread posts its unique information. In the left navigation pane, click on Windows 2000 Server Resource Kit and click on Distributed Systems Guide, then Enterprise Technologies, and then Interpreting the Cluster Log. Sample Cluster Log Thread ID 00000444.00000600::2008/11/18-18:23:48.307 Physical Disk <Disk V:>: [DiskArb ] Issuing GetSectorSize on signature 9a042144. 00000444.000005e0::2008/11/18-18:23:48.307 Physical Disk <Disk R:>: [DiskArb ]Successful read (sector 12) [:0] (0,00000000:00000000). 00000444.00000608::2008/11/18-18:23:48.307 Physical Disk <Disk W:>: [DiskArb ]DisksOpenResourceFileHandle: CreateFile successful. 00000444.00000600::2008/11/18-18:23:48.307 Physical Disk <Disk V:>: [DiskArb ] GetSectorSize completed, status 0. 00000444.00000608::2008/11/18-18:23:48.307 Physical Disk <Disk W:>: DiskArbi tration must be called before DisksOnline. 00000444.00000600::2008/11/18-18:23:48.307 Physical Disk <Disk V:>: [DiskArb ] ArbitrationInfo.SectorSize is 512 00000444.00000608::2008/11/18-18:23:48.307 Physical Disk <Disk W:>: [DiskArb ] Arbitration Parameters (1 9999). I 6 6872 5688 006
Analyzing Cluster Logs 00000444.00000600::2008/11/18-18:23:48.307 Physical Disk <Disk V:>: [DiskArb ] Issuing GetPartInfo on signature 9a042144. Because the cluster performs many operations simultaneously, the log entries pertaining to a particular thread are interwoven along with the threads of the other cluster operations. Depending on the number of cluster groups and resources, reading a cluster log can become difficult. Tip: To follow a particular operation, search by the thread ID. For instance, to follow online events for Physical Disk V, perform these steps using the preceding sample cluster log: 1. Anchor the cursor in the desired area. 2. Search up or down for thread 00000600. Diagnosing a Problem Using Cluster Logs The following topics provide you with useful information for diagnosing problems using cluster logs: Gathering Materials Opening the Cluster Log Converting GMT to Local Time Converting Cluster Log GUIDs to Text Resource Names Understanding State Codes Understanding Persistent State Understanding Error and Status Codes Gathering Materials You need to gather the following pieces of information, tools, and files to use with the cluster logs to diagnose problems: Information Tools Date and time of problem occurrence Server time zone Notepad or Wordpad text viewer This command-line tool is embedded in Windows. The command syntax is Net Helpmsg <error number>). Output from the MPS Report Utility from all cluster nodes Files from the MPS Report Utility run Cluster log (Mandatory) 6872 5688 006 I 7
Analyzing Cluster Logs The file name is <server name>_cluster.log. System event log (Mandatory) The file name is <server name>_event_log_system.txt..nfo system information file for installed adapters and driver versions (Reference) The file name is <server name>_msinfo.nfo. Cluster registry hive for cross-referencing information used in the cluster log (Reference) The file name is <server name>_cluster_registry.hiv. Cluster configuration file for a basic listing of cluster nodes, groups, resources, and dependencies (available in MPS Report Utility version 7.2 or later) The file name is <server name>_cluster_mps_information.txt. Opening the Cluster Log Use a text editor to view the cluster log file in the MPS Report Utility. Notepad or Wordpad works well. Notepad allows text searches up or down the document. Wordpad allows text searches only down the document. Note: Do not open the cluster.log file on a production cluster. Logging stops while the file is open. Instead, copy the cluster.log file first and then open the copy to read the file. The cluster log is on the local system in the directory Winnt/Cluster/Cluster.log. Converting GMT/UCT to Local Time The time posted in the cluster log is given as GMT/UCT. You must convert GMT/UCT to the local time to cross-reference cluster log entries with system and application event log entries. You can find the local time zone in the.nfo file in MPS Reports under system summary. You can also use the Web site www.worltimeserver.com to find accurate local time for a given city, GMT/UCT, and the difference between the two in hours. Converting Cluster Log GUIDs to Text Resource Names A globally unique identifier (GUID) is a 32-character hexadecimal string used to identify a unique entity in the cluster. A unique entry can be a node name, group name, resource name, or cluster name. The GUID format is nnnnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnnnn. The following are examples of GUIDs in the cluster log: 000007d0.00000808::2008/04/23-21:48:23.105 [FM] FmpHandleResourceTransition: resource Name = ae775058-af20-4ba2-a911-af138b1f65bd old state=130 new state=3 I 8 6872 5688 006
Analyzing Cluster Logs 000007d0.00000808::2008/04/23-21:48:23.448 [FM] FmpRmOfflineResource: RMOffline() for 6060dc33-5737-4277-b2f2-9cc45629ef0 returned error 997 000007d0.00001970::2008/05/02-21:41:58.846 [FM] OnlineResource: e65bc275-66d1-41ff-8a 4e-89ad6643838b depends on 758bb9bb-7d1f-4148-a994-684dd4f8c969. Bring online first. 000007d0.0000081::2008/05/04-17:21:06.888 [FM] New owner of Group b072608c-b7f3-48b0-83f8-7c922c14e709 is 2, state 0, curstate 1. Mapping a Text Name to a GUID The two methods for mapping a text name to a GUID are Automatic mapping Reviewing the cluster registry hive Automatic Mapping The simplest method of mapping a text name to a GUID is the automatic mapping performed by some versions of the MPS Report tool. However, most versions of the MPS Report tool do not perform this automatic function. For those versions with the automatic mapping feature, you can find the information in the cluster configuration file (<server name>_cluster_mps_information.txt). The following listing shows this mapping: f9f0b528-b674-40fb-9770-c65e17a2a387 = SQL Network Name f0dd1852-acc8-4921-b33a-a77dd5cdcfee = SQL Server Fulltext (SQL1) f0aca2c4-049f-4255-9332-92a69cc07326 = MSDTC eff360f3-d987-4a020-8f3c-4118056a50b2 = MSDTC IP Address e74769f8-67e1-43b2-9bec-93171c31d182 = SQL IP Address 1 e09f61cf-8ebf-4cd1-9ae3-58ed4d2b0fbc = Disk K: Reviewing the Cluster Registry Hive The second method of mapping a text name to a GUID is more complex and involves opening the cluster registry hive from the MPS Report tool and then reviewing the contents. Follow these steps to open and review the cluster registry hive: 1. Start the Registry Editor (Regedt32.exe). 2. Click the HKEY_LOCL_MACHINE hive. 3. Click the HKEY_LOCAL_MACHINE root folder. 6872 5688 006 I 9
Analyzing Cluster Logs 4. Click Load Hive on the Registry menu. 5. Select the <server name>_cluster_registry.hiv file; then press Ctrl-C. 6. Select Open. 7. Press Ctrl-V to obtain the key name. 8. Expand the cluster hive and review the GUIDS, which are located in the subkeys Groups, Resources, Networks, and NetworkInterfaces, as shown in Figure I 2. Figure I 2. Expanded Cluster Hive (in Windows 2000 Server) Scroll through the GUIDs until you find the one that matches the GUID from the cluster log. You can also open each key until you find the matching GUID. Tip: Under each GUID is a TYPE field. This field identifies a resource type such as physical disk, IP address, network name, generic application, generic service, and so forth. You can use this field to find a specific resource type and then map it to the GUID. Understanding State Codes MSCS uses state codes to determine the status of a cluster component. The state varies depending on the type of cluster components, which are nodes, groups, resources, networks, and network interfaces. Some state codes are posted in the cluster log using the numeric code and others using the actual value for the code. I 10 6872 5688 006
Analyzing Cluster Logs Examples of State Codes in the Cluster Log The following example entries show state codes for the resource, group, network interface, node, and network types of cluster component: Resource In this example, the resource is changing states from online pending (129) to online (2). 00000850.00000888::2008/05/05-17:37:29.125 [FM] FmpHandleResource Transition : Resource Name = 87e55402-87cb-4354-95e7-6dd864b79039 old state = 129 new s tate=2 Group In this example, the group state is set to offline (1). 00000898.000008a0::2008/05/05-06:25:55:062 [FM] Setting group 1951e272-6271- 4ea3-b0f9-cd767537f245 owner to node 2, state 1 Network interface This example provides the actual value of the state code, not the numeric code. 00000898.00000598:2008/05/05-06:28:40;921 [ClMsg] Received interface unreach able event for node 2 network 1 Node This example provides the actual value of the state code, not the numeric code. 00000898.0000060c::2008/05/05-06:28:45:953 [EP] Node down event received 000 00898.000008a8:2008/05/05-06:28:45:953 [Gum] Nodes down: 0002. Locker=1, Loc king=1 Network This example provides the actual value of the state code, not the numeric code. 00000898.000008a4::2008/05/05-06:25:53:703 [NM] Processing local interface u p event for network 0433c4e2-a577-4325-9ebd-a9d3d2b9b81f. 6872 5688 006 I 11
Analyzing Cluster Logs State Codes Table I 3 lists the state codes from the Windows 2000 Resource Kit for nodes. Table I 3. Node State Codes State Code State 1 ClusterNodeStateUnknown 0 ClusterNodeUp 1 ClusterNodeDown 2 ClusterNodePaused 3 ClusterNodeJoining Table I 4 lists the state codes from the Windows 2000 Resource Kit for groups. Table I 4. Group State Codes State Code State 1 ClusterGroupStateUnknown 0 ClusterGroupOnline 1 ClusterGroupOffline 2 ClusterGroupFailed 3 ClusterGroupPartialOnline Table I 5 lists the state codes from the Windows 2000 Resource Kit for resources. Table I 5. Resource State Codes State Code State 1 ClusterResourceStateUnknown 0 ClusterResourceInherited 1 ClusterResourceInitializing 2 ClusterResourceOnline 3 ClusterResourceOffline 4 ClusterResourceFailed 128 ClusterResourcePending I 12 6872 5688 006
Analyzing Cluster Logs Table I 5. Resource State Codes State Code State 129 ClusterResourceOnlinePending 130 ClusterResourceOfflinePending Table I 6 lists the state codes from the Windows 2000 Resource Kit for network interfaces. Table I 6. Network Interface State Codes State Code State 1 ClusterNetInterfaceStateUnknown 0 ClusterNetInterfaceUnavailable 1 ClusterNetInterfaceFailed 2 ClusterNetInterfaceUnreachable 3 ClusterNetInterfaceUp Table I 7 lists the state codes from the Windows 2000 Resource Kit for networks Table I 7. Network State Codes State Code State 1 ClusterNetworkStateUnknown 0 ClusterNetworkUnavailable 1 ClusterNetworkDown 2 ClusterNetworkPartitioned 3 ClusterNetworkUp Understanding Persistent State Persistent state is not a state code, but rather a key in the cluster registry hive for groups and resources. The persistent state key reflects the current state of a resource or group. This key is not a permanent value; it changes value when a group or resource changes states. 6872 5688 006 I 13
Analyzing Cluster Logs You can change the value of the persistent state key, which can be useful for troubleshooting or managing the cluster. For example, you can change the value before a manual failover or shutdown to prevent a particular group or resource from starting automatically. The value for the persistent state can be 0 (disabled or offline) or 1 (enabled or online). The default value is 1. If the value for persistent state is 0, the group or resource remains in an offline state until it is manually brought online. The following is an example cluster log reference to persistent state: 000008bc.00000908::2008/05/12-23:45:36/687 [FM] FmpPropagateGroupState: Grou p 1951e272-6271-4ea3-b0f9-cd767537f245 state = 3, persistent state = 1 For more information about persistent state, view Microsoft Knowledge Base article 259243, How to Set the Startup Value for a Resource on a Clustered Server at this URL: http://support.microsoft.com/default.aspx?scid=kb;en-us;259243 Understanding Error and Status Codes You can easily interpret error and status codes that occur in cluster log entries by issuing the following command from the command line: Net Helpmsg <error number> This command returns a line of explanatory text that corresponds to the number. Examples For the error code value of 5 as shown in the following example, the Net Helpmsg command returns Access is denied. 00000898.000008f0:2008/30-16:03:31.979 [DM] DmpCheckpointTimerCb -Failed to reset log, error=5 For the status code value of 997 as shown in the following example, the Net Helpmsg command returns Overlapped I/O operation is in progress. This status code is also known as I/O pending. 00000898.00000a8c::2008/05/05-06:38:14.187 [FM] FmpOnlineResource: Returning Resource 87e55402-87cb-4354-95e7-6dd864b79039, state 129, statue 997 For the status code value of 170 as shown in the following example, the Net Helpmsg command returns The requested resource is in use. 000009a4.000009c4::2008/05/15-07:28:42.303 Physicsl Disk <Disk J:>:[DiskArb] CompletionRoutine, status 170 I 14 6872 5688 006
Index A accessing an image, 3-1 analyzing intelligent fabric switch logs, A-16 RA log collection files, A-8 server (host) logs, A-16 B bandwidth, verifying, D-7 bin directory, A-14 C clearing the system event log (SEL), B-1 CLI file, A-10 clock synchronization, verifying, D-8 cluster failure, recovering, 4-19 cluster log cluster registry hive, I-9 definition, I-1 error and status codes, I-14 GUID format, I-8 GUIDs, I-8 layout, I-4 mapping GUID to text name, I-9 name and location, I-1 opening, I-8 overview, 2-9 persistent state, I-13 state codes, I-10, I-12 cluster registry hive, I-9 cluster service modules, I-4 cluster settings system environment variables, I-2 cluster setup, checking, 4-1 collecting host logs using host information collector (HIC) utility, A-7 using MPS utility, A-6 collecting RA logs, A-1, A-3 Collector (See Unisys SafeGuard 30m Collector) collector directory, A-11 configuration settings, saving, D-2 configuring the replacement RA, D-6 connecting, accessing the replacement RA, D-4 connectivity testing tool messages, C-8 converting local time to GMT or UTC, A-3 D data consistency group bringing online, 3-2, 4-9 manual failover, 3-1, 4-8 recovery tasks, 3-1, 4-7 taking offline, 4-7, 5-8 data flow, overview, 2-2 detaching the failed RA, D-3 determining when the failure occurred, A-2 diagnostics Installation Manager, C-1 RA hardware, B-2 directory bin, A-14 collector, A-11 etc, A-11 files, A-11 home, A-11, A-14 host log extraction, A-15 InfoCollect, A-12 processes, A-12 rreasons, A-11 sbin, A-12 tmp, A-14 usr, A-13 E e-mail notifications 6872 5688 006 Index 1
Index configuring a diagnostic e-mail notification, 2-8 overview, 2-8 enabling PCI-X slot functionality, D-5 environment settings, restoring, D-2 etc directory, A-11 event log, E-1 displaying, E-3 event levels, E-2 event scope, E-2 event topics, E-1 list of Detailed events, E-22 list of Normal events, E-5 overview, 2-6 using for troubleshooting, E-3 events event log, E-1 understanding, E-1 events that cause journal distribution, 2-10 group initialization effects on move-group operation, 4-3 H HIC (See host information collector (HIC) utility) high load disk manager reports, 10-4 general description, 10-3 home directory, A-11, A-14 host information collector (HIC) utility overview, 2-8 using, A-7 host logs collection using host information collector (HIC) utility, A-7 using MPS utility, A-6 F Fabric Splitter, 2-4 Fibre Channel diagnostics detecting Fibre Channel LUNs, C-13 detecting Fibre Channel Scsi3 Reserved LUNs, C-15 detecting Fibre Channel targets, C-12 performing I/O to LUN, C-15 running SAN diagnostics, C-9 viewing Fibre Channel details, C-11 Fibre Channel HBA LEDs location, 8-12 files directory, A-11 full-sweep initialization, 4-4 G geographic clustered environment basic configuration diagram, 2-1 definition, 2-1 overview, 2-1 recovery from total failure of one site, 4-19 geographic replication environment, 2-1 definition, 2-1 server failure, 9-21 total storage loss, 5-13 GMT converting local time to, A-3 example of local time conversion, A-3 I InfoCollect directory, A-12 initialization from marking mode, 4-5 full sweep, 4-4 long resynchronization, 4-4 initiate_failover command, 4-6 Installation Manager diagnostics, 2-8 Diagnostics menu, 8-18, 8-22, C-2 steps to run, C-2 Installation Manager diagnostics collect system info, C-18 Fibre Channel diagnostics, C-9 IP diagnostics, C-2 synchronization diagnostics, C-17 installing and configuring the replacement RA, D-4 IP diagnostics port diagnostics, C-5 site connectivity tests, C-3 system connectivity, C-6, C-7 test throughput, C-4 view IP details, C-3 view routing table, C-4 K kutils Index 2 6872 5688 006
Index L command summary, H-2 overview, 2-9 path designations, H-1 string, H-1 using, H-1 Local Replication by CDP, 2-5 log extraction directory host, A-15 RA, A-9 log file, A-10 long resynchronization, 4-4 M management console locked user, 8-4 RA attached to cluster, 8-4 understanding access, 8-4 manual failover data consistency group, 3-1, 4-8 performing, 4-7 performing with data consistency group (older image), 4-8 quorum consistency groups, 4-14, 4-22 manual failover of volumes and data consistency groups accessing an image, 3-1 marking mode, initializing from, 4-5 MIB OID Unisys, F-1 RA file, F-3 MIB II, F-1 Microsoft Cluster Service, 2-1 modifying the Preferred RA setting, D-3 move group operation, initialization effects, 4-3 MPS utility, A-6 MSCS (See Microsoft Cluster Service) MSCS properties, checking, 4-1 N network bindings checking, 4-2 cluster specific, 4-3 host network specific, 4-2 network LEDs location, 8-11 networking problem cluster node public NIC failure (geographic clustered environment), 7-3 management network failure (geographic clustered environment), 7-11 port information, 7-33 private cluster network failure (geographic clustered environment), 7-23 public or client WAN failure (geographic clustered environment), 7-6 replication network failure (geographic clustered environment), 7-15 temporary WAN failures, 7-22 total communication failure (geographic clustered environment), 7-27 new for this release, 1-2 P parameters file, A-9 performance problem failover time lengthens, 10-5 high load disk manager, 10-4 distributer, 10-4 slow initialization, 10-2 persistent state key, I-13 port information, 7-33 processes directory, A-12 Q quorum consistency group manual failover, 4-14, 4-22 R RA problem all RAs at one site fail, 8-26 all RAs not attached, 8-28 all SAN Fibre Channel HBAs fail, 8-14 onboard management network adapter fails, 8-24 onboard WAN network adapter fails, 8-20 optional Gigabit Fibre Channel WAN network adapter fails, 8-20 reboot regulation failover, 8-12 6872 5688 006 Index 3
Index single hard disk fails, 8-25 single RA failure, 8-4 single RA failures with switchover, 8-5 single RA failures without switchover, 8-22 single SAN Fibre Channel HBA on one RA fails, 8-22 rear panel indicators, 8-11 recording group properties and saving settings, D-2 recovery all RAs fail on site, 4-11 from site failure, 4-19 from total failure of one site in geographic clustered environment, 4-19 site 1 failure with quorum owner located on site 2, 4-24 site 1 failure with quorum resource owned by site 1, 4-19 using older image, 4-7 recovery tasks data consistency group, 3-1, 4-7 reformatting the repository volume, 5-8 removing Fibre Channel host bus adapters, D-4 replacing an RA, D-1 replication appliance (RA) connecting, accessing, D-4 diagnostics, B-2 LCD status messages, B-4 replacing, D-1 replication appliance (RA) analyzing logs from, A-8 collecting logs from, A-1 replication, reversing direction, 4-10, 4-15 repository volume not accessible, 5-6 reformatting, 5-8 restoring environment settings, D-2 restoring failover settings, 4-23 restoring group properties, D-8 resynchronization, long, 4-4 rreasons directory, A-11 runcli file, A-14 S SafeGuard 30m Control behavior during move group, 4-5 SAN connectivity problem RAs not accessible to splitter, 6-11 total SAN switch failure (geographic clustered environment), 6-14 volume not accessible to RAs, 6-2 volume not accessible to splitter, 6-6 saving configuration settings, D-2 sbin directory, A-12 server problem cluster node failure (georgraphic clustered environment), 9-2 infrastructure (NTP) server fails, 9-19 server crash or restart, 9-12 server failure (georgraphic replication environment), 9-21 server HBA fails, 9-18 server unable to connect with SAN, 9-14 unexpected server shutdown because of a bug check, 9-8 Windows server reboot, 9-3 SNMP traps configuring and using, F-1 MIB, F-1 resolving issues, F-4 variables and values, F-2 SSH client, using, C-1 state codes, I-10, I-12 storage problem journal volume not accessible, 5-10 repository volume not accessible, 5-5 storage failure on one site (geographic clustered environment), 5-15 total storage loss (geographic replicated environment), 5-13 user or replication volume not accessible, 5-3 storage-to-ra access, checking, D-5 summary file, A-11 system event log (SEL), clearing, B-1 system status using CLI commands, 2-7 using the management console, 2-7 T tar file, A-15 testing FTP connectivity, A-2 tmp directory, A-14 troubleshooting general procedures, 2-11 recovering from site failure, 4-19 Index 4 6872 5688 006
Index U Unisys SafeGuard 30m Collector, G-1 Collector mode, G-3 adding customer information, G-5 adding scripts, G-11 automatic discovery of RAs, G-4 compressing an SSC file, G-6 configuring component types using built-ins scripts, G-8 configuring RAs, G-4 configuring SAN switches, G-9 deleting a scheduled SSC file, G-14, G-15, G-19 deleting scripts, G-11 disabling scripts, G-10 duplicating installation on another PC, G-6 enabling scripts, G-9 opening an SSC file, G-8 querying a scheduled SSC file, G-13 running all scripts, G-6 running scripts, G-10 scheduling an SSC file, G-13 stopping script execution, G-11 installing, G-1 prior to configuring, G-2 security breach warning, G-3 View mode, G-20 Unisys SafeGuard 30m solution definition, 2-1 unmounting volumes at production site, 3-3 at remote site, 3-3 user types, preconfigured for RAs, 2-7 using the SSH client, C-1 using this guide, 1-3 usr directory, A-13 UTC converting local time to, A-3 example of local time conversion, A-3 V verify_failover command, 4-6 verifying clock synchronization, D-8 verifying the replacement RA installation, D-7 W WAN bandwidth, verifying, D-7 webdownload/webdownload, 2-7, C-20 6872 5688 006 Index 5
Index Index 6 6872 5688 006
.
2009 Unisys Corporation. All rights reserved. *68725688-006* 6872 5688 006