Module 7: Server Cluster Maintenance and Troubleshooting



Similar documents
Troubleshooting File and Printer Sharing in Microsoft Windows XP

Installing Windows Rights Management Services with Service Pack 2 Step-by- Step Guide

VERITAS Backup Exec 9.1 for Windows Servers Quick Installation Guide

VERITAS Backup Exec TM 10.0 for Windows Servers

Microsoft File and Print Service Failover Using Microsoft Cluster Server

Windows clustering glossary

WhatsUp Gold v16.3 Installation and Configuration Guide

How To Install And Configure Windows Server 2003 On A Student Computer

Diamond II v2.3 Service Pack 4 Installation Manual

Table Of Contents. - Microsoft Windows - WINDOWS XP - IMPLEMENTING & SUPPORTING MICROSOFT WINDOWS XP PROFESSIONAL...10

Administration GUIDE. Exchange Database idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 233

By the Citrix Publications Department. Citrix Systems, Inc.

"Charting the Course to Your Success!" MOC D Windows 7 Enterprise Desktop Support Technician Course Summary

Deploying Windows Streaming Media Servers NLB Cluster and metasan

Symantec Backup Exec TM 11d for Windows Servers. Quick Installation Guide

UFR II Driver Guide. UFR II Driver Ver ENG

Module 10: Maintaining Active Directory

UltraBac Documentation. UBDR Gold. Administrator Guide UBDR Gold v8.0

Lab Answer Key for Module 1: Installing and Configuring Windows Server Table of Contents Lab 1: Configuring Windows Server

AD RMS Step-by-Step Guide

WhatsUp Gold v16.2 Installation and Configuration Guide

Microsoft Exchange 2003 Disaster Recovery Operations Guide

VERITAS NetBackup 6.0 for Microsoft Exchange Server

Microsoft BackOffice Small Business Server 4.5 Installation Instructions for Compaq Prosignia and ProLiant Servers

User Manual. Onsight Management Suite Version 5.1. Another Innovation by Librestream

Course: WIN310. Student Lab Setup Guide. Summer Microsoft Windows Server 2003 Network Infrastructure (70-291)

Step-by-Step Guide for Setting Up IPv6 in a Test Lab

Installing and Configuring vcenter Multi-Hypervisor Manager

Deploying Microsoft RemoteFX on a Single Remote Desktop Virtualization Host Server Step-by-Step Guide

File and Printer Sharing with Microsoft Windows

Administering and Managing Failover Clustering

Legal Notes. Regarding Trademarks KYOCERA Document Solutions Inc.

Clustering ExtremeZ-IP 4.1

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

Deploying Remote Desktop Connection Broker with High Availability Step-by-Step Guide

IBM FileNet Image Services

Installation Guide for Workstations

Imaging Computing Server User Guide

SharePoint Server for Business Intelligence

MCSE Core exams (Networking) One Client OS Exam. Core Exams (6 Exams Required)

How To Set Up A Two Node Hyperv Cluster With Failover Clustering And Cluster Shared Volume (Csv) Enabled

Deploying Personal Virtual Desktops by Using RemoteApp and Desktop Connection Step-by-Step Guide

WhatsUp Gold v16.1 Installation and Configuration Guide

Introduction to Hyper-V High- Availability with Failover Clustering

Hands-On Microsoft Windows Server Chapter 12 Managing System Reliability and Availability

Reporting for Contact Center Setup and Operations Guide. BCM Contact Center

Sage Timberline Enterprise Installation and Maintenance Guide

StarWind Virtual SAN Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

SolarWinds Migrating SolarWinds NPM Technical Reference

HP PolyServe Software upgrade guide

EXPRESSCLUSTER X for Windows Quick Start Guide for Microsoft SQL Server Version 1

Clustering VirtualCenter 2.5 Using Microsoft Cluster Services

Sentinel Installation Guide

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

Symantec Backup Exec 12.5 for Windows Servers. Quick Installation Guide

About Recovery Manager for Active

Direct Storage Access Using NetApp SnapDrive. Installation & Administration Guide

StarWind iscsi SAN Configuring HA File Server for SMB NAS

StarWind iscsi SAN: Configuring HA File Server for SMB NAS February 2012

Abstract. Microsoft Corporation Published: November 2011

Modular Messaging. Release 4.0 Service Pack 4. Whitepaper: Support for Active Directory and Exchange 2007 running on Windows Server 2008 platforms.

Windows BitLocker Drive Encryption Step-by-Step Guide

Administration GUIDE. SharePoint Server idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 201

Dell Recovery Manager for Active Directory 8.6. Quick Start Guide

Cluster Guide. Version: 9.0 Released: March Companion Guides:

3 Setting up Databases on a Microsoft SQL 7.0 Server

System i and System p. Customer service, support, and troubleshooting

BrightStor ARCserve Backup for Windows

Operating System Installation Guide

Deploying Remote Desktop IP Virtualization Step-by-Step Guide

StarWind iscsi SAN & NAS: Configuring HA File Server on Windows Server 2012 for SMB NAS January 2013

Installing RMFT on an MS Cluster

Windows Server Update Services 3.0 SP2 Step By Step Guide

Dell NetVault Bare Metal Recovery for Dell NetVault Backup Server User s Guide

CA ARCserve Replication and High Availability for Windows

המרכז ללימודי חוץ המכללה האקדמית ספיר. ד.נ חוף אשקלון טל' פקס בשיתוף עם מכללת הנגב ע"ש ספיר

StarWind iscsi SAN & NAS: Configuring HA Shared Storage for Scale- Out File Servers in Windows Server 2012 January 2013

Microsoft Dynamics GP. Engineering Data Management Integration Administrator s Guide

Lesson Plans Microsoft s Managing and Maintaining a Microsoft Windows Server 2003 Environment

Microsoft Internet Information Server 3.0 Service Failover Using Microsoft Cluster Server

Installation / Migration Guide for Windows 2000/2003 Servers

PRIMEQUEST Integration

Step-by-Step Guide for Creating and Testing Connection Manager Profiles in a Test Lab

ACTIVE DIRECTORY DEPLOYMENT

Moving the TRITON Reporting Databases

(Exam ): Configuring

HL2170W Windows Network Connection Repair Instructions

How to install Small Business Server 2003 in an existing Active

Networking Best Practices Guide. Version 6.5

Secure IIS Web Server with SSL

Upgrading Good Mobile Messaging and Good Mobile Control Servers

Sage HRMS 2014 Sage Employee Self Service Tech Installation Guide for Windows 2003, 2008, and October 2013

Secure Perfect RAID Recovery Instructions

Pro-Watch Software Suite Installation Guide Honeywell Release 4.1

Workflow Administration of Windchill 10.2

Upgrading from Call Center Reporting to Reporting for Contact Center. BCM Contact Center

IT Essentials v4.1 LI Upgrade and configure storage devices and hard drives. IT Essentials v4.1 LI Windows OS directory structures

Intel Storage System Software User Manual

Transcription:

Module 7: Server Cluster Maintenance and Troubleshooting Contents Overview 1 Cluster Maintenance 2 Troubleshooting Cluster Service 11 Lab A: Cluster Maintenance 23 Review 27

Information in this document is subject to change without notice. The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted. Complying with all applicable copyright laws is the responsibility of the user. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Microsoft Corporation. If, however, your only means of access is electronic, permission to print one copy is hereby granted. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2000 Microsoft Corporation. All rights reserved. Microsoft, MS-DOS, MS, Windows, and Windows NT <plus other appropriate product names or titles> are either registered trademarks or trademarks of Microsoft Corporation in the U.S.A. and/or other countries. <This is where mention of specific, contractually obligated to, third party trademarks should be listed.> The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted. Other product and company names mentioned herein may be the trademarks of their respective owners.

Module 7: Server Cluster Maintenance and Troubleshooting i Overview To provide an overview of the module topics and objectives.! Cluster Maintenance! Troubleshooting Cluster Service Server cluster maintenance and troubleshooting are considered two separate disciplines. Maintenance is continuous, whereas troubleshooting has a beginning when the problem is discovered, and an end when the problem is resolved. The two disciplines are complimentary, however. When every troubleshooting procedure that you follow fails, you will need to rebuild the cluster from a backup tape generated during a maintenance procedure. After completing this module, you will be able to:! Perform the steps to successfully backup a server cluster.! Perform the steps to successfully restore a server cluster.! Evict a node from a server cluster.! Identify the tools that are necessary to troubleshoot a cluster failure.! Interpret the entries on the quorum log.! Identify and troubleshoot common server cluster failures: network communications, SCSI configuration problems, group, resource and quorum failures.

ii Module 7: Server Cluster Maintenance and Troubleshooting # Cluster Maintenance To introduce the topics covered in this section.! Backup! Restoring the First Node! Restoring the Shared Disks! Restoring the Second Node! Leaving a Cluster Cluster service uses the self-tuning features of Microsoft Windows 2000 and requires very little maintenance. The only day-to-day maintenance operations you need to perform is to back up the cluster. Under special circumstances, a node in the cluster may need to be replaced, for example, for a hardware upgrade. In this situation, you need to evict a node from the cluster and add the upgraded node to the cluster.

Module 7: Server Cluster Maintenance and Troubleshooting iii Backup To describe how to back up the system state, node and shared disks.! Backing Up the System State! Backing Up the Node! Backing Up the Shared Disk Backing up the cluster nodes is no different from backing up installations of Microsoft Windows 2000 Advanced Server. It is recommended that you perform regular backups by using the Windows 2000 Backup program (ntbackup), or other compatible backup programs. Note A compatible backup program would be able to perform the same backup operations as ntbackup, especially with regard to backing up the System State and the cluster configuration database. Backing Up the System State The cluster registry on each node contains the configuration information for the cluster. The Backup tool that is included with Windows 2000 backs up the cluster database on the quorum disk when you back up the system state. The node's local cluster registry hive, clusdb, is not backed up. Using ntbackup, back up the system state on each node. The system state includes:! The cluster database on the quorum disk.! The quorum log on the quorum disk.! The local registry.

iv Module 7: Server Cluster Maintenance and Troubleshooting Backing Up the Node Follow standard computer backup procedures to back up the operating system and the data on the local drives. Note Backup is essential, but regular testing to make sure that backups and restores actually work as expected is also essential. A good practice is to schedule test backup and restore operations frequently. Backing Up the Shared Disks It is critical to back up data on the shared disks, because cluster-aware applications will almost certainly be placing data on these disks. Because either node of the cluster could own the shared disk resource at any time, it is possible for each node to back up the data on the drive. However, having each node back up data would require you to install backup hardware and software on each cluster node, which is not the best solution. One possibility is to identify a non-clustered server running Windows 2000 Server and schedule it to back up this data remotely through a network connection to the drive s administrative share or a hidden share that you create. For example, you might create FBackup$, GBackup$, and HBackup$ file share resources on the virtual server for the root of drives F, G, and H. These shares would not appear in the browse list and could be configured to allow access only to members of the Backup Operators group.

Module 7: Server Cluster Maintenance and Troubleshooting v Restoring the First Node To list the steps for restoring a server cluster and describe how to restore the first node. Steps for restoring a server cluster: 1. Restore the first node. 2. Restore the shared disks. 3. Restore the second node. 4. Perform unit testing. The following sections describe the procedure for restoring a server cluster in the event that both nodes and the shared disk fail. It is possible that any one of the components in the cluster could fail independently. In the case of a failed component, you follow the same steps for restoring that specific component. Restoring a Server Cluster Performing a complete restore of a server cluster is a straightforward process. 1. Restore a node of the cluster. Restore the operating system, local files and system state. 2. Restore the shared disks of the restored first node. Restore the Disk Signature files. Restore the data on the shared disk. Restore the cluster configuration files on the shared disk. 3. Restore the remaining node of the cluster. Restore the operating system, local files and system state of the restored second node of the cluster. 4. Perform Unit Testing.

vi Module 7: Server Cluster Maintenance and Troubleshooting Restoring a Node of the Cluster To restore a node in a server cluster, you follow the same procedure that you would use in restoring a Windows 2000 operating system. 1. Install a fresh copy of Windows 2000 Advanced Server on the node to be restored. 2. Log on as Administrator and restore the system and boot partition, system state, and associated volumes from the backup. Make sure that you select the option to restore the system state to the original location in the backup program. 3. Restart the node. 4. Perform the steps for restoring the shared. These steps follow in the next section. Note The difference between the time of the backup and the time of the restoration to the new computer may affect the computer account on the domain controller. You may have to join a workgroup and then rejoin the domain.

Module 7: Server Cluster Maintenance and Troubleshooting vii Restoring the Shared Disks To describe how to restore the shared disk by restoring signature files, data and cluster configuration files.! Restoring Disk Signature Files! Restoring the Data on the Shared Disk! Restoring the Cluster Configuration Files After you have restored a node in the cluster, you must restore the shared disks. Restoring the shared disks involves restoring the Disk Signature file that the cluster uses to identify the disk. You may also need to restore a shared disk if you are running out of disk space, or there is impending disk failure of a disk. It can be costly to make mistakes while replacing a cluster disk; the consequences can be irrecoverable loss of all of the data on that disk. If the disk is the quorum disk, the server cluster's configuration data is at risk. Restoring Disk Signature Files Because Cluster service relies on disk signatures to identify and mount volumes, if a disk is replaced, or if the bus is re-enumerated, Cluster service will not find the disk signatures that it is expecting and will not function. You can extract the disk signature from the registry and write it to the new disk by running Dumpcfg.exe which writes the signature file from the old disk to the new disk. Cluster service will recognize the new disk and successfully start the resource. Note At a command prompt, type Dumpcfg.exe/signature disknumber where signature is the disk signature and disknumber is the number of the disk that you replaced. If the disk that you are replacing is the quorum disk, use Cluster Administrator to move the quorum to a different disk, and proceed in the replacement of the disk. After the disk is brought back online, you can move the quorum back to the new disk.

viii Module 7: Server Cluster Maintenance and Troubleshooting Restoring the Data on the Shared Disk Restoring the data on the shared disk is the same as a restore of a local disk. Before restoring the data, make sure that you have associated each shared disk to the same drive letter as before the disaster or failure. When restoring, make sure that you restore the data to the original location and verify the integrity after you have completed the restore. Restoring the Cluster Configuration Files The cluster configuration files include the cluster database and the quorum log, both of which are stored on the shared disk. The cluster database is the database or configuration data (cluster objects and their settings) pertinent to the cluster. This database is the product of the cluster registry key checkpoint and the changes that are recorded in the quorum log. All of the nodes of the cluster hive maintain a local copy of this database in the nodes local registry. After you have restored the disk signature file and data, you can start the server cluster. If the cluster files were not restored, or were corrupted, the following procedure can restore the cluster database from the registry of the restored node. Identify the node on which you will restore the database (in the case of a disaster restore, this will be the first node that has been restored). Restore the cluster database on the selected node by restoring the system state. Restoring the system state creates a temporary folder under the %Systemroot%\Cluster folder called Cluster_backup. You use NtBackup to restorer the cluster configuration files which places the cluster configuration files on the node but not on the shared disk. You then restore them to the shared disk by using the Clusrest.exe tool that is available in the Windows 2000 Resource Kit. Clusrest.exe restores both the quorum log (Quorum.log) file and the cluster database (Clusdb). Note You can use the resource kit Clusrest.exe tool, which is a free download from www.microsoft.com

Module 7: Server Cluster Maintenance and Troubleshooting ix Restoring the Second Node To describe how to restore the second or remaining nodes of a cluster and test the failover and failback policies.! Restoring the Remaining Nodes of a Cluster! Perform Unit Testing After you complete the process of restoring a node of a cluster, and Cluster service has started successfully on the newly restored node, you can start the restore process on the other node of the cluster. Restoring the Remaining Node(s) of the Cluster The restoration of the second node of a cluster is the same procedure as restoring the first node of a cluster, except that you will not have to restore the shared disks. Performing Unit Testing Testing the failover and failback policy is recommended before putting the cluster back into production. 1. Verify that the disk and cluster resources are available on the correct node. 2. Fail over each group and resource to verify that they can successfully start on the other node of the cluster. 3. Test the failback policy of each resource by allowing the resource to fail back to a preferred owner after the node has come back online.

x Module 7: Server Cluster Maintenance and Troubleshooting Leaving a Cluster To describe how to evict a node from a cluster.! Steps for Leaving a Cluster 1. Back up both nodes. 2. Verify backup. 3. Move all grups to the remaining node. 4. Stop the Cluster Service on the node to be removed. 5. Evict the node 6. Unplug the server from the shared bus. If you need to change a node of a cluster, for example, to add a more powerful server, you need to logically remove the node before physically removing the node from the cluster. When you configure a new server with the shared bus, and the public and private networks, you can then run the Cluster Installation Wizard. To remove a node from a cluster, from Cluster Administrator, right-click on the node to access the menu with the Stop Cluster option and Evict Node options. To Leave a Cluster: 1. Back up both nodes. 2. Verify backup. 3. Move all of the groups to the remaining node. 4. Stop Cluster service on the node that is to be removed. 5. Evict the node. 6. Unplug the server from the shared bus (if the shared bus is a small computer system interface (SCSI) bus, be careful about termination). Note If a new server is to join the cluster later, run the Cluster Installation Wizard and select Join a Cluster.

# Troubleshooting Cluster Service Module 7: Server Cluster Maintenance and Troubleshooting xi To introduce the topic of troubleshooting as it relates to server clusters.! Troubleshooting Tools! Examining the Cluster Log! Troubleshooting Network Communications! SCSI Configuration Problems! Group and Resource Failure! Quorum Failure Troubleshooting a problem with Cluster service can be more complex than troubleshooting a single server. Virtual servers change ownership from one node to another, which may cause network connectivity problems. Applications running on the cluster are difficult to troubleshoot, because they are running from a virtual server instead of a physical server. You could also have a nodeto-node communication problem because servers usually work independently of each other and not together. You might experience hardware problems with the shared bus and the shared disk resources. The most common failures are due to improper configurations within groups and resources. The Cluster service will fail if the quorum log becomes corrupt. It is important to know how to repair the quorum log in order to restart the cluster. You use the same tools to identify problems on the cluster as you would use to identify problems on a physical server. The best resource for troubleshooting is the cluster log because Cluster service records the activity of each node in the cluster log. This log can help you identify problems on the node or in the cluster.

xii Module 7: Server Cluster Maintenance and Troubleshooting Troubleshooting Tools To describe the tools used for troubleshooting Cluster service problems.! Disk Manager! Task Manager! Performance Monitor! Network Monitor! Dr. Watson! Services Snap-in When troubleshooting Cluster service, you can use the same tools and methodologies that you would when troubleshooting Windows 2000 Advanced Server. Cluster service writes logging information to the system log of every node in the cluster. Cluster service also writes a more detailed log of cluster activity to the cluster log on each node. Use these two sources to gather information when you begin troubleshooting a problem. You will be able to determine whether the problem is related to the network, to services or applications, or to physical components in the cluster. Note Use Event Viewer to filter the system log on event source: ClusSvc. You can view general events such as, Microsoft Clustering Service failed to join the cluster on this node and Microsoft Clustering Service successfully created a cluster on this node. After you have determined the type of problem, you can use the following tools to search for the source of the problem. You must check each node individually when using any of these tools.! Disk Manager. You check disk manager to find out the health of the shared disk. You can check whether the disks are recognized by the operating system, and whether the shared disks are basic versus dynamic. You also need to verify that the drive letter to the shared disks are the same on both nodes.! Task Manager. You can verify that Cluster service is running in Microsoft Windows 2000 Task Manager. You can also use Task Manager as a performance monitor, but you do not obtain the level of detail as with performance monitor. In Task Manager, you will be able to verify the CPU utilization percentage and the memory resources on the node.

Module 7: Server Cluster Maintenance and Troubleshooting xiii! Performance Monitor. Microsoft Windows 2000 Performance Monitor is the primary tool for finding bottlenecks on servers running Windows 2000. It is recommended that you create a baseline before and after cluster resources are added to the cluster. You also need to create a baseline on each node during failover and failback of resources to check for potential physical resource deficiencies. It is recommended that you configure a computer to monitor the Cluster service property on every node of the cluster, and send an e-mail to an administrator when a node or the cluster is offline.! Network Monitor. Microsoft Windows 2000 Network Monitor is used to troubleshoot any node-to-node and client-to-node communication. You must configure Network Monitor to capture data on the private network to see node-to-node communication.! Dr. Watson. Dr. Watson is a user-mode debugging tool. If a clustered application or the Cluster Administrator crashes, the debugging information is found in the Dr. Watson log file.! Services Snap-in. Cluster service runs as a service in Windows 2000. If Cluster service is not running correctly, check the properties of the service through the services snap-in to ensure that the default properties have not changed. Verify that Cluster service: Is set to start automatically. Is set to log on as the designated domain service account. Is set to restart after a failure.! Make sure the four following services have started: Network Connections (Network Connections has a Remote Procedure Call (RPC) dependency Remote Procedure Call (RPC) Windows Management Instrumentation Driver Extensions Windows Time

xiv Module 7: Server Cluster Maintenance and Troubleshooting Examining the Cluster Log To learn how to use the cluster log to troubleshoot Cluster service problems. Copy of cluster - Wordpad File Edit View Insert Format Help timestamp event description 000003b8.000003b4::2000/10/02-19:44:12.946 [CS] Cluster Service started Cluster Node Vers 000003b8.000003b4::2000/10/02-19:44:12.946 OS Version 5.0.21 000003b8.000002f0::2000/10/02-19:44:12.957 [CS] Service Starting 000003b8.000002f0::2000/10/02-19:44:13.007 [EP] Initialization 000003b8.000002f0::2000/10/02-19:44:13.057 [DM]: Initialization 000003b8.000002f0::2000/10/02-19:44:13.097 [DM]: Loading cluster database form D:\WINNT\clu 000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: Entry The IDs of the process and event description 000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: thread created 000003b8.000002f0::2000/10/02-19:44:13.427 thread issuing the log entry [NM] Initializing 000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node name = SERVER1. 000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node ID = 1. 000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Creating object for node 1 (SERVER1) 000003b8.000002f0::2000/10/02-19:44:13.437 [NM] Initializing networks. 000003b8.000002f0::2000/10/02-19:44:13.447 [NM] Initializing network interfaces. 000003b8.000002f0::2000/10/02-19:44:13.788 [NM] Initializing complete. 000003b8.000002f0::2000/10/02-19:44:13.848 [NM] Starting worker thread 000003b8.000002f0::2000/10/02-19:44:13.848 [API] Initializing 000003b8.000002f0::2000/10/02-19:44:13.848 [FM] Worker thread running 000003b8.000002f0::2000/10/02-19:44:13.878 [LM] :LMInitialize Entry. 000003b8.000002f0::2000/10/02-19:44:13.878 [LM] :TimerActInitialize Entry. 000003b8.000002f0::2000/10/02-19:44:13.878 [CS] Service Domain Account = clusservice@mocmoc 000003b8.000002f0::2000/10/02-19:44:13.878 [CS] Initializing RPC server. 000003b8.000002f0::2000/10/02-19:44:14.038 [INIT] Attempting to join cluster MYCLUSTER 000003b8.000002f0::2000/10/02-19:44:14.048 [JOIN] Spawning thread to connect to sponsor 10. 000003b8.000002f0::2000/10/02-19:44:14.048 [JOIN] Spawning thread to connect to sponsor 169 Creates a new cluster group The cluster log is a diagnostic log that is a more complete record of cluster activity than the Microsoft Windows 2000 Event Log. The cluster log records the Cluster service activity (Clussvc.exe and associated processes) that leads up to the events recorded in the event log. Although the event log can point you to a problem, the cluster log helps you to determine the source of the problem. So, for diagnosis, check the event log for general information and the cluster log for specific details about the cluster status. If you see a problem in the event log, note the timestamp and go to approximately the same timestamp on the cluster log. The cluster log is enabled by default in Windows 2000. Cluster log output is printed to a.log file in %SystemRoot%\Cluster. Setting the Logging Level You can set four logging levels in the cluster log.four logging levels are possible. The default level is two, which logs enough information necessary for normal troubleshooting. To set a different logging level, open Control Panel\System and create a system environment variable under the Advanced button called ClusterLogLevel with a value of 0, 1, 2, or 3, where 0=no logging, 1=Errors only, 2=Errors and Warnings, and 3=Everything that happens. Setting the Log File Size The log file defaults to a maximum size of eight megabytes (MB). When the log file size reaches eight MB, the log file will start overwriting the data in the log file. To specify a larger file size, add the registry entry ClusterLogSize under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc\ Parameters. ClusterLogSize has a type of DWORD and it should specify the maximum size in MB for the log file. If this value is set to 0, logging is disabled.

Module 7: Server Cluster Maintenance and Troubleshooting xv Cluster Log Entries There are two types of cluster log entries: Component Event Log entries and Resource dynamic-link library (DLL) log entries. Cluster service is made up of a number of components, such as database manager and global update manager. The cluster log records the interactions of these components, making it a powerful diagnostic tool. Because resource groups are the basic unit of failover, resource DLL entries are essential to understanding cluster activity. The first line in the body of a typical cluster log is: 378.32c::1999/06/09-18:00:18.874 Cluster service started - Cluster Node Version 3.2051 The main elements of this line are common to every line of the log:! The IDs of the process and thread issuing the log entry. These two IDs are concatenated, separated by a period. In the previous example, the Process ID is 378, and the thread ID is 32c.! Timestamp. The timestamp is recorded in the following format, in Greenwich Mean Time (GMT): yyyy/mm/dd-hh:mm:ss.sss! Event description. One example of an event description would be Cluster service started. Component Event Log Entries In the following example, [NM] indicates the component that wrote the event to the cluster log; in this case, NM stands for node manager. 378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership. Resource DLL Log Entries. The following example is a cluster log entry for a resource DLL event. This example is one of the entries from the disk arbitration process. 15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>: [DISKARB] Arbitration Parameters (1 9999). Instead of listing an abbreviated component name between the timestamp and event description as component log entries do, entries describing resource DLL events list the following information:! Resource type (Physical Disk)! Resource name (<Disk I:>) The event description in this example is [DISKARB] Arbitration Parameters (1 9999).

xvi Module 7: Server Cluster Maintenance and Troubleshooting Troubleshooting Network Communications To describe how to troubleshoot node-to-node and client-to-node communication.! Troubleshooting Node-to-Node Communication $ Verify RPC Communication $ Verify Cluster Heartbeats! Troubleshooting Client-to-Node Communications $ Check NetBT Cache with Nbtstat $ Ping IP Address $ WINS and Static Mappings There are two types of cluster network communications that can fail: the client may be unable to access the cluster or the nodes may be unable to communicate with each other. When client communications are interrupted, there is a problem with the public network.. When the nodes are unable to communicate, there is a problem with either the public or the private network. Troubleshooting these two types of network-related problems requires different approaches. Troubleshooting Node-to-Node Communication You can use Windows 2000 Network Monitor before installing Cluster service to capture the trace of the ping between the nodes on the public and private network. After Cluster service is installed, you use Network Monitor to verify remote procedure call (RPC) communication and cluster heartbeats. Note You can also use RPC Ping, which is an RPC connectivity verification tool that is a free download from www.microsoft.com. This tool verifies that Windows 2000 Server services are responding to call requests of remote procedures between nodes. Verifying RPC Communication To verify that RPC communication is occurring between the nodes of a cluster, use a network capture utility, such as Microsoft Network Monitor. Windows 2000 Server includes a simple version of Network Monitor that you can install by using the Network program in Control Panel. To verify RPC communication, configure the Capture utility to capture all of the traffic between the nodes of a cluster. After you have started a capture, using Cluster Administrator to create a group or resource will result in RPC traffic between the nodes.

Module 7: Server Cluster Maintenance and Troubleshooting xvii Verifying Cluster Heartbeats As with RPC communication, to verify that cluster heartbeats are occurring between the nodes of a cluster, you must use a network capture utility. To verify cluster heartbeats, make sure that Cluster service is started on both nodes. Use Network Monitor to capture User Datagram Protocol (UDP) port 3343 frames which is the port number that Cluster service uses to send heartbeats on the network. Troubleshooting Client-to-Node Communications After a failover occurs, clients must still be able to gain access to a cluster, even though they will be accessing a different node. For this to work, the client must be able to resolve any cluster network names so that they will always connect to the node on which the resources are online. If clients cannot connect to virtual servers, verify that:! The client is accessing the cluster by using the correct network name or IP address.! The client has the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol correctly installed and configured. Check NetBT Cache with Nbtstat Depending on the resource being accessed, the client can address the cluster by specifying either the resource network name or the IP address. In the case of the network name, you can verify proper name resolution by checking the NetBT cache (using the Nbtstat.exe utility) to determine whether the name had been previously resolved. Also, confirm proper Windows Internet Name Service (WINS) configuration, at the client and at the cluster nodes. Ping IP Address Using Ping Utility If the client is accessing the resource through a specific IP address, using the ping utility, ping the IP address of the cluster resource and cluster nodes from a command prompt. WINS Static Mappings You should not create static network name to IP address mappings for any cluster names in a WINS database. WINS is the only name resolution method that will cause problems when using static mappings, because WINS static mappings use the media access control (MAC) address of the network card as part of the static mapping. If clients are having a problem connecting to a virtual server, an administrator might have created a WINS static mapping for a virtual server. The node for which the mapping is created will be able to bring the network name resource online and clients will be able to connect. However, if failover occurs, the second node in the cluster will be able to bring the IP address online but not the network name. When the second node attempts to bring the network name online, WINS will return an error preventing it from registering the network name. WINS prevents the network name from going online because the second node does not have the same physical address as recorded in the static mapping for the network name.

xviii Module 7: Server Cluster Maintenance and Troubleshooting SCSI Configuration Problems To explain how to troubleshoot SCSI configuration problems.! SCSI Controllers! SCSI Terminiation! SCSI Cabling If you suffer hardware failures, you may have to replace hardware components of the cluster. If you replace components in the SCSI subsystems, you need to make sure that the new SCSI configurations conform to the following guidelines. SCSI Controllers SCSI Ids Boot Time SCSI Bus Reset Non-Compliant Controllers Each device on the shared SCSI bus must have a unique SCSI ID. Most SCSI controllers default to SCSI ID 7. Therefore, the SCSI ID for one of the controllers on the shared SCSI bus must be changed to something other than ID 7. Cluster service uses SCSI bus resets, but in a controlled way during a membership regroup operation. Some SCSI controllers reset the SCSI bus when they initialize at start time, before Windows 2000 is loaded. If the SCSI controllers reset the SCSI bus, the bus reset can interrupt any data transfers between the other node and drives on the shared SCSI bus. Therefore, you should disable automatic SCSI bus resets, if possible, by using the adapter configuration program accessible at PC start time. It is important to verify that the SCSI controllers that are being used are on the Cluster service Hardware Compatibility List (HCL). For a SCSI controller to work with Cluster service, it must support the SCSI reserve and release commands and bus resets.

Module 7: Server Cluster Maintenance and Troubleshooting xix (continued) SCSI Termination Active or Forced-Perfect Termination On-Card Termination SCSI Cabling Tri-Link or Y-cable SCSI Connectors Long Cables There are three types of termination that are used for terminating the SCSI bus: passive termination, active termination, and forced perfect termination. Because both active and forced perfect termination use electronics to provide termination, these types provide the best termination. You should not use passive termination in a cluster, because it can result in problems such as unnecessary failover or inability to access the quorum disk. Many SCSI controllers provide on-card termination; however, the on-card termination does not provide termination when the computer is not turned on. Oncard termination only becomes an issue when external terminators are not used. When using external terminators, the on-card termination should be disabled. Attaching Y-cables or tri-link connectors to the back of the SCSI controllers at each end of the bus is one method that you can use to allow the SCSI bus to remain terminated even when one node is turned off. These components allow you to use external terminators that will continue to provide termination if a node is turned off. You must ensure that the SCSI cards in the nodes are not providing termination when using these connectors. It is very common to have multiple external SCSI drives on the shared SCSI bus. When configuring multiple external drives, it is very important not to exceed the maximum combined cable length that the controller manufacturer recommends. The SCSI specifications specify the maximum combined cable length when using different types of cabling. If the manufacturer of the controller recommends a shorter distance, be sure to follow the recommendation of the manufacturer.

xx Module 7: Server Cluster Maintenance and Troubleshooting Group and Resource Failures To describe how to troubleshoot group and resource failures. Cluster Administrator [MYCLUSTER (MYCLUSTER)] File View Window Help MYCLUSTER Groups Cluster Group Mygroup SQL Group Resources Cluster Configuration SERVER1 SERVER2 Name State Owner Reso Cluster IP Address Online SERVER2 IP Ad Cluster Name Online SERVER2 Netw Disk W: Online SERVER2 Physi Printer Spooler Online SERVER2 Print Public Failed SERVER2 File S For Help, press F1 NUM If groups or resources are not available to clients, you need to verify whether it is a restart, failover or failback problem. In Cluster Administrator, you will see a visual notification that a group or a resource in a group is offline. Because there are a variety of reasons for a failure, you will have to troubleshoot the cause to find out whether it is a resource or group failure. Problem A Resource Fails, But is Not Brought Back Online The Default Quorum Resource Will Not Come Online Cannot Bring a Group Online Possible Resolution In the Policies dialog box for the resource properties, verify that Don t restart is cleared (not selected). Verify that the resource dependencies are correctly configured. Verify that any dependent resources are online. Verify that there are no hardware errors by using Event Viewer and looking for disk input/output (I/O) error messages. Verify that there are no hardware or configuration problems with any disk resources for the group. Verify that the resource dependencies are correctly configured. Move the group to the other node and attempt to bring the group online. If this works, verify that the first node can gain access to everything that is necessary to bring the group s resources online (for instance, the disk resource).

Module 7: Server Cluster Maintenance and Troubleshooting xxi (continued) Problem A Group Cannot Be Moved or Failed Over to the Other Node A Group Failed Over But Did Not Fail Back The Entire Group Failed and Has Not Restarted Possible Resolution Verify that the resource is properly installed on the node. Verify that the other node is set as a possible owner for all resources in the group in the Properties dialog box for the resource. Verify that the failback policies for the group are properly configured. In the Properties dialog box for the group, verify that Prevent failback is cleared. If Failback immediately is selected, be sure to wait long enough for the group to fail back. Check these settings for all of the resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back will affect the entire group. Ensure that the node to which you want the groups to fail back is configured as the preferred owner of the group. If not, Cluster service will leave the groups on the node to which they failed over. If the node on which the group had been running is offline, verify that the other node is a possible owner of the group and of all the resources in the group. Ensure that the group has not exceeded its failover threshold or its failover period. Bring the resources online one at a time to determine which resource is causing the problem. Create a temporary group (for testing purposes), and then move the resources to it one at a time, bringing each resource online after moving the resource.

xxii Module 7: Server Cluster Maintenance and Troubleshooting Quorum Failure To describe how to recover fro a quorum failure by resetting the quorum log.! Reset the Quorum Log If after running the Clusrest.exe tool the cluster will still not start because of a corrupted quorum log, you can reset the quorum log. Cluster service maintains details about changes within the cluster via the quorum log file. If this file becomes corrupted for any reason, it is possible that Cluster service may not start. The following error message may occur when you attempt to start Cluster service on a node of the server cluster: Event ID: 1147 Source: ClusSvc If you do not have a backup of the quorum log file, perform the following steps: 1. Open a command prompt. 2. Go to the %Systemroot%\Cluster. 3. Start Cluster service by typing clussvc -debug -resetquorumlog which attempts to create a new quorum log file that is based on the cluster configuration information in the local system's cluster registry hive. 4. Stop Cluster service by pressing CTRL+C. 5. Restart Cluster service by typing net start clussvc 6. Close the command prompt.

Lab A: Cluster Maintenance To introduce the lab. In this lab, you will back up and restore the cluster configuration files, evict a node from the cluster and uninstall the Cluster service. Module 7: Server Cluster Maintenance and Troubleshooting xxiii Objectives After completing this lab, you will be able to:! Back up cluster configuration files.! Restore cluster configuration files.! Evict a node from the cluster.! Uninstall the Cluster service. Prerequisites Before working on this lab, you must be familiar with the concepts in Module 7, Server Cluster Maintenance and Troubleshooting. You must also have a server cluster installed and running on both nodes. Lab Setup To complete this lab, you need the following:! Two computers running Microsoft Windows 2000 Advanced Server, each with a small computer system interface (SCSI) adapter installed. The SCSI adapter in the first computer is configured with SCSI ID 6. The SCSI adapter in the second computer is configured with SCSI ID 7.! A shared external SCSI disk. Each computer is connected to the shared disk with SCSI cables.! Two network adapter cards in each node of the cluster.

xxiv Module 7: Server Cluster Maintenance and Troubleshooting Scenario In this exercise, you will back up a node s system state which includes the cluster configuration files. After the backup is complete, you will restore the system state and verify that the cluster configuration files were restored to the node. At this point, to restore the cluster, you would run the Clustrest.exe utility, but for the purposes of this lab, you will not restore the cluster. You will then evict a node from a cluster and uninstall the Cluster service on both nodes. The following exercises will refer to your computers as NodeA and NodeB. For this lab, you will perform all of the tasks on both NodeA and NodeB, with the exception of evicting a node which will you will perform on NodeB. Estimated time to complete this lab: 45 minutes

Module 7: Server Cluster Maintenance and Troubleshooting xxv Exercise 1 Backup and Restore In this exercise, you will learn how NTBackup is used to backup and restore the cluster.! Complete this lab from NodeA and NodeB 1. Click Start, point to Programs, then point toaccessories, and point to System Tools, and then click Backup. 2. In the Backup dialog box, click Backup Wizard. 3. In the Backup Wizard dialog box, click Next. 4. Select Only backup the System State data, and then click Next. 5. In the Backup media or file name dialog box, type c:\backup.bkf and then click Next. 6. Click Finish to start the backup. 7. NTBackup will start backing up the system state which will take a couple of minutes. 8. When the backup is complete, click Close. 9. In the Backup dialog box, click Restore Wizard. 10. Click Next. 11. Click Import File to locate the backup file of the system state. 12. In the Catalog backup file dialog box, type c:\backup.bkf and then click OK. 13. In the What to restore box, expand File, expand Media created. 14. Check the System State box, click Next, and then click Finish. 15. In the Enter Backup File Name dialog box, click OK. 16. The Restore process will take a couple of minutes. 17. When Restore is complete, click Close. 18. Do not restart the computer, click No. 19. Close the NTBackup. Note NTBackup does not restore the cluster files to the shared disk. NtBackup places the cluster files on the local node.! This exercise will examine the cluster files restored by NTBackup 1. Click Start, and then click Run. 2. In the Run dialog box, type %systemroot%\cluster and then click OK. 3. Double-click the cluster_backup folder to view the files that are restored by NTBackup. 4. Question What utility would you use to restore these files to the shared drive?

xxvi Module 7: Server Cluster Maintenance and Troubleshooting Exercise 2 Removing the Cluster Service In this exercise, you will remove Cluster service from both computers in the cluster.! To evict a node Complete this task from NodeA only. 1. Log on as Administrator with a password of password. 2. Open Cluster Administrator from the Administrative Tools menu. 3. Right-click serverx (where serverx is NodeB). 4. Click Stop Cluster Service. 5. Click Yes. 6. Click Yes to (are you sure?)! To remove Cluster service from NodeA and NodeB Complete this task from NodeA and NodeB.. 1. Log on as Administrator with a password of password. 2. Open Add/Remove Programs from Control Panel. 3. Click Add/Remove Window Components. 4. Clear Cluster Service, and then click Next. 5. Click Finish. 6. Click Yes to restart the computer.

Module 7: Server Cluster Maintenance and Troubleshooting xxvii Review To reinforce module objectives by reviewing key points. The review questions cover some of the key concepts taught in the module. 1. Which two Cluster components that are backed up when you back up the system state on a clustered node? a. b. The cluster database on the quorum disk and the quorum log on the quorum disk. 2. You are trying to restore a cluster using NTBackup. You successfully restore the system state, but you notice that the cluster files on the quorum disk were not restored. What else must you do to restore the cluster files to the quorum disk? You must run the clustrest.exe tool to restore the cluster files from the node that was restored. 3. The event log stated a potential error on the cluster. You think the cluster log will give you a better explanation of the error. You explore the files of the quorum disk and cannot find the cluster log. Why can t you find the cluster log on the quorum disk? The cluster log is kept in the %systemroot%\cluster directory. It is called cluster.log

THIS PAGE INTENTIONALLY LEFT BLANK