WHITE PAPER VERITAS Volume Manager for Windows 2000 CAMPUS CLUSTERING: USING VERITAS VOLUME MANAGER FOR WINDOWS WITH MICROSOFT CLUSTER SERVER (MSCS) Using Volume Manager for Windows with MSCS in Campus Clustering Environments White Paper, August 5 th, 2002 1
TABLE OF CONTENTS table OF contents...2 Overview...3 Dynamic Volumes Concepts...4 Dynamic Volume Overview...4 Dynamic Volumes Virtualize Storage...4 Dynamic Volumes in Microsoft Windows 2000...5 Dynamic Volumes in VERITAS Volume Manager for Windows...5 Dynamic Disk Groups...6 Microsoft Cluster Server (MSCS)...6 MSCS Overview...6 MSCS Quorum Resource...8 Cluster Ownership of the Quorum Resource...9 The Heartbeat of a Cluster...10 MSCS Challenge/Defense Protocol...11 VERITAS Volume Manager in a MSCS Cluster Environment...12 Volume Manager Advantages with MSCS...12 Dynamic Volume Support with MSCS...12 Campus Clustering using Fault Tolerant MSCS Resources...12 How VERITAS Volume Manager can help solve Challenges in a Cluster Environment...13 ISSUE #1: Can t Grow Data Volumes Online...13 ISSUE #2: Can t Use Fault Tolerant Data Volumes...13 ISSUE #3: Quorum Disk is a Single Point of Failure for the Cluster...14 Disaster Recovery: Campus Clusters...14 Campus Clustering: Mirrored Quorum Disks at Two Locations...15 Summary...17 2
OVERVIEW The Microsoft Windows 2000 operating system offers significant advances in performance, scalability, and manageability. One of the key features of this new operating system is the Logical Disk Manager (LDM) that provides logical volume management and online disk administration capabilities. VERITAS Volume Manager for Windows 2000 extends these inthe-box basic capabilities to create a highly scalable, manageable platform for the most dataintensive or critical application environments. Windows 2000 also supports Microsoft Cluster Server (MSCS), the Microsoft solution for creating a loosely coupled configuration of servers with application failover capabilities. The MSCS technology has been in place for a few years and is used to improve the availability and manageability of Windows systems. Using VERITAS Volume Manager for Windows 2000, system administrators can create flexible storage configurations integrated with the MSCS cluster server, so that the Cluster Server can automatically migrate all the storage required for a specific application between nodes when a failover occurs. This solution combines the high-availability failover capabilities of MSCS with the highly configurable and manageable storage capabilities of VERITAS logical volume management support. This paper provides a brief overview of the various components involved in this solution and then discusses specifically how to create application-specific storage migration for MSCS using VERITAS Volume Manager. This paper also discusses the two key advantages of using VERITAS Volume Manager in a MSCS environment: the ability to use dynamic disks with clustering and the ability to create Campus Clusters with fault tolerant mirrored quorum and data resources. Both of these VERITAS Volume Manager advantages reduce planned and unplanned downtime in a clustering environment. More information on Windows 2000 can be found on the Microsoft Web site at http://www.microsoft.com. The VERITAS Web site (http://www.veritas.com/) contains other sources of information on VERITAS Volume Manager for Windows 2000. 3
DYNAMIC VOLUMES CONCEPTS Dynamic Volume Overview VERITAS worked with Microsoft to develop the logical volume management in the Windows 2000 software. Logical volume management through the use of dynamic volumes removes physical limitations of storage, enabling administrators to build higher-performance, more available storage configurations from existing disk devices. This simplifies disk administration tasks for reduced cost of ownership. Windows 2000 introduces a new Logical Disk Manager (LDM) facility that supports both basic disks and dynamic disks. Basic disks use standard disk partition tables to support basic volumes and have been supported on previous versions of Windows. Dynamic disks that contain dynamic volumes store disk and volume information on the disk itself. A dynamic volume is an abstract online storage management unit instantiated by a system software component called a volume manager. To file systems, database management systems, and applications that do raw I/O, a dynamic volume appears to be located on a single disk, in the sense that: It has a fixed amount of non-volatile storage Its storage capacity is organized as consecutively numbered 512-byte blocks Sequences of consecutively numbered blocks can be read or written with a single request Reading and writing can start at any block The smallest unit of data that can be read or written is one 512-byte block Dynamic Volumes Virtualize Storage Unlike basic volumes, a dynamic volume can aggregate the capacity of several disks into a single storage unit so that there are fewer storage units to manage, or to accommodate files larger than the largest available disk. A dynamic volume can aggregate I/O performance of several disks. This allows large files to be transferred faster than would be possible with the fastest available disk. In some circumstances, it also enables more I/O transactions per second to be executed than would be possible with the fastest available disk (i.e., by issuing concurrent I/O s). A dynamic volume can improve data availability through mirroring or RAID techniques that tolerate disk failures. Failure-tolerant volumes can remain fully functional when one or more disks that comprise them fail. A dynamic volume created with the VERITAS Volume Manager can grow dynamically. More complex volumes can be created to provide a combination of these benefits. 4
Dynamic Volumes in Microsoft Windows 2000 Dynamic volumes in Windows 2000 can host software-managed RAID volumes. Because the disk and volume information is on the disk itself instead of in system tables, moving or reallocating dynamic disk storage between systems is easier. Another major benefit is that administrators can perform disk and volume management tasks without restarting the system. The volume manager supports online growth and management of storage. Dynamic volumes in Windows 2000 may be simple, spanned, striped (RAID-0), mirrored (RAID-1), or RAID-5 (striping with distributed parity). The Windows 2000 Logical Disk Manager provides online management and configuration of local and remote disk storage and a domainwide view of storage resources. Together, these features support highly configurable and manageable storage solutions. Dynamic Volumes in VERITAS Volume Manager for Windows VERITAS Volume Manager extends the capabilities of Windows 2000 dynamic volumes. For example, Volume Manager dynamic volumes have all the capabilities of the native Windows 2000 dynamic volumes, plus: Striped and RAID-5 volumes using more than 32 physical disks (columns) Mirrored stripe volumes for a high-performance, highly available storage solution Ability to grow software RAID volumes dynamically without taking users or applications off-line (no rebooting) N-way mirroring administrators can create and detach third mirrors to mirrored volumes Preferred plex designating a local mirror as the preferred read device for data with heavy request loads. Hot spares, Hot Relocation and Undo Hot Relocation RAID 5 and Dirty Region Logging to speed recovery after RAID 5 or mirror volume failure. Volume Manager also provides advanced online management capabilities. For example, administrators can expand mirrored, striped, and RAID-5 volumes while the data is online and available. Administrators can use the graphical interface to identify storage bottlenecks and move data to correct or prevent performance problems. Finally, VERITAS Volume Manager supports shared and partitioned shared storage configurations using the concept of multiple disk groups. This makes it easier for multiple Windows servers to share a disk farm or Storage Area Network by segmenting the storage available, with each server owning specific storage segments. The administrator can easily reconfigure or change the segmentation. This last feature is relevant for supporting storage migration with MSCS. 5
DYNAMIC DISK GROUPS The VERITAS Volume Manager supports a concept called Dynamic Disk Groups. A dynamic disk group is a collection of disks with arbitrary volume layout. The Windows 2000 Logical Disk Manager does not support dynamic disk groups. VERITAS Volume Manager support for multiple dynamic disk groups is a key feature when used in a MSCS environment. A dynamic disk group is the object that is imported or deported. When a disk group is imported all the volumes contained in the disk group are brought online and made available by the volume manager. When a dynamic disk group is deported all the volumes contained within the dynamic up are taken offline and made unavailable by the volume manager. There are three types of VERITAS Volume Manager dynamic disk groups: 1. Primary disk group - contains the boot/system disk and zero to many additional disks with arbitrary volume layout 2. Secondary disk group - contains one-to-many disks with arbitrary volume layout 3. Cluster disk group - contains one-to-many disks with arbitrary volume layout. A cluster disk group has two additional properties Cluster disk groups are intended to be used by clustering applications such as MSCS and VERITAS Cluster Server (VCS). A cluster disk group is NOT automatically imported at boot time. The user must perform a manual import through the GUI, CLI, or API, if the group is not managed by a cluster. A cluster disk group uses hardware locking mechanisms (e.g., SCSI-2 reserve/release) to guarantee that the disks within a cluster disk group are exclusively owned by one node at a time Both MSCS and VCS can import and deport a cluster disk group through online and offline operations When VERITAS Volume Manager is used with MSCS, the cluster disk group is the resource managed by MSCS. MICROSOFT CLUSTER SERVER (MSCS) MSCS Overview Microsoft Cluster Server is the Microsoft clustering solution for Microsoft Windows-based servers. A detailed description of MSCS is beyond the scope of the paper and a more detailed description can be found on the Microsoft Web site. This section highlights some relevant points. 6
A cluster is a group of independent computers working together as a single system to ensure that mission-critical applications and resources are as highly available as possible. The group is managed as a single system, shares a common namespace, and is specifically designed to tolerate component failures, and to support the addition or removal of components in a way that's transparent to users. Clustered systems have several advantages including fault-tolerance, high-availability, scalability, simplified management and support for rolling upgrades, to name a few. Microsoft Cluster Server employs a shared nothing architecture, and was initially introduced as a two-node clustering solution for the Enterprise Edition of Windows NT. As of this writing, MSCS is supported with up to two nodes in the Windows 2000 Advanced Server operating system, and up to four nodes in the Windows 2000 Datacenter Server operating system. Shared nothing refers to the fact that resources such as storage belong to only one system in the cluster at any time. However, the storage is physically connected to both nodes via SCSI bus or Fibre Channel. In the event of a failure on one node, the other node can import the storage resource and host the application. The primary operating attributes of MSCS are as follows: Each system in the cluster is a node. Today the nodes must be NT server or Windows 2000 systems. Both nodes should be the same OS level. Any item managed by MSCS is a resource. Resources may include storage devices, file shares, TCP/IP addresses, applications, and databases. A resource group is the collection of resources that failover as a group. All resource dependencies (such as a net name that depends on an IP address) must exist in the same group. This cluster design brings both availability and manageability benefits. The MSCS software tracks the state of the nodes in the cluster. In the event of an application or server failure, it either restarts the application or performs a failover to an available node. When the failed node is again available, MSCS switches the applications back to their preferred node. Failover is the process by which services that were running on one node are moved to another node or nodes in a cluster. Most stateless applications switch to the failover node transparently. Some applications that track the state of the nodes need to re-establish a connection to the cluster. Otherwise, the 7
failover is transparent. System administrators can also manually move resources from one node to another to perform load balancing and system maintenance tasks without resorting to downtime on production applications. MSCS Quorum Resource Another concept central to the practice of MSCS clustering is the quorum resource. A quorum resource is usually, though not necessarily, a SCSI disk that arbitrates for a resource by supporting something known as the challenge/defense protocol (explained later in this document). This resource should be capable of storing the cluster registry and cluster logs. It also is used to persist configuration change logs, tracking changes to the configuration database when any defined cluster member is missing or not active. This prevents configuration partitions in time, also known as temporal partitions. Temporal partitions are undesirable, because changed configuration data is not persisted, thereby causing an out-of- cluster. sync In MSCS, a quorum is a resource determines ownership of the cluster. Exactly one quorum resource is in every cluster. In a way, the quorum resource is a global control lock for the cluster. The quorum is also used to determine which node is the cluster when the network heartbeat is lost. This prevents split-brain situations when the connection between the nodes is broken and both nodes try to start the cluster. Split-Brain Split-brain refers to a state where nodes in a cluster lose contact with each other across the network, but shared disks continue to operate (also known as network partitioning ). The secondary server in the cluster, believing the primary server has failed because it no longer hears its heartbeat, takes over the disk. The primary server, which no longer receives a heartbeat from the backup server, but knows that it (the primary) is still operating properly, continues to write to disk. Windows NT and Windows 2000 (as well as other operating systems and file systems today) cannot support multiple systems writing to the same disk at the same time, so some data may be lost. Because split-brain is a situation that must be avoided, MSCS uses the following mechanisms to keep split-brain from happening to the cluster: SCSI reservation The process used to control a hard disk, or in an array, multiple hard disks, so that only the server with the reservation can access the drive(s). Through SCSI reserve and release commands, MSCS is able to maintain control of the shared disks so that only the server that has control of the drives has access. The quorum resource is critical to the cluster. However, the quorum is a single point of failure in MSCS without third party augmentation via hardware or software. Volume Manager provides mechanisms for building redundancy in the quorum resource. By placing the quorum in a Volume Manager Disk Group resource (Cluster Disk Group) it can be configured as a fault 8
tolerant (RAID) volume. A two or four-way mirror strategy provides a high level of redundancy and prevents a deadlock situation during arbitration. SCSI reservations are placed on all disks in a disk group to physically fence off the disks from other nodes in the cluster. Hardware augmentation alone (Hardware RAID) can provide electronic fault tolerance for the quorum, but a hardware solution for physical protection can be cost prohibitive. Volume Manager can offer both at a reasonable cost. It is possible with Volume Manager to physically separate the disks of the quorum and locate them in separate sites. This allows the quorum and thus, the cluster, to remain online should fire, flood, etc., damage one, or possibly more, of the quorum disks. Cluster Ownership of the Quorum Resource The quorum device drives the practice of cluster ownership. Ideally, in a cluster only one server should know the cluster configuration and be able to make decisions on that part of the cluster service. So, when you build the cluster service, you use a couple of algorithms to determine who s in charge. One algorithm is a simple majority, which would certainly cancel each other out. To do this, you use a quorum resource by doing the following: Within the cluster administrator, you determine a quorum usually, but not always, part of a SCSI disk determining who has ownership of the cluster. Recall that only one owner can own a resource at any time. That s the same mechanism used to ensure that only one person is in charge of the cluster at any time. This is important, because to implement a cluster server, you must designate a disk to act as the quorum device which provides arbitration and knowledge of who s in charge at any time. A device arbitrates for a resource by supporting the challenge/defense protocol of storing the cluster registry and logs. The quorum resource not only arbitrates, but also provides a place for doing checkpoints. This means it persists configuration-change logs, tracking changes to the configuration database when any defined member is missing (not active). The quorum device also prevents configuration of partitions in time, also known as temporal partitions. These partitions are considered a negative in clustering. So, if you change one node while a second node is down, you can expect that when the second node comes up, it would have the right configuration information. You don t want to go from state to state prime on one machine and then bring up another machine and have it come back as state and not state prime. That would mean you had lost some state information. You use the quorum device as a means of logging those changes so that at any time, you can survive catastrophic failures and bring data back on time in an orderly manner. With configuration data on the quorum device, you can always know where the information is located. 9
The Heartbeat of a Cluster A third concept central to clustering depends on both the namespace and the quorum device. It s known as the heartbeat of a cluster. Heartbeat: In a failover configuration, the heartbeat allows two or more systems to communicate privately with each other. Heartbeats are signals that are sent periodically from one system to another to verify the systems are active. Consider the heartbeat using the MSCS Cluster example in figure 1 below: Figure 1 In this example, Server A owns the disk in cabinet A and is running Microsoft Exchange. Server B owns the disk in cabinet B and is running Microsoft SQL. Server A and Server B are both active and they re active servicing client requests. If the network connections between these two servers should become unplugged, then the heartbeat between the servers will fail. The solution is that the servers can use the MSCS challenge/defense protocol (described below) and the quorum resource to learn whether the other server is still functioning. The challenge/defense protocol uses a low-level bus reset of the SCSI buses between the machines to attempt to gain control of the disks quorum resource. 10
When a SCSI bus reset is used, the reservation that each server had been holding on the quorum disk would be lost. Each server then would have roughly 10 seconds to reestablish that reservation, which would in turn let the other server know that it is still functioning even though the other server wouldn t necessarily be able to communicate with the server. If the active cluster server does not re-establish the SCSI reservation on the quorum resource within the time limit, all applications that were on the server would then flow to the other server. The new server servicing the application may now be a bit slower, but clients will still get their applications serviced. The IP (Internet protocol) address and network names will move, applications will be reconstituted according to the defined dependencies and clients will still be serviced, without any question as to the state of the cluster. MSCS Challenge/Defense Protocol The MSCS challenge/defense protocol works as follows: SCSI-2 has reserve/release verbs with a semaphore on the disk controller. The owner of the disk controller gets a lease on the semaphore, which it can renew every three seconds. To preempt ownership, a challenger clears the semaphore with a SCSI bus reset, waits ten seconds (three seconds for renewal and two seconds for bus-settle time twice, to give the current owner two chances to renew). If the semaphore is still clear, the challenger takes the lease from the former owner by issuing a reserve to acquire the semaphore. Figure 2 11
VERITAS VOLUME MANAGER IN A MSCS CLUSTER ENVIRONMENT Volume Manager Advantages with MSCS VERITAS Volume Manager for Windows has two key advantages when used with Microsoft Cluste r Server (MSCS): Dynamic Volumes Support with MSCS Campus Clustering using Fault Tolerant MSCS Resources Dynamic Volume Support with MSCS Windows 2000 Advanced Server and Datacenter Server shipped from Microsoft do not provide support for dynamic disks in a server cluster (MSCS) environment. VERITAS Volume Manager for Windows 2000 fully supports dynamic volumes in an MSCS environment. Windows 2000 Advanced Server and Datacenter server shipped from Microsoft do not provide support for dynamic disks in a server cluster (MSCS) environment. VERITAS Volume Manager for Windows 2000 adds the dynamic disk features to a server cluster. Campus Clustering using Fault Tolerant MSCS Resources VERITAS Volume Manager provides additional fault tolerant support for the MSCS quorum by using the dynamic disk group as a quorum resource. This is accomplished by configuring the quorum on a mirrored volume contained within a cluster disk group. This provides additional redundancy and protection from a single disk failure. Volume Manager supports up to 32-way mirrored volumes. This provides quorum protection for up to 31 hard disk failures. Typically a two or four-way mirror is used for the quorum resource to provide fault tolerance. To own a disk group, the server must successfully import the disk group and must be able to obtain a SCSI reservation on a majority of the disks in the disk group. The quorum disk group is critical to the system; a two or four-way mirror strategy provides high levels of redundancy and helps prevent a deadlock situation or Split Brain. SCSI reservations are placed on all disks in a disk group to physically fence off the disks from other nodes in the cluster. Volume Manager also has built-in technology to support 2N-way mirroring at two sites in a campus cluster. Unlike the physical disk resource the Volume Manager Disk Group resource is a cluster disk group. A cluster disk group can have one to many physical disks containing an arbitrary number of volume layouts. Volume Manager uses a similar challenge/defense protocol as the MSCS physical disk resource on cluster disk groups that have an online Volume Manager Disk Group resource. To support the challenge/defense protocol the VERITAS volume manager uses a majority algorithm to determine exclusive ownership of the disk group. When a 12
challenge occurs the Volume Manager arbitrates for ownership of the disk group. To gain exclusive ownership, one node must obtain a lease (i.e., semaphore) on a majority of disks in the disk group. Volume Manager obtains leases on physical disks in the group and not the volumes on those disks. Volume Manager uses a majority algorithm to protect against a deadlock situation where one node obtains a lease on the disks in the disk group and another node obtains a lease on the other half. VERITAS recommends that the cluster disk group containing the quorum volume contain an even number of disks. It is recommended that the quorum volume include a mirror plex on each disk in the group. Using an even number of disks in a cluster disk group evenly distributed between the two Campus Cluster sites protects against potential deadlock and Split Brain situations. In the case of a secondary server (failover site) failure, the primary server will remain active with ownership of 50% of the disks in the quorum cluster disk group 1. If the primary server (active site) fails and cluster control is manually failed over to the secondary server, when the primary site is restored it cannot obtain the majority and successfully arbitrate for ownership of the disk group, causing a Split Brain situation, or, in the case of data volumes, exposing the data to probable corruption. Note, the challenge and defense protocol is only invoked when the public and private MSCS cluster heartbeat fails. This Volume Manager for Windows protection in a Campus Cluster is not just limited to the MSCS quorum resource. Volume Manager can also be used to protect any Volume Manager Disk Group Cluster Resource, whether it contains the quorum or data. This allows protection of critical volumes and data files across a Campus Cluster. How VERITAS Volume Manager can help solve Challenges in a Cluster Environment As presented thus far in this paper, there are challenges that face a system administrator deploying MSCS in his/her production environment. ISSUE #1: Can t Grow Data Volumes Online Clustering provides higher availability than non-clustered systems. Yet, if a server s data grows and storage space must be added onto existing volumes, there is no way to avoid downtime with native MSCS. By using Volume Manager in conjunction with MSCS, dynamic disks can be utilized, allowing you to grow your volumes without interrupting data availability. ISSUE #2: Can t Use Fault Tolerant Data Volumes Another consideration in building a high availability solution is to protect against possible hardware failure. Because native MSCS does not allow the use of dynamic disks, data in a 1 Volume Manager 3.0 for Windows 2000, Service Pack 1, is a requirement for this functionality. 13
cluster can t be made fault tolerant. If a disk that holds your data in a cluster fails, you must take the cluster off line, replace the faulty hardware, and restore the data from a backup, unless you use Volume Manager. Volume Manager s dynamic disks support Mirrored, RAID-5, and mirrored stripe volumes to keep your data online through hardware failures. ISSUE #3: Quorum Disk is a Single Point of Failure for the Cluster Since native MSCS cannot use dynamic disks, the quorum will go offline if the disk it resides on, fails. MSCS has a utility that allows the quorum to be rebuilt, but this obviously must take place while the Cluster is offline. By putting the quorum on a Volume Manager Disk Group resource, it can be mirrored to avoid downtime from a single hardware failure. Without the use of dynamic volumes and software mirroring, a quorum resource on a basic disk is a single point of failure for the cluster. By mirroring the quorum and spreading the plexes of the mirror evenly across separate hardware storage arrays, even loosing an entire array will not force the cluster offline. The methods used and caveats to be aware of in developing a solution that precludes the quorum from being a single point of failure are explained in the following paragraphs. Disaster Recovery: Campus Clusters It is becoming commonplace for customers to protect their Microsoft Clusters by utilizing Campus Clusters to protect from natural disasters such as floods and hurricanes. This practice is also becoming more common as power blackouts become issues that customers must plan into their system planning. Campus Clusters are multiple nodes in separate buildings with mirrored SAN attached storage located in each building. Microsoft Windows based servers support campus clusters out of the box without the use of VERITAS Volume Manager. The clusters can be located in locations up to 20 miles apart using fibre channel storage area networks (SANs) and long wave optical technologies 2. This solution can disperse the clustered servers into different buildings or areas to protect the servers from most disasters that could strike one of the locations. A key resource in the cluster is the quorum resource. If this quorum resource is lost to the cluster, the cluster will fail, as none of the cluster servers will be able to gain control of the quorum resource and ultimately the cluster. Most customers use hardware RAID to contain this resource that provides hardware redundancy and protection from the loss of one or more physical disks. While a quorum resource located on hardware RAID provides for disk failures within the hardware RAID, this type of quorum resource protection does not protect from natural disasters and power failures that could effect the physical location that contains this 2 The distance between cluster sites is dependent upon the SAN equipment used in the Campus Cluster implementation. 14
single quorum resource. MSCS alone cannot protect the quorum resource from making it a single point of failure within the cluster. Volume Manager allows the quorum resource to be located on multiple disks located at multiple locations by using the software mirroring capabilities of dynamic volumes in an MSCS environment. By using software mirroring, the quorum resource can be distributed across multiple hardware RAID enclosures that are located at multiple physical sites. This provides a fully fault tolerant solution for customers who want to maximize the uptime for their MSCS cluster solution. Campus Clustering: Mirrored Quorum Disks at Two Locations The Volume Manager Campus Clustering support protects from a single site failure by locating the disks in a cluster disk group evenly across two separate locations. For example, in the case of the quorum, Site A would have one cluster server and half of the plexes of a mirrored quorum resource, and Site B would have one cluster server and the other half of the plexes of the mirrored quorum resource. Volume Manager 2 or 4-Way Mirror Quorum Resource Clustering Support Computer Computer Computer Computer Ethernet Site A Site B Cluster Heartbeat Windows Server A Disk Cabinet A Disk Cabinet B Windows Server B 2 or 4-W ay M irrored Q uorum R esource Quorum Resource Mirror Plex 1 Quorum Quorum Resource Resource... Mirror Mirror... Plex 3 Plex 2 Quorum Resource Mirror Plex 4 15
The scenarios that can occur when there is a cluster server failure include the following: If the site not owning the cluster goes offline, the quorum and data volumes will stay online at the other site and other cluster resources will stay online or move to that site. Volume Manager for Windows will allow the owning cluster node to remain online with 50% ownership of the disks in the quorum group 3 If the site owning the quorum volume goes offline, the remaining site will not be able to gain control of the quorum volume because it cannot reserve a majority of disks in the quorum group. As mentioned earlier, this is a safeguard to prevent multiple nodes from on lining members of a cluster disk group to which they have access. The only recommended and supported Campus Cluster configurations are those made up of two sites utilizing cluster disk groups with an even number of disks, with the disks evenly distributed between the sites. WARNING: Manual failover of a cluster between the two sites should only be performed after coordination between the two sites to ensure that the primary server has in fact failed. If a cluster disk group containing the MSCS quorum is manually (forced) imported into the secondary (failover) server when the primary server is still active, this will cause a Split Brain situation. There may be data loss i f 'Split Brain' occurs because each plex of the mirrored volume may be updated independently when the same disk group is imported on both nodes. Volume Manager 3.0 for Windows 2000 Service Pack 1 is a requirement for this Campus Cluster implementation. Volume Manager 3.0 for Windows 2000 with Service Pack 1 maintains the reservation on disks that have 50% of the disks available, but will not import a disk group in which a minority of disks (50% or less) is available. This makes implementation of a two-site Campus Cluster configuration which utilizes cluster disk groups comprised of an even number of disks, with the disks evenly distributed across the two sites, possible, while significantly reducing the risks of split-brain and data corruption, which are usually associated with such a configuration. This new Volume Manager ability to maintain reservations on disk groups that have only 50% of the disks available applies to all cluster disk groups. VERITAS has a special command line utility, vxclus that will aid in this manual failover. The operator at site B completes the vxclus command to gain control over the cluster in spite of not having a majority of disks in the quorum cluster disk group. When Site A comes back online, there are special procedures needed to ensure single control of the cluster is maintained. Please reference the VERITAS Technote titled VERITAS Volume Manager 3.0 for Windows 2000 Force Import Utility on this subject that is available from your local VERITAS Office. 3 Using Volume Manager 3.0 for Windows 2000 Service Pack 1 or Higher 16
SUMMARY It is a challenge to ensure the high availability of applications and data in today s rapidly growing Windows environments. Many factors can cause downtime planned downtime to perform system maintenance and necessary upgrades, as well as unexpected faults with software and hardware. VERITAS Volume Manager for Windows 2000 builds on the strong foundation of logical volume management and dynamic disks in Windows 2000. It provides advanced storagemanagement capabilities for applications with critical performance or availability requirements and offers the highest level of online disk- and volume-management capabilities available. In addition, Volume Manager enables the use of dynamic volumes in an MSCS cluster environment and creating Campus Clusters with fault tolerant MSCS quorum resources and data volumes using 2N-Way Mirroring. Using Volume Manager and MSCS together provides a flexible, inexpensive clustering solution that uses commodity hardware and provides a great deal of flexibility and manageability. To learn more about VERITAS Volume Manager for Windows, visit http://www.veritas.com/us/products/volumemanagerwin. 17