Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

Similar documents

Windows Server Failover Clustering April 2010

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Long-Distance Configurations for MSCS with IBM Enterprise Storage Server

Server Clusters : Geographically Dispersed Clusters For Windows 2000 and Windows Server 2003

Windows clustering glossary

VERITAS Volume Manager for Windows 2000

Availability Guide for Deploying SQL Server on VMware vsphere. August 2009

Tiburon Master Support Agreement Exhibit 6 Back Up Schedule & Procedures. General Notes on Backups

Contingency Planning and Disaster Recovery

SAN Conceptual and Design Basics

HP Serviceguard Cluster Configuration for HP-UX 11i or Linux Partitioned Systems April 2009

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Extended Distance SAN with MC/ServiceGuard Opens New Disaster Recovery Opportunities

Cisco Active Network Abstraction Gateway High Availability Solution

Best practices for fully automated disaster recovery of Microsoft SQL Server 2008 using HP Continuous Access EVA with Cluster Extension EVA

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

HP Serviceguard Cluster Configuration for Partitioned Systems

Availability and Disaster Recovery: Basic Principles

Managing and Maintaining a Windows Server 2003 Network Environment

Windows Geo-Clustering: SQL Server

Five Secrets to SQL Server Availability

Site-Wide Disaster Recovery and Business Continuity Solutions

High Availability Solutions for the MariaDB and MySQL Database

Deploying Global Clusters for Site Disaster Recovery via Symantec Storage Foundation on Infortrend Systems

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

SanDisk ION Accelerator High Availability

DISASTER RECOVERY BUSINESS CONTINUITY DISASTER AVOIDANCE STRATEGIES

HA / DR Jargon Buster High Availability / Disaster Recovery

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

VERITAS Volume Management Technology for Windows COMPARISON: MICROSOFT LOGICAL DISK MANAGER (LDM) AND VERITAS STORAGE FOUNDATION FOR WINDOWS

INSIDE. Preventing Data Loss. > Disaster Recovery Types and Categories. > Disaster Recovery Site Types. > Disaster Recovery Procedure Lists

VMware Virtual Machine File System: Technical Overview and Best Practices

Oracle Database 10g: Backup and Recovery 1-2

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

Deploying Exchange Server 2007 SP1 on Windows Server 2008

Clustering ExtremeZ-IP 4.1

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Microsoft File and Print Service Failover Using Microsoft Cluster Server

Owner of the content within this article is Written by Marc Grote

Using Hitachi Protection Manager and Hitachi Storage Cluster software for Rapid Recovery and Disaster Recovery in Microsoft Environments

Traditionally, a typical SAN topology uses fibre channel switch wiring while a typical NAS topology uses TCP/IP protocol over common networking

Blackboard Managed Hosting SM Disaster Recovery Planning Document

EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

WHITE PAPER: DATA PROTECTION. Implementing Highly Available Data Protection with Veritas NetBackup. Alex Davies January 2008

SiS964 RAID. User s Manual. Edition. Trademarks V1.0 P/N: U49-M2-0E

High Availability Storage

WHITE PAPER: ENTERPRISE SECURITY. Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions

Module: Business Continuity

High Availability and MetroCluster Configuration Guide For 7-Mode

Network Attached Storage. Jinfeng Yang Oct/19/2015

PROTECTING MICROSOFT SQL SERVER TM

MICROSOFT WINDOWS SERVER FAILOVER CLUSTERING WITH EMC VPLEX

Simple Software for a Business Continuity Plan and a Disaster Recovery Plan (BCP DRP) pdf ppt

Attunity RepliWeb Event Driven Jobs

Symantec Storage Foundation and High Availability Solutions Microsoft Clustering Solutions Guide for Microsoft SQL Server

SQL Server AlwaysOn. Michal Tinthofer 11. Praha What to avoid and how to optimize, deploy and operate.

RECOVERY OF CA ARCSERVE DATABASE IN A CLUSTER ENVIRONMENT AFTER DISASTER RECOVERY

End-to-End Availability for Microsoft SQL Server

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

SnapManager 1.0 for Virtual Infrastructure Best Practices

HIGH AVAILABILITY STRATEGIES

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

SQL Server AlwaysOn Deep Dive for SharePoint Administrators

Restoring Microsoft SQL Server 7 Master Databases

Vess A2000 Series HA Surveillance with Milestone XProtect VMS Version 1.0

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Getting Started with Endurance FTvirtual Server

PoINT Jukebox Manager Deployment in a Windows Cluster Configuration

Backup and Redundancy

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

The events of September 11, 2001 didn t

HP StorageWorks Data Protection Strategy brief

High Availability Option for Windows Clusters Detailed Design Specification

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER

Deploy App Orchestration 2.6 for High Availability and Disaster Recovery

High availability and disaster recovery with Microsoft, Citrix and HP

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

Windows Server 2008 Hyper-V Backup and Replication on EMC CLARiiON Storage. Applied Technology

Leveraging Virtualization in Data Centers

VERITAS Storage Foundation 4.3 for Windows

istorage Server: High Availability iscsi SAN for Windows Server 2012 Cluster

High Availability and Disaster Recovery Solutions for Perforce

Contents. SnapComms Data Protection Recommendations

EXTENDED ORACLE RAC with EMC VPLEX Metro

Managing Cisco ISE Backup and Restore Operations

Bosch Video Management System High Availability with Hyper-V

Aljex Software, Inc. Business Continuity & Disaster Recovery Plan. Last Updated: June 16, 2009

Critical SQL Server Databases:

Business Continuity: Choosing the Right Technology Solution

Administering and Managing Failover Clustering

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Oracle Database Disaster Recovery Using Dell Storage Replication Solutions

12 NETWORK MANAGEMENT

Setup for Failover Clustering and Microsoft Cluster Service

Live Migration across data centers and disaster tolerant virtualization architecture with HP Cluster Extension and Microsoft Hyper-V

Perforce Backup Strategy & Disaster Recovery at National Instruments

Transcription:

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.com

Motivation WWW access has made many businesses 24 by 7 operations. Critical sales and support functions commonly take place online. Downtime is likely result in losses of many millions of dollars an hour. A disaster that results in a week or two of critical application unavailability is likely to result in business failure.

Disaster Causes Environmental (fire, tornado, earthquake, flood, major power outage) Civil unrest, terrorist actions Major operational errors Etc.

Traditional Recovery Methods Recovery from offsite backups Warm/hot site restoration Remote mirroring Clusters (HA only, not disaster tolerant) Fault tolerant architectures (ditto)

How to obtain Disaster Tolerance Combine solutions In addition to routine backups, etc. Use remote mirroring.... and clustering........ and place cluster nodes at each site

The The Internet Cluster Node Heartbeat Link Remote Cluster Node Mirror Interconnect Local Site Remote Site

What makes this hard? 1) Mirroring is unidirectional 2) The quorum disk is special 3) Total loss of communications between sites can present protocol problems

Unidirectional mirroring Mirroring was not designed as a general purpose communications path. It just moves data from primary to secondary disk copies. MSCS is a shared nothing clustering system (vs. shared disk ). When ownership of a disk is changed, mirror direction may need to be swapped.

The Quorum disk is special Used as a tie breaker to prevent splitbrain Where two (or more) nodes think the others are down and try to provide the same service Low level SCSI commands may be delivered at any time Therefore must accurately follow the SCSI Specification semantics for Reserve, Release and Bus/Device Reset

Preventing Split-Brain When all inter-site communications are lost It s hard to run a protocol between sites to decide to move control of the cluster 1. The remote site may be down ( disaster ) 2. It may have lost network connectivity 3. It may be up and providing service How to decide?

Wolfpack Background Two approaches to application data disks Shared Data, all nodes see all disks Shared Nothing, one node at a time Applications and their resources are packaged into groups Disk(s) IP address Network name The application itself

The Resource DLL Inserted into group bring up prior to disks Checks to see the mirroring status and direction Swap direction it if needed Mirror Direction

Other resource DLL functions The disk array allows for sync and async I/O to mirror volumes (LUN by LUN) Numerous options are available based on data-safety requirements The DLL must preserve and protect the selected level of data-safety (600+ rules) User supplied scripts can be run GUI is provided to setup and manage

Dealing with the Quorum disk We want the quorum disk mirrored in two places, but It must continue to operate as a single drive in the face of failures We didn t want to create more communications paths to rely on

Disk control protocol (simplified) 1. A node wanting to control a disk issues a Bus/Device Reset to the disk 2. It waits INTERVAL+ time 3. It issues a SCSI Reserve command 4. If the reserve fails, it assumes another node has defended its ownership 5. If it succeeds, it is the new disk owner and starts routine re-reservations every INTERVAL time

Disk protocol (cont.) MSCS also uses reads/write to probe the current reservation state We place a filter driver in the SCSI stack to intercept SCSI commands for the quorum disk A service program cooperates with the filter driver to insure correct behavior of the quorum disk pair We use three special disk pairs to communicate between servers and sites

The special disks 1. The first disk pair, synchronously mirrored, holds the current reservation state of the quorum disk 2. A second disk pair (not mirrored) is used to hold a exclusionary lock between local nodes 3. The third disk pair is used as a global lock. It s pair state and direction indicates the ownership of the global lock.

The special disks (cont.) These disks are small (the smallest that can be created in the array) They are not part of the cluster configuration They are not assigned drive letters They do not contain file systems Also required are Command Device disks on each array used for array control commands

The Disks

One type of failures is hard! If all communications are lost between sites you need help to recover You need the quorum, but we stretched it An external arbitrator can help Located with the application users Is prepared to allow one and only one site proceed with cluster ops Arbiter location available from Active Directory (or config files)

How it works For the node that currently holds the quorum device It detects the loss of communications It creates a process The process contacts the arbitrator If it makes contact, it knows all is OK and exits If it is unable to make contact, it assumes the communications failure has isolated it and shuts down clustering (also assuming the other site will restart clustering)

How it works (cont) When MSCS asks a node to take control of the quorum disk and the normal protocol finds a communications failure It causes clustering to shut down It creates a new process to see if recovery is needed That process contacts the arbitrator If the arbitrator tells it to proceed, it cleans up the metadata and restarts clustering locally, if not, it just exits

How the arbitrator decides Upon request for arbitration, the arbitrator process attempts to contact the cluster s general IP address If the cluster responds (a operating node that holds the quorum device) it reports that the cluster is running If it cannot contact the cluster s IP, it reports that the cluster is down or isolated If multiple nodes request this service, it only responds to the first

Software Installation Hardware configuration is done first Typical Microsoft Installer/InstallShield script setup is used One node at a time (quorum is moved ahead of each node s setup) Same script installs the resource DLL Also installs the arbitrator on a node external to the cluster

Summary A stretched cluster significantly increases disaster tolerance It is possible to quickly and automatically recover from major failures Inter-site link technology doesn t matter Cost is a small increment when clustering and distance mirroring are already present This implementation is hardware specific but the technology is portable to other distance mirroring solutions This implementation is Microsoft WHQL certified

Future Work New arrays (distance mirroring is all that is necessary) Windows.NET and Win64 Redundant Arbitrator New quorum technologies from Microsoft (SCSI persistent reservations, new quorum algorithms) Transportability to Unix clusters?