Virtual Linux Server Disaster Recovery Planning

Similar documents

High Availability for Linux on IBM System z Servers

High Availability Architectures for Linux in a Virtual Environment

z/vm and Linux Disaster Recovery A Customer Experience Lee Stewart Sirius Computer Solutions (DSP)

FDR/UPSTREAM and INNOVATION s other z/os Solutions for Managing BIG DATA. Protection for Linux on System z.

Deployment Topologies

Marco Mantegazza WebSphere Client Technical Professional Team IBM Software Group. Virtualization and Cloud

Virtualization, Business Continuation Plan & Disaster Recovery for EMS -By Ramanj Pamidi San Diego Gas & Electric

IBM Infrastructure Suite for z/vm and Linux

SAP HANA Operation Expert Summit BUILD - High Availability & Disaster Recovery

Big Data Storage in the Cloud

Oracle on System z Linux- High Availability Options Session ID 252

EMC Business Continuity for VMware View Enabled by EMC SRDF/S and VMware vcenter Site Recovery Manager

Customer Experiences:

Using Virtualization to Help Move a Data Center

Veritas InfoScale Availability

Oracle Networking and High Availability Options (with Linux on System z) & Red Hat/SUSE Oracle Update

Symantec Cluster Server powered by Veritas

HA / DR Jargon Buster High Availability / Disaster Recovery

EMC Business Continuity for Microsoft SQL Server 2008

Backup Solutions with Linux on System z and z/vm. Wilhelm Mild IT Architect IBM Germany mildw@de.ibm.com 2011 IBM Corporation

Server and Storage Virtualization with IP Storage. David Dale, NetApp

Understand the New Enterprise Cloud System

The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Implementing a Holistic BC/DR Strategy with VMware

Disaster Recovery 101. Sudarshan Ranganath & Matthew Phillips Ellucian

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

VMware vcenter Site Recovery Manager 5 Technical

Virtual Networking with z/vm Guest LAN and Virtual Switch

What s New with VMware Virtual Infrastructure

EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM

Linux on System z. ProvisioningscenarioswithIBMTivoli Service Automation Manager 7.2.2

Private Cloud for WebSphere Virtual Enterprise Application Hosting

VMware Site Recovery Manager with EMC RecoverPoint

Abstract Introduction Overview of Insight Dynamics VSE and Logical Servers... 2

Consulting Solutions Disaster Recovery. Yucem Cagdar

Virtual Networking with z/vm Guest LANs and the z/vm Virtual Switch

Oracle Cloud Provisioning with IBM Wave and Oracle 12c Cloud Control on IBM z Systems

IBM Software Group. Lotus Domino 6.5 Server Enablement

Running Oracle Databases in a z Systems Cloud environment

Improving availability with virtualization technology

BUSINESS CONTINUITY BEST PRACTICES FOR SAP HANA TAILORED DATACENTER INTEGRATION WITH EMC SYMMETRIX VMAX

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

High Availability Solutions for the MariaDB and MySQL Database

Appendix C to DIR Contract Number DIR-TSO-2736 SunGard Availability Services Discount Level: 25% Managed Data Center Services - Cloud Hosting

GSE z/os Systems Working Group s390-tools

EMC RECOVERPOINT FAMILY

Veritas Storage Foundation High Availability for Windows by Symantec

Virtualized Disaster Recovery (VDR) Overview Detailed Description... 3

Hyper-V Replica. Aidan Finn

Veritas Cluster Server from Symantec

EMC Business Continuity and Disaster Recovery Solutions

OBIEE 11g Scaleout & Clustering

Architecting DR Solutions with VMware Site Recovery Manager

Cross-Platform Access

Understanding the New Enterprise Cloud System

Unitt Zero Data Loss Service (ZDLS) The ultimate weapon against data loss

VMware vsphere-6.0 Administration Training

VMware Site Recovery Manager (SRM) Lessons Learned in the First 90 Days

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

EMC Solutions for Disaster Recovery

Building disaster-recovery solution using Azure Site Recovery (ASR) for Hyper-V (Part 1)

Tushar Joshi Turtle Networks Ltd

HCS Redundancy and High Availability

Maximum Availability Architecture

Centrata IT Management Suite 3.0

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Deploying Windows Streaming Media Servers NLB Cluster and metasan

What s new in Hyper-V 2012 R2

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

HRG Assessment: Stratus everrun Enterprise

Disaster Recovery of Tier 1 Applications on VMware vcenter Site Recovery Manager

Oracle 11g Disaster Recovery Solution with VMware Site Recovery Manager and EMC CLARiiON MirrorView/A

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Disaster Recovery Checklist Disaster Recovery Plan for <System One>

HBA Virtualization Technologies for Windows OS Environments

Oracle E-Business Suite Disaster Recovery Solution with VMware Site Recovery Manager and EMC CLARiiON

VMware Business Continuity & Disaster Recovery Solution VMware Inc. All rights reserved

Using RAID Admin and Disk Utility

Cisco Active Network Abstraction Gateway High Availability Solution

Best Practices for Virtualised SharePoint

Implementing and Managing Windows Server 2008 Clustering

Windows Server 2008 R2 Hyper-V Server and Windows Server 8 Beta Hyper-V

PowerVC 1.2 Q Power Systems Virtualization Center

DeltaV Virtualization High Availability and Disaster Recovery

SAN Implementation Course SANIW; 3 Days, Instructor-led

SAN Conceptual and Design Basics

TECHNICAL SPECIFICATIONS For DISASTER RECOVERY SERVICES TKG /09/2015

IP Storage On-The-Road Seminar Series

Deployment Options for Microsoft Hyper-V Server

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Transcription:

Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance August 2013 Session 13726

Agenda Definitions Our Environment Business Recovery Philosophy at Nationwide Planning Execution 2

Definitions High Availability With any IT system it is desirable that the system and its components (be they hardware or software) are up and running and fully functional for as long as possible, at their highest availability. The most desirable high availability rate is known as five 9s, or 99.999% availability. A good deal of planning for high availability centers around backup and failover processing and data storage and access. Deal with significant outage within data center LPAR failure Operating System outage Application ABEND 3

High Availability Production LPAR 1 HTTP Server LPAR 2 LPAR 6 VLAN 1 LPAR 3 HTTP Server LPAR 4 LPAR 8 WAS Node Session Node VLAN 2 WAS Node WAS Node WAS Node WAS Node VLAN 3 WAS Node Session Node DB Server VLAN 4 DB Server DB Server VLAN 5 DB Server vswitch VM TCPIP VM VM VM TCPIP vswitch TCPIP vswitch TCPIP vswitch vswitch vswitch vswitch vswitch VM TCPIP vswitch VM TCPIP vswitch vswitch vswitch OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA Internet Cisco Switch w/ Firewall Cisco Switch w/ Firewall Cisco Switch w/ Firewall Cisco Switch w/ Firewall Cisco Switch w/ Firewall Front End Back End Tools Cisco Switch w/ Firewall intranet 4

Definitions Disaster Recovery Disaster recovery in information technology is the ability of an infrastructure to restart operations after a disaster. While many of today's larger computer systems contain built-in programs for disaster recovery, standalone recovery programs often provide enhanced features. Disaster recovery is used both in the context of data loss prevention and data recovery. Deal with complete outage Natural catastrophe Data center Major hardware failure 5

Our Environment Four z196 installed in 2011 Current configuration: Production boxes 36 IFLs 18 on each z196 696 GB memory 6 z/vm LPARs (plus system programmer configuration test) Tier 4+ data center Fully redundant power, telecom, generators, etc Development boxes 21 IFLs 10 on one z196 and 11 on the other 850 GB memory 6 z/vm LPARs (plus system programmer test) 6

Planning Design Challenges Priorities Setup Automation Documentation Teamwork 7

Design What Identify what needs to be recovered Everything or subset? Priorities recovery order When Need to know recovery objectives Where Identify where recovery will occur Second site Vendor site How Identify how to transfer programs and data Identify how to perform recovery 8

Challenges Production configuration changes may require DR configuration changes Processor model changes (not necessarily) Capacity Network configuration Transition Consistent distribution and synchronization of boot and DR scripts Commitment to regular testing DR Infrastructure Application DR 9

Priorities Prioritizing your application recovery must be done by the people in your organization that understand the business processes. Business requirements that drive recovery time-frame regulated financial / investments responsive to customers 10

Setup Asynchronous replication with SRDF between production and recovery sites Replicated volumes at recovery site z/vm different unit address between prod and dev/test/dr SAN target WWPN and LUN differ between production and recovery Changes for recovery are automated in Linux script run at boot Manual processes Initiate Clone copies Vary ECKD DR volumes online Start Linux servers Update DNS to reflect DR IP addresses for all servers Middleware and application hard-coded parameters (e.g. IP addresses) 11

Setup The DR process would be the same for the following failures System z failure DASD / SAN storage frame failure Long distance (inter-data center) fiber failure No DR required, (redundant links installed) 12

Setup FCP (x12) IBM IBM z196 z196 FICON (x8) EMC V-Max Storage ECKD Replication Fibre Open Systems SAN Prod Dev FCP (x12) SRDF Replication Fibre IBM IBM z196 z196 FICON (x8) EMC V-Max Storage ECKD Replication Fibre Open Systems SAN 13

Normal Production Production DR DMX DMX R1 R2 SAN Replication Replication Fiber Fiber Replication SAN 14

Failure Happens If failure occurs: manually stop replication; initiate DR Clones; Bring ECKD volumes online at DR site, start Linux servers Production DR DMX DMX SAN Fiber Fiber SAN 15

Recovery Begins Servers identify DR configuration, change IP address, change SAN parameters, register new IP with DNS, start replication south to north. Production DR V-Max V-Max SAN Fiber Fiber SAN 16

Recovery Begins Servers identify DR configuration, change IP address, change SAN parameters, register new IP with DNS, start replication south to north. Production DR V-Max V-Max R2 R1 Replication SAN Fiber Fiber SAN 17

Automation Avoid manual processes Dependence on key individuals Prone to mistakes Slow Automated processes Requires only basic knowledge of environment and technologies in use Accuracy Repeatable Faster Does not mean build it once then ignore; Requires regular review and updates 18

Automation Automation begins at provisioning DR configuration stored with production configuration CMS NAMES file Contains all information about provisioned server Copy stored on DR disk also Also used to generate report of server definitions for easy lookup Linux PARM file stored on CMS disk Stored on disk accessible at boot time Copy stored on DR disk also Define everything needed to provision server and at boot time 19

Automation Extract from LINUX NAMES file for one guest :nick.ws001 :userid.pzvws001 :node.vn2 :desc.prod web server 1 :env.prod :hostname.pzvmws001 :load.linuxws :ip.10.1.1.1 :vswitch.prodvsw1 :vlan.2102 :ip_nb.10.2.1.1 :vsw_nb.netbkup1 :vlan_nb.3940 :ip_dr.10.221.1.1 :vsw_dr.prodvsw1 :vlan_dr.2102 :ip_drbu.10.222.1.1 :vsw_drbu.netbkup1 :vlan_drbu.3940 :oth_ip.10.1.1.5 10.1.1.15 :dr_oth_ip.10.221.1.5 10.221.1.15 :status.2005-09-08 :gold.v1.2 :memory.256m :cpus.1 :share.200,ls-200 :comments. :storage.2.3g :storage_os.7.1g :bootdev.251 20

Automation Extract from LINUX NAMES file for one guest (cont d) :storage_san.16.86g :sanluns.r1:0100:5006048ad52d2588:0059000000000000:8.43 R1:0100:5006048AD52D2588 :005A000000000000:8.43 R1:0200:5006048AD52D2587:0059000000000000:8.43 R1:0200:50 06048AD52D2587:005A000000000000:8.43 :storage_san_dr.16.86g :sanluns_dr.r2:0100:5006048ad52e4f87:006a000000000000:8.43 R2:0100:5006048AD52E4 F87:006B000000000000:8.43 R2:0200:5006048AD52E4F88:006A000000000000:8.43 R2:0200 :5006048AD52E4F88:006B000000000000:8.43 21

Automation PARM file HOST=pzvmws001 ADMIN=10.1.1.1 BCKUP=10.2.1.1 DRADMIN=10.221.1.1 DRBCKUP=10.222.1.1 ENV=PROD DRVIP=10.1.1.5,10.1.1.15 BOOTDEV=251 VIP=10.221.1.5,10.221.1.15 SAN_1=0100:5006048AD52D2588:005A000000000000,0200:5006048AD52D258 7:005A000000000000 SAN_2=0100:5006048AD52D2588:0059000000000000,0200:5006048AD52D258 7:0059000000000000 SAN_3=0100:5006048AD52E4F87:006A000000000000,0200:5006048AD52E4F8 8:006A000000000000 SAN_4=0100:5006048AD52E4F87:006B000000000000,0200:5006048AD52E4F8 8:006B000000000000 22

Automation Alternate Start-up Scripts Identify production or DR mode VMCP interact with CP CMSFS read CMS files Set parameters for environment Hostname to /etc/hostname IP addresses to /etc/sysconfig/network/ifcfg-qeth-bus-ccw-0.0.xxxx SAN LUN information Color prompt by environment Prod = Red DR = Yellow Tools = Green 23

Automation Extract from boot.config # Setup variables echo "1" > /sys/bus/ccw/devices/0.0.0191/online sleep 5 # modprobe required just in case modprobe cpint PARMDEV=`grep 191 /proc/dasd/devices awk '{print $7}'` QUSERID=`hcp query userid` GUEST=ècho $QUSERID cut -d" " -f 1` LOCO=ècho $QUSERID cut -c14` LPAR=ècho $QUSERID cut -c13-15` BOX=ècho $QUSERID cut -c1` NZVWS001 AT VN1N cmsfscat -d /dev/$parmdev -a ${GUEST}.PARMFILE > /tmp/sourceinfo. /tmp/sourceinfo echo "0" > /sys/bus/ccw/devices/0.0.0191/online 24

Automation Result of cmsfscat cat /tmp/sourceinfo HOST=pzvmws001 ADMIN=10.1.1.1 BCKUP=10.2.1.1 DRADMIN=10.221.1.1 DRBCKUP=10.222.1.1 ENV=PROD DRVIP=10.1.1.5,10.1.1.15 BOOTDEV=251 VIP=10.221.1.5,10.221.1.15 SAN_1=0100:5006048AD52D2588:005A000000000000,0200:5006048AD52D2587:005A000000 000000 SAN_2=0100:5006048AD52D2588:0059000000000000,0200:5006048AD52D2587:0059000000 000000 SAN_3=0100:5006048AD52E4F87:006A000000000000,0200:5006048AD52E4F88:006A000000 000000 SAN_4=0100:5006048AD52E4F87:006B000000000000,0200:5006048AD52E4F88:006B000000 000000 25

Automation More extract from boot.config case "$ENV" in PROD) if [ "$LOCO" == "$BOX" ] then CLR="41"; #Red else CLR="43"; #Yellow/Gold ENV="DR"; fi ;; DEV JT TOOLS TOOL) CLR="42"; #Green ;; esac ST) CLR="44"; #Blue ;; PT) CLR="45"; #Purple ;; UAT IT) CLR="46"; #Turq ;; *) CLR="42"; #Green ENV="UNK"; ;; Examples: barlowr@szvmjt002:jt:barlowr> barlowr@nzvmws001:prod:barlowr> 26

Documentation Document everything Declaration criteria Contact information (your list may be different) Operating System Middleware Application Network Security Lists of servers Recovery process Verification process Fail-back process 27

Documentation DR Procedure example: Confirm DISASTER declaration Begin shutdown all test/development guests to insure sufficient capacity. Bring up production DR guests identified by business units for each application environment. Make appropriate emergency DNS changes to point users to DR environment per definitions for each application environment. 28

Documentation Return Procedure example: Confirm DISASTER OVER declaration Reverse disk replication; confirm synchronization Follow instructions for confirmation of original production environment for each application. Bring down DR guests identified by business units for each application environment. Make appropriate DNS changes to point users to non-dr environment per definitions for each application environment. Resume normal disk replication. 29

Teamwork Recovery coordinator z/vm System Programmers Linux System Administrators Middleware WAS Administrators Database Administrators MQ Administrators Application Teams Testing methodology Expected results Avoid processes that are dependent on subject matter experts (SME) when a disaster happens 30

Execution Test Document results Compare to plan Repeat 31

Execution Where to recover the systems Your own second site A recovery vendor do the people go Identify what personnel need to travel to recovery site Document travel procedures Identify alternate (local) office space Some office locations may be able to access recovery site if connectivity is available 32

Execution Testing Test as often as feasible Frequency may depend on having your own site or contracting with a vendor Tests should be as close as possible to real recovery conditions Operating systems are easy Some subsystems are not so easy (e.g. large database) Multi-platform applications can be are more complex Automate as much as possible to avoid manual effort 33

Document Results Compare to Plan Detailed plans for all test scenarios Carefully track tests Document action items and follow up for improvements Build on successes 34

Repeat Do it again Do it regularly Corporate emphasis may be required to encourage all applications to test If you have a DR plan and it is not regularly exercised and validated, your business is at risk. 35

Contact Information Rick Barlow Senior z/vm Systems Programmer Phone: (614) 249-5213 Internet: Richard.Barlow@nationwide.com And I thought we were busy before Linux showed up! 36