IBM MQ Disaster Recovery



Similar documents
WebSphere MQ Disaster Recovery

15026: WebSphere MQ HA and DR

IBM Software Services for Collaboration

The predictive power of Big Data in healthcare

Why Finance Should Automate Management & Regulatory Reporting Processes

SAML and OAUTH Technologies WebSphere Application Server

Ubiquitous Computing, Pervasive Risk: Securely Deploy and Manage Enterprise Mobile Devices

IBM Digital Experience meets IBM WebSphere Commerce

Using computing resources with IBM ILOG CPLEX

IBM MobileFirst Analytics: Discover what s working, what s not, and why

IBM Systems Director Navigator for i5/os New Web console for i5, Fast, Easy, Ready

New Single Sign-on Options for IBM Lotus Notes & Domino IBM Corporation

Developing Exceptional Mobile and Multi-Channel Applications using IBM Web Experience Factory IBM Corporation 1

IBM Worklight: Responsive Design for Mul8- Channel Applica8on Development

How to Deliver Measurable Business Value with the Enterprise CMDB

Cloud Computing with xcat on z/vm 6.3

Maximo Business Intelligence Reporting Roadmap Washington DC Users Group

How To Write An Article For Ibm.Com

A Strategic Guide to Enterprise Mobility

IBM Systems and Technology Group Technical Conference

SHOW101 Step by Step: Install, manage and configure DAOS and ID Vault. Gabriella Davis Queen of Turtles ;) Paul Mooney Senior Architect, Bluewave

Oleksandr Romanko, Ph.D. Senior Research Analyst, Risk Analytics Business Analytics, IBM Canada October 8, Business Analytics and Optimization

Disaster Recovery. Websense Web Security Web Security Gateway. v7.6

Memory-to-memory session replication

How To Write An Architecture For An Bm Security Framework

Lotus Sametime. FIPS Support for IBM Lotus Sametime 8.0. Version 8.0 SC

Software Services for WebSphere. Capitalware's MQ Technical Conference v

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Backups in the Cloud Ron McCracken IBM Business Environment

Session 1494: IBM Tivoli Storage FlashCopy Manager

z/osmf Software Deployment Application- User Experience Enhancement Update

Module 14: Scalability and High Availability

MIMIX Availability. Version 7.1 MIMIX Operations 5250

The Consolidation Process

Oracle on System z Linux- High Availability Options Session ID 252

IBM Tivoli Storage FlashCopy Manager Overview Wolfgang Hitzler Technical Sales IBM Tivoli Storage Management

Title. Click to edit Master text styles Second level Third level

IBM RATIONAL PERFORMANCE TESTER

Disaster Recovery for Oracle Database

IBM Maximo Asset Management Essentials

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Lisa Gundy IBM Corporation. Wednesday, March 12, 2014: 11:00 AM 12:00 PM Session 15077

Symantec NetBackup for Lotus Notes Administrator's Guide

Managing and Securing the Mobile Device Invasion IBM Corporation

System z Batch Network Analyzer Tool (zbna) - Because Batch is Back!

How To Manage Energy At An Energy Efficient Cost

Energy Management in a Cloud Computing Environment

Symantec NetBackup for Microsoft SharePoint Server Administrator s Guide

Integrated and reliable the heart of your iseries system. i5/os the next generation iseries operating system

Managed Services - A Paradigm for Cloud- Based Business Continuity

Westek Technology Snapshot and HA iscsi Replication Suite

Deploying a private database cloud on z Systems

Neverfail for Windows Applications June 2010

VERITAS Business Solutions. for DB2

UPSTREAM for Linux on System z

IBM Lotus Enterprise Integrator (LEI) for Domino. Version August 17, 2010

IBM XIV Storage System. MSCS Guide. Version 1.0.x

Leveraging WebSphere Commerce for Search Engine Optimization (SEO)

Platform LSF Version 9 Release 1.2. Migrating on Windows SC

The objective of WebSphere MQ clustering is to make this system as easy to administer and scale as the Single Queue Manager solution.

z/tpf FTP Client Support

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft Exchange

EMC NetWorker Module for Microsoft for Windows Bare Metal Recovery Solution

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Session Title: Cloud Computing 101 What every z Person must know

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Single Sign-on (SSO) technologies for the Domino Web Server

WebSphere Application Server Development Environment. Anita Rass Wan WebSphere Foundation Product Manger

Big Data at the Speed of Business - IBM Innovationen für eine neue Ära

CA XOsoft Replication for Windows

SteelEye Protection Suite for Linux v8.2.0 WebSphere MQ / MQSeries Recovery Kit. Administration Guide

Accelerate with ATS DS8000 Hardware Management Console (HMC) Best practices and Remote Support Configuration September 23rd, 2014.

IBM System i Integrated Collaboration

DB2 9 for LUW Advanced Database Recovery CL492; 4 days, Instructor-led

Symantec NetBackup OpenStorage Solutions Guide for Disk

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server

CS z/os Application Enhancements: Introduction to Advanced Encryption Standards (AES)

RECOVERY OF CA ARCSERVE DATABASE IN A CLUSTER ENVIRONMENT AFTER DISASTER RECOVERY

IBM WebSphere MQ File Transfer Edition, Version 7.0

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Performance and scalability of a large OLTP workload

IBM WebSphere Premises Server

IBM Tivoli Remote Control

HA / DR Jargon Buster High Availability / Disaster Recovery

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Data Transfer Tips and Techniques

Symantec NetBackup for Microsoft SharePoint Server Administrator s Guide

Automatic Service Migration in WebLogic Server An Oracle White Paper July 2008

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

High Availability and Disaster Recovery Solutions for Perforce

Deployment Options for Microsoft Hyper-V Server

SAN Conceptual and Design Basics

High Availability & Disaster Recovery. Sivagopal Modadugula/SAP HANA Product Management Session # 0506 May 09, 2014

Transcription:

IBM MQ Disaster Recovery Mark Taylor marke_taylor@uk.ibm.com IBM Hursley

Please Note IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Introduction Availability is a very large subject You can have the best technology in the world, but you have to manage it correctly Technology is not a substitute for good planning and testing!

What is DR Wikipedia Version Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster. Disaster recovery is a subset of business continuity. While business continuity involves planning for keeping all aspects of a business functioning in the midst of disruptive events, disaster recovery focuses on the IT or technology systems that support business functions.

What is DR Getting applications running after a major (often whole-site) failure or loss It is not about High Availability although often the two are related and share design and implementation choices HA is having 2, DR is having them a long way apart More seriously, HA is about keeping things running, while DR is about recovering when HA has failed. Requirements driven by business, and often by regulators Data integrity, timescales, geography One major decision point: cost How much does DR cost you, even if it s never used? How much are you prepared to lose

Disaster Recovery vs High Availability Designs for HA typically involve a single site for each component of the overall architecture Designs for DR typically involve separate sites Designs for HA (and CA) typically require no data loss Designs for DR typically can have limited data loss Designs for HA typically involve high-speed takeover Designs for DR typically can permit several hours down-time

Local Recovery

What makes a Queue Manager on Dist? /var/mqm/log/qmgr ini files ini files Recovery Logs SSL Store Registry Registry /var/mqm/qmgrs/qmgr System QMgr QFiles Obj Definiitions Security Cluster etc

Backups At minimum, backup definitions at regular intervals Include ini files and security settings One view is there is no point to backing up messages They will be obsolete if they ever need to be restored Distributed platforms data backup only possible when qmgr stopped Use rcdmqimg on Distributed platforms to take images Channel sync information is recovered even for circular logs Backup everything before upgrading code levels On Distributed, you cannot go back Exclude queue manager data from normal system backups Some backup products interfere with MQ processing

What makes a Queue Manager on z/os? VSAM DS BSDS HIGHEST RBA CHKPT LIST Private Objects Private Messages Pagesets RECOV RBAs VSAM DS LOG INVENTORY Tapes ARCHIVED LOGS... DB2... Shared Objects Group Objects CP ACTIVE CP LOGS Shared Messages CP CF LOG VSAM Linear DS

What makes up a Queue Manager? Queue manager started task procedure Specifies MQ libraries to use, location of BSDS and pagesets and INP1, INP2 members start up processing System Parameter Module zparm Configuration settings for logging, trace and connection environments for MQ BSDS: Vital for Queue Manager start up Contains info about log RBAs, checkpoint information and log dataset names Active and Archive Logs: Vital for Queue Manager start up Contain records of all recoverable activity performed by the Queue Manager Pagesets Updates made lazily and brought up to date from logs during restart Start up with an old pageset (restored backup) is not really any different from start up after queue manager failure Backup needs to copy page 0 of pageset first (don t do volume backup!) DB2 Configuration information & Group Object Definitions Coupling Facility Structures Hold QSG control information and MQ messages

Backing Up a z/os Queue Manager Keep copies of ZPARM, MSTR procedure, product datasets and INP1/INP2 members Use dual BSDS, dual active and dual archive logs Take backups of your pagesets This can be done while the queue manager is running (fuzzy backups) Make sure you backup Page 0 first, REPRO or ADRDSSU logical copy DB2 data should be backed up as part of the DB2 backup procedures CF application structures should be backed up on a regular basis These are made in the logs of the queue manager where the backup was issued

Rebuilding a Queue Manager Make sure you have a backup of the definitions Either through a tool such as Omegamon Configuration Manager Or by manually creating the backup MAKEDEF, dmpmqcfg or MS03 Make sure you know which version of MQ is installed And you have the install images for the code Make sure you ve got the security configuration Windows SIDs? Also any customisation in the qm.ini file (or registry) And sometimes exits might have external configuration Don t forget to have the binaries available - often separately installed

Rebuilding a Queue Manager A rebuild creates a replacement queue manager Same object definitions But loss of message data and channel sequence numbers Replacement queue manager has a new QMID MQ Explorer saves QMID in its list of known queue managers Will allow you to connect, but requires confirmation that the new qmid is expected Recommend issuing RESET CLUSTER at full repository to remove the old QMID before bringing the replacement online

Remote Recovery

Topologies Sometimes a data centre is kept PURELY as the DR site Sometimes 2 data centres are in daily use; back each other up for disasters Normal workload distributed to the 2 sites These sites are probably geographically distant Another variation has 2 data centres near each other Often synchronous replication With a 3rd site providing a long-distance backup "Tornados never hit two sites 10 miles apart so we only use nearby backup site" And of course further variations and combinations of these

Queue Manager Connections DR topologies have little difference for individual queue managers But they do affect overall design Where do applications connect to How are messages routed Clients need ClntConn definitions that reach any machine Will be affected by how you manage network Do DNS names move with the site? Do IP addresses move with the site? Some sites always put IP addresses in CONNAME; others use hostname No rule on which is better

Disk replication Disk replication can be used for MQ disaster recovery Either synchronous or asynchronous disk replication is OK Synchronous: No data loss if disaster occurs Performance is impacted by replication delay Limited by distance (eg 100km) Asynchronous: Some limited data loss if disaster occurs It is critical that queue manager data and logs are replicated in the same consistency group if replicating both Disk replication cannot be used between the active and standby instances of a multi-instance queue manager Could be used to replicate to a DR site in addition though

Combining HA and DR Primary Site Backup Site HA Pair Machine A Active instance Machine B Standby instance Machine C Backup instance QM1 shared storage Replication QM1

Combining HA and DR Active/Active Site 1 Site 2 HA Pair Machine A Active instance Machine B Standby instance QM1 Machine C Backup instance Replication QM1 HA Pair Machine A Backup instance Machine C Active instance Machine D Standby instance QM2 Replication QM2

More about disk replication During failover, need to stop disk replication and make backup disks live This is probably the most time-consuming piece Both for HA and DR, managing disks is critical path Will probably be replicating data for all resource managers, not just MQ Configuration variation: A local disk-only synchronous replica A remote full datacentre using async replication chained from primary If primary fails, the local replica can still bring the backup site up-to-date QM1 Sync QM1 Async QM1 Primary datacentre Disk-only site Backup datacentre

Backup Queue Manager - Objective Feature introduced in MQ V6 Prepares a queue manager for restart/recovery Without needing to replay all logs at a critical time For Windows, Unix and System i Backup queue manager takes the place of original New QMID but contains original definitions and messages

Backup Queue Manager - Procedure Configure queue manager with linear logging Create a queue manager at the primary site Create an identical one at the DR site the backup queue manager Ship full, inactive log files from active QM to the DR site Can use disk replication to do this, but disks have to be mounted No need to replicate qfiles Or modify SupportPac or sample programs for log management to copy files at the same time as deleting/archiving local logs Replay log files on the backup QM to bring it up to date Do this at regular intervals strmqm -r If disaster occurs, activate the backup queue manager strmqm -a For more control, can force filling of current log file MQSC RESET QMGR TYPE(ADVANCELOG)

A non-dr DR configuration Based on a real customer scenario: different qmgrs in different sites Regular deliberate partial failovers of apps and/or qmgrs Recommended MQ clusters: "marooned" messages allowed for this design Sender App QM3 Site 3 Site 1 Site 2 Sender App QM1 Receiver App App moves site QM2 Receiver App

Integration with other products May want to have consistency with other data resources For example, databases and app servers Only way for guaranteed consistency is disk replication where all logs are in same group Otherwise transactional state might be out of sync DB2 can use MQ as part of its own replication strategy InfoSphere Replication Server

Planning and Testing

Planning for Recovery Write a DR plan Document everything to tedious levels of detail Include actual commands, not just a description of the operation Not Stop MQ, but as mqm, run /usr/local/bin/stopmq.sh US.PROD.01 And test it frequently Recommend twice a year Record time taken for each task Remember that the person executing the plan in a real emergency might be under-skilled and over-pressured Plan for no access to phones, email, online docs Each test is likely to show something you ve forgotten Update the plan to match You re likely to have new applications, hardware, software May have different plans for different disaster scenarios

Example Exercises from MQ Development Different groups have different activities that must continue Realistic scenarios can help show what might not be available From the MQ development lab Most of the change team were told there was a virulent disease and they had to work from home Could they continue to support customers If Hursley machine room was taken out by a plane missing its landing at Southampton airport Could we carry on developing the MQ product Source code libraries, build machines, test machines Could fixes be produced (A common one) Someone hit emergency power-off button Not just paper exercises

Networking Considerations DNS - You will probably redirect hostnames to a new site But keep the same IP addresses? Consider NAT for routing to external partners? Affects CONNAME Include external organisations in your testing 3rd parties may have firewalls that do not recognize your DR servers LOCLADDR configuration Not normally used by MQ, but firewalls, IPT and channel exits may inspect it May need modification if a machine changes address Clustering needs special consideration Easy to accidentally join the real cluster and start stealing messages Ideally keep network separated, but can help by: Not giving backup live security certs Not starting chinit address space (z/os) Not allowing channel initiators to start (distributed) Use CHLAUTH rules Backup will be out of sync with the cluster REFRESH CLUSTER() resolves updates

A Real MQ Network Story Customer did an IP move during a DR test Forgot to do the IP move back when they returned to prime systems Didn t have monitoring in place that picked this up until users complained about lack of response

Application Patterns for availability Article describing examples of how to build a hub topology supporting: Continuous availability to send MQ messages, with no single point of failure Linear horizontal scale of throughput, for both MQ and the attaching applications Exactly once delivery, with high availability of individual persistent messages Three messaging styles: Request/response, fire-and-forget, and pub/sub http://www.ibm.com/developerworks/websphere/library/techarticles/1303_b roadhurst/1303_broadhurst.html

Other Resources Applications may need to deal with replay or loss of data. Decide whether to clear queues down to a known state, or enough information elsewhere to manage replays Order of recovery may change with different product releases Every time you install a new version of a product revisit your DR plan What do you really need to recover DR site might be lower-power than primary site Some apps might not be critical to the business But some might be unrecognised prereqs

If a Real Disaster Hits Hopefully you never need it. But if the worst happens: Follow your tested plan Don t try shortcuts But also, if possible: Get someone to take notes and keep track of the time tasks took Prepare to attend post mortem meetings on steps you took to recover Accept all offers of assistance And afterwards: Update your plan for the next time

Summary Various ways of recovering queue managers Plan what you need to recover for MQ Plan the relationship with other resources Test your plan

Legal Disclaimer IBM Corporation 2013. All Rights Reserved. The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus Sametime Unyte ). Subsequent references can drop IBM but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the or symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. If you reference Adobe in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. If you reference Java in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. If you reference Microsoft and/or Windows in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. If you reference Intel and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. If you reference UNIX in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. If you reference Linux in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.