Database High Availability. Solutions 2010



Similar documents
DB2 9 for LUW Advanced Database Recovery CL492; 4 days, Instructor-led

Eliminate SQL Server Downtime Even for maintenance

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

High Availability Solutions for the MariaDB and MySQL Database

Contents. SnapComms Data Protection Recommendations

Module 14: Scalability and High Availability

be architected pool of servers reliability and

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

MySQL High Availability Solutions. Lenz Grimmer OpenSQL Camp St. Augustin Germany

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

High Availability Databases based on Oracle 10g RAC on Linux

CL492RU: DB2 9 for LUW Advanced Database Recovery Training CL492RU: DB2 9 for LUW Advanced Database Recovery Training

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

Oracle Database 10g: Backup and Recovery 1-2

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

VERITAS Business Solutions. for DB2

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

SanDisk ION Accelerator High Availability

Tushar Joshi Turtle Networks Ltd

Ingres Replicated High Availability Cluster

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Outline. Failure Types

Appendix A Core Concepts in SQL Server High Availability and Replication

Availability Guide for Deploying SQL Server on VMware vsphere. August 2009

Comparing TCO for Mission Critical Linux and NonStop

Ecomm Enterprise High Availability Solution. Ecomm Enterprise High Availability Solution (EEHAS) Page 1 of 7

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Protecting Microsoft SQL Server

Veritas Cluster Server from Symantec

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

Backup and Recovery. What Backup, Recovery, and Disaster Recovery Mean to Your SQL Anywhere Databases

Application Continuity with BMC Control-M Workload Automation: Disaster Recovery and High Availability Primer

Veritas Cluster Server by Symantec

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

DeltaV Virtualization High Availability and Disaster Recovery

Techniques for implementing & running robust and reliable DB-centric Grid Applications

HA / DR Jargon Buster High Availability / Disaster Recovery

PROTECTING MICROSOFT SQL SERVER TM

PROTECTING AND ENHANCING SQL SERVER WITH DOUBLE-TAKE AVAILABILITY

Maximum Availability Architecture. Oracle Best Practices For High Availability

Creating A Highly Available Database Solution

Oracle Database Solutions on VMware High Availability. Business Continuance of SAP Solutions on Vmware vsphere

WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Integrated Application and Data Protection. NEC ExpressCluster White Paper

High-Availablility Infrastructure Architecture Web Hosting Transition

Oracle on System z Linux- High Availability Options Session ID 252

ORACLE DATABASE HIGH AVAILABILITY STRATEGY, ARCHITECTURE AND SOLUTIONS

Online Transaction Processing in SQL Server 2008

Availability Digest. MySQL Clusters Go Active/Active. December 2006

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Microsoft SQL Database Administrator Certification

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

Active/Active DB2 Clusters for HA and Scalability

Using Hitachi Protection Manager and Hitachi Storage Cluster software for Rapid Recovery and Disaster Recovery in Microsoft Environments

Maximizing Data Center Uptime with Business Continuity Planning Next to ensuring the safety of your employees, the most important business continuity

SQL Server Database Administrator s Guide

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

Database Mirroring: High Availability (HA) and Disaster Recovery (DR) Technology

Using Continuous Operations Mode for Proper Backups

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

How Routine Data Center Operations Put Your HA/DR Plans at Risk

High Availability and Disaster Recovery Solutions for Perforce

Course Syllabus. Maintaining a Microsoft SQL Server 2005 Database. At Course Completion

Deployment Options for Microsoft Hyper-V Server

Oracle Databases on VMware High Availability

End-to-End Availability for Microsoft SQL Server

Trends in Application Recovery. Andreas Schwegmann, HP

Disaster Recovery for Oracle Database

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

Ingres Backup and Recovery. Bruno Bompar Senior Manager Customer Support

IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity

High Availability Storage

Administering and Managing Failover Clustering

IP Storage On-The-Road Seminar Series

Course Syllabus. At Course Completion

SAP Solutions on VMware Business Continuance Protecting Against Unplanned Downtime

Jive and High-Availability

SQL-BackTrack the Smart DBA s Power Tool for Backup and Recovery

Top Ten Private Cloud Risks. Potential downtime and data loss causes

Database high availability

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Veritas InfoScale Availability

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

WHITE PAPER: ENTERPRISE SECURITY. Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Pervasive PSQL Meets Critical Business Requirements

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Transcription:

Database High Availability DB2 9 DBA certification Solutions 2010 exam 731 P.O. Box 200, 5520 AE Eersel, The Netherlands Tel.:(+31) 497-530190, Fax: (+31) 497-530191 E-mail: kbrant@kbce.nl Disclaimer The information contained in this presentation is based on techniques, algorithms, and documentation published by the several authors and companies, and in addition is the result of research. It is therefore subject to change at any time without notice or warning. The information contained in this presentation has not been submitted to any formal tests or review and is distributed on an As is basis without any warranty, either expressed or implied. The use of this information or the implementation of any of these techniques is a client responsibility and depends on the client s ability to evaluate and integrate them into the client s operational environment. While each item may have been reviewed for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Clients attempting to adapt these techniques to their own environments do so at their own risks. Foils, handouts, and additional materials distributed as part of this presentation or seminar should be reviewed in their entirety. Note: This presentation gives you an overview of techniques used by database vendors. It can not be used for making company decisions regarding high availability without further studies. Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 2 1

Trademarks This presentation contains many trademarks in use by database vendors if we are aware of a trademark we put it in capitals. Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 3 Agenda What is downtime Techniques in use SQL Server Cluster DB2 Data Sharing / PureScale vs Oracle RAC Wise Words Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 4 2

What is downtime? Terminology in use Term Business Risk Solution Data Recovery High Availability Disaster Recovery Downtime and Data loss Downtime Permanent Data loss and "unable to continue" Not investing in hardware, software and knowledge means potential high risk for downtime and data loss Redundant data Redundant system components Redundant systems and facilities Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 6 3

Permitted downtime? Uptime SLA About Downtime and Data loss Downtime per Year Downtime per Month 99.9% 8.76 hours 43.8 minutes 99.99% 52.6 minutes 4.38 minutes 99.999% 5.26 minutes 0.438 minutes Acceptable data/transaction loss (if any)? Mean time to recovery? Difference for "normal down" and disaster? How much damage is done after how much time? $$? Who is first in case of disaster? Note: Database uptime application availability Application failures Hardware Outages (Power, Network, etc.) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 7 What is causing downtime? Downtime Unplanned down Planned down Hardware Failure Data Corruption Data / Appl Changes System Upgrades Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 8 4

Unplanned: Hardware Failure Storage subsystem Disk or Controller Firmware or driver problem Network Often causing a partial down (difficult to measure) Often rely on third party (SLA!) Unplanned down Hardware Failure Data Corruption Server Cluster Support: Oracle RAC / Microsoft MSCS / IBM Sysplex and PureScale Where is the backup / cluster server? Can virtual server be a solution? Power Outage Environment change, too many requests (unstable grid) etc. Third party, difficult to SLA Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 9 Unplanned: Site failure Site Failure Complete Server room down (e.g. fire) Can always happen because you depend on external party (e.g. power) More than just a database problem All data and hardware is involved Can you handle? Network changes Workload (different config) Fail-back situation Unplanned down Hardware Failure Data Corruption Isn't there a hidden Single Point of Failure E.g. glassfiber back-up in the same bundle Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 10 5

Unplanned: Data Corruption Human error Biggest problem difficult because many scenarios which need different solutions Unplanned down Hardware Failure Data Corruption Logical Corruption Difficult to detect (sometimes after years strange data emerges) Can you detect the cause (e.g. program) and how much data is affected? Special techniques to go back in time and select data again Can you re-process the data / transactions? Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 11 Planned: Data / Application changes Application upgrades Schema changes still difficult Still a market for vendor tools DB2 versioning has performance impact Running systems need to shutdown in order to reload E.g. middleware transaction refresh How are we testing / accepting this If test fails, how do we undo the change? Planned down Data/Appl Changes System Upgrades Data maintenance Offline REORG (sometimes needed) Roll-in / Roll-out data Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 12 6

Planned: System Upgrades Hardware upgrades Growth End of life cycle / vendor support Redundant is not always hot-swap Software upgrades Operating system, middleware, DBMS etc. Wise to combine upgrades? What if the new combination is not stable How to respond to vendor patches Needed? What if we don't Policy? One size fits all? Planned down Data/Appl Changes System Upgrades Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 13 Down vs Solution Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 14 7

Time Needed Most common down Partial down System does work but certain function are unavailable Examples: Transactions with certain input abort Certain location cannot connect Single database corrupt End-user to database goes through many layers How to report? Layered approach can buy to downtime Let user work with front-end as if it is real-time Bring down backend for maintenance and queue requests E.g. online banking without full history or real-time transactions Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 15 How long does it take to fix it Breakdown of recovery Many problems the analysis takes a (very) long time Human errors Corruptions Many companies suffer a knowledge problem How to fix it Creating the scenario Testing the scenario? Speed of recovery itself Best parameters Tools 10 8 6 4 2 0 Recovery Time Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 16 8

Techniques in use to minimize downtime Traditional backup types Database Backups Full backup Online / Offline (z/os Sharelevel) Incremental & Differential backup Include log backup Any other than Full backup is substitute for log! Disk is better than tape First backup to disk (separate physical disk volume) Detect exceptions encountered during backup Verify backup files Copy backup files to tape, remote disk or storage manager (TSM) Data retention policy for backup files What are you going to do with these backups? Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 18 9

Location of backup files Duration of retention Protection of sensitive data! Backup Retention Policy Sarbanes/Oxley (SOX) HIPAA Internal policies for data management and protection Access to backups from offsite data storage Often the weakest link in security scenario's Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 19 Two techniques Snapshot technology Real disk mirroring Share disk after snap until an update is done How useful is the snapped disk Was DBMS aware of the snap? Was it up? Did the DBMS participate in the snap? Snap can be extremely useful As backup As a fallback (e.g. after failed upgrade) Be careful Not all scenarios work on the snapped copy Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 20 10

Rebuild database (restore & rollforward) DB2 LUW more flexible than z/os Include log with backup Differential and delta backups Used for: Redirected restore (problem investigation) Simple scenario's which allow for down-time As a safety net when all else fails Less safe than you might imagine Backup just reads files (does not analyze them) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 21 Create offline solution Replication Based on master / slave Can be a "High Availability" solution All database support this (sometimes very advanced) Often horizontal / vertical segmented Mostly row based but sometimes optimized E.g. MySQL DRBD is like RAID1 over network Replication does not take care of: IP takeover Heartbeat & automate takeover Slave becoming master Fail back and resync Conflict resolution (slave is read-only) With further automation can be base for High Availability (MySQL Flipper) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 22 11

DB2 for LUW HADR Replication using Log Shipping Many limitation compared to DataGuard Xkoto Gridscale for DB2 More options than HADR Data Guard Suports both log shipping and SQL shipping Very mature / flexible product DB2 for z/os Trackersite Very basic (really in use?) Microsoft SQL Server mirror Witness server automates the takeover Xkoto has also gridscale for SQL Server Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 23 Clustering solution Withdraw a "node" from the solution Network problem (IP needs re-routed) Failover might not be active Can it be automated? Heartbeat or Timeout Client re-route Split Brain problem What happens to running Units Of Work? Locked data or other node backout? Take-over of transaction / restart of transaction / fail transaction Controlled "failure" for planned down (e.g. upgrade) Fail back / Insert node into the cluster Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 24 12

Shared Everything Shared nothing vs shared everything DB2 Data Sharing & PureScale (HA & Perf) Oracle RAC (HA & Perf) Does not share memory, only disk Microsoft SQL Server Cluster (HA) Based on hardware / operating system solution Sybase IQ?? Shared Nothing Microsoft SQL Server mirror (HA) Oracle Data Guard (HA) DB2 HADR (HA) MySQL Cluster (HA) MySQL Replication (Perf) Both have many limitations, still large system use them Perf: Terradata, Postgres (Greenplum), Netezza, DB2 LUW DPF Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 25 High Availabilty Replication v.s. Cluster Both can have Single Point of Failure (SPF) Wrong configuration can destroy data SAN/NAS I/O overhead when shared storage With RAID SAN is no longer SPF Make sure network to the SAN is not SPF Replication is easy to break Inconsistent data (e.g. middle transaction) Painful start-up / restart Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 26 13

Split Brain condition Due to communications failures nodes are separated Missing heartbeat is really down or communication failure If multiple nodes control of the cluster, then it's called a split-brain condition If this happens, then bad things will happen Special software solutions are needed to 100% secure a down of the other node(s) This software can become Single Point of Failure Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 27 SQL Server Cluster 14

SQL Server cluster: Failover clustering Client PCs SQL fails over and is available to clients Failure Occurs! Node A Node B SQL SQL Heartbeat Passive Node Disk cabinet A Disk cabinet B SCSI Reserve Broken New Reservation Established Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 29 SQL Server Cluster: Data Mirroring Application Commit Witness 1 5 SQL Server Principal 2 SQL Server Mirror 2 >2 4 3 >3 Log Data Log Data Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 30 15

SQL Server Comparison Database Mirroring Scope: user Database Standard hardware Very fast failover (seconds) OS flexible (e.g. 32/64) Independent storage Reporting on mirror (Read-Only) Geographic separation OK Failover Clustering Scope: Full instance Certified hardware Automatic failover (minutes) Enterprise OS Shared storage Standby not available Servers co-located (site failure!) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 31 Compare DB2 to Oracle DB2 DB2 DB2 CF Log Log Log CF Data DB2 Data Sharing / PureScale Single system image Dynamic workload management Software / Hardware Solution No Single Point of Failure ORACLE RAC High Speed inter-system links Lots of communications No Global cache Cache Fusion / Interconnect Passes data around a lot Extra communication overhead Scalability? Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 32 16

Components in a RAC Cluster Global Cache Services (GCS) Manage Data Page Synchronization Sends DATA to other nodes Global Enqueue Service (GES) Manages Global Locks for non-data pages Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 33 Oracle RAC in action cache fusion Example how Oracle RAC moves data around 8741 Instance A Instance B Instance C Do READ data block 8741 Instance D Master Node 2 3 1 Want to read data block 8741 8741 Read into buffer cache Instance B becomes "owner" of the block Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 34 17

Oracle RAC in action cache fusion Example how Oracle RAC moves data around 8741 3 8741 Instance A Instance B Instance C Send 8741 Please send 8741 to node C Instance D Master Node 2 1 Want to read data block 8741 4 Received data block 8741 from Node B 8741 Data no longer comes from disk, owners forward them. Multiple copies around Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 35 Oracle RAC in action cache fusion Example how Oracle RAC moves data around Instance A 6 Flush PI data block 8741 8741 2 Instance C 6 8741 Flush PI data block 8741 3 Forward 8741 to node B Instance D Master Node 1 8741 PI = Previous Image Owners have to write, after write caches are flushed New read requests moves the data again around 5 4 Write data block 8741 Instance B 8741 Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 36 18

What happens if RAC node fails Crash recovery by other node Freeze GCS, not allowing updates to database anymore Data block remaster and recover the pages using redo/undo Invalidate blocks recovered False node failure detection Can have split brain problem Oracle Custer software Heartbeat based, Single Point of Failure (Yes, says IBM) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 37 Unavailable time Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 38 19

RAC vs Purescale Scalability IBM PureScale benchmark: 95% scalability 32 members 81% scalability 112 members Oracle RAC?? No figures License forbids publication of measurements Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 39 Even Larry admits eweek (www.eweek.com) 31-Oct-2003: I make fun of a lot of other databases all other databases, in fact, except the mainframe version of DB2. It's a first-rate piece of technology. Larry Ellison, Oracle's Founder and CEO I guess we have to add DB2 LUW PureScale to it ;-) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 40 20

Wise words Prepare for failure DISASTER WILL HAPPEN Ensure that no important data is lost Think of the different types of unavailability (there is no golden bullet) Keep It Simple, Stupid (KISS) Complexity is the enemy of reliability Saving on education is like stopping a watch to save time Automate as much as possible Careful you still understand it, so document it (incl. what if scenario) Test it! Frequently!! Use good scenarios! Audit it You need a devile's advocate to find the holes Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 41 New technology Source: channelinsider.com If you can see it an touch it then it is: physical If you cannot see it but you can touch it then it is: transparent If you can see it but not touch it then it is: virtual If you cannot see it nor touch it then It's gone! Be careful with new technology ;-) Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 42 21

QUESTIONS? Copyright KBCE b.v., 2010 All Intellectual Rights Reserved 43 22