IBM RJEŠENJA ZA MIRAN SAN Danijel Paulin, Systems Architect, IBM Croatia
Sadržaj prezentacije Uvod Business Continuity RPO RTO Business Continuity Tiers IBM metodologija Pregled IBM rješenja 1
Importance of Business Resilience With the constant emergence of new regulations, security threats, and service outages ranging from nature to deliberate attacks to human error uptime for IT business is increasingly essential. Non-resilience affects: Growth Business risk Competitive posture Compliance to regulations Business Resilience is an integral thread that runs through the entire operation. An electronics manufacturer Source: IGS FactPoint Study, May 2004
Requirements for IT Business Continuity in a Time To Market world. Recovery times must be repeatable and reliable Allows business continuity processes to be built Upon a reliable, consistent recovery time Large scalability Recovery times must be known even as the system scales In today s Time To Market world, it is unacceptable to not have assured scalability Testing must be affordable and nearly continuous Repeatable, reliable, scalable IT business continuity can only be assured through testing that can be affordably be performed often Fundamentally, requires end-to-end automation and test
Industry standard definitions Business Resilience (BR) - The ability of the business to rapidly adapt and respond to opportunities, regulations and risks, in order to maintain secure and continuous business operations, be a more trusted partner, and enable growth. Business Resilience spans business strategy, organizational structure, business and IT processes, IT infrastructure, applications and data, and facilities. It arises from the implementation and management of a plan that ensures high availability through monitoring and automatic adjustment of redundant or virtualized infrastructure components. Disaster Recovery (DR) is one component of an overall Business Resilience Plan Continuous Availability (CA) - Attribute of a system to deliver non disruptive service to the end-user 7 days a week, 24 hours a day (there are no planned or unplanned outages) Continuous Operations (CO) - Attribute of a system to continuously operate and mask planned outages from end-users. It employs non-disruptive hardware and software changes, non-disruptive configuration, software coexistence High Availability (HA) - The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and mask unplanned outages from end-users. It employs Fault Tolerance, Automated Failure Detection, Recovery, Bypass, Reconfiguration, Testing, Problem and Change Management
What is needed to provide Business Continuity? Business Continuity High Availability Tight integration with Server failover Supports 24x7 Disaster Recovery Continuous Operations Non-disruptive backups Non-disruptive planned outages Protects against unplanned outages such as disasters, site outages Critical Business data is protected Operations continue after a disaster Recovery is predictable and reliable Costs are predictable and manageable
Does your system really need HA/DR/BC? What is the target recovery time? Minutes? Hours? Days? Costs associated with implementing and maintaining an HA or DR solution Redundant hardware Inter-site networking Operations staff System maintenance & update Major factor in a dynamic environment HA/DR/BC is a balance of recovery time requirements and cost!
Cost of Outage vs. Cost of Solution Cost of Solution and Time to Recover Cost of Outage over Time Cost Cost/Time Window Time
Find the balance Find the balance
Business Continuance Objectives Business Objectives for Disaster Recovery? Recovery Time Objective (RTO)?how long can you afford to be without your systems?? Recovery Point Objective (RPO)?when it is recovered, how much data can you afford to recreate?? Network Recovery Objective (NRO)?how long to switch over the network?? Degraded operations objectives (DOO)?what will be the impact on operations with fewer data centers? 11 10 9 8 7 12 Backup 6 1 2 3 4 5 RPO Application Processing Application Failure The real solution is selected based on the particular cost curve slope:? If I spend a little more, how much faster is Disaster Recovery?? If I spend a little less, how much slower is Disaster Recovery? Understanding the recovery time vs. cost curve is the key to selecting solution(s)
The Recovery Time and Recovery Point Objectives are critical to establish the Disaster Definition. Technology Recovery Last off site Vital Records Backup Event Occurs System Available RPO Offsite Storage Window System Communications and Data Recovery RTO?? Critical Point? LOST DATA System Unavailable BUSINESS RECOVERY Normal Procedures All Manual Procedures Forward Recovery and Data Recreation User Recovery Normal Procedures
Business Continuity Tiers Recovery from a disk image Recovery from a tape copy Cost / Value BC Tier 7 Server or Storage replication with end-to-end automated server recovery BC Tier 6 Real-time continuous data availability, server or storage BC Tier 5 Application/database integration BC Tier 4 Point in Time replication, Tiered Storage 15 Min. 1-4 Hr.. 4-8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days Recovery Time Objective (guidelines only) BC Tier 3 VTL, Data De-Dup, Remote vault BC Tier 2 Tape libraries + Automation BC Tier 1 Restore from Tape Blend solutions to maximize application coverage at optimum cost. Rule of Thumb: In tiers 5 7 network costs can be 50% of TCO over 3 years.
Tier 0 There is no saved information, no documentation, no backup hardware Typical recovery time is unpredictable. It may not be possible to recover at all
Tier 1 Daily Pickup Truck Access method (PTAM), Disk Subsystem or tape based mirroring to locations without processors IBM Tivoli Storage manager
Tier 2 Daily at Recovery Time PTAM with Hot-site available IBM Tivoli Storage Manager
Tier 3 Daily Daily at Recovery Time Electronic Vaulting of Data some mission critical data IBM Tivoli Storage Manager - Disaster Recovery Manager
Tier 4 High Bandwidth connections Daily at Recovery Time Batch/Online Database Shadowing and Journaling Global Copy, Global Mirror, FlashCopy VTL IBM Tivoli Storage Manager - Disaster Recovery Manager TPC-R Greater data currency and faster recovery More disk based solution
Tier 5 High Bandwidth connections Software Two-phase commit Requirement for consistency of data Little to no data loss in such solutions
Tier 6 Application independent solutions with data consistency High Bandwidth connections GDPS/MM HS Manager, XRC, Global Mirror TPC-R AIX Logical Volume Mirroring Highest levels of data currency No dependence on the applications
Tier 7 Application independet solutions with data consistency High Bandwidth connections Automation GDPS/MM (open LUN Mgmt, HyperSwap), GDPS/XRC, GDPS/MGM PowerHA/XD GDOC
IBM End to End approach to IT Business Continuity Solution IT End to End Automation Data Availability Network Automation Application Availability System Availability Network Availability Disk Mirroring Server Clustering System Automation
Business Continuity is a process and people design, not a technology 40% Technology Hardware and software capabilities 60% Process Definition/design, compliance and continuous improvement People Roles & responsibilities, management, skills development & discipline
Ideal Business Continuity design Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels. Business Prioritization Integration into IT Manage risk assessment Risks, Vulnerabilities and Threats Awareness, Regular Validation, Change Management. business impact analysis Impacts of Outage RTO/RPO Assessment Maturity Model Measure ROI Roadmap Current Capability Program Design crisis team business resumption disaster recovery high availability Strategy Design High Availability design High Availability Servers Storage, Data Replication Database and Software design Implement Estimated Recovery Time 1. People 2. Processes 3. Plans 4. Strategies 5. Networks 6. Platforms 7. Facilities program validation Resilience Program Management Source: IBM STG, IBM Global Services
IT Business Continuity Solution Selection Methodology We need to be online 24x7 Risk Analysis ❶ ❷ Key Questions Customer Requirements BC automation candidates BIA / RTO RPO Analysis CEO Hmm... That means Oracle and SAP must be recovered Identify BC Tier requirement eliminate BC automations which do not meet all requirements valid preliminary candidate solutions ❸ CIO Business Requirements Initial solutioning Detailed Evaluation Design Team
Key IT Business Continuity Requirements Questions (in proper order): 1. What applications or databases to recover? 3. What is desired Recovery Time Objective (RTO)? 5. What is the connectivity, infrastructure, and bandwidth between sites? 2. What platform? (z, p, i, x and Windows, Linux, heterogeneous open, heterogeneous z+open) 4. What is distance between the sites? (if there are 2 sites) 6. What are the specific h/w equipment(s) that needs to be recovered? 7. What is the Level of Recovery? - Planned Outage - Unplanned Outage - Transaction Integrity 8. What is the Recovery Point Objective? 10. Who will design the solution? 9. What is the amount of data to be recovered (in GB or TB)? 11. Who will implement the solution? 12. Remaining solutions are valid choices to give to detailed DR evaluation team
IBM solutions
IBM Copy Services and Terminologies FlashCopy Global Mirror Metro / Global Mirror Asynchronous mirroring Synchronous mirroring Three site Available on: Available on: synchronous and DS8000*, DS6000*, DS8000*, DS6000*, asynchronous ESS* ESS* mirroring SAN Volume SAN Volume Controller* Available on: Controller* DS4000/DS5000 Metro Mirror Point in time copy Available on: DS8000*, DS6000*, ESS* SAN Volume Controller* DS4000/DS5000 Within Storage System DS8000*, ESS DS4000/DS5000 Primary Metro distance Primary Primary Metro Site A <300km Site A Site A Site B Site B Out of Out of Region Region Site B Site C * Supported by TPC-R 4.1
IBM Portfolio for IBM Business Resiliency Tivoli Storage Productivity Center - Replication GPFS - SONAS GDOC (open systems) System z GDPS Power Systems p, i: PowerHA Tivoli Systems Automation Metro Mirror / Global Mirror (DS8K,XIV,SVC.. ) Cost / Value Tivoli FlashCopy Manager BC Tier 7 Server or Storage replication with end-to-end automated server recovery BC Tier 6 Real-time continuous data availability, server or storage BC Tier 5 Application/database integration BC Tier 4 Point in Time replication, Tiered Storage DB2 HA/DR WebSphere MQseries Tivoli Systems Automation Snapshot / Flashcopy (XIV, NAS, DS8K ) 15 Min. 1-4 Hr.. 4-8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days Recovery Time Objective (guidelines only) ProtecTier BC Tier 3 VTL, Data De-Dup, Remote vault BC Tier 2 Tape libraries + Automation BC Tier 1 Restore from Tape Tivoli Storage Manager
Two Site Disk Mirroring Solutions
IBM Two Site Metro Mirror Designed to Provide: No Data Loss Supports: Open Systems System i System z System z GDPS Hyperswap Ease of use, Lower cost Industry Leading Performance IB M Req Host Ack VOLUME A Send I/O write Confirm I/O write VOLUME B Approximately 35% of all ESS, DS6000, DS8000 subsystems have license for Metro Mirror PRIMARY SECONDARY
Host side mirroring (LVM) Designed to Provide: No Data Loss Supports: Open Systems System i No site failover in case of primary storage failure Lower cost IB M Send I/O write Host Confirm I/O write Send I/O write VOLUME A Confirm I/O write VOLUME B Used for extremely high availability requirements PRIMARY SECONDARY
IBM Two Site Global Mirror Designed to Provide: 'A Primary SAN SAN B Global Copy Secondary FlashCopy Unlimited Global Distance Reliable Data Currency 3-5 seconds (bandwidth permitting) Scalability Up to 8 subsystems (17 with IBM RPQ) Heterogeneous Consistency Group System z, System i, open systems data Native Application Performance Ease of use, Lower cost No active external controlling software or server cycles required to form Con groups Dynamic creation/deletion of Global Mirror configuration Native performance Performance Transmission Consistent Data Top 50 on Fortune Magazine's list of Large European bank using Global Mirror the world's 500 largest companies 35 TB, 15 ESSs in mirrored config 27,000 IO/sec RPO: 4 to 7 seconds
Business Continuity with SAN Volume Controller (SVC) Traditional SAN Replication API s differ by vendor Replication destination must be the same as the source Different multipath drivers for each array Lower-cost disks offer primitive, or no replication services SAN Volume Controller Common replication API, SAN-wide, that does not change as storage hardware changes Common multipath driver for all arrays Replication targets can be on lower-cost disks, reducing the overall cost of exploiting replication services FlashCopy PPRC SAN TimeFinder SRDF SAN SAN Volume Controller SVC IBM DSx IBM DSx EMC Sym EMC Sym IBM DSx IBM DS4x EMC Sym HP MA IBM S-ATA
SVC - Virtual Disk Mirroring SVC stores two copies of a virtual disk, usually on separate disk systems SVC maintains both copies in sync and writes to both copies If disk supporting one copy fails, SVC provides continuous data access by using other copy Copies are automatically resynchronized after repair Intended to protect critical data against failure of a disk system or disk array A local high availability function, not a disaster recovery function Copies can be split Either copy can continue as production copy Either or both copies may be space-efficient
SVC Split I/O Group Site 1 Site 2 Intersite ISL 1 SVC Node 1 Node 2 ISL 2 VDisk Mirroring Automated failover with SVC handling The loss of: - SVC node - Quorum disk - Storage subsystem Can incorporate MM/GM to provide disaster recovery - 3 site like capability Disk system that supports Extended Quorum SVC Quorum SVC Quorum SVC Quorum
Three Site Disk Mirroring Solutions
Requirements of three site disk mirroring Metro/Global Mirror A->B->C Fast Failover / Failback to any site A Metro B Fast re-establishment of 3 site recovery, without production outages C Global Quickly resynchronize any site with incremental changes only - Links and bandwidth assumed between all sites
Three Site IBM Metro / Global Mirror Designed to Provide: A Metro Mirror C B Global Mirror Performance, scalability Metro Mirror distance and performance Global Mirror 3 to 5 seconds data currency Global Mirror tolerance of bandwidth constraint Satisfy all 3 site requirements: Fast Failover / Failback to any site Fast re-establishment of 3 site recovery without production outages Ease of use, autonomic, self-monitoring Lower Total Cost of Ownership Lower cache and SAN / HBA port requirements compared to many competitive alternatives Cascading
Three Site IBM Metro / Global Mirror Overview production systems Cascaded configuration Metro Mirror (sync PPRC) and cascaded Global Mirror (async) A Metro Mirror B Continuou s Availability Metro Mirror provides synchronous mirroring at metro distance from site A to site B Load on controller A is reduced compared to Multi-Target Global Mirror to provide regional Disaster Recovery at site C Global Mirror Disaster Recovery C
Three Site Metro / Global Mirror Overview production systems Metro Mirror (sync PPRC), and cascaded Global Mirror (async) Planned or unplanned switch to site B: A Metro Mirror B Continuous Availability Continue Disaster Recovery coverage on B-C leg No reconfiguration necessary Global Mirror Disaster Recovery C
Three Site Metro / Global Mirror Overview production systems Metro Mirror (sync PPRC), and cascaded Global Mirror (async) A Metro Mirror B Continuous Availability Full ability to re-establish B-A-C recoverability while production continues to run at B Incremental change resync Global Mirror Disaster Recovery C
Three Site Metro / Global Mirror Overview production systems Metro Mirror (sync PPRC), and cascaded Global Mirror (async) In event of site B outage, or failure of links A-B or B-C: A Metro Mirror B Continuous Availability Can establish A-C recoverability quickly Incremental change resync Can quickly re-establish A-B-C recoverability as soon as links or site B recover Incremental change resync Global Mirror Disaster Recovery C
IBM Portfolio for IBM Business Resiliency Tivoli Storage Productivity Center - Replication GPFS - SONAS GDOC (open systems) System z GDPS Power Systems p, i: PowerHA Tivoli Systems Automation Metro Mirror / Global Mirror (DS8K,XIV,SVC.. ) Cost / Value Tivoli FlashCopy Manager BC Tier 7 Server or Storage replication with end-to-end automated server recovery BC Tier 6 Real-time continuous data availability, server or storage BC Tier 5 Application/database integration BC Tier 4 Point in Time replication, Tiered Storage DB2 HA/DR WebSphere MQseries Tivoli Systems Automation Snapshot / Flashcopy (XIV, NAS, DS8K ) 15 Min. 1-4 Hr.. 4-8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days Recovery Time Objective (guidelines only) ProtecTier BC Tier 3 VTL, Data De-Dup, Remote vault BC Tier 2 Tape libraries + Automation BC Tier 1 Restore from Tape Tivoli Storage Manager
TPC for Replication Volume level Copy Service Management Manages Data Consistency across a set of volumes with logical dependencies Coordinates Copy Service Functionalities across different Hardware Flash Copy Metro Mirror Global Mirror Metro Global Mirror Ease of Use Single common point of control Web browser based GUI and CLI Persistent Store Data Base Source / Target volume matching SNMP Alerts Wizard based configuration Business Continuity Site Awareness High Availability Configuration active and standby management server No Single point of Failure Disaster Recovery Testing Disaster Recovery Management
GDPS solutions Continuous Availability of Data within a Data Center Continuous Availability / Disaster Recovery within a Metropolitan Region Disaster Recovery at Extended Distance Continuous Availability Regionally and Disaster Recovery Extended Distance Single Data Center Two Data Centers Two Data Centers Three Data Centers Applications remain active Continuous access to data in the event of a storage subsystem outage Systems may remain active Multi-site workloads can withstand site and/or storage failures Rapid Systems Disaster Recovery with seconds of Data Loss Disaster recovery for out of region interruptions High availability for site disasters Disaster recovery for regional disasters A B C GDPS/HyperSwap Mgr GDPS/HyperSwap Mgr GDPS/GM GDPS/MGM Basic HyperSwap GDPS/PPRC GDPS/XRC GDPS/MzGM
GDOC solution (VERITAS Cluster Server) Large range of platforms supported including AIX, HP-UX, Linux, Solaris and Windows High availability clusters with disaster recovery within Metropolitan area (MAN) High availability clusters with Extended distance disaster recovery (WAN) Supports the following IBM replication technologies SVC : Metro Mirror and Global Mirror DS8000 : Metro Mirror and Global Mirror DS5000 (ERM Sync) XIV (Sync Windows and Solaris) VCS GCO VCS VCS GCO VCS Firedrill capability integrates the use of PiT copy for DR testing VERITAS Cluster Server (VCS)
VMware Site Recovery Manager (SRM) Site Recovery Manager is the VMware product for disaster recovery Hardware replication support provided by vendor Storage Replication Adapters Supports the following IBM replication technologies DS8000 (Metro Mirror and Global Mirror) SVC (Metro Mirror and Global Mirror) XIV (Sync and Async) DS4/5000 (ERM Sync and Async) N Series (Sync, Async and Semi Sync)
Tivoli System Automation Application Manager Web-based Operations Operations Console Policy Editor Automation J2EE Framework Websphere with ISC Automation Engine Abstract Resource Model (Policy) SA AppMan Adapter ensure the communication with systems and automated applications. Adapter Adapter Adapter Adapter Adapter MSCS Windows on PCs SA MP Linux on PCs AIX, Linux on System p Linux on System i Linux on System z HACMP AIX on System p SA z/os z/os on System z VCS Sun Solaris
Tivoli System Automation Application Manager Distributed DR Manager planned site switch Steps: 1. Operator initiates planned site switch HA Web Cluster HA Cluster Web Node 1 Node 2 Node 3 Node 4 SA App Man Browser Client 2. SA AM triggers TPC- R to switch replication direction HA WAS Cluster HA WAS Cluster Node 1 Node 2 Node 3 Node 4 HA Cluster DB2 HA Cluster DB2 Node 1 Node 2 Node 3 Node 4 TPC-R 3. SA AM starts application components on site II HA SAP/DB2 Cluster HA SAP/DB2 Cluster Node 1 Node 2 Node 3 Node 4 DSxxxx Secondary Primary Session DSxxxx Secondary Primary Site I Site II
GDPS Distributed Cluster Management (DCM) Coordinated actions for distributed cluster recovery End-to-end recovery solution Helps optimize operations Helps meet enterprise-level RTO and RPO Integrated, industry-unique, automated DCM support added in conjunction with: Symantac Veritas Cluster Server (GDPS/PPRC and GDPS/XRC) Tivoli System Automation Application Manager (GDPS/PPRC only) Integrated, Automated, Industry-unique
GDPS/MM DCM for Tivoli SA Application Manager GDPS z/os Sysplex GDPS K-System Site 1 Site 2 SA AppMan Metro Mirror Clustered Distributed Applications
Business Resiliency: IBM Offerings Client looking for Client looking for guidance with the guidance with the resiliency of their resiliency of their business business Client interested in Client interested in HA awareness HA awareness (new HA technology (new HA technology & BR approach) & BR approach) Customer interested Client Customer interested with specific in getting a high level in getting view of a his high IT level HA/DR project & ready availability view of his state IT to to get get started designing availability state solution Customer profile Client with availability Concerns Client looking for for help in in building an an availability plan GTS / STG Free offering DI Leadership Center (High Availability Center of Competency) Resilient Business Infrastructure (RBI) Business Resiliency Briefing Business Resiliency QuickCheck DI DI Business Resiliency Exploration Session Assessment phase STG product implementation (storage, server, PowerHA ) ) TIVOLI offerings (monitoring, backup // restore solution etc) Software HA solutions (DB2, Websphere etc) Resiliency Consulting (SPL5) (BIA, Risk assessment ) GBS offerings (HA Application Design etc) Service management (SPL1) Site and Facility services On Ramp offerings For Fee (DR sites etc) Service & product plays STG SWG GTS GBS
Korisni linkovi http://www.ibm.com/services/us/bcrs/self-assessment/index.html?ibmmerch http://www-03.ibm.com/systems/business_resiliency/ http://www.redbooks.ibm.com/abstracts/sg246547.html?open http://www.redbooks.ibm.com/abstracts/sg246548.html?open http://www.redbooks.ibm.com/abstracts/sg246684.html?open
Hvala na Vašoj pozornosti Kontakt Danijel Paulin Systems Architect IBM Hrvatska Miramarska 23 Zagreb Telefon: +385 91 6308206 E-Mail: danijel.paulin@hr.ibm.com www.ibm.com/hr/hr/ 55