Building Continuous Cloud Infrastructures Deepak Verma, Senior Manager, Data Protection Products & Solutions John Harker, Senior Product Marketing Manager October 8, 2014
WebTech Educational Series Building Continuous Cloud Infrastructures In this Webtech, Hitachi design experts will cover what is needed to build Continuous Cloud Infrastructures servers, networks and storage. Geographically distributed, fault-tolerant, stateless designs allow efficient distributed load balancing, easy migrations, and continuous uptime in the face of individual system element or site failures. Starting with distributed stretch-cluster server environments, learn how to design and deliver enterprise-class cloud storage with the Hitachi Storage Virtualization Operating System and Hitachi Virtual Storage Platform G1000. In this session, you will learn: Options for designing Continuous Cloud Infrastructures from an application point of view. Why a stretch-cluster server operating environment is important to Continuous Cloud Infrastructure system design. How Hitachi global storage virtualization and global-active devices can simplify and improve server-side stretch-cluster systems.
Application Business Continuity Choices Types of failure scenarios Locality of Reference of a failure How much data can we lose on failover? (RPO) How long does recovery take? (RTO) How automatic is failover? How much does solution cost?
Types of Failure Events & Locality of Reference PROBABILITY LOCALIZED RECOVERY REMOTE RECOVERY LOGICAL FAILURE Probability: High Causes: Human Error, Bugs Desired RTO/RPO: Low/Low Remediation: Savepoints, Logs, Backups, Point-in-Time Snapshots Cost: $ Probability: Low Causes: Rolling Disasters Desired RTO/RPO: Medium/ Remediation: Remote Replication with Point-in-Time Snapshots Cost: $$$$ PHYSICAL FAILURE Probability: Medium Causes: Hardware Failure Desired RTO/RPO: Zero/Zero Remediation: Local High Availability Clusters s & Storage Cost: $$ COST Probability: Very Low Causes: Immediate Site Failure Desired RTO/RPO: Low/Zero- Low Remediation: Replication Synchronous, Remote High Availability Cost: $$$
OUTAGE Understanding RPO and RTO 12am 2am 4am 6am 8am 10am 12pm 2pm 4pm 8pm 10pm 12am 2am Hours RPO Seconds ZERO RTO Hours
Data Protection Options Traditional approach Multi-pathing, clusters, backups, Application or database driven Storage array based replication Remote and local protection Appliance based solutions Stretched clusters, quorums Array based high availability
Traditional Data Protection Approach App/DB Buffer Cluster App/DB Buffer App/DB Buffer Cluster App/DB Buffer App & DB Backup DB Log DB Log App & DB Restore Tape Tape Truck, Tape Copy, VTL Replication Focus has been on server failures at local site only Coupled with enterprise storage for higher localized Up- Time Logical failures and rolling disasters have high RPO/RTO Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 8-48 hrs. 8-48 hrs. RTO 0* 4-8 hrs. 4+ hrs. 4+ hrs. Scalability and efficiency are oxymoron Recovery involves manual intervention and scripting Caveats *Assume HA for every component and cluster aware application.
Application Based Data Protection Approach App/DB Buffer Cluster App/DB Buffer Application Data Transfer App/DB Buffer Cluster App/DB Buffer App & DB Backup DB Log DB Log App & DB Restore Tape Tape Truck, Tape Copy, VTL Replication Reduces remote physical recovery times Requires additional standby infrastructure, licenses Consumes processing capability of application/db servers Specific to every application type, OS type, etc. Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 0-4 hrs. # 8-48 hrs. RTO 0* 4-8 hrs. 15 min. - 4 hrs. # 4+ hrs. Fail-back involves manual intervention, scripting Caveats *Assume HA for every component and cluster aware application. # Network latency and application overhead dictate values
Array Based Data Protection Approach App /DB Buffer Cluster App/DB Buffer Offline App/DB Cluster Offline App/DB Array Based Block Sync or Async. OFFLINE App/DB Aware Local Array Clone/Snap DB Log Single IO Consistency DB Log Single IO Consistency App/Db Aware Remote Array Clone/Snap DB Log Optional Batch Copy DB Log Tape Single IO Consistency Single IO Consistency Reduces recovery times across the board No additional standby infrastructure, licenses, or compute power Generic to any application type, OS type, etc. Fail-back as easy as fail-over, with some scripting Not application awareness, usually crash consistent RPO 0* Local Physical Local Logical 15 min. 24 hrs. Remote Physical Remote Log. & Phy. 0-4 hrs. # 15-24 hrs. RTO 0* 1 5 min. 5-15 min. 1 5 min. Caveats *Assume HA for every component and cluster aware application. # Network latency and application overhead dictate values
Appliance Based High Availability Approach App/DB Buffer Cluster App/DB Buffer Extended Cluster App/DB Buffer Cluster App/DB Buffer App & DB Backup Applianc e DB DB Applianc e Log Log Quorum Applianc e DB Applianc e Log App & DB Restore Tape Tape Truck, Tape Copy, VTL Replication Takes remote physical recovery times to zero Combine with app/db/os clusters for true 0 RPO & RTO Introduces complexity (connectivity, quorum) and risk and latency to performance Does not address logical recovery RPO and RTO Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 0 # 8-48 hrs. RTO 0* 4-8 hrs. 0 # 4+ hrs. Caveats *Assume HA for every component and cluster aware application. # Synchronous Distances, coupled with app/db/os geo-clusters
Array Based H/A + Data Protection Approach App /DB Buffer Cluster App/DB Buffer Extended Cluster Offline App /DB App/DB Buffer Cluster Offline App/DB App/DB Buffer OFFLINE App/DB Aware Local Array Clone/Snap DB Log Single IO Consistency Array Array Based Based Bi-Directional Block High Sync Availability or Async. Copy DB Log Single IO Consistency App/Db Aware Remote Array Clone/Snap DB Log Quorum DB Log Tape Single IO Consistency Single IO Consistency Takes remote physical recovery times down to zero Generic to any application type, OS type, etc. No performance impact, built-in capability of array Combine with app/db/os clusters for true 0 RPO & RTO Fail-back as easy as fail-over, no scripting RPO 0* Local Physical Local Logical 15 min. 24 hrs. Remote Physical Remote Log. & Phy. 0 # 15 min -24 hrs. RTO 0* 1 5 min. 0 # 1 5 min. Caveats *Assume HA for every component and cluster aware application. # Synchronous Distances, coupled with app/db/os geo-clusters Combined with snaps/clones for dual logical protection
Consideration to move to an active-active highly available architecture Storage platform capable of supporting H/A Application/DB/OS clusters capable of utilizing storage H/A functionality without impacts Network capable of running dual site workloads with low latency Quorum site considerations to protect against split-brain or H/A downtime. People and process maturity in managing active-active sites Coupled logical protection across both sites and 3 rd site DR
OUTAGE Options for Data Protection 12am 2am 4am 6am 8am 10am 12pm 2pm 4pm 8pm 10pm 12am 2am Hours RPO Seconds ZERO RTO Hours Archive Hitachi Content Platform Backup Data Instance Manager Data Protection Suite Symantec Netbackup Application aware Snapshot and Mirroring Operational recovery HAPRO HDPS IntelliSnap Thin Image or In-system replication Disaster Recovery Universal Replicator (async) TrueCopy (sync) Operational resiliency Universal Replicator (async) Truecopy (sync) CDP Data Instance Manager Transparent Cluster Failover Global Active Device Always On Restore/Rec over from Snapshot Mirroring Replication Restore from Database logs Backup
Hitachi Storage Virtualization Operating System Introducing Global Storage Virtualization Virtual Machines FOREVER CHANGED the way we see DATA CENTERS Hitachi STORAGE VIRTUAL OPERATING SYSTEM is doing the SAME Application Application Virtual Storage Identify Virtual Storage Identify Operating System Operating System Host I/O and Copy Mgmt. Host I/O and Copy Mgmt. Virtual Hardware Hardware Virtual Hardware OS and VM File System Virtual Hardware Hardware Virtual Hardware Virtual Storage Software CPU Memory NIC Drive Virtual Storage Director Cache Front-End Ports Media
Disaster Avoidance Simplified New SVOS global-active device (GAD) Virtual-storage machine abstracts underlying physical arrays from hosts Storage-site failover transparent to host and requires no reconfiguration When new global-active device volumes are provisioned from virtual-storage machine, they can be automatically protected Simplified management from a single pane of glass Site A Compute HA Cluster Storage HA Cluster Site B Global Storage Virtualization Virtual-Storage Machine
Supported Cluster Environments SVOS global-active device OS + Multipath + Cluster Software Support Matrix Global-active device Support OS Version Cluster August 2014 1Q2015 VMware 4.x, 5.x VMware HA (Vmotion) Supported Supported IBM AIX 6.x, 7.x HACMP / PowerHA Supported 2008 MSFC Supported Microsoft Windows 2008 R2 MSFC Supported 2012 MSFC Supported 2012 R2 MSFC Supported Red Hat Linux 5.x, 6.x Red Hat Cluster Supported VCS Hewlett Packard HP-UX 11iv2, 11iv3 MC/SG Supported SC Oracle Solaris 10, 11.1 VCS Oracle RAC Supported Supported Supported Supported
Hitachi SVOS Global-Active Device Clustered Active-Active Systems s with Apps Requiring High Availability global storage virtualization Virtual Storage Identity 123456 global-active device Virtual LDEVs: 10:01 10:02 Virtual Storage Identity 123456 s with Apps Requiring High Availability Write to Multiple Copies Simultaneous from Multiple Applications Resource Group 1 LDEVs: 10:00 10:01 10:02 Resource Group 2 LDEVs: 20:00 20:01 20:02 Read Locally Simultaneous from Multiple Applications Quorum
One Technology, Many Uses Cases HETEROGENEOUS STORAGE VIRTUALIZATION Host GLOBAL-ACTIVE DEVICES Virtual-Storage Machine CPU CACHE PORTS PORTS MEDIA Physical-Storage Machine MEDIA External-Storage Machine MEDIA External-Storage Machine
One Technology, Many Uses Cases Preserve Identity During Migration NON-DISRUPTIVE MIGRATION Host GLOBAL-ACTIVE DEVICES Virtual-Storage Machine LOGICAL DEVICES CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical-Storage Machine Physical-Storage Machine
One Technology, Many Uses Cases MULTI-TENANCY Host Host GLOBAL-ACTIVE DEVICES Virtual-Storage Machine #1 GLOBAL-ACTIVE DEVICES Virtual-Storage Machine #2 CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical-Storage Machine
One Technology, Many Uses Cases FAULT TOLERANCE Host GLOBAL-ACTIVE DEVICES MIRRORING Virtual-Storage Machine GLOBAL-ACTIVE DEVICES CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical-Storage Machine CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical-Storage Machine
One Technology, Many Uses Cases APPLICATION / HOST - LOAD-BALANCING Application GLOBAL-ACTIVE DEVICES Virtual Storage Machine CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical Storage Machine #1 CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical Storage Machine #2
One Technology, Many Uses Cases DISASTER AVOIDANCE and ACTIVE-ACTIVE DATA CENTER NAS Host NAS Host Cluster GLOBAL-ACTIVE DEVICES MIRRORING Virtual Storage Machine GLOBAL-ACTIVE DEVICES CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical Storage Machine Site A CPU CACHE PORTS PORTS MEDIA MEDIA MEDIA Physical Storage Machine Site B
Delivering Always-Available VMware Extend native VMware functionality with or without vmetro Storage Cluster Active/Active over metro distances Fast, simple non-disruptive migrations 3-data center high availability (with SRM support) Hitachi Thin Image snapshot support Prod. s (Active) VMware Stretch Cluster Global-active device Prod. s (Active) Site 1 Site 2 QRM Quorum system
VMware Continuous Infrastructure Scenarios Application Migration Read/Write IO switches to local site s path Path/Storage Failover ESX switches paths to alternate site path HA Failover VMware HA fails over VM, local site s IO path is used VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM HCS Active Quorum Quorum Quorum
Delivering Always-Available Oracle RAC Elegant distance extension to Oracle RAC Prod. s (Active) Oracle RAC Prod. s (Active) Active/Active over metro distances Simplified designs, fast non-disruptive migrations 3-data center high availability Increase infrastructure utilization and reduce costs Site 1 Global-active device QRM Quorum Site 2
Delivering Always-Available Microsoft Hyper-V Active/Active over metro distances Complement or avoid Microsoft geo clustering Fast, simple and non-disruptive application migrations Hitachi Thin Image snapshot support Simple Failover and failback Prod. s (Active) Microsoft Multisite/Stretch Cluster Global-active device Prod. s (Active) Site 1 Site 2 QRM Quorum
Global-Active Device Management Hitachi Command Suite (HCS) offers efficient management of global-active devices while providing central control of multiple systems Storage-Management HCS Agent CCI Pair Mgt CMD HCS DB Primary Storage Mgt (Active) HCS Prod. -1 (Active) App/ DBMS HCS clustering App/DBMS clustering HA mirroring TC/HA mirroring QRM Prod. -2 (Active) App/ DBMS Quorum Volume Storage Mgt (Passive) HCS CMD HCS DB Remote HCS Agent CCI Pair Mgt Clustered HCS server is used, the local HCS server enables GAD management If local site fails, the remote HCS server takes over GAD management HCS Database should be replicated with either TrueCopy or GAD Pair-Management s Managed through Hitachi Replication Manager Runs Hitachi Device Manager Agent/CCI HCS management requests to configure/operate the HA mirrored via the command device
3 Data Center Always Available Infrastructures Protecting the protected Node (e.g., Oracle/RAC) I/O Active HUR PVOL Journal group Cluster Global Active Device (GAD) Node (e.g., Oracle/RAC) I/O Active HUR PVOL Journal group Global-Active Device Active-Active high availability Read-local Bi-directional synchronous writes Metro distance Consistency groups (supported early 2015) HUR Active HUR Standby Hitachi Universal Replicator Active/standby remote paths HUR SVOL Journal group Pair configuration is on GAD consistency and HUR journal group basis with Delta-Resync Journal groups with Delta-Resync Any distance Quorum Remote FCIP Quorum
Global-Active Device Specifications Index August 2014 Late 2014 Global-active device management Max number of volumes (creatable pairs) Hitachi Command Suite v8.0.1 or later 64K Max pool capacity 12.3 PB Max volume capacity 46 MB to 4 TB 46 MB to 59.9 TB Supporting products in combination with global-active device. All on either side or both sides Campus distance support Metro distance support Dynamic Provisioning / Dynamic Tiering / Hitachi Universal Volume Manager ShadowImage / Thin Image Can use any qualified path failover software Hitachi Dynamic Link Manager is required (until ALUA support) HUR with Delta-Resync Nondisruptive Migration (NDM)
Hitachi Storage Software Implementation Services Service Description Pre-deployment assessment of your environment Planning and design Prepare subsystem for replication options Implementations: Create and delete test configuration Create production configuration Integrate production environment with Hitachi Storage Software Test and validate installation Knowledge transfer
Don t Pay the Appliance Tax! SAN port explosion Appliance proliferation Additional management tools Limited snapshot support Per-appliance capacity pools Disruptive migrations With Appliances Complexity Scales Faster Than Capacity All of the above
Global-Active Device: Simplicity at Scale Native, high-performance design Single management interface Advanced non-disruptive migrations Simplified SAN topologies Large-scale data protection support Full access to storage pool Avoid the Appliance Tax With Hitachi All of the above
Hitachi Global Storage Virtualization OPERATIONAL SIMPLICITY ENTERPRISE SCALE
Questions and Discussion
Upcoming WebTechs WebTechs, 9 a.m. PT, 12 p.m. ET The Rise of Enterprise IT-as-a-Service, October 22 Stay tuned for new sessions in November Check www.hds.com/webtech for Links to the recording, the presentation, and Q&A (available next week) Schedule and registration for upcoming WebTech sessions Questions will be posted in the HDS Community: http://community.hds.com/groups/webtech
Thank You