Comparing TCO for Mission Critical Linux and NonStop Iain Liston-Brown EMEA NonStop PreSales BITUG, 2nd December 2014 1
Agenda What do we mean by Mission Critical? Mission Critical Infrastructure principles Architecture Comparisons Application TCO Comparison for Real Payment Solutions Q&A 2
What do we want to achieve with our MC infrastructure? MC requirements MC resources Expected Outcome Provide Transactional Business Services to End-Users Keep The Right Track of Transactions Expected Behavior Service is always available (24x7: normal operations do not include stop/start) No matter how dynamic the business environment may be Service is always responsive (Human interaction time - response time within a very few seconds) No matter how many users want to use the services concurrently Critical resources HW Infrastructure: Processing Power, Storage.. Data Base Services Business Application OLTP Minimize Downtime Maximize Data Protection 3
Availability, Data Integrity & Scalability Desgin Principles and a few Not-So-Easy-Tasks Eliminate Single Points of Failure Minimize the Impact of Failures Usage of redundant resource Smart provisioning Smaller areas of failure Data Preservation and Recovery ACID properties preservation (Atomicity, Consistency, Isolation & Durability) DataBase BackUp OnLine Replication Build Fault Tolerance for each and every layer All Active Clustering Stand-by Clustering Data Scalability Multi-node DataBase Pros & Cons for OLTP These Design Principles Go Beyond the Capabilities Of Standard Standalone Servers 4
What about Generic SMP Scale-UP? Scale-out SMP processing & MPP processing Require: Application Awareness of the sub laying processing architecture Specific Devices for Interoperation and Load Balancing Specific RDBMS that take advantage of the Scale-out architecture Quite often the more generic Scale-UP SMP model is used In such cases, an effort is made to build HA or FT on top of it Sysplex and FT SMP options Mission Critical Specific Scale-up like processing Virtualized infrastructure StandBy Nodes and Clustering Different Layers may be used for Application, DataBase and Storage Switch mechanisms in place for Planned or Unplanned Fail Over (Growth) Clustering may be as complex or more complex to manage as Scale-Out or MPP Larger Components mean larger impact of failure & less predictable Recovery Time and Recovery Point 5
Scale-out SMP servers & MPP NonStop servers Two approaches to Availability, Scalability & Data integrity DB App DB App Stateless or Cluster Aware Interconnect & Coherence Load Balancer Application Nodes: from 2 to N RAC DataBase Nodes: from 2 to..? DB DB DB MPP Architecture NonStop OS NonStop SQL/MX External Storage Subsystem External Storage Subsystem Intgrated Storage Subsystem Integrated Storage Subsystem 6
Manageability Design Principles Manageability Greater complexity means bigger chance of something going wrong and it makes more difficult to identify problems and determine what is a symptom and what is a cause Creating Clusters of Systems that are initially intended to work stand alone increases the complexity and the chance of a failure to occur. Adding SAN storage can also increase complexity. Storage System are often used in multipurpose mode. In case of disasters or failures, people are under pressure and at those times they may make dumb mistakes. A well designed system helps to avoid taking bad decisions, by doing the right thing by design. (Wendy Barlett) These Design Principles mean that complexity decreases consistency and that good design and integration makes for simplicity leading to lower cost 7
Design Principles: Summary & Conclusions Summary Redundancy of the Critical Components & Efficient Provision of them Smaller areas of failure are easier to replace and easier to recover Growth by resource addition is less disruptive than growth by replacement Larger numbers of components are more complex to manage and so, more failure prone Larger numbers of components require smarter interoperation infrastructure Conclusion HW Architectures that may better respond to these principles are: MPP processing Scale-out SMP processing 8
IDC Availability Spectrum and Options? Characterisation Impact of component failure NonStop delivers AL4 Fault Tolerant Server Availability Level 4 Switch to alternate resources is not perceptible to end users AL3 Clustered Server Availability Level 3 Short outage is needed for failover to take place AL2 Workload Balancing Availability Level 2 Balancing may not be perceptible to end-users because of retry AL1 Not shipped as highly available Availability Level 1 Need to switch to redundant resources before processing resumes. Source: IDC, September 2012, Doc #236946 Worldwide and U.S. High-Availability Server 2012-2016 Forecast and Analysis 9
VMware options for HA & FT Fault Tolerance Processing, Another Not-So-Easy-Task VMware provides two features to enable increased availability of applications running on vsphere. VMware vsphere High Availability (HA) provides automated restart of virtual machines in the event of hardware or operating system failures. VMware vsphere Fault Tolerance (FT) provides continuous availability of virtual machines by running a secondary virtual machine on another host that closely shadows the primary virtual machine. VMware vsphere vmotion live migration allows you to move an entire running virtual machine from one physical server to another, without downtime and with transaction Integrity. The architecture and usage of these features are described in detail in documents available from VMware. http://www.vmware.com/products/vsphere/features/fault-tolerance.html http://www.vmware.com/products/vsphere/features/high-availability.html 10
HP Serviceguard - Failover Handling Application failures Heartbeat LAN Application Package 15.12.45.23 Application Configuration SG + Linux OS App Data Linux OS Application Configuration SG + Linux OS Application Application Failed Application Configured over to Adoptive to run Configuration Adoptive Node Node 15.12.45.23 HP Serviceguard for Linux Cluster 11
Real Application High End NonStop MPP vs SMP Linux + Oracle Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
From NO HA to Pure StandBy Clustering Building Minimal HA Features on top of ISV suggestion 13
From Pure StandBy Clustering to Cluster AWARENESS Building Resiliency for DB nodes failures 14
Sizing for the Different tps rates on both Options (*) Just one App server in the picture and quotes 15
NonStop to StandBy Linux - 3 year TCO comparison - 16
NonStop to Linux RAC DB - 3 year TCO comparison - 17
Real Application Entry Systems NonStop MPP vs SMP Linux + Oracle Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Pure StandBy Clustering at Each Site ISV suggested configuration includes DB Replication to DR Site Primary Site DR Site Active DB Node @ Primary Site Replicates to Active DB Node @ DR Site 19
From Pure StandBy Clustering to Cluster AWARENESS Building Resiliency for DB nodes failures 20
Sizing for the Different tps rates on both Options 21
NonStop to Linux 3y TCO comparison 22
Mission Critical Application Availability and SMP Scale-Up Summary
Mission Critical Infrastructure & Features Summary Scalability High Availabilty Data Integrity Recovery NonStop MPP HW UpLift & Incremental Built-in At every component SF TakeOver or DR SMP HW ForkLift & Node Addition Multiplicity Platform Dependent FailOver or DR Linux OS - Scale UP Passive NUMA constrains StandBy Clustering Passive FailOver or DR Linux OS Scale OUT Passive App/DB dependent SL Retry or Clustering Passive TakeOver or DR NonStop OS Active Active End-To-End SF TakeOver or DR Oracle S1, SE, EE One Node ( 2, 4, N sockets) StandBy Clustering Redo logs RMAN / Data Guard Oracle EE RAC Multi Node ( 3 nodes max?) RAC Multi - Redo logs RMAN / Data Guard NonStop SQL/MX 16 Blade units per Node Built-In TMF Audit Trail TMF / RDF Application HW Aware, StateLess,?? HW Aware, StateLess,?? DB Dependent DB Dependent 24
Thank you