High availability From Wikipedia, the free encyclopedia



Similar documents
BUILDING HIGH-AVAILABILITY SERVICES IN JAVA

Total Business Continuity with Cyberoam High Availability

MaximumOnTM. Bringing High Availability to a New Level. Introducing the Comm100 Live Chat Patent Pending MaximumOn TM Technology

Executive Brief Infor Cloverleaf High Availability. Downtime is not an option

Creating A Highly Available Database Solution

Safety in Numbers. Using Multiple WAN Links to Secure Your Network. Roger J. Ruby Sr. Product Manager August Intelligent WAN Access Solutions

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

How Routine Data Center Operations Put Your HA/DR Plans at Risk

Five Secrets to SQL Server Availability

High Availability Concepts for Video Surveillance

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

HA / DR Jargon Buster High Availability / Disaster Recovery

DISASTER RECOVERY PLANNING GUIDE

Finding a Cure for Downtime

Blackboard Managed Hosting SM Disaster Recovery Planning Document

Program: Management Information Systems. David Pfafman 01/11/2006

High Availability for Citrix XenApp

Oracle Maps Cloud Service Enterprise Hosting and Delivery Policies Effective Date: October 1, 2015 Version 1.0

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications

INSIDE. Preventing Data Loss. > Disaster Recovery Types and Categories. > Disaster Recovery Site Types. > Disaster Recovery Procedure Lists

System Infrastructure Non-Functional Requirements Related Item List

Fault Tolerant Servers: The Choice for Continuous Availability

Leveraging Virtualization for Disaster Recovery in Your Growing Business

The Benefits of Continuous Data Protection (CDP) for IBM i and AIX Environments

Business Continuity: Choosing the Right Technology Solution

VoIP Logic: Disaster Recovery and Resiliency

POLICY NAME IT DISASTER RECOVERY POLICY AND PLAN POLICY NUMBER POLICY FILE REFERENCE 3/3/6 DATE OF ADOPTION REVIEW OR AMENDMENT DATES

Disaster Recovery for Oracle Database

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Benefit from Disaster Recovery... Without a Disaster

Skelta BPM and High Availability

The Importance of Software License Server Monitoring White Paper

DISASTER RECOVERY. Omniture Disaster Plan. June 2, 2008 Version 2.0

Veritas Cluster Server from Symantec

Introduction to Virtualization. Paul A. Strassmann George Mason University October 29, 2008, 7:20 to 10:00 PM

Getting Started with Endurance FTvirtual Server

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Contents. SnapComms Data Protection Recommendations

High Availability Design Patterns

High Availability Cluster for RC18015xs+

COST-BENEFIT ANALYSIS: HIGH AVAILABILITY IN THE CLOUD AVI FREEDMAN, TECHNICAL ADVISOR. a white paper by

Windows Server 2008 R2 Hyper-V Live Migration

High Availability for Citrix XenServer

Mastering Disaster A DATA CENTER CHECKLIST

Avaya Aura Communication Manager Greater than 5 Nines Availability

BUILDING THE CARRIER GRADE NFV INFRASTRUCTURE Wind River Titanium Server

Red Hat Enterprise linux 5 Continuous Availability

Informatica MDM High Availability Solution

Neverfail Solutions for VMware: Continuous Availability for Mission-Critical Applications throughout the Virtual Lifecycle

Protecting SQL Server in Physical And Virtual Environments

Why Fails MessageOne Survey of Outages

HRG Assessment: Stratus everrun Enterprise

Cloud Based Application Architectures using Smart Computing

Meeting Management Solution. Technology and Security Overview N. Dale Mabry Hwy Suite 115 Tampa, FL Ext 702

WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?

Implementing High-Availability (HA) Solutions for Siebel ebusiness Applications

Microsoft SQL Server on Stratus ftserver Systems

MAKING YOUR VIRTUAL INFRASTUCTURE NON-STOP Making availability efficient with Veritas products

Exhibit E - Support & Service Definitions. v1.11 /

CHAPTER 1 Basic High Availability Concepts

IBM Virtualization Engine TS7700 GRID Solutions for Business Continuity

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 13 Business Continuity

Fault Tolerant Solutions Overview

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

High Availability White Paper

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

High Availability Database Solutions. for PostgreSQL & Postgres Plus

A High Availability Clusters Model Combined with Load Balancing and Shared Storage Technologies for Web Servers

Mission-Critical Fault Tolerance for Financial Transaction Processing

Accounts Payable Imaging & Workflow Automation. In-House Systems vs. Software-as-a-Service Solutions. Cost & Risk Analysis

Virtual Fax Server Solutions. White Paper March 2010

SaaS Service Level Agreement (SLA)

Windows Server 2008 R2 Hyper-V Live Migration

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Ohio Supercomputer Center

Domains. Seminar on High Availability and Timeliness in Linux. Zhao, Xiaodong March 2003 Department of Computer Science University of Helsinki

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

Everything You Need to Know About Network Failover

Virtualizing disaster recovery using cloud computing

PROMAPP TECHNICAL INFORMATION

High Availability and Disaster Recovery Solutions for Perforce

Code Subsidiary Document No. 0007: Business Continuity Management. September 2015

Pervasive PSQL Meets Critical Business Requirements

Module 7: System Component Failure Contingencies

How To Build A Clustered Storage Area Network (Csan) From Power All Networks

Integration of PRIMECLUSTER and Mission- Critical IA Server PRIMEQUEST

Ingres Replicated High Availability Cluster

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

A risky business. Why you can t afford to gamble on the resilience of business-critical infrastructure

Disaster Recovery for Small Businesses

Service Level Terms Inter8 Cloud Services. Service Level Terms Inter8 Cloud Services

HP StorageWorks Data Protection Strategy brief

CCNP Switch Questions/Answers Implementing High Availability and Redundancy

Synology High Availability (SHA)

The functionality and advantages of a high-availability file server system

IBM Global Technology Services March Virtualization for disaster recovery: areas of focus and consideration.

Whitepaper Continuous Availability Suite: Neverfail Solution Architecture

Protect Your Business with Automated Business Continuity Solutions

SanDisk ION Accelerator High Availability

W H I T E P A P E R. Reducing Server Total Cost of Ownership with VMware Virtualization Software

Transcription:

High availability From Wikipedia, the free encyclopedia High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. Users want their systems, for example wrist watches, hospitals, airplanes or computers, to be ready to serve them at all times. Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. [1] Generally, the term downtime is used to refer to periods when a system is unavailable. Contents 1 Scheduled and Unscheduled Downtime 2 Percentage calculation 3 Measurement and interpretation 4 Closely related concepts 5 System design for high availability 6 Reasons for unavailability 7 Costs of unavailability 8 See also 9 References 10 External links Scheduled and Unscheduled Downtime A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application, middleware, and operating system failures. Many computing sites exclude scheduled downtime from availability calculations, assuming, correctly or incorrectly, that scheduled downtime has little or no impact upon the computing user community. By excluding scheduled downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online [citation needed] hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night. Percentage calculation Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.

Availability % Downtime per year Downtime per month* Downtime per week 90% ("one nine") 36.5 days 72 hours 16.8 hours 95% 18.25 days 36 hours 8.4 hours 98% 7.30 days 14.4 hours 3.36 hours 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 99.5% 1.83 days 3.60 hours 50.4 minutes 99.8% 17.52 hours 86.23 minutes 20.16 minutes 99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes 99.95% 4.38 hours 21.56 minutes 5.04 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds * For monthly calculations, a 30-day month is used. Uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage. In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents. [citation needed] The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence. [2] Measurement and interpretation Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% "uptime." However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users a true availability measure is holistic. Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand. Closely related concepts Recovery time (or estimated time of repair (ETR)) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center. Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss. A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

System design for high availability Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy); however, this architecture suffers from the requirement that the entire system must be brought down for patching and Operating System upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover). High availability implies no human intervention to restore operation in complex systems. For example, availability limit of 99.999% allows about one second of down time per day, which is impractical using human labor. The need for human intervention for maintenance actions in a large system will exceed this limit. Availability limit of 99% would allow an average of 15 minutes per day, which is realistic for human intervention. Redundancy (engineering) is used to eliminate the need for human intervention. The two kinds of redundancy are passive redundancy and active redundancy. Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller so long as boat speed exceeds water velocity long enough to avoid running out of fuel. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system. Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet routing is derived from early work by Birman and Joseph in this area. [3] Active redundancy may introduces more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic. Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between planned maintenance, upgrade events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellite. Global Positioning System is an example of a zero downtime system. Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a mission critical period. This strategy is called Condition-based maintenance, and this is only effective with active redundancy. Modeling and simulation is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously. Reasons for unavailability A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems, from most to least important, as follows [4] :

Causal factor of unavailability Lack of best practice change control Lack of best practice monitoring of the relevant components Lack of best practice requirements and procurement Lack of best practice operations Lack of best practice avoidance of network failures Lack of best practice avoidance of internal application failures Lack of best practice avoidance of external services that fail Lack of best practice physical environment Lack of best practice network redundancy Lack of best practice technical solution of backup Lack of best practice process solution of backup Lack of best practice physical location Lack of best practice infrastructure redundancy Lack of best practice storage architecture redundancy The factors themselves are based on the work of Evan Marcus & Hal Stern. [5] Costs of unavailability In a 1998 report from IBM Global Services, unavailable systems are estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues. [6] See also Business continuity planning Carrier grade Disaster recovery Fault-tolerant system Reliability (computer networking) High-availability cluster Service Availability Forum Split multi-link trunking OpenSAF Uptime References 1. ^ Piedad, Floyd. High Availability: Design, Techniques, and Processes, [1] (http://books.google.com/books? id=khb0hdq98qyc&dq=high+availability+floyd+piedad+book&printsec=frontcover&source=bn&hl=en&ei=gs0lsrlvbkj 2. ^ Evan L. Marcus, The myth of the nines (http://searchstorage.techtarget.com/tip/0,289483,sid5_gci921823,00.html) 3. ^ RFC 992 4. ^ Ulrik Franke, Pontus Johnson, Johan König, Liv Marcks von Würtemberg: Availability of enterprise IT systems - an expert -based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid, [2] (http://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/ir-ee- ICS_2010_047.pdf?l=en_UK) 5. ^ E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003. 6. ^ IBM Global Services, Improving systems availability, IBM Global Services, 1998, [3] (http://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdf)

External links Carrier grade:the Myth of the Nines (http://www.pipelinepub.com/0407/pdf/article%204_carrier% 20Grade_LTC.pdf) Pipeline PDF Service Availability Reporting (http://themonitoringguy.com/articles/service-availability-reporting/) - A Guide To Service Availability Reporting Cisco IOS Management for High Availability Networking (http://www.cisco.com/en/us/tech/tk869/tk769/technologies_white_paper09186a00800a998b.shtml/) - Best Practices White Paper Retrieved from "http://en.wikipedia.org/wiki/high_availability" Categories: System administration Quality control Applied probability Reliability engineering This page was last modified on 20 April 2011 at 08:14. Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. See Terms of Use for details. Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.