HighAv - Version: 2 21 June 2016 Design of High Availability Systems & Software
Design of High Availability Systems & Software HighAv - Version: 2 2 days Course Description: This course examines the high-level design of embedded systems and software that are to provide their services at near-continuous availability. High availability systems must tolerate both expected and unexpected faults. Their design is based on redundant hardware and software combined in ways that will achieve five-nines (99.999%) or greater availability, equivalent to less than 1 second of downtime per day. Basic hardware N-plexing and voting issues are discussed, followed by an indepth study of a number of backward error recovery fault tolerance techniques including static N-version programming, Checkpoint-Rollback, Process Pairs, and Recovery Blocks. The class continues with several forward error recovery techniques. Technical issues such as failover management, data replication, and software design defects, are addressed in depth. Many real-world examples are presented. This course is far from a general course about system or software design theory, but rather it is highly focused on the design of embedded systems and software that must make their services available at all times, with less than 5 minutes per year of downtime. Intended audience: This course is intended for practicing real-time and embedded systems software system architects, project managers and technical consultants who have responsibility for designing, structuring and
implementing the software for real-time and embedded computer systems that are required to continue providing service despite the occurrence of internal and external faults. Prerequisites: Many (but not all) high-availability systems are also safety-critical systems -- with can threaten human safety or even human life in situations where the system fails and remains unavailable for significant periods of time. For those highavailability systems that also have safety-critical requirements, we recommend that the course "Design of S <SafetyCrit.html>afety-Critical <SafetyCrit.html> Systems and Software <SafetyCrit.html>" should be taken at the same time as this course. The two courses have little overlap in content, and offer complimentary approaches and perspectives. It is possible to combine these two courses into a unified threeor four-day course for presentation at customer sites. Objectives: The primary goal of this course is to give participants the skills necessary to design software for real-time and embedded computer systems that must relentlessly provide service despite the occurrence of nternal and external faults. This is a very practical, results-oriented course that will provide knowledge and skills that can be applied immediately. Topics: Definitions and Background High Availability Fault -> Error -> Failure Single Points of Failure Fault Tree Analysis Exercise: Probabilistic Fault Tree Analysis Underlying Principles
Fault Avoidance vs. Tolerance Redundancy Failure Curves Replication vs. Functional Redundancy vs. Analytic Redundancy Dynamic vs. Static Redundancy Extended Example: Space Shuttle Software Fundamental System-Level Design Patterns Static Hardware Fault Tolerance N-Plex Design Exercise: MTBF, MTTF Calculations in Triple Modular Redundancy Dynamic System Fault Tolerance Redundant Pairs Clusters Cluster Failover Strategy Choices Examples: Redundant Cluster Design Concepts for Backward Error Recovery Design Diversity Dynamic System Redundancy Backward Error Recovery Transactions Checkpointing System and Software Design Patterns for High Availability Checkpoint-Rollback Process Pairs Recovery Blocks Limitations of Backward Error Recovery Patterns Forward Error Recovery Design Patterns Technical Issues in High Availability Design
RAID: Redundant Arrays of Inexpensive Disks Exercise: Hamming Codes Failover Management Data Replication Dealing with Software Design Faults Extended Example: Airbus A330/340 Fly-by-Wire C Language in Critical Systems Software Robustness: MISRA-C, LINT, Static Code Analyzers Exercise: C-Language Shenanigans