SCHA - Version: 1 21 June 2016 Safety Critical & High Availability Systems
Safety Critical & High Availability Systems SCHA - Version: 1 3 days Course Description: This Masterclass examines the design of embedded systems and software that are to provide services in applications that could, when they fail, threaten the well-being or safety of people. Many, though not all, of these systems must not be stopped under any circumstances, and thus must be designed for high availability. Practical guidance is offered on how to address these concerns when designing systems in fields such as medical, automotive, avionics, nuclear and chemical process control. The Masterclass surveys concepts and alternatives for system and software architectures appropriate for safetycritical and high availability systems. Following an examination of hazard and risk analysis techniques, the seminar goes on to list a number of approaches to software safety that span fault avoidance, fault detection, and fault containment tactics including redundancy, recovery, masking and barriers. A variety of candidate architectural design patterns are examined, including dual/triple modular redundancy, shutdown monitors, dissimilar independent designs, backup parallel patterns and active/monitor parallel patterns. Many realworld examples are presented. Systems which are required to provide high availability must be designed to tolerate faults. Their design is usually based on off-the-shelf hardware and software combined in ways that will achieve
five-nines (99.999%) or greater availability. Basic hardware N-plexing and voting issues are discussed, followed by an in-depth study of a number of backward error recovery fault tolerance techniques including Checkpoint-Rollback, Process Pairs, and Recovery Blocks. The class continues with several forward error recovery techniques. Technical issues such as failover management, data replication, and software design defects, are addressed in depth. This Masterclass is far from a general course about system or software design theory, but rather it is tightly focused on the design of embedded systems and software that are required to provide their intended functions without endangering the safety or life of users or their environment, while at the same time maintaining high availability if required. Intended audience: This Masterclass is intended for practicing real-time and embedded systems engineers, software system architects, project managers and technical consultants who have responsibility for designing, structuring and implementing the hardware and software for real-time and embedded computer systems in applications that could, when they fail, threaten the well-being or life of people. Many of these systems have high availability as an additional design requirement Prerequisites: Course participants are expected to be familiar with general embedded and real-time software design. Objectives: The primary goal of this Masterclass is to give the participant the skills necessary to design systems and software for real-time and embedded computers in which faults and failures could pose a danger to human life. As part of this, participants gain skills in designing systems for high availability. This is very practical, results-oriented
training that provides knowledge and skills that can be applied immediately. Topics: Definitions and Background Hazards and Risks Safety vs. Fault Tolerance Design Issues for Safety Redundancy Approaches to Dependability Examples: Automotive Brake-by-Wire, Steer-by-Wire Preparatory Analyses Hazard Analysis: FMEA Fault & Event Tree Analysis Exercise: Fault Tree Analysis Probabilistic Event Tree Analysis Risk Analysis Approaches to Safety: Fault Avoidance, Fault Detection, Fault Tolerance Fundamental Safety Design Patterns Detection of Sensor Errors Failstop Fault Masking Shutdown Design Patterns Single Channel Patterns Multi-Channel Safety Design Patterns Actuation Monitoring Options Dual Channel Patterns Dual Closed-Loop Patterns
Heterogeneous Peer-Channel Pattern Example: Flight Control Computer Development Dual-Dual Pattern Design Patterns for Resiliency and Safety Monitor-Actuator Pattern Extended Example: Medical Respiratory Ventilator The Safety Executive Extended Example: Automotive Drive-by-Wire Extended Example: Airbus A330/340 Fly-by-Wire A Cookbook for Safety-Critical Design Functionality Learning from System Failures and Accidents Sources of System Accidents Hazard-Based Risk Analysis Calculations Exercise: Spacecraft Risk Analysis Software Factors in Some Famous Accidents High Availability: Underlying Principles Fault Avoidance vs. Tolerance Failure Curves Replication vs. Functional Redundancy vs. Analytic Redundancy Dynamic vs. Static Redundancy Extended Example: Space Shuttle Software Fundamental System-Level Availability Design Patterns Static Hardware Fault Tolerance N-Plex Design Exercise: MTBF, MTTF Calculations in Triple Modular Redundancy Dynamic System Fault Tolerance Redundant Pairs
Clusters Cluster Failover Strategy Choices Concepts for Backward Error Recovery Design Diversity Dynamic System Redundancy Backward Error Recovery Transactions & Checkpointing System and Software Design Patterns for High Availability Checkpoint-Rollback Process Pairs Recovery Blocks Limitations of Backward Error Recovery Patterns Forward Error Recovery Design Patterns Technical Issues in High Availability Design Failover Management Data Replication Dealing with Software Design Faults C Language in Critical Systems Software Robustness: MISRA-C, LINT, Static Code Analyzers Exercise: C-Language Shenanigans
º Final Examination. www.sela.co.il 03-6176066