Embedded Systems Lecture 9: Reliability & Fault Tolerance. Björn Franke University of Edinburgh



Similar documents
Price/performance Modern Memory Hierarchy

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

CS 6290 I/O and Storage. Milos Prvulovic

Value Paper Author: Edgar C. Ramirez. Diverse redundancy used in SIS technology to achieve higher safety integrity

Software Fault Avoidance

Design of High Availability Systems & Software

FUNCTIONAL SAFETY CERTIFICATE

Dependable software development. Programming techniques for building dependable software systems.

Using RAID6 for Advanced Data Protection

Definition of RAID Levels

Wireless Sensor Network Performance Monitoring

COMP 7970 Storage Systems

Designing a Control System for High Availability

Modular architecture for high power UPS systems. A White Paper from the experts in Business-Critical Continuity

Safety Critical & High Availability Systems

Network Monitoring. Chu-Sing Yang. Department of Electrical Engineering National Cheng Kung University

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Domains. Seminar on High Availability and Timeliness in Linux. Zhao, Xiaodong March 2003 Department of Computer Science University of Helsinki

High Availability and Safety solutions for Critical Processes

Verification of Triple Modular Redundancy (TMR) Insertion for Reliable and Trusted Systems

MAINTAINED SYSTEMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University ENGINEERING RELIABILITY INTRODUCTION

Data Storage - II: Efficient Usage & Errors

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Topics in Software Reliability

IT White Paper. N + 1 Become Too Many + 1?

Module 1: Introduction to Computer System and Network Validation

Spacecraft Computer Systems. Colonel John E. Keesee

Disk Storage & Dependability

SAN Conceptual and Design Basics

SAFETY MANUAL SIL Switch Amplifier

MultiPARTES. Virtualization on Heterogeneous Multicore Platforms. 2012/7/18 Slides by TU Wien, UPV, fentiss, UPM

System Aware Cyber Security

Hardware in the Loop (HIL) Testing VU 2.0, , WS 2008/09

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Distributed Fault-Tolerant / High-Availability (DFT/HA) Systems

SAFETY MANUAL SIL RELAY MODULE

The Overload Fault Tolerance Theory of Parallel Design

"Reliability and MTBF Overview"

Developments in Point of Load Regulation

Boeing B Introduction Background. Michael J. Morgan

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Introduction. Chapter 1

Increased power protection with parallel UPS configurations

CS 61C: Great Ideas in Computer Architecture. Dependability: Parity, RAID, ECC

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

VERY IMPORTANT NOTE! - RAID

CSC408H Lecture Notes

Software Quality. Software Quality Assurance and Software Reuse. Three Important Points. Quality Factors

Achieving High Availability

Failure Mode and Effect Analysis. Software Development is Different

Maximizing return on plant assets

Software testing. Objectives

Software Engineering. How does software fail? Terminology CS / COE 1530

Automotive Software Engineering

Maximizing UPS Availability

High Availability Design Patterns

OPTIMIZING POWER SYSTEM PROTECTION FOR MODERN DATA CENTERS

Standardization and modularity as the foundation of data center UPS infrastructure

Reliability Block Diagram RBD

Multiagent Reputation Management to Achieve Robust Software Using Redundancy

RAID 5 rebuild performance in ProLiant

Failure Modes, Effects and Diagnostic Analysis

How To Create A Multi Disk Raid

Contents. Introduction. Introduction and Motivation Embedded Systems (ES) Content of Lecture Organisational

Selecting Sensors for Safety Instrumented Systems per IEC (ISA )

System Availability and Data Protection of Infortrend s ESVA Storage Solution

Overview of IEC Design of electrical / electronic / programmable electronic safety-related systems

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

System Specification. Objectives

Software Health Management An Introduction. Gabor Karsai Vanderbilt University/ISIS

Electronic Power Control

SOFTWARE-IMPLEMENTED SAFETY LOGIC Angela E. Summers, Ph.D., P.E., President, SIS-TECH Solutions, LP

Testing for the Unexpected: An Automated Method of Injecting Faults for Engine Management Development

BUILDING HIGH-AVAILABILITY SERVICES IN JAVA

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Autonomous Maintenance

ELECTROTECHNIQUE IEC INTERNATIONALE INTERNATIONAL ELECTROTECHNICAL

PLCs in automated material handling systems. SAGS presentation Gert Maas

System Safety Process Applied to Automotive High Voltage Propulsion Systems

TABLE OF CONTENT

Copyright 2006 Quality Excellence for Suppliers of Telecommunications Forum

Ambient Temperature In Range Probe X

ISO Introduction

Designing Fault-Tolerant Power Infrastructure

A New Approach to Nuclear Computer Security

Why the Subsystem Storage Is a Must for the Applications of. Mission Critical Surveillance Projects? Application Notes

Transcription:

Embedded Systems Lecture 9: Reliability & Fault Tolerance Björn Franke University of Edinburgh

Overview Definitions System Reliability Fault Tolerance

Sources and Detection of Errors Stage Error Sources Error Detection Specification & Design Prototype Manufacture Algorithm Design Formal Specification Algorithm Design Wiring & Assembly Timing Component Failure Wiring & Assembly Component Failure Consistency Checks Simulation Stimulus/Response Testing System Testing Diagnostics Installation Field Operation Assembly Component Failure Component Failure Operator Errors Environmental Factors System Testing Diagnostics Diagnostics

Definitions Reliability: Survival Probability When function is critical during the mission time. Availability: The fraction of time a system meets its specification. Good when continuous service is important but it can be delayed or denied Failsafe: System fails to a known safe state Dependability: Generalisation - System does the right thing at right time

System Reliability R F (t) The reliability, of a system is the probability that no fault of the class F occurs (i.e. system survives) during time t. R F (t) =P (t init t<t f f F ) where tinit is time of introduction of the system to service, tf is time of occurrence of the first failure f drawn from F. Failure Probability, Q F (t) is complementary to R F (t) R F (t)+q F (t) =1 We can take off the F subscript from R F (t) and Q F (t) When the lifetime of a system is exponentially distributed, the reliability of the system is: where the parameter is called the failure rate R(t) =e λt λ

Component Reliability Model Burn In Useful Life Wear Out During useful life, components exhibit a constant failure rate λ. Reliability of a device can be modelled using an exponential distribution R(t) =e λt

Component Failure Rate Failure rates often expressed in failures / million operating hours Automotive Embedded System Component Failure Rate λ Military Microprocessor 0.022 Typical Automotive Microprocessor 0.12 Electric Motor Lead/Acid battery 16.9 Oil Pump 37.3 Automotive Wiring Harness (luxury) 775

MTTF: Mean Time To Failure MTTF: Mean Time to Failure or Expected Life MTTF: Mean Time To (first) Failure is defined as the expected value of tf MTTF = E(t f )= where λ is the failure rate. 0 R(t)dt = 1 λ MTTF of a system is the expected time of the first failure in a sample of identical initially perfect systems. MTTR: Mean Time To Repair is defined as the expected time for repair. MTBF: Mean Time Between Failure

MTTF - MTTR - MTBF Availability = MTBF/(MTBF + MTTR)

Serial System Reliability Serially Connected Components R k (t) is the reliability of a single component k: R k (t) =e λ kt Assuming the failure rates of components are statistically independent. The overall system reliability i=1 No redundancy: Overall system reliability depends on the proper working of each component i=1 λ i) Serial failure rate R ser (t) R ser (t) =R 1 (t) R 2 (t) R 3 (t)... R n (t) n R ser (t) = R i (t) R ser (t) =e t(p n n λ ser = i=1 λ i

System Reliability Building a reliable serial system is extraordinarily difficult and expensive. System Reliability Building a reliable serial system is extraordinarily difficult and expensive. For example: if one is to build a serial 0.999 100 system with 100 components each of which had a reliability =0.905 of 0.999, the overall system reliability would be (0.999) 100 = 0.905 For example: if one is to build a serial system with 100 components each of which had a reliability of 0.999, the overall system reliability would be Reliability of System of Components Reliability of System of Components Minimal Path Set: Minimal Path Set: Minimal set of components whose functioning Minimal set of components whose functioning ensures the functioning of the ensures the functioning of the system system: {1,3,4} {2,3,4} {1,5} {2,5} {1,3,4} {2,3,4} {1,5} {2,5} G.Khan COE718: HW/SW Codesign of Embedded Systems 11

Parallel System Reliability Parallel Connected Components Q k (t) 1 R k (t) Q k (t) =1 e λ kt is : Assuming the failure rates of components are statistically independent. Q par (t) = Overall system reliability: n i=1 Q i (t) R par (t) =1 n (1 R i (t)) i=1

Example Consider 4 identical modules are connected in parallel System will operate correctly provided at least one module is operational. If the reliability of each module is 0.95. The overall system reliability is 1 (1 0.95) 4 =0.99999375

Parallel-Serial Reliability Parallel-Serial Reliability Parallel and Serial Connected Components Parallel and Serial Connected Components Total reliability is the reliability of the first half, in serial with the second half. Total reliability is the reliability of the first half, in serial with the second half. Given R1=0.9, R2=0.9, R3=0.99, R4=0.99, R5=0.87 Given R1=0.9, R2=0.9, R3=0.99, R4=0.99, R5=0.87 Rt = [1-(1-0.9)(1-0.9)][1-(1-0.87)(1-(0.99*0.99))] R t =(1 (1= 0.987 0.9)(1 0.9))(1 (1 0.87)(1 (0.99 0.99))) = 0.987 G.Khan COE718: HW/SW Codesign of Embedded Systems 13

Faults and Their Sources What is a fault? Fault is an erroneous state of software or hardware resulting from failures of its components Fault Sources Design errors Manufacturing Problems External disturbances Harsh environmental conditions System Misuse

Fault Sources Mechanical -- wears out Deterioration: wear, fatigue, corrosion Shock: fractures, overload, etc. Electronic Hardware -- bad fabrication; wears out Latent manufacturing defects Operating environment: noise, heat, ESD, electro-migration Design defects Software -- bad design Design defects Code rot -- accumulated run-time faults People Can take a whole lecture content...

Fault and Classifications Failure: Component does not provide service Fault: A defect within a system Error: A deviation from the required operation of the system or subsystem Extent: Local (independent) or Distributed (related) Value: Determinate Indeterminate (varying values) Duration: Transient Intermittent Permanent

ault-tolerant Computing COE718: HW/SW Codesign of Embedded Systems 18 aspects of FTC: Fault Tolerant Computing Fault-Tolerant Computing lt detection lt isolation and containment tem recovery lt Diagnosis Fault air Detection Main aspects of Fault Tolerant Computing (FTC): Fault detection, Fault isolation and containment, System recovery, Fault Diagnosis Repair Isolation Recovery Normal Processing td ti tr Time Switch-in Spare Diagnosis & Repair Time

Tolerating Faults There is four-fold categorisation to deal with the system faults and increase system reliability and/or availability. Methods for Minimising Faults Fault Avoidance: How to prevent the fault occurrence. Increase reliability by conservative design and use high reliability components. Fault Tolerance: How to provide the service complying with the specification in spite of faults having occurred or occurring. Fault Removal: How to minimise the presence of faults. Fault Forecasting: How to estimate the presence, occurrence, and the consequences of faults. Fault-Tolerance is the ability of a computer system to survive in the presence of faults.

Fault-Tolerance Techniques Hardware Fault Tolerance Software Fault Tolerance

Hardware Fault-Tolerance Techniques Fault Detection Redundancy (masking, dynamic) Use of extra components to mask the effect of a faulty component. (Static and Dynamic) Redundancy alone does not guarantee fault tolerance. It guarantee higher fault arrival rates (extra hardware). Redundancy Management is Important A fault tolerant computer can end up spending as much as 50% of its throughput in managing redundancy.

Hardware Fault-Tolerance: Fault Detection Detection of a failure is a challenge Many faults are latent that show up later Fault detection gives warning when a fault occurs. Duplication: Two identical copies of hardware run the same computation and compare each other results. When the results do not match a fault is declared. Error Detecting Codes: They utilise information redundancy

Hardware Fault-Tolerance: Redundancy Static and Dynamic Redundancy Extra components mask the effect of a faulty component. Masking Redundancy Static redundancy as once the redundant copies of an element are installed, their interconnection remains fixed e.g. N-tuple modular redundancy (nmr), ECC, TMR (Triple Modular Redundancy) 3 identical copies of modules provide separate results to a voter that produces a majority vote at its output. Dynamic Redundancy System configuration is changed in response to faults. Its success largely depends upon fault detection ability.

TMR Configuration PR1, PR2 and PR3 processors execute different versions of the code for the same application. Voter compares the results and forward the majority vote of results (two out of three). TMR Configuration Inputs PR1 Voter Outputs PR2 V PR3

Software Fault-Tolerance Hardware based fault-tolerance provides tolerance against physical i.e. hardware faults. How to tolerate design/software faults? It is virtually impossible to produce fully correct software. We need something: To prevent software bugs from causing system disasters. To mask out software bugs. Tolerating unanticipated design faults is much more difficult than tolerating anticipated physical faults. Software Fault Tolerance is needed as: Software bugs will occur no matter what we do. No fully dependable way of eliminating these bugs. These bugs have to be tolerated.

Tolerating Software Failures How to Tolerate Software Faults? Software fault-tolerance uses design redundancy to mask residual design faults of software programs. Software Fault Tolerance Strategy Defensive Programming If you can not be sure that what you are doing is correct. Do it in many ways. Review and test the software. Verify the software. Execute the specifications. Produce programs automatically.

SW Fault-Tolerance Techniques Software Fault-tolerance is based on HW Fault-tolerance Software Fault Detection is a bigger challenge Many software faults are of latent type that shows up later. Can use a watchdog to figure out if the program is crashed Change the specification to provide low level of service Write new versions of the software Throw the original version away. Use all the versions and vote the results. N-version Programming (NVP) Use an on-line acceptance test to determine which version to believe. Recovery Block Scheme.

Fault-tolerant Software Design Techniques Recovery block scheme (RB) Dynamic redundancy N-version programming scheme (NVP) n-modular redundancy Hardware redundancy is needed to implement the above Software Faulttolerance techniques.

Fault-tolerant Software Design Techniques Fault-tolerant Software Design Techniques Primary Primary V1 V2 V3 Alternate Alternate H H H H H RB NVP G.Khan RB Scheme comprises of three elements: A primary module to execute critical software functions. Acceptance test for the output of N-independent program variants execute in parallel on the identical input. Results are obtained by voting upon the output of COE718: primary module. HW/SW Alternate Codesign of Embedded Systems modules perform the same functions as of primary. individual programs

RB: Recovery Block Scheme Recovery Block Scheme An Architectural View of RB Input Rollback and try alternate version Failed Recovery Memory S w i t c h Primary Alternate-1 Alternate-n AT Passed Output Failed and alternates exhausted RB uses diverse versions RB uses diverse versions. Raise error Attempt to prevent Attempt residual to prevent residual software faults G.Khan COE718: HW/SW Codesign of Embedded Systems 32

Fault Recovery Fault recovery technique's success depends on the detection of faults accurately and as early as possible. Three classes of recovery procedures: Full Recovery It requires all the aspects of fault tolerant computing. Degraded recovery: Also referred as graceful degradation. Similar to full recovery but no subsystem is switched-in. Defective component is taken out of service. Suited for multiprocessors. Safe Shutdown

Fault Recovery Two Basic Approaches: Forward Recovery Produces correct results through continuation of normal processing. Highly application dependent Backward Recovery Some redundant process and state information is recorded with the progress of computation. Rollback the interrupted process to a point for which the correct information is available. e.g. Retry, Checkpointing, Journaling

Summary Reliability Serial Reliability, Parallel Reliability, System Reliability Fault Tolerance Hardware, Software

Preview Scheduling Theory Priority Inversion Priority Inheritance Protocol