Séminaire de Sûreté de Fonctionnement de l X Safety Integrity Levels Antoine Rauzy École Polytechnique
Agenda Safety Integrity Levels and related measures as introduced by the Standards How to interpreted these notions? Calculations Wrap-Up
Agenda Safety Integrity Levels and related measures as introduced by the Standards How to interpreted these notions? Calculations Wrap-Up
Risk: a Bi-Dimensional Notion A Riskfor a Systemis a pair (e,q), where e is an Initiating Eventthat leads the system into a Degraded State q in which the integrity of the system is more or less severely impacted. Whether a Risk is Acceptabledepends on the Frequency of the event e and Severityof the degradation in the state q. IEC 61508 Risk Matrix Frequency Severity Negligible Marginal Critical Catastrophic Minor injuries at worst Major injuries to one or more persons Loss of a single life Multiple loss of life Frequent > 10 3 Undesirable Unacceptable Unacceptable Unacceptable Probable 10-3 to 10-4 Tolerable Undesirable Unacceptable Unacceptable Occasional 10-4 to 10-5 Tolerable Tolerable Undesirable Unacceptable Remote 10-5 to 10-6 Acceptable Tolerable Tolerable Undesirable Improbable 10-6 to 10-7 Acceptable Acceptable Tolerable Tolerable Incredible 10-7 Acceptable Acceptable Acceptable Acceptable
Risk Mitigation To Mitigatea Risk, one has to reduce its Frequency or its Severityor both. Risk Mitigationis usually achieved by means of Safety Mechanisms and induces of additional Development Efforts. Risk Reduction of the Frequency Reduction of the Severity
Safety Standards Safety is regulated by Standards IEC 61511 Industrial Processes IEC 61508 IEC 61513 IEC 62061 Safety Instrumented Systems EN 50126, 50128, 50129 ISO 26262 Nuclear Machines Train Automotive ARP 4761 ARP 4754 DO 178B (C) DO 254 Avionic Strongly related concepts appear under different names in standards: Safety Integrity Levels(IEC 61508, IEC 61511 ) } Automotive Safety Integrity Levels(ISO 26262) Design Assurance Levels(ARP 4754, ARP 4761, DO 178B) Tolerable Hazard Rate(EN 50126, EN 501128, EN 501129) are indicators of : The severity of the risk under consideration The mitigation necessary to make this risk acceptable The effortto be done to achieve this risk mitigation }
Design Assurance Levels ARP 4754 Severity A Risk is acceptable when its frequency per flight hour is lower than the one defined for the DAL corresponding to its severity Minor Major Hazardous Catastrophic Failure is noticeable, but has a lesser impact than a Major failure (for example, causing passenger inconvenience or a routine flight plan change) Failure is significant, but has a lesser impact than a Hazardous failure (for example, leads to passenger discomfort rather than injuries) or significantly increases crew workload Failure has a large negative impact on safety or performance, or reduces the ability of the crew to operate the aircraft due to physical distress or a higher workload, or causes serious or fatal injuries among the passengers Failure may cause a crash. Error or loss of critical function required to safely fly and land aircraft Probable 10-5 < F 10-1 DAL D Unacceptable Unacceptable Unacceptable Frequency per flight hour Occasional 10-7 < F 10-5 Acceptable DAL C Unacceptable Unacceptable Remote 10-9 < F 10-7 Negligible Acceptable DAL B Unacceptable improbable F 10-9 Negligible Negligible Acceptable DAL A
IEC 61508: Safety Instrumented Systems Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems: Safety Instrumented Systems Logic Controller Sensors Actuators Risk: over pressure in the reactor IEC 61508 Prescriptions/Concerns: 1. Reduction of Systematic Failures (e.g. errors in the logic of the controller) 2. Probabilistic Safety Assessment (random mechanical failures in the SIS) 3. Architectural constraints (redundancies )
IEC 61508: Safety Integrity Levels The Safety Integrity Level (SIL)of the Safety Instrumented System (SIS)is determined by the Risk Reduction Factor (RRF) provided by the SIStotheEquipment Under Control (EUC). Assuming that the SIS prevents the whole risk, this is also a measure of the likelihood of a Failure of the SIS. Low Demand Mode: Probability of Failure on Demand (PFDaverage) High Demand Mode: Probabilityof Failure per Hour (PFH) SIL 4 10-5 to 10-4 (RFF > 10000) 10-9 to 10-8 SIL 3 10-4 to 10-3 (1000 RFF < 10000) 10-8 to 10-7 SIL 2 10-3 to 10-2 (100 RFF < 1000) 10-7 to 10-6 SIL 1 10-2 to 10-1 (10 RFF < 100) 10-6 to 10-5
Other Standards Similar concepts are used in other Standards, e.g. ISO 26262 EN 50126 ASIL ASIL D < 10-8 h -1 ASIL C < 10-7 h -1 ASIL B < 10-6 h -1 ASIL A < 10-5 h -1 Random hardwarefailure target values Maximum THR per hour SIL 10-9 to 10-8 4 10-8 to 10-7 3 10-7 to 10-6 2 10-6 to 10-5 1
Issues with IEC 61508 The formulation of the current version of IEC 61508 (next version expected in 2015) raises a number of problems: Fuzzy definitions: High demand mode: systems that operate continuously (more than once per year) Low demand mode: systems that operate intermittently (less than once a year) Obscure concepts: PFD? PFH? Safe Failure Fraction Formulas given without justification Many issues not taken into account, e.g. ageing, failure dependencies Low Demand Mode: Probability of Failure on Demand (PFDaverage) High Demand Mode: Probabilityof Failure per Hour (PFH) SIL 4 10-5 to 10-4 10-9 to 10-8 SIL 3 10-4 to 10-3 10-8 to 10-7 SIL 2 10-3 to 10-2 10-7 to 10-6 SIL 1 10-2 to 10-1 10-6 to 10-5
Agenda Safety Integrity Levels and related measures as introduced by the Standards How to interpreted these notions? Calculations Wrap-Up
Risk = Hazard x Exposure Hazard: Condition, event, or circumstance that could lead to or contribute to an unplanned or undesirable event (FAA). The initiating event of the accident sequence can be externalto the system, i.e. a hazard, or internalor a combination of both. The frequency of the risk is the product of the probabilityof this initiating event and the exposure time. Bird Strike taxi takeoff initial climb climb cruise descent initial approach final approach landing exposure 1% 1% 14% 57% 11% 12% 3% 1% accidents 12% 12% 8% 10% 8% 4% 10% 11% 25% fatalities 0% 16% 14% 13% 16% 4% 12% 13% 12% Averaging the risk per use hour (here flight hour) may be incorrect: although a bird strike can be dangerous only during takeoff and initial climb, those two phases occur in each flight.
Reliability versus Availability S: system under study. T: (random variable) date of the first failure of S. Reliability: Availability: def { t T } RS ( t) = Pr < F ( t) = 1 R ( t) A S def { Sisworkingatt} ( t) = Pr Q ( t) = 1 A ( t) S S def def S S If the system is not repairable, then R S (t) = A S (t). However, most of the componentsof the Safety Instrumented Systems are periodically tested. F S (t) Q S (t) for a periodically tested component
Low Demand versus High Demand Modes Low Demand Mode: demand frequency << test frequency When failed, the Safety Instrumented System is likely to be repaired before being demanded. The Safety Instrumented System behaves almost independently from the Equipment Under Control. Consequence: PFD S (t) = Q S (t), and of coursepfdavg S (t) = Q S (t)/t PFD PFDavg High Demand Mode: demand frequency > test frequency The Safety Instrumented System and the Equipment Under Control are tightly linked. An accident is likely to occur as soon as the SIS fails. Consequence: PFH S (T)is related to the (un)reliabilityf S (t)
Periodically Tested Systems To be taken into account at component level: Availability of the component during the test Covering (probability that a failure is actually detected) Failures du to the test (bad reconfiguration) Duration of the test To be taken into account at system level Average versus maximum Test shifts 1-out-of-1 system unavailable during the test 2-out-of-2 system simultaneous tests 2-out-of-2 system shifted tests
Probability of Failure per Hour (1) How to interpret the notion of Probability of Failure per Hour? PFH S ( t)? = FS( t) t does not mean anything. Moreover, it tends to 0 as t increases. f S def ( t) = dfs( t) dt The Failure Intensity f S (t)is of no help, for the very same reason. Moreover it is not a probability. def Pr rs( t) = limdt 0 { the system fails between tandt + dt/ C } dt where C stands for the event: the system worked without interruption from 0 to t (included) The Failure Rate r S (t)is of no help. It is undetermined when t increases.
Probability of Failure per Hour (2) w S def Pr ( t) = limdt 0 { the system fails between tandt + dt/ E } dt where E stands for the event: the system was working at t = 0 The Unconditional Failure Intensity w S (t)is probably the right notion. We know how to calculate w S (t) (see articles by Dutuit, Rauzy& Signoret). w ( t). w ( t) S( t) = MIFS, c c S MIF w S, c ( t) def = QS( t) Q ( t) c c ( t) = ( t). A ( t) Failure Rate λ C (t) c λ c c Marginal Importance Factor MIF S,c (t), also called Birnbaum Importance Factor
Safe Failure Fraction The Safe Failure Fraction (SFF)is an awkward attempt to measure the proportion/relative importance of failures/failure scenarios of the Safety Instrumented System that are not harmful for the Equipment Under Control. Similar indicators are proposed in other standards. nominal state detection/alert degraded state failure(s) test/maintenance repair safe state failure(s) the Equipment Under Control accident is stopped correctly the Safety Instrumented System is demanded while being failed repair state the Safety Instrumented System is repaired before being demanded At best the SFF can be understood as a conditional probability: given that a failure occurs, what is the probability that something bad happens?.
Agenda Safety Integrity Levels and related measures as introduced by the Standards How to interpreted these notions? Calculations Wrap-Up
Multi-Phases Markov Processes Multi-phases Markov processes with rewards phase 1 (before 1 st test) λ OK KO OK KO R OK 0.99 0.01 0.00 KO 0.00 0.02 0.98 µ phase 2 (between tests) λ OK KO Rewards: phase 1 OK KO availability 1.00 0.00 phase 2 OK KO R availability 1.00 0.00 0.00 R Calculations: Transient probabilities (+ rewards) Sink state for unreliability OK KO R OK 0.99 0.01 0.00 KO 0.00 0.02 0.98 R 0.00 0.00 1.00
Calculations Several tools can be used for SIL calculation, e.g. Multi-Phases Markov Processes with Rewardsare a very convenient framework to model componentsare calculate their availability. They can hardly be used for systemsbecause of the exponential blow-up of the size of the models. For systems, tools such as Fault Treesmust be used. Otherwise, it remains Stochastic Simulation(from Petri Nets, AltaRica ) which isa versatile tooland workswellatleast whennumbersare not toolow.
SIL/DAL Allocation Standards such as ARP 4761 describe top-down allocation methods, typically based on the Fault Tree structure. Safety Requirement: 5.0 x 10-9 Loss of the Aircraft Unannounced Loss of Braking Capacities 5.0 x 10-9 5.0 x 10-9 Spurious Braking afterv1 Other Failure Conditions Loss of Thrust Invertors Loss of Wheel Braking Spurious Thrust Inversion after V1 Spurious Wheel Braking after V1 Erroneous Fault Report after V1 5.0 x 10-3 1.0 x 10-8 Announced Loss of the Braking System on Dangerous Runway Unannounced Loss of the Braking System 5.0 x 10-9 5.0 x 10-9 5.0 x 10-9 5.0 x 10-7 5.0 x 10-7 Such empirical methods can take seamlessly into account duplicated events, technology readiness, costs It is much better to see allocation as optimization problems, working bottom-up
Agenda Safety Integrity Levels and related measures as introduced by the Standards How to interpreted these notions? Calculations Wrap-Up
Wrap-Up Safety Standards introduce SIL and related concepts. Definitions are rather fuzzy. The same concept may appear under different names and the same name can be used with different meaning. However, once mathematically cleaned up, SIL and related concepts are very useful indicators of: The severity of the risk under consideration The mitigation necessary to make this risk acceptable The effort to be done to achieve this risk mitigation There exist efficient calculation methods Multi-phases Markov Processes with Rewards (for components) Fault Trees Stochastic Simulation