Safety Analysis: FMEA Risk analysis Lecture 8
Failure modes and effect analysis (FMEA) Why: to identify contribution of components failures to system failure How: progressively select the individual components or functions within a system and investigate their possible modes of failure Information analyzed: possible failure modes, possible causes, local and system effect, how to fix (remedial actions)
What is the proper level? Depends at which design stage: might be very general might be very detailed Hardware To off-the-shelf components Or field-replaceable assemblies for which failure modes are available Software as a single component Failure modes as worst possible effects Does not include human
Example of hardware-oriented FMEA
Evaluation of FMEA + Allows to identify redundancy, single-point failure, inspection points and how often the system needs to be serviced Technique is complete - Time consuming Does not consider effect of multiple or common-cause failures
Some notes about FMEA Very often hardware-oriented FMEA formulates software requirements very vaguely, e.g., modify software to detect failure. How to do it better? Find a common model, i.e., the model which would be a middle-hand between safety analysis and software requirements
Example a conveyor system A conveyor system consists of a feed belt and an elevating rotary table. The feed belt transports objects placed on its left end to the right end and then on the table. The table then elevates and rotates an object to make it available for processing by further machines. The belt has photo-electric cells, which signals when an object has arrived at its ends. The motor of the belt may be switched on and off: it has to be on while waiting for a new object and has to be switched off when an object is at the end of the belt but cannot be delivered onto the table because the table is not in proper position. The table lifts and rotates an object clockwise to a position for further processing. When an object is taken the table moves down and counterclockwise to accept another object. For the brevity we omit the details of table s implementation. We merely observe that the table moves between two positions: one for loading an object from the feed belt and another for unloading an object for further machines. Initially the table is in its loading position and the feed belt is running empty while waiting for an object to be placed.
Hazards: Object is jammed between the belt and the table Objects are piled up
Safety requirements Safety requirement1: The feed belt may only convey an object through its exit sensor, if the table is in the loading position. Safety requirement2. A new object may only be put on the feed belt after the exit sensor confirms that the last one has arrived at the end of the feed belt
Statechart of fault free system
Safety requirements in terms of Safety requirement1: statecharts FB is in Delivering implies TAB is in ReadyForLoading Safety requirement2: EntrySen_On arrives only when FB is in VACANT
Analysis The model of reality which controller has is actually a statechart model Controller keeps record of state and updates the state upon arrival of every event We introduce variables to model states of components
Example of traditional FMEA Unit Failure mode Possible cause Local effects System effects Remedial action Feed belt entry sensor Stuck at zero Primary sensor failure Sensor sends zero signal constantly Arrival of an object is undetected. No control over the distance between arriving objects. Danger to pile up. Ensure that the fault is always detected. Modify software to detect the fault. If fault occurs then switch on alarm and stop the system
Formalizing FMEA The main idea is to use statecharts to express how each error should be detected and mitigated Two types of detection: Aberrant event and Timeout
Examples of formalized FMEA
Statechart model of fail-safe conveyor system (see appendix 3)
Fail-safe systems A fail-safe system upon detection of an error is shut down.
Fail-safe controller Constants /*Maximal time to reload object from feed belt to table */ MaxDeliveringTime /*Maximal time for object to come from beginning to end of feed belt */ MaxTranspDelivTime Procedures /* Halts feed belt */ HaltFB = if FB=VACANT FB_st=VACANT then FB := HALTINGV FB_st := HaltedVac elseif FB=FBLOADED FB_st=StartTransporting then FB := HALTINGL FB_st := HStartTransporting /* Immediately stops feed belt */ StopFB = if (FB=VACANT FB=FBLOADED) then FB := FBSTOPPED FB_ST := FBSTOPPED /* Outputs message to an operator */ Warning (Msg: String) = output(msg)
/* Timers are active then ON */ DeliveryTimer, TranspDelivTimer : {ON, OFF} /*Time stamps fix time when timers are activated */ DeliveryTimerStamp, TranspDelivTStamp : INT /*Object arrived at the beginning activate timer TranspDelivTimer */ E=EntrySen_ON FB=VACANT FB_st=VACANT E :=NIL FB := FBLOADED FB_st := StartTransporting TranspDelivTimer := ON TranspDelivTStamp := t
/*Object arrived at the end -- deactivate timer TranspDelivTimer */ E=ExitSen_ON FB= FBLOADED FB_st = Transporting TAB=TVACANT TAB_st = ReadyToLoad FB_st := Delivering TranspDelivTimer := OFF [] E=ExitSen_ON FB= FBLOADED FB_st = Transporting (TAB=TVACANT TAB=TLOADED) TAB_st ReadyToLoad E :=NIL FB_st := Waiting TranspDelivTimer := OFF /* Object passed the exit sensor while the motor was supposed to be OFF */ E=ExitSen_OFF FB=FBLOADED FB_st=Waiting E :=NIL FB := FBFAILED FB_ST := FBFAILED call Warning( Feed belt motor fails to stop ) call StopTAB
Conclusions on safety analysis Do not be scared by hardware terms in fault trees and FMEA! Your knowledge of simple failure modes of sensors and actuators which we have studied already will be sufficient. But always deduce software requirements from safety analysis: overlooked safety requirements is a greatest threat! Always start safety analysis from identifying hazards, i.e., asking yourself what I would like to avoid happening in this system Draw fault tree to analyse how it can happen Conduct FMEA to see which components in which failure modes are contributing to hazards Derive specifications of error detection procedures, remedial actions and make sure you implement them correctly!
Hazard and accident Hazard is a potential for an accident How do we judge whether a hazard is acceptable? The importance of a hazard is related to the accidents that may result from it. Accident is an unintended event or sequence of events that causes death, injury, environmental or material damage 23
Risk Two factors are significant for an accident: the potential consequences of any accident that might result from the hazard the frequency (or probability) of such an accident occurring Risk is a combination of the frequency or probability of a specified hazardous event, and its consequence. 24
Example Failure of a particular component is likely to result in an explosion that could kill 100 people. It is estimated that this component will fail once in every 10000 years. What is the risk associated with that component? Risk= severity x frequency = 100 x 0.0001 = 0.01 deaths per year 25
Categories of severity for military Category systems Definition Catastrophic Multiple deaths Critical Marginal Negligible A single death, and/or multiple severe injuries or severe occupational illnesses A single severe injury or occupational illness, and/or multiple minor injuries or minor occupational illnesses At most a single injury or minor occupational illness 26
Accident probability ranges for Accident frequency military systems Occurrence during operational life considering all instances of the system Frequent Likely to be continually experienced Probable Likely to occur often Occasional Likely to occur several times Remote Likely to occur some time Improbable Unlikely, but may exceptionally occur Incredible Extremely unlikely that the event will occur at all 27
Risk classification Severity of hazardous event Risk classification Frequency of hazardous event 28
Why classifying risks? Risks can be expressed qualitatively and quantitatively Calculation of risk results in a risk class (or risk level). Most standards define a number of risk classes and then set out development and design techniques appropriate for each category of risk 29
Risk classes and interpretations for military systems 30
As Low As is Reasonably Practicable (ALARP) principle 31
The process of risk reduction 32
Difference between criticality of systems Both an electric toaster and a nuclear reactor protection system should be adequately safe but meaning of adequately would be different for these two cases Hence the importance of safe operation differs widely between applications Different safety requirements for different projects mean different levels of risk reduction required 33