Failure Behavior Analysis for Reliable Distributed Embedded Systems

Failure Behavior Analysis for Reliable Distributed Embedded Systems Mario Tra, Bernd Schürmann, Torsten Tetteroo {tra schuerma tetteroo}@informatik.uni-kl.de Deartment of Comuter Science, University of Kaiserslautern Abstract Failure behavior analysis is a very imortant hase in develoing large distributed embedded systems with weak safety requirements which do graceful degradation in case of failures. Today, the analysis will usually be done by standard methods like FTA and FMEA considering the existence of faults, only. Gradations of errors are not regarded, although this is a very coarse system behavior aroximation. In contrast to that, our advanced failure behavior analysis yields more sohisticated and graded results. We obtain comrehensive results by assigning a quality descrition to all the information in a system and extending the ure information flow to an information quality flow, that models system failure behavior, too. We model this information quality flow by object-oriented hierarchical etri nets. Large arts of these nets can automatically be generated from the existing behavioral system structure. A net simulator enables us to erform all the sohisticated analyses we need to examine the failure behavior. 1 Introduction 1..* system tasks 1..* Task < sends information Failure Analysis is one of the main asects in develoing reliable embedded systems. Knowing the effects of faults on the system behavior enables the develoer to strengthen the weak oints and to revent the system from failing in the case of faults. In safety critical systems, e.g. drive-by-wire systems, failures must be revented, wherefore redundant comonents or functionality are added. Our research grou focuses on the develoment of large embedded systems with weak safety requirements like building automation systems. Preventing all ossible failures is not necessary, and, of course, too exensive in these systems. Therefore, our aroach meets the idea of graceful degradation. We allow certain failures, however, they must be controllable in order to obtain guaranteed and redictable gradations of the system functionality. These gradations can be seen as different failure modes of the system. We use an object-oriented aroach to model the system. Figure 1 shows the simlified data model used in our develoing rocess as an UML class diagram [1]. The system behavior is realized by tasks. Tasks are related to other tasks by the information they interchange. Each task is realized in a system comonent, which can aggregate other subcomonents, creating a structural hierarchy. One aroriate requirements analysis rocess ro- Fig. 1. Simlified data model. ducing data which can be maed to that simlified data model is, for examle, the requirements engineering rocess for large distributed reactive systems develoed in our research grou [2]. It has to be mentioned that the real data model is more comlex than it is shown in figure 1, but it could be transformed to the simlified model. To realize a controllable and redictable gradation of the system functionality, it is necessary to analyze which faults affect the system behavior. A fault causes a gradation of functionality, switching the system to a secific failure mode. Deending on that mode, a succeeding fault causes another failure mode. Therefore, the analysis must be able to handle successive faults and their chronological order. Furthermore, it is imortant to analyze in detail how faults affect the system. In section 2, we give an overview of related failure analysis and verification techniques. Then, we describe our analysis aroach in sections 3 and 4. 2 Related Work System realized in > Comonent system comonents subcomonents There exist a lot of techniques and methods to analyze and to verify systems, so that this section can give an incomlete overview, only. The techniques and methods can be divided into two grous: On the one hand there are formal verification techniques like model checking and markov chains, on the other hand there are semi-formal methods like fault tree analysis and failure mode and effects analysis. Fault Tree Analysis (FTA) is a to down aroach: It starts with system failure situations which must be avoided, and analyzes how these failures can be caused by faults of subsystems or system comonents. The faults are combined by boolean exressions like and, or, not, etc. roducing boolean equations as the result of FTA [3]. Usually, FTA does not consider different modes of the system excet success and failure, and it considers neither multile faults nor their chronological order. - 1 -

Failure Mode and Effects Analysis (FMEA) is another well known analysis technique. It starts with a ossible fault of a comonent and analyzes bottom-u, how higherlevel functionality and the systems functionality are affected. The results of FMEA are listed informally in large tables describing, for examle, the comonent failure, the effects to the system, the criticality of the failure, etc. Usually, FMEA oversimlifies a system into two modes: success and failure, and does not consider different modes, which could reresent gradations of system functionality and erformance [4]. Multile faults are usually not considered, either [5]. PRICE AND TAYLOR extended FMEA to analyze and reort the most likely multile simultaneous failure combinations, but they do not handle the chronological order of faults [5]. YANG AND KAPUR introduced a customer driven reliability to FMEA as a quality over time. They consider different erformance levels of a roduct which are degraded over time [4]. Model Checking is a formal verification technique used to automatically check systems which have a finite state sace. A system is modelled as a finite automaton and elementary roositions are maed to each state, which are fulfilled in these states. Then, this model of the system is used to automatically rove whether or not the secification is satisfied. Such a secification is given as a set of roerties, usually exressed in temoral logic [6]. Markov Chain Analysis is a technique to examine stochastic systems. The system is modelled with a finite set of states and transitions between the states. Each transition from state s i to state s j is labeled with ij exressing the robability that the transition will be executed in the next ste. This model is called a (discrete-time) Markov Chain which is interreted at discrete time stes. Continuous-time Markov chains are interreted over continuous time. Their transitions are labeled with the rate of the exonential robability distribution. It should be noted that Markov chains are memoryless: The robabilities of the transitions deend on the current state, only. Neither the revious states nor the time the system is in the current state influence these robabilities. Markov chains can be analyzed using numerical or analytical solutions. For examle, the evolution of the model u to a given oint in time or the long-run average behavior of the system can be studied [7]. Adding elementary roositions to the states enables the usage of Model Checking with Markov chains. For examle, the Erlangen-Twente Markov Chain Model Checker (E MC 2 ) is a Model Checker for discrete-time and continuous-time Markov chains [8]. 3 Failure Behavior Analysis 3.1 Motivation Usually, FTA and FMEA only consider a fault as existing or non-existing. Gradations of errors are not regarded. This is, however, only a very coarse aroximation of the system behavior in the case of a failure. A major disadvantage of FTA and FMEA is that they are comletely static, i.e. as soon as an outut error deends on the current characteristics of an inut error FTA and FMEA cannot be used. We want to exlain that in the following examle. Let us regard three comonents used to control a radiator temerature. The first comonent is a temerature sensor. The sensor value is smoothed in a secial moving average filter. Finally, the smoothed value is sent to the third comonent, a hysteresis switch which oens or closes the radiator valve, resectively. A new temerature value is samled every 20 seconds and the moving average filter uses the last 6 values to calculate a new mean temerature. Due to the secial imlementation of the moving average filter, the hysteresis switch gets a new value every 2 minutes. FTA examines how the hysteresis switch can be influenced. As it uses the value of the moving average filter as inut, it is of course influenced by a fault of the latter. In turn, the outut value of the moving average filter deends on the value of the temerature sensor. In consequence, it is concluded that a fault of the temerature sensor causes a failure in the hysteresis switch. A similar result would be obtained by an FMEA. Now, let us assume that a valve movement temorarily disturbs the temerature sensor and causes a relative error of 10%. Ten seconds are needed to oen or close the valve. That means, for a eriod of 10 seconds the temerature value might have a relative error of 10%. Now, it must be examined whether or not it is necessary to use a better and thus more exensive temerature sensor. For that, FTA as well as FMEA are useless. FTA and FMEA have only shown that the hysteresis switch deends on the temerature sensor. The detailed information if or how a relative error of 10% of the temerature value that ersists for 10 seconds influences the hysteresis switch cannot be obtained. Although it would be ossible to manually calculate the influence on the hysteresis switch in that simle examle, this is not ossible if the interdeendencies of large (distributed) systems must be considered. Therefore, a failure behavior analysis is required that is caable of treating the dynamic deendencies of the outut values on the error of the inut values as well as the timing asects. The treatment of dynamic deendencies oses two major roblems. First, it must be ossible to describe faults and errors, thus to use more gradations than only valid and invalid. Second, it is necessary to describe the deendency of an outut value of a comonent on the error characteristics of a current inut value. For the descrition of faults and the definition of the deendencies it must be ossible to consider timing asects. These additional features are rovided by the failure behavior analysis we will introduce in this aer. In comarison with FTA and FMEA, our failure behavior analysis enables the analyst to obtain much more sohisticated information about the failure behavior of a large and ossibly distributed system. 3.2 Overview of the Analysis The advanced analysis of the system behavior in the case of a failure shall deliver sohisticated and graded analysis results. We do not only want to know which comonents are influenced by certain errors, but also how and by which kind of errors they are influenced. - 2 -

We obtain such analysis results by assigning a quality descrition to all the information transformed and transorted in a system. Instead of regarding the ure information flow in a system that characterizes the functional behavior, we regard the flow of the information qualities which defines the system behavior in the case of a failure, too. In the following this will be exlained more detailed. As it has been mentioned earlier, we assume that the overall functionality of a system is decomosed into various tasks, whereby the relations between interdeendent tasks are known. The system functionality is defined by the collaboration of the single tasks. The stand-alone functionality of a single task is usually quite simle. Due to the communication between tasks, however, their functionalities are comosed to an arbitrarily comlex, suerordinated system functionality. That means, at the lowest level the system functionality is defined by task functionalities defining the ossibly state deendent transformation of inut information to outut information. To obtain the suerordinated functionality, the communication between tasks, i.e. the transortation of information, must be regarded. We want to seize the same concet for the failure behavior analysis. Instead of regarding the transformation and transortation of information, we are interested in the quality of the information. For that, we will first introduce a construct that enables us to describe such a quality. Then, we will define how these qualities are transformed by a single task. We describe this with etri nets. In the last ste, we will relace the transortation of information by the transortation of their qualities in the task collaboration network. For that, we use hierarchical etri nets. The system develoer needs only to define the maing of the information qualities at the task level. The suerordinated etri net defining the collaboration of the single tasks can be generated, automatically. This leads to a high scalability of our aroach. In section 3.3 we will give recise definition of the term information. After that, we will introduce the concet of the information quality in section 3.4. Finally, in section 3.5 we will show, how etri nets can be used to define the information quality flow. 3.3 Information Any kind of data that enters or leaves the system, or is moved or stored in the system is called information. This is the only meaning of the term information in this aer, indeendent of any other existing definitions. The information concet is illustrated in figure 2. A sensor value enters the system and is stored in attribute A1 of comonent A. This information is moved to comonent B, whereby information can be moved using method calls or by sending signals. Then, it is sent to an actuator. Sensor Comonent A Attribute A1 Information Comonent B Attribute B1 Fig. 2. Samle information flow. Actuator Although the sensor value has not been stored in the system, so far, it is called information since it is data that enters the system. The same alies to data sent to an actuator. 3.4 Information Quality We assign a quality attribute to each information. For examle, the validity of an information or its relative error could be used. We do this by using a class reresenting the information quality. This class may have an arbitrary number of attributes describing the quality of an information. Of course, all caabilities of the flexible object-oriented aroach can be utilized. A UML reresentation of the information quality class is shown in figure 3. InformationQuality validity relativeerror nominalvalue currentvalue MTTC MTTO roertylist Fig. 3. Information quality class. The attribute validity defines whether or not the information is valid, at all. The attribute relativeerror reresents the relative error of a continuous information, e.g. a temerature value. As we will see in the alication examles, in most cases it is sufficient only to have a relative error and no absolute error. As our aroach ought to be aliable at various stages of the develoment rocess, we do not assume an executable secification of the functional behavior. Therefore, it is not always ossible to examine the roagation of absolute errors of continuous values. However, it is ossible and even necessary to regard the absolute values of discrete information. Therefore, the attributes nominalvalue and currentvalue are used. If, for examle, the on/off-switch of a cruise control is defect, it is a crucial difference whether the control should be switched on or off. Therefore, considering the nominal value is necessary to describe how a comonent is influenced. If we regard a comonent that de-/activates an alarm system which has the ossible states deactivated, activated, and alarmon, then besides the nominal value (e.g. the alarm system should be deactivated) the current value must be given (e.g. due to a failure the alarm system might be activated). The attributes MTTC (Maximum Time To Correction) and MTTO (Minimal Time To Occurrence) are necessary to describe and to examine transient errors. A software or hardware fault leads to an error only if the affected comonent is used. If, for examle, a faulty sensor value is corrected before it has been read, it will not result in an error. Therefore, it is obviously necessary to regard the time that is required at maximum to correct a fault. This time is reresented by the attribute MTTC. Furthermore, it must be defined how long a value remains correct at minimum. This time is reresented by the attribute MTTO. We want to use the following examle to illustrate the - 3 -

meaning and the usage of these attributes. Let us regard a micro controller reading the values of a sensor. We assume that a watchdog timer is used to reinitialize the micro controller if it has been stalled, wherefore it can be assured that in the case of a failure the micro controller is working again after at latest t 1 seconds, i.e. the MTTC is t 1 seconds. Further, we assume that the micro controller has intensively been tested, so that it can be guaranteed that it takes at least t 2 seconds before a failure occurs after the micro controller has been reinitialized, i.e. the MTTO is t 2 seconds. In addition, our analysis can also be used to determine the required values of t 1 and t 2 as constraints at a very early stage of the develoment rocess. The last attribute we want to rovide is the roertylist. All characteristics of an information quality that are not required to determine whether and how a comonent is influenced, can be reresented by a roerty. For examle, it might be ossible to use a textual reresentation to describe the influence of a comonent failure on the energy consumtion of a system. Therefore, in each comonent traversed by an information quality due to the error roagation, a new roerty can be attached to the traversing information quality. Although the influences informally described in these roerties are not used to determine the failure behavior of deending comonents, at the end of the analysis the analyst obtains a detailed textual descrition of all relevant influences on arbitrary concerns, i.e. a kind of reort is generated automatically. Due to the usage of an information quality class, our aroach is very flexible and extendable, as it is ossible to modify the existing or to add new attributes and methods to the class to adat the concet to various domains and kinds of alication. 3.5 Information Quality Flow Errors can be described with information qualities. Now, it is necessary to describe the roagation of errors. Therefore, we regard the information quality flow instead of the information flow of a system, as it is shown in figure 4. Sensor Comonent A Attribute A1 InformationQuality Validity ProertyList Comonent B Attribute B1 Actuator Fig. 4. Samle information quality flow. A functional behavior secification defines the information flow. Therefore, it is secified how an information is modified by a single task. Further, the system structure defines the interdeendencies between tasks. A failure behavior secification basically uses the same flow structure, however, it is defined how the quality of an outut information of the task is influenced by the quality of the inut information. We use hierarchical reference nets [9] to describe this asect. Reference nets are object-oriented etri nets which suort redicates. These nets enable us to have tokens reresent the information qualities and we can assign conditions to transitions. Modelling the failure behavior of a single task Deending on the qualities of the inut information, a task is set to a certain failure mode. This deendency is described in a so-called guard condition that is assigned to the failure mode. If a guard condition of a failure mode is true, the task is set to that failure mode. In consequence, the guard conditions of all failure modes of a task must be mutually exclusive. Each ossible failure mode is reresented by a lace FM j in the etri net modelling the task. The according guard condition is assigned to the transition leading to that lace. The qualities of the outut information are defined deending on the current failure mode and the inut information qualities of a task. This is realized by assigning an action to a transition leading to the task outut lace, whereby the necessary inut information qualities are available at the transition in form of tokens moved from the inut laces to the outut laces. The resulting general structure of a task etri net is shown in figure 5. Inut 1 Inut i Inut m Guard 1 Guard j Guard n FM 1 FM j FM n Action 1 Action j Action n Outut 1 Outut k Outut o Fig. 5. General etri net structure for the definition of the information quality maing within a single task. When the etri net is to be evaluated, the tokens referencing information quality objects are moved to the inut laces, whereby an own lace exists for each inut information quality. According to the guard conditions the tokens are moved to the lace reresenting the aroriate failure mode (FM j). In the following transition, new tokens reresenting the qualities of each outut information are created. For each failure mode a different deendency of the outut qualities on the inut qualities may be defined. In section 4 this concet will be shown in an examle. Chronological order of faults One disadvantage of the data flow rincile of etri nets is that the token reresenting the current failure mode is directly removed from the failure mode lace FM j because the action transitions are enabled immediately after the according failure mode lace has been marked. We therefore create an additional token that is tagged with the name of the current failure mode and move it to an additional lace called currentfailuremode. This way, we can reserve the current failure mode. In consequence, the - 4 -

current failure mode can also be considered in the guard conditions. That means, the chronological order of faults and all receding events can influence the failure behavior. Although the current failure mode is reserved, it is not ossible to figure out the failure history after an analysis. We therefore further extend the etri net by a history lace (fig. 6) to log this history. The history lace can contain an arbitrary number of tokens. Each of these tokens is tagged with the name of the according failure mode and a time stam (NOW). As every change of the failure mode is reresented by a token on the history lace, the comlete failure behavior history of the task during the analysis is logged. Inut Guard 1 Guard i Guard n [ FM i,now] [ FM 1,NOW] [ FM n,now] History Fig. 6. All changes of the current failure mode during the analysis are logged in a history lace. Task A b Transition 1a Transition 2b Task B b Failure Modes c Task C is injected into the sub net. The token on the lace Failure Modes in the suerordinated net is a reference to the sub net. After the sub net has been executed, transition 2a is enabled and, in consequence, transition 2b is fired, too. Token c is then outreached to the suerordinated net and is roagated to task C. 4 Alication Examle In the following, an examle will be introduced that demonstrates the alication of the analysis. Thereby, the main asect is to show that the concets introduced in the receding sections are sufficient to obtain sohisticated analysis results. In the examle, we will regard a ressure control. The focus lies on data rocessing. Esecially, the usage of the relative error attribute will be shown. 4.1 Pressure Control Obviously, it is ossible to describe the roagation of a relative error of data rocessing comonents comletely mathematically. However, relative errors seem to be roblematic if absolute values must be considered. Therefore, data rocessing comonents, the error roagation behavior of which can be described mathematically, are combined with a failure behavior secification that uses absolute values in the guards. Thus, the examle is used to demonstrate that it is sufficient to use the relative error, although absolute values are required to decide whether a guard is fulfilled. Esecially, if the MTTC is considered, as well, a wide range of failure cases can be examined and sohisticated results can be obtained. 4.1.1 Requirements. The main task of the control system is to control the ressure of an oxygen tank that belongs to a chemical lant. The target ressure is 500 bar. A ressure between 480 and 520 bar is otimal. If the ressure is between 470 and 480 bar or 520 and 530 bar, resectively, the efficiency of the deending machines is decreased. As soon as the ressure gets higher than 530 bar the situation becomes dangerous. If the ressure is lower than 470 bar the machines are stalled. This is also illustrated in figure 8. Pressure/bar Plant state Transition 1b b b.valid c.valid=b.valid b Normal Failure c b c!b.valid c.valid=b.valid c Transition 2a 480-520 otimal oeration 470-480 or 520-530 reduced efficiency / should be avoided <470 or > 530 machine stall, danger / must be avoided Fig. 8. Requirements on tank ressure. Fig. 7. General rincile of hierarchy in reference nets Modelling the interdeendencies of tasks By now, we have shown how the transformation of the information qualities by a single task can be modelled. Now, it is necessary to describe the interconnection of the tasks with etri nets. Utilizing the hierarchy concet rovided by etri nets, we can automatically generate a suerordinated etri net reresenting the interdeendencies of the tasks, as these interdeendencies are given by the system structure (figure 1). The general rincile of the hierarchy suort of reference nets is illustrated in figure 7. Transitions in the suerordinated net are uniquely connected with transitions in the sub nets (in figure 7 these connections are illustrated with dashed arcs). As soon as transition 1a is enabled, transition 1b is fired, too. Token b 4.1.2 System Design. The chosen structure of the ressure control system is illustrated in figure 9. Basically, three ressure sensors at different laces in the ressure tank are available. An average filter calculates the actual overall ressure. In order to suress temoral fluctuations, addi- - 5 -

Pressure Sensor 1 Pressure Sensor 2 Pressure Sensor 3 Average Filter Moving Average Filter Hysteresis Switch Fig. 9. Structure of ressure control system. tionally, a moving average filter is alied in the next ste. After that, a hysteresis switch decides when the valve has to be oened or closed. The formulae defining the digital filters are shown in figure 10. The average filter weights all of the three ressure values equally to calculate the actual overall ressure value. The moving average filter uses the last five actual ressure values to calculate the smoothed average ressure value, whereby again all values are weighted equally. i[k]: ressure i at time k [k] : average ressure at time k a[k]: smoothed average ressure at time k average filter moving average filter k [ ] a[ k] The hysteresis loo of the switch is shown in figure 11. Although a ressure between 480 and 520 bar is otimal, the threshold values of the hysteresis have been set to 490 and 510 bar in order to have a 10 bar reserve at both ends. As the target ressure is 500 bar, a tolerance interval of +/- 10 bar has been established, before the valve is oened or closed, resectively. 4.1.3 Failure Behavior. In the next ste, the failure behavior must be defined. Therefore, the failure modes of the three comonents, that can be considered as tasks of the system, are secified. As it has been exlained earlier, we use extended etri nets to describe the failure behavior. Average Filter The failure behavior model of the average filter is shown in figure 12. The general structure illustrated in figure 5 has been used to model the failure behavior of that task. The qualities of the three ressure values 1, 2, and 3 are considered as inut. If any of these values has a relative error different from zero, the task is set to the failure mode Error, otherwise the task remains in the mode Normal. If the task is in the normal mode, the outut value has no error, therefore, the error attribute is set to 0 (The setting Valve 1 = -- ( 1[ k] + 2[ k] + 3[ k] ) 3 1 4 = -- k [ i] 3 i = 0 Fig. 10. Formulae describing the digital filters. switch command [oen/close] oen close ressure [bar] 490 500 510 Fig. 11. Hysteresis loo used for the hysteresis switch. enter 1 2 3 Guards: g1 : 1.error>0 or 2.error>0 or 3.error>0 g2 : 1.error==0 and 2.error==0 and 3.error==0 Actions: a1 :.error = 1/3 (1.error + 2.error + 3.error) a2 :.error = 0 1 1 2 2 3 3 currentfailuremode Error g1 [1,2,3] g2 Normal [1,2,3] Error Normal [1,2,3] a1 [1,2,3] getstate Fig. 12. Petri net secifying the failure behavior of the average filter. of the attribute validity has been neglected to kee the examle simle). In the failure mode Error the relative error of the outut ressure is calculated using the relative errors of the inut values, as it is secified by the formula shown in figure 13. The current failure mode is reresented by an additional token on the lace currentfailuremode, asithas been mentioned earlier. The transitions outside the box are required as interface and will be connected to transitions in the suerordinated net, as it has been exlained in section 3.5. 1.error = -- ( 3 1.error + 2.error + 3.error) Fig. 13. formula secifying error roagation of the average filter Moving Average Filter The moving average filter uses the last five values of the average filter, whereby every 2 milliseconds a new value is samled. The error roagation can be secified, in general, deending on the amount of regarded samles N and the samle eriod T, as it is shown in figure 14. The w = min 1, --- 1 (.MTTC) N ------------------------- T a.error = w.error Fig. 14. Formula secifying error roagation of an moving average filter. quotient of the MTTC over the eriod T defines how many samles can be influenced by a fault. All other samles have no error (It is assumed that MTTO» N T ). Therefore, the ratio between the number of influenced samles and the number of all samles defines the weight w of the relative error. The according etri net for the filter is shown in figure 15. Again, the failure modes Normal and Error are defined. The normal mode is valid when the inut is correct, otherwise, the task is set to the error mode. In the normal mode, again, the error attribute of the outut quality is set to zero. In the error mode, the relative error of the outut value is calculated according to the formula shown in figure 14. To exress a ersistent error with the MTTC, the latter is set to the maximal ossible integer value (MAX- INT). If an error is ersistent, all samles are faulty, therefore, a must have the same relative error as. This fm a2 exit - 6 -

enter Guards: g1 :.error>0 g2 :.error==0 Actions: a1 : a.error = min(1, 1/5 * floor(.mttc / 2)) *.error a2 : a.error = 0 Error CurrentFailureMode g1 Normal g2 Normal getstate exit a Fig. 15. Petri net secifying the failure behavior of the moving average filter. requirement is met if the MTTC is set to MAXINT, asin that case the weight w always evaluates to 1, i.e. the error of is assigned to the error attribute of a. Error fm a1 a a2 a value of a is lower than the actual ressure: 520( 1 x) 510 => x 0.019 = 1.9% value of a is higher than the actual ressure: 480( 1 + x) 510 => x 0.0625 = 6.25% Fig. 16. Maximal relative error for the uer threshold value Hysteresis Switch Before we define the etri net for the hysteresis switch, we examine its failure behavior. The hysteresis loo is illustrated in figure 11 and the consequences of too low or too high a ressure are shown in figure 8. Besides the normal oeration, wrong ressure values can result in a roblematic or dangerous situation, resectively. Obviously, it is reasonable to define the three failure modes Normal, Problem, and Danger. Now, we must examine which errors result in which failure mode. At first sight, it seems quite simle to define guards like.currentvalue > 520. However, our aroach is aliable at early stages of the develoment rocess, therefore, we do not assume to have absolute values or absolute errors available. For this reason, we regard the threshold values and use the available relative errors to obtain the absolute values. This is sufficient, as it is only necessary to consider the worst case. We must regard both threshold values and, since the relative error does not exress if the faulty value is higher or lower than the actual value, it is also necessary to cover both cases in the consideration. We want to calculate exemlarily the maximal relative error that is allowed to remain in the normal mode for the uer threshold value. The uer threshold value is 510 bar. Actually, it is only necessary to oen the valve when the ressure is higher than 520 bar. If the faulty value of a is lower than the actual ressure in the tank, it must be ensured, nevertheless, that the valve is oened at latest when the ressure is higher than 520 bar. That means, if the ressure is 520 bar, the current, faulty value of a must be at least 510 bar. According to the uer formula shown in figure 16, the relative error must therefore be lower than or equal to 1.9%. If the value of a is higher than the actual ressure, the valve might be oened too early. We demand that the ressure must be at least 480 bar before the valve is oened. The according maximal relative error is calculated using the lower formula of figure 16. The remaining maximal relative errors can be calculated in the same way. The etri net reresenting the resulting failure behavior of the hysteresis switch is shown in figure 17. As the outut of this task is a direct system outut (the command for the valve), it is only necessary to outreach the failure mode to obtain the influence on the system behavior. (a.error>1.9) and (a.error<=3.8) a a.error>3.8 a.error<=1.9 Problem Danger Normal currentfailuremode Fig. 17. Petri net secifying the failure behavior of the hysteresis switch. 4.1.4 Interdeendencies of tasks. So far, the failure behavior of three single tasks has been described. Now, it is necessary to combine the single etri nets in a suerordinated etri net to obtain the system failure behavior. The overall etri net secifying the failure behavior of the ressure control system is shown in figure 18. Mainly the system structure has been rebuilt. First, an instance of the etri net reresenting the average filter is created and three qualities of the ressure values can be injected, as it has been exlained in section 3.5. The current failure mode and the quality of the average ressure value are requested from the sub net describing the average filter. The quality of the average ressure value is used as inut for an instance of the etri net secifying the failure behavior of the moving average filter. The outut of this sub net is its current failure mode and the quality of the smoothed average ressure value, which, in turn, is used as inut for the hysteresis switch. As the outut of the hysteresis switch is a direct system outut, it is reasonable not to use an information quality as outut, but only the current failure mode. An analysis can be started by simulating the etri net. Errors can be injected using the injection transitions. We create an information quality object, which is referenced by a token, and ut this token on the lace leading to the resective transition. These information quality tokens can be either defined and laced manually on the inut laces, or automatically by a searate software. The injection and the roagation of the information qualities is then done automatically by a etri net simulator [10]. Placing the tokens manually has the disadvantage that the etri net simulator requires to create and to lace new information quality tokens for each analysis run. Furthermore, it is not ossible to change the injections during an analysis. However, it is one major advantage of our aroach that the exit getstate - 7 -

average filter 1 2 3 Injection Transitions Inut Places sub net average filter failure mode moving average filter hysteresis switch Fig. 18. Petri net defining system failure behavior of ressure control. The etri net can be analyzed by utting tokens which reference according information qualties, on the inut laces. Qualities of all information in the system can be injected into arbitrary arts of the system. A etri net simulator handles the injection into the subnets and the roagation between deending nets. basic concets allow to simulate and to correct various errors during an analysis run, even the order of faults can be examined. For these reasons, it is reasonable to use a searate software. Then, it is ossible to define a comlete analysis scenario in advance and the software creates and laces the according information quality tokens in the aroriate order. Moreover, it is ossible that a user interface is rovided enabling the analyst to set new information qualities dynamically during the analysis. 4.1.5 Analysis. For the means of this aer, it is not necessary to distinguish between manual and automatic creation and lacing of information quality tokens. For that reason, in the following analyses the information quality tokens are assumed to be already laced. First, we will examine the ermanent fault of a single sensor. As a second scenario we will examine the influence of a transient disturbance on the system behavior. failure mode sub net moving average filter a sub net hysteresis switch failure mode Permanent Sensor Fault The first analysis is used to answer the question, whether or not it is necessary to rovide redundant ressure sensors. Therefore, we regard the ermanent fault of a single ressure sensor. We want to examine whether or not the two remaining sensors are sufficient to hold u a correct system behavior. In addition to the functionalities of the tasks introduced in the revious sections, we assume that an error detection is imlemented which detects a missing sensor value. In that case, the average filter only uses the two remaining values. The error detection also considers a sensor defect, if its value has a deviation of 15% to the values of the other two sensors. A detected failure of a single sensor has no effect on the system behavior. Exactly, that would be the result of a simle analysis, too. However, we want to examine the influence of a relative sensor error of 14%, as this error will not be recognized by the error detection. In advance, a token reresenting an information quality object with the error attribute set to 14% was laced on the inut lace of 1. As the fault is considered to be ermanent, the MTTC was set to MAXINT. The error attribute of the two remaining information qualities was set to 0. When the etri net simulation is started, the three information qualities are injected into an instance of the etri net defining the failure behavior of the average filter. After that etri net has been evaluated, besides the current failure mode, the token reresenting the information quality of is outreached. The relative error of is 4.66%, as the relative error of 14% has been reduced by the average filter. The token reresenting the information quality of is then injected into the etri net of the moving average filter. As the MTTC is assumed to be infinite, the moving average filter cannot reduce the error. For that reason, the error attribute of the outreached information quality of a is still 4.66%. This quality is then injected into the sub net of the hysteresis switch. As the relative error of a is higher than 3.8%, a dangerous situation arises. The analysis stes are illustrated in figure 19. Obviously, the error detection is not Created tokens: 1.valid=false; 1.error = 14%; 1.MTTC=MAXINT; 2.valid=true; 3.valid=true; 1 2 3 Injection-Interfaces Inut Places Error average filter moving average filter 1 2 3.error = 4.66% hysteresis switch Error a a.error = 4.66% Danger Fig. 19. Analysis of the influences of a ermanent sensor fault. sufficient. Even if only a deviation of 6% instead of 15% was allowed before an error is detected, a roblematic situation would be the consequence. Although we admit that it is ossible to calculate a correct deviation rate for that simle examle manually, this is imossible for large, comlex, and distributed systems. Our analysis, instead, is intended and articularly suited for the analysis of those systems. Transient Sensor Fault Now, we want to consider the following scenario: The sensors can be influenced by electromagnetic radiation. We assume, that the electromagnetic radiation does not ersist longer than 4 milliseconds in the tank. Measurements showed that in the case of electromagnetic radiation, the value of sensor 1 has a relative error of 15%, the value of 2 has a relative error of 6%, and sensor 3 is not influenced. Now, we want to examine, whether or not it is necessary to imrove the sensors or to suress the radiation. The according analysis is illustrated in figure 20. Two information qualities for the values of 1 and 2 are created with the according relative errors. The MTTC is set to 4 milliseconds. The MTTO is neglected as we assume that the frequency of occurrence of the disturbance is very low. After the average filter, the actual overall ressure value has a relative error of 7%. The disturbance ersists for 4 milliseconds, i.e. in the worst case 2 of 5 samles have a relative error of 7%. In consequence, the relative error of - 8 -

Created tokens: 1.valid=false; 1.error = 15%; 1.MTTC=4; 2.valid=false; 2.error = 6%; 2.MTTC=4; 3.valid=true; Error 1 2 3 average filter 1 2 3 Injection-Interfaces Inut Places moving average filter.error=7% hysteresis switch Fig. 20. Analysis of the influences of a transient sensor fault. the smoothed ressure value a is reduced to 2.8%. That means, the efficiency of the deending machines is reduced (see figure 17). It is obviously necessary to imrove the behavior. In the next scenario, we assume that the engineers of the lant roose two ossible imrovements: First, it is ossible to reduce the maximal time of ersistence of the radiation from 4 to 3.5 milliseconds. Second, it is ossible to shield the sensors, in consequence, the relative errors could be reduced to 12% and 4.8%, resectively, that means a reduction of the disturbance by 20%. If we use these values as inut for the analysis, we obtain the result, that the reduction of the time of disturbance to 3.5 milliseconds is sufficient to hold u the normal mode. A ersistence of the radiation of less than 4 milliseconds means that at most one samle is influenced. For that reason, the moving average filter comensates the relative error Ṡhielding the sensors as assumed above, however, is not sufficient. Desite the reduced relative errors, the overall ressure has still a relative error of 5.6%. The moving average filter only reduces the error to 2.24%, what is not sufficient to remain in the normal mode. 5 Conclusion: Analyzing large distributed systems Error a a.error= 2.8% Problem In this aer, we introduced a flexible, scalable, and extendable aroach for failure behavior analysis. In comarison to existing analyses, like FTA or FMEA, our analysis yields more sohisticated results which enable the analyst to understand the system behavior in the case of faults. In the alication examle, we demonstrated the alicability of our analysis. However, we limited the comlexity of the examle to revent going beyond the scoe of a aer. Our research focuses on large, distributed embedded systems. Therefore, our analysis has been develoed for those systems. Mastering the comlexity of those systems is one major roblem. For that reason, the scalability of our aroach is of crucial imortance. A further essential asect is the ossibility to automatically generate major arts of the etri nets. For examle, it is even ossible to generate a simle failure behavior that sets the validity attribute of the outut information qualities to false, if any inut quality is invalid. For that reason, similar results as they are obtained by FMEA are yielded automatically without any additional effort of the analyst. Although the generated etri nets define only a very coarse aroximation of the actual failure behavior, it is, in contrast to a common FMEA, ossible to examine the effects of fault combinations very easily or even automatically. A further asect that is imortant for the analysis of large systems is reuse. If tasks or comonents are reused in other rojects, the etri nets defining their failure behavior, can be reused, too. If distributed systems ought to be analyzed, our aroach has two further advantages. First, the artitioning of the system is suorted, as one can examine which tasks should be assigned to which artition so that a failure of one artition has the least influence on the overall system behavior. Second, the effects of missing or delayed information, interchanged between system artitions, can be analyzed. The delay of an information can be assigned to its quality and the effects on single tasks can be modelled exlicitly with etri nets. The effect on the overall system behavior is obtained automatically, as the error roagation is rovided by the etri net simulator. 6 References [1] G. Booch, I. Jacobson, J. Rumbaugh, The Unified Modelling Language User Guide, Addison Wesley Longman, Reading, MA. 1999 [2] A. Metzger, S. Queins, A Reuse- and Prototying-based Aroach for the Secification of Building Automation Systems, OMER-2 Worksho, Hersching, Germany, 2001 [3] IEC 61025 (1990-10), Fault Tree Analysis, International Electrotechnical Commission, Geneva, Switzerland, 1990 [4] K. Yang, C. K. Kaur, Customer Driven Reliability: Integration Of QFD And Robust Desing, Proceedings IEEE Annual Reliability and Maintainability Symosium, 1997 [5] C. J. Price, N. S. Taylor, FMEA For Multile Failures, Proceedings IEEE Annual Reliability and Maintainability Symosium, 1998 [6] B. Berard, M. Bidoit, A. Finkel, F. Laroussinie, A. Petit, L.Petrucci, Ph. Schnoebelen, P. McKenzie, Systems and Software Verification, Sringer Verlag, Berlin, 2001 [7] H. Hermanns, Construction and Verification of Performance and Reliability Models, in Bulletin of the Euroean Association for Theoretical Comuter Science (EATCS), 2001 [8] H. Hermanns, J.P. Katoen, J. Meyer-Kayser and M. Siegle, A Markov chain model checker, Proceedings of Six International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Sringer Verlag, Berlin, 2001 [9] Olaf Kummer, Simulating Synchronous Channels and Net Instances, 5. Worksho on Algorithms and Tools for Petri Nets, 1998 [10]Olaf Kummer, Frank Wienberg, RENEW - The Reference Net Worksho, Petri Net Newsletter, No. 56, 1999. - 9 -