I.CHEM.E. SYMPOSIUM SERIES NO. 110

RELIABILITY ASSESSMENT OF COMPUTER SYSTEMS USED IN THE CONTROL OF CHEMICAL PLANT E Johnson* The use of process control computers frequently leads to improvements in the operation and safety of chemical plant. However, there is a need to recognise, and identify, the potential hazards that could be imposed through failure of a computer system which is providing total control of a chemical plant or process. This paper describes the approaches to assessment used by ICI and will refer to other assessment methodologies. The assessment techniques described, both quantitative and qualitative, will consider the total computer system, the installation, hardware, software and modifications. INTRODUCTION Over a quarter of a century ago, (1962), ICI installed its first digital computer for the direct control of a chemical plant. The computer, a Ferranti ARGUS 200, installed on the Fleetwood Ammonia Soda Plant took in some 256 process measurements of flow, temperature, pressure level and ph, and controlled 120 valves. It proved that a chemical plant could be controlled from one digital computer and a new era in centralised control had begun. Improved process operation followed from computer control even though the process technology of ammonia soda had been established for nearly 100 years. Over the last two decades many process control computers have been installed throughout the company, with improvements in process operation, safety and profitability. This has largely resulted from the 'technology explosion' in integrated micro circuits which enables present day computer systems to provide, in a cost effective way:- complex control functions and logic operations. process optimisation. data recording for management information and incident investigation. sophisticated presentation of plant information with the use of colour VDU's. early warning diagnostics of process malfunctions. * ICI pic Engineering Department (N.W.). 209

However, with such sophisticated and complex equipment providing total control of a chemical plant or process, there are accompanying disadvantages with the computer system:- unpredictable failure modes limited failure data greater susceptibility to common mode failure software unreliability: with accompanying difficulties in total validation and verification ease of modification and hence ease of corruption. There is, therefore, the need to recognise these problems, identify and assess the potential hazards that could arise through failure (in total or in part) of a computer system which is providing total control of a chemical plant or process. The approaches to assessment used by ICI cover the following topics:- Control System Requirements - Potential hazards - Control philosophy - Functional specification Computer System Installation - Environmental aspects - Power supplies and cables - Fire detection and protection - Electrical interference Computer System Hardware Computer System Software - Critical inputs and outputs - Distribution and redundancy - Fault tree analysis - Application software - Approaches to reliability - Operating system software - Modifications Each of the above topics will be described in some detail and a further section will refer to other assessment techniques- CONTROL SYSTEM REQUIREMENTS Before any reliability assessment of the computer system can take place, the areas of the plant where the potential hazards exist need to be identified. These potential hazards of fire, explosion, release of toxic material, pollution of the environment, and their consequences, are identified and quantified during the hazard and operability studies of the process plant. Within I.C.I., any new plant (or modification to an existing one) is subjected to a sequence of studies from the design phase through to commissioning, using the techniques developed and refined since the early 1970's. Within these hazard studies, the critical measurements and control parameters will be highlighted, together with the necessary protection systems. Where there are significant risks to people and the environment it is ICI's current philosophy to install protective systems which are usually hardwired and which are independent of the control systems - whether the control system is computerised or not. 210

In addition to demands from the plant, some failures of the computer system may also place demands on the plant protection system, and these need to be identified at an early stage in the design. However, it is recognised that with a computer controlled plant, the in-built diagnostic and logical capabilities of the computer can be used as the 'first line of defence 1 for the protective system. Using a computer in this way, the demand rate on the protective system may be reduced and may lead to the simplification of the protection system itself. Once the critical measurements, control functions, alarm requirements, and any special software requirements for the plant have been Identified, the functional specification and the control philosophy for the computer system can be compiled. It is at this point that the effects of hardware and software failures, data errors and operator Interface failures are considered, and their affect on the sensitive areas of the process plant- Considerations relevant to the safety and reliability of the computer system should Include:- a) Operator/Plant Interface - VDU display(s), Keyboard(s), and printer(s) How many are required? Are they dedicated to a particular plant section? Are all facilities for the whole plant Included on each one? What degree of control is each plant operator given? b) Although the alarm function will be included can the loss of the plant alarm function be tolerated when the computer system falls? Are independent and hardwired alarms required? c) In the case of computer failure, will the operator be presented with enough information to appreciate the state of the plant? Will the operator have sufficient facilities to make the plant safe? Will he be presented with too much (and conflicting) information? d) Will the computer system include both the control function and part of the protection system? There are cost benefits In combining the two functions, but a computer system failure leading to a control fault could at the same time prevent the initiation of the protection system. e) For sequence control:- If the plant operates in a batchwise rather than a continuous manner, will the system provide 'full control 1, stepping through each sequence automatically, or will it just control the timing of each stage, carryout the necessary checks and then indicate to the operator that this stage has finished, thus allowing him to initiate the next stage of the process? From this initial assessment of the computer system, two main factors will emerge, firstly, the reliability requirements to which the computer system must conform, and secondly, the control philosophy and operating strategy necessary to accommodate both normal plant operation and equipment failure or malfunction. 211

From the former, the design team will be able to provide the manufacturer or supplier with a deeper understanding of the computer application, stressing the importance of the equipment meeting the functional specification. From the latter, the commissioning/operating management, who, like the design team, will have been present in the plant hazard study, will now start planning the plant operational requirements, operator training needs and maintenance needs, and writing operating instructions and maintenance procedures for the computer system. HOW TO ASSESS THE SYSTEM For any reliability assessment there are three basic needs; some form of standard or guidelines to which the assessment should conform, a systematic method, and reliability data. Methodologies, techniques and tools abound for assessing the reliability of hardware, with even more for the validation and verification of system specification, software and testing. (Refs 1,2,3,4). Although reliability data for 'system hardware 1 is limited for computer system assessment, it is highly unlikely that any assessment for a chemical plant would need to go 'below circuit board level 1. Most manufacturers provide reliability figures for the major hardware components of their system, figures which may have been compiled from field data on installed systems. However for new equipment, with little field data, reliability figures may have been compiled from component data (eg Military Handbook Data, MIL.HDBK.217D). There is also a need for some form of Standard or Guideline to which the assessment should conform, especially if the equipment being used forms part of a safety related system. The manufacturer, in producing the equipment has BS 5760 and MIL.HDBK.217D to guide him but for the 'system' assessor there are no comprehensive national or international standards. This has been recognised by the H-S.E. who published their Guidelines on Programmable Electronic Systems in June 1987 (Ref 6). The Commission of the European Communities, in particular DG XIII, has also recognised the deficiency and part funded the European Workshop on Industrial Computer Systems Technical Committee 7 (EWICS TC7), who have recently produced four guidelines (Ref 7-13). Meanwhile companies, including ICI are building new plants and controlling them with computer systems- ICI have acknowledged the lack of guidelines and have adapted Hazard and Operability techniques to assess computer systems, relating the results back to the criteria applied to the total plant. COMPUTER SYSTEM INSTALLATION The reliability of the computer system not only depends on the hardware and software but also on its installation and the environment in which it is housed. With the high reliability which may be obtained from the computer hardware, these factors can become very important. 212

The Installation should be subjected to a stringent and structured hazard analysis similar to that for the plant, in which for example, the following topics are examined:- 1. Air Conditioning Some form of air conditioning equipment is usually required in order to satisfy the manufacturer's specification for relative humidity and temperature. For computer systems on chemical plant, the possibility of ingress of noxious fumes into an air conditioning system should be considered with either a chemical type filter (ie activated carbon) or air input dampers which close automatically on detection of toxic gas. The air conditioning system should at least be fitted with an alarm to indicate its failure especially if the computer equipment is in an unmanned area. 2. Fire Detection/Protection The risk of a computer system fire in normal operation is very low. However, since it is likely that a large proportion of the equipment would be in an unmanned area, a combination of heat and ionisation detectors strategically placed around the installation, should be installed. The likely causes of fire would be from some form of electrical failure or from overheating (hence the need for air conditioning equipment) Therefore, consideration should be given to switching off the air conditioning system and isolating the power source on detection of fire. Since any fire would generally be localised, and switching off the power would remove the cause, local hand held extinguishers would probably suffice. However, for larger systems, manual or automatically triggered flood systems of Halon or COo maybe considered, although there is considerable debate on their effectiveness and ensuing damage if triggered spuriously. 3. Power Supplies and Cables The integrity of the incoming supply to the system should be assessed (failure, voltage spikes, frequency variations) and the question of an alternative supply or Uninterrupted Power Supply (U.P.S.) system considered. The consequences of power failure and power return should be examined, especially with respect to restart procedures for the software. This is particularly important on batch processes and sequence control systems. 213

Cables to and from the computer system should be segregated wherever possible from power cables, particularly from those supplying power to large electric motors- They should also be protected along their routes from mechanical damage, and duplicated signals from sensitive areas should follow separate routes. 4. Earthing The basic requirements for the earthing system are to avoid the risk of electric shock (a.c earth); to provide a signal reference potential (d.c earth) and to avoid the build up of a static charge which may cause damage or maloperation of the equipment- In addition, the earthing of the computer system should be implemented in such a way that the invasion of fault currents from other circuits (eg electric motor circuits) is avoided. This has been recognised by my company and guidance is available for the design team. 5 Electrical Interference The earthing system, plus choice of cable routes, would usually avoid electro-magnetic interference problems from large electrical drives, but the problem should not be ignored. Electrostatic build up on computer cabinets can sometimes be a problem and the earthing system should take this into account. Facilities for earth bonding operating/maintenance personnel should also be considered. The use of radio transceivers is widespread on chemical plant, particularly by the instrument maintenance personnel, when calibrating/checking field instruments back to the computer. Note should be made of the susceptibility of computer equipment to Radio Frequency Interference, (R.F.I.)- COMPUTER SYSTEM HARDWARE In the first stage of the assessment, the philosophy of plant control was established and the critical measurements and control loops were identified. Failures that could lead to a potential hazard are governed by the reliability of the hardware relating to these critical parameters. In order to enhance the reliability, a number of actions can be taken. For example, distributing the critical inputs and outputs on the respective interface cards ensures that a single card failure would not lead to the loss of a number of critical measurements. Again if a parameter is used for control plus a critical alarm, its measurement should preferably have two separate measuring devices fed in via separate channels. It follows, therefore, that the emphasis regarding hardware reliability is focussed on its configuration. 214

System failure modes can generally be classified according to their effects on plant safety and whether they are revealed or unrevealed- They can either be 'FAIL SAFE' or 'FAIL TO DANGER' or have negligible effect. A 'FAIL SAFE' mode exists when, as a result of some failure, the plant moves towards a safe state, eg a controlled plant shutdown. 'FAIL TO DANGER' modes exist when, as a result of the failure, the plant moves towards a potentially hazardous situation. It is the latter that have to be quantitatively assessed against the safety criteria for the plant. The method of analysis most frequently used within the hazard studies is Fault Tree Analysis, F.T.A. This method, 'top down approach', identifies the system failure mode of interest (top event) and then identifies the failures of the sub-system levels that can cause that failure, until the lowest desired level of component is reached. Applied to computer systems on chemical plant, the lowest level would be input/output boards or a processor board, - not the components within that board. Complementary to FTA, a form of Failure Mode and Effect Analysis, (F.M.E.A-) would have been applied at the first stage of the assessment in order to determine the required level of redundancy for the computer system. Referring back to revealed failures, all on-line computer systems have some form of in-built software diagnostics, or'watch-dog', which to a greater or lesser degree will detect failures when they occur and provide some form of signal. This signal, in its simplest form, would merely notify the operator of the failure- On more complex systems, it would cause switch over to the back-up systems, or initiate plant shut-down. The degree, of on-line diagnostics will determine the percentage of failures detected - one manufacturer of a distributed system claims 98% of failures detected- Following this analysis, a quantified assessment of the hardware reliability can be carried out, using the available failure data and including the mean time to repair of any equipment Installed to provide redundancy. COMPUTER SYSTEM SOFTWARE "No techniques are available which allow the frequency of software failures to be predicted In advance of use" (ref lpl80). "It was recognised that hardware falls In a random manner but software which has specification and design error induced failures does not" (ref 1.197). "Although programmable system failures can be Identified using F.T.A., for specific hazards only the hardware element can be quantified" (ref lpl80). Although Hazard and Operabillty techniques for assessing chemical plant have been developed and refined by ICI since the early 1970's, computer systems were at first only subjected to detailed study and checking by the design discipline. Formalised hazard studies by a multi-disciplined team were not carried out on computer systems, especially the software. 215

However, procedures for software modifications, (Ref 14) based on established HAZOP techniques, were initiated in the early 1980's. A process control computing guideline was compiled about the same time. On 6 December '85, an incident occurred on a Nylon Polymer Plant at Wilton Site, when, due to a computer failure, the plant was shutdown In an uncontrolled manner. Control valves were wrongly set, and 3 tonnes of molten polymer at 300 C under a nitrogen pressure of 27 barg, followed by raw materials for the next batch, came out of the extrusion valve onto the floor. Following this incident and the subsequent Hazard Study of the computer system, especially the software (ref lppl89-206), improved and more comprehensive design and checking procedures for ensuring software reliability were established. The procedures are constructed on the basis of pre-requisltes followed by checklists. This procedure now enables software to be subjected to a formalised review at each phase of its development by a multi discipline team. It also ensures that comprehensive documentation is provided. Structured programming techniques are used; tried and tested routines are used wherever possible and separate modules are written for logically separate functions. Examples of the pre-requisites are given in Appendix 1 and the checklists in Appendix II. This approach, however, only applies to the application or user software. Each system will be accompanied by a suite of software provided by the manufacturer - the operating system or system software. The user will have no say in the creation of this system software nor will he know what quality checks have been Imposed. It will therefore be the responsibility of the user to establish with the manufacturer this aspect of quality and reliability. During these discussions it will be most important to establish the program change control procedures followed by the manufacturer. This has been recognised by the H.S.E. in the P.E.S. Guidelines Part 2, checklist No 14A. OTHER TECHNIQUES I have briefly described ICI's approach to the reliability assessment of computer systems, namely, the application of well established Hazard and Operability techniques to the installation and the system hardware, plus quality assurance procedures for the software using pre-requtsites and checklists throughout the software life-cycle. It has been widely recognised that,whilst the reliability of a computer system's hardware can be relatively easily assessed both quantitatively and qualitatively, software reliability assessment is difficult, complex, and well nigh impossible. 216

Software failures are considered to be 'systematic' and cover the whole life cycle from specification to testing. In order to reduce these errors, formal methods for specification have been deslgned eg MASCOT (Ref 15), which is a design method supported by a programming system. It defines a systematic method of expressing the structure of a real-time system in a way that is independent of the target hardware or implementation language. Software checking programmes are in abundance, one in particular MALPAS (MALvern Program Analysis Suite) was developed by R.S.R-E. over many years for the static analysis of critical software. Programme testing is the last step before system test and commissioning, and test methods and test data are as critical as the program itself. In his paper at the Safety and Reliability Society Symposium, 1987, SARSS 87, (ref 1 ppl00-117) P G Bishop outlined the STEM project (Software Test and Evaluation Methodologies), the objective of which was to evaluate a number of fault detection, fault prediction and failure methods. This was done by applying these methods to the documented computer programs on diverse software. These programs were produced against a common specification for a safety trip system on an experimental nuclear reactor at Halden, Norway and known to contain faults. From the various methods used within the project, testing diverse software programs back to back with random data proved to be the most effective testing method, finding over 90% of known faults. These are just a small sample of the methodologies that are being used to improve software reliability which is recognised as the problem area in achieving safety and reliability for computer systems- CONCLUSIONS ICI has been one of the forerunners in designing and developing Hazard and Operability study techniques for ensuring chemical plant safety since the early 1970's. Computer technology and application has improved and increased enormously over the same period, and can make a positive contribution to plant safety. At the same time, it has been recognised that failures within a computer system controlling a hazardous plant can have serious consequences. There is, therefore, the need to carry out reliability and safety assessment on computer systems in safety related applications and ICI has adapted its well proven HAZOP techniques for this purpose. It is not considered that entirely new methods of analysis are required but, due to the rapidly evolving technology and functional complexity, some development of existing analytical techniques is necessary. The need for guidelines and standards in this area has also been recognised, particularly with the advent of new statutory regulations- The HSE have published guidelines (ref 6) on programmable electronic systems in safety related applications. The European Workshop on Industrial Computer Systems Technical Committee 7 (EWICS TC7) has also produced a number of position papers and four guidelines on safety related computer systems under part sponsorship for the CEC (refs 7 to 13). 217

Material from the EW1CS TC7 documents is being used by the international standards organisations, International Electrotechnical Commission (IEC) and International Standards Organisation (ISO), in their current work- Even with standards, guidelines and methodologies it will still be necessary to allow a committed, well disciplined team, time within the projects to produce reliable systems and prevent major chemical and process related accidents. ACKNOWLEDGEMENTS The author would like to thank ICI Pic for permission to publish this paper. 218

REFERENCES 'Achieving Safety and Relability with Computer Systems' Proceedings of the Safety and Reliability Society Symposium 1987. 11-12 November 1987. Publisher Elsevier Applied Science. Safety and Reliability of Programmable Electronic Systems Proceedings of P.E.S. Safety Symposium Guernsey, UK 28-30 May 86. Publisher Elsevier Applied Science. SAFECOMP 83 Achieving Safe Real Time Computer Systems Proceedings of the third IFAC/IFIP Workshop, Cambridge, UK September 1983, Published by Pergammon Press. SAFECOMP 86 Trends in Safe Real Time Computer Systems Proceedings of the fifth 1FAC Workshop, Sarlat, France 14-17 October 1986- Published by Pergammon Press. Symposium on Safety and Reliability of Programmable Electronic Systems. I.E.E. Teesside sub centre Middlesbrough May 14 1987. H.S.E. Guidelines - Programmable Electronic Systems in Safety Related Applications - Part 2 General Technical Guidelines. H.M.S.O. June 1987. Draft Guidelines to Design Computer Systems for Safety. Attributes, Criteria and Measures- Draft Guidelines for the Assessment of the Safety and Reliability of High Integrity Industrial Computer Systems- Techniques Directory. Draft Guidelines on Safety Related measures to be used in Software Quality Assurance. Classification of Functions of quality Assurance. Draft Guidelines for the maintenance and modification of safety related computer systems. Refs 7-13 above are EWICS TC7 final deliverable documents under a contract with CEC DG XIII which are intended to be published in book form in 1989. Software Maintenance and Change Control - S R Nunns ICI Pic- Engineering Department, NE - Internal working document. MASCOT 3 User Guide- 1987. MASCOT Users Forum, RSRE Malvern, England 219

Designing within Constraints APPENDIX I EXAMPLES OF PRE-REQUISITES (a) Estimate the processor timing load for each task. (b) Estimate peripheral/resource access constraints for each task. (c) Estimate space requirements for each task- (Any problems revealed by (a), (b) and (c) should cause an urgent review of the previous phase). (d) Identify the partitions bearing in mind the physical constraints imposed by the operating system. From this, split the task into individual functions which will identify the modules. Modules should have a minimum of external links and be designed to ensure that probable modifications result in changes to a minimum number of modules. These modules should be functionally complete with well defined interfaces (average size should be 300 lines). Software should be partitioned so that aspects which handle, for example, computer external interfaces, scheduling and store allocation are separated from application procedures with well defined interfaces between them. This produces a clearer system structure and aids verification. Diagrams should be used (see Appendix B) to describe: (1) Modules and data with control flow. (2) Core store map. (3) Backing store map. (4) Task thread diagram. These should be kept up-to-date throughout the life-cycle. Principal decisions should be documented to avoid later ambiguity, and areas of difficulty should be identified at an early stage in the design's procedure. 220

APPENDIX II EXAMPLE OF CHECKLIST (1) Are the man/machine and instrument interfaces properly defined? (2) Have means of extending it been defined? (3) Are there various modes, ie for skilled and unskilled users? (4) Has the interface been defined such that Illegal actions do not corrupt the system or lock up the interface? (5) Are there means of monitoring the interface loading and type and frequency of use actions? (6) Has the Interface built-in security? (7) Has the interface response time been defined? (1) (1) Are the files detailed? (2) Have the controls over them been stated? (3) Has the detailed layout been defined? (4) What routines have access to what parts of the files? (5) What security is there on the files from the system point and user (ie passwords, etc). (6) What routine can update these, when and how? (7) What duplications and recovery do they have? (8) Have overflow conditions and actions been defined? (9) Have the lives of the files and records been defined? (10) Has the rate of update been stated? (11) Have the checks on data loss and how to recover the loss been defined? (12) Have the methods and means been defined of how a process checks that the part of the file read or updated is the correct part? (13) Have the block lengths been defined or, if of variable length, have means of determining the length been defined? (14) Have the means of assessing been defined? (15) Have the means of creating the Initial file been defined? (16) Can the frequency of file access be monitored? 221