Software Engineering. Computer Science Tripos 1B Michaelmas 2011. Richard Clayton



Similar documents
The Therac 25 A case study in safety failure. Therac 25 Background

Software Safety Basics

Safety and Hazard Analysis

PATRIOT MISSILE DEFENSE Software Problem Led to System Failure at Dhahran, Saudi Arabia

New trends in medical software safety: Are you up to date? 5 October 2015, USA

Dependable Systems Course. Introduction. Dr. Peter Tröger

SFTA, SFMECA AND STPA APPLIED TO BRAZILIAN SPACE SOFTWARE

World-first Proton Pencil Beam Scanning System with FDA Clearance

Issue 1.0 Jan A Short Guide to Quality and Reliability Issues In Semiconductors For High Rel. Applications

Information Security Policies. Version 6.1

Risk. Risk Categories. Project Risk (aka Development Risk) Technical Risk Business Risk. Lecture 5, Part 1: Risk

Arc Flash Mitigation. Remote Racking and Switching for Arc Flash danger mitigation in distribution class switchgear.

JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 21/2012, ISSN

Occupational Electrical Accidents in the U.S., James C. Cawley, P.E.

Safety-Critical Firmware What can we learn from past failures?

WHITEPAPER: SOFTWARE APPS AS MEDICAL DEVICES THE REGULATORY LANDSCAPE

Project Risks. Risk Management. Characteristics of Risks. Why Software Development has Risks? Uncertainty Loss

Software Engineering. Hans van Vliet Vrije Universiteit Amsterdam, The Netherlands

A Patient s Guide to the Calypso System for Breast Cancer Treatment

Turntable microswitch assembly Plunger

NTSU Fleet Vehicle Driving

Use of the VALIDATOR Dosimetry System for Quality Assurance and Quality Control of Blood Irradiators

University of Paderborn Software Engineering Group II-25. Dr. Holger Giese. University of Paderborn Software Engineering Group. External facilities

Cathode Ray Tube. Introduction. Functional principle

COMMONWEALTH OF PENNSYLVANIA DEPARTMENT S OF PUBLIC WELFARE, INSURANCE, AND AGING

CS4507 Advanced Software Engineering

C3306 LOCKOUT/TAGOUT FOR AUTHORIZED EMPLOYEES. Leader s Guide. 2005, CLMI Training

Embedded Real-Time Systems (TI-IRTS) Safety and Reliability Patterns B.D. Chapter

Analyzing Electrical Hazards in the Workplace

BWC Division of Safety and Hygiene

Developing software for Autonomous Vehicle Applications; a Look Into the Software Development Process

Motivation and Contents Overview

Risk and Hazard Management

How To Use An Nc Express5800 Card (Nec) With A Remote Management Card (Ec) On A Pc Or Mac Computer (Nemca) With An Ipa/Sda/Sdu) On An Ipad Or

Human Factors and Medical Device Accidents. Presented by Frank R. Painter, MS, CCE University of Connecticut

penelope athena software SOFTWARE AS A SERVICE INFORMATION PACKAGE case management software

Risk Assessment. Module 1. Health & Safety. Essentials November Registered charity number

CS 6290 I/O and Storage. Milos Prvulovic

Workplace Incident Fatalities Accepted by the Workers Compensation Board in 2014

Near Miss Reporting. Loss Causation Model

Building Software in an Agile Manner

Rice University Laser Safety Manual

Risk Analysis of a CBTC Signaling System

LS1024B / LS2024B/ LS3024B. Solar Charge Controller USER MANUAL

Lockout/Tagout (LOTO): Automating the process

Citation 1 Item 1a. #22: Struck by Inspection #

System Integration. system integration at a glance. Manufacturing Capabilities. Control and Identity Systems

Safety and Radiation Protection Office Working with X-Ray Equipment

20-CS X Network Security Spring, An Introduction To. Network Security. Week 1. January 7

Defense Acquisition Review January April 2004

Software Engineering Introduction & Background. Complaints. General Problems. Department of Computer Science Kent State University

Developing software which should never compromise the overall safety of a system

Preventing Workplace Injuries. Mike Griffiths, HM Inspector of Health and Safety

Ethical Issues in the Software Quality Assurance Function

Software in safety critical systems

An accident is an unplanned and unwelcome event which disrupts normal routine.

DISASTER RECOVERY AND NETWORK REDUNDANCY WHITE PAPER

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary

Software-based medical devices from defibrillators

Safety Requirements Specification Guideline

Workers Compensation: Making a claim

New Challenges In Certification For Aircraft Software

Guide to Sources Relating to Units of the Canadian Expeditionary Force. Tramways Companies, Canadian Engineers

Elevator Malfunction Anyone Going Down?

How to Mimic the PanelMate Readout Template with Vijeo Designer

Introduction. Getting started with software engineering. Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 1 Slide 1

PG DRIVES TECHNOLOGY R-NET PROGRAMMER R-NET PROGRAMMING SOFTWARE - DEALER ELECTRONIC MANUAL SK78809/2 SK78809/2 1

APA format (title page missing) Proton Therapy 3. A paper. A New Cancer Treatment: Proton Therapy. A Review of the Literature

Generic risk assessment form. This document forms part of Loughborough University s health and safety policy Version 3 February 2014

Bypass transfer switch mechanisms

the compensation myth

An Escape. our man John on a research binge. Electric Slide

AHIS Road safety project Student Council THINK!

WHS Induction Series. 36 Toolbox Talks. Contents

Motorcycle Airbag System

Owner/ Operator Hauling Asphalt Flux Dies After Driving into a Ravine and Striking Trees

Data Loss Prevention Program

How To Control Gimbal

NFPA 70E 2012 Rolls Out New Electrical Safety Requirements Affecting Data Centers

Machine Guarding and Operator Safety. Leader Guide and Quiz

ROTATING MACHINES. Alignment & Positioning of. Fast, easy and accurate alignment of rotating machines, pumps, drives, foundations, etc.

PhoneWatch Smart Security System User Manual - Domonial

Transcription:

Software Engineering Computer Science Tripos 1B Michaelmas 2011 Richard Clayton

Critical software Many systems must avoid a certain class of failures with high assurance safety critical systems failure could cause, death, injury or property damage security critical systems failure could allow leakage of confidential data, fraud, real time systems software must accomplish certain tasks on time Critical systems have much in common with critical mechanical systems (bridges, brakes, locks, ) Key: engineers study how things fail

Tacoma Narrows, Nov 7 1940

Definitions I Error design flaw or deviation from intended state (a static quality) Failure non-performance of system (a dynamic quality). Classical definition says under specified environmental conditions. Reliability probability of failure within a set period of time typically expressed as MTBF/MTTF: mean time between failures / to failure, depending whether system will be repaired and restarted Accident undesired, unplanned event resulting in specified kind/level of loss Near Miss (or Incident) event with the potential to be an accident, but no loss occurs

Definitions II Safety freedom from accidents Hazard Risk set of conditions on system which in some environmental conditions, will lead to an accident hence: hazard + failure = accident the probability of a bad outcome the probability that hazard leads to accident (danger), combined with the hazard exposure or duration (latency) Uncertainty risk not quantifiable

Arianne 5, June 4 1996 Arianne 5 accelerated faster than Arianne 4 This caused an operand error in float-to-integer conversion The backup inertial navigation set dumped core The core was interpreted by the live set as flight data Full nozzle deflection 20 o angle of attack booster separation $370 million of satellites destroyed

Real-time systems Many safety-critical systems are also real-time systems used in monitoring or control Criticality of timing makes many simple verification techniques inadequate Often, good design requires very extensive application domain expertise Exception handling tricky, as with Arianne Testing can also be really hard

Patriot missile failure, Feb 25 1991 Failed to intercept an Iraqi scud missile in First Gulf War SCUD struck US barracks in Dhahran; 29 dead Other SCUDs hit Saudi Arabia, Israel

Patriot missile failure Reason for failure measured time in 1/10 sec, truncated from.0001100110011 when system upgraded from air-defence to anti-ballistic-missile, accuracy increased but not everywhere in the (assembly language) code! modules got out of step by 1/3 sec after 100 hours operation so system looked for Scud 600 metres away from where it was since nothing visible at incorrect location, no launch occurred not found in testing as spec only called for 4h tests Critical system failures are typically multifactorial: a reliable system can t fail in a simple way But classical definition of `failure said under specified environmental conditions... So was this a failure?

Security critical systems Usual approach try to get high assurance of one aspect of protection Example: stop classified data flowing from high to low using one-way flow Assurance via simple mechanism Keeping this small and verifiable is often harder than it looks at first!

Building critical systems Some things go wrong at the detail level and can only be dealt with there (e.g. integer scaling) However in general safety (along with security and real-time performance) is a system property and has to be dealt with at the system level A very common error is not getting the scope right for example, designers don t consider human factors such as usability and training

Hazard elimination e.g., motor reversing circuit above, in the left hand circuit failure of both switches to move together will short the battery Some tools can eliminate whole classes of software hazards, e.g. using a strongly-typed language such as Ada But usually hazards involve more than just software

The Therac accidents I The Therac-25 was a radiotherapy machine sold by AECL; 11 machines shipped Between 1985 and 1987 three people died in six accidents Example of a fatal coding error, compounded with usability problems and poor safety engineering

The Therac accidents II 25 MeV therapeutic accelerator with two modes of operation: 1. 25MeV focussed electron beam on target to generate X-rays 2. 5-25 MeV spread electron beam for skin treatment (with 1% of beam current) Safety requirement don t fire 100% beam at human!

The Therac accidents III Previous models (Therac 6 and 20) had mechanical interlocks to prevent high-intensity beam use unless X-ray target in place The Therac-25 replaced these with software Fault tree analysis arbitrarily assigned probability of 10-11 to computer selects wrong energy Code was poorly written, unstructured and not really documented

The Therac accidents IV Marietta, GA, June 85: woman s shoulder burnt. settled out of court. FDA not told. Hamilton, Ontario, July 85: woman s hip burnt. AECL suspected a micro-switch error (reporting incorrect turntable positions) but could not reproduce fault; changed software anyway. Yakima, WA, Dec 85: woman s hip burned could not be a malfunction East Texas Cancer Centre, Mar 86: man burned in neck died five months later of complications 3 weeks later: another man burned on face & died after 3 weeks Hospital physicist managed to reproduce flaw: if parameters changed too quickly from x-ray to electron beam, then the safety interlocks failed Yakima, WA, Jan 87: man burned in chest and died different bug now thought to have caused Ontario accident

The Therac accidents V East Texas deaths caused by editing beam type and then issuing a start treatment request very quickly thereafter This was due to poor software design

The Therac accidents VI Data entry routine sets turntable and MEOS (the mode and energy level) When data entry complete (cursor on last line) machine starts configuration Part of this involves setting magnets into correct position (which takes 8 seconds, so a timer routine is called) The timer routine checks for cursor movement if it is being called whilst the magnets are being moved Unfortunately, it also cleared the magnets moving flag; so it didn t check the cursor for subsequent magnet moves

The Therac accidents VII AECL had ignored safety aspects of software initial investigations had looked for hardware faults Confused reliability with safety Lack of defensive design Inadequate reporting, follow-up and regulation failed to explain Ontario accident at the time a true/false flag was being incremented to keep it true, and after 255 increments it speciously got set to the wrong value! Unrealistic risk assessments ( think of a number and double it ) Inadequate software engineering practices specification an afterthought, complex architecture, dangerous coding, little testing, careless HCI design, incomprehensible messages displayed to users, failure to follow up accident reports