Software Safety Basics



Similar documents
PATRIOT MISSILE DEFENSE Software Problem Led to System Failure at Dhahran, Saudi Arabia

Software Engineering. Computer Science Tripos 1B Michaelmas Richard Clayton

CS4507 Advanced Software Engineering

Dependable Systems Course. Introduction. Dr. Peter Tröger

Software Engineering. Lecture 1 Introduction. Adapted from: Chap 1. Sommerville 9 th ed. Chap 1. Pressman 6 th ed.

New trends in medical software safety: Are you up to date? 5 October 2015, USA

The Therac 25 A case study in safety failure. Therac 25 Background

Safety and Hazard Analysis

Software Engineering. Hans van Vliet Vrije Universiteit Amsterdam, The Netherlands

University of Paderborn Software Engineering Group II-25. Dr. Holger Giese. University of Paderborn Software Engineering Group. External facilities

Part 2: The Use of Software in Safety Critical Systems

Practical Guide to Endurance and Data Retention

Developing software which should never compromise the overall safety of a system

Value Paper Author: Edgar C. Ramirez. Diverse redundancy used in SIS technology to achieve higher safety integrity

UC CubeSat Main MCU Software Requirements Specification

Dependable Systems. Trends in Software Dependability. Dr. Peter Tröger. Most pictures (C) IBM

Introduction. Getting started with software engineering. Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 1 Slide 1

Embedded Real-Time Systems (TI-IRTS) Safety and Reliability Patterns B.D. Chapter

On-line PD Monitoring Makes Good Business Sense

Overview and History of Software Engineering

JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 21/2012, ISSN

Software Tragedies: Case Studies in Software Safety

The Role of Software in Recent Catastrophic Accidents

No quality mobile apps without testing in production (TiP) Marc van t Veer

What Types of ECC Should Be Used on Flash Memory?

Ethical Issues in the Software Quality Assurance Function

Safety-Critical Firmware What can we learn from past failures?

Defense Acquisition Review January April 2004

Embedded Systems Lecture 9: Reliability & Fault Tolerance. Björn Franke University of Edinburgh

Functional safety. Essential to overall safety

Analyzing the Security Significance of System Requirements

Mission Assurance Manager (MAM) Life Cycle Risk Management Best Practices David Pinkley Ball Aerospace MA Chief Engineer September 23, 2014

The Fallacy of Software Write Protection in Computer Forensics Mark Menz & Steve Bress Version 2.4 May 2, 2004

Software Engineering. Introduction. Software Costs. Software is Expensive [Boehm] ... Columbus set sail for India. He ended up in the Bahamas...

Risk Assessment. Module 1. Health & Safety. Essentials November Registered charity number

What You Should Know About Cloud- Based Data Backup

Manage the RAID system from event log

A Quality Requirements Safety Model for Embedded and Real Time Software Product Quality

Code of Conduct for Commercial Drivers

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

SILs and Software. Introduction. The SIL concept. Problems with SIL. Unpicking the SIL concept

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Numerical Matrix Analysis

The Role of Automation Systems in Management of Change

CAST Analysis John Thomas and Nancy Leveson. All rights reserved.

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters

3.4.4 Description of risk management plan Unofficial Translation Only the Thai version of the text is legally binding.

PRIMECLUSTER GLS for Windows. GLS Setup Guide for Cluster Systems MSCS/MSFC Edition

Atlas Emergency Detection System (EDS)

Yaffs NAND Flash Failure Mitigation

Risk. Risk Categories. Project Risk (aka Development Risk) Technical Risk Business Risk. Lecture 5, Part 1: Risk

NEW ZEALAND INJURY PREVENTION STRATEGY SERIOUS INJURY OUTCOME INDICATORS

Database and Data Mining Security

Selecting Sensors for Safety Instrumented Systems per IEC (ISA )

Software Safety Hazard Analysis

Introducing Tripp Lite s PowerAlert Network Management Software

Reducing Medical Errors with an Electronic Medical Records System

The Course.

Basic Components of a Spacecraft Computer. Hardware, Software, and Documentation

Professional Issues Part V1: Liability for Defective Software. Dr. Amanda Sharkey

Seion. A Statistical Method for Alarm System Optimisation. White Paper. Dr. Tim Butters. Data Assimilation & Numerical Analysis Specialist

Practical issues in DIY RAID Recovery

How Routine Data Center Operations Put Your HA/DR Plans at Risk

ANDRA ZAHARIA MARCOM MANAGER

Dell Server Management Pack Suite Version 6.0 for Microsoft System Center Operations Manager User's Guide

PTP-Global. Alarm Management An Introduction

Achieving High Availability

25 Backup and Restoring of the Database

Basic Fundamentals Of Safety Instrumented Systems

Occupational safety risk management in Australian mining

Controlling Risks Risk Assessment

A New Approach to Nuclear Computer Security

Test Data Management Best Practice

LDA, the new family of Lortu Data Appliances

Levels of Software Testing. Functional Testing

Vijeo Citect run as a Windows service

Software-based medical devices from defibrillators

How to Mimic the PanelMate Readout Template with Vijeo Designer

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

How Complex Systems Fail

RAID Basics Training Guide

System Release Notes Express5800/320LB System Release Notes

A Risk Assessment Methodology (RAM) for Physical Security

DEDICATED TO EMBEDDED SOLUTIONS

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Hazardous Material Waller County ARES training material used with permission from Christine Smith, N5CAS.

The Role of Software in Spacecraft Accidents

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 13 Business Continuity

Normal Accident Theory

Software Testing Strategies and Techniques

A Methodology for Safety Critical Software Systems Planning

Clinical Risk Management: Agile Development Implementation Guidance

Safety Issues in Automotive Software

PIONEER RESEARCH & DEVELOPMENT GROUP

WHITE PAPER. Improving Driver Safety with GPS Fleet Tracking Technology

6.828 Operating System Engineering: Fall Quiz II Solutions THIS IS AN OPEN BOOK, OPEN NOTES QUIZ.

DANGER indicates that death or severe personal injury will result if proper precautions are not taken.

Compliance. TODAY February Meet Lew Morris

Transcription:

Software Safety Basics (Herrmann, Ch. 2) 1

Patriot missile defense system failure On February 25, 1991, a Patriot missile defense system operating at Dhahran, Saudi Arabia, during Operation Desert Storm failed to track and intercept an incoming Scud. This Scud subsequently hit an Army barracks, killing 28 Americans. [GAO] 2

Patriot: A software failure [A] software problem in the system s weapons control computer led to an inaccurate tracking calculation that became worse the longer the system operated. At the time of the incident, the battery had been operating continuously for over 100 hours. By then, the inaccuracy was serious enough to cause the system to look in the wrong place for the incoming Scud. [GAO] 3

Tracking a missile: what should happen Search: Wide range scanned When missile detected, range gate calculates the next area to scan Validation, Tracking: Only range gated area scanned 4

Software design flaw Range gate calculates predicated position from Time of last radar detection: integer, measuring tenths of seconds Known velocity of missile: floating-point value Problem: Range gate used 24-bit registers, and each 0.1-second time increment added a little error Over time, this error became significant enough to cause range gate to miscalculate missile position 5

What actually happened Range gated area shifted, no longer accurate 6

Sources of the problem Patriot designed for use against slower (Mach 2) missiles, not Scuds (Mach 5) Proper calibration not performed largely due to fear that adding an external recorder could crash the system(!) Patriot system typically used in short intervals no longer than 8 hours Supposed to be mobile, quick on/off, to avoid detection 7

Ariane 5 failure On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700m, the launcher veered off its flight path, broke up and exploded. 8

Ariane 5: A software failure Unexpectedly large values encountered during alignment of inertial platform Attempt to convert overly large 64-bit value into a 16-bit value Software exception Guidance system (hardware) shutdown 9

Sources of the problem Alignment code reused from (smaller, less powerful) Ariane 4 Velocity values of Ariane 5 were out of range of Ariane 4 Ironically, alignment not even needed after lift-off! Why was alignment code running? Engineers decided to leave it running for 40 seconds after planned lift-off time Permitting easy restart if launch was put on hold briefly 10

Panama Cancer Institute accidents (Gage & McCormick, 2004) November 2000: 27 cancer patients given massive doses of radiation Partly due to flaws in Multidata software Medical physicists who used the software were found guilty of 2 nd degree murder in Panama Note: In the well-known Therac-25 incidents of the 1980s, software failures led to massive doses of radiation being administered to patients. Do we ever learn?... 11

Multidata software Used to plan radiation treatment Operator enters patient data Operator indicates placement of blocks (metal shields used to protect sensitive areas) through graphical editor Software provides 3D prediction of where radiation would be distributed From this data, dosage is determined 12

NRC Information Notice 2001-08, Supp. 2 Block placement editor Blocks drawn as separate polygons (There are 2 blocks in this picture) Software limitation: At most 4 blocks What if doctors want to use more blocks? 13

NRC Information Notice 2001-08, Supp. 2 A solution Note: This is a single unbroken line Software treated it as a single block Now you can draw more blocks! 14

Fatal problem Dosage prediction algorithm expected blocks in the form of polygons, but graphical editor allowed nonpolygons When run on non-polygon blocks, predictions were drastically wrong; overly high dosages prescribed 15

What is software safety? Features and procedures which ensure that a product performs predictably under normal and abnormal conditions, and the likelihood of an unplanned event occurring is minimized and its consequences controlled and contained; thereby preventing accidental injury or death, whether intentional or unintentional. (Herrmann) 16

Features and procedures Features: built into the software itself Range checks; monitors; warnings/alarms Procedures: concern the proper environment for the software, and its proper use Computer hardware that the software runs on Physical, mechanical components of environment Human users 17

Normal and abnormal conditions Abnormal conditions: Failure of hardware components Power outage Extreme environmental conditions (temperature, velocity) What to do? Not necessarily the best reaction, but one that has the best chance of preventing injury or death Fail-safe: shut down Fail-operational: continue in simpler degraded mode 18

Avoiding unplanned events To Herrmann, human users are the primary source of such events Can produce unusual inputs or combinations of inputs User interface design, testing can be crucial to software safety Understand user behavior Create interfaces that guide users toward good input 19

Terminology alert #1 There are many definitions of safety Herrmann thinks of safety as a set of features and procedures Something you can actually see in the software Leveson: freedom from accidents or losses This is an idealized property of the software something to aim for rather than actually achieve Storey distinguishes safety from adequate safety Here, safety is close to Leveson s definition; adequate safety is closer to Herrman s definition 20

Fault, error and failure Fault: cause of error - Misinterpreted software requirements - Code defects: code doesn t implement required behavior Error: dangerous state, from which failure is possible Failure: unacceptable system behavior 21

Fault, error and failure: Example Fault: memory device fails - Note: failure of component means fault at higher level - If memory device not used, fault remains hidden Error: attempt at memory access is unsuccessful - Now, presence of fault is evident - But, software may be able to handle it gracefully Failure: unsuccessful memory access results in computing an out of bounds value 22

Faults: Hardware vs. software Some hardware faults may be random Due to manufacturing defects or simple wear and tear Probability can be estimated statistically Well-known techniques to minimize random faults: error-correcting codes, redundant systems Software faults are always systematic not random Generated during design or specification not execution Software is not manufactured and doesn t wear out Techniques for minimizing random faults don t work with systematic faults Ariane 5 had redundant systems all running the same software! 23

Fault management options Avoidance: Prevent faults from entering the system during the design phase good practices in design e.g. programming standards Removal: Find faults in the system before release Testing costly and not always very effective 24

Fault management options Tolerance: Find faults in operational system after release, allow system to proceed correctly Recovery blocks: Create duplicate code modules Run primary module, then run an acceptance test If test fails, roll back changes and run an alternative module N-version programming: several independent implementations of a program Goal: ensure design diversity, avoid common faults Both approaches are costly, and may not be very effective For a study on whether N-version programming really achieves design diversity, read Knight & Leveson s article. 25

Model of system failure behavior fault removed fault not introduced Perfect fault introduced OK Erroneous error detected error not detected Fail Operational Fail Safe Innocuous Failure Dangerous Failure Known Safe State Unknown or Dangerous State 26

Terminology alert #2 fault and error have many alternative definitions Sometimes, error is a synonym for what we re calling fault, and fault means behavior that may trigger a failure Following these alternative definitions, we have: error 27

References United States General Accounting Office. Report IMTEC-92-26, February 4, 1992. http://www.fas.org/spp/starwars/gao/im92026.htm Ariane 5 Flight 501 Failure Report by the Inquiry Board. July 19, 1996. http://sunnyday.mit.edu/accidents/ariane5accidentreport.html U.S. Nuclear Regulatory Commission. Update on radiation therapy overexposures in Panama. NRC Information Notice 2001-08, Supp. 2, November 20, 2001. http://www.hsrd.ornl.gov/nrc/special/in200108s2.pdf D. Gage and J. McCormick. Why software quality matters. Baseline, March 2004, 33-56. http://www.baselinemag.com/print_article2/0,1217,a=120920,00.asp Nancy G. Leveson. Safeware: System Safety and Computers. Addison Wesley, 1995. Neil Storey. Safety-Critical Computer Systems. Prentice Hall, 1996. J.C. Knight and N.G. Leveson. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Transactions on Software Engineering 12(1), 1986, 96-109. 28