Safety-Critical Firmware What can we learn from past failures? Michael Barr & Dan Smith Webinar: September 9, 2014 MICHAEL BARR, CTO BSEE/MSEE and Firmware Developer Consultant and Trainer (1999-present) Former Adjunct Professor! University of Maryland, Johns Hopkins University Former Editor-in-Chief; Columnist; Conference Chair Expert witness! unintended acceleration injuries; smartphone and set-top patents Author of 3 books and 70+ articles/papers 2 Copyright Barr Group. All rights reserved. Page 1
BARR GROUP The Embedded Systems Experts Barr Group helps companies make their embedded systems safer and more secure. barrgroup.com 3 UPCOMING PUBLIC BOOT CAMPS Embedded SOFTWARE Boot Camp! October 20-24 near Detroit, Michigan Embedded ANDROID Boot Camp! October 27-31 in Costa Mesa, California Embedded SECURITY Boot Camp! November 3-7 in Dallas, Texas http://barrgroup.com/training-calendar 4 Copyright Barr Group. All rights reserved. Page 2
UPCOMING PUBLIC 1-DAY TRAINING Firmware Defect Prevention for Safety-Critical Devices! September 23 rd near Detroit, Michigan http://barrgroup.com/courses/1day/safety-critical Overview! Focus on cost-effective defect prevention best practices! For engineers and managers in safety-critical fields 5 DAN SMITH, PRINCIPAL ENGINEER BSEE from Princeton 20+ years of embedded systems design! Fields: Control systems, telecom/datacom, medical devices, defense, transportation! Roles: engineer, instructor, speaker, consultant! Numerous RTOSes, processors, platforms Focus on secure, safe, fault-tolerant systems 6 Copyright Barr Group. All rights reserved. Page 3
OVERVIEW OF TODAY S WEBINAR Goal! Examine past software failures in critical systems! Learn how to avoid repeating the past Key Takeaways! Failures often traceable to preventable defects! Combination of education, process and vigilance Prerequisites! Knowledge of C (and perhaps a bit of C++) 7 CRITICAL SYSTEMS Defined: A (safety) critical system can cause injury or death when it malfunctions. Other disciplines, weighty concerns:! High Security Systems (access control, military)! High Availability Systems (grid, mobile/cellular, internet)! Mission critical systems (unmanned exploration) Commission / Omission 8 Copyright Barr Group. All rights reserved. Page 4
LOOKING THROUGH A KEYHOLE Much more to developing safety-critical systems! Planning, staffing, training, budgeting! Product specifications, requirements, test plans! Hardware, mechanical, redundancy, fail-safes! Modeling / simulation, formal proofs, fuzzing! Testing, validation, verification Presentation covers only implementation phase! Specifically, firmware development 9 ROLE OF FIRMWARE IN CRITICAL SYSTEMS Increasing role of firmware in:! Automobiles & transportation in general! Mobile electronics (phone, GPS, etc.)! Medical devices! we could go on & on & on More functionality being pushed into firmware! Operations formerly handled by hardware! Greater complexity, greater potential for problems 10 Copyright Barr Group. All rights reserved. Page 5
RIPPED FROM THE HEADLINES Source: http://en.ria.ru/world/20140828/192413515/galileo-satellites-incident-likely-result-of-software-errors.html http://spectrum.ieee.org/tech-talk/aerospace/satellites/two-galileo-satellites-are-parked-in-the-wrong-spots 11 HINDSIGHT Of course the cause & fix is obvious!! Then why same mistakes repeated over & over?!?! Similar lessons from security (e.g. buffer overflow) Point isn t to criticize or taunt! Avoid repeating the same mistakes 12 Copyright Barr Group. All rights reserved. Page 6
THERAC-25 Images: http://hci.cs.siue.edu/nsf/files/semester/week13-2/ppt-text/slide13.html 13 THERAC: WHAT HAPPENED? Hardware interlocks were designed out! Previous generations had them, replaced with software Early sign of software s increasing safety responsibility Race conditions & improper machine settings! High energy beam activated without spreader plate! One byte-counter overflowed at just the wrong time Result: 100x radiation dosage! At least 6 patients harmed, 3 killed 14 Copyright Barr Group. All rights reserved. Page 7
THERAC: FINDINGS Atomic Energy of Canada Limited (AECL):! Immature and inadequate software development process ( untestable software )! Incomplete reliability modeling & failure mode analysis! No (independent) review of critical software! Improper software re-use from older models! Improper inter-task synchronization Also notable:! System implemented in assembly language! System used own in-house operating system 15 ARIANE 5 / FLIGHT 501 Successor to smaller Ariane 4 rocket! Designed to carry larger, heavier payloads! Today: standard launch vehicle for ESA June 1996: Maiden flight Payload: Cluster! Four 1200-kg spacecraft! Mission: study Earth s magnetosphere 16 Image Source: https://en.wikipedia.org/wiki/file:ariane_501_cluster.svg Copyright Barr Group. All rights reserved. Page 8
FLIGHT 501 FAILURE 37 seconds into launch! Both inertial navigation systems malfunction & crash! Thrusters steered into extreme & incorrect orientations! Vehicle departed from intended flight path Flight termination system! Mechanical stresses triggered deliberate self-destruction! Fortunately that worked as intended!!! Cost: approximately $370M 17 FLIGHT 501 - CAUSE Inertial navigation system (SRI) re-used from Ariane 4 Flight 501 much greater horizontal velocity! Conversion: 64-bit floating point to 16-bit integer 18! Variable holding horizontal velocity overflowed! Overflow checks omitted for efficiency Implementation language was Ada! Typically regarded as a safer implementation language Software where error occurred:! Not needed after launch! Copyright Barr Group. All rights reserved. Page 9
MISRA C:2012 Directive 4.1: Run-time failures shall be minimized. C s run-time environment is very light-weight! Unchecked array access, divide by 0, dynamic allocation Implication:! Burden is on you, the programmer Tactic: Extensive (dynamic) run-time checking 19 ASSERTIONS Software assertions! Used to confirm programmer s assumptions at runtime Also a form of documentation C language: assert() (header file <assert.h>)! Expression is expected to evaluate to TRUE bool$isinrange(int$lower_bound,$int$upper_bound,$int$value)$ {$ $$assert(lower_bound$<=$upper_bound);$ $$ $ }$ What if expression evaluates to FALSE? 20 Copyright Barr Group. All rights reserved. Page 10
REMOVING ASSERTIONS Cost of assertions! Run time (CPU), code size Removing / disabling assertions! Typically by defining NDEBUG at compile time Assertions turn into whitespace! Often done just before shipping / production Ship what you tested Parachutes and pennies! And seatbelts 21 FLIGHT 501 - LESSONS Re-use of software isn t always an automatic win Mixing data types (e.g int & float) can be problematic! It s not just C that suffers from such problems Don t execute unnecessary software Disable assertions (sanity checks) at your own risk Consideration of failure modes is important too 22 Copyright Barr Group. All rights reserved. Page 11
WHAT ABOUT TESTING? Testing is necessary and important! But not sufficient Testing does not prove the absence of bugs! Some bugs escape to the field And are often very difficult to reproduce! Tests are software, too Bugs aren t limited to production code Tests are just one part of an overall quality strategy 23 PATRIOT MISSILE SYSTEM 24 Copyright Barr Group. All rights reserved. Page 12
PATRIOT MISSILE FAILURE : February 25, 1991! 28 U.S. soldiers dead; 100+ wounded! Single deadliest incident for U.S. 25 THE PATRIOT SOFTWARE BUG Two versions of system time! Clock 1: integer ticks (one tick = 0.1s) 26! Clock 2: fixed-point representation 3.25s:'000000000000000000000011.010000000000000000000000' Problem: no exact representation of 0.1 decimal (base 10) in binary ( non terminating )! Conversion from integer ticks to floating point values results in rounding (about 1 part in a million) After 100 hours (360,000 seconds), this is ~0.34 seconds! But what does that translate to in terms of distance? GAO Report: https://www.fas.org/spp/starwars/gao/im92026.htm Copyright Barr Group. All rights reserved. Page 13
PERILS OF FLOATING POINT, 1 void$test1(void)${$ $$float$f$=$0.1f;$printf("%0.6f\n",$f);$ $$f$+=$0.1f;$$$$$$printf("%0.6f\n",$f);$ }$ void$test2(void)${$ $$float$f$=$0.1f;$printf("%0.9f\n",$f);$ $$f$+=$0.1f;$$$$$$printf("%0.9f\n",$f);$ }$ int$main(void)${$ $$test1();$ $$test2();$ $$ $ 0.100000$ 0.200000$ 0.100000001$ 0.200000003$ 27?!?!? PERILS OF FLOATING POINT, 2 void$test3(void)${$ $$float$f1$=$0.1f,$f2$=$0.3f;$ $$f1$+=$0.3f;$f1$+=$0.7f;$ $$for$(int$i$=$0;$i$<$8;$++i)$${$ $$$$f2$+=$0.1f;$ $$}$ $$printf("%0.9f\n%0.9f\n",$f1,$f2);$ }$ int$main(void)${$ $$test3();$ $$return$0;$ }$ 1.100000024$ 1.100000143$ Larger error accumulation due to rounding on each iteration of loop 28 Copyright Barr Group. All rights reserved. Page 14
ACCUMULATED ERROR Uptime (h) Error (s) Shift (m) 1.0034 7 8.0275 55 20.0687 137 100.3433 687 29 GAO Report: https://www.fas.org/spp/starwars/gao/im92026.htm PATRIOT MISSILE FAILURE: LESSONS Testing will not catch all problems Mixing floating point and fixed point/integer operations can be tricky! In fact using floating point alone can be tricky! Tracking time (or any precise quantity)! Be consistent! Understand precision, rounding and conversion 30 Copyright Barr Group. All rights reserved. Page 15
MARS CLIMATE ORBITER 31 Source: https://upload.wikimedia.org/wikipedia/commons/1/19/mars_climate_orbiter_2.jpg UNITS ARE IMPORTANT Ultimately, computers calculate things! Most calculations involve dimensions & units! Pressure(kPa), velocity(m/s), flow (l/m), etc. Common unit mistakes in calculations! Same fundamental dimension, different system e.g. SetVelocityMetersPerSec(MPH_55);'! Disagreement in fundamental dimensions e.g. SetAcceleration((pos2?pos1)/time));' 32 Copyright Barr Group. All rights reserved. Page 16
DIMENSIONAL ANALYSIS In C, no unit information in standard types e.g. int'speed'='1234;'! Is that 123.4 meters per second?! Is that 123.4 miles per hour? e.g. float'calcpress(float'force,'float'area);'! What are the units for force & area? Can we use the language s type system to help?! Yes, and static analysis, too (e.g. Flexelint 9) 33 USING FLEXELINT 9 TO EXPOSE DIMENSION / UNIT PROBLEMS $$$$$1 $//$Dimensional$analysis$demonstration.$ $$$$$2 $//$Report$whenever$a$variable$(such$as$v)$typed$as$a$Velocity$ $$$$$3 $//$is$assigned$anything$other$than$a$velocity$or$a$met/sec.$ $$$$$4 $$ $$$$$5 $//lint$wstrong($acjcx,$met,$sec,$velocity$=$met/sec$)$ $$$$$6 $typedef$double$met,$sec,$velocity;$ $$$$$7 $$ $$$$$8 $Velocity$speed($Met$d,$Sec$t$)${$ $$$$$9 $$$Velocity$v;$ $$$$10 $$$v$=$d$/$t;$$$$$$$$$$$$$$//$ok$ $$$$11 $$$v$=$1$/$t;$$$$$$$$$$$$$$//$nope!$ $$$$12 $$$v$=$(3.5/t)$*$d;$$$$$$$$//$ok$ v$=$1$/$t;$$$$$$$$$$$$$$//$warning$ dimensional2.c$$11$$warning$632:$assignment$to$strong$type$'met/sec'$ in$context:$assignment$ dimensional2.c$$11$$warning$633:$assignment$from$a$strong$type$'1/ Sec'$in$context:$assignment$ 34 Copyright Barr Group. All rights reserved. Page 17
C DON T USE NAKED NUMBERS Consider an object-oriented approach! Create different types (classes) for different units typedef$uint32_t$speed1;$ typedef$uint16_t$speed2;$ typedef$struct$foo_tag${$ $$SPEED1$SpeedInCmPerSec;$ }$SPEED_CM_S;$ typedef$struct$foo_tag2${$ $$SPEED2$SpeedInMilesPerHour;$ }$SPEED_M_H;$ $ //$Below$routines$would$have$builtWin$bounds$checking,$etc.$ void$ctor1speedcmpersec(speed_cm_s$*obj,$speed1$initspeedcmpersec);$ void$ctor2speedcmpersec(speed_cm_s$*obj,$speed_m_h$const$*speedin);$ void$adjustspeedcmpersec(speed_cm_s$*current,$speed_cm_s$const$*adjustment);$ $ 35 EVEN BETTER USE C++ C++ is a perfect fit for this problem! Stronger type system than C! Templates & metaprogramming enforce dimensional correctness at compile time! More information:! Paper, Scott Meyers, Dimensional Analysis in C++ 1! See Boost::Units 2 1 http://se.ethz.ch/~meyer/publications/others/scott_meyers/dimensions.pdf 2 http://www.boost.org/doc/libs/1_56_0/doc/html/boost_units/dimensional_analysis.html 36 Copyright Barr Group. All rights reserved. Page 18
FILTERING OUT THE DEFECTS Coding Standard (e.g. MISRA) Static Analysis Formal Code Inspection Testing (multiple levels) 37 KEY TAKEAWAYS No such thing as bug-free software Testing is not sufficient Defense in depth just like security! Coding Standard / safe subset (e.g. MISRA standard)! Process (static analysis, code inspections)! Knowledge is a pro-active approach Always better to prevent than to find & fix 38 Copyright Barr Group. All rights reserved. Page 19
FURTHER READING Haven t found that glitch Dr. David Cummings http://articles.latimes.com/2010/mar/11/opinion/la-oewcummings12-2010mar12 Mars Code - Gerard J. Holzmann Text: http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext Video: http://vimeo.com/84991949 Better Embedded System SW (Phil Koopman) http://betterembsw.blogspot.com/ 39 QUESTION & ANSWER 40 Copyright Barr Group. All rights reserved. Page 20
ADDITIONAL RESOURCES Paper: Top 10 Bug-Killing Coding Standard Rules barrgroup.com/embedded-systems/how-to/bug-killing- Standards-for-Embedded-C Michael Barr s Blog: Barr Code http://embeddedgurus.com/barr-code/ Training: Barr Group s Upcoming Public Courses barrgroup.com/training-calendar 41 CONCLUSION 42 Copyright Barr Group. All rights reserved. Page 21