Numbers. are abstract objects used for counting and measuring. There are several number types: Natural numbers (0, 1, 2, 3, 4, 5,...

Similar documents
ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Binary Adders: Half Adders and Full Adders

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Computer Science 281 Binary and Hexadecimal Review

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

CHAPTER 5 Round-off errors

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

The string of digits in the binary number system represents the quantity

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Lecture 2. Binary and Hexadecimal Numbers

Let s put together a Manual Processor

Useful Number Systems

HOMEWORK # 2 SOLUTIO

CS321. Introduction to Numerical Methods

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

NUMBER SYSTEMS. William Stallings

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

(Refer Slide Time: 00:01:16 min)

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

CSI 333 Lecture 1 Number Systems

Binary Numbering Systems

Number Representation

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Lecture 8: Binary Multiplication & Division

Understanding Logic Design

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

Integer Operations. Overview. Grade 7 Mathematics, Quarter 1, Unit 1.1. Number of Instructional Days: 15 (1 day = 45 minutes) Essential Questions

Review of Scientific Notation and Significant Figures

Negative Exponents and Scientific Notation

Paramedic Program Pre-Admission Mathematics Test Study Guide

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Numbering Systems. InThisAppendix...

Software Safety Basics

COMBINATIONAL CIRCUITS

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

5 Combinatorial Components. 5.0 Full adder. Full subtractor

Fractions to decimals

Systems I: Computer Organization and Architecture

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

Multiplying and Dividing Signed Numbers. Finding the Product of Two Signed Numbers. (a) (3)( 4) ( 4) ( 4) ( 4) 12 (b) (4)( 5) ( 5) ( 5) ( 5) ( 5) 20

Answer Key for California State Standards: Algebra I

DNA Data and Program Representation. Alexandre David

Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Numeral Systems. The number twenty-five can be represented in many ways: Decimal system (base 10): 25 Roman numerals:

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left.

1. True or False? A voltage level in the range 0 to 2 volts is interpreted as a binary 1.

Solution for Homework 2

Activity 1: Using base ten blocks to model operations on decimals

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Numerical Matrix Analysis

Basic numerical skills: FRACTIONS, DECIMALS, PROPORTIONS, RATIOS AND PERCENTAGES

Exponents, Radicals, and Scientific Notation

Gates, Circuits, and Boolean Algebra

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

Sistemas Digitais I LESI - 2º ano

Binary Numbers. Binary Octal Hexadecimal

CHAPTER 3 Boolean Algebra and Digital Logic

3.1. RATIONAL EXPRESSIONS

CDA 3200 Digital Systems. Instructor: Dr. Janusz Zalewski Developed by: Dr. Dahai Guo Spring 2012

Session 29 Scientific Notation and Laws of Exponents. If you have ever taken a Chemistry class, you may have encountered the following numbers:

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

5.1 Radical Notation and Rational Exponents

PREPARATION FOR MATH TESTING at CityLab Academy

Base Conversion written by Cathy Saxton

EVALUATING ACADEMIC READINESS FOR APPRENTICESHIP TRAINING Revised For ACCESS TO APPRENTICESHIP

Radicals - Multiply and Divide Radicals

CS201: Architecture and Assembly Language

Number Systems and Radix Conversion

NUMBER SYSTEMS. 1.1 Introduction

MATH 60 NOTEBOOK CERTIFICATIONS

PATRIOT MISSILE DEFENSE Software Problem Led to System Failure at Dhahran, Saudi Arabia

Figure 1. A typical Laboratory Thermometer graduated in C.

Recall the process used for adding decimal numbers. 1. Place the numbers to be added in vertical format, aligning the decimal points.

Lab 1: Full Adder 0.0

CS101 Lecture 11: Number Systems and Binary Numbers. Aaron Stevens 14 February 2011

BINARY CODED DECIMAL: B.C.D.

Attention: This material is copyright Chris Hecker. All rights reserved.

Preliminary Mathematics

Zuse's Z3 Square Root Algorithm Talk given at Fall meeting of the Ohio Section of the MAA October College of Wooster

This 3-digit ASCII string could also be calculated as n = (Data[2]-0x30) +10*((Data[1]-0x30)+10*(Data[0]-0x30));

Logic Reference Guide

=

Chapter 1: Digital Systems and Binary Numbers

Dr Brian Beaudrie pg. 1

2.2 Scientific Notation: Writing Large and Small Numbers

26 Integers: Multiplication, Division, and Order

Solving Quadratic Equations

Rational Exponents. Squaring both sides of the equation yields. and to be consistent, we must have

Session 7 Fractions and Decimals

MATH-0910 Review Concepts (Haugen)

JobTestPrep's Numeracy Review Decimals & Percentages

Transcription:

Numbers are abstract objects used for counting and measuring. There are several number types: Natural numbers (0, 1, 2, 3, 4, 5,... ad infinitum) Integer numbers (..., 3, 2, 1, 0, 1, 2, 3,... ) Rational numbers ( 7 3, 4 9, etc.; formally, they are of the form m, where m and n are integers and n 0) n Real numbers (like π = 3.14159265358..., 2 = 1.41421356237..., etc.) see https://en.wikipedia.org/wiki/irrational_number... How can we represent all these numbers in computers? Computer words are binary (no problem: binary can encode everything) Computer words are finite, usually of 32 or 64 bits (can lead actually has already led to disaster) Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 1

Numeral systems A numeral is a symbol or group of symbols that represents a number Unary: numeral means 7 Decimal: numeral 456 means (4 10 2 ) + (5 10 1 ) + (6 10 0 ) numeral 101 10 means (1 10 2 ) + (0 10 1 ) + (1 10 0 ) Binary: numeral 101 2 means (1 2 2 ) + (0 2 1 ) + (1 2 0 ), d-ary, for any d > 1: a n 1 a n 2... a 1 a 0 means i.e., decimal 5 a n 1 d n 1 + a n 2 d n 2 + + a 1 d 1 + a 0 d 0 From now on we assume that computer words are 32 bits long. This means that we can represent the natural numbers from 0 to 2 32 1 : 0000 0000 0000 0000 0000 0000 0000 0000 2 = 0 10 0000 0000 0000 0000 0000 0000 0000 0001 2 = 1 10 0000 0000 0000 0000 0000 0000 0000 0010 2 = 2 10...... 1111 1111 1111 1111 1111 1111 1111 1110 2 = 2 32 2 = 4, 294, 967, 294 10 1111 1111 1111 1111 1111 1111 1111 1111 2 = 2 32 1 = 4, 294, 967, 295 10 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 2

Converting decimal numbers to binaries: divide by 2 Rule: divide repeatedly by 2, keeping track of the remainders as you go Example: covert 123 10 to binary 123/2 = 61 remainder = 1 61/2 = 30 remainder = 1 30/2 = 15 remainder = 0 15/2 = 7 remainder = 1 7/2 = 3 remainder = 1 3/2 = 1 remainder = 1 1/2 = 0 remainder = 1 The result is read from the last remainder upwards: 123 10 = 1111011 2 why does it work? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 3

A few binary numbers Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 4

Binary addition and subtraction (for unsigned numbers) The four basic rules for adding binary digits are as follows: 0 + 0 = 0 sum of 0 with a carry of 0 0 + 1 = 1 sum of 1 with a carry of 0 1 + 0 = 1 sum of 1 with a carry of 0 1 + 1 = 10 sum of 0 with a carry of 1 Example: Carry Carry Carry 1 1 1 0 1 1 + 1 0 1 1 0 0 0 Formulate four rules for subtracting bits. What about multiplication? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 5

Negative numbers: sign-magnitude representation How to represent negative integer numbers? Most obvious idea: treat the most significant (left-most) bit in the word as a sign: if it is 0, the number is positive; if the left-most bit is 1, the number is negative; the right-most 31 bits contain the magnitude of the integer Example: +18 10 = 0000 0000 0000 0000 0000 0000 0001 0010 2 18 10 = 1000 0000 0000 0000 0000 0000 0001 0010 2 This representation was used in early machines, but several shortcomings have been revealed Awkward arithmetic Two zeros: +0 and 0... Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 6

Sign-magnitude representation 1111 0000 0001 1110 0010 0 1 1101-6 -7 2 0011-5 3 1100-4 4 0100-3 5 1011-2 6 0101-1 -0 7 1010 0110. 1001 0111 1000 110... 0 111... 1-2 n-1-2 n-2 100... 0 000... 0 0-0 2 n-2 2 n-1-1 011... 1. 010... 0 4-bit numbers n-bit numbers Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 7

Two s complement representation The most significant bit represents the sign, as before The positive numbers are also represented as before 0a 30... a 0 means 0 2 31 + a 30 2 30 + + a 1 2 + a 0 To represent a negative number, take its complement to 2 31 : 1a 30... a 0 means 1 2 31 + a 30 2 30 + + a 1 2 + a 0 0000 0000 0000 0000 0000 0000 0000 0000 2 = 0 10 0000 0000 0000 0000 0000 0000 0000 0001 2 = 1 10 0000 0000 0000 0000 0000 0000 0000 0010 2... = 2 10... 0111 1111 1111 1111 1111 1111 1111 1101 2 = 2, 147, 483, 645 10 0111 1111 1111 1111 1111 1111 1111 1110 2 = 2, 147, 483, 646 10 0111 1111 1111 1111 1111 1111 1111 1111 2 = 2, 147, 483, 647 10 1000 0000 0000 0000 0000 0000 0000 0000 2 = 2, 147, 483, 648 10 1000 0000 0000 0000 0000 0000 0000 0001 2 = 2, 147, 483, 647 10 1000 0000 0000 0000 0000 0000 0000 0010 2... = 2, 147, 483, 646 10... 1111 1111 1111 1111 1111 1111 1111 1101 2 = 3 10 1111 1111 1111 1111 1111 1111 1111 1110 2 = 2 10 1111 1111 1111 1111 1111 1111 1111 1111 2 = 1 10 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 8

Two s complement representation (cont.) 1111 0000 0001 1110 0010 0 1 1101-2 -1 2 0011-3 3 1100-4 4 0100-5 5 1011-6 6 0101-7 -8 7 1010 0110. 1001 0111 1000 110... 0 111... 1-1 -2 n-2 100... 0 000... 0 0 2 n-2-2 n-1 2n-1-1 011... 1. 010... 0 4-bit numbers n-bit numbers Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 9

Two s complement arithmetic: negation a 31 a 30... a 1 a 0 represents a 31 2 31 + a 30 2 30 + + a 1 2 + a 0 The rules for forming the negation of an integer (in two s complement notation): Take the Boolean negation of each bit of the integer (including the sign bit) That is, set each 1 to 0 and each 0 to 1. Treating the result as an unsigned binary integer, add 1. Example: Negate 2 10 = 0000 0000 0000 0000 0000 0000 0000 0010 2 Invert the bits and add 1: 1111 1111 1111 1111 1111 1111 1111 1101 2 + 1 2 1111 1111 1111 1111 1111 1111 1111 1110 2 (= 2 10 ) Negate 2 10 and check whether you get 2 10 Negate 0. Any problem? Overflow; but is the result correct? Negate 1000 0000 0000 0000 0000 0000 0000 0000 2 = 2, 147, 483, 648 10 Any problem? Is the result correct? Why? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 10

Two s complement addition and subtraction Subtraction is achieved using addition x y = x + ( y): to subtract y from x, negate y and add the result to x Addition is the same as the addition of unsigned numbers on page 5 The result of addition can be larger than can be held in a 32 bit word. This situation is called overflow. Detecting overflow: if two numbers are added, and they are both positive or both negative, then overflow occurs if the result the opposite sign Examples of overflow: assume that we deal with 4 bit words only 0101 2 = 5 10 + 0100 2 = 4 10 1001 2 = 7 10 + 1010 2 = 6 10 1001 2 = 7 10 overflow 1 0011 2 = 3 10 overflow When adding numbers with different signs, overflow cannot occur. Why? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 11

Arithmetic and logic unit (ALU) The ALU is the part of computer that performs arithmetic and logical operations. It is the workhorse of the CPU because it carries out all the calculations. The hardware required to build an ALU is the basic gates: AND, OR and NOT We use these gates to construct a 32-bit adder for integers Such an adder can be created by connecting 32 1-bit adders Inputs and outputs of a single-bit adder: two inputs for the operands (the bits we want to add), say, A and B; a single-bit output for the sum; a second output to pass on the carry CarryOut a third input is CarryIn the CarryOut from the previous adder A B CarryIn + CarryOut Sum Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 12

Designing a single-bit adder We begin by giving precise input and output specifications (truth-table) A B CarryIn CarryOut Sum Comments 0 0 0 0 0 0 + 0 + 0 = 00 2 0 0 1 0 1 0 + 0 + 1 = 01 2 0 1 0 0 1 0 + 1 + 0 = 01 2 0 1 1 1 0 0 + 1 + 1 = 10 2 1 0 0 0 1 1 + 0 + 0 = 01 2 1 0 1 1 0 1 + 0 + 1 = 10 2 1 1 0 1 0 1 + 1 + 0 = 10 2 1 1 1 1 1 1 + 1 + 1 = 11 2 Then we turn this truth-table into logical equations: CarryOut = ( A B CarryIn) (A B CarryIn) can be simplified to (A B CarryIn) (A B CarryIn) (remember the majority function?) CarryOut = (B CarryIn) (A CarryIn) (A B) Sum = (A B CarryIn) ( A B CarryIn) and ( A B CarryIn) (A B CarryIn) Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 13

Boolean circuits for the 1-bit adder A B CarryIn CarryIn A B CarryOut Sum Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 14

Boolean circuit for the 1-bit adder (cont.) Check that the following circuit realises the same functions CarryOut and Sum A B CarryIn Sum CarryOut Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 15

32-bit adder Adding A = a 31 a 30... a 2 a 1 a 0 and B = b 31 b 30... b 2 b 1 b 0 a 31 b 31 a 2 b 2 a 1 b 1 a 0 b 0 CarryIn A B CarryIn A B CarryIn A B CarryIn A B CarryIn Full adder Full adder Full adder Full adder CarryOut CarryOut Sum... s 31 CarryOut Sum s 2 CarryOut Sum s 1 CarryOut Sum s 0 CarryIn in the least significant adder is supposed to be 0 What happens if we set this CarryIn to 1 instead of 0? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 16

32-bit adder and subtractor Control bit C: C = 0: Add A + B C = 1: Subtract A B (because b i 1 = b i ) b 31 b 2 b 1 b 0 C a 31 a 2 a 1 a 0 A B CarryIn A B CarryIn A B CarryIn A B CarryIn Full adder Full adder Full adder Full adder CarryOut CarryOut Sum... s 31 CarryOut Sum s 2 CarryOut Sum s 1 CarryOut Sum s 0 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 17

Scientific notation The two s complement representation allowed us to deal with the integer numbers in the interval between 2 31 and +2 31 1 What if we need larger numbers (say, in astronomy)? How to deal with small fractions (say, in nuclear physics)? Scientific notation: 976, 000, 000, 000, 000 = 9.76 10 14 0.0000000000000976 = 9.76 10 14 We dynamically slide the decimal point to a convenient location and use the exponent of 10 to keep track of that decimal point Any given number can be written in the form a 10 b in many ways; e.g., 350 = 3.5 10 2 = 35 10 1 = 350 10 0 In normalised scientific notation, exponent b is chosen such that 1 a < 10 Using binary numeral system instead of decimal, we can represent any real non-zero number in the form a 2 E with 1 a < 2, or ( 1) S (1 + F ) 2 E, 0 F < 1 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 18

Binary fractions 16.625 10 = 1 10 1 + 6 10 0 + 6 10 1 + 2 10 2 + 5 10 3 = 10000.101 2 = 1 2 4 + 0 2 3 + 0 2 2 + 0 2 1 + 0 2 0 + 1 2 1 + 0 2 2 + 1 2 3 To convert a decimal number with both integer and fractional parts, convert each part separately and combine the answers (integer conversion is on page 3) Example: convert 0.6875 10 to binary (multiply by 2) 0.6875 2 = 1.3750 integer = 1 0.3750 2 = 0.7500 integer = 0 0.7500 2 = 1.5000 integer = 1 0.5000 2 = 1.0000 integer = 1 The result is read from the first integer downwards: 0.6875 10 = 0.1011 2 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 19

IEEE 754 floating-point standard Every real number different from 0 can be represented in the form ( 1) S (1 + F ) 2 E, 0 F < 1 S is the sign (0 for positive and 1 for negative) F is the fraction (0 F < 1) E is the exponent (determining the actual location of the binary point) The size of E and F may vary: a fixed word size means that one must take a bit from one to add a bit to the other. There existing compromises are: Single precision floating-point format: 8 bits for E and 23 bits for F S } Exponent {{ } F (fraction) 8 bits }{{} 1 bit } {{ } 23 bits Double precision floating-point format: 11 bits for E and 52 bits for F S } Exponent {{ } F (fraction) 11 bits }{{} 1 bit } {{ } 52 bits Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 20

IEEE 754 floating-point standard (cont.) If we number the bits of the fraction F from left to right b 1, b 2,..., b n (where n = 23 or n = 52), then the IEEE 754 standard gives us the numbers ( 1) S (1 + b 1 2 1 + b 2 2 2 + b 3 2 3 + + b n 2 n ) 2 E How to represent E? We need both positive and negative exponents. Biased notation: fix Bias = 127 10 = 2 7 1 = 0111 1111 2 for single precision Bias = 1023 10 = 2 10 1 = 011 1111 1111 2 for double precision E = Exponent Bias where Exponent is the number stored in the exponent field Examples: An exponent of 1 is represented by the bit pattern of the value 1 + 127 10 = 0111 1110 2 (single precision) An exponent of +1 is represented by the bit pattern of the value 1 + 127 10 = 1000 0000 2 (single precision) Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 21

IEEE 754 floating-point standard (cont.) The range of single precision numbers is then from as small as ±1.0000 0000 0000 0000 0000 000 2 2 126 to as large as ±1.1111 1111 1111 1111 1111 111 2 2 127 expressible negatives expressible positives overflow {}}{{}}{{ underflow }}{{ underflow }}{{}}{{ overflow }}{ (1 2 24 ) 2 128 0.5 2 127 0 0.5 2 127 (1 2 24 ) 2 128 Number line 0 is represented as both 0000... 0000 (+0) and 1000... 0000 ( 0) not all numbers in the intervals between (1 2 24 ) 2 128 and 0.5 2 127 and between 0.5 2 127 and (1 2 24 ) 2 128 are represented (why?) Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 22

Example: floating point representation Show the IEEE 754 binary representation of the number 0.75 10 in single and double precision 0.75 10 = (3/4) 10 = (3/2 2 ) 10 = 11 2 /2 2 10 = 0.11 2 = The general representation for a single precision number is ( 1) S (1 + Fraction) 2 Exponent 127 = 0.11 2 2 0 = 1.1 2 2 1 So, in our case Fraction = 0.1, Exponent = 1 + 127 = 126 The single precision binary representation of 0.75 10 is therefore 1 } 01111110 {{}} 10000000000000000000000 {{} 8 bits 23 bits }{{} 1 bit What is the double precision representation? Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 23

Example: converting binary to decimal floating point What decimal number is represented? 1 } 10000001 {{}} 01000000000000000000000 {{} 8 bits 23 bits }{{} 1 bit The sign bit is 1, the exponent field contains 10000001 2 = 129 10, and the fraction field contains 0.01 2 = 1 2 2 = 1 4 = 0.25 ( 1) S (1 + Fraction) 2 Exponent Bias = ( 1) 1 (1 + 0.25) 2 129 127 = = 1 1.25 2 2 = 1.25 4 = 5 Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 24

Accurate arithmetic All integers in the interval between 2 31 and 2 31 1 are represented exactly Floating-point numbers are normally approximations for a number they can t really represent (there are infinitely many numbers between, say, 0 and 1, but no more than 2 53 can be represented exactly in double precision floating point. The best we can do is get the floating-point representation close to the actual number Rounding in IEEE 754: always keeps two extra bits on the right during intermediate computations, called guard and round Example: add 2.56 10 10 0 to 2.34 10 10 2 assuming we have three significant decimal digits. Round to the nearest decimal number with three significant decimal digits, first with guard and round digits, and then without them. To add 2.56 10 10 0 to 2.34 10 10 2, we have to shift the smaller number to the right to align the exponents, so 2.56 10 10 0 becomes 0.0256 10 10 2. The guard digit holds 5 and the round digit holds 6. As 2.3400 + 0.0256 = 2.3656, the sum is 2.3656 10 2. Since we have two digits to round, we obtain 2.37 10 2. Doing this withot guard and round would give 2.36 10 2. Does it matter? Well it does... Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 25

Finite precision can lead to disaster The Patriot missile failure http://www.ima.umn.edu/~arnold/disasters/disasters.html American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile. The Scud struck an American Army barracks, killing 28 soldiers and injuring around 100 other people. Cause: software problem, namely, inaccurate calculation of the time since boot due to computer arithmetic errors. Problem specifics: Time in tenths of second as measured by the system s internal clock was multiplied by 1/10 to get the time in seconds. Internal registers were 24 bits wide. 1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 bits) Error 0.1100 1100 2 23 9.5 10 8 Error in 100 hour operation period: 9.5 108 100 60 60 10 = 0.34s Distance traveled by Scud = (0.34s) (1676m/s) 570m. This was far enough that the incoming Scud was outside the range gate that the Patriot tracked. Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 26

Finite range can lead to disaster Explosion of Ariane rocket 4/06/1996 http://www.ima.umn.edu/~arnold/disasters/disasters.html Unmanned Ariane 5 rocket of the European Space Agency veered off its flight path, broke up, and exploded only 30s after lift-off (altitude of 3700m) The $500 million rocket (with cargo) was on its first voyage after a decade of development costing $7 billion Cause: software error in the inertial reference system Problem specifics: A 64 bit floating point number relating to the horizontal velocity of the rocket was being converted to a 16 bit signed integer. An SRI 1 software exception arose during conversion because the 64-bit floating point number had a value greater than what could be represented by a 16-bit signed integer (max = 32, 767) 1 SRI = Système de Référence Inertielle or Inertial Reference System Fundamentals of Computing 2016 17 (2) http://www.dcs.bbk.ac.uk/~michael/foc/foc.html 27