UNIVERSITY OF CALIFORNIA BERKELEY Engineering 7 Department of Civil and Environmental Engineering. Internal representation of numbers

Similar documents
Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

CSI 333 Lecture 1 Number Systems

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Numerical Matrix Analysis

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

The string of digits in the binary number system represents the quantity

CS321. Introduction to Numerical Methods

NUMBER SYSTEMS. William Stallings

CHAPTER 5 Round-off errors

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

Number Representation

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Useful Number Systems

Systems I: Computer Organization and Architecture

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

Numbering Systems. InThisAppendix...

Activity 1: Using base ten blocks to model operations on decimals

Computer Science 281 Binary and Hexadecimal Review

Math 115 Spring 2011 Written Homework 5 Solutions

Binary Adders: Half Adders and Full Adders

=

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Binary Numbering Systems

Attention: This material is copyright Chris Hecker. All rights reserved.

A Short Guide to Significant Figures

CHAPTER 4 DIMENSIONAL ANALYSIS

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

The BBP Algorithm for Pi

THE BINARY NUMBER SYSTEM

Base Conversion written by Cathy Saxton

Chapter 1. Binary, octal and hexadecimal numbers

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

Number Conversions Dr. Sarita Agarwal (Acharya Narendra Dev College,University of Delhi)

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

The New IoT Standard: Any App for Any Device Using Any Data Format. Mike Weiner Product Manager, Omega DevCloud KORE Telematics

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

2011, The McGraw-Hill Companies, Inc. Chapter 3

Goals. Unary Numbers. Decimal Numbers. 3,148 is s 100 s 10 s 1 s. Number Bases 1/12/2009. COMP370 Intro to Computer Architecture 1

Figure 1. A typical Laboratory Thermometer graduated in C.

Zuse's Z3 Square Root Algorithm Talk given at Fall meeting of the Ohio Section of the MAA October College of Wooster

Chapter 2. Binary Values and Number Systems

8 Square matrices continued: Determinants

Lecture 2. Binary and Hexadecimal Numbers

The Answer to the 14 Most Frequently Asked Modbus Questions

Arithmetic in MIPS. Objectives. Instruction. Integer arithmetic. After completing this lab you will:

Session 29 Scientific Notation and Laws of Exponents. If you have ever taken a Chemistry class, you may have encountered the following numbers:

Lecture 3: Finding integer solutions to systems of linear equations

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

Number Systems and Radix Conversion

23. RATIONAL EXPONENTS

Lies My Calculator and Computer Told Me

Math Workshop October 2010 Fractions and Repeating Decimals

Integer Operations. Overview. Grade 7 Mathematics, Quarter 1, Unit 1.1. Number of Instructional Days: 15 (1 day = 45 minutes) Essential Questions

Grade 6 Math Circles. Binary and Beyond

26 Integers: Multiplication, Division, and Order

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Chapter 7 - Roots, Radicals, and Complex Numbers

FAST INVERSE SQUARE ROOT

This 3-digit ASCII string could also be calculated as n = (Data[2]-0x30) +10*((Data[1]-0x30)+10*(Data[0]-0x30));

Common Core Standards for Fantasy Sports Worksheets. Page 1

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

Guidance paper - The use of calculators in the teaching and learning of mathematics

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

5.1 Radical Notation and Rational Exponents

Exponents. Learning Objectives 4-1

Lecture 11: Number Systems

Session 7 Fractions and Decimals

Our Primitive Roots. Chris Lyons

Supplemental Worksheet Problems To Accompany: The Pre-Algebra Tutor: Volume 1 Section 1 Real Numbers

Exponents, Radicals, and Scientific Notation

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

3. Mathematical Induction

Rational Exponents. Squaring both sides of the equation yields. and to be consistent, we must have

DNA Data and Program Representation. Alexandre David

CS101 Lecture 11: Number Systems and Binary Numbers. Aaron Stevens 14 February 2011

Maths Workshop for Parents 2. Fractions and Algebra

Section 4.1 Rules of Exponents

Chapter 5. Binary, octal and hexadecimal numbers

Sample Fraction Addition and Subtraction Concepts Activities 1 3

Continued Fractions and the Euclidean Algorithm

Stanford Math Circle: Sunday, May 9, 2010 Square-Triangular Numbers, Pell s Equation, and Continued Fractions

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Linear Programming Notes VII Sensitivity Analysis

A PRACTICAL GUIDE TO db CALCULATIONS

Negative Exponents and Scientific Notation

Euler s Method and Functions

Normal distribution. ) 2 /2σ. 2π σ

The programming language C. sws1 1

Transcription:

UNIVERSITY OF CALIFORNIA BERKELEY Engineering 7 Department of Civil and Environmental Engineering Spring 2013 Professor: S. Govindjee 1 Introduction Internal representation of numbers In everyday life, numbers are almost exclusively represented using the base 10 decimal system. This means, for example, that an integer N written as corresponds to a short-hand for N = a n a n 1 a 1 a 0 (1) N = (a n a n 1 a 1 a 0 ) 10 = a n 10 n + a n 1 10 n 1 + + a 1 10 1 + a 0 10 0 (2) As a concrete example we have that: N = (231) 10 = 2 10 2 + 3 10 1 + 1 10 0 (3) In the above case, 10 is called the base (or radix) of the system. The a i s are integers between 0 and 9; i.e., a i {0, 1, 2,, base 1}. Rational numbers less than 1, numbers with decimal points, make use of the negative powers of the base: N = (0.a 1 a 2 a n 1 a n ) 10 = a 1 10 1 + a 2 10 2 + + a n 1 10 (n 1) + a n 10 n (4) As a concrete example we have that: 2 Binary Representation N = (0.231) 10 = 2 10 1 + 3 10 2 + 1 10 3 (5) Computers generally use a base 2 system which is called the binary system. An integer N is represented as N = (a n a n 1 a 1 a 0 ) 2 = a n 2 n + a n 1 2 n 1 + + a 1 2 1 + a 0 2 0 (6) The a i s are only allowed to be 0 or 1. As a concrete example we have: N = (11100111) 2 = 1 2 7 + 1 2 6 + 1 2 5 + 0 2 4 + 0 2 3 + 1 2 2 + 1 2 1 + 1 2 0 (7) The binary system is quite convenient to computers, since the basic unit of information that they store is an electrical charge which can be either on or off. A single binary digit (a 0 or a 1) is called a bit. An 8-bit sequence of 0 s and 1 s is usually called a byte. Determining the binary representation of integers is made easy in MATLAB by the command dec2bin. For example, 1

>> A=dec2bin(231) A = 11100111 You can convert the other way using bin2dec. 2.1 Storage of binary integers The set of all integers is countably infinite. This means that in order to represent all integers, computers would need infinite storage capacity. In reality, computers can store only a finite subset of all integers because each integer is only allowed a certain number of bits. The number of bits allocated for any integer stored is sometimes called a word-length. Using a 16-bit integer word-length, the integer N = 231 may be stored as 0 000000011100111. (8) The first bit, of the 16 bits, stores the sign of the number (here positive is stored as 0), while the remaining 15 bits store the number itself using the binary system representation. Example 1 What is the largest integer that can be stored in a 32-bit signed integer? The binary representation for this would be the case where are the bits (save the sign bit) are set to unity: 0 } 1111111111111111111111111111111 {{} (9) 31 bits This is equivalent to 2 0 + 2 1 + 2 2 + + 2 30 = 1 231 1 2 = 231 1 2 10 9 (10) or roughly 2 billion. For unsigned integers, 32-bits are available and one can get up to 4 billion. If you type intmax in MATLAB it will show you the precise value. Example 2 Convert the integer (578) 10 to base 3. This can be performed by repeatedly subtracting successive powers of 3 from the number until one arrives at zero. We start the process with the largest power of 3 that will divide into (578) 10. This can be computed from y = log 3 (578). We can now divide our number by 3 y to find out how many times it goes into (578) 10, and then subtract this amount and repeat. A simple bit of code that achieves this is % base conversion clear; x = 578; 2

while x =0 y = floor(log(x)/log(3)); % Compute highest exponent d(y+1) = floor(x/3ˆy); % Determine digit value x = x d(y+1)*3ˆy; % Deflate number end disp('base 3 representation'); disp(d(end: 1:1)); The result is (210102) 3. Our algorithm can be checked using the built-in MATLAB function dec2base(578,3). 2.2 Binary storage of real numbers Real numbers, as opposed to integers, have decimal points. They can be represented in many different ways. One convenient way is to represent them in scientific notation or floating-point form: x = ±(d 1.d 2 d n ) β1 β E 2. (11) In the above, ± is the sign of the number. The factor (d 1.d 2 d n ) β1 is known as the significand. The digits after the decimal point, d 2 d n, are called the fraction or mantissa. β 1 is the base for the fraction; β 2 is the base for the exponent. E is the exponent. The floating-point form of a real number is called normal if d 1 0; note 0 d i < β 1. In normal form one also has the relation that (d 1.d 2 d n ) β1 < β 1 In order to store a real number expressed in normal form, the computer needs to allocate space for the sign, fraction, and exponent there is no need to store d 1 since in the binary system it is always equal to 1. In most computer systems real numbers are stored using 32-bits (single precision) or 64-bits (double precision). For both integers and real numbers, the precise word-length is dependent on the computer architecture and can vary widely from system to system. 1 2.3 IEEE standard MATLAB uses the IEEE standard for storing floating point numbers. Its default is to use double precision numbers. However, we will start by discussing single precision numbers as it requires less writing. For single precision IEEE numbers (32 bits), the first bit is the sign bit, the next 8 bits are for the biased exponent (a shifted form of the exponent), and the remaining 23 bits are for the fraction (mantissa). Both the exponent and the fraction are stored in base 2 so that β 1 = β 2 = 2. A single precision number, V, looks like : V = sign bit (s) biased exponent bits (e) mantissa bits (f) (12) 1 When buying computers and/or operating systems, they are often designated as 32-bit or 64-bit. This designation refers to the word length used to index locations in memory. Thus a 64-bit computer/os can utilize a lot more memory than a 32-bit computer/os. But this is independent of the sizes used to store variables in a program. 3

By definition: If all the biased exponent bits are set, i.e. e = 255, and f is nonzero then V = NaN (Not a Number). If all the biased exponent bits are set, i.e. e = 255, f = 0 and s = 1, then V =. If all the biased exponent bits are set, i.e. e = 255, f = 0 and s = 0, then V = +. If 0 < e < 255, then V = ( 1) s 2 e 127 (1.f) 2. (13) The (1.f) 2 represents the binary number created by appending f to an implicit 1.. The actual exponent E is not stored. Rather, e = E +127 is stored so that the exponent can have positive as well as negative values without the need for another sign bit; 127 is referred to as the bias. Beyond normal numbers there is also the special case of subnormal numbers. A subnormal number occurs when e = 0. In this situation, the significand is assumed to be of the form (0.f) 2. On some systems, numbers of this form cause underflow errors. The most important subnormal number is the case where f = 0; in which case, V = ±0 depending on the sign bit. There are similar rules for double precision numbers. It is not critical to memorize exactly how IEEE numbers are stored. What really matters is understanding that there are limits on what can be stored and computed. For example, there is a minimum(non-zero) and maximum value that can be stored. The maximum value of an IEEE 32-bit number is 0 s 11111110 e 11111111111111111111111 = ( 1) 0 2 254 127 (2 2 23 ) 3.40 10 38 (14) f The term 2 2 23 comes from exploiting the relation for the geometric sum: i.e., 1 + a 1 + a 2 + + a n = 1 an+1 1 a ; (15) (1.11111111111111111111111) 2 = (1/2) 0 + (1/2) 1 + + (1/2) 23 = 1 (1/2)23+1 1 (1/2) The minimum(non-zero) normal number representable in IEEE single-precision is 0 s 00000001 e = 2 2 23 (16) 00000000000000000000000 = ( 1) 0 2 1 127 (1) 1.2 10 38 (17) f MATLAB s default for storing numbers is IEEE double precision floating-point format in which there is one sign bit, 11 (biased) exponent bits, and 52 mantissa bits; the bias is 1023. The biased exponent e is required to be in the range 0 < e < 2047 for normal numbers: V = ( 1) s 2 e 1023 (1.f) 2. (18) Rules for NaN, ±, etc. follow as above with the critical value of e being 2047 instead of 255. 4

3 Finite precision of numbers The fact that numbers are stored using a finite number of bits for the exponent and the mantissa has important implications on the accuracy of numerical computations. It is quite possible to have very well formulated mathematical expressions that simply do not compute as one would expect. Example 3 Let us consider adding a number to unity; i.e., 1 + δ. Because of the way the numbers are stored, we could for instance obtain the seemingly odd result that 1 + δ = 1 for δ > 0. To see how this could occur we need to think about how the simple operation of addition could be performed inside a computer. Suppose we start with the number 1; in IEEE double precision, it has an exponent E = 0 and a biased exponent e = 1023 = (01111111111) 2 and mantissa f = 0. If we are to try and make as small a change as possible to 1 we would take the mantissa and add a 1 to the very last bit; i.e. make f = 2 52 (recall there are 52 bits in the double precision mantissa). This represents adding 2.2 10 16 to 1. If we try to add a number smaller than 2 52 to 1 there are not enough mantissa bits to be able to represent the result. The computer simply throws away these extra bits and the information is lost. It is easy to check this fact in MATLAB with a simple program: for ii = 50: 1: 54 if 1 == (1 + 2ˆii) fprintf('1 + 2ˆ%d equals 1\n',ii); else fprintf('1 + 2ˆ%d does not equal 1\n',ii); end end The output of which is 1 + 2ˆ 50 does not equal 1 1 + 2ˆ 51 does not equal 1 1 + 2ˆ 52 does not equal 1 1 + 2ˆ 53 equals 1 1 + 2ˆ 54 equals 1 The number 2 52 is known as machine epsilon; it is simply the smallest number one can add to unity and still be able to detect it. In MATLAB machine epsilon is the pre-defined variable eps. When represented as a base 10 number, machine epsilon 2.2 10 16 and this tells us that in general when one adds two numbers (in IEEE double precision) there are 16 (base 10) digits of accuracy in the computation; i.e. (a + a*b) == a, if b < eps. 2 2 To have MATLAB display more digits of a number, in the command window, you should type format long e at the command prompt. For other printing format options type help format. 5

Example 4 Consider a hypothetical computer system constructed upon a base 10 number system that can store 3 digits in the significand (leading digit plus mantissa). Now consider adding x = 2 = 10 0 2.00 to y = 0.001 = 10 3 1.00. In order for the computer to perform this operation it needs to have both numbers utilizing the same exponent. This makes it possible to directly add the significands. The computation will look like (19) 2. 0 0 10 0 + 0. 0 0 1 10 0 2. 0 0 1 10 0 Note that to get the exponent of y to match that of x, the 1 in necessarily shifted beyond the 3 digits that the our hypothetical computer stores. The result will be that our computer will tell us that x + y is 2.00. Example 5 As a last example consider the innocuous looking request to MATLAB: >> 1.0 0.8 0.2. Type this into MATLAB and one finds that the result is not zero! However >> 1.0 0.2 0.8 is zero. This occurs even though MATLAB is using 11 biased exponent bits and 52 mantissa bits. The reason becomes apparent if one carefully looks at how these numbers are stored in IEEE format. The number 1 looks like: 0 1 bit 01111111111 11 bits 0000000000000000000000000000000000000000000000000000 52 bits (20) The number 0.8 appears as 0 1 bit 01111111110 11 bits 1001100110011001100110011001100110011001100110011001 52 bits (21) and the number 0.2 appears as 0 1 bit 01111111100 11 bits 1001100110011001100110011001100110011001100110011001 52 bits (22) The first thing to note is that in base 2 these decimals repeat infinitely. They can not be stored exactly similar to how 1/3 can not be stored exactly in a finite number of base 10 digits. When the numbers are shifted to effect the computations, bits get lost and result in errors whereby 1.0 0.8 0.2 is not equal to 1.0 0.2 0.8. The difference however can be determined by noting that the errors will be associated with incorrect trailing bits in the IEEE format. In this particular case, with a bit of effort, one can show that the error occurs in the second to last bit the one with value 2 3 2 51 ( 3 is the exponent for 0.2). In fact: >> 1.0 0.8 0.2 + 2ˆ 54 ans = 0 6

One would never try to implement this type of brute force correction scheme to improve precision as it is quite computation dependent. Instead in programs that depend upon floating point computations, care is taken when comparing numbers to ensure that one is not asking for precision that is not computable by the computer. Remark: To convert a base 10 number to IEEE format one first computes the true exponent as E = log 2 ( x ). The IEEE exponent is then the binary representation of E shifted by the bias. The mantissa can be found by reducing to normal form and subtracting the leading 1: f = ( x /2 E ) 1. Finally the mantissa can be converted to binary by successively subtracting powers of 2 n for n { 1, 2,, 52} a process similar to what was done in Example 2 (but without the need for determining the digit value since it can only be 1 or 0 in a binary representation). 7