This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers



Similar documents
Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

CHAPTER 5 Round-off errors

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Attention: This material is copyright Chris Hecker. All rights reserved.

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Computer Science 281 Binary and Hexadecimal Review

This 3-digit ASCII string could also be calculated as n = (Data[2]-0x30) +10*((Data[1]-0x30)+10*(Data[0]-0x30));

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

CS321. Introduction to Numerical Methods

The string of digits in the binary number system represents the quantity

Numerical Matrix Analysis

Lecture 8: Binary Multiplication & Division

Lecture 2. Binary and Hexadecimal Numbers

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

CS201: Architecture and Assembly Language

Binary Adders: Half Adders and Full Adders

2.2 Scientific Notation: Writing Large and Small Numbers

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Arithmetic in MIPS. Objectives. Instruction. Integer arithmetic. After completing this lab you will:

Useful Number Systems

Floating point package user s guide By David Bishop (dbishop@vhdl.org)

DNA Data and Program Representation. Alexandre David

A Short Guide to Significant Figures

What Every Computer Scientist Should Know About Floating-Point Arithmetic

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

0.8 Rational Expressions and Equations

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

CSI 333 Lecture 1 Number Systems

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Session 29 Scientific Notation and Laws of Exponents. If you have ever taken a Chemistry class, you may have encountered the following numbers:

Data Storage 3.1. Foundations of Computer Science Cengage Learning

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

Recall the process used for adding decimal numbers. 1. Place the numbers to be added in vertical format, aligning the decimal points.

Math Workshop October 2010 Fractions and Repeating Decimals

Number Systems and Radix Conversion

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

Chapter 2. Binary Values and Number Systems

Figure 1. A typical Laboratory Thermometer graduated in C.

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

1 Description of The Simpletron

COMPSCI 210. Binary Fractions. Agenda & Reading

Numbering Systems. InThisAppendix...

Negative Exponents and Scientific Notation

Binary Numbering Systems

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Review of Scientific Notation and Significant Figures

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

NUMBER SYSTEMS. William Stallings

Multiplying and Dividing Signed Numbers. Finding the Product of Two Signed Numbers. (a) (3)( 4) ( 4) ( 4) ( 4) 12 (b) (4)( 5) ( 5) ( 5) ( 5) ( 5) 20

Digital Design. Assoc. Prof. Dr. Berna Örs Yalçın

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Number Representation

Pre-Algebra Lecture 6

Lies My Calculator and Computer Told Me

26 Integers: Multiplication, Division, and Order

Numeral Systems. The number twenty-five can be represented in many ways: Decimal system (base 10): 25 Roman numerals:

HOMEWORK # 2 SOLUTIO

MBA Jump Start Program

Floating Point Fused Add-Subtract and Fused Dot-Product Units

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Zuse's Z3 Square Root Algorithm Talk given at Fall meeting of the Ohio Section of the MAA October College of Wooster

Decimals and other fractions

Chapter 2 Measurement and Problem Solving

Binary Numbers. Binary Octal Hexadecimal

=

Let s put together a Manual Processor

Solve addition and subtraction word problems, and add and subtract within 10, e.g., by using objects or drawings to represent the problem.

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left.

Chapter 1. Binary, octal and hexadecimal numbers

Session 7 Fractions and Decimals

The finite field with 2 elements The simplest finite field is

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

RN-Codings: New Insights and Some Applications

Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots

RN-coding of Numbers: New Insights and Some Applications

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Integer Operations. Overview. Grade 7 Mathematics, Quarter 1, Unit 1.1. Number of Instructional Days: 15 (1 day = 45 minutes) Essential Questions

Activity 1: Using base ten blocks to model operations on decimals

Basic numerical skills: FRACTIONS, DECIMALS, PROPORTIONS, RATIOS AND PERCENTAGES

Base Conversion written by Cathy Saxton

CDA 3200 Digital Systems. Instructor: Dr. Janusz Zalewski Developed by: Dr. Dahai Guo Spring 2012

Welcome to Physics 40!

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Exponents, Radicals, and Scientific Notation

Chapter 4 -- Decimals

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

MATH-0910 Review Concepts (Haugen)

JobTestPrep's Numeracy Review Decimals & Percentages

Transcription:

This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition and subtraction Multiplication and division Error analysis Error and bias Rounding and truncation CIS371 (Roth/Martin): Floating Point 1 CIS371 (Roth/Martin): Floating Point 2 Readings P+H Chapter 3.6 and 3.7 Floating Point (FP) Numbers Floating point numbers: numbers in scientific notation Two uses Use I: real numbers (numbers with non-zero fractions) 3.1415926 2.1878 6.62 * 10 34 Use II: really big numbers 3.0 * 10 8 6.02 * 10 23 Aside: best not used for currency values CIS371 (Roth/Martin): Floating Point 3 CIS371 (Roth/Martin): Floating Point 4

The Land Before Floating Point Early computers were built for scientific calculations ENIAC: ballistic firing tables But didn t have primitive floating point data types Many embedded chips today lack floating point hardware Programmers built scale factors into programs Large constant multiplier turns all FP numbers to integers inputs multiplied by scale factor manually Outputs divided by scale factor manually Sometimes called fixed point arithmetic The Fixed Width Dilemma Natural arithmetic has infinite width Infinite number of integers Infinite number of reals Infinitely more reals than integers (head spinning ) Hardware arithmetic has finite width N (e.g., 16, 32, 64) Can represent 2 N numbers If you could represent 2 N integers, which would they be? Easy, the 2 N 1 on either size of 0 If you could represent 2 N reals, which would they be? 2 N reals from 0 to 1, not too useful Uhh. umm CIS371 (Roth/Martin): Floating Point 5 CIS371 (Roth/Martin): Floating Point 6 Range and Precision Range Distance between largest and smallest representable numbers Want big Precision Distance between two consecutive representable numbers Want small In fixed width, can t have unlimited both Scientific Notation Scientific notation: good compromise Number [S,F,E] = S * F * 2 E S: sign F: significand (fraction) E: exponent Floating point : binary (decimal) point has different magnitude + Sliding window of precision using notion of significant digits Small numbers very precise, many places after decimal point Big numbers are much less so, not all integers representable But for those instances you don t really care anyway Caveat: all representations are just approximations Sometimes wierdos like 0.9999999 or 1.0000001 come up + But good enough for most purposes CIS371 (Roth/Martin): Floating Point 7 CIS371 (Roth/Martin): Floating Point 8

IEEE 754 Standard Precision/Range Single precision: float in C 32-bit: 1-bit sign + 8-bit exponent + 23-bit significand Range: 2.0 * 10 38 < N < 2.0 * 10 38 Precision: ~7 significant (decimal) digits Used when exact precision is less important (e.g., 3D games) Double precision: double in C 64-bit: 1-bit sign + 11-bit exponent + 52-bit significand Range: 2.0 * 10 308 < N < 2.0 * 10 308 Precision: ~15 significant (decimal) digits Used for scientific computations Numbers >10 308 don t come up in many calculations 10 80 ~ number of atoms in universe How Do Bits Represent Fractions? Sign: 0 or 1! easy Exponent: signed integer! also easy Significand: unsigned fraction!?? How do we represent integers? Sums of positive powers of two S-bit unsigned integer A: A S 1 2 S 1 + A S 2 2 S 2 + + A 1 2 1 + A 0 2 0 So how can we represent fractions? Sums of negative powers of two S-bit unsigned fraction A: A S 1 2 0 + A S 2 2 1 + + A 1 2 S+2 + A 0 2 S+1 1, 1/2, 1/4, 1/8, 1/16, 1/32, More significant bits correspond to larger multipliers CIS371 (Roth/Martin): Floating Point 9 CIS371 (Roth/Martin): Floating Point 10 Some Examples What is 5 in floating point? Sign: 0 5 = 1.25 * 2 2 Significand: 1.25 = 1*2 0 + 1*2 2 = 101 0000 0000 0000 0000 0000 Exponent: 2 = 0000 0010 What is 0.5 in floating point? Sign: 1 0.5 = 0.5 * 2 0 Significand: 0.5 = 1*2 1 = 010 0000 0000 0000 0000 0000 Exponent: 0 = 0000 0000 Normalized Numbers Notice 5 is 1.25 * 2 2 But isn t it also 0.625 * 2 3 and 0.3125 * 2 4 and? With 8-bit exponent, we can have 256 representations of 5 Multiple representations for one number Lead to computational errors Waste bits Solution: choose normal (canonical) form Disallow de-normalized numbers IEEE 754 normal form: coefficient of 2 0 is 1 Similar to scientific notation: one non-zero digit left of decimal Normalized representation of 5 is 1.25 * 2 2 (1.25 = 1*2 0 +1*2-2 ) 0.625 * 2 3 is de-normalized (0.625 = 0*2 0 +1*2-1 + 1*2-3 ) CIS371 (Roth/Martin): Floating Point 11 CIS371 (Roth/Martin): Floating Point 12

More About Normalization What is 0.5 in normalized floating point? Sign: 1 0.5 = 1 * 2 1 Significand: 1 = 1*2 0 = 100 0000 0000 0000 0000 0000 Exponent: -1 = 1111 1111 IEEE754: no need to represent co-efficient of 2 0 explicitly It s always 1 + Buy yourself an extra bit (~1/3 of decimal digit) of precision Yeeha Problem: what about 0? How can we represent 0 if 2 0 is always implicitly 1? IEEE 754: The Whole Story Exponent: signed integer! not so fast Exponent represented in excess or bias notation N-bits typically can represent signed numbers from 2 N 1 to 2 N 1 1 But in IEEE 754, they represent exponents from 2 N 1 +2 to 2 N 1 1 And they represent those as unsigned with an implicit 2 N 1 1 added Implicit added quantity is called the bias Actual exponent is E (2 N 1 1) Example: single precision (8-bit exponent) Bias is 127, exponent range is 126 to 127 126 is represented as 1 = 0000 0001 127 is represented as 254 = 1111 1110 0 is represented as 127 = 0111 1111 1 is represented as 128 = 1000 0000 CIS371 (Roth/Martin): Floating Point 13 CIS371 (Roth/Martin): Floating Point 14 IEEE 754: Continued Notice: two exponent bit patterns are unused 0000 0000: represents de-normalized numbers Numbers that have implicit 0 (rather than 1) in 2 0 Zero is a special kind of de-normalized number + Exponent is all 0s, significand is all 0s (bzero still works) There are both +0 and 0, but they are considered the same Also represent numbers smaller than smallest normalized numbers 1111 1111: represents infinity and NaN ± infinities have 0s in the significand ± NaNs do not IEEE 754: Infinity and Beyond What are infinity and NaN used for? To allow operations to proceed past overflow/underflow situations Overflow: operation yields exponent greater than 2 N 1 1 Underflow: operation yields exponent less than 2 N 1 +2 IEEE 754 defines operations on infinity and NaN N / 0 = infinity N / infinity = 0 0 / 0 = NaN Infinity / infinity = NaN Infinity infinity = NaN Anything and NaN = NaN CIS371 (Roth/Martin): Floating Point 15 CIS371 (Roth/Martin): Floating Point 16

IEEE 754: Final Format exp significand Biased exponent Normalized significand Exponent more significant than significand Helps comparing FP numbers Exponent bias notation helps there too Every computer since about 1980 supports this standard Makes code portable (at the source level at least) Makes hardware faster (stand on each other s shoulders) Floating Point Arithmetic We will look at Addition/subtraction Multiplication/division Implementation Basically, integer arithmetic on significand and exponent Using integer ALUs Plus extra hardware for normalization To help us here, look at toy quarter precision format 8 bits: 1-bit sign + 3-bit exponent + 4-bit significand Bias is 3 CIS371 (Roth/Martin): Floating Point 17 CIS371 (Roth/Martin): Floating Point 18 FP Addition Assume A represented as bit pattern [S A, E A, F A ] B represented as bit pattern [S B, E B, F B ] What is the bit pattern for A+B [S A+B, E A+B, F A+B ]? [S A +S B, E A +E B, F A +F B ]? Of course not So what is it then? CIS371 (Roth/Martin): Floating Point 19 FP Addition Decimal Example Let s look at a decimal example first: 99.5 + 0.8 9.95*10 1 + 8.0*10-1 Step I: align exponents (if necessary) Temporarily de-normalize one with smaller exponent Add 2 to exponent! shift significand right by 2 8.0* 10-1! 0.08*10 1 Step II: add significands Remember overflow, it isn t treated like integer overflow 9.95*10 1 + 0.08*10 1! 10.03*10 1 Step III: normalize result Shift significand right by 1 add 1 to exponent 10.03*10 1! 1.003*10 2 CIS371 (Roth/Martin): Floating Point 20

FP Addition Quarter Example Now a binary quarter example: 7.5 + 0.5 7.5 = 1.875*2 2 = 0 101 11110 1.875 = 1*2 0 +1*2-1 +1*2-2 +1*2-3 0.5 = 1*2-1 = 0 010 10000 Step I: align exponents (if necessary) 0 010 10000! 0 101 00010 Add 3 to exponent! shift significand right by 3 Step II: add significands 0 101 11110 + 0 101 00010 = 0 101 100000 Step III: normalize result 0 101 100000! 0 110 10000 Shift significand right by 1! add 1 to exponent FP Addition Hardware E1 F1 E2 F2 n ctrl >> v + + >> E F De-normalize smaller exponent Add significands Normalize result CIS371 (Roth/Martin): Floating Point 21 CIS371 (Roth/Martin): Floating Point 22 What About FP Subtraction? Or addition of negative quantities for that matter How to subtract significands that are not in 2C form? Can we still use an adder? Trick: internally and temporarily convert to 2C Add phantom 2 in front ( 1*2 1 ) Use standard negation trick Add as usual If phantom 2 bit is 1, result is negative Negate it using standard trick again, flip result sign bit Ignore phantom bit which is now 0 anyway Got all that? Basically, big ALU has negation prefix and postfix circuits FP Multiplication Assume A represented as bit pattern [S A, E A, F A ] B represented as bit pattern [S B, E B, F B ] What is the bit pattern for A*B [S A*B, E A*B, F A*B ]? This one is actually a little easier (conceptually) than addition Scientific notation is logarithmic In logarithmic form: multiplication is addition [S A^S B, E A +E B, F A *F B ]? Pretty much, except for Normalization Addition of exponents in biased notation (must subtract bias) Tricky: when multiplying two normalized significands Where is the binary point? CIS371 (Roth/Martin): Floating Point 23 CIS371 (Roth/Martin): Floating Point 24

FP Division Assume A represented as bit pattern [S A, E A, F A ] B represented as bit pattern [S B, E B, F B ] What is the bit pattern for A/B [S A/B, E A/B, F A/B ]? [S A^S B, E A E B, F A /F B ]? Pretty much, again except for Normalization Subtraction of exponents in biased notation (must add bias) Binary point placement No need to worry about remainders, either Ironic Multiplication/division roughly same complexity for FP and integer Addition/subtraction much more complicated for FP than integer CIS371 (Roth/Martin): Floating Point 25 Accuracy Remember our decimal addition example? 9.95*10 1 + 8.00*10-1! 1.003*10 2 Extra decimal place caused by de-normalization But what if our representation only has two digits of precision? What happens to the 3? Corresponding binary question: what happens to extra 1s? Solution: round Option I: round down (truncate), no hardware necessary Option II: round up (round), need an incrementer Why rounding up called round? Because an extra 1 is half-way, which rounded up CIS371 (Roth/Martin): Floating Point 26 More About Accuracy Problem with both truncation and rounding They cause errors to accumulate E.g., if always round up, result will gradually crawl upwards One solution: round to nearest even If un-rounded LSB is 1! round up (011! 10) If un-rounded LSB is 0! round down (001! 00) Round up half the time, down other half! overall error is stable Another solution: multiple intermediate precision bits IEEE 754 defines 3: guard + round + sticky Guard and round are shifted by de-normalization as usual Sticky is 1 if any shifted out bits are 1 Round up if 101 or higher, round down if 011 or lower Round to nearest even if 100 Numerical Analysis Accuracy problems sometimes get bad Addition of big and small numbers Summing many small numbers Subtraction of big numbers Example, what s 1*10 30 + 1*10 0 1*10 30? Intuitively: 1*10 0 = 1 But: (1*10 30 + 1*10 0 ) 1*10 30 = (1*10 30 1*10 30 ) = 0 Numerical analysis: field formed around this problem Bounding error of numerical algorithms Re-formulating algorithms in a way that bounds numerical error Practical hints: never test for equality between FP numbers Use something like: if(abs(a-b) < 0.00001) then CIS371 (Roth/Martin): Floating Point 27 CIS371 (Roth/Martin): Floating Point 28

One Last Thing About Accuracy Suppose you added two numbers and came up with 0 101 11111 101 What happens when you round? Number becomes denormalized arrrrgggghhh Arithmetic Latencies Latency in cycles of common arithmetic operations Source: Software Optimization Guide for AMD Family 10h Processors, Dec 2007 Intel Core 2 chips similar FP adder actually has more than three steps Align exponents Add/subtract significands Re-normalize Round Potentially re-normalize again Potentially round again Add/Subtract Multiply Divide Int 32 1 3 14 to 40 Int 64 23 to 87 Floating point divide faster than integer divide? Why? 1 5 Fp 32 4 4 16 Fp 64 4 4 20 CIS371 (Roth/Martin): Floating Point 29 CIS371 (Roth/Martin): Floating Point 30 Summary App App App System software Mem CPU I/O FP representation Scientific notation: S*F*2 E IEEE754 standard Representing fractions FP operations Addition/subtraction: harder than integer Multiplication/division: same as integer!! Accuracy problems Rounding and truncation Upshot: FP is painful Thank lucky stars P37X has no FP CIS371 (Roth/Martin): Floating Point 31