The representation of real numbers with a pre-determined number of integer and fractional digits is called the fixed point notation (FixP).

Similar documents
Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

The string of digits in the binary number system represents the quantity

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

Numerical Matrix Analysis

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

CHAPTER 5 Round-off errors

Data Storage 3.1. Foundations of Computer Science Cengage Learning

CSI 333 Lecture 1 Number Systems

CS321. Introduction to Numerical Methods

Lecture 11: Number Systems

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Numbering Systems. InThisAppendix...

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Number Representation

Attention: This material is copyright Chris Hecker. All rights reserved.

Binary Numbering Systems

Computer Science 281 Binary and Hexadecimal Review

HOMEWORK # 2 SOLUTIO

Lecture 2. Binary and Hexadecimal Numbers

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

Systems I: Computer Organization and Architecture

NUMBER SYSTEMS. 1.1 Introduction

Q-Format number representation. Lecture 5 Fixed Point vs Floating Point. How to store Q30 number to 16-bit memory? Q-format notation.

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Solution for Homework 2

This 3-digit ASCII string could also be calculated as n = (Data[2]-0x30) +10*((Data[1]-0x30)+10*(Data[0]-0x30));

Useful Number Systems

Goals. Unary Numbers. Decimal Numbers. 3,148 is s 100 s 10 s 1 s. Number Bases 1/12/2009. COMP370 Intro to Computer Architecture 1

Chapter 2. Binary Values and Number Systems

Number and codes in digital systems

Binary Numbers. Binary Octal Hexadecimal

Digital Design. Assoc. Prof. Dr. Berna Örs Yalçın

Review of Scientific Notation and Significant Figures

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

THE EXACT DOT PRODUCT AS BASIC TOOL FOR LONG INTERVAL ARITHMETIC

Chapter 7 - Roots, Radicals, and Complex Numbers

Number Conversions Dr. Sarita Agarwal (Acharya Narendra Dev College,University of Delhi)

Session 29 Scientific Notation and Laws of Exponents. If you have ever taken a Chemistry class, you may have encountered the following numbers:

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

COMPSCI 210. Binary Fractions. Agenda & Reading

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

Binary Adders: Half Adders and Full Adders

Numerical Analysis I

2011, The McGraw-Hill Companies, Inc. Chapter 3

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Figure 1. A typical Laboratory Thermometer graduated in C.

DNA Data and Program Representation. Alexandre David

ASCII Characters. 146 CHAPTER 3 Information Representation. The sign bit is 1, so the number is negative. Converting to decimal gives

26 Integers: Multiplication, Division, and Order

CDA 3200 Digital Systems. Instructor: Dr. Janusz Zalewski Developed by: Dr. Dahai Guo Spring 2012

Chapter 1. Binary, octal and hexadecimal numbers

Number Systems. Introduction / Number Systems

FAST INVERSE SQUARE ROOT

5.1 Radical Notation and Rational Exponents

3. Convert a number from one number system to another

Exponents. Learning Objectives 4-1

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Negative Exponents and Scientific Notation

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Pemrograman Dasar. Basic Elements Of Java

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

How do you compare numbers? On a number line, larger numbers are to the right and smaller numbers are to the left.

CPEN Digital Logic Design Binary Systems

Exponents, Radicals, and Scientific Notation

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Number Systems and Radix Conversion

Chapter 5. Binary, octal and hexadecimal numbers

2 Number Systems. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Floating point package user s guide By David Bishop (dbishop@vhdl.org)

Arithmetic in MIPS. Objectives. Instruction. Integer arithmetic. After completing this lab you will:

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

DIGITAL-TO-ANALOGUE AND ANALOGUE-TO-DIGITAL CONVERSION

NUMBER SYSTEMS. William Stallings

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

Binary Representation

( 214 (represen ed by flipping) (magnitude = -42) = = -42 (flip the bits and add 1)

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Digital to Analog and Analog to Digital Conversion

Binary, Hexadecimal, Octal, and BCD Numbers

AN617. Fixed Point Routines FIXED POINT ARITHMETIC INTRODUCTION. Thi d t t d ith F M k Design Consultant

Instruction Set Architecture (ISA)

Chapter 3 Review Math 1030

Some Functions Computable with a Fused-mac

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Supplemental Worksheet Problems To Accompany: The Pre-Algebra Tutor: Volume 1 Section 1 Real Numbers

10CS35: Data Structures Using C

A White Paper about. MiniSEED for LISS and data compression using Steim1 and Steim2

Digital Logic Design. Introduction

Florida Math for College Readiness

Transcription:

Real Numbers and Floating-Point Arithmetic A real consists of an integer and a fractional part. The number of digits employed for the fractional part determines the precision, while the number of digits of the integer part determines the range of the representation. The transcendental numbers such as Integers work for many, but not all situations Often need to represent real numbers π = 3.14159265 e = 2.71828 0.0000000031 decimal point exponent Sign 6.02 x 10 23 magnitude Mantissa radix (base) Sign magnitude An alternate way to write numbers is in scientific notation where the number is normalized 3.1 10-9 Several advantages Simplifies exchange since has common form Increases accuracy since unnecessary 0 s are replaced Ex: 0.1 10-5, 10.4 10-4 are not normalized. Similarly, we can represent binary numbers in normalized form: (0.1) 2 (1.0) 2 2-1 (normalized) Computer arithmetic that supports such numbers is called floating point. So, in binary: (1.xxxxxxxxx) 2 2 yyyy The representation of real numbers with a pre-determined number of integer and fractional digits is called the fixed point notation (FixP). Example: The real number pi can be represented in a 16-bit register using 8- bit for the integer, and the remaining 8-bit for the fractional part (8.8-bit FixP ) in the following form:

integer part fractional part 3 = 1 1.14159265358.. 1 1 (LSB) 0.28318530716.. 0 (MSB) 0 1 (MSB) 0.56637061432.. 0.13274122864.. 1 0.26548245728.. 0 0.53096491456.. 0.06192982912.. 1 0.12385965824.. 0 0.24771931648.. 0 (LSB) pi (8.8-bit FixP) = ( 00000011.00100100 ) 2 = 3+1/8+1/64 =3.140625 Resolution of this representation is 1/256 = 0.00390625 Error in representation is (pi 3.140625) 0.00096765 The disadvantage of the FixP notation is the loss of relative or percent resolution (U % ) in representing small numbers. Example: Consider the numbers A= 152.005 and B=0.025 in 8.8-bit FixP. A (8.8-bit FixP) = 10011000.00000001 2 and B (8.8-bit FixP) = 00000000.00000110 2. The absolute resolution or precision is in both cases U(A) =U(B) =1/256 0.004. The percent resolution or precision in the representation of A is U % (A) = absolute-precision / significand-magnitude 100 % = (1/256) / 152.005 100 % 0.0025 %. However, for B it is U % (B) = (1/256) /0.025 100 15.6 %. The loss of percent resolution of B is a result of decreased number of significant bits in the register. The scientific notation uses a fixed number of significant-figures (significant-digits) and it provides an almost constant percent resolution for a much larger range of numbers.

Example: Write the numbers A = 152.005 and B = 0.025 in the scientific notation with 6 significant digits and find the percent resolution for each case. <<solution A = 1.52005 10 2, and B = 2.50000 10 2. The percent resolution for the numbers A and B are: U % (A) = 0.00001 / 1.52005 100 = 0.0006 % U % (B) = 0.00001 / 2.50000 100 = 0.0004 % Floating point (FP) notation is the scientific notation of the binary numbers. The format of FP notation depends on the width and sign convention of the significand and exponent fields. A floating point format is composed of three binary-fields: a sign-bit field S, a binary-exponent E, and a binary-significand-fraction F. The value of a non-zero number R is: R = ( 1) S F 2 E, The leading bit of F of a non-zero number is always one. (i.e., 1 F < 2). Example Find the sign S, fraction F, and exponent E of the binary numbers A= 152.005 and B= 0.025. Use 16-bit for F, and 8-bit E fields. <<solution A= 152.005 = 10011000.00000001 2, A = ( 1) 0 1. 0011 0000 0000 001 2 2 7, S = 0; (positive). F must have a leading-one, and a fraction. F = 1. 0011 0000 0000 001 2 152.005 /2 7 1.187539...; In the binary form of A the binary point is shifted right 7 bits to have a leading-one. Each shift is equivalent to dividing by 2, total effect is divide by 2 7 =1/128. This effect is compensated by the exponent term 2 7, giving that E = 00000111 2 = 7.

B = 0.025 = 00000000.00000110 0110 0110 0110 0110 2 B = ( 1) 1 1. 1001 1001 1001 100 2 2 6, S = 1; (negative). E = 6 = 1111 1010 2. F = 1.1001 1001 1001 100 2 0.025/2 6 1.6 Example: A non-standard 24-bit floating point format has 1-bit sign S, 15-bit significand F, 8-bit exponent E in signed-binary notation, and represents r = ( 1) S F 2 E. The significand F is normalized to have both a leading-one and a fraction. Find the binary-fp form of A= 152.005 : S=0; F= 1.0011 0000 0000 00 2 ; and E=7= 0000 0111 2. S F E A (24-bit-FP) = 0 100 1100 0000 0000 0000 0111 2. In this example, the least significant figure of A is not fitting into the significand field of this FP format. Because of the loss of one bit the represented value is 152. The largest positive number which can be stored in this 24-bit notation is obtained as follows: N maximum = 1. 1111 1111 1111 111 2 2 011111112 = 2 2 127 1 2 127 15, 2 128 negligible since 2 127 15 is 3.4 x 10 38 2 15 times smaller than 2 127 IEEE-754 Standard of Floating Point Numbers In IEEE-754 standard, the exponent field is written in biasedsigned-number format, IEEE-754 defines two floating point notation for two different precision.

The single-precision-format requires 32-bit, and it is shortly called float. The single precision FP format has the following fields: b 31 1-bit S sign-bit (1:negative, 0:positive) b 23 b 30 8-bit Es biased-exponent ( Es =E+127) b 0 -b 22 23-bit Fs binary-significand ( Fs =F 1) bits: 31 30 29 28 27 26 25 24 23 22 21 20...... 2 1 0 S Biased-Exponent Es=E+127 Binary-Significand Fs=F 1 fields: (8 bits) ( 23 bits) where, S is set if the number is negative, Es = E+127 is the biased-exponent to represent the exponents from -126 to 128, ( Es=0 is reserved for number-zero). Fs = F 1 is the binary-significand of the number, after the leading-one bit in the normalized fraction of the number is removed, The value of the represented number is R = ( 1) S (Fs+1) 2 Es 127 IEEE-754 single-precision-fp-fields value of the number S (1-bit), Es (8-bit), Fs (23-bit) ( 1) S (Fs+1 ) 2 Es 127 In IEEE-754 float format, Es=0 (or E= 127) is reserved for a zero number, so that a number that contains zeros in all fields is encountered to be zero. The smallest positive number R min has Es=1 (E= 126) and Fs=0, (F=1). It corresponds to 1 2 126 = 1.17 10 38. Any number under this value causes to an underflow condition, because Es=0 is reserved to represent R=0. The largest number R max has Es=255, (or E=+128) and Fs=1 2 23 1 (F 2). It corresponds to 2 2 128 6.8 10 38. Any number over this value causes to an overflow condition, because 8-bit exponent field is not wide enough for Es.

Example: Write the numbers A= 152.005 and B= 0.025 in IEEE- 754 float form. A = 10011000.00000001 2, A = 1 0 1.001 1000 0000 0001 2 2 7, S = 0; F = 1.0011 0000 0000 001 2 152.005/2 7 1.187539...; Fs =.0011 0000 0000 001 2 F 1 = 0.187539... E = 7 ; Es = 7 +127 = 128+6 = 1000 0110 2 S Es Fs A = 0 1000 0110 0011 0000 0000 0010 0000 000 B= 00000000.00000110 01100110 0110 2 B = 1 1 1.100 1100 1100 1100 2 2 6, S = 1; F = 1.100 1100 1100 1100 2 0.025/2 6 1.6 ; Fs =.100 1100 1100 1100 2 1.6 1 0.6 ; E = 6 ; Es = 6 +127 = 121 = 0111 1001 2 S Es Fs B = 1 0111 1001 100 1100 1100 1100 1100 1100 IEEE 754 double-precision-float format extends the precision to 53-bits (16-significant-figures in decimal), and the range of the floating point numbers to 11-bit ( R max 10 308 ). From MSB to LSB, the fields of the double precision format are: W1: Most b 31 1-bit S sign-bit (0:pozitive, 1:negative) Significant b 20 b 30 11-bit Ed biased Exponent (Ed=E+1023) Word b 0 -b 19 52-bit Fd binary-significand (Fd=F 1) W0: L.S.Word b 31 -b 0