2 Computer arithmetics

Size: px

Start display at page:

Download "2 Computer arithmetics"

Lorena Simon
3 years ago
Views:

1 2 Computer arithmetics Digital systems are implemented on hardware with finite wordlength. Implementations require special attention because of possible quantization and arithmetic errors. Part I: Real number representations Characteristics Basis: integers Fixed-point numbers Floating-point formats Part II: Design flows of fixed-point solutions Analysis based flow Simulation based flow 1

2 2.1 Background: real number representations Two categories: fixed-point and floating-point. In both cases, a certain number of significant bits represent an integer value, and associated scaling maps that integer to some real number. sign (s) + significand (f) significant bits exponent (e) scaling floating point representation integer / fraction fixed during design fixed point representation Exponent: floating-point hardware performs appropriate scaling during run-time. In the case of fixed-point numbers this is a design-time decision. Can be hard! Implications for choosing the computational platform: Do we really need an optimized fixed-point solution? Or, do we want to have an easier money-saving design process? In addition, the floating-point HW might not contain all features of the standards and word length might also be limited 2

3 2.1.1 Characterization of representations Word length: the total number of bits in a representation Bit precision: the number of significant bits Range: the smallest and the largest representable value - overflow Precision: the smallest interval between two consecutive numbers (unit in least position, ulp) - roundoff noise, underflow Dynamic range: measures the ratio between the smallest and largest absolute values Dynamic range in db = 20log 10 (AMax/AMin). Fixed-point case: all numbers represented with the same precision, 6 db per one bit Floating-point case: large numbers represented with less precision, dynamic range huge 32-bit signed 32-bit signed IEEE-754 single precision integer fractional floating-point word length bit precision (= 23 bit significand + sign bit) range (max. absolute value) precision (ulp) (for E min ) (for E max ); note: 256 for dynamic range 187 db 187 db 1535 db 3

4 2.1.2 Integers Fixed-point and floating-point representations are based on representations of integer values. The choice of particular representation depends on what has to be done with the numbers. 1. Unsigned integers 2. Signed integers sign-magnitude: encoding of significant bits in floating-point formats one s complement: advantage is easy negation two s complement: the common choice for fixed-point formats biased representation (excess-b): a bias value, B, is subtracted from an unsigned integer value. Example: offset binary (ADC output, DAC input), encoding of exponents in floating-point formats a 2 a 1 a 0 sign-magnitude 1 s complement 2 s complement offset binary

5 2.1.3 Fixed-point representation Fixed-point numbers are based on scaling of integers (unsigned or signed). Two s complement integers used as a basis for signed fixed-point. (1) Binary point scaling: The value represented is Ṽ = V int /2 n where V int is the integer value represented by the bit string, and n is the number of fraction bits. Notation: up.n for unsigned and sp.n for signed formats where p is the word length and n is the number of fraction bits. For example, s8.3: 5

6 Example. 4-bit signed fixed point numbers: s format Binary point position Range Precision (ulp) s4.-1 a 3 a 2 a 1 a s4.0 a 3 a 2 a 1 a s4.1 a 3 a 2 a 1 a s4.2 a 3 a 2 a 1 a s4.3 a 3 a 2 a 1 a s4.4 a 3 a 2 a 1 a s4.5 a 3 a 2 a 1 a

7 (2) Slope-bias scaling: The value represented is Ṽ = s V int +b where s > 0 is called the slope, and b is the (offset) bias. The slope can be represented as s = f 2 e where 1 f < 2 is called the fractional slope and e shows the radix point position. - binary point scaling is a special case of this: b = 0,f = 1,e = n. - precision (the weight of the least significant bit) is equal to the slope - the goal of slope (and bias) selection: utilization of the full dynamic range Example. We want to represent the angles in the range [ π,π) with maximal precision using 6 bits. 1) Binary point scaling: - we must have two integer bits as 011 = 3. - thus the format to be used is s the range is [ 4,4 2 3 ]. - the precision of the format is ulp = 2 3 = ) Slope-bias scaling: - using zero bias, we use s = π/2 5 to use the full dynamic range. - the range is then π [ 1,1 2 5 ] and ulp = s

8 2.1.4 Fixed-point arithmetics (1) Addition: guard bits - sp.n + sp.n s(p+1).n - guard bits g added to the accumulator of the MAC data path: n n multiply 2n add 2n+g - g log 2 (N), where N is the number of terms to be added - in MAC based FIR filtering, the bound for g depends on the coefficient values (2) Multiplication: The law of conservation of bits: sp 1.n 1 sp 2.n 2 s(p1+p 2 ).(n 1 +n 2 ) - note: one extra integer bit introduced (e.g. s4.3 s4.3 s8.6; = 0, = 1) - if the largest negative value does not occur, that extra integer bit is not needed 8

9 Modes of arithmetic: - in integer arithmetic fixed-point values are treated as integers, and the programmer must take care that overflows do not occur (e.g. intermediate scaling operations, coefficient magnitude restrictions). - in fractional arithmetic, one uses fractional fixed-point values. Multiplication and storing the result can be implemented in a special manner: Operand A Operand B S binary point integer multiplication (saturating to handle 1 x 1) S S (S) x x arithmetic shift left + binary point movement Result: S x x (rounding +) taking the most significant bits S x Multiplication by a power of two: (1) can be implemented simply as an arithmetic shift - left: may cause overflow, right: precision may be lost (2) the movement of the binary point to the right/left - overflow and precision loss is not possible - basis of the CORDIC algorithm discussed later 9

10 (3) Signal quantization: rounding of the arithmetic results to specific word lengths - There are different kinds of rounding methods: (1) truncation: simply discard least significant bits (2) round-to-nearest (3) convergent rounding (4) magnitude truncation - Introduces roundoff noise e s round s q modelled as s s q - Depending on the rounding method, noise can be biased (expectation E{e} 0) - The quantization noise gets amplified through noise transfer functions (4) Overflow handling: hardware may use guard bits, wrapping, or saturation - in the case of wrapping, overflows are neglected in HW. Therefore, one must either (1) ascertain that the final result is within the range, or (2) check that overflows cannot occur (by analysis/simulation) - saturating operations are not associative! Therefore some standards for algorithms may specify exact order of performing operations 10

11 2.1.5 Floating-point representation Design of the representation is based on HW implementation issues Bit string parts: (1) sign bit, (2) exponent, and (3) significand (order is important!). Modes of bit string interpretation: normalized number: the exponent is adjusted so that the maximum precision is achieved. Ṽ = ( 1) s (1+f) 2 e e b, where s is the value of the sign bit, e is the unsigned integer encoded by the exponent part, e b is the exponent bias, and f is the unsigned fractional fixed point value encoded by the significand. zero: representations of +0 and -0 denormalized number: for representing small numbers, which fill the underflow gap around zero infinity, not-a-number (NaN) IEEE 754 standard formats are commonly used single precision (32 bits = sign bit + exponent 8 bits + significand 23 bits; e b = 127) double precision (64 bits = ; e b = 1023) half precision (16 bits = ; e b = 15): used especially in computer graphics Non-standard format may be designed for arithmetics of a particular application. Support for all modes might not be needed in HW. 11

12 Example. IEEE 754 single precision Mode When normalized e {1,2,...,254} zero e = 0,f = 0 denormalized e = 0,f 0 infinity e = 255,f = 0 not-a-number e = 255,f 0 12

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition