2 Computer arithmetics

2 Computer arithmetics Digital systems are implemented on hardware with finite wordlength. Implementations require special attention because of possible quantization and arithmetic errors. Part I: Real number representations Characteristics Basis: integers Fixed-point numbers Floating-point formats Part II: Design flows of fixed-point solutions Analysis based flow Simulation based flow 1

2.1 Background: real number representations Two categories: fixed-point and floating-point. In both cases, a certain number of significant bits represent an integer value, and associated scaling maps that integer to some real number. sign (s) + significand (f) significant bits exponent (e) scaling floating point representation integer / fraction fixed during design fixed point representation Exponent: floating-point hardware performs appropriate scaling during run-time. In the case of fixed-point numbers this is a design-time decision. Can be hard! Implications for choosing the computational platform: Do we really need an optimized fixed-point solution? Or, do we want to have an easier money-saving design process? In addition, the floating-point HW might not contain all features of the standards and word length might also be limited 2

2.1.1 Fixed-point representation Fixed-point numbers are based on scaling of integers (unsigned or signed). Two s complement integers used as a basis for signed fixed-point. (1) Binary point scaling: The value represented is Ṽ = V int /2 n where V int is the integer value represented by the bit string, and n is the number of fraction bits. Notation: up.n for unsigned and sp.n for signed formats where p is the word length and n is the number of fraction bits. For example, s8.3: 3

(2) Slope-bias scaling: The value represented is Ṽ = s V int +b where s > 0 is called the slope, and b is the (offset) bias. The slope can be represented as s = f 2 e where 1 f < 2 is called the fractional slope and e shows the radix point position. - binary point scaling is a special case of this: b = 0,f = 1,e = n. - precision (the weight of the least significant bit) is equal to the slope - the goal of slope (and bias) selection: utilization of the full dynamic range Example. We want to represent the angles in the range [ π,π) with maximal precision using 6 bits. 1) Binary point scaling: - we must have two integer bits as 011 = 3. - thus the format to be used is s6.3. - the range is [ 4,4 2 3 ]. - the precision of the format is ulp = 2 3 = 0.125. 2) Slope-bias scaling: - using zero bias, we use s = π/2 5 to use the full dynamic range. - the range is then π [ 1,1 2 5 ] and ulp = s 0.0982 4

2.1.2 Fixed-point arithmetics (1) Addition: guard bits - sp.n + sp.n s(p+1).n - guard bits g added to the accumulator of the MAC data path: n n multiply 2n add 2n+g - g log 2 (N), where N is the number of terms to be added - in MAC based FIR filtering, the bound for g depends on the coefficient values (2) Multiplication: The law of conservation of bits: sp 1.n 1 sp 2.n 2 s(p1+p 2 ).(n 1 +n 2 ) - note: one extra integer bit introduced (e.g. s4.3 s4.3 s8.6; 4-3-1 = 0, 8-6-1 = 1) - if the largest negative value does not occur, that extra integer bit is not needed 5

Modes of arithmetic: - in integer arithmetic fixed-point values are treated as integers, and the programmer must take care that overflows do not occur (e.g. intermediate scaling operations, coefficient magnitude restrictions). - in fractional arithmetic, one uses fractional fixed-point values. Multiplication and storing the result can be implemented in a special manner: Operand A Operand B S binary point integer multiplication (saturating to handle 1 x 1) S S (S) x x arithmetic shift left + binary point movement Result: S x x (rounding +) taking the most significant bits S x Multiplication by a power of two: (1) can be implemented simply as an arithmetic shift - left: may cause overflow, right: precision may be lost (2) the movement of the binary point to the right/left - overflow and precision loss is not possible - basis of the CORDIC algorithm discussed later 6

(3) Signal quantization: rounding of the arithmetic results to specific word lengths - There are different kinds of rounding methods: (1) truncation: simply discard least significant bits (2) round-to-nearest (3) convergent rounding (4) magnitude truncation - Introduces roundoff noise e s round s q modelled as s s q - Depending on the rounding method, noise can be biased (expectation E{e} 0) - The quantization noise gets amplified through noise transfer functions (4) Overflow handling: hardware may use guard bits, wrapping, or saturation - in the case of wrapping, overflows are neglected in HW. Therefore, one must either (1) ascertain that the final result is within the range, or (2) check that overflows cannot occur (by analysis/simulation) - saturating operations are not associative! Therefore some standards for algorithms may specify exact order of performing operations 7

2.1.3 Floating-point representation Design of the representation is based on HW implementation issues Bit string parts: (1) sign bit, (2) exponent, and (3) significand (order is important!). Modes of bit string interpretation: normalized number: the exponent is adjusted so that the maximum precision is achieved. Ṽ = ( 1) s (1+f) 2 e e b, where s is the value of the sign bit, e is the unsigned integer encoded by the exponent part, e b is the exponent bias, and f is the unsigned fractional fixed point value encoded by the significand. zero: representations of +0 and -0 denormalized number: for representing small numbers, which fill the underflow gap around zero infinity, not-a-number (NaN) IEEE 754 standard formats are commonly used single precision (32 bits = sign bit + exponent 8 bits + significand 23 bits; e b = 127) double precision (64 bits = 1 + 11 + 52; e b = 1023) half precision (16 bits = 1 + 5 + 11; e b = 15): used especially in computer graphics Non-standard format may be designed for arithmetics of a particular application. Support for all modes might not be needed in HW. 8

Example. IEEE 754 single precision Mode When normalized e {1,2,...,254} zero e = 0,f = 0 denormalized e = 0,f 0 infinity e = 255,f = 0 not-a-number e = 255,f 0 9

2.2 Design of a fixed-point signal processing solution 1. The first step in the development of signal processing systems is the development / selection of the algorithms: - design of a floating-point reference model - exploration of the numerical properties of the algorithm using that model - e.g. Matlab provides excellent tools for the design 2. Then, fixed-point model should be designed, using the floating-point model as a reference - the fixed-point model serves as the verification model for the final implementation - details of the fixed-point design reflect the primitives of the hardware e.g. split ALUs, CORDIC processors, saturation/wrapping arithmetic... - sufficient word lengths for various number objects must be determined (what ranges and precisions are needed?) Two basic approaches for the conversion work are: Analytic approach. Favored by algorithm designers who do not have complete understanding of the hardware. Simulation approach. Favored by HW designers who frown upon mathematics of the models. 10

2.2.1 Analysis based design flow Design outline: Algorithm design via mathematical modelling Analysis of the coefficient quantization effects Analysis of the rounding and scaling effects Selection of the word lengths Simulation to verify the design results (1) Coefficient quantization: one must check that the specifications are met, and in the case of IIR filtering that the filter remains stable Analysis example: - assume FIR filter with N coefficients - sp.(p 1) format for the coefficients is to be selected - quantization introduces a parallel filter: H q (z) = H(z)+E(z) - max. quantization error for the format is ulp/2. ulp=2 (p 1). Thus, e(n) 2 p. - assuming the worst-case error for all coefficients, we get E(ω) = 2 p N - maximum stop-band attenuation is therefore bounded by 20log 10 (2 p N) - having a specification available, we get an upper bound for p 11

(2) Effect of the roundoff noise - one can think in terms of the impulse response h(n) from the round-off location to the output - if the quantization interval is q then the signal quantization noise power is σ 2 ir = q2 /12 - the effect at output is σor 2 = σ2 ir h 2 (n) n=0 - if noises are assumed uncorrelated the effects of round-off locations sum up - however, the noises can be correlated. E.g. in telecommunications the signals are often preiodic - such correlation leads to peaking in the noise spectrum (3) Overflows in IIR structures - the adders on the feedback path may overflow - use of saturation arithmetic may lead to instability, and spoil the response - another solution is to perform input scaling to reduce its range Note on analysis: - approximation formulas may only provide quick checks and starting points for the design - more detailed derivations needed, or simulation-based approach taken 12

2.2.2 Tool/simulation based design flow Simulation-based development of fixed-point reference model can be based on writing C/C++ code alternative: in Matlab one can use the Fixed Point Toolbox the Simulink environment also contains features that can be used to implement fixed-point simulations Outline of the design process 1. Algorithm design using floating-point precision 2. Conversion into full-precision fixed-point model 3. Determination of the coefficient quantization effects 4. Determination of the maxima via value logging 5. Reduction of the word lengths for implementation platform Demonstration: fixed-point modelling support in the Simulink environment 13