A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers

Transcription

1 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers Klaus Schneider and Adrian Willenbücher Embedded Systems Group University of Kaiserslautern Kaiserslautern, Germany {schneider, willenbuecher}@cs.uni-kl.de Abstract Signed-digit (SD) numbers generalize traditional radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representations that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction of n-digit numbers are performed with circuits of depth O(1) and size O(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at position i depend only on the summands digits at positions i and i 1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two s complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations. I. INTRODUCTION Although there are many other number systems, simple radix numbers to a base B > 0 are still popular in computer arithmetic. An n-digit radix-b number is thereby given as a sequence of digits [x n 1,...,x 0 ] with x i {0,...,B 1} that denotes the following natural number: n 1 [x n 1,...,x 0 ] B := x i B i i=0 It is well-known that the addition of radix-b numbers suffers inherently from carry propagation: In the worst case, a carry is generated when adding the least significant digits x 0 and y 0, and is then propagated from the rightmost digits x 0,y 0 to the leftmost digits x n 1,y n 1. As a consequence, simple carry-ripple adders have depth 1 O(n). Even though this can be reduced to a depth of O(log(n)), e.g., by carry-lookahead adders [1], the depth still grows with the number of digits. 1 The depth of a circuit is the length of the longest path from inputs to outputs. Circuits with a depth depending on the number of digits n limit the clock speed of synchronous circuits in terms of n. For radix-b numbers, it is not difficult to see that addition, subtraction, multiplication, division and comparison operations of n-digit numbers all require a depth of at least O(log(n)) since the digits of the results depend on all digits of the operands. For all basic operations, optimal O(log(n)) algorithms are known, even though these require sometimes substantial mathematical effort [2] [4]. Since this minimal O(log(n)) depth cannot be improved for radix-b numbers, one has to consider non-conventional number systems for improvements. For example, residue number systems (RNS) [5], [6] encode a number x by its moduli (x 1,...,x n ) := ((x mod p 1 ),...,(x mod p n )) that are unique for numbers x {0,...,( n i=1 p i) 1} for relatively prime numbers p i. Addition, subtraction, and multiplication can be done in parallel on the moduli, and thus, with a depth O(1). Division can only be done by iterative methods like Newton-Raphson or Goldschmidt iteration which lead again to a depth of O(log(n)). The main problems for RNS numbers are however that comparison (<) is not possible and that conversions to and from radix numbers are relatively expensive. An alternative to RNS numbers are signed-digit (SD) numbers [3], [7] [12] that allow negative digits of a range { D,...,+D} with D < B for radix-b numbers. Due to the redundant number representation, addition and subtraction can be implemented with a depth of O(1), i.e., independent of the number of digits, while multiplication, division, and comparison can still be implemented with a depth of O(log(n)). The key to carry-free addition is thereby to switch to another representation of the sum in case carries would have to be generated (see Section II-A). However, the standard algorithm for addition and subtraction of SD numbers [7] does not work for the important base B =2as we will also explain in Section II-A. For this reason, Parhami [8] and others [13] suggested to recode the given input numbers so that the later addition and subtraction of binary numbers will become carry-free. In this paper, we prove that the standard algorithm of Avizienis can be refined to correctly handle binary SD numbers. Avizienis algorithm computes for two digits x i and y i, a transfer digit t i+1 {1, 0, +1} and an interim sum w i /14 $ IEEE DOI /.22 44

2 such that the sum digit s i can be computed as s i = t i + w i. Our algorithm computes an additional condition l i that stores some important information to define the transfer and sum digits. Our transfer digits t i depend on the operand digits x i 1,y i 1,x i 2,y i 2 and the additional condition l i depends on x i 1,y i 1 only, so that our algorithm has still depth O(1). We implemented our algorithm on FPGAs and compared its speed and area requirements with previous approaches to SD addition and also with a carry-lookahead adder. It turned out that our algorithm is faster than a hybrid carry-lookahead/carry-ripple adder for more than 24 bits on our hardware platform, and requires just about 50% of the chip area of other SD addition circuits. Our paper is organized as follows: In Section II, we discuss Avizienis algorithm for adding SD numbers. In Section III, we first analyze why that algorithm does not work for the case of binary numbers, and then develop a solution for this problem in Section III-B. To demonstrate the efficiency of our algorithm, we present experimental results in Section IV. II. PREVIOUS WORK In this section, we review known results about signed-digit numbers. To this end, we provide new proofs that allow us to discuss in the next section where the difficulties to define a carry-free addition for binary SD numbers come from. A. Signed-Digit Numbers Avizienis introduced in [7] the following SD numbers to a radix B>1 and a digit set { D,...,+D}: Definition 1: Given some number D and a radix B>1, a sequence [x n 1,...,x 0 ] of digits x i { D,...,+D} encodes the following integer: n 1 [x n 1,...,x 0 ] D,B := x i B i i=0 There may be several SD representations of the same number. For example, for B =3and D =2, the value 5 can be encoded as [2, 1], [1, 2] or [1, 1, 1]. To understand the different redundant representations of a number, we list the following well-known theorem without proof: Theorem 1 (Uniqueness of Division with Remainder): For all integers x, y Z with y 0, there are uniquely defined numbers q, r Z with x = q y +r and 0 r< y. We therefore write q := (x div y) and r := (x mody). By the above theorem, we conclude the following result: Lemma 1 (SD Number Representations): x = [x n 1,...,x 0 ] D,B = [x n 1,...,x 0] D,B implies x 0 = x 0 + k B for some k Z. Proof: Using y 1 := [x n 1,...,x 1 ] D,B and y 1 := [x n 1,...,x 1] D,B, we obviously have x = y 1 B + x 0 = y 1 B + x 0, and therefore x 0 x 0 =(y 1 y 1) B holds. Hence, x 0 x 0 is a multiple of B, so that the proposition holds with k := y 1 y 1. Due to the redundant representations of a number, it is not possible to reduce equality testing to checking the equality of the corresponding digits. However, due to the (constant depth) reduction x = y x y =0, checking equality can be reduced to checking whether the result is zero. This is possible with depth O(log(n)) if zero has a unique representation (i.e., all digits being zero). To be able to check equality of SD numbers, Avizienis therefore imposed that D<Bmust hold because of the following result: Theorem 2 (Unique Representation of Zero): The number 0 has a unique representation as SD number [x n 1,...,x 0 ] D,B if and only if D<Bholds. Proof: For any n, wehave [x n 1,...,x 0 ] D,B =0for x i =0. For any other representation [x n 1,...,x 0] D,B = 0 with x 0 x 0, we would have x 0 x 0 = x 0 = k B with k 0by the previous lemma. However, this is impossible iff x 0 { D,...,+D} { (B 1),...,B 1} holds. Hence, we see that x 0 =0is uniquely determined for x = 0 if and only if D < B holds. Then, we have [x n 1,...,x 1 ] D,B =0, and the same argument applies to the next digit x 1, and so on. For example, we have [1, B] D,B = [ 1,B] D,B =0if we would allow D = B. Hence, we always assume D<B in the following to ensure the unique representation of 0. This uniqueness result can be generalized to other least significant digits x 0 : Assume first that B 2 D (and thus B D D) holds, so that we can partition the legal digits { D,...,+D} into the following intervals: D...D B D B +1...B D 1 B D...D By Lemma 1, the digits D 1 := { D,...,D B} and D +1 := {B D,...,D} can be mapped to each other by either adding or subtracting B, while for the digits D 0 := {D B +1,...,B D 1} no legal digits are obtained this way. Thus, digits in D 0 are uniquely determined, while digits in either D 1 or D +1 have exactly one alternative. Choosing the alternative, we have to either increment or decrement the next digit x i+1, and then the same discussion can be repeated for x i+1. However, if B > 2 D holds, then there are no alternatives left for the digits (since x 0 B D B <D 2 D = D). Hence, to ensure redundancy, we have to impose as second constraint B 2 D (in addition to D<B) to obtain the following result: Lemma 2 (Redundancy of SD Representations): For any SD number x = [x n 1,...,x 0 ] D,B with D < B 2 D, the following holds: If (x modb) {0,...,B D 1}, then x 0 is uniquely defined as x 0 := (x modb). If (x modb) {B D,...,D}, then either x 0 = (x modb) or x 0 =(xmodb) B holds, thus there are exactly two solutions for x 0. If (x modb) {D+1,...,B 1}, then x 0 is uniquely defined as x 0 := (x modb) B. 45

3 Table I POSSIBLE DECOMPOSITIONS u i = x i + y i = t i+1 B + w i WITH x i,y i,w i { D,...,+D} ASSUMING D<B 2 D. range of u i possible decomposition u i = t i+1 B + w i with w i { D,...,+D} u i { 2D,..., D 1} (t i+1,w i )=( 1,B+ u i ) with w i {B 2D,...,B D 1} { D +1,...,D 1} u i { D,..., B + D} (t i+1,w i )=( 1,B+ u i ) with w i {B D,...,D} { D +1,...,D} or (t i+1,w i )=(0,u i ) with w i { D,..., B + D} { D,...,D 1} u i { (B D 1),...,B D 1} (t i+1,w i )=(0,u i ) with w i { (B D 1),...,B D 1} { D +1,...,D 1} u i {B D,...,D} (t i+1,w i )=(0,u i ) with w i {B D,...,D} { D +1,...,D} or (t i+1,w i )=(+1, B + u i ) with w i { D,..., B + D} { D,...,D 1} u i {D +1,...,2D} (t i+1,w i )=(+1, B + u i ) with w i {D B +1,...,2D B} { D +1,...,D 1} The constraint D < B is added to ensure the unique representation of zero (to ensure that we can check equality of SD numbers) while the second constraint B 2 D is added to ensure a minimal redundancy that can be exploited for a carry-free addition as explained below. Note that Avizienis imposed a stronger second constraint B<2 D that then excludes the case B = 2. We will see in the following discussion why he did so and why we will not be that strict. The above lemma is the key to construct a carry-free addition algorithm: If two SD numbers [x n 1,...,x 0 ] D,B and [y n 1,...,y 0 ] D,B have to be added, we may first consider the expression [u n 1,...,u 0 ] D,B with u i := x i + y i. Since each x i and each y i are legal digits, we have 2 D u i 2 D. According to Avizienis, each u i is decomposed into an outgoing transfer digit t i+1 and an interim sum digit w i so that x i + y i = u i = t i+1 B + w i holds. Due to 2 B< 2 D x i + y i 2 D<2 B, it follows that t i+1 { 1, 0, +1} holds for all such decompositions. Note that a particular choice t i+1 { 1, 0, +1} determines the range of u i = t i+1 B + w i, so that we can easily prove the following lemma (note that D < B 2 D implies B D< 2D < D B + D<0 <B D D< 2D <B+ D): Lemma 3: For given digits x i,y i { D,...,+D} with D<B 2 D, the number u i = x i +y i can be decomposed as u i = t i+1 B + w i with w i { D,...,+D} and t i+1 { 1, 0, +1} as shown in Table I. The proof is easily obtained by checking the cases mentioned in Table I. The final step of the computation consists now in computing the sum digits s i := w i +t i by means of the transfer and interim sum digits. We have to make sure that these additions will not produce a carry. For this reason, Avizienis demanded that w i { D +1,...,+D 1} must hold, which is also possible according to the following lemma: Lemma 4: For given digits x i,y i { D,...,+D} with D<B<2 D, the number u i = x i +y i can be decomposed as u i = t i+1 B + w i with w i { D +1,...,+D 1} as shown in Table I. Proof: The proof is easily obtained by checking all the cases mentioned in Table I. Note that the cases with u i { D,..., B + D} and u i {B D,...,D} allow two different decompositions and for each case, there is one u i that produces an interim sum w i { D+1,...,+D 1}.In that case, however, we use the other possible decomposition and can therefore ensure w i { D +1,...,+D 1}. Since it is possible to find a decomposition with w i { D +1,...,+D 1}, it is now possible to compute the final sum digits s i := w i + t i without producing a carry! However, the reader might have noted that we had to strengthen the constraint D<B 2 D used before to D<B<2 D to make this possible. Based on the above lemma, the carry-free addition due to Avizienis is now as follows: Theorem 3 (Carry-Free Addition by Avizienis): The addition of SD numbers x = [x n 1,...,x 0 ] D,B and y = [y n 1,...,y 0 ] D,B with D < B < 2 D can be computed in depth O(1) with O(n) work (gates) as follows: 1) for i {0,...,n 1}, compute u i := x i + y i 2) for i {0,...,n 1}, +1 :ifu i +D compute t i+1 := 1 :ifu i D 0 :if D<u i < +D 3) for i {0,...,n 1}, compute digits s i := t i + u i t i+1 B with t 0 := 0 } {{ } =:w i The final sum is then the SD number s = [t n,s n 1,...,s 0 ] D,B. Each of the above steps can be performed in parallel, so that the sum can be computed in three steps. Moreover, in the 46

4 cases u i { D,..., B + D} and u i {B D,...,D} the algorithm prefers the decomposition with t i+1 = 0 except for the cases u i = ±D, where the other possible decomposition is used. This way, we always have w i { D +1,...,+D 1} and therefore, the final addition s i := t i + w i produces a legal digit. Other operations can be implemented as follows: Subtraction of x and y can be simply performed by addition of x and y = [ y n 1,..., y 0 ] D,B which can also be done with depth O(1) and work O(n). 2 Checking equality of x and y is reduced to checking whether x y =0holds. The subtraction can be done with depth O(1) and work O(n), but checking that all obtained digits are zero requires depth O(log n) and work O(n). Comparing x<yis reduced to testing for x y< 0. The subtraction can be done with depth O(1) and work O(n), but checking the sign may require depth O(log(n)) since some of the leading digits can be zero (the sign of the first non-zero digit determines the sign). Multiplication can be obtained by adding the partial products x y i B i which can be arranged with a depth of O(log(n)) and work O(n 2 ) [14], [15]. Division can be implemented by multiplication of the integer reciprocal, requiring depth O(log(n)) and work O(n 2 ) [2]. Hence, SD numbers are an interesting number representation that leads to efficient arithmetic algorithms. B. Binary SD Numbers Avizienis already noted that his algorithm does not work for binary SD numbers for the reasons we explained in the previous section. Using the weaker constraints D<B 2 D, we can reconsider Table I that reduces for B =2and D =1to the following decompositions: u i (t i+1,w i ) 2 ( 1, 0) 1 (0, 1) or ( 1, +1) 0 (0, 0) +1 (0, +1) or (+1, 1) +2 (+1, 0) As can be seen, there is no decomposition that always allows us to achieve that w i { D+1,...,+D 1} = {0} holds. For this reason, it was widely accepted that there is no carryfree addition for general binary SD numbers. One possible solution is to consider a radix B =2 k and to represent digits x i then as two s complement numbers with k +1 bits. The disadvantage is that the depth is increased to O(log(k)) (due to addition of two s complement numbers with k bits), as considered in [11]. Since small numbers 2 The work of a parallel algorithm is the number of executed operations, i.e., the number of gates of the corresponding circuit. k can be chosen, this may still be a practical solution. Many papers consider also variants of these SD number representations, e.g. using asymmetric digits sets [12]. As another solution, Parhami [8] suggested recoding a given binary SD number x of length n to an equivalent SD number x of length n +1 such that there are no two neighboring digits x i+1 and x i with x i+1 x i = 1. Unfortunately, the output of his addition algorithm does not satisfy this condition, so that it has to be recoded again before another addition takes place. This does not only increase the required chip area, but also adds further latency to each addition. Other works on recoding SD numbers are discussed in [13]. We therefore considered whether it is possible to construct a direct algorithm for the addition of binary SD numbers despite the problems with the decomposition mentioned in the previous section. As we report in the next section, it turns out that there is indeed such an algorithm, and it can be efficiently implemented in hardware. III. OUR ALGORITHM FOR CARRY-FREE ADDITION OF BINARY SD NUMBERS A. Analyzing the Problem Below, we first analyze the problem for base B =2and then construct a carry-free binary SD addition algorithm. We have to add two digits x i and y i of given numbers plus the transfer digit t i that comes from the neighboring digits to the right. All of x i, y i, and t i belong to the digit set { 1, 0, +1}, and we have to define transfer digits t i+1,an interim sum w i, and the final sum digit s i such that the following constraints hold: 1) x i + y i =2 t i+1 + w i 2) s i = t i + w i 3) t i+1, w i and s i are digits from { 1, 0, +1} 4) t i+1 is defined independent of t i (to avoid a propagation chain) To this end, consider Table II: The first three columns list the possible inputs for x i, y i and t i. The next two columns are values for t i+1 and w i that were computed by the algorithm of the previous section, i.e. +1 if x i + y i +1 t i+1 := 1 if x i + y i 1 0 otherwise and w i := x i + y i 2 t i+1 and s i := x i + y i + t i 2 t i+1. As can be seen, the algorithm sometimes computes values for s i that are not in the allowed range. The symbol * in the rightmost column marks these rows (where the algorithm fails) and we have colored these rows in dark gray. It is not difficult to see that a correct result would have been possible, since we have [t i+1,s i ] 1,2 = [ 1, +2] 1,2 = 0= [0, 0] 1,2 and [t i+1,s i ] 1,2 = [+1, 2] 1,2 =0= [0, 0] 1,2 holds. 47

5 Table II VALUES OF t i+1 AND s i FOR STANDARD SD ADDITION. x i y i t i t i+1 w i s i * * * * However, we cannot simply change these rows in the table to correct the outputs t i+1 and s i, since the computation of t i+1 must be independent of t i, and should only depend on x i and y i. Hence, changing the value of t i+1 inarow forces us to make the same change in all rows where x i and y i has the same value. We therefore say that two input triples (x i,y i,t i ) and (x i,y i,t i ) are equivalent iff x i = x i y i = y i holds. The symbol + denotes the rows that are equivalent in this sense to another input that leads to wrong results, and we have colored these rows in a lighter gray. We therefore see that we have four critical input classes (x i,y i,t i ) = ( 1, 0, ), (x i,y i,t i )=(0, 1, ), (x i,y i,t i )=(0, +1, ), and (x i,y i,t i )=(+1, 0, ) that refer to the decomposition cases in Table I where two decompositions are possible. Since we have to define a decomposition 2 t i+1 + w i = x i + y i independent of t i, there is no solution by the information given in this table. For example, consider the critical input class (x i,y i,t i )=( 1, 0, ): Using t i+1 = 1 as computed by the algorithm leads to value s i =+2for t i =+1. Using t i+1 =0instead leads to value s i = 2 for t i = 1, and using t i+1 =+1leads to forbidden values of s i for all values of t i (see Table III). Thus, it is not possible Table III ALTERNATIVE VALUES OF t i+1 FOR (x i = 1 AND y i =0). x i y i t i t i+1 s i t i+1 s i to define a decomposition for t i+1 that only depends on x i and y i as remarked by Avizienis! B. Solution Our algorithm uses additional information that solves the problem explained in the previous section. As the algorithm describes a hardware circuit, we make use of an encoding of the digits { 1, 0, +1} by a pair of booleans (x.0,x.1). There are many encodings of the digits { 1, 0, +1}, but the following two are the most popular ones: Value sign-value neg-pos -1 (true, true) (true, false) 0 (false, false) (false, false) +1 (false, true) (false, true) We choose the neg-pos encoding for our algorithm because it lends itself well to a concise description of the logic equations below; in addition, it makes negating a value a simple swap of the pair s elements. The key idea of our solution is to choose different decompositions x i + y i =2 t i+1 + w i in the critical cases (with gray color) of Table II. Since we cannot do this based on x i and y i only, and since we are not allowed to consider t i, we introduce a new input l i such that (t i =+1 l i ) (t i = 1 l i ) holds, and we generate an output l i+1 that maintains this property as an invariant (t i+1 =+1 l i+1 ) (t i+1 = 1 l i+1 ) (1) that is forwarded to the full adder that receives x i+1 and y i+1 as inputs, while l i is provided in addition to t i by the full adder for x i 1 and y i 1. Using l i, we can then decide whether we use the one or the other possible decomposition in the critical cases (with gray color) of Table II. Note that l i does not hold the full information of t i, since it is not determined for t i =0.To establish the above invariant, we define l i+1 := x i.0 y i.0 which means that l i+1 holds if and only if at least one of the digits x i,y i is 1. We prove that equation (1) holds by inspecting Table IV, where the solution computed by our algorithm is given as 48

6 Table IV VALUES OF t i+1 AND s i FOR OUR SD ADDITION ALGORITHM. x i y i t i l i l i+1 t i+1 s i x y tin lin lout tout s T T * T F T T T T T F T F T T T * T F T T T T T F T F T T F * F F F T F T F F F F F T T * T F T T F T F F F F F T F * F F F principle, we could replace l i by (t i 1) without making the equations incorrect. However, the hardware circuit would then suffer from carry propagation since t i+1 would then depend on t i. Figure 1 defines a full adder using the Quartz language [16] that can be cascaded to obtain a carry-free binary SD adder. Inputs are declared by? while outputs are declared with!. The inputs tin, x, and y are thereby pairs of booleans that encode digits { 1, 0, +1} via the neg-pos encoding, i.e., ε(x.0,x.1) = (x.0 1 0) + (x ) maps a pair of booleans to the corresponding digits. The module also makes use of local boolean variables w1, w2, w3, w4, w, u1, and u0. w is thereby defined such that it holds if and only if one of the critical input cases are given (the gray shaded ones in Table IV). Variables u1 and u0 are used to define some common subexpressions. module SgnFullAdd( (bool bool)?tin,?x,?y,bool?lin, (bool bool)!tout,!s,bool!lout) { bool w1,w2,w3,w4,w,u1,u0; // define the critical input cases: w1 =!x.0 &!x.1 & y.1; // x==0 & y==+1 w2 =!x.0 &!x.1 & y.0; // x==0 & y== 1 w3 =!y.0 &!y.1 & x.1; // y==0 & x==+1 w4 =!y.0 &!y.1 & x.0; // y==0 & x== 1 w = w1 w2 w3 w4; u1 =!lin & w; // tin!= 1 & critical input u0 = lin & w; // tin!=+1 & critical input // determine lout := x= 1 y= 1 lout = x.0 y.0; // tout.0 holds iff x=y= 1 tin!=+1 & x+y= 1 tout.0 = x.0 & y.0 lin & (w2 w4); // tout.1 holds iff x=y=+1 tin!= 1 & x+y=+1 tout.1 = x.1 & y.1!lin & (w1 w3); // determine sum digit s.0 = tin.0 &!u0 u1 &!tin.1; s.1 = tin.1 &!u1 u0 &!tin.0; } Figure 1. Implementation of a Full Adder for Binary SD Numbers the three rightmost columns, and we can also verify that the important equation x i + y i + t i =2 t i+1 + s i holds, and that all computed values are legal digits. Note that the inputs in Table IV are arbitrary, but input l i must respect the mentioned invariant above. We use * in case its value is a don t care (i.e., if t i =0). As can be seen, in case of non-critical inputs (those that are not given in gray color), the decomposition of x i + y i = u i into t i+1 2+w i does only depend on x i and y i, while in the critical cases, it also depends on l i. Using the information of l i, it is possible to choose a decomposition where always legal digits are obtained for t i+1 and s i without generating a carry digit. It is interesting to note that l i and l i+1 have strong relationships to t i and t i+1 due to the mentioned invariants. However, l i+1 only depends on the digits x i and y i, while t i+1 depends on l i, but not on t i. This is very important: In As can be seen, tout only depends on x,y,lin; s depends on tin,lin,x,y, and lout on x,y. Therefore, there is no dependency from tin to tout and neither is there one from lin to lout. Dependencies between neighbored full adder modules are shown in Figure 2. As can be seen, a sum digit s i depends on x i,y i,x i 1,y i 1,x i 2,y i 2, l i on x i 1,y i 1, and t i on x i 1,y i 1,x i 2,y i 2. Figure 2. Dependencies of the Variables in SgnFullAdd 49

7 It is not difficult to prove that the following theorem holds where ε(x) maps the pair of booleans x =(x.0,x.1) to a digit { 1, 0, +1} according to the neg-pos encoding: Theorem 4 (Correctness of SgnFullAdd): If x, y, tin are pairs of booleans that encode digits { 1, 0, +1}, and if lin is a boolean such that condition (lin tin.1) ( lin tin.0) holds, then the following holds for module SgnFullAdd shown in Figure 1: tout and s encode signed binary digits { 1, 0, +1} (lout tout.1) ( lout tout.0) ε(x)+ε(y)+ε(tin) =2 ε(tout)+ε(s) Proof: The proof can be made by an exhaustive enumeration of all cases, which has been performed by means of the Averest tool set. Thus, all bits l i, then all transfer digits, and then all sum digits are computed in three parallel steps, thus requiring time O(1). Hence, we obtained a carry-free addition of binary SD numbers without the need to re-encode the inputs. The crucial fact used here is that we can extract enough information from the next less-significant digits to distinguish the cases where forbidden digits for s i would be computed within the critical inputs. Note that l i does not have the complete information to determine t i since that would lead to a dependency between t i+1 and t i that would introduce a carry chain. C. Conversion to/from Binary Numbers Converting radix-2 or two s complement numbers to binary SD numbers does not require any logic resources. For a radix-2 number x = [x n 1,...,x 0 ], the equivalent SD number x in neg-pos encoding is x.0 := [0,...,0] and x.1 := [x n 1,...,x 0 ]; for a two s complement number x = [x n 1,...,x 0 ], an equivalent SD number is x.0:=[x n 1, 0,...,0] and x.1:=[0,x n 2,...,x 0 ]. The correctness of this can be easily seen from the equation [x n 1,...,x 0 ] 2C = x n 1 2 n 1 + n 2 i=0 x i 2 i, where x 2C denotes the two s complement interpretation of a bitvector x. To convert an SD number x back to a radix-2 or a two s complement number, the bitvector [x n 1.0,...,x 0.0] is interpreted as a radix-2 number and subtracted from the radix- 2 number [x n 1.1,...,x 0.1] (since [x n 1,...,x 0] 1,2 = n 1 i=0 (x i.1 x i.0) 2 i ). This requires a single n-bit subtraction which needs time O(log(n)) and returns an (n +1)-bit radix-2/two s complement number. IV. BENCHMARK RESULTS A. Setup We implemented our addition algorithm in hardware on a Xilinx Virtex 5 FPGA, along with Parhami s algorithm [8], and a simple addition of two s complement numbers to make comparisons. On these FPGAs, simple addition is implemented using a dedicated carry logic and fast carry chains, resulting in a combination of carry-lookahead and carry-ripple adders. This method is the fastest and the smallest carry-based addition for all but very high bit-width numbers. For Parhami s method, we chose the signed-value encoding, since it was the one they focused on in [8]. Our benchmarks were set up as follows: To measure latency, we registered the inputs and outputs of the respective adder implementation. The synthesis and implementation tools were set to optimize for clock frequency, and the given latencies are the minimum clock periods which were still routable. For area, the design was solely comprised of the adder circuit, with the FPGA s pins serving as the inputs and the outputs of the adder. The tools were set to optimize for area, and the area is measured in occupied lookup tables (LUTs). For our benchmarks, we assumed that the inputs are given as signed-digit numbers. This is necessary in order to ensure that the input is as general as possible so that the synthesis tools are not able to optimize the circuit unrealistically by exploiting don t-care conditions. We measured the following benchmarks: add2: addition circuit with two n-digit inputs and an (n +1)-digit output add3: addition circuit with three n-digit inputs and an (n +2)-digit output B. Results Table V shows the latency and the maximum frequency of the two-input and the three-input adder for our new addition algorithm and compares it to Parhami s adder. The values were determined for an input width of n =64, but they are actually independent of n (with very small deviations due to slight variations in the LUT array and the routing network of the FPGA). We included the values for a 64-bit native addition as a reference. As can be seen, our algorithm is more than 40 % faster than Parhami s SD addition. It also tends to achieve a frequency which is 50 % higher than a 64-bit native FPGA addition. This is to be expected, since our algorithm has a constant O(1) latency, while the best latency which any carry-based addition can achieve is O(log n). In fact, our algorithm is so efficient that the breakeven point is at n =24, for which native addition has a latency of 2.11 ns. Interestingly, Parhami s adder is actually slower than native FPGA addition for the three-input case, even though it is faster for the two-input case. In Table VI, we show the area requirements for the different algorithms. For all of them, the occupied area is proportional to their input width, hence we give the number of LUTs per input digit (measured for n =64). For example, our method requires 3 LUTs per digit, so adding two 32- digit numbers requires 96 LUTs. As expected, three-input 50

8 Table V LATENCY OF TWO-INPUT AND THREE-INPUT ADDERS IN NANOSECONDS, RESP. MAXIMUM FREQUENCY IN MHZ. add2 (ns / MHz) add3 (ns / MHz) Our adder 2.02 / / 318 Parhami s adder 2.88 / / 201 Simple adder (64-bit) 3.19 / / 212 Table VI AREA REQUIREMENTS OF TWO-INPUT AND THREE-INPUT ADDERS IN LUTS PER INPUT BIT. add2 add3 Our adder Parhami s adder Two s complement adder adders need twice the area of two-input adders, since they are just two adders in sequence. Our method requires three times as much area as the native addition algorithm, and less than half of Parhami s algorithm. Note that in the case of an ASIC implementation, our algorithm would likely perform even better compared to a two s complement adder since the latter benefits from the dedicated carry-propagation chain on the FPGA, an advantage which would not exist on an ASIC. V. CONCLUSION We developed an algorithm for adding binary SD numbers which does not require the recoding step of previous approaches [8]. Our algorithm makes use of an additional input l i that is used to determine suitable transfer and interim sum digits that avoid this way a carry generation. By implementing our addition algorithm on an FPGA, we showed that our method is approximately 40 % faster and needs less than half as much area compared to previous approaches to binary SD addition. It has a lower latency than even the fastest carry-based two s complement addition for input widths as low as 24 bits, allowing it to be used as a replacement in many practical, latency-critical hardware designs. REFERENCES [1] P. Kogge and H. Stone, A parallel algorithm for the efficient solution of a general class of recurrences, IEEE Transactions on Computers (T-C), vol. 22, pp , [2] P. Beame, S. Cook, and H. Hoover, Log depth circuits for division and related problems, in Foundations of Computer Science (FOCS). West Palm Beach, Florida, USA: IEEE Computer Society, 1984, pp [3] B. Parhami, Computer Arithmetic Algorithms and Hardware Designs. Oxford University Press, [4] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann, [5] H. Garner, The residue number system, IRE Transactions on Electronic Computers, vol. 8, pp , June [6] H. Garner, R. Arnold, B. Benson, C. Brockus, R. Gonzalez, and D. Rozenberg, Residue number systems for computers, University of Michigan, Technical Report , October [7] A. Avizienis, Signed-digit number representations for fast parallel arithmetic, IRE Transactions on Electronic Computers, vol. 10, no. 3, pp , September [8] B. Parhami, Carry-free addition of recorded binary signeddigit numbers, IEEE Transactions on Computers (T-C), vol. 37, no. 11, pp , [9], Generalized signed-digit number systems: A unifying framework for redundant number representations, IEEE Transactions on Computers (T-C), vol. 39, no. 1, pp , January [10] S.-H. Shieh and C.-W. Wu, Asymmetric high-radix signeddigit number systems for carry-free addition, Journal of Information Science and Engineering, vol. 19, no. 6, pp , [11] G. Jaberipur and M. Ghodsi, High radix signed digit number systems: Representation paradigms, Scientia Iranica, vol. 10, no. 4, pp , [12] S. Gorgin and G. Jaberipur, A family of high radix signed digit adders, in Symposium on Computer Arithmetic (ARITH). Tübingen, Germany: IEEE Computer Society, 2011, pp [13] M. Joye and S.-M. Yen, Optimal left-to-right binary signeddigit recoding, IEEE Transactions on Computers (T-C), vol. 49, no. 7, pp , [14] C. Koc and S. Johnson, Multiplication of signed-digit numbers, Electronics Letters, vol. 30, no. 11, pp , [15] C. Hung and B. Parhami, Generalized signed-digit multiplication and its systolic realizations, in Circuits and Systems. Detroit, Michigan, USA: IEEE Computer Society, 1993, pp [16] K. Schneider, The synchronous programming language Quartz, Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany, Internal Report 375, December [17] S. Arno and F. Wheeler, Signed digit representation of minimal Hamming weight, IEEE Transactions on Computers (T-C), vol. 42, no. 8, pp , August [18] A. Booth, A signed binary multiplication technique, Quarterly Journal of Mechanics and Applied Mathematics (QJ- MAM), vol. 4, no. 2, pp , [19] D. Phatak, T. Goff, and I. Koren, Constant-time addition and simultaneous format conversion based on redundant binary representations, IEEE Transactions on Computers (T-C), vol. 50, [20] G. Reitwiesner, Advances in Computers. Academic Press, 1960, ch. Binary Arithmetic. 51