A New Euclidean Division Algorithm for Residue Number Systems

A New Euclidean Division Algorithm for Residue Number Systems Jean-Claude Bajard and Laurent Stéphane Didier Laboratoire d Informatique de Marseille CMI, Université de Provence, 39 rue Joliot-Curie, 3453 Marseille Cedex FRANCE Jean-Michel Muller CNRS, Laboratoire de l Informatique du Parallélisme, 46 Allée d Italie, 69364 Lyon Cedex 7 FRANCE Abstract We propose in this paper a new algorithm and architecture for performing divisions in residue number systems. Our algorithm is suitable for residue number systems with large moduli, with the aim of manipulating very large integers on a parallel computer or a specialpurpose architecture. The two basic features of our algorithm are one one hand the use of a high-radix division method, and on the other hand the use of a floating-point arithmetic that should run in parallel with the modular arithmetic. Introduction A Residue Number System (abbreviated as RNS) is composed of moduli that are relatively prime integers (i.e., GCD if ). A number is represented in the RNS by the residues modulo. When performing additions, subtractions and multiplications, the operations can be performed in parallel for each modulus, using separate ALU s : this allow RNS computations to be completed quickly[4, 24] [28]. Unfortunately, division and comparison look difficult to perform in a residue number system [, 5]. The very origin of such systems is quite old: it goes back to the well known Chinese Remainders Theorem (CRT) [8]. An history of the Residue Number Systems can be found in [27]. In the following, the set of relatively prime integers is called RNS base, and to simplify we assume that the s are prime numbers.

The CRT shows that for any -uple,, there exists a unique integer,, such that represents. The CRT also gives an algorithm that computes. In 989, D. Gamberger [3, 2] proposed a new algorithm for performing divisions in an RNS. Then, Lu and Chiang proposed a RNS division algorithm based on the combination of a classical division method and a parity checking method [5, 22]. In this paper, we propose a new division algorithm for Residue Number Systems. Our algorithm has been designed for applications where manipulation of huge numbers is required. The first part of this paper is devoted to a weighted representation, where a floating-point like representation is added to the usual RNS representation. In the second part we present our new algorithm for RNS division, which gives the quotient and the remainder of the Euclidean division of two numbers given in RNS form. Then we analyze the various features of our algorithm. 2 Weighted RNS arithmetic As told above, our algorithm uses a weighted arithmetic, based on the combination of a RNS arithmetic and a floating-point-like arithmetic. The basic idea behind this keeping the order of magnitude of RNS numbers is not new: such an idea has been proposed [8, 5] in order to detect overflows. What is new is the fact that we use this for comparisons and quotient-digit estimations. 2. Definition of the FPL representation Assume that is the RNS representation of an integer in the RNS-base, where. The integers will be represented in radix. We assume that all the s are between the same two consecutive powers, that is to say: there exists such that Thus, all the computations can be carried out using Define and as integers satisfying: -digit numbers. () Also define and as integers satisfying: (2) (3) 2

In other words, is a -digit number, and is a small number. We have: (4) therefore can be viewed as a (weighted) floating-point representation of. In the following, it will be called the Floating-Point like (FPL) representation of the RNS number. There may be several possible values of that satisfy the requirement of the above definition. In practice, those possible values do not all lead to the same accuracy. The most accurate representation is the one with the largest value of. The FPL representations will be manipulated in parallel with the RNS representations, using the classical floating-point algorithm for addition and a slightly modified one for multiplication (to compute, one first computes using the usual floating-point algorithm, then one multiplies the result by a floating-point approximation of to take into account the weight that appears in (5)). Most of the time, the FPL representations will suffice for comparing numbers. However, if the numbers that are being compared are very close, or if, after many computations, the FPL representations have become too inaccurate, it will become necessary to refresh them (i.e. to re-compute them from the RNS representations). It will not be necessary to perform a refreshment after each operation. A refreshment will be performed when the error on the FPL representations becomes so large that a comparison is impossible. In the following, we show how to construct and how to refresh the FPL representation of a RNS number. 2.2 Refreshment To compute or to refresh the FPL representation of a RNS number, one can use the Mixed- Radix System associated to the RNS-base [8]. Unfortunately, this solution seems intrinsically sequential. Another solution is to directly implement the formula that appears in the proof of the Chinese Remainder Theorem. We choose this last solution because it looks much more parallelisable. The proof of the CRT uses the relation: (5) modulo (6) with and modulo. We can notice that, The refreshment method is characterized by the fact that all the computations are performed with integers. Since the evaluation is done modulo, we just have to manipulate the fractional part of each term, and the fractional part of the sum. 3

When and how do we obtain a significant? First of all, we can remark that for each integer,, we have Define as the integer constituted by the first digits of the fractional part of, that is to say: -digit number -digit number is a -digit number We have, (7) The terms are pre-computed and stored constants. Now, let us multiply the -digit integer by. The result is a -digit integer. We note the integer whose digits are the first digits of the fractional part of. That is: -digit number -digit number -digit number integer part error part We have, 4

From Eq. (7) we deduce, (8) Define as the integer whose digits are the first digits of the fractional part of, that is: Thus, (9) So, we deduce from (8), (9) and (6) that: Thus, Using Eq. () we can conclude that, This gives: () If we compare (4) and (), we can see that if then one can choose: and if, this means that is small in front of ( ). In such a case, we start the process again, by first multiplying (in the RNS system) by, and trying to convert it to the FPL system ( will be added to the obtained exponent). As a matter of fact, even if, if is small in front of, it may be advantageous to multiply (in the RNS system) by, where is such that, and to start the process again, to get a much more accurate FPL representation. The following theorem summarizes our result. Theorem If is such that: 5

then we can obtain such that: only using integer arithmetic with numbers less than Remark It is useful to refresh the FPL representation every additions or multiplications. In this case we can perform the refreshment two times: the first one to find a good, the second one to find the corresponding. 2.3 Implementation of the refreshment The implementation uses -digit Arithmetic Units (AU s) and a FPL unit, which are connected each other with a -digit bus. This bus is composed of sub-bus connected by gates for the simulation of a binary tree for the evaluation of (figure ). Thus, we can assume a computation time for. Each -digit arithmetic unit has its own registers and memory to contain, and local variables like, or the partial value of. Such a unit can be a single -digit adder with a control part. It can perform specific additions to compute, (just additions with no overflow control: the overflows correspond to the integer part) and addition multiplication modulo. The control part of each unit can be commanded by a single sequencer with instructions for the different operations in a ROM. We can use redundant adders. All the numbers are positive, so it is easy to only consider the least significant part knowing that the first significant digit must be positive. To have an idea of the size (which is ), a Borrow-Save implementation (radix and digits in ) with and (which gives ) uses around transistors for the arithmetic part. So it is possible to implement other operations in hardware, for example a modular multiplication[2]. 3 A RNS division algorithm We note the RNS representation of. We note the representation of in the Mixed-Radix system associated to the RNS-base. That is to say, with. If is non-zero then the function SupRNS returns the RNS representation of, else it returns the RNS representation of. In all cases this function returns a flag such that if then, else 6

FPL-U 2 3 2 4 8 2 6 2 24 28 AU AU 2 AU 3 AU AU 5 AU 6 AU 7 AU AU 9 AU AU AU AU 7 AU 8 AU 9 AU AU 2 AU AU 25 AU AU 3 AU 4 AU 5 AU 22 AU 23 AU 26 AU 27 AU AU 29 AU 3 AU 3 AU 3 2 2 Evaluation of : Step Initialization If then gates open the connexions all gates are closed (no connexion) AU sends to AU AU computes partial Figure : Architecture of the implementation 7

Theorem 2 If then for given numbers and, the following algorithm evaluates and such that: with Algorithm pre computing: construction of initialization: and loop: while or ( and ) SupRNS if end: if if else refresh 4 Analysis of the RNS division algorithm 4. proof of the while loop: decrease of to Let us denote. We have, and, Thus, 8

() as we have, we obtain, thus, If we have, (this is possible with two refreshments) then, (2) so, Thus (3) As, we have, Eq. (3) becomes, as we have, we obtain, 9

Thus we have no overflow problems in because And, if then the next value of can be, else. This confirms that decreases to. 4.2 When As we have seen in (4), and considering (2) we deduce that just before the last iteration we have and. Now the last iteration gives with (3), (4) and (4) becomes, in other words for, This requires at most another iteration with an exact comparison using a Mixed Radix number system. 4.3 implementations Now we can complete the description of the implementation of the refreshment, in particular the description of the implementation of the function RNSsup. The evaluation of the RNS representation of is done in two steps. One in the FPL-unit for the evaluation of two numbers and such that. is a -digit number, thus modulo for each. To compute and in the FPL unit (5) Algorithm 2 if then we compute If such that then inf else and

To compute modulo in the -th Arithmetic Unit (AU) The FPL unit broadcasts to each arithmetic unit. The -th AU receives and computes such that modulo. The value modulo of is read in a local table or is computed using the radix- decomposition of (all the digits are less than ) with a lookup of the values modulo of for in a table. In this last case the table is very small (if and then it only contains one value) Thus is evaluated with at most modular multiplications. 4.4 performances At each step of the while loop we compute the RNSsup function, we perform one refreshment, one modular multiplication and two modular additions. The number of steps is at most equal to. Thus the cost of this loop is at most, in other words. Now the last step of our algorithm is done in time [8]. Theorem 3 The proposed RNS-division algorithm is a -time algorithm with space. 5 Other recent algorithms Comparison and division are difficult problems in RNS arithmetic. We can find many algorithms in the literature that have been proposed to cope with these problems. In D. Gamberger presented an original RNS division algorithm without comparisons [3]. The execution time of his algorithm only depends on the value of the divisor. Each iteration requires modular steps, and the number of iterations is often greater than. For example with moduli,,, and, with a divisor close to the maximum number of iterations is equal to. The execution time of our algorithm does not depend on the value of the divisor, it only depends on the size of the difference between the dividend and the divisor. The worst case of our algorithm has a better execution time than the one of the mean case of Gamberger s algorithm. More recently, M. Lu and J.S. Chiang proposed a RNS division based on a classical high-radix division algorithm with comparisons performed using a parity checking [5, 22]. To have an efficient -time algorithm they use tables. The size of each table is proportional to. For example, if then the size of the tables is close to bits. At each iteration, Lu and Chiang s algorithm computes a binary digit of the quotient. Each iteration requires the evaluation of. Our algorithm computes a radix- -digit at each iteration. Each iteration requires a floating point division and the evaluation of. Thus the execution time of our algorithm is better as soon as times the time of one iteration of Lu and Chiang s algorithm is greater than the floating point division time. Moreover, we are not limited by the size of a table.

6 Conclusion We have proposed an efficient algorithm for RNS division, the implementation of which is realistic. The execution time of this algorithm is better than that of previously published algorithms, and it does not require large tables. We do not claim that our algorithm is attractive for applications such as computer cards e.g. credit cards or phone cards for such applications, other algorithms (see for instance [26]) look more promising. The use of our algorithm to perform modular multiplications of huge numbers (i.e. a RNS multiplication followed by a RNS division) would be comparable to the use of modular multiplication algorithms proposed by N. Takagi [25, 26] or P. Kornerup [9] as soon as. References [] G. Alia and E. Martinelli. On the lower bound to the VLSI complexity of number conversion from weighted to residue representation. IEEE Transactions on Computers, 42(8):962, August 993. [2] G. Alia and E. Martinelli. A VLSI modulo m multiplier. IEEE Transactions on Computers, 4(7):873, July 994. [3] D.K. Barnerji, T.Y. Cheung, and V. Ganesan. A High-Speed division Method in Residue System. In 5th IEEE Symp. on Comp. Arith., page 58, 98. [4] PW Beame, SA. COOK, and HJ. Hoover. log depth circuits for division and related problems. SIAM J. Comp., 5(4):994, November 986. [5] JS. Chiang and Mi Lu. A general division algorithm for residue number system. In P. Kornerup and D. Matula, editors, proceedings of the th Symposium on Computer Arithmetic, page 76. IEEE Computer Society Press, 99. [6] W.A. Chren. A New Residue Number System Division Algorithm. Comput. Math. Appl., 9(7):3, 99. [7] L. Ciminiera and P. Montuschi. Over-redundant digit sets and the design of digit-by-digit division units. IEEE Transactions on Computers, 43(3):269, March 994. [8] E. D. Di Claudio, G. Orlandi, and F. Piazza. A systolic redundant residue arithmetic error correction circuit. IEEE Transactions on Computer, 42(4):427, April 994. [9] GI. Davida and B. Litow. Fast parallel arithmetic via modular representation. SIAM J. Comp., 2(4):756, August 99. [] G. Dimauro, S. Impedovo, and G. Pirlo. A new technique for fast number comparison in the residue number systems. IEEE Transactions on Computers, 42(5):67, May 993. [] SE. Eldridge and CD. Walter. Hardware implementation of montgomery s modular multiplication algorithm. IEEE Transactions on Computers, 42(6):693, July 993. 2

[2] D. Gamberger. Incomplete specified numbers in residue number system - Defintion and applications. In M. D. Ercegovac and E. Swartzlander, editors, proceedings of the 9th Symposium on Computer Arithmetic, pages 2 25. IEEE Computer Society Press, 989. [3] D. Gamberger. New Approach to Integer Division in Residue Number System. In P. Kornerup and D. Matula, editors, proceedings of the th Symposium on Computer Arithmetic, pages 84 9. IEEE Computer Society Press, 99. [4] H.L. Garner. The Residue Number System. IRE Trans. Electronic Computer, 8:4, June 959. [5] S. Kaushik. Sign detection in non-redundant residue number system with reduced information. In 6th Symposium on Computer Architecture. IEEE Computer Society Press, 983. [6] Y.A. Keir, P.W. Cheney, and M. Tannenbaum. Division and Overflow Detection in Residue Number Systems. IRE Trans. Electron. Comp.,, August 962. [7] E. Kinoshita, H. Kosako, and Y. Kojima. General Division in Symmetric Residue Number System. IEEE Transactions on Computer, 22:34, February 973. [8] D. Knuth. The art of computer programming, volume 2. Addison Wesley, 973. [9] P. Kornerup. High-radix modular multiplication for cryptosystem. In th IEEE Symposium on Computer Arithmetic, 993. [2] P. Kornerup. A systolic, linear-array multiplier for a class of right-shift algorithm. IEEE Transactions on Computers, 43(8):892, August 994. [2] M.L. Lin, E. Leiss, and B. McInnis. Division and Sign Detection Algorithms for Residue Number Systems. Comput. Math. Appl., (4/5):33, 984. [22] Mi Lu and JS. Chiang. A novel division algorithm for the residue number system. IEEE Transactions on Computers, 992. [23] M. Shand and J. Vuillemin. Fast implementation of RSA cryptography. In ARITH, th Symposium on Computer Arithmetic, Windsor Canada, 993. [24] N.S Szabo and R.I. Tanaka. Residue Arithmetic and its applications to computer Technology. McGraw-Hill, 967. [25] N. Takagi. A Radix-4 Modular Multiplication Hardware Algorithm Efficient for Iterative Modular Multiplications. In P. Kornerup and D. Matula, editors, proceedings of the th Symposium on Computer Arithmetic, pages 35 4. IEEE Computer Society Press, 99. [26] N. Takagi. A modular multiplication algorithm with triangle additions. In proceedings of the th Symposium on Computer Arithmetic, page 272. IEEE Computer Society Press, 993. [27] F.J. Taylor. Residue Arithmetic: a tutorial with examples. IEEE Computer Magazine, May 984. [28] F.J. Taylor. A more efficient residue arithmetic implementation of the FFT. In 5th Symposium on Computer Arithmetic. IEEE Computer Society Press, 985. 3

[29] CD. Walter. Systolic modular multiplication. IEEE Transactions on Computers, 42(3):376, March 993. 4