Division by Invariant Integers using Multiplication

Transcription

1 Division by Invariant Integers using Multiplication Torbjörn Granlun Cygnus Support 1937 Lanings Drive Mountain View, CA Peter L. Montgomery Centrum voor Wiskune en Informatica 780 Las Colinas Roa San Rafael, CA Abstract Integer ivision remains expensive on toay s processors as the cost of integer multiplication eclines. We present coe sequences for ivision by arbitrary nonzero integer constants an run time invariants using integer multiplication. The algorithms assume a two s complement architecture. Most also require that the upper half of an integer prouct be quickly accessible. We treat unsigne ivision, signe ivision where the quotient rouns towars zero, signe ivision where the quotient rouns towars, an ivision where the result is known a priori to be exact. We give some implementation results using the C compiler GCC. 1 Introuction The cost of an integer ivision on toay s RISC processors is several times that of an integer multiplication. The tren is towars fast, often pipeline combinatoric multipliers that perform an operation in typically less than 10 cycles, with either no harware support for integer ivision or iterating iviers that are several times slower than the multiplier. Table 1.1 compares multiplication an ivision times on some processors. This table illustrates that the iscrepancy between multiplication an ivision timing has been growing. Integer ivision is use heavily in base conversions, number theoretic coes, an graphics coes. Compilers Work one by first author while at Sweish Institute of Computer Science, Stockholm, Sween. Work one by secon author while at University of California, Los Angeles. Supporte by U.S. Army fellowship DAAL03 89 G generate integer ivisions to compute loop counts an subtract pointers. In a static analysis of FORTRAN programs, Knuth [13, p. 9] reports that 39% of arithmetic operators were aitions, 22% subtractions, 27% multiplications, 10% ivisions, an 2% exponentiations. Knuth s counts o not istinguish integer an floating point operations, except that 4% of the ivisions were ivisions by 2. When integer multiplication is cheaper than integer ivision, it is beneficial to substitute a multiplication for a ivision. Multiple authors [2, 11, 15] present algorithms for ivision by constants, but only when the ivisor ivies 2 k 1 for some small k. Magenheimer et al [16, 7] give the founation of a more general approach, which Alverson [1] implements on the Tera Computing System. Compiler writers are only beginning to become aware of the general technique. For example, version 1.02 of the IBM RS/6000 xlc an xlf compilers uses the integer multiply instruction to expan signe integer ivisions by 3, 5, 7, 9, 25, an 125, but not by other o integer ivisors below 256, an never for unsigne ivision. We assume an N bit two s complement architecture. Unsigne (i.e., nonnegative) integers range from 0 to 2 N 1 inclusive; signe integers range from 2 N 1 to 2 N 1 1. We enote these integers by uwor an swor respectively. Unsigne oublewor integers (range 0 to 2 2N 1) are enote by uwor. Signe oublewor integers (range 2 2N 1 to 2 2N 1 1) are enote by swor. The type int is use for shift counts an logarithms. Several of the algorithms require the upper half of an integer prouct obtaine by multiplying two uwors or two swors. All algorithms nee simple operations such as as, shifts, an bitwise operations (bit ops) on uwors an swors, as summarize in Table 3.1. We show how to use these operations to ivie by arbitrary nonzero constants, as well as by ivisors which are loop invariant or repeate in a basic block, using one multiplication plus a few simple instructions per ivision. The presentation concentrates on three types of

2 Architecture/Implementation N Approx. Year Motorola MC68020 [18, pp. 9 22] Time (cycles) for HIGH(N bit N bit) Motorola MC Intel 386 [9] Intel 486 [10] Intel Pentium SPARC Cypress CY7C S 100 S SPARC Viking [20] HP PA 83 [16] S 70 S HP PA FP 70 S MIPS R3000 [12] P 35 P Time (cycles) for N bit/n bit ivie (unsigne) (signe) MIPS R4000 [17] P 139 POWER/RIOS I [4, 22] (signe only) 19 (signe only) PowerPC/MPC601 [19] DEC Alpha 21064AA [8] P 200 S Motorola MC S 38 Motorola MC P 18 S No irect harware support; approximate cycle count for software implementation F Does not inclue time for moving ata to/from floating point registers P Pipeline implementation (i.e., inepenent instructions can execute simultaneously) Table 1.1: Multiplication an ivision times on ifferent CPUs ivision, in orer by ifficulty: (i) unsigne, (ii) signe, quotient roune towars zero, (iii) signe, quotient roune towars. Other topics are ivision of a uwor by a run time invariant uwor, ivision when the remainer is known a priori to be zero, an testing for a given remainer. In each case we give the mathematical backgroun an suggest an algorithm which a compiler can use to generate the coe. The algorithms are ineffective when a ivisor is not invariant, such as in the Eucliean GCD algorithm. Most algorithms presente herein yiel only the quotient. The remainer, if esire, can be compute by an aitional multiplication an subtraction. We have implemente the algorithms in a evelopmental version of the GCC 2.6 compiler [21]. DEC uses some of these algorithms in its Alpha AXP compilers. 2 Mathematical notations Let x be a real number. Then x enotes the largest integer not exceeing x an x enotes the least integer not less than x. Let TRUNC(x) enote the integer part of x, roune towars zero. Formally, TRUNC(x) = x if x 0 an TRUNC(x) = x if x < 0. The absolute value of x is x. For x > 0, the (real) base 2 logarithm of x is log 2 x. A multiplication is written x y. If x, y, an n are integers an n 0, then x y (mo n) means x y is a multiple of n. Two remainer operators are common in language efinitions. Sometimes a remainer has the sign of the ivien an sometimes the sign of the ivisor. We use the Aa notations n rem = n TRUNC(n/) n mo = n n/ (sign of ivien), (sign of ivisor). (2.1) The Fortran 90 names are MOD an MODULO. In C, the efinition of remainer is implementation epenent (many C implementations roun signe quotients towars zero an use rem remainering). Other efinitions have been propose [6, 7]. If n is an uwor or swor, then HIGH(n) an LOW(n) enote the most significant an least significant halves of n. LOW(n) is a uwor, while HIGH(n) is an uwor if n is a uwor an an swor if n is a swor. In both cases n = 2 N HIGH(n) + LOW(n). 3 Assume instructions The suggeste coe assumes the operations in Table 3.1, on an N bit machine. Some primitives, such as loaing constants an operans, are implicit in the notation an are not inclue in the operation counts.

3 TRUNC(x) Truncation towars zero; see 2. HIGH(x), LOW(x) Upper an lower halves of x: see 2. MULL(x, y) Lower half of prouct x y (i.e., prouct moulo 2 N ). MULSH(x, y) Upper half of signe prouct x y: If 2 N 1 x, y 2 N 1 1, then x y = 2 N MULSH(x, y) + MULL(x, y). MULUH(x, y) Upper half of unsigne prouct x y: If 0 x, y 2 N 1, then x y = 2 N MULUH(x, y) + MULL(x, y). AND(x, y) Bitwise AND of x an y. EOR(x, y) Bitwise exclusive OR of x an y. NOT(x) Bitwise complement of x. Equal to 1 x if x is signe, to 2 N 1 x if x is unsigne. OR(x, y) Bitwise OR of x an y. SLL(x, n) Logical left shift of x by n bits (0 n N 1). SRA(x, n) Arithmetic right shift of x by n bits (0 n N 1). SRL(x, n) Logical right shift of x by n bits (0 n N 1). XSIGN(x) 1 if x < 0; 0 if x 0. Short for SRA(x, N 1) or SRL(x, N 1). x + y, x y, x Two s complement aition, subtraction, negation. Table 3.1: Mathematical notations an primitive operations The algorithm in 8 requires the ability to a or subtract two oublewors, obtaining a oublewor result; this typically expans into 2 4 instructions. The algorithms for processing constant ivisors require compile time arithmetic on uwors. Algorithms for processing run time invariant ivisors require taking the base 2 logarithm of a positive integer (sometimes roune up, sometimes own) an require iviing a uwor by a uwor. If the algorithms are use only for constant ivisors, then these operations are neee only at compile time. If the architecture has a leaing zero count (LDZ) instruction, then these logarithms can be foun from log 2 x = N LDZ(x 1), log 2 x = N 1 LDZ(x) (1 x 2 N 1). Some algorithms may prouce expressions such as SRL(x, 0) or (x y); the optimizer shoul make the obvious simplifications. Some escriptions show an aition or subtraction of 2 N, which is a no-op. If an architecture lacks arithmetic right shift, then it can be compute from the ientity SRA(x, l) = SRL(x + 2 N 1, l) 2 N 1 l whenever 0 l N 1. If an architecture has only one of MULSH an MULUH, then the other can be compute using MULUH(x, y) = MULSH(x, y) + AND(x, XSIGN(y)) + AND(y, XSIGN(x)) for arbitrary N bit patterns x, y (interprete as uwors for MULUH an as swors for MULSH). 4 Unsigne ivision Suppose we want to compile an unsigne ivision q = n/, where 0 < < 2 N is a constant or run time invariant an 0 n < 2 N is variable. Let s try to fin a rational approximation m/2 N+l of 1/ such that n m n = 2 N+l whenever 0 n 2 N 1. (4.1) Setting n = in (4.1) shows we require 2 N+l m. Setting n = q 1 shows 2 N+l q > m (q 1). Multiply by to erive ( m 2 N+l) (q 1) < 2 N+l. This inequality will hol for all values of q 1 below 2 N if m 2 N+l 2 l. Theorem 4.2 below states that these conitions are sufficient, because the maximum relative error (1 part in 2 N ) is too small to affect the quotient when n < 2 N. Theorem 4.2 Suppose m,, l are nonnegative integers such that 0 an 2 N+l m 2 N+l + 2 l. (4.3) Then n/ = m n/2 N+l for every integer n with 0 n < 2 N. Proof. Define k = m 2 N+l. Then 0 k 2 l by hypothesis. Given n with 0 n < 2 N, write n = q + r where q = n/ an 0 r 1. We must show that q = m n/2 N+l. A calculation gives m n k + 2N+l q = 2N+l n 2 N+l q = k n 2 N+l + n n r = k 2 l n 2 N 1 + r. (4.4)

4 This ifference is nonnegative an oes not excee 1 2N 1 2 N = N < 1. Theorem 4.2 allows ivision by to be replace with multiplication by m/2 N+l if (4.3) hols. In general we require 2 l 1 to ensure that a suitable multiple of exists in the interval [2 N+l, 2 N+l +2 l ]. For compatibility with the algorithms for signe ivision ( 5 an 6), it is convenient to choose m > 2 N+l even though Theorem 4.2 permits equality. Since m can be almost as large as 2 N+1, we on t multiply by m irectly, but instea by 2 N an m 2 N. This leas to the coe in Figure 4.1. Its cost is 1 multiply, 2 as/subtracts, an 2 shifts per quotient, after computing constants epenent only on the ivisor. Initialization (given uwor with 1 < 2 N ): int l = log 2 ; /* 2 l 2 1 */ uwor m = 2 N (2 l )/ + 1; /* m = 2 N+l / 2 N + 1 */ int sh 1 = min(l, 1); int sh 2 = max(l 1, 0); /* sh 2 = l sh 1 */ For q = n/, all uwor: uwor t 1 = MULUH(m, n); q = SRL(t 1 + SRL(n t 1, sh 1 ), sh 2 ); Figure 4.1: Unsigne ivision by run time invariant ivisor Explanation of Figure 4.1. If = 1, then l = 0, so m = 1 an sh 1 = sh 2 = 0. The coe computes t 1 = 1 n/2 N = 0 an q = n. If > 1, then l 1, so sh 1 = 1 an sh 2 = l 1. Since m 2N (2 l ) + 1 2N ( 1) + 1 < 2 N, the value of m fits in a uwor. Since 0 t 1 n, the formula for q simplifies to q = SRL(t 1 + SRL(n t 1, 1), l 1) t1 + (n t 1 )/2 = 2 l 1 (t1 + n)/2 t1 + n = =. 2 l 1 2 l (4.5) But t 1 + n = m n/2 N + n = (m + 2 N ) n/2 N. Set m = m + 2 N = 2 N+l / + 1. The hypothesis of Theorem 4.2 is satisfie since 2 N+l < m 2 N+l + 2 N+l + 2 l. Caution. Conceptually q is SRL(n + t 1, l), as in (4.5). Do not compute q this way, since n+t 1 may overflow N bits an the shift count may be out of bouns. Improvement. If is constant an a power of 2, replace the ivision by a shift. Improvement. If is constant an m = m + 2 N is even, then reuce m/2 l to lowest terms. The reuce multiplier fits in N bits, unlike the original. In rare cases (e.g., = 641 on a 32 bit machine, = on a 64 bit machine) the final shift is zero. Improvement. If is constant an even, rewrite n n/2 e = /2 e for some e > 0. Then n/2 e can be compute using SRL. Since n/2 e < 2 N e, less precision is neee in the multiplier than before. These ieas are reflecte in Figure 4.2, which generates coe for n/ where n is unsigne an is constant. Proceure CHOOSE MULTIPLIER, which is share by this an later algorithms, appears in Figure 6.2. Inputs: uwor an n, with constant. uwor o, t 1 ; uwor m; int e, l, l ummy, sh post, sh pre ; (m, sh post, l) = CHOOSE MULTIPLIER(, N); if m 2 N an is even then Fin e such that = 2 e o an o is o. /* 2 e = AND(, 2 N ) */ sh pre = e; (m, sh post, l ummy ) = CHOOSE MULTIPLIER( o, N e); else sh pre = 0; en if if = 2 l then Issue q = SRL(n, l); else if m 2 N then assert sh pre = 0; Issue t 1 = MULUH(m 2 N, n); Issue q = SRL(t 1 + SRL(n t 1, 1), sh post 1); else Issue q = SRL(MULUH(m, SRL(n, sh pre )), sh post ); en if Figure 4.2: Optimize coe generation of unsigne q = n/ for constant nonzero The following three examples illustrate the cases in Figure 4.2. All assume unsigne 32 bit arithmetic. Example. q = n/10. CHOOSE MULTIPLIER fins m low = (2 36 6)/10 an m high = ( )/10. After one roun of ivisions by 2, it returns (m, 3, 4), where m = ( )/5. The suggeste coe q = SRL(MULUH(( )/5, n), 3) eliminates the pre shift by 0. See Table Example. q = n/7. Here m = ( )/7 > This example uses the longer sequence in Figure 4.1. Example. q = n/14. CHOOSE MULTIPLIER first returns the same multiplier as when = 7. The

5 suggeste coe uses separate ivisions by 2 an 7: q = SRL(MULUH(( )/7, SRL(n, 1)), 2). 5 Signe ivision, quotient roune towars 0 Suppose we want to compile a signe ivision q = TRUNC(n/), where is constant or run time invariant, 0 < 2 N 1, an where 2 N 1 n 2 N 1 1 is variable. All quotients are to be roune towars zero. We coul prove a theorem like Theorem 4.2 about when TRUNC(n/) = TRUNC(m n/2 N+l ) for all n in a suitable range (cf. (7.1)), but it wouln t help since we can t compute the right sie given only m n/2 N. Instea we show how to ajust the estimate quotient when the ivien or ivisor is negative. Theorem 5.1 Suppose m,, l are integers such that 0 an 0 < m 2 N+l 1 2 l. Let n be an arbitrary integer such that 2 N 1 n 2 N 1 1. Define q 0 = m n/2 N+l 1. Then ( n ) TRUNC = q 0 if n 0 an > 0, 1 + q 0 if n < 0 an > 0, q 0 if n 0 an < 0, 1 q 0 if n < 0 an < 0. Proof. When n 0 an > 0, this is Theorem 4.2 with N replace by N 1. Suppose n < 0 an > 0, say n = q r where 0 r 1. Define k = m 2 N+l 1. Then q m n 2 N+l 1 = k 2 l n 2 N r, (5.2) as in (4.4). Since 0 < k 2 l by hypothesis, the first fraction on the right of (5.2) is positive an r/ is nonnegative. The sum is at most 1/ + ( 1)/ = 1, so q 0 = m n/2 N+l 1 = q 1, as asserte. For < 0, use TRUNC(n/) = TRUNC(n/ ). Caution. When < 0, avoi rewriting the quotient as TRUNC(( n)/ ), which fails for n = 2 N 1. For a run time invariant ivisor, this leas to the coe in Figure 5.1. Its cost is 1 multiply, 3 as, 2 shifts, an 1 bit op per quotient. Explanation of Figure 5.1. The multiplier m satisfies 2 N 1 < m < 2 N except when = ±1; in the latter cases m = 2 N + 1. In either case m = m 2 N fits in an swor. We compute m n/2 N as n+ (m 2 N ) n/2 N, using MULSH. The subtraction of XSIGN(n) as one if n < 0. The last line negates the tentative quotient if < 0 (i.e., if sign = 1). Variation. ( An alternate computation of m is m = 2 N (2 l 1 ) ) + 1 TRUNC. This uses signe (2N) bit/n bit ivision, with N bit quotient. Initialization (given constant swor with 0): int l = max ( log 2, 1); uwor m = N+l 1 / ; swor m = m 2 N ; swor sign = XSIGN(); int sh post = l 1; For q = TRUNC(n/), all swor: swor q 0 = n + MULSH(m, n); q 0 = SRA(q 0, sh post ) XSIGN(n); q = EOR(q 0, sign ) sign ; Figure 5.1: Signe ivision by run time invariant ivisor, roune towars zero Overflow etection. The quotient n/ overflows if n = 2 N 1 an = 1. The algorithm in Figure 5.1 returns 2 N 1. If overflow etection is require, the final subtraction of sign shoul check for overflow. Improvement. If m is constant an even, then reuce m/2 l to lowest terms, as in the unsigne case. This improvement is reflecte in Figure 5.2, which generates coe for TRUNC(n/) where is a nonzero constant. Figure 5.2 also checks for ivisor being a power of 2 or negative thereof. Inputs: swor an n, with constant an 0. uwor m; int l, sh post ; (m, sh post, l) = CHOOSE MULTIPLIER(, N 1); if = 1 then Issue q = ; else if = 2 l then Issue q = SRA(n + SRL(SRA(n, l 1), N l), l); else if m < 2 N 1 then Issue q = SRA(MULSH(m, n), sh post ) XSIGN(n); else Issue q = SRA(n + MULSH(m 2 N, n), sh post ) XSIGN(n); Cmt. Caution m 2 N is negative. en if if < 0 then Issue q = q; en if Figure 5.2: Optimize coe generation of signe q = TRUNC(n/) for constant 0 Example. q = TRUNC(n/3). On a 32 bit machine. CHOOSE MULTIPLIER(3, 31) returns sh post = 0 an m = ( )/3. The coe q = MULSH(m, n) XSIGN(n) uses one multiply, one shift, one subtract.

6 6 Signe ivision, quotient roune towars Some languages require negative quotients to roun towars rather than zero. With some ingenuity, we can compute these quotients in terms of quotients which roun towars zero, even if the signs of the ivien an ivisor are unknown at compile time. If n an are integers, then the ientities TRUNC(n/) if n 0 an > 0, n TRUNC((n + 1)/) 1 if n < 0 an > 0, = TRUNC((n 1)/) 1 if n > 0 an < 0, TRUNC(n/) if n 0 an < 0 are easily verifie. Since the new numerators n±1 never overflow, these ientities can be use for computation. They are summarize by n ( ) n + sign n sign = TRUNC + q sign, (6.1) where sign = XSIGN(), n sign = XSIGN(OR(n, n + sign )), an q sign = EOR(n sign, sign ). The cost is 2 shifts, 3 as/subtracts, an 2 bit ops, plus the ivie (n + sign is a repeate subexpression). For remainers, a corollary to (2.1) an (6.1) is n mo = n TRUNC((n + sign n sign )/) q sign = ((n + sign n sign ) rem ) (6.2) sign + n sign q sign = ((n + sign n sign ) rem ) + AND( 2 sign 1, q sign ). The last equality in (6.2) can be verifie by separately checking the cases q sign = n sign sign = 0 an q sign = n sign + sign = 1. The subexpression 2 sign 1 epens only on. For rouning towars +, an analog of (6.1) is n ( ) n sign + n pos = TRUNC EOR( sign, n pos ), where sign = XSIGN() an n pos = (n > sign ). Improvement. If > 0 is constant, then sign = 0. Then (6.1) becomes n ( ) n nsign = TRUNC + n sign, where n sign = XSIGN(n). Since TRUNC( x) = TRUNC(x) an EOR( 1, n) = 1 n = (n + 1), this is equivalent to n ( ( )) EOR(nsign, n) = EOR n sign, TRUNC (6.3) ( > 0). The ivien an ivisor on the right of (6.3) are both nonnegative an below 2 N 1. One can view them as signe or as unsigne when applying earlier algorithms. Improvement. The XSIGN(OR(n, n + sign )) is equivalent to (n NOT( sign )) an to (n < sign ), where the relationals prouce 1 if true an 0 if false. On the MIPS R2000/R3000 [12], for example, one can compute sign = SRL(, N 1); n sign = (n < sign ); /* SLT, signe */ q sign = EOR( n sign, sign ); q = TRUNC((n ( sign ) + ( n sign ))/) ( q sign ); (six instructions plus the ivie), saving an instruction over (6.1). Improvement. If n known to be nonzero, then n sign simplifies to XSIGN(n). For constant ivisors, one can use (6.1) an the algorithm in Figure 5.2. For constant > 0 a shorter algorithm, base on (6.3), appears in Figure 6.1. Inputs: swor n an, with constant an 0. uwor m; int l, sh post ; (m, sh post, l) = CHOOSE MULTIPLIER(, N 1); if = 2 l then Issue q = SRA(n, l); else assert m < 2 N ; Issue swor n sign = XSIGN(n); Issue uwor q 0 = MULUH(m, EOR(n sign, n)); Issue q = EOR(n sign, SRL(q 0, sh post )); en if Figure 6.1: Optimize coe generation of signe q = n/ for constant > 0 Example. Using signe 32 bit arithmetic, the coe for r = n mo 10 (nonnegative remainer) can be swor n sign = XSIGN(n); uwor q 0 = MULUH(( )/5, EOR(n sign, n)); swor q = EOR(n sign, SRL(q 0, 2)); r = n SLL(q, 1) SLL(q, 3);. The cost is 1 multiply, 4 shifts, 2 bit ops, 2 subtracts. Alternately, if one has a fast signe ivision algorithm which rouns quotients towars 0 an returns remainers, then (6.2) justifies the coe r = ((n XSIGN(n)) rem 10) + AND(9, XSIGN(n)). The cost is 1 ivie, 1 shift, 1 bit op, 2 as/subtracts.

7 proceure CHOOSE MULTIPLIER(uwor, int prec); Cmt. Constant ivisor to invert. 1 < 2 N. Cmt. prec Number of bits of precision neee, 1 prec N. Cmt. Fins m, sh post, l such that: Cmt. 2 l 1 < 2 l. Cmt. 0 sh post l. If sh post > 0, then N + sh post l + prec. Cmt. 2 N+sh post < m 2 N+sh post (1 + 2 prec ). Cmt. Corollary. If 2 prec, then m < 2 N+sh post ( l )/ 2 N+sh post l+1. Cmt. Hence m fits in max(prec, N l) + 1 bits (unsigne). Cmt. int l = log 2, sh post = l; uwor m low = 2 N+l /, m high = (2 N+l + 2 N+l prec )/ ; Cmt. To avoi numerator overflow, compute m low as 2 N + (m low 2 N ). Cmt. Likewise for m high. Compare m in Figure 4.1. Invariant. m low = 2 N+sh post/ < m high = 2 N+sh post (1 + 2 prec )/. while m low /2 < m high /2 an sh post > 0 o m low = m low /2 ; m high = m high /2 ; sh post = sh post 1; en while; /* Reuce to lowest terms. */ return (m high, sh post, l); /* Three outputs. */ en CHOOSE MULTIPLIER; Figure 6.2: Selection of multiplier an shift count 7 Use of floating point One alternative to MULUH an MULSH uses floating point arithmetic. Let the floating point mantissa be F bits wie (e.g., F = 53 for IEEE ouble precision arithmetic). Then any floating point operation has relative error at most 2 1 F, regarless of the rouning moe, unless exponent overflow or unerflow occurs. Suppose N 1 an F N + 3. We claim that where ( n ) TRUNC = TRUNC(q est ), ( ) F q est n, (7.1) whenever n 2 N 1 an 0 < < 2 N, regarless of the rouning moes use to compute q est. The proof assumes that n > 0 an > 0, by negating both sies of (7.1) if necessary (the case n = 0 is trivial). Since the relative error per operation is at most 2 1 F, the estimate quotient q est satisfies F ( F ) 2 n q est ( F ) ( F ) 2 n. Use this an the inequalities 1 2 F F < F ( F ) 2, ( F ) ( F ) 2 1 < F N to erive (1 2 F ) n < q est < n/ n/ 1 2 N 1 1 n+1 = n + 1. Denote q = TRUNC(n/). Then q est < (n + 1)/ implies TRUNC(q est ) q. If q est < q, then (1 2 F ) q (1 2 F ) n < q est < q. Both q an q est are exactly representable as floating point numbers, but there are no representable numbers strictly between (1 2 F ) q an q. This contraiction shows that q est q an hence q = TRUNC(q est ). For quotients roune towars, use (6.1). If F = 53 an N 50, then (7.1) can be use for N bit integer ivision. The algorithm may trigger an IEEE exception for inexactness if the application program enables that conition. Alverson [1] uses integer multiplication, but computes the multiplier using floating point arithmetic. Baker [3] oes moular multiplication using a combination of floating point an integer arithmetic. 8 Diviing uwor by uwor One primitive operation for multiple precision arithmetic [14, p. 251] is the ivision of a uwor by a uwor, obtaining uwor quotient an remainer, where the quotient is known to be less than 2 N. We

8 Initialization (given uwor, where 0 < < 2 N ): int l = 1 + log 2 ; /* 2 l 1 < 2 l */ uwor m = (2 N (2 l ) 1)/ ; /* m = (2 N+l 1)/ 2 N */ uwor norm = SLL(, N l); /* Normalize ivisor 2 N l */ For q = n/ an r = n q, where, q, r are uwor an n is uwor: uwor n 2 = SLL(HIGH(n), N l) + SRL(LOW(n), l); /* See note about shift count. */ uwor n 10 = SLL(LOW(n), N l); /* n 10 = n 1 2 N 1 + n 0 2 N l */ /* Ignore overflow. */ swor n 1 = XSIGN(n 10 ); uwor n aj = n 10 + AND( n 1, norm 2 N ); /* n 10 + n 1 ( norm 2 N ) */ /* = n 1 ( norm 2 N 1 ) + n 0 2 N l */ uwor q 1 = n 2 + HIGH ( ) m (n 2 ( n 1 )) + n aj ; /* Unerflow is impossible. */ /* See Lemma 8.1. */ swor r = n 2 N + (2 N 1 q 1 ) ; /* r = n q 1, r < */ q = HIGH(r) (2 N 1 q 1 ) + 2 N ; /* A 1 to quotient if r 0. */ r = LOW(r) + AND( 2 N, HIGH(r)); /* A to remainer if r < 0. */ Figure 8.1: Unsigne ivision of uwor by run time invariant uwor. escribe a way to compute this quotient an remainer after some preliminary computations involving only the ivisor, when the ivisor is a run time invariant expression. Lemma 8.1 Suppose that, m, an l are nonnegative integers such that 2 l 1 < 2 l 2 N an 0 < 2 N+l m. (8.2) Given n with 0 n 2 N 1, write n = n 2 2 l + n 1 2 l 1 + n 0, where n 0, n 1, an n 2 are integers with 0 n 1 1 an 0 n 0 2 l 1 1. Define integers q 1 an q 0 by q 1 2 N + q 0 = n 2 2 N + (n 2 + n 1 ) (m 2 N ) + n 1 ( 2 N l 2 N 1) + n 0 2 N l (8.3) an 0 q 0 2 N 1. 0 n q 1 < 2. Then 0 q 1 2 N 1 an Proof. Define k = 2 N+l m. Then (8.2) implies 0 < k 2 l 1. The boun n 2 N 1 implies n 2 2 N l 1. Equation (8.2) implies m > 2 N+l / > 2 N. A corollary to (8.3) is q 1 2 N + q 0 = n 2 m + n 1 (m 2 N ) + 2 N l ( n 1 ( 2 l 1 ) + n 0 ) ( 2 N l 1) m + 1 (m 2 N ) + 2 N l ( 1 (2 l 1 1) + (2 l 1 1) ) = 2 N l ( m 2) < 2 2N. This proves the upper boun on the integer q 1. A straightforwar calculation using the efinitions of k an q 0 an n 0 reveals that n q 1 = (n 2 + n 1 ) k + q 0 2 N + (1 2 ) ) l (n 1 ( 2 l 1 ) + n 0. (8.4) Since 2 l 1 < 2 l by hypothesis, the right sie of (8.4) is nonnegative. This remainer is boune by ( 2 N l ) + (2 N 1) 2 N + (1 2 ) ( ) l 1 ( 2 l 1 ) + (2 l 1 1) ( ) 2 < 2 l + + (1 2 ) l = 2, completing the proof. This leas to an algorithm like that in Figure 8.1 when iviing a uwor by a run time invariant uwor with quotient known to be less than 2 N. Unlike the previous algorithms, this coe rouns the multiplier own when computing a reciprocal. After initializations epening only on the ivisor, this algorithm requires two proucts (both halves of each) an simple operations (incluing oublewor as an subtracts). Five registers hol, norm, l, m, an N l. Note. The shift count l in the computations of m an n 2 may equal N. If this is too large, use separate shifts by l 1 an 1. If a oublewor shift is available, compute n 2 an n 10 together.

9 9 Exact ivision by constants Occasionally a language construct requires a ivision whose remainer is known to vanish. An example occurs in C when subtracting two pointers. Their numerical ifference is ivie by the object size. The object size is a compile time constant. Suppose we want coe for q = n/, where is a nonzero constant an n is an expression known to be ivisible by. Write = 2 e o where o is o. Fin inv such that 1 inv 2 N 1 an Then inv o 1 (mo 2 N ). (9.1) 2 e q = 2 e n = n o ( inv o ) n = inv n (mo 2 N ), o as in [2]. Hence 2 e q inv n (mo 2 N ). Since n/ o = 2 e q fits in N bits, it must equal the lower half of the prouct inv n, namely MULL( inv, n). An SRA (for signe ivision) or SRL (for unsigne ivision) prouces the quotient q. The multiplicative inverse inv of o moulo 2 N can be foun by the extene Eucliean GCD algorithm [14, p. 325]. Another algorithm observes that (9.1) hols moulo 2 3 if inv = o. Each Newton iteration inv inv (2 inv o ) mo 2 N (9.2) oubles the known exponent by which (9.1) hols, so log 2 (N/3) iterations of (9.2) suffice. If o = ±1, then inv = o so the multiplication by inv is trivial or a negation. If is o, then e = 0 an the shift isappears. A variation tests whether an integer n is exactly ivisible by a nonzero constant without computing the remainer. If is a power of 2 (or the negative thereof, in the signe case), then check the lower bits of n to test whether ivies n. Otherwise compute inv an e as above. Let q 0 = MULL( inv, n). If n = q for some q, then q 0 = 2 e q must be a multiple of 2 e. The original ivision is exact (no remainer) precisely when (i) q 0 is a multiple of 2 e, an (ii) q 0 is sufficiently small that q 0 o is representable by the original ata type. For unsigne ivision check that 2 0 q 0 2 e N 1 an that the bottom e bits of q 0 (or of n) are zero. When e > 0, these tests can be combine if the architecture has a rotate (i.e., circular shift) instruction, or by expaning this rotate into 2 N 1 OR(SRL(q 0, e), SLL(q 0, N e)). For signe ivision check that 2 2 e N 1 2 q 0 2 e N 1 1 an that the bottom e bits of q 0 are zero; the interval check can be one with an a an one signe or unsigne compare. Relately, to test whether n rem = r, where an r are constants with 1 r < an where n is signe, check whether MULL( inv, n r) is a nonnegative multiple of 2 e not exceeing 2 e (2 N 1 1 r)/. Example. To test whether a signe 32 bit value i is ivisible by 100, let inv = ( )/25. Compute swor q 0 = MULL( inv, i). Next check whether q 0 is a multiple of 4 in the interval [ q max, q max ], where q max = ( )/25. Since these algorithms require only the lower half of a prouct, other optimizations for integer multiplication apply here too. For example, applying strength reuction to the C loop signe long i, imax; for (i = 0; i < imax; i++) { if ((i % 100) == 0) {... } } might yiel (** enotes exponentiation) const unsigne long inv = (19*2**32 + 1)/25; const unsigne long qmax = (2**31-48)/25; unsigne long test = qmax; /* test = inv*i + qmax mo 2**32 */ for (i = 0; i < imax; i++, test += inv) { if (test <= 2*qmax && (test & 3) == 0) {... } } No explicit multiplication or ivision remains. 10 Implementation in GCC We have implemente the algorithms for constant ivisors in the freely available GCC compiler [21], by extening its machine an language inepenent internal coe generation. We also mae minor machine epenent moifications to some of the machine escriptor, or m files to get optimal coe. All languages an almost all processors supporte by GCC benefit. Our changes are scheule for inclusion in GCC 2.6.

10 To generate coe for ivision of N bit quantities, the CHOOSE MULTIPLIER function nees to perform (2N) bit arithmetic. This makes that proceure more complex than it might appear in Figure 6.2. Optimal selection of instructions epening on the bitsize of the operation is a tricky problem that we spent quite some time on. For some architectures, it is important to select a multiplication instruction that has the smallest available precision. On other architectures, the multiplication can be performe faster using a sequence of aitions, subtractions, an shifts. We have not implemente any algorithm for run time invariant ivisors. Only a few architectures (AMD 29050, Intel x86, Motorola 68k & 88110, an to some extent IBM POWER) have aequate harware support to make such an implementation viable, i.e., an instruction that can be use for integer logarithm computation, an a (2N) bit/n bit ivie instruction. Even with harware support, one must be careful that the transformation really improves the coe; e.g., a loop might nee to be execute many times before the faster loop boy outweighs the cost of the multiplier computation in the loop heaer. 11 Results Figure 11.1 has an example with compile time constant ivisor that gets rastically faster on all recent processor implementations. The program converts a binary number to a ecimal string. It calculates one quotient an one remainer per output igit. Table 11.1 shows the generate assembler coes for Alpha, MIPS, POWER, an SPARC. There is no explicit ivision. Although initially compute separately, the quotient an remainer calculations have been combine (by GCC s common subexpression elimination pass). The unsigne int ata type has 32 bits on all four architectures, but Alpha is a 64 bit architecture. The Alpha coe is longer than the others because it multiplies ( )/5 by x using 4 [ ( ) ( ) ( 4 [4 (4 x x) + x] x )] + x instea of the slower, 23 cycle, mulq. This illustrates that the multiplications neee by these algorithms can sometimes be compute quickly using a sequence of shifts, as, an subtracts [5], since multipliers for small constant ivisors have regular binary patterns. Table 11.2 compares the timing on some processor implementations for the raix conversion routine, with an without the ivision elimination algorithms. The number converte was a full 32 bit number, sufficiently large to hie proceure calling overhea from the measurements. We also ran the integer benchmarks from SPEC 92. The improvement was negligible for most of the programs; the best improvement seen was only about 3%. Some benchmarks that involve hashing show improvements up to about 30%. We anticipate significant improvements on some number theoretic coes. References [1] Robert Alverson. Integer ivision using reciprocals. In Peter Kornerup an Davi W. Matula, eitors, Proceeings 10th Symposium on Computer Arithmetic, pages , Grenoble, France, June [2] Ehu Artzy, James A. Hins, an Harry J. Saal. A fast ivision technique for constant ivisors. CACM, 19(2):98 101, February [3] Henry G. Baker. Computing A*B (mo N) efficiently in ANSI C. ACM SIGPLAN Notices, 27(1):95 98, January [4] H.B. Bakoglu, G.F. Grohoski, an R. K. Montoye. The IBM RISC system/6000 processor: Harware overview. IBM Journal of Research an Development, 34(1):12 22, January [5] Robert Bernstein. Multiplication by integer constants. Software Practice an Experience, 16(7): , July [6] Raymon T. Boute. The Eucliean efinition of the functions iv an mo. ACM Transactions on Programming Languages an Systems, 14(2): , April [7] A.P. Chang. A note on the moulo operation. SIGPLAN Notices, 20(4):19 23, April [8] Digital Equipment Corporation. DECchip AA Microprocessor, Harware Reference Manual, 1st eition, October [9] Intel Corporation, Santa Clara, CA. 386 DX Microprocessor Programmer s Reference Manual, [10] Intel Corporation, Santa Clara, CA. Intel486 Microprocessor Family Programmer s Reference Manual, [11] Davi H. Jacobsohn. A combinatoric ivision algorithm for fixe-integer ivisors. IEEE Trans. Comp., C 22(6): , June [12] Gerry Kane. MIPS RISC Architecture. Prentice Hall, Englewoo Cliffs, NJ, 1989.

11 #efine BUFSIZE 50 char *ecimal (unsigne int x) { static char buf[bufsize]; char *bp = buf + BUFSIZE - 1; *bp = 0; o { *--bp = 0 + x % 10; x /= 10; } while (x!= 0); return bp; /* Return pointer to first igit */ } Figure 11.1: Raix conversion coe Alpha MIPS POWER SPARC $2,buf la $5,buf+49 l 10,LC..0(2) sethi %hi(buf+49),%g2 sb $0,0($5) cau 11,0,0xcccc or %g2,%lo(buf+49),%o1 li $6,0xcccc0000 oril 11,11,0xccc stb %g0,[%o1] ori $6,$6,0xccc cal 0,0(0) sethi %hi(0xccccccc),%g2 L1: multu $4,$6 stb 0,0(10) or %g2,0xc,%o2 mfhi $3 L1: mul 9,3,11 L1: a %o1,-1,%o1 subu $5,$5,1 srai 0,3,31 umul %o0,%o2,%g0 srl $3,$3,3 an 0,0,11 r %y,%g3 sll $2,$3,2 a 9,9,0 srl %g3,3,%g3 au $2,$2,$3 a 9,9,3 sll %g3,2,%g2 sll $2,$2,1 sri 9,9,3 a %g2,%g3,%g2 subu $2,$4,$2 muli 0,9,10 sll %g2,1,%g2 au $2,$2,48 sf 0,0,3 sub %o0,%g2,%g2 move $4,$3 ai. 3,9,0 a %g2,48,%g2 bne $4,$0,L1 ai 0,0,48 orcc %g3,%g0,%o0 sb $2,0($5) stbu 0,-1(10) bne L1 j $31 bc 4,2,L1 stb %g2,[%o1] move $2,$5 la lq u $1,49($2) aq $2,49,$0 mskbl $1,$0,$1 stq u $1,49($2) L1: zapnot $16,15,$3 s4subq $3,$3,$2 s4aq $2,$3,$2 s4subq $2,$3,$2 sll $2,8,$1 subq $0,1,$0 aq $2,$1,$2 sll $2,16,$1 lq u $4,0($0) aq $2,$1,$2 s4aq $2,$3,$2 srl $2,35,$2 mskbl $4,$0,$4 s4al $2,$2,$1 aq $1,$1,$1 subl $16,$1,$1 al $1,48,$1 insbl $1,$0,$1 bis $2,$2,$16 bis $1,$4,$1 stq u $1,0($0) bne $16,L1 ret $31,($26),1 ai 3,10,0 br retl mov Table 11.1: Coe generate by our GCC for raix conversion %o1,%o0

12 Architecture/Implementation MHz Time with ivision performe Time with ivision eliminate Speeup ratio Motorola MC68020 [18, pp. 9 22] Motorola MC SPARC Viking [20] HP PA MIPS R3000 [12] MIPS R4000 [17] POWER/RIOS I [4, 22] DEC Alpha [8] * *This time ifference is artificial. The Alpha architecture has no integer ivie instruction, an the DEC library functions for ivision are slow. Table 11.2: Timing (microsecons) for raix conversion with an without ivision elimination [13] Donal E. Knuth. An empirical stuy of FOR- TRAN programs. Technical Report CS 186, Computer Science Department, Stanfor University, Stanfor artificial intelligence project memo AIM 137. [14] Donal E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Aison-Wesley, Reaing, MA, 2n eition, [15] Shuo-Yen Robert Li. Fast constant ivision routines. IEEE Trans. Comp., C 34(9): , September [16] Daniel J. Magenheimer, Liz Peters, Karl Pettis, an Dan Zuras. Integer multiplication an ivision on the HP Precision Architecture. In Proceeings Secon International Conference on Architectural Support for Programming Languages an Operating Systems (ASPLOS II). ACM, Publishe as SIGPLAN Notices, Volume 22, No. 10, October, [17] MIPS Computer Systems, Inc, Sunnyvale, CA. MIPS R4000 Microprocessor User s Manual, [18] Motorola, Inc. MC Bit Microprocessor User s Manual, 2n eition, [19] Motorola, Inc. PowerPC 601 RISC Microprocessor User s Manual, [20] SPARC International, Inc., Menlo Park, CA. The SPARC Architecture Manual, Version 8, [21] Richar M. Stallman. Using an Porting GCC. The Free Software Founation, Cambrige, MA, [22] Henry Warren. Preicting Execution Time on the IBM RISC System/6000. IBM, Preliminary Version.