Improved division by invariant integers

1 Improve ivision by invariant integers Niels Möller an Torbjörn Granlun Abstract This paper consiers the problem of iviing a two-wor integer by a single-wor integer, together with a few extensions an applications. Due to lack of efficient ivision instructions in current processors, the ivision is performe as a multiplication using a precompute single-wor approximation of the reciprocal of the ivisor, followe by a couple of ajustment steps. There are three common types of unsigne multiplication instructions; we efine full wor multiplication (umul) which prouces the two-wor prouct of two single-wor integers, low multiplication (umullo) which prouces only the least significant wor of the prouct, an high multiplication (umulhi), which prouces only the most significant wor. We escribe an algorithm which prouces a quotient an remainer using one umul an one umullo. This is an improvement over earlier methos, since the new metho uses cheaper multiplication operations. It turns out we also get some aitional savings from simpler ajustment conitions. The algorithm has been implemente in version 4.3 of the GMP library. When applie to the problem of iviing a large integer by a single wor, the new algorithm gives a speeup of roughly 30%, benchmarke on AMD an Intel processors in the x86 64 family. I. INTRODUCTION Integer ivision instructions are either not present at all in current microprocessors, or if they are present, they are consierably slower than the corresponing multiplication instructions. Multiplication instructions in turn are at least a few times slower than aition instructions, both in terms of throughput an latency. The situation was similar a ecae ago [1], an the tren has continue so that ivision latency is now typically 5-15 times higher than multiplication latency, an ivision throughput is up to 50 times worse than multiplication throughput. Another tren is that branches cost graually more, except for branches that the harware can preict correctly. But some branches are inherently unpreictable. Division can be implemente using multiplication, by first computing an approximate reciprocal, e.g., by Newton iteration, followe by a multiplication that results in a caniate quotient. Finally, the remainer corresponing to this caniate quotient is compute, an if the remainer is too small or too large, the quotient is ajuste. This proceure is particularly attractive when the same ivisor is use several times; then the reciprocal nee to be compute only once. Somewhat surprisingly, a well-tune Newton reciprocal followe by multiplication an ajustments wins over the harware ivision instructions even for a single non-invariant ivision on moern 64-bit PC processors. This paper consiers the problem of iviing a two-wor number by a single-wor number, using a single-wor approx- N. Möller is a long time member of the GMP research team. Email: nisse@lysator.liu.se T. Granlun is with the Centre for Inustrial an Applie Mathematics, KTH, Stockholm. Granlun s work was sponsore by the Sweish Founation for Strategic Research. Email: tege@naa.kth.se imate reciprocal. The main contributions are a new algorithm for ivision using such a reciprocal an new algorithms for computing a suitable reciprocal for 32-bit an 64-bit wor size. The key iea in our new ivision algorithm is to compute the caniate remainer as a single wor rather than a ouble wor, even though it oes not quite fit. We then use a fraction associate with the caniate quotient to resolve the ambiguity. The new metho is more efficient than previous methos for two reasons. It uses cheaper multiplication operations, omitting the most significant half one of the two proucts. Computing the least significant wor of a prouct is a cheaper operation than computing the most significant wor (e.g., on AMD Opteron, the ifference in latency is one cycle, while on Intel Core 2, the ifference is three cycles). The neee ajustment conitions are simpler. When the ivision algorithms in this paper are use as builing blocks for algorithms working with large numbers, our improvements typically affect the linear term of the execution time. This is of particular importance for applications using integers of size up to a few ozen wors, e.g., on a 64-bit CPU, 2048-bit RSA correspons to computations on 32-wor numbers. The new algorithms have been implemente in the GMP library [2]. As an example of the resulting speeup, for ivision of a large integer by a single wor, the new metho gives a speeup of 31% compare to earlier methos, benchmarke on AMD Opteron an Intel Core 2. The outline of this paper is as follows. The rest of this section efines the notation we use. Section II explains how the neee reciprocal approximation is efine, an how it is useful. In Sec. III, we escribe new algorithms for computing the reciprocal, an we present our main result, a new algorithm for iviing a two-wor number by a single wor. Analysis of the probability for the ajustment steps in the latter algorithm is provie in Appenix A. Section IV escribes a couple of extensions, primarily motivate by schoolbook ivision, the most important one being a metho for iviing a three-wor number by a two-wor number. In Sec. V, we consier an algorithm that can take irect avantage of the new ivision metho: Diviing a large integer by a single-wor. We escribe the x86 64 implementation of this algorithm using the new metho, an compare it to earlier results. Finally, Sec. VI summarises our conclusions. A. Notation an conventions Let l enote the computer wor size, an let = 2 l enote the base implie by the wor size. Lower-case letters enote single-wor numbers, an upper-case letters represent numbers

2 of any size. We use the notation X = x n 1,...,x 1,x 0 = x n 1 n 1 + + x 1 + x 0, where the n-wor integer X is represente by the wors x i, for 0 i < n. We use the following multiplication operations: p 1,p 0 umul(a,b) = ab Double wor prouct p 0 umullo(a,b) = (ab) mo ab p 1 umulhi(a,b) = Low wor High wor Our algorithms epen on the existence an efficiency of these basic multiplication operations, but they o not require both umul an umulhi. These are common operations in all processors, an very few processors lack both umul an umulhi 1. II. DIVISION USING AN APPROXIMATE RECIPROCAL Consier the problem of iviing a two-wor number U = u 1,u 0 by a single-wor number, computing the quotient an remainer U q = r = U q. Clearly, r is a single-wor number. We assume that u 1 <, to ensure that also the quotient q fits in a single wor. We also restrict attention to the case that is a normalise singlewor number, i.e., /2 <. This is equivalent to the wor having its most significant bit set. It follows that u 0 / < 2, an one can get a reasonable quotient approximation from u 1 alone, without consiering u 0. We have 1/ < 1/ 2/. We represent the reciprocal 1/ using a fixe-point representation, with a single wor an an aitional implicit one bit at the most significant en. We efine the precompute reciprocal of as the integer 2 1 v =. (1) The constraints on imply that 0 < v <, in particular, v is a single wor number. We have ( + v)/ 2 1/, or more precisely, 1 1 2 + v 2 < 1. (2) For the borerline case = /2, we have the true reciprocal 1/ = 2/, which equals (+v)/ 2 for v =. Our efinition instea gives the single-wor number v = 1 in this case. The usefulness of v comes from Eq. (2) which implies U (u 1 + u 0 ) + v 2 = u 1 + u 1v + u 0 + u 0v 2. (3) Since ( + v)/ 2 < 1/, the integer part of the right han sie is at most q, an hence a single wor. Since the terms on the right han sie are non-negative, this boun is still vali if some of the terms are omitte or truncate. 1 The SPARC v9 architecture is a notable exception, making high performance arithmetic on large numbers very challenging. A. Previous methos The trick of using a precompute reciprocal to replace integer ivision by multiplication is well-known. The simplest variant is Alg. 1, which uses a quotient approximation base on the first two terms of Eq. (3). (q,r) DIV2BY1( u 1,u 0,,v) In: /2 <, u 1 <, v = ( 2 1)/ 1 q vu 1 / + u 1 // Caniate quotient (umulhi) 2 p 1,p 0 q // umul 3 r 1,r 0 u 1,u 0 p 1,p 0 // Caniate remainer 4 while r 1 > 0 or r 0 // Repeate at most 3 times 5 q q + 1 6 r 1,r 0 r 1,r 0 7 return q,r 0 Algorithm 1: Simple ivision of two-wor number by a singlewor number, using a precompute single-wor reciprocal. To see how it works, let U = u 1,u 0 an let q enote the true quotient U/. We have ( + v) = 2 k, where 1 k. Let q enote the caniate quotient compute at line 1, an let q 0 = vu 1 mo enote the low, ignore, half of the prouct. Let R enote the corresponing caniate remainer, compute on line 3. Then R = U q = u 0 + u 1 u 1( + v) q 0 = u 0 + u 1k + q 0 We see that R 0, which correspons to q q. Since k, we also get the upper boun R < +2 4, which implies that q q 3. Since R may be larger than, it must be compute as a two-wor number at line 3 an in the loop, at line 5, which is execute at most three times. The problem is that in the two-wor subtraction U q, most, but not all, bits in the most significant wor cancel. Hence, we must use the expensive umul operation rather than the cheaper umullo. The quotient approximation can be improve. By checking if u 0, an if so, incrementing q before computing r, one gets R < 3 an q q 2. The metho in [1], Sec. 8, is more intricate, guaranteeing that R < 2, so that q q 1. However, it still computes the full prouct q, so this metho nees one umul an one umulhi. III. NEW ALGORITHMS In this section, we escribe our new algorithms. We first give efficient algorithms for computing the approximate reciprocal, an we then escribe our new algorithm for ivision of a ouble-wor number by a single wor. A. Computing the reciprocal From the efinition of v, we have 2 1 1, 1 v = =

3 so for architectures that provie an instruction for iviing a two-wor number by a single wor, that instruction can be use to compute the reciprocal straightforwarly. If such a ivision instruction is lacking or if it is slow, the reciprocal can be compute using the Newton iteration This equation implies that x k+1 = x k + x k (1 x k ). (4) 1 x k+1 = (1 x k ) 2. (5) Consier one iteration, an assume that the accuracy of x k is roughly n bits. Then the esire accuracy of x k+1 is about 2n bits, an to achieve that, only about 2n bits of are neee in Eq. (4). If x k is represente using n bits, matching its accuracy, then the computation of the right han sie yiels 4n bits. In a practical implementation, the result shoul be truncate to match the accuracy of 2n bits. The resulting error in x k+1 is the combination of the error accoring to Eq (5), the truncation of the result, an any truncation of the input. v RECIPROCAL WORD() In: 2 63 < 2 64 1 0 mo 2 // Least significant bit 2 9 2 55 // Most significant 9 bits 3 40 2 24 + 1 // Most significant 40 bits 4 63 /2 // Most significant 63 bits 5 v 0 (2 19 3 2 8 )/ 9 // By table lookup 6 v 1 2 11 v 0 2 40 v0 2 40 1 // 2 umullo 7 v 2 2 13 v 1 + 2 47 v 1 (2 60 v 1 40 ) // 2 umullo 8 e 2 96 v 2 63 + v 2 /2 0 // umullo 9 v 3 (2 31 v 2 + 2 65 v 2 e ) mo 2 64 // umulhi 10 v 4 (v 3 2 64 (v 3 + 2 64 + 1) ) mo 2 64 // umul 11 return v 4 Algorithm 2: Computing the reciprocal ( 2 1)/, for 64-bit machines ( = 2 64 ). Algorithm 2 gives one variant, for = 2 64. Here, v 0 is represente as 11 bits, v 1 as 21 bits, v 2 as 34 bits, an v 3 an v 4 as 65-bit values where the most significant bit, which is always one, is implicit. Note that since 40 an 63 are roune upwars, they may be equal to 2 40 an 2 63 respectively, an hence not quite fit in 40 an 63 bits. Theorem 1 (64-bit reciprocal): With = 2 64, the output v of Alg. 2 satisfies 0 < 2 ( + v). Proof: We will prove that the errors in each iteration are boune as follows: e 0 = 2 50 v 0 40 e 0 < 5 8 242 (6) e 1 = 2 60 v 1 40 0 e 1 < 29 32 243 (7) e 2 = 2 97 v 2 0 < e 2 < 873 1024 263 + (8) e 3 = 2 128 (2 64 + v 3 ) 0 < e 3 < 2 (9) e 4 = 2 128 (2 64 + v 4 ) 0 < e 4 (10) Each step involves a truncation, an we let 0 δ k < 1 enote the truncation error in each step. Start with (6). Let = 40 2 31 9, then 1 2 31. We have v 0 = 219 3 2 8 9 δ 0 e 0 = 2 50 219 3 2 8 9 (2 31 9 + ) + δ 0 40 From this, we get For (7), we get = 3 2 39 + δ 0 40 219 3 2 8 9 e 0 3 2 39 + δ 0 40 < 3 2 39 + 2 40 = 5 2 39 e 0 3 2 39 219 3 2 8 9 > 3 2 39 2 42 = 5 2 39 v 1 = 2 11 v 0 2 40 v 2 0 40 (1 δ 1 ) e 1 = 2 60 (2 11 v 0 2 40 v 2 0 40 ) 40 + (1 δ 1 ) 40 = 2 40 e 2 0 + (1 δ 1 ) 40 It follows that e 1 > 0 an that ( ) 2 5 e 1 < 2 44 + 2 40 = 29 8 32 243 For (8), we first note that the prouct v 1 (2 60 v 1 40 ) fits in 64 bits, since the first factor is 21 bits an the secon factor is e 1, which fits in 43 bits. Let = 2 24 40, then 1 2 24. We get v 2 = 2 13 v 1 + 2 47 v 1 (2 60 v 1 40 ) δ 2 e 2 = 2 97 v 2 (2 24 40 ) = 2 97 2 24 (2 13 v 1 + 2 47 v 1 (2 60 v 1 40 )) 40 + v 2 + δ 2 = 2 23 e 2 1 + v 2 + δ 2 It follows that e 2 > 0 an that ( ) 2 29 e 2 < 2 63 + 2 58 + = 873 32 1024 263 + For (9), first note that the value e, compute at line 8, equals e 2 /2. Then (8) implies that this value fits in 64 bits. Let ǫ enote the least significant bit of e 2, so that e = (e 2 ǫ)/2. Define v 3 = 2 31 v 2 + 2 66 v 2 (e 2 ǫ) e 3 = 2 128 v 3 (We will see in a moment that v 3 = 2 64 + v 3, an hence also e 3 = e 3 ). We get e 3 = 2 128 (2 31 v 2 + 2 66 v 2 (2 97 v 2 ǫ)) + δ 3 = 2 66 e 2 2 + (2 66 v 2 ǫ + δ 3 ) It follows that e 3 > 0 an that ( ) 2 ( 873 873 e 3 < 2 60 + 1024 4096 + 1 4 + 1 ) 2 32 + 1 < 2

4 v RECIPROCAL WORD() In: 2 31 < 2 32 1 0 mo 2 2 10 2 22 // Most significant 10 bits 3 21 2 11 + 1 // Most significant 21 bits 4 31 /2 // Most significant 31 bits 5 v 0 (2 24 2 14 + 2 9 )/ 10 // By table lookup 6 v 1 2 4 v 0 2 32 v0 2 21 1 // umullo + umulhi 7 e (2 48 v 1 31 + v 1 /2 0 ) // umullo 8 v 2 2 15 v 1 + 2 33 v 1 e // umulhi 9 v 3 (v 2 2 32 (v 2 + 2 32 + 1) ) mo 2 32 // umul 10 return v 3 Algorithm 3: Computing the reciprocal ( 2 1)/, for 32-bit machines ( = 2 32 ). It remains to show that 2 64 v 3 < 2 2 64. The upper boun follows from e 3 > 0. For the borerline case = 2 64 1, one can verify that v 3 = 2 64, an for 2 64 2, we get v 3 = 2128 e 3 2128 e 3 2 64 2 = 2 64 + 2 264 e 3 2 64 2 For the final ajustment step, we have > 2 64. 2 64 (v 3 + 2 64 + 1) = 2 64 (2 128 e 3 + ) = 2 64 + 2 64 ( e 3 ) { 2 64 e 3 = 2 64 1 e 3 > Hence, the effect of the ajustment is to increment the reciprocal approximation if an only if e 3 >. The esire boun, Eq. (10), follows. Algorithm 3 is a similar algorithm for = 2 32. In this algorithm, v 0 is represente as 15 bits, v 1 as 18 bits, an v 2 an v 3 as 33-bit values where the most significant bit, always one, is implicit. The correctness proof is analogous, with the following error bouns: e 0 = 2 35 v 0 21 e 0 < 33 64 226 e 1 = 2 49 v 1 0 < e 1 < 2113 4096 231 + e 2 = 2 64 (2 32 + v 2 ) 0 < e 2 < 2 e 3 = 2 64 (2 32 + v 3 ) Remarks: 0 < e 2 The final step in the algorithm is not a Newton iteration, but an ajustment step which as zero or one to the reciprocal approximation. We gain precision in the first Newton iteration by choosing the initial value v 0 so that the range for the error e 0 is symmetric aroun zero. In the Newton iteration x+x(1 x), there is cancellation in the subtraction (1 x), since x is close to 1. In Alg. 2 an 3 we arrange so that the errors e k, for k 1, (q,r) DIV2BY1( u 1,u 0,,v) In: /2 <, u 1 <, v = ( 2 1)/ 1 q 1,q 0 vu 1 // umul 2 q 1,q 0 q 1,q 0 + u 1,u 0 3 q 1 (q 1 + 1) mo 4 r (u 0 q 1 ) mo // umullo 5 if r > q 0 // Unpreictable conition 6 q 1 (q 1 1) mo 7 r (r + ) mo 8 if r // Unlikely conition 9 q 1 q 1 + 1 10 r r 11 return q 1,r Algorithm 4: New algorithm for iviing a two-wor number by a single-wor number, using a precompute single-wor reciprocal. are non-negative, an exploit that a certain number of the high bits of v k are know a-priori to be all ones. The execution time of Alg. 2 is roughly 48 cycles on AMD Opteron, an 70 cycles on Intel Core 2. B. Diviing a two-wor number by a single wor To improve performance of ivision, it woul be nice if we coul get away with using umullo for the multiplication q in Alg. 1 (line 2), rather than a full umul. Then the caniate remainer U q will be compute only moulo, even though the full range of possible values is too large to be represente by a single wor. We will nee some aitional information to be able to make a correct ajustment. It turns out that this is possible, if we take the fractional part of the quotient approximation into account. Intuitively, we expect the caniate remainer to be roughly proportional to the quotient fraction. Our new an improve metho is given in Alg. 4. It is base on the following theorem. Theorem 2: Assume /2 <, 0 u 1 <, an 0 u 0 <. Put v = ( 2 1)/. Form the two-wor number q 1,q 0 = ( + v)u 1 + u 0. Form the caniate quotient an remainer Then r satisfies q = q 1 + 1 r = u 1,u 0 q. max(,q 0 + 1) r < max(,q 0 ) Hence r is uniquely etermine given r mo, an q 0. Proof: We have ( + v) = 2 k, where 1 k. Substitution in the expression for r gives r = u 1 + u 0 q 1 = u 1k + u 0 ( ) + q 0.

5 For the lower boun, we clearly have r q 0. This boun implies that both these inequalities hol: r r (q 0 ) > q 0. The esire lower boun on r now follows. For the upper boun, we have r < 2 + ( ) + q 0 = ( ) + q 0 max(,q 0 ) where the final inequality follows from recognising the expression as a convex combination. Remark: The lower boun for r is attaine if an only if u 0 = u 1 = 0. Then q 1 = q 0 = 0, an r =. The upper boun is attaine if an only if u 0 = u 1 = 1 an = /2. Then v = 1, q 1 = 2, q 0 = /2, an r = /2 1. In Alg. 4, enote the value compute at line 4 by r. Then r = r mo. A straightforwar application of Theorem 2 woul compare this value to max(,q 0 ). In Alg. 4, we instea compare r to q 0. To see why this gives the correct result, consier two cases: Assume r 0. Then r = r < max(,q 0 ). Hence, whenever the conition at line 5 is true, we have r <, so that the aition at the next line oes not overflow. The secon ajustment conition, at line 8, reuces the remainer to the proper range 0 r <. Otherwise, r < 0. Then r = r+ max(,q 0 +1). Since r > q 0, the conition at line 5 is true, an since r, the aition (r + ) mo = r + = r + yiels a correct remainer in the proper range. The conition at line 8 is false. Of the two ajustment conitions, the first one is inherently unpreictable, with a non-negligible probability for either outcome. This means that branch preiction will not be effective. For goo performance, the first ajustment must be implemente in a branch-free fashion, e.g., using a conitional move instructions. The secon conition, r, is true with very low probability (see Appenix A for analysis of this probability), an can be hanle by a preicate branch or using conitional move. IV. EXTENSIONS FOR SCHOOLBOOK DIVISION The key iea in Alg. 4 can be applie to other small ivisions, not just two-wor ivie by single wor (which we call a 2/1 ivision). This leas to a family of algorithms, all which compute a quotient approximation by multiplication by a precompute reciprocal, then omit computing the high, almost cancelling, part of the corresponing caniate remainer, an finally, they perform an ajustment step using a fraction associate with the quotient approximation. We will focus on extensions that are useful for schoolbook ivision with a large ivisor. The most important extension (q, r 1,r 0 ) DIV3BY2( u 2,u 1,u 0, 1, 0,v) In: /2 1 <, u 2,u 1 < 1, 0, v = ( 2 1)/ 1 q 1,q 0 vu 2 // umul 2 q 1,q 0 q 1,q 0 + u 2,u 1 3 r 1 (u 1 q 1 1 ) mo // umullo 4 t 1,t 0 0 q 1 // umul 5 r 1,r 0 ( r 1,u 0 t 1,t 0 1, 0 ) mo 2 6 q 1 (q 1 + 1) mo 7 if r 1 q 0 8 q 1 (q 1 1) mo 9 r 1,r 0 ( r 1,r 0 + 1, 0 ) mo 2 10 if r 1,r 0 1, 0 // Unlikely conition 11 q 1 q 1 + 1 12 r 1,r 0 r 1,r 0 1, 0 13 return q 1, r 1,r 0 Algorithm 5: Diviing a three-wor number by a two-wor number, using a precompute single-wor reciprocal. is 3/2-ivision, i.e., iviing a three-wor number by a twowor number. This is escribe next. Later on in this section, we will also look into variations that prouce more than one quotient wor. A. Diviing a three-wor number by a two-wor number For schoolbook ivision with a large ivisor, the simplest metho is to compute one quotient wor at a time by iviing the most significant two wors of the ivien by the single most significant wor of the ivisor, which is a irect application of Alg. 4. Assuming the ivisor is normalise, the resulting quotient approximation is at most two units too large. Next, the corresponing remainer caniate is compute an ajuste if necessary. A rawback with this metho is that the probability of ajustment is significant, an that each ajustment has to o an aition or a subtraction of large numbers. To improve performance, it is preferable to compute a quotient approximation base on one more wor of both ivien an ivisor, three wors ivie by two wors. With a normalise ivisor, the quotient approximation is at most one off, an the probability of error is small. For more etails on the schoolbook ivision algorithm, see [3, Sec. 4.3.1, Alg. D] an [4]. We therefore consier the following problem: Divie u 2,u 1,u 0 by 1, 0, computing the quotient q an remainer r 1,r 0. To ensure that q fits in a single wor, we assume that u 2,u 1 < 1, 0, an like for 2/1 ivision, we also assume that the ivisor is normalise, 1 /2. Algorithm 5 is a new algorithm for 3/2 ivision. The ajustment conition at line 7 is inherently unpreictable, an shoul therefore be implemente in a branch-free fashion, while the secon one, at line 10, is true with very low probability. The algorithm is similar in spirit to Alg. 4. The correctness of the algorithm follows from the following theorem. Theorem 3: Consier the ivision of the three-wor number U = u 2,u 1,u 0 by the two-wor number D = 1, 0.

6 Assume that /2 1 < an u 2,u 1 < 1, 0 Put 3 1 v = D which is in the range 0 v <. Form the two-wor number q 1,q 0 = ( + v)u 2 + u 1. Form the caniate quotient an remainer Then r satisfies with q = q 1 + 1 r = u 2,u 1,u 0 q 1, 0. c 2 r < c c = max( 2 D,q 0 ). Proof: We have ( +v)d = 3 K, for some K in the range 1 K D. Substitution gives r = U qd = u 2K + u 1 ( 2 D) + u 0 + q 0 D D. The lower bouns r D an r > q 0 2 follow in the same way as in the proof of Theorem 2, proving the lower boun r c 2. For the upper boun, the borerline cases make the proof more involve. We nee to consier several cases. If u 2 1 1, then r < ( 1 1)D + ( 1)( 2 D) + 2 D + q 0 D = (2 D) 2 + q 0 D 0 D 2 = 2 D 2 ( 2 D) + D 2 q 0 0D 2 c. If u 2 = 1, then u 1 0 1, by assumption. In this case, we get r < 1D + ( 0 1)( 2 D) + 2 D + q 0 D = 2 D 2 ( 2 D) + D 2 q 0 + ( 0) ( ( + 1)D 3) 2 c + ( 0) ( ( + 1)D 3) 2. Uner the aitional assumption that D ( 1), we get ( + 1)D 3 < 0, an it follows that r < c. Finally, the remaining borerline case is u 2 = 1 an D > ( 1). We then have u 2 = 1 = 1, 0 u 1 < 0, an v = 0 since ( 3 1)/D < 1. It follows that q 1 = u 2 = 1. We get r = u D = (u 1 0 ) + u 0 < 0 < c. Hence the upper boun r < c is vali in all cases. v RECIPROCAL WORD 3BY2( 1, 0 ) In: /2 1 < 1 v RECIPROCAL WORD( 1 ) // We have 2 1 ( + v) 1 < 2. 2 p 1 v mo // umullo 3 p (p + 0 ) mo 4 if p < 0 // Equivalent to carry out 5 v v 1 6 if p 1 7 v v 1 8 p p 1 9 p (p 1 ) mo // We have 2 1 ( + v) 1 + 0 < 2. 10 t 1,t 0 v 0 // umul 11 p (p + t 1 ) mo 12 if p < t 1 // Equivalent to carry out 13 v v 1 14 if p,t 0 1, 0 15 v v 1 16 return v Algorithm 6: Computing the reciprocal which DIV3BY2 expects, v = ( 3 1)/ 1, 0. This is a single wor reciprocal base on a two-wor ivisor. B. Computing the reciprocal for 3/2 ivision The reciprocal neee by Alg. 5, even though still a single wor, is slightly ifferent from the reciprocal that is neee by Alg 4. One can use Alg. 2 or Alg. 3 (epening on wor size) to compute the reciprocal of the most significant wor 1, followe by a couple of ajustment steps to take into account the least significant wor 0. We suggest the following strategy: Start with the initial reciprocal v, base on 1 only, an the corresponing prouct ( +v) 1, where only the mile wor is represente explicitly (the high wor is 1, an the low wor is zero). We then a first 0 an then v 0 to this prouct. For each aition, if we get a carry out, we cancel that carry by appropriate subtractions of 1 an 0 to get an unerflow. The etails are given in Alg. 6. Remark: The prouct 1 v mo, compute in line 2, may be available cheaply, without multiplication, from the intermeiate values use in the final ajustment step of RECIPRO- CAL WORD (Alg. 2 or Alg. 3). C. Larger quotients The basic algorithms for 2/1 ivision an 3/2 ivision can easily be extene in two ways. One can substitute ouble-wors or other fixe-size units for the single wors in Alg. 4 an Alg. 5. This way, one can construct efficient algorithms that prouce quotients of two or more wors. E.g., with ouble-wor units, we get algorithms for ivision of sizes 4/2 an 6/4. In any of the algorithms constructe as above, one can fix one or more of the least significant wors of both

7 (Q,r) DIV NBY1(U,) In: U = u n 1...u 0, /2 < Out: Q = q n 1...q 0 1 v RECIPROCAL WORD() 2 r 0 3 for j = n 1,...,0 4 (q j,r) DIV2BY1( r,u j,,v) 5 return Q,r Algorithm 7: Diviing a large integer U = u n 1...u 0 by a normalise single-wor integer. ivien an ivisor to zero. This gives us algorithms for ivision of sizes such as 3/1 an 5/3 (an applying this proceure to 3/2 woul recover the goo ol 2/1 ivision). Details an applications for some of these variants are escribe in [4]. V. CASE STUDY: X86 64 IMPLEMENTATION OF n/1 DIVISION Schoolbook ivision is the main application of 3/2 ivision, as was escribe briefly in the previous section. We now turn to a more irect application of 2/1 ivision using Alg. 4. In this section, we escribe our implementation of DIV NBY1, iviing a large number by a single wor number, for current processors in the x86 64 family. We use conitional move (cmov) to avoi branches that are ifficult to hanle efficiently by branch-preiction. Besies cmov, the most crucial instructions use are mul, imul, a, ac, sub an lea. Detaile latency an throughput measurements of these instructions, for 32-bit an 64-bit processors in the x86 family, are given in [5]. We iscuss the timing only for AMD Opteron ( K8/K9 ) an Intel Core 2 (65 nm Conroe ) in this section. The AMD Opteron results are vali also for processors with the bran names Athlon an Phenom 2. Other recent Intel processors give results slightly ifferent from the 65 nm Core 2 results we escribe 3. Our results focus mainly on AMD chips since they are better optimise for scientific integer operations, i.e., the ones we epen on. If we on t specify host architecture, we are talking about AMD Opteron. A. Diviing a large integer by a single wor Consier ivision of an n-wor number U by a single wor number. The result of the ivision is an n-wor quotient an a single-wor remainer. This can be implemente by repeately replacing the two most significant wors of U by their single-wor remainer moulo, an recoring the 2 Phenom has the same multiplication latencies, but slightly higher(!) latency for ivision. 3 The 45 nm Core 2 has somewhat lower ivision latency, an the same multiplication latencies. The Core ix processors (x = 3, 5, 7, 9) have lower ivision latency, an for umul, they have lower latency for the low prouct wor, but higher(!) latency for the high prouct wor. loop: mov (np, un, 8), %rax iv mov %rax, (qp, un, 8) ec un jnz loop Example 1: Basic ivision loop using the iv instruction, running at 71 cycles per iteration on AMD Opteron, an 116 cycles on Intel Core 2. Note that rax an rx are implicit input an output arguments to the iv instruction. corresponing quotient wor [3, Sec. 4.3.1, exercise 16]. The variant shown in Alg. 7 computes a reciprocal of (an hence requires that is normalise), an applies our new 2/1 ivision algorithm in each step. To use Alg. 7 irectly, must be normalise. To also hanle unnormalise ivisors, we select a shift count k such that /2 2 k <. Alg. 7 can then be applie to the shifte operans 2 k U an 2 k. The quotient is unchange by this transformation, while the resulting remainer has to be shifte k bits right at the en. Shifting of U can be one on the fly in the main loop. In the coe examples, register cl hols the normalisation shift count k. B. Naïve implementation The main loop of an implementation in x86 64 assembler is shown in Example. 1. Note that the iv instruction in the x86 family appear to be tailor-mae for this loop: This instructions takes a ivisor as the explicit argument. The two-wor input ivien is place with the most significant wor in the rx register an the least significant wor in the rax register. The output quotient is prouce in rax an the remainer in rx. No other instruction in the loop nee to touch rx as the remainer is prouce by each iteration an consume in the next. However, the epenency between iterations, via the remainer in rx, means that the execution time is lower boune by the latency of the iv instruction, which is 71 cycles on AMD Opteron [5] (an even longer, 116 cycles, on Intel Core 2). Thanks to parallelism an out-of-orer execution, the rest of the instructions are execute while waiting for the result from the ivision. This loop is more than an orer of magnitue slower than the loop for multiplying a large number by a single-wor number. C. Ol ivision metho The earlier ivision metho from [1] can be implemente with the main loop in Example 2. The epenency between operations, via the rax register, is still crucial to unerstan the performance. Consier the sequence of epenent instructions in the loop, from the first use of rax until the output value of the iteration is prouce. This is what we call the recurrency chain of the loop. The assembler listing is annotate with cycle numbers, for AMD Opteron an Intel Core 2. We let cycle 0 be the cycle when the first instructions on the recurrency chain starts executing, an the following instructions in the chain are annotate with the cycle number of the earliest cycle the

8 loop: mov (up,un,8), %rx shl %cl, %rx, %r14 lea (,%r14), %r12 bt $63, %r14 cmovnc %r14, %r12 0 0 mov %rax, %r10 0 0 ac $0, %rax 1 2 mul inv 5 10 a %r12, %rax mov, %rax 6 11 ac %r10, %rx 7 13 not %rx 8 14 mov %rx, %r12 8 14 mul %rx 12 22 a %rax, %r14 13 23 ac %rx, %r10 14 25 sub, %r10 13 23 lea (,%r14), %rax 14 26 cmovnc %r14, %rax AMD Intel sub %r12, %r10 mov (up,un,8), %r14 mov %r10, 8(qp,un,8) ec un jnz loop Example 2: Previous metho using a precompute reciprocal, running at 17 cycles per iteration on AMD Opteron, an 32 cycles on Intel Core 2. instruction can start executing, taking its input epenencies into account. To create the annotations, one nees to know the latencies of the instructions. Most arithmetic instructions, incluing cmov an lea have a latency of one cycle. The crucial mul instruction has a latency of four cycles until the low wor of the prouct is available in rax, an one more cycle until the high wor is available in rx. The imul instructions, which prouces the low half only, also has a latency of four cycles. These numbers are for AMD, the latencies are slightly longer on Intel Core 2 (2 cycles for ac an cmov, 5 cycles for imul an 8 for mul). See [5] for extensive empirical timing ata. Using these latency figures, we fin that the latency of the recurrency chain in Example 2 is 15 cycles. This is a lower boun on the execution time. It turns out that the loop runs in 17 cycles per iteration; the instructions not on the recurrency chain are mostly scheule for execution in parallel with the recurrency instructions, an there s plenty of time, 8 cycles, when the CPU is otherwise just waiting for the results from the multiplication unit. This is a four time speeup compare to the 71-cycle loop base on the iv instruction. For Intel Core 2, the latency of the recurrency chain is 28 cycles, while the actual running time is 32 cycles per iteration. D. New ivision metho The main loop of an implementation of the new ivision metho is given in Example 3. Annotating the listing with loop: nop mov (up,un,8), %r10 0 0 lea 1(%rax), %r11 shl %cl, %r10, %rbp 0 0 mul inv 4 8 a %rbp, %rax 5 9 ac %r11, %rx mov %rax, %r11 mov %rx, %r13 6 11 imul, %rx 10 16 sub %rx, %rbp mov, %rax 11 17 a %rbp, %rax 11 17 cmp %r11, %rbp 12 18 cmovb %rbp, %rax AMD Intel ac $-1, %r13 cmp, %rax jae fix ok: mov %r13, (qp) sub $8, qp ec un mov %r10, %rbp jnz loop jmp one fix: sub, %rax inc %r13 jmp ok one: Example 3: Division coe (from GMP-4.3) with the new ivision metho, base on Alg. 4. Running at 13 cycles per iteration on AMD Opteron, an 25 cycles on Intel Core 2. cycle numbers in the same way, we see that the latency of the recurrency chain is 13 cycles. Note that the rarely taken branch oes not belong to the recurrency chain. The loop actually also runs at 13 cycles per iteration; all the remaining instructions are scheule for execution in parallel with the recurrency chain 4. For Intel Core 2, the latency of the recurrency chain is 20 cycles, with an actual running time of 25 cycles per iteration. Comparing the ol an the new metho, first make the assumption (which is conservative in the Opteron case) that all the loops can be tune to get their running times own to the respective latency bouns. We then get a speeup of 15% on AMD Opteron an 40% on Intel Core 2. If we instea compare actual cycle counts, we see a speeup of 31% on both Opteron an Core 2. On Opteron, we gain one cycle from replacing one of the mul instructions by the faster imul, the other cycle shave off the recurrency chain are ue to the simpler ajustment conitions. In this application, the coe runs slower on Intel Core 2 than on AMD Opteron. The Intel CPU loses some cycles ue 4 It s curious that if the nop instruction at the top of the loop is remove, the loop runs one cycle slower. It seems likely that similar ranom changes to the instruction sequence in Example 2 can reuce its running time by one or even two cycles, to reach the lower boun of 15 cycles.

9 Implementation Recurrency chain latency an real cycle counts AMD Opteron Intel Core 2 Naïve iv loop (Ex. 1) 71 71 116 116 Ol metho (Ex. 2) 15 17 28 32 New metho (Ex. 3) 13 13 20 25 TABLE I SUMMARY OF THE LATENCY OF THE RECURRENCY CHAIN, AND ACTUAL CYCLE COUNTS, FOR TWO X86 64 PROCESSORS. THE LATENCY NUMBERS ARE LOWER BOUNDS FOR THE ACTUAL CYCLE COUNTS. to higher latencies for multiplication an carry propagation, resulting in a higher overall latency of the recurrency chain. An then it loses some aitional cycles ue to the fact that the coe was written an scheule with Opteron in min. VI. CONCLUSIONS We have escribe an analyse a new algorithm for iviing a two-wor number by a single-wor number ( 2/1 ivision). The key iea is that when computing a caniate remainer where the most significant wor almost cancels, we omit computing the most significant wor. To enable correct ajustment of the quotient an the remainer, we work with a slightly more precise quotient approximation than in previous algorithms, an an associate fractional wor. Like previous methos, we compute the quotient via an approximate reciprocal of the ivisor. We escribe new, more efficient, algorithms for computing this reciprocal for the most common cases of a wor size of 32 or 64 bits. The new algorithm for 2/1 ivision irectly gives a speeup of roughly 30% on current processors in the x86 64 family, for the application of iviing a large integer by a single wor. It is curious that on these processors, the combination of our reciprocal algorithm (Alg. 2) an ivision algorithm (Alg. 4) is significantly faster than the built in assembler instruction for 2/1 ivision. This inicates that the algorithms may be of interest for implementation in CPU microcoe. We have also escribe a couple of extensions of the basic algorithm, primarily to enable more efficient schoolbook ivision with a large ivisor. Most of the algorithms we escribe have been implemente in the GMP library [2]. ACKNOWLEDGEMENTS The authors wish to thank Stephan Tolksorf, Björn Terelius, Davi Harvey an Johan Håsta for valuable feeback on raft versions of this paper. As always, the responsibility for any remaining errors stays with the authors. REFERENCES [1] T. Granlun an P. L. Montgomery, Division by invariant integers using multiplication, in Proceeings of the SIGPLAN PLDI 94 Conference, June 1994. [2] T. Granlun, GNU multiple precision arithmetic library, version 4.3, May 2009, http://gmplib.org/. [3] D. E. Knuth, Seminumerical Algorithms, 3r e., ser. The Art of Computer Programming. Reaing, Massachusetts: Aison-Wesley, 1998, vol. 2. [4] T. Granlun an N. Möller, Division of integers large an small, August 2009, to appear. [5] T. Granlun, Instruction latencies an throughput for AMD an Intel x86 processors, 2009, http://gmplib.org/ tege/x86-timing.pf. APPENDIX A PROBABILITY OF THE SECOND ADJUSTMENT STEP In this appenix, we analyse the probability of the secon ajustment step (line 8 in Alg. 4), an substantiate our claim that the secon ajustment is unlikely. We use the notation from Sec. III-B. We also use the notation that P[event] is the probability of a given event, an E[X] is the expecte value of a ranom variable X. We will treat r as a ranom variable, but we first nee to investigate for which values of r that the secon ajustment step is one. There are two cases: If r, then r < max(,q 0 ) an imply that r < q 0. The first ajustment is skippe, the secon is one. If r > q 0, then r < max(,q 0 ) implies that r < an r + <. The first ajustment is one, then unone by the secon ajustment. The inequalities r an r q 0 are thus mutually exclusive, the former possible only when q 0 > an the latter possible only when q 0 <. One example of each kin, for = 2 5 = 32: U q r v k q q 0 r 414 18 23 0 24 16 22 30 18 504 18 28 0 24 16 28 0 0 To fin the probabilities, in this section, we treat r as a ranom variable. Consier the expression for r, r = u 1k + u 0 ( ) + q 0. We assume we have a fixe = ξ, with 1/2 ξ < 1, an consier u 1 an u 0 as inepenent uniformly istribute ranom variables in the ranges 0 u 1 < an 0 u 0 <. We also make the simplifying assumptions that k an q 0 are inepenent an uniformly istribute, in the ranges 0 < k an 0 q 0 <, an that all these variables are continuous rather than integer-value. 5 Lemma 4: Assume that 1/2 ξ < 1, that u 1, u 0, k an q 0 are inepenent ranom variables, continuously an uniformly istribute with ranges 0 u 1,k ξ, 0 u 0,q 0. Let Then r = u 1k + u 0 (1 ξ) + q 0 ξ P[ r ξ or r q 0 ] ξ. (2 1/ξ)3 2 1/ξ = log + 1 6(1 ξ) 2 ξ 6 ( + (1 ξ) 1 18 + 1 2ξ 11 12ξ 2 + 11 ) 36ξ 3 (11) 5 These assumptions are justifie for large wor-size. Strictly speaking, with fixe, the variable k is of course not ranom at all. To make this argument strict, we woul have to treat as a ranom variable with values in a small range aroun ξ, e.g., uniformly istribute in the range ξ ± 3/4, an consier the limit as. Then the moulo operations involve in q 0 an k make these variables behave as almost inepenent an uniformly istribute.

10 Furthermore, if we efine then f(ξ) = 1 297 64 (1 ξ) + 15 2 (1 ξ)2 17 (1 ξ)3 4 P[ r ξ or r q 0 ] (1 ξ)6 24f(ξ) (12) with an absolute error less than 0.01 percentage points, an a relative error less than 5%. Proof: Define the stochastic variables Now, X = u 1k ξ 2 R = u 1k + u 0 (1 ξ) ξ 2 Q = q 0. r ξ = R + Q 1. By assumption, Q is uniformly istribute, while R has a more complicate istribution. Conitioning on Q = s, we get the probabilities P[ r ξ] = = P[ r q 0 ] = = 1 3 ξ 1/ξ ξ+1/ξ 2 0 1 ξ 0 1 1/ξ 1 P[R 2 s] s P[R 1 + s] s P[R 1 + (1/ξ 1)s] s ξ+1/ξ 2 0 P[R 1 + s] s. Aing the probabilities (recall that the events are mutually exclusive), we get the probability of ajustment as Probability [%] 2.5 2 1.5 1 0.5 0 0.5 0.6 0.7 0.8 0.9 1 Fig. 1. Probability of the unlikely ajustment step, as a function of the ratio ξ = /. for ξ close to 1/2. The coefficients of f are chosen to give the same asymptotics. The error bouns for Eq. (12) are foun numerically. In Fig. 1, the ajustment probability of Eq. (11) is plotte as a function of the ratio ξ = /. This is a rapily ecreasing function, with maximum value for ξ = 1/2, which gives the worst case probability of 1/36 for close to /2. This curve is base on the assumptions on continuity an inepenence of the ranom variables. For a fixe an wor size, the ajustment probability for ranom u 1 an u 0 will eviate some from this continuous curve. In particular, the borerline case = /2 actually gives an ajustment probability of zero, so it is not the worst case. ξ 1 1 ξ ξ+1/ξ 2 0 P[R 1 + s] s. (13) We next nee the probabilities P[R s] for 1 s ξ + 1/ξ 1. By somewhat teious calculations, we fin P [X s] = s ( 1 log s ) P[R s] = ξ E[max(0,X (s (1/ξ 1)))] 1 ξ (s + 1 1/ξ)2 = log s + 1 1/ξ 2(1 ξ) ξ + ξ2 4(s + 1 1/ξ) + 3(s + 1 1/ξ) 2, 4(1 ξ) where the latter equation is vali only for s in the interval of interest. Substituting in Eq. (13) an integrating yiels Eq. (11). To approximate this complicate expression, we first erive its asymptotics: for ξ close to 1, an (1 ξ) 6 /24 + O ( (1 ξ) 7) 1/36 13/18(ξ 1/2) + 34/3(ξ 1/2) 2 + O ( (ξ 1/2) 3 log(ξ 1/2) )