Multiple Precision Integer Multiplication on GPUs

Multiple Precision Integer Multiplication on GPUs Koji Kitano and Noriyuki Fujimoto Graduate School of Science, Osaka Prefecture University, Sakai-shi, Osaka, Japan Abstract This paper addresses multiple precision integer multiplication on GPUs. In this paper, we propose a novel data-structure named a product digit table and present a GPU algorithm to perform the multiplication with the product digit table. Experimental results on a 3.10 GHz Intel Core i3-2100 CPU and an NVIDIA GeForce GTX480 GPU show that the proposed GPU algorithm respectively runs over 71.4 times and 12.8 times faster than NTL library and GMP library, two of common libraries for single thread multiple precision arithmetic on CPUs. Another experiments show also that the proposed GPU algorithm is faster than the fastest existing GPU algorithm based on FFT multiplication if bit lengths of given two multiple precision integers are different. Keywords: multiple precision integer, parallel multiplication, GPGPU, CUDA 1. Introduction Multiple precision integer arithmetic finds several applications, for example in primality test of a large number which is important in public-key cryptography. There are many works on multiplication that is of wide application of all other multiple precision arithmetics. These representatives include Karatsuba method [7], Toom-Cook method [19], 4- way method [22], 5-way method [22], and Strassen FFT multiplication [16]. Time complexities of these methods are O(n 1.585 ),O(n 1.465 ), O(n 1.404 ), and O(n 1.365 ) respectively where n is the number of bits of a multiplicand and a multiplier, but in the study [21], Zuras compares implementation of these methods in C language and assembly language on HP-9000/720, and reports that the naive O(n 2 ) method is the best for small numbers and that all naive methods are faster than FFT multiplication which has advantage in time complexity for large numbers. Finally, Zuras concludes that FFT multiplication is not always the fastest even for extremely large numbers (> 37,000,000 bits). On the other hand, research on GPGPU (General-Purpose computation on Graphics Processing Units) has much attention in recent years, and several works on multiple precision integer multiplication with a GPU are known. The fastest work of these existing works is the implementation [2] of FFT multiplication on GPUs. This method puts multiple precision integers into a 2 393216 -ary number and computes multiple precision integer multiplication by Karatsuba method except that one digit one digit is performed by FFT multiplication. This method is fast if the bit lengths of a multiplier and a multiplicand are the same and multiples of 393216. However, it is significantly slowed down if the bit lengths of a multiplier and a multiplicand are different or not multiples of 393216, because in such a case given two numbers are promoted to numbers of the bit length that is the smallest multiple of 393216 not smaller than the bit length of the larger number of the two numbers [2]. Moreover, currently required bit length in cryptology is a few thousands. In this paper, we propose a novel data-structure named a product digit table and presents a GPU algorithm to perform the multiplication with the product digit table. The proposed method is different from the FFT multiplication in that there is no need to promote given two numbers even if the bit lengths of the two numbers are not the same. In addition, since the proposed method represents numbers in 2 32 -ary, efficiency loss is relatively low even for numbers of a few thousand bits. The remainder of this paper is organized as follows. In Section 2, we briefly review existing algorithms for GPUs. In Section 3, we present the proposed GPU algorithm in detail. In Section 4, we show some experimental results. In Section 5, we give some concluding remarks and future works. Due to the limited space, we illustrate neither the architecture nor the programming of GPUs in this paper. Readers unfamiliar with these are recommended the studies [4], [8], [13], [14], [15]. 2. Related Works As far as we know, except below, there is no existing research on multiple precision integer multiplication for GPUs. In the study [2], the authors improved their result in the study [1], and implemented a method on CUDA such that Karatsuba method divides the whole multiplication into multiplications of smaller numbers which are performed by Strassen FFT multiplication. Their experiments show up to 4.29 times speedup with a single core of a 2.93 GHz Intel Core i7 870 and an NVIDIA GeForce GTX480 in integer multiplication of length 255 Kbits to 24.512 Mbits. In the study [6], the authors reported 0.8 to 2.9 times speedup relative to CPU library mpfq [5] with SSE2 instructions in integer multiplication of length 160 to 384 bits for a 3 GHz Intel Core2 Duo E8400 and an NVIDIA GeForce 9800GX2. In the study [21], under the assumption that many independent four arithmetic operations in multiple precision are

given, the authors proposed a multiple precision arithmetic library which executes each operation with a single thread, and report about 4 times speedup relative to CPU library GNU MP [3] in 30720 integer multiplications of 2048 bits 2048 bits with a single core of a 2.80 GHz Intel Core i7 and an NVIDIA GeForce GTX280. In the study [18], the authors implemented the modular algorithm [9] which can execute in O(n) time a multiplication of two numbers represented in the residue number system, while transformation of an n word integer from and to the residue number requires O(n 2 ) time. In the study [20], the authors reported about an optimization of multiple precision integer multiplication of 1024 bits 1024 bits with Karatsuba method on CUDA. In the study [12], the authors implemented a multiple precision multiplication with FFT on CUDA and reported 10 times speedup, but the used CPU and GPU are not shown. 3. The Proposed Method 3.1 A Product Digit Table We can compute an integer multiplication by manual calculation as shown on the left side of Fig. 1. In the example, the base is 10. We consider computing the multiplication with a table as shown on the right side of Fig. 1. An element of the table is a quotient or remainder when dividing by the base a product of one digit of a multiplier and one digit of a multiplicand. Therefore, we can independently compute each element. For example, a white 2 on a black ground in the table is a remainder of division of 8 4 by base 10, and a white 4 on a black ground is a quotient of division of 5 9 by base 10. We can get the product by computing the summation column by column in the table and propagating the carries. We call such a table a product digit table. Fig. 2 shows a typical shape of a product digit table for a product of a multiple precision integer A of a digits and a B of b digits (a b). Each element only in gray-scaled area A 1, A 2, or A 3 has a value. Fig. 2 implies that a product digit table has the following properties. Fig. 1: Manual calculation of a multiplication and the corresponding product digit table Fig. 2: Shape of the product digit table in case of a digits b digits The number of the columns in A 1 (A 3 ) is equal to b. The number of the rows of a column in A 1 (A 3 ) is monotonically increasing (decreasing) two by two if the columns are seen from left to right. The number of the columns in A 2 equals a b, and the number of the rows in A 2 is exactly 2b. When A and B in a base BASE are stored respectively in an array A[0.. a-1] and an array B[0.. b-1] (A[0] and B[0] are the least significant digits), the algorithm in Fig. 3 generates a product digit table of A B on a 2D array T. Each element of T that is not assigned a value is don t care. Fig. 4 shows an example of a product digit table that is made up by the algorithm if A is a five digit number and B is a three digit number. 3.2 Data Structure Each element of a product digit table is computed based on a product of one digit of A and one digit of B. We set BASE=2 32 to execute the computation of the product as efficiently as possible. This setting makes every product fit in 64 bits (unsigned long long type on GPUs). Therefore, we can compute each quotient by 32 bits shift to the right for a product of length 64 bits and each remainder by computing one digit one digit as in 32 bits ( unsigned int type ). In the proposed GPU algorithm, we basically allocate a thread to each column of a product digit table so that both summation computation for each column and carry propagation can be done in parallel. However, if we directly use a product digit table in Fig. 2, the amount of computation and memory access would be imbalanced among threads allocated to the columns in A 1 or A 3. Hence, we represent a product digit table as a 2D array of size a (2b) as shown in Fig. 5, rather than a 2D array of size (a + b) (2b) as shown in Fig. 2. In the rest of this paper, for a product digit table in the load balanced form, R 1 is defined as the rectangle that corresponds to A 2 and R 2 is defined as the rectangle that corresponds to the two triangles A 1 and A 3 (See Fig. 2 and Fig. 5). Fig. 6 shows an example of a product digit table in the load balanced form. Notice that the

Fig. 3: An algorithm for constructing a product digit table Fig. 4: An example of a product digit table generated by the algorithm in Fig. 3 (in case that A is a five digit number and B is a three digit number) order of the elements in each column never affects the result of a multiplication in question. Hence, we can arbitrarily arrange the elements within their column. R 2 in Fig. 5 has the following properties. The horizontal length is b, and the vertical length is 2b. A border line that splits R 2 into A 1 and A 3 lies between the 2α-th row and the (2α + 1)-th row in the α-th column from the right. In the β-th row, B[β/2] is referred in all columns, and in the α-th column, A[α] is referred at the top row and the indices of the referred elements of A cyclically increase one by one every two rows. 3.3 Algorithm We divide R 2 into tiles of width BLOCK_SIZE elements and height (BLOCK_SIZE 2) elements as shown in Fig. 7. In the proposed algorithm, each thread block (block for short) has BLOCK_SIZE threads and is allocated to a tile so that each thread computes a summation of a column of the tile. The tile can be categorized into three types: tiles with A 1 elements only, A 3 elements only, and the both elements. We respectively refer these types as T 1, T 3, and T 13. In a tile of the type T 13, a conditional branch is needed and therefore the execution efficiency is lower than that of the type T 1 or the type T 3. However, the slow down for a whole program is negligible because the number of tiles of T 13 is much smaller than the total number of tiles of T 1 and tiles of T 3. In fact, the parallelism of a block is sustained in a tile of T 1 and a tile of T 3. Each block accesses BLOCK_SIZE 2 elements of A and BLOCK_SIZE elements of B. This enables each thread to compute the column summation only with values on shared memory. Device memory accesses to load the elements into shared memory can be coalesced, because each block refers contiguous elements of A and B. 3.4 Detail of Implementation We illustrate what computation is to be performed for R 2. Regardless of the type of a tile, the first thing for a block to perform is loading necessary elements on device memory

Fig. 5: A product digit table in load balanced form (general case) into shared memory. Then, each thread of a block computes a summation of elements in the allocated column col. Each thread of a block that computes T 13 separately computes two summations of A 1 elements and A 3 elements. Then, each thread adds the summation of A 1 elements to index col of an array C to store the answer, and adds the summation of A 3 elements to index col+a of C. We call a CUDA C kernel function for rectangle R 2 mul_bint_a. Next, we explain a kernel function for rectangle R 1. This function is identical to the function for rectangle R 2 except that indices to refer A, B, and C are changed. We call a CUDA C kernel function for rectangle R 1 mul_bint_b. If the number of digits is enough large, almost all elements of C are likely to exceed the base 2 32 at this stage with high probability. Thus, we perform carry propagation to make all elements smaller than the base. We divide this processing into two steps for efficiency. The first step is to separate each sum in C into the least significant digit and the carry. We call a CUDA C kernel function for the first step mul_bint_c. Note that the used base 2 32 is large enough to limit each carry at most one digit. Therefore, we can regard all the least significant digits form a multiple precision integer with base 2 32. Similarly, all the carries form a multiple precision integer. Hence, in the second step, we add these two multiple precision integers. We explain the detailed implementation of the second step. In general, a multiple precision integer sum involves carry propagation, which is essentially sequential. In the worst case, carry propagates from the least significant digit to the most significant digit, which requires the computation time such that a single thread computes the whole sum. Our implementation is essentially the same as "carry skip adder" [10] that is one of implementations of a full adder in hardware. However, we devised our implementation as below so that addition is quickly done if only short propagations are occurred. Our implementation utilizes a fact that the possibility that at least one carry arises is very small because the base is 2 32. Fig. 8 illustrates the idea behind our implementation, but here the used base is 10 for simplicity. At first, from given two arrays A and B we generate in parallel an array I to store carry information as shown in Fig. 8. The carry information is 1 if the sum of the corresponding two elements is larger than the base, 2 if it is equivalent to the base-1 (it is 9 in Fig. 8), or 0 if it is neither of the above two. Next, for every element whose carry information is 2, in parallel we change 2 to 1 (0) if its right neighbor is 1 (0). Notice that 2 such that its right neighbor is also 2 is not changed. We repeat this parallel processing until all 2s are disappeared. Then, we compute and store the sum total of three elements: the corresponding two elements of given two multiple precision integers and the final carry information of the two elements. Finally, we change each element of the array to a remainder of it divided by the base. The changed array is the multiple precision integer sum. This process needs barrier synchronization among blocks, so we achieve this by dividing the whole process into three kernel functions mul_bint_d, mul_bint_e and mul_bint_f. These functions are executed only once in the order of mul_bint_d, mul_bint_e and mul_bint_f. 4. Experiments In this section, we compare the proposed method with NTL [17] and GMP [3], two representatives of existing multiple precision arithmetic libraries for a single CPU thread, and also with Emmart et al. s FFT multiplication on GPUs [2] that is the fastest of existing studies on multiple precision integer multiplication on GPUs. For each test, 3.10 GHz Intel Core i3-2100 and an NVIDIA GeForce GTX480 was used. Basically, we used 64 bit Windows7 Professional SP1 as an OS and Visual Studio 2008 Professional as a compiler, but only for GMP programs we used 64 bit Linux (ubuntu 11.04) as an OS and g++ 4.5.2 as a compiler because GMP does not officially support Windows. We used CUDA Ver 3.2 and display driver Ver 285.62. We used 48KB shared memory. The used versions of NTL and GMP are respectively 5.5.2 and 5.0.4. The NTL does not use SIMD instructions such as SSE instructions, but the GMP uses SSE2 instructions and MMX instructions. Both NTL and GMP are single-threaded libraries. Multithreaded versions of them do not exist yet. Parallelizing them seems to be hard work. Therefore, the CPU programs use a single core only. At first, we show a comparison between the proposed method and NTL. Table 1 (Table 2) summarizes execution times of a multiplication with the proposed method (NTL) for two multiple precision integers of length 8 Kbits to 256 Kbits. Table 3 shows the corresponding speedup ratios. These tables indicate that the proposed method is faster than NTL in all cases and the maximum speedup is over 70 times.

Fig. 6: An example of a product digit table in the load balanced form (reconstructed from the table in Fig. 4) Fig. 7: Thread block allocation for a product digit table in the balanced form Next, we show a comparison with GMP that is known to be faster than NTL in general. Table 4 summarizes execution times of a multiplication for the same multiple precision integers as the integers in Table 1 and Table 2. Table 5 shows the corresponding speedup ratios. Although these results are inferior to the results for NTL, we can see over 10 times speedup. In addition, we conducted a speed comparison with Emmart et al. s FFT multiplication on GPUs. Emmart et al. reported 255 Kbits 255 Kbits with GTX480 takes 0.207 msec. Under the almost same condition of 256 Kbits 256 Kbits, the proposed method takes 0.5922 msec as shown in Table 1. Hence, if bit lengths of given two multiple precision integers are the same, Emmart et al. s FFT multiplication is about three times faster than the proposed method. However, if bit lengths of given two multiple precision integers are different, the proposed method is faster than Emmart et al. s FFT multiplication. For example, consider the case of 256 Kbits 8 Kbits. As noted in Section 1, Emmart et al. s implementation is based on FFT of 383 Kbits. Thus, values smaller than 383 Kbits must be promoted to 383 Kbits to be multiplied as 383 Kbits 383 Kbits. In fact, they reported 383 Kbits 383 Kbits with GTX480 takes 0.200 msec, but 255 Kbits 255 Kbits takes 0.207 msec due to additional time for promotion. So in their implementation, the speed in 256 Kbits 8 Kbits is at least the speed in 383 Kbits 383 Kbits. In contrast to this, as shown in Table 1, the execution time of the proposed algorithm is shortened if either a multiplier or a multiplicand is smaller than the other. Fig. 8: An algorithm for computing a multiple precision integer sum For example, the execution time of a multiplication of 256 Kbits 8 Kbits is about 13 times faster than a multiplication of 256 Kbits 256 Kbits. In summary, in case of 256 Kbits 8 Kbits, the proposed method is about 4.57 times faster than Emmart et al. s GPU implementation (0.207 msec / 0.0453 msec). 5. Conclusion We have proposed a novel data structure named a product digit table, and we have presented an algorithm that fast executes a multiple precision integer multiplication on GPUs based on the product digit table. The proposed method is based on manual calculation, so in case of a multiple

Table 1: Execution times of A B with the proposed algorithm on a GPU (msec) (Data transfer time between a GPU and a CPU is not included) A B 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits 8Kbits 0.0211 0.0290 0.0292 0.0311 0.0364 0.0453 16Kbits 0.0290 0.0212 0.0301 0.0334 0.0426 0.0607 32Kbits 0.0292 0.0301 0.0248 0.0395 0.0570 0.0934 64Kbits 0.0311 0.0334 0.0395 0.0530 0.0946 0.1671 128Kbits 0.0364 0.0426 0.0570 0.0946 0.1610 0.3110 256Kbits 0.0453 0.0607 0.0934 0.1671 0.3110 0.5922 Table 2: Execution times of A B with NTL library on a CPU (msec) A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits 0.094 0.179 0.357 0.695 1.388 2.799 16Kbits 0.179 0.273 0.560 1.031 2.049 4.059 32Kbits 0.357 0.560 0.777 1.574 3.068 6.666 64Kbits 0.695 1.031 1.574 2.464 4.590 9.455 128Kbits 1.388 2.049 3.068 4.590 6.989 13.972 256Kbits 2.799 4.059 6.666 9.455 13.972 20.766 Table 3: Speedup ratios of the proposed algorithm to NTL A B 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits 8Kbits 4.5 6.2 12.2 22.3 38.1 61.8 16Kbits 6.2 12.9 18.6 30.9 48.1 66.9 32Kbits 12.2 18.6 31.3 39.8 53.8 71.4 64Kbits 22.3 30.9 39.8 46.5 48.5 56.6 128Kbits 38.1 48.1 53.8 48.5 43.4 44.9 256Kbits 61.8 66.9 71.4 56.6 44.9 35.1 Table 4: Execution times of A B with GMP on a CPU (msec) A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits 0.035 0.053 0.089 0.158 0.284 0.552 16Kbits 0.053 0.074 0.117 0.215 0.396 0.779 32Kbits 0.089 0.117 0.165 0.286 0.543 1.072 64Kbits 0.158 0.215 0.286 0.385 0.748 1.486 128Kbits 0.284 0.396 0.543 0.748 0.997 1.912 256Kbits 0.552 0.779 1.072 1.486 1.912 2.602 precision integer multiplication of the same bit length, our algorithm runs slower than FFT multiplication. However, if bit lengths of given two multiple precision integers are different and their bit numbers are not extremely large, our algorithm is better than FFT multiplication. Future works include further optimization of our implementation, in particular for integers of length a few thousand bits intended for the real problem including public-key cryptography. References [1] Emmart, N. and Weems, C., "High Precision Integer Addition, Subtraction and Multiplication with a Graphics Processing Unit," Parallel Processing Letters, Vol.20, No.4, pp.293-306, 2010. [2] Emmart, N. and Weems, C., "High Precision Integer Multiplication with a GPU," IEEE Int l Parallel & Distributed Processing Symposium (IPDPS), pp.1781-1787, May 2011.

Table 5: Speedup ratios of the proposed algorithm to GMP A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits 1.7 1.8 3.0 5.1 7.8 12.2 16Kbits 1.8 3.5 3.9 6.4 9.3 12.8 32Kbits 3.0 3.9 6.7 7.2 9.5 11.5 64Kbits 5.1 6.4 7.2 7.3 7.9 8.9 128Kbits 7.8 9.3 9.5 7.9 6.2 6.1 256Kbits 12.2 12.8 11.5 8.9 6.1 4.4 [3] Free Software Foundation, "The GNU Multiple Precision Arithmetic Library, ", http://gmplib.org/, 2011. [4] Garland, M. and Kirk, D. B., "Understanding Throughput-Oriented Architectures," Communications of the ACM, Vol.53, No.11, pp.58-66, 2010. [5] Gaudry, P. and ThomÃş, E., "The mpfq Library and Implementing Curve-based Key Exchanges," Software Performance Enhancement for Encryption and Decryption Workshop, pp. 49âĂŞ64, 2007. [6] Giorgi, P., Izard, T., and Tisserand, A., "Comparison of Modular Arithmetic Algorithms on GPUs," Int l Conf. on Parallel Computing (ParCo), http://hal-lirmm.ccsd.cnrs.fr/docs/00/43/06/89/pdf/article-parco09.pdf, Sept. 2009. [7] Karatsuba, A. and Ofman, Y., "Multiplication of Multidigit Numbers on Automata," Doklady Akademii Nauk SSSR, Vol.145, No.2, pp.293-294, 1962. (in Russian). English translation in Soviet Physics-Doklady, Vol.7, pp.595-596, 1963. [8] Kirk, D. B. and Hwu, W. W., "Programming Massively Parallel Processors: A Hands-on Approach," Morgan Kaufmann, 2010. [9] Knuth, D. E., "The Art of Computer Programming," Vol.2, Seminumerical Algorithms, 3rd Edition, Addison-Wesley, 1997. [10] Lehman, M. and Burla, N., "Skip Techniques for High-Speed Carry Propagation in Binary Arithmetic Units". IRE Trans. Elec. Comput., Vol.EC-10, pp.691-698, 1961. [11] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," [12] Liu, H. J. and Tong, C., "GMP Implementation on CUDA : A Backward Compatible Design with Performance Tuning," [13] NVIDIA, "CUDA Programming Guide Version 5.5.," http://www.nvidia.com/object/cuda_develop.html, 2013. [14] NVIDIA, "CUDA Best Practice Guide 5.5.," http://www.nvidia.com/object/cuda_develop.html, 2013. [15] Sanders, J. and Kandrot, E., "CUDA by Example: An Introduction to General-Purpose GPU Programming," Addison-Wesley Professional, 2010. [16] SchÃűnhage, A. and Strassen, V., "Schnelle Multiplikation grosser Zahlen," Computing, Vol.7, pp.281-292, 1971. [17] Shoup, V., "NTL: A Library for doing Number Theory," http://www.shoup.net/ntl/, 2009. [18] Tanaka, T. and Murao, H., "An Efficient Method for Multiple- Precision Integer Arithmetics Using GPU " Information Processing Society of Japan SIG Technical Report Vol.2010-HPC-124 No.2 pp.1-7 2010 (in Japanese). [19] Toom, A. L., "The Complexity of a Scheme of Functional Elements Realizing the Multiplication of Integers," Soviet Math., Vol. 3, pp.714-716, 1963. [20] Zhao, K., "Implementation of Multiple-precision Modular Multiplication on GPU", http://www.comp.hkbu.edu.hk/ pgday/2009/10th_papers/kzhao.pdf, 2009. [21] Zhao, K. and Chu, X., "GPUMP: A Multiple-Precision Integer Library for GPUs," IEEE Int l Conf. on Computer and Information Technology (CIT), pp.1164-1168, June 2010. [22] Zuras, D., "More on Squaring and Multiplying Large Integer," IEEE Trans. Computers, Vol.43, No 8, pp.899-908, Aug. 1994.