Multiple Precision Integer Multiplication on GPUs

Size: px
Start display at page:

Download "Multiple Precision Integer Multiplication on GPUs"

Transcription

1 Multiple Precision Integer Multiplication on GPUs Koji Kitano and Noriyuki Fujimoto Graduate School of Science, Osaka Prefecture University, Sakai-shi, Osaka, Japan Abstract This paper addresses multiple precision integer multiplication on GPUs. In this paper, we propose a novel data-structure named a product digit table and present a GPU algorithm to perform the multiplication with the product digit table. Experimental results on a 3.10 GHz Intel Core i CPU and an NVIDIA GeForce GTX480 GPU show that the proposed GPU algorithm respectively runs over 71.4 times and 12.8 times faster than NTL library and GMP library, two of common libraries for single thread multiple precision arithmetic on CPUs. Another experiments show also that the proposed GPU algorithm is faster than the fastest existing GPU algorithm based on FFT multiplication if bit lengths of given two multiple precision integers are different. Keywords: multiple precision integer, parallel multiplication, GPGPU, CUDA 1. Introduction Multiple precision integer arithmetic finds several applications, for example in primality test of a large number which is important in public-key cryptography. There are many works on multiplication that is of wide application of all other multiple precision arithmetics. These representatives include Karatsuba method [7], Toom-Cook method [19], 4- way method [22], 5-way method [22], and Strassen FFT multiplication [16]. Time complexities of these methods are O(n ),O(n ), O(n ), and O(n ) respectively where n is the number of bits of a multiplicand and a multiplier, but in the study [21], Zuras compares implementation of these methods in C language and assembly language on HP-9000/720, and reports that the naive O(n 2 ) method is the best for small numbers and that all naive methods are faster than FFT multiplication which has advantage in time complexity for large numbers. Finally, Zuras concludes that FFT multiplication is not always the fastest even for extremely large numbers (> 37,000,000 bits). On the other hand, research on GPGPU (General-Purpose computation on Graphics Processing Units) has much attention in recent years, and several works on multiple precision integer multiplication with a GPU are known. The fastest work of these existing works is the implementation [2] of FFT multiplication on GPUs. This method puts multiple precision integers into a ary number and computes multiple precision integer multiplication by Karatsuba method except that one digit one digit is performed by FFT multiplication. This method is fast if the bit lengths of a multiplier and a multiplicand are the same and multiples of However, it is significantly slowed down if the bit lengths of a multiplier and a multiplicand are different or not multiples of , because in such a case given two numbers are promoted to numbers of the bit length that is the smallest multiple of not smaller than the bit length of the larger number of the two numbers [2]. Moreover, currently required bit length in cryptology is a few thousands. In this paper, we propose a novel data-structure named a product digit table and presents a GPU algorithm to perform the multiplication with the product digit table. The proposed method is different from the FFT multiplication in that there is no need to promote given two numbers even if the bit lengths of the two numbers are not the same. In addition, since the proposed method represents numbers in ary, efficiency loss is relatively low even for numbers of a few thousand bits. The remainder of this paper is organized as follows. In Section 2, we briefly review existing algorithms for GPUs. In Section 3, we present the proposed GPU algorithm in detail. In Section 4, we show some experimental results. In Section 5, we give some concluding remarks and future works. Due to the limited space, we illustrate neither the architecture nor the programming of GPUs in this paper. Readers unfamiliar with these are recommended the studies [4], [8], [13], [14], [15]. 2. Related Works As far as we know, except below, there is no existing research on multiple precision integer multiplication for GPUs. In the study [2], the authors improved their result in the study [1], and implemented a method on CUDA such that Karatsuba method divides the whole multiplication into multiplications of smaller numbers which are performed by Strassen FFT multiplication. Their experiments show up to 4.29 times speedup with a single core of a 2.93 GHz Intel Core i7 870 and an NVIDIA GeForce GTX480 in integer multiplication of length 255 Kbits to Mbits. In the study [6], the authors reported 0.8 to 2.9 times speedup relative to CPU library mpfq [5] with SSE2 instructions in integer multiplication of length 160 to 384 bits for a 3 GHz Intel Core2 Duo E8400 and an NVIDIA GeForce 9800GX2. In the study [21], under the assumption that many independent four arithmetic operations in multiple precision are

2 given, the authors proposed a multiple precision arithmetic library which executes each operation with a single thread, and report about 4 times speedup relative to CPU library GNU MP [3] in integer multiplications of 2048 bits 2048 bits with a single core of a 2.80 GHz Intel Core i7 and an NVIDIA GeForce GTX280. In the study [18], the authors implemented the modular algorithm [9] which can execute in O(n) time a multiplication of two numbers represented in the residue number system, while transformation of an n word integer from and to the residue number requires O(n 2 ) time. In the study [20], the authors reported about an optimization of multiple precision integer multiplication of 1024 bits 1024 bits with Karatsuba method on CUDA. In the study [12], the authors implemented a multiple precision multiplication with FFT on CUDA and reported 10 times speedup, but the used CPU and GPU are not shown. 3. The Proposed Method 3.1 A Product Digit Table We can compute an integer multiplication by manual calculation as shown on the left side of Fig. 1. In the example, the base is 10. We consider computing the multiplication with a table as shown on the right side of Fig. 1. An element of the table is a quotient or remainder when dividing by the base a product of one digit of a multiplier and one digit of a multiplicand. Therefore, we can independently compute each element. For example, a white 2 on a black ground in the table is a remainder of division of 8 4 by base 10, and a white 4 on a black ground is a quotient of division of 5 9 by base 10. We can get the product by computing the summation column by column in the table and propagating the carries. We call such a table a product digit table. Fig. 2 shows a typical shape of a product digit table for a product of a multiple precision integer A of a digits and a B of b digits (a b). Each element only in gray-scaled area A 1, A 2, or A 3 has a value. Fig. 2 implies that a product digit table has the following properties. Fig. 1: Manual calculation of a multiplication and the corresponding product digit table Fig. 2: Shape of the product digit table in case of a digits b digits The number of the columns in A 1 (A 3 ) is equal to b. The number of the rows of a column in A 1 (A 3 ) is monotonically increasing (decreasing) two by two if the columns are seen from left to right. The number of the columns in A 2 equals a b, and the number of the rows in A 2 is exactly 2b. When A and B in a base BASE are stored respectively in an array A[0.. a-1] and an array B[0.. b-1] (A[0] and B[0] are the least significant digits), the algorithm in Fig. 3 generates a product digit table of A B on a 2D array T. Each element of T that is not assigned a value is don t care. Fig. 4 shows an example of a product digit table that is made up by the algorithm if A is a five digit number and B is a three digit number. 3.2 Data Structure Each element of a product digit table is computed based on a product of one digit of A and one digit of B. We set BASE=2 32 to execute the computation of the product as efficiently as possible. This setting makes every product fit in 64 bits (unsigned long long type on GPUs). Therefore, we can compute each quotient by 32 bits shift to the right for a product of length 64 bits and each remainder by computing one digit one digit as in 32 bits ( unsigned int type ). In the proposed GPU algorithm, we basically allocate a thread to each column of a product digit table so that both summation computation for each column and carry propagation can be done in parallel. However, if we directly use a product digit table in Fig. 2, the amount of computation and memory access would be imbalanced among threads allocated to the columns in A 1 or A 3. Hence, we represent a product digit table as a 2D array of size a (2b) as shown in Fig. 5, rather than a 2D array of size (a + b) (2b) as shown in Fig. 2. In the rest of this paper, for a product digit table in the load balanced form, R 1 is defined as the rectangle that corresponds to A 2 and R 2 is defined as the rectangle that corresponds to the two triangles A 1 and A 3 (See Fig. 2 and Fig. 5). Fig. 6 shows an example of a product digit table in the load balanced form. Notice that the

3 Fig. 3: An algorithm for constructing a product digit table Fig. 4: An example of a product digit table generated by the algorithm in Fig. 3 (in case that A is a five digit number and B is a three digit number) order of the elements in each column never affects the result of a multiplication in question. Hence, we can arbitrarily arrange the elements within their column. R 2 in Fig. 5 has the following properties. The horizontal length is b, and the vertical length is 2b. A border line that splits R 2 into A 1 and A 3 lies between the 2α-th row and the (2α + 1)-th row in the α-th column from the right. In the β-th row, B[β/2] is referred in all columns, and in the α-th column, A[α] is referred at the top row and the indices of the referred elements of A cyclically increase one by one every two rows. 3.3 Algorithm We divide R 2 into tiles of width BLOCK_SIZE elements and height (BLOCK_SIZE 2) elements as shown in Fig. 7. In the proposed algorithm, each thread block (block for short) has BLOCK_SIZE threads and is allocated to a tile so that each thread computes a summation of a column of the tile. The tile can be categorized into three types: tiles with A 1 elements only, A 3 elements only, and the both elements. We respectively refer these types as T 1, T 3, and T 13. In a tile of the type T 13, a conditional branch is needed and therefore the execution efficiency is lower than that of the type T 1 or the type T 3. However, the slow down for a whole program is negligible because the number of tiles of T 13 is much smaller than the total number of tiles of T 1 and tiles of T 3. In fact, the parallelism of a block is sustained in a tile of T 1 and a tile of T 3. Each block accesses BLOCK_SIZE 2 elements of A and BLOCK_SIZE elements of B. This enables each thread to compute the column summation only with values on shared memory. Device memory accesses to load the elements into shared memory can be coalesced, because each block refers contiguous elements of A and B. 3.4 Detail of Implementation We illustrate what computation is to be performed for R 2. Regardless of the type of a tile, the first thing for a block to perform is loading necessary elements on device memory

4 Fig. 5: A product digit table in load balanced form (general case) into shared memory. Then, each thread of a block computes a summation of elements in the allocated column col. Each thread of a block that computes T 13 separately computes two summations of A 1 elements and A 3 elements. Then, each thread adds the summation of A 1 elements to index col of an array C to store the answer, and adds the summation of A 3 elements to index col+a of C. We call a CUDA C kernel function for rectangle R 2 mul_bint_a. Next, we explain a kernel function for rectangle R 1. This function is identical to the function for rectangle R 2 except that indices to refer A, B, and C are changed. We call a CUDA C kernel function for rectangle R 1 mul_bint_b. If the number of digits is enough large, almost all elements of C are likely to exceed the base 2 32 at this stage with high probability. Thus, we perform carry propagation to make all elements smaller than the base. We divide this processing into two steps for efficiency. The first step is to separate each sum in C into the least significant digit and the carry. We call a CUDA C kernel function for the first step mul_bint_c. Note that the used base 2 32 is large enough to limit each carry at most one digit. Therefore, we can regard all the least significant digits form a multiple precision integer with base Similarly, all the carries form a multiple precision integer. Hence, in the second step, we add these two multiple precision integers. We explain the detailed implementation of the second step. In general, a multiple precision integer sum involves carry propagation, which is essentially sequential. In the worst case, carry propagates from the least significant digit to the most significant digit, which requires the computation time such that a single thread computes the whole sum. Our implementation is essentially the same as "carry skip adder" [10] that is one of implementations of a full adder in hardware. However, we devised our implementation as below so that addition is quickly done if only short propagations are occurred. Our implementation utilizes a fact that the possibility that at least one carry arises is very small because the base is Fig. 8 illustrates the idea behind our implementation, but here the used base is 10 for simplicity. At first, from given two arrays A and B we generate in parallel an array I to store carry information as shown in Fig. 8. The carry information is 1 if the sum of the corresponding two elements is larger than the base, 2 if it is equivalent to the base-1 (it is 9 in Fig. 8), or 0 if it is neither of the above two. Next, for every element whose carry information is 2, in parallel we change 2 to 1 (0) if its right neighbor is 1 (0). Notice that 2 such that its right neighbor is also 2 is not changed. We repeat this parallel processing until all 2s are disappeared. Then, we compute and store the sum total of three elements: the corresponding two elements of given two multiple precision integers and the final carry information of the two elements. Finally, we change each element of the array to a remainder of it divided by the base. The changed array is the multiple precision integer sum. This process needs barrier synchronization among blocks, so we achieve this by dividing the whole process into three kernel functions mul_bint_d, mul_bint_e and mul_bint_f. These functions are executed only once in the order of mul_bint_d, mul_bint_e and mul_bint_f. 4. Experiments In this section, we compare the proposed method with NTL [17] and GMP [3], two representatives of existing multiple precision arithmetic libraries for a single CPU thread, and also with Emmart et al. s FFT multiplication on GPUs [2] that is the fastest of existing studies on multiple precision integer multiplication on GPUs. For each test, 3.10 GHz Intel Core i and an NVIDIA GeForce GTX480 was used. Basically, we used 64 bit Windows7 Professional SP1 as an OS and Visual Studio 2008 Professional as a compiler, but only for GMP programs we used 64 bit Linux (ubuntu 11.04) as an OS and g as a compiler because GMP does not officially support Windows. We used CUDA Ver 3.2 and display driver Ver We used 48KB shared memory. The used versions of NTL and GMP are respectively and The NTL does not use SIMD instructions such as SSE instructions, but the GMP uses SSE2 instructions and MMX instructions. Both NTL and GMP are single-threaded libraries. Multithreaded versions of them do not exist yet. Parallelizing them seems to be hard work. Therefore, the CPU programs use a single core only. At first, we show a comparison between the proposed method and NTL. Table 1 (Table 2) summarizes execution times of a multiplication with the proposed method (NTL) for two multiple precision integers of length 8 Kbits to 256 Kbits. Table 3 shows the corresponding speedup ratios. These tables indicate that the proposed method is faster than NTL in all cases and the maximum speedup is over 70 times.

5 Fig. 6: An example of a product digit table in the load balanced form (reconstructed from the table in Fig. 4) Fig. 7: Thread block allocation for a product digit table in the balanced form Next, we show a comparison with GMP that is known to be faster than NTL in general. Table 4 summarizes execution times of a multiplication for the same multiple precision integers as the integers in Table 1 and Table 2. Table 5 shows the corresponding speedup ratios. Although these results are inferior to the results for NTL, we can see over 10 times speedup. In addition, we conducted a speed comparison with Emmart et al. s FFT multiplication on GPUs. Emmart et al. reported 255 Kbits 255 Kbits with GTX480 takes msec. Under the almost same condition of 256 Kbits 256 Kbits, the proposed method takes msec as shown in Table 1. Hence, if bit lengths of given two multiple precision integers are the same, Emmart et al. s FFT multiplication is about three times faster than the proposed method. However, if bit lengths of given two multiple precision integers are different, the proposed method is faster than Emmart et al. s FFT multiplication. For example, consider the case of 256 Kbits 8 Kbits. As noted in Section 1, Emmart et al. s implementation is based on FFT of 383 Kbits. Thus, values smaller than 383 Kbits must be promoted to 383 Kbits to be multiplied as 383 Kbits 383 Kbits. In fact, they reported 383 Kbits 383 Kbits with GTX480 takes msec, but 255 Kbits 255 Kbits takes msec due to additional time for promotion. So in their implementation, the speed in 256 Kbits 8 Kbits is at least the speed in 383 Kbits 383 Kbits. In contrast to this, as shown in Table 1, the execution time of the proposed algorithm is shortened if either a multiplier or a multiplicand is smaller than the other. Fig. 8: An algorithm for computing a multiple precision integer sum For example, the execution time of a multiplication of 256 Kbits 8 Kbits is about 13 times faster than a multiplication of 256 Kbits 256 Kbits. In summary, in case of 256 Kbits 8 Kbits, the proposed method is about 4.57 times faster than Emmart et al. s GPU implementation (0.207 msec / msec). 5. Conclusion We have proposed a novel data structure named a product digit table, and we have presented an algorithm that fast executes a multiple precision integer multiplication on GPUs based on the product digit table. The proposed method is based on manual calculation, so in case of a multiple

6 Table 1: Execution times of A B with the proposed algorithm on a GPU (msec) (Data transfer time between a GPU and a CPU is not included) A B 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits 8Kbits Kbits Kbits Kbits Kbits Kbits Table 2: Execution times of A B with NTL library on a CPU (msec) A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits Kbits Kbits Kbits Kbits Kbits Table 3: Speedup ratios of the proposed algorithm to NTL A B 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits 8Kbits Kbits Kbits Kbits Kbits Kbits Table 4: Execution times of A B with GMP on a CPU (msec) A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits Kbits Kbits Kbits Kbits Kbits precision integer multiplication of the same bit length, our algorithm runs slower than FFT multiplication. However, if bit lengths of given two multiple precision integers are different and their bit numbers are not extremely large, our algorithm is better than FFT multiplication. Future works include further optimization of our implementation, in particular for integers of length a few thousand bits intended for the real problem including public-key cryptography. References [1] Emmart, N. and Weems, C., "High Precision Integer Addition, Subtraction and Multiplication with a Graphics Processing Unit," Parallel Processing Letters, Vol.20, No.4, pp , [2] Emmart, N. and Weems, C., "High Precision Integer Multiplication with a GPU," IEEE Int l Parallel & Distributed Processing Symposium (IPDPS), pp , May 2011.

7 Table 5: Speedup ratios of the proposed algorithm to GMP A 8Kbits 16Kbits 32Kbits 64Kbits 128Kbits 256Kbits B 8Kbits Kbits Kbits Kbits Kbits Kbits [3] Free Software Foundation, "The GNU Multiple Precision Arithmetic Library, ", [4] Garland, M. and Kirk, D. B., "Understanding Throughput-Oriented Architectures," Communications of the ACM, Vol.53, No.11, pp.58-66, [5] Gaudry, P. and ThomÃş, E., "The mpfq Library and Implementing Curve-based Key Exchanges," Software Performance Enhancement for Encryption and Decryption Workshop, pp. 49âĂŞ64, [6] Giorgi, P., Izard, T., and Tisserand, A., "Comparison of Modular Arithmetic Algorithms on GPUs," Int l Conf. on Parallel Computing (ParCo), Sept [7] Karatsuba, A. and Ofman, Y., "Multiplication of Multidigit Numbers on Automata," Doklady Akademii Nauk SSSR, Vol.145, No.2, pp , (in Russian). English translation in Soviet Physics-Doklady, Vol.7, pp , [8] Kirk, D. B. and Hwu, W. W., "Programming Massively Parallel Processors: A Hands-on Approach," Morgan Kaufmann, [9] Knuth, D. E., "The Art of Computer Programming," Vol.2, Seminumerical Algorithms, 3rd Edition, Addison-Wesley, [10] Lehman, M. and Burla, N., "Skip Techniques for High-Speed Carry Propagation in Binary Arithmetic Units". IRE Trans. Elec. Comput., Vol.EC-10, pp , [11] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," [12] Liu, H. J. and Tong, C., "GMP Implementation on CUDA : A Backward Compatible Design with Performance Tuning," [13] NVIDIA, "CUDA Programming Guide Version 5.5.," [14] NVIDIA, "CUDA Best Practice Guide 5.5.," [15] Sanders, J. and Kandrot, E., "CUDA by Example: An Introduction to General-Purpose GPU Programming," Addison-Wesley Professional, [16] SchÃűnhage, A. and Strassen, V., "Schnelle Multiplikation grosser Zahlen," Computing, Vol.7, pp , [17] Shoup, V., "NTL: A Library for doing Number Theory," [18] Tanaka, T. and Murao, H., "An Efficient Method for Multiple- Precision Integer Arithmetics Using GPU " Information Processing Society of Japan SIG Technical Report Vol.2010-HPC-124 No.2 pp (in Japanese). [19] Toom, A. L., "The Complexity of a Scheme of Functional Elements Realizing the Multiplication of Integers," Soviet Math., Vol. 3, pp , [20] Zhao, K., "Implementation of Multiple-precision Modular Multiplication on GPU", pgday/2009/10th_papers/kzhao.pdf, [21] Zhao, K. and Chu, X., "GPUMP: A Multiple-Precision Integer Library for GPUs," IEEE Int l Conf. on Computer and Information Technology (CIT), pp , June [22] Zuras, D., "More on Squaring and Multiplying Large Integer," IEEE Trans. Computers, Vol.43, No 8, pp , Aug

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Kapitel 1 Multiplication of Long Integers (Faster than Long Multiplication)

Kapitel 1 Multiplication of Long Integers (Faster than Long Multiplication) Kapitel 1 Multiplication of Long Integers (Faster than Long Multiplication) Arno Eigenwillig und Kurt Mehlhorn An algorithm for multiplication of integers is taught already in primary school: To multiply

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Accelerating Wavelet-Based Video Coding on Graphics Hardware Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing

More information

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19 Outline Motivation and previous work

More information

Speeding Up RSA Encryption Using GPU Parallelization

Speeding Up RSA Encryption Using GPU Parallelization 2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, typeng2001@yahoo.com

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1} An Efficient RNS to Binary Converter Using the oduli Set {n + 1, n, n 1} Kazeem Alagbe Gbolagade 1,, ember, IEEE and Sorin Dan Cotofana 1, Senior ember IEEE, 1. Computer Engineering Laboratory, Delft University

More information

ADVANCED COMPUTER ARCHITECTURE

ADVANCED COMPUTER ARCHITECTURE ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: marco.ferretti@unipv.it Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Faster polynomial multiplication via multipoint Kronecker substitution

Faster polynomial multiplication via multipoint Kronecker substitution Faster polynomial multiplication via multipoint Kronecker substitution 5th February 2009 Kronecker substitution KS = an algorithm for multiplying polynomials in Z[x]. Example: f = 41x 3 + 49x 2 + 38x +

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2) Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 683-690 Research India Publications http://www.ripublication.com/aeee.htm Implementation of Modified Booth

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Image Processing & Video Algorithms with CUDA

Image Processing & Video Algorithms with CUDA Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Lecture 2. Binary and Hexadecimal Numbers

Lecture 2. Binary and Hexadecimal Numbers Lecture 2 Binary and Hexadecimal Numbers Purpose: Review binary and hexadecimal number representations Convert directly from one base to another base Review addition and subtraction in binary representations

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Performance Analysis and Comparison of JM 15.1 and Intel IPP H.264 Encoder and Decoder

Performance Analysis and Comparison of JM 15.1 and Intel IPP H.264 Encoder and Decoder Performance Analysis and Comparison of 15.1 and H.264 Encoder and Decoder K.V.Suchethan Swaroop and K.R.Rao, IEEE Fellow Department of Electrical Engineering, University of Texas at Arlington Arlington,

More information

An OpenCL Candidate Slicing Frequent Pattern Mining Algorithm on Graphic Processing Units*

An OpenCL Candidate Slicing Frequent Pattern Mining Algorithm on Graphic Processing Units* An OpenCL Candidate Slicing Frequent Pattern Mining Algorithm on Graphic Processing Units* Che-Yu Lin Science and Information Engineering Chung Hua University b09502017@chu.edu.tw Kun-Ming Yu Science and

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images Borusyak A.V. Research Institute of Applied Mathematics and Cybernetics Lobachevsky Nizhni Novgorod

More information

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Roman Pfarrhofer and Andreas Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer & A. Uhl 1 Carinthia Tech Institute

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

GPGPU Parallel Merge Sort Algorithm

GPGPU Parallel Merge Sort Algorithm GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp. Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using

More information

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi I.J. Information Technology and Computer Science, 2015, 09, 8-14 Published Online August 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.09.02 Modern Platform for Parallel Algorithms

More information

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

More information

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic: Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

The Methodology of Application Development for Hybrid Architectures

The Methodology of Application Development for Hybrid Architectures Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

More information

Visualization Tool for GPGPU Programming

Visualization Tool for GPGPU Programming ASEE 2014 Zone I Conference, April 3-5, 2014, University of Bridgeport, Bridgeport, CT, USA. Visualization Tool for GPGPU Programming Peter J. Zeno Department of Computer Science and Engineering University

More information

IP Video Rendering Basics

IP Video Rendering Basics CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.

More information

Operating System for the K computer

Operating System for the K computer Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

Lecture 8: Binary Multiplication & Division

Lecture 8: Binary Multiplication & Division Lecture 8: Binary Multiplication & Division Today s topics: Addition/Subtraction Multiplication Division Reminder: get started early on assignment 3 1 2 s Complement Signed Numbers two = 0 ten 0001 two

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

Interactive Level-Set Deformation On the GPU

Interactive Level-Set Deformation On the GPU Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Exploiting GPU Hardware Saturation for Fast Compiler Optimization Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom a.magni@sms.ed.ac.uk Christophe Dubach School of Informatics

More information

Using the Game Boy Advance to Teach Computer Systems and Architecture

Using the Game Boy Advance to Teach Computer Systems and Architecture Using the Game Boy Advance to Teach Computer Systems and Architecture ABSTRACT This paper presents an approach to teaching computer systems and architecture using Nintendo s Game Boy Advance handheld game

More information

Efficient representation of integer sets

Efficient representation of integer sets Efficient representation of integer sets Marco Almeida Rogério Reis Technical Report Series: DCC-2006-06 Version 1.0 Departamento de Ciência de Computadores & Laboratório de Inteligência Artificial e Ciência

More information

Intrusion Detection Architecture Utilizing Graphics Processors

Intrusion Detection Architecture Utilizing Graphics Processors Acta Informatica Pragensia 1(1), 2012, 50 59, DOI: 10.18267/j.aip.5 Section: Online: aip.vse.cz Peer-reviewed papers Intrusion Detection Architecture Utilizing Graphics Processors Liberios Vokorokos 1,

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN. zl2211@columbia.edu. ml3088@columbia.edu

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN. zl2211@columbia.edu. ml3088@columbia.edu MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN Zheng Lai Zhao Liu Meng Li Quan Yuan zl2215@columbia.edu zl2211@columbia.edu ml3088@columbia.edu qy2123@columbia.edu I. Overview Architecture The purpose

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

A Distributed Render Farm System for Animation Production

A Distributed Render Farm System for Animation Production A Distributed Render Farm System for Animation Production Jiali Yao, Zhigeng Pan *, Hongxin Zhang State Key Lab of CAD&CG, Zhejiang University, Hangzhou, 310058, China {yaojiali, zgpan, zhx}@cad.zju.edu.cn

More information

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs Nathan Whitehead Alex Fit-Florea ABSTRACT A number of issues related to floating point accuracy and compliance are a frequent

More information

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0 Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

More information

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method Determining the Optimal Combination of Trial Division and Fermat s Factorization Method Joseph C. Woodson Home School P. O. Box 55005 Tulsa, OK 74155 Abstract The process of finding the prime factorization

More information

Fast Implementations of AES on Various Platforms

Fast Implementations of AES on Various Platforms Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.

More information

Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University

Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University COMPARATIVE ANALYSIS OF SOFTWARE LIBRARIES FOR PUBLIC KEY CRYPTOGRAPHY Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University 1 OBJECTIVE Evaluation of Multi-precision

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

High-speed image processing algorithms using MMX hardware

High-speed image processing algorithms using MMX hardware High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to

More information

Session 6 Number Theory

Session 6 Number Theory Key Terms in This Session Session 6 Number Theory Previously Introduced counting numbers factor factor tree prime number New in This Session composite number greatest common factor least common multiple

More information

Computation of 2700 billion decimal digits of Pi using a Desktop Computer

Computation of 2700 billion decimal digits of Pi using a Desktop Computer Computation of 2700 billion decimal digits of Pi using a Desktop Computer Fabrice Bellard Feb 11, 2010 (4th revision) This article describes some of the methods used to get the world record of the computation

More information