FFT Opencl and Polynomial Multiplication

Size: px

Start display at page:

Download "FFT Opencl and Polynomial Multiplication"

Conrad Bond
7 years ago
Views:

1 FFT Opencl and Polynomial Multiplication CSE 5211 Design and Analysis of Algorithms D. Eiland, Y. Duan & S. Wang 12/1/2011

2 OpenCL based Polynomial Multiplication OpenCL OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. And it provides parallel computing using task-based and data-based parallelism. It has been adopted by Intel, AMD, Nvidia, and ARM. No doubt, OpenCL is a very new technology. I will illustrate how to write a hello world as beginning. Headings Just like any other external API used in C++, a header file must be included when using the OpenCL API. For the C++ bindings we use cl.hpp. Besides that, OpenCL also support JAVA programming language. But in this project, our team decide to use C++. Then, a small number of additional C++ headers, which are agnostic to OpenCL, are used. Errors Figure 1 Headings in Hello world A common property of most OpenCL API calls is that they either return an error code (type cl_int) as the result of the function itself, or the error code is stored at a location passed by the user as a parameter to the call. So, it is important for the application to check its each behavior correctly in the case of error. So, it is very necessary to define a function to handle the error code each time. As you can see in Figure 2.

3 Figure 2 Context The steps above are preparation works. And the following step to initializing and using OpenCL is to create a context. The rest of the OpenCL work (creating devices and memory, compiling and running programs) is performed within this context. A context can have a number of associated devices (for example, CPU or GPU devices), and, within a context, OpenCL guarantees a relaxed memory consistency between devices. However, before creating the context, we need to choose a platform from the platform list. Figure 3 platform and context You can alternate CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU to run the OpenCL and you also could choose the proper platform you like in the formlist. Buffer Before delving into compute devices, where the real work happens, an OpenCL buffer should be allocated to hold the result of the kernel that will be run on the device. Passing the flag CL_MEM_USE_HOST_PTR, when creating the buffer.

4 Figure 4 Devices In OpenCL, although many operations are performed with respect to a given context, there are also device specific operations. OpenCL provides the ability to queue information about particular objects, and using the C++ API it comes in the form of object.getinfo <CL_OBJECT_QUERY>(). Figure 5 After obtaining the proper device, the kernel file should be built and loaded in this device. Kernel The first few lines of the following code simply load the OpenCL device program from disk, convert it to a string, and create a cl::program::sources object using the helper constructor. Given an object of type cl::program::sources a cl::program, an object is created and associated with a context, then built for a particular set of devices. EntryPoint Figure 6 A given program can have many entry points, called kernels. There is assumed to exist a straightforward mapping from kernel names, represented as strings, to a function defined with the kernel attribute in the compute program.

5 Building a cl::kernel object, kernel. Kernel arguments are set using the C++ API with kernel.setarg(), which takes the index and value for the particular argument. CommandQueue Figure 7 Each command queue has a one-to-one mapping with a given device; it is created with the associated context using a call to the constructor for the class cl::commandqueue. Given a cl::commandqueue queue, kernels can be queued using queue.enqueundrangekernel. This queues a kernel for execution on the associated device. Event Figure 8 The final argument to the enqueuendrangekernel call above was a cl::event object, which can be used to query the status of the command with which it is associated. It supports the method wait() that blocks until the command has completed. This is required to ensure the kernel has finished execution before reading the result back into host memory with queue.enqueuereadbuffer(). With the compute result back in host memory, it is simply a matter of outputting the result to std::cout and exiting the program. Figure 9

6 KernelFile Before showing the code, it s necessary to introduce the memory mode in OpenCL, which is also important about the kernel file. OpenCL 1.0 defines 4 memory spaces: private, local, constant and global. The figure below shows a diagram of the memory hierarchy defined by OpenCL. Private memory is memory that can only be used by a single compute unit. This is similar to registers in a single compute unit or a single CPU core. Local memory is memory that can be used by the work-items in a work-group. This is similar to the local data share that is available on the current generation of AMD GPUs. Constant memory is memory that can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on AMD GPUs. Finally, global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs. Figure 10 memory model Overview We implemented a polynomial multiplication tool that uses the properties of the Discrete Fourier Transform (DFT) to perform the bulk of the work. DFT-based Polynomial Multiplication The product of two polynomials (A*B) is normally an O(n 2 ) operation, however, by using the DFT operation it can be reduced to an O(nlogn) operation. This is done by first doubling the size of the

7 polynomials (A, B) and transforming them using a DFT operation. These results are them multiplied together (using a point-wise operation) and then an inverse DFT is performed which results in the expected polynomial coefficients. To achieve O(nlogn) speed of the transformation, the DFT operation is substituted with the more efficient Fast Fourier Transform (FFT) version. The FFT is reliant upon the bufferfly operation and determines how elements (or values) are combined during the transformation. The primary different between our iterative and parallel implementations is the execution of the butterfly operation. With the iterative version each operation is performed one after another, while the parallel version executes stages of (size n) operations simultaneously. OpenCL implementation OpenCL has three major concepts: Buffers Memory that can be accessed within an OpenCL execution context Devices Commanded to execute code in parallel Kernels Code that can be executed (on Devices) We used the following algorithm for out OpenCL implementation: 1. Create random data sets A + B (size = n) and pad with zeros (size = 2n) 2. Load OpenCL Device 3. Load FFT, Point-Wise Multiply and Inverse FFT Kernels 4. Copy A + B to Buffers on OpenCL Device 5. For log 2 (2n) iterations; Execute (n times in parallel) FFT Kernel on A + B 6. Execute (n times in parallel) Point-Wise Multiply Kernel on FFT output 7. Execute (n times in parallel) Inverse FFT on Point-Wise Multiply output 8. Copy Inverse FFT output from Buffers on OpenCL Device Results For our tests, we compared the following implementations: Iterative Parallel w/opencl CPU Device Parallel w/opencl GPU Device All tests were conducted on the following hardware/software configuration: OS: Linux (Fedora Core 15) CPU: AMD Phenom X4 (2.6 Ghz) RAM: 4 GB GPU: Radeon 46XX (1 GB Video Ram) Compiler: GCC OpenCL: AMD APP 2.3

To test the FFT, I give a 32768 coefficient polynomial, and the device is Nvidia GMS 360, then the result is : Figure 11 To test the polynomial multiplication, the coefficient size is 2 4, then the

8 To test the FFT, I give a coefficient polynomial, and the device is Nvidia GMS 360, then the result is : Figure 11 To test the polynomial multiplication, the coefficient size is 2 4, then the result is : Figure 12 From table 1, we can see the overall run-time results. It is fairly obvious that once we reach a certain input size (2 10 ) that both the parallel OpenCL versions become faster than the iterative version and in the final run will be 60x faster. However, if we profile the run-time of specific portions of the algorithms does not clarify why OpenCL versions require the large input. While there is some overhead from loading the OpenCL kernel and loading the compute device with memory, the majority of the time is always spent within parallel (FFT) execution. This might be related to some overhead for launch the threads or possibly an inefficient FFT

9 implementation. Whatever the case, it is clear the given a large enough problem, OpenCL parallelization has the potential to significantly speed-up computation times. Table 1 - Total Run-Time Input Size (2 n ) Total Run-Time Iterative Parallel (CPU) Parallel (GPU) Table 2 - Iterative Run-Time Break Down Input Size (2 n ) FFT Point-wise Multiplication Inverse FFT Total Run-Time

10 Input Size (2 n ) Kernel Load / Compile Time Buffer Copy (CPU -> GPU) Table 3 - OpenCL (CPU) Timing Break Down Buffer copy (GPU -> CPU) FFT Point-wise Multiply Inverse FFT Total Run- Time Table 4 - OpenCL (GPU) Timing Break Down Input Size (2 n ) Kernel Load / Compile Time Buffer Copy (CPU -> GPU) Buffer copy (GPU -> CPU) FFT Point-wise Multiply Inverse FFT Total Run- Time

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.