GPU vs. CPU Rasterization. James Doverspike

Transcription

1 GPU vs. CPU Rasterization James Doverspike

2 Abstract Today, almost all personal computers rely on GPUs to achieve realtime rendering of complex 3-D scenes. This paper seeks to reevaluate the performance of scan-line rasterization on the GPU and CPU for modern hardware. It will show that for a large enough problem size, the graphics card can achieve 17x speedup. 1. Introduction The purpose of this project is to evaluate the increase in performance of 3D computer graphics using the GPU. I seek to show that, though CPUs have gained significant parallelism since the inception of the GPU, the CPU still cannot approach the raw computing power of a graphics card for embarrassingly parallel problems. The independence of the data allows specialized hardware to compute almost the entire problem in parallel. The speedup will be compared by measuring the execution time of two different implementations on the same data. I will provide reasonably efficient algorithms using CUDA and POSIX Threads to solve the same problem using different concurrent program structures. Section 2 describes the problem and it's inherent parallelism. It also explains how the problem is typically solved in practice. Section 3 explains the CPU implementation and how the parallelism of newer processors can be exploited. Section 4 explains the GPU implementation and what optimizations are introduced to maximize performance. In Sections 5 and 6 the performance of both implementations are tested independently on different hardware and for different problem sizes. Finally, Section 7 compares the performance of both implementations relative to each other. 2. Problem People want to see high-performance 3- dimensional graphics on a two-dimensional display within their home. In the early 90s, typical desktop computers had only one processing unit the CPU. Even worse, these early processors were only capable of one thread of computation at a time. This meant that the natural parallelism of computer graphics was unused by that generation's hardware. Hence, the Graphical Processing Unit was born. The GPU was originally a custom card designed specifically to do graphical computations as fast as possible. A pipeline for doing extremely specific computations was build into the card, and the card was controlled by extremely specific function calls. Programmers began to realize that the impressive computing power of GPUs could be utilized for other problems. And just as a pipeline was constructed over time to solve a specific problem as fast as possible, the pipeline was slowly disassembled to allow nonspecific problems to utilize the GPU's power. The result of this process for NVIDIA graphics cards is CUDA. Ironically, this project will use the generalized hardware and programming interface to solve the problem for which the rigid pipeline of hardware and software was originally designed. But the GPU architecture is not all that has changed since graphics cards became mainstream CPUs have also adopted parallel processing. Yet a graphics problem has not been solved using the CPU since the era of DOOM. Today, a typical desktop built for gaming will have a graphics card and a dual or quad-core processor. But only the graphics card is used for graphical processing. Therefore, this project will attempt to get reasonable performance out of rasterization on the CPU as well. Reasonable performance is around 16 ms per frame, or 60 FPS.

3 3. CPU Implementation The CPU rasterizer uses POSIX threads for platform-independent threading. The entire graphics pipeline is contained within the execution of one thread. The number of threads is specified at the command line, so that you can choose the best number for your processing environment. The entire computation is divided evenly across each thread and the threads share data structures that contain the scene information. Atomicity is not a problem for the scene data because the computations are independent. They each read their portion of the scene data once, then write back the transformed information once. Note that it is a problem for the z-buffer and frame buffer because threads are dependent on depths computed from other threads. Thread safety is omitted from both implementations because of time constraints. As opposed to the SIMD design of the GPU implementation, the CPU implementation follows the fork/join design. The main thread forks the threads once the scene has been set up, then joins the threads upon completion. There needs to be a synchronization point between the vertex transformation phase and the triangle rasterization phase because rasterization is dependent on the new vertex information. This is implemented with a pthreads barrier that all of the spawned threads share. The measurements were conducted on an Intel Q6600 Quad-core 2.4 GHz processor. Four threads were used in every measurement because tests showed that this was the optimal number for this specific processor. 4. GPU Implementation The GPU rasterizer has two kernel invocations. The first one transforms the vertices from scene space to screen space. The second kernel rasterizes all the triangles in the scene using the new vertex information. There needs to be a synchronization point between the vertex phase and the rasterize phase. This is implemented naturally by dividing the computation into two kernels, vertextransform and rasterize. It also provides a more natural data decomposition, because the number of vertices and triangles are different. Thus, one thread is created for each vertex and one thread is created for each triangle. The number of threadblocks is computed from the total number of triangles divided by the number of threads per block, which is fixed. Tests have shown that 32 threads per block has the best performance for this environment. For the first frame, the CPU has to transfer the vertex, triangle, and normal information to the card. It also needs to malloc space on the card for the frame-buffer and the z-buffer. The frame-buffer is an array of unsigned ints that stores the pixel information. When the computation is completed, that array becomes the output to the screen. The z-buffer is used to store the distance each pixel is from the screen. This is required for when objects overlap, because the closest object should be displayed. The second frame of the scene is computed significantly faster because all of the memory is already allocated and on the card. In fact, only the frame-buffer and z-buffer must be cleared to draw the next frame. Hence, the time it takes to compute the second frame is our metric. Also note that every frame after the second frame will take the same amount of time. Only the first frame has the heavy load of setting up the scene. It is acceptable to have a load time before the first frame, as long as every other frame is rendered quickly. The measurements were conducted on an NVIDIA GeForce 8800 gt 600 MHz with 512 GB of ram. 5. CPU Performance The CPU shows predictable speedup. It increases as it approaches the number of cores on the processor, then levels off and slowly

4 decreases. The algorithm scales up reasonably well compared with the theoretical constant value of 1. If the number of processing cores was actually increasing rather than the number of threads, we would see a flat graph. Instead we get a falling slope entirely below CPU Speedup Threads Time (ms) main-title Threads CPU Scaleup Threads Time (ms) You can see the execution time of the four different sized models in the table above. For several thousand triangles the CPU is at a good framerate. If anything, this at least shows that modern processors can handle older games with less than 50,000 triangles which, at the time, required a graphics card. Nevertheless, this performance is well below that of an equivalent GPU. There is no clear bottleneck in the CPU implementation like there is in the GPU implementation, but most of the work is done within the rasterization loops. Model Triangles Time (ms) Dragon AK x AK x AK x Above is the performance of the CPU rasterizer. You can see that by the second model it is already slower than our goal. The AK x9 model would run at about 5 fps unreasonably slow. The execution times on the GPU cluster for the same four models were 5, 23, 46, and 166 ms. 6. GPU Performance The GPU's speedup is much harder to measure because of the constraints CUDA imposes. The architecture was designed to operate above a certain threshold of concurrency so that measuring below that is clunky and unhelpful. Either way, this environment only allows a minimum of 16 threads per thread block. The timing of the computation shows that, as expected, the bottleneck is at the interface between the device and main memory. With this in mind, the device is assumed to be infinitely fast. In almost all cases the kernels executed in under one millisecond. Almost all of the execution time came from transferring data onto the card with cudamemcpy(). Below you can see the execution time of four different sized models. The First column indicates the time of execution of the first frame, which contains the extra data allocation work. 16 threads per block Model Triangles Time (ms) First (ms) Dragon

5 AK x AK x AK x threads per block Model Triangles Time (ms) First (ms) Dragon AK x AK x AK x The only decision branches in the GPU code are the two for loops that iterate over the x and y pixels of a triangle. All of the other branches were eliminated using boolean statements. Below is a table of the execution times of the rasterizer without the if statements replaced. These times were achieved with 16 threads per block. Surprisingly, they are close enough to the optimized version that we can say there is no improvement. This is likely because the threads were executed so fast that at this level of complexity it isn't yet noticeable. Model Time (ms) First (ms) Dragon 4 7 AK x AK x AK x Performance Comparison Both implementations increase in execution time linearly as the problem size increases. Note that this is not scaleup, but a graph of optimal performance by the size of the problem. In the graph below, the orange line represents the GPU's performance and the blue line represents the CPU's performance. As you can see, the GPU is significantly faster than the CPU. As predicted, the CPU can only compare when the problem size is a few thousand triangles after that, the GPU takes off and doesn't stop. The linearity of the GPU graph is a testament to the computing philosophy of CUDA. It shows that the hundreds of stream processors on the device are designed in such a way that they achieve maximum efficiency when there are millions of threads to run. There is a distinction between the threading of the two implementations. As more triangles are added, the CPU threads each do more work. Conversely, more GPU threads are spawned. Each CUDA thread rasterizes one triangle, independent of the total number of triangles. Adding threads causes no noticeable slowdown from overhead because CUDA threads are so lightweight. Time (ms) CPU vs GPU Performance Triangles The GPU rasterizes one million triangles times faster than the CPU. You can see the speedup between the two implementations in the table below. For this type of computation, the 112 stream processors overwhelm the 4 CPU cores. The massive speedup is clearly worth the extra hardware. Trying to solve this problem with processors not as parallel as a GPU is always a waste of reasources.

6 Speedup Triangles I was able to reach my target of one million triangles at 60 fps with the GPU implementation. The CPU implementation, however, could only stay above 60 fps for the smallest model. The algorithm for both implementations includes Gouraud shading from a single light source. The algorithm does not include the work of texture mapping or more advanced lighting, like bump mapping or the Phong illumination model. However, a typical scene in a modern video game contains much less than one million triangles. This extra work was added to compensate for the simplicity of the rendering model. Clearly this is enough computing power to handle the complex scenes of today's video games. 8. Future Work This is an ongoing project. The ultimate goal is to have a real-time engine for viewing a 3-D scene. There are several optimizations (which are commented in the code) which will eventually be implemented, as well as features for saving the rendering as an image file. The next big step towards this project's completion is texture mapping, which was left out due to time constraints. After that, more advanced shading and lighting techniques will be added. To allow movement in the scene, SDL will be used to detect keyboard and mouse movements so that the scene can be examined in real time. Finally, the project will be released as open source and uploaded as a tutorial on the internet. It was nigh-impossible to find a document that properly explained rasterization. If anything, this project will serve as an example for future graphics programmers to understand the concepts behind OpenGL and DirectX.