Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching Conclusions and Future Work
Introduction Implement JosStam sstable Fluids fluid simulation algorithm on CPU, GPU and Cell Detailed Flop and Bandwidth Analysis for each computational stage and each implementation Propose new schemes to solver the processor idle problem and the performance loss caused by random memory access
Highlights Stam s fluid algorithm is bandwidth bounded The cores sitting idle up to 96% of the time!! Detailed Flops and Bandwidth analysis of each computational stage Make use of otherwise idle processors by using higher order Mehrstellen methods Adopt a simple static caching scheme to beat the performance barrier in the semi -Lagrangianadvection step (99% hit rate ~~ this is very impressive)
Flops & Bandwidth Analysis -Overview Pre-assumptions: Ideal cache with maximum data reuse Three rows of the computational grid fit in cache Hardware: CPU: SSE4 -Intel Xeon 5100 (Woodcrest) GPU: Nvidia Geforce 8800 Ultra, Geforce 7900 Cell: IBM Cell QS20 Code version: CPU: Stam s open source implementation GPU: Harris implementation Cell: Theodore Kim implementation Multiply add is counted as one operation on all the architectures. The following example is in 2D. The 3D results could be obtained by applying the same analysis
Flops & Bandwidth Analysis Add Source Source Code: Analysis: Analysis: Within each loop: 2 loads + 1 store 2 Scalar components of velocity and the single density field Flops: 3N 2 Bandwidth: 9N 2
Flops & Bandwidth Analysis -Diffusion Source Code: Analysis: Perform I iterations on each grid cell 1 store for x[i][j], 1 store for x0[i][j] Under ideal cache assumption: 1 load only for x[i][j+1] 3 adds and 1 multiply-add in each iteration Flops: (3+ 12I) N 2 Bandwidth: (9+ 9I) N 2
Flops & Bandwidth Analysis Projection I Three sub-stages: divergence computation, pressure computation, final projection Divergence Computation: Source Code Analysis Analysis 1 store for div[i][j], 1 load for u, 1 load for new row of v 2 minus, 1 add, 1 multiply Flops: 3N 2 Bandwidth:4N 2 Pressure Computation: Computed by a linear solver The same as Divergence Computation Flops: 3N 2 Bandwidth: 4N 2
Flops & Bandwidth Analysis Projection II Final Projection Source Code Analysis Loads and stores for u and v Loads for p could be amortized into a single load 1 minus, 1 multiply add per line Flops: 5N 2 Bandwidth: 4N 2 Sum up: Flops: (8 + 4I)N 2 Bandwidth: (8+3I)N 2
Flops & Bandwidth Analysis Advection I Three steps: backtrace, grid index computation and interpolation Backtrace Source Code Analysis 1 multiply add for each line Loads from u and v Flops: 2N 2 Bandwidth: 2N 2 Grid Index Computation Source Code
Flops & Bandwidth Analysis Advection II Grid Index Computation: Analysis: The If statements can be stated as ternaries (0 Flops) and emulate a floor function, 1 flop for each Local variable computation Flops: 4N 2 Bandwidth: 0 Interpolation Two steps: weight computation and interpolation computation
Flops & Bandwidth Analysis Advection III Interpolation Weights Computation Source Code Flops: 4N 2 Bandwidth: 0 Interpolation Computation Source Code Analysis 1 load for d[i][j] No amortize for the loads of d0 (unpredictable access pattern) With multiply-add 6 flops Flops: 6N 2 Bandwidth: 5N 2
Flops & Bandwidth Analysis -Summary To sum up 2D case Flops : (56 + 16I)N 2 Bandwidth: (38 + 12I)N 2 3D case (Extended from 2D) Flops: (106 + 30I)N 3 Bandwidth: (71+15I)N 3
Peak Performance Estimate I Hardware specification: CPU: Intel Xeon 5100 Two cores at 3 Ghz Dispatch a 4-float SIMD instruction each clock cycle. Peak performance: 24 GFlops/s. Peak memory bandwidth: 10.66 GB/s GPU: NvidiaGeforce8800 Ultra 128 scalar cores at 1.5 Ghz. Peak Performance :192 GFlop/s. Peak memory bandwidth:103.7 GB/s Cell: IBM QS20 Cell blade Two Cell chips at 3.2 Ghz. 8 Synergistic Processing Elements (SPEs) per cell Dispatching 4-float SIMD instructions every clock cycle. Peak Performance: 204.8 GFlops/s. Peak memory bandwidth: 25.6 GB/s Performance Evaluation from the developed equations (Table1. on the next page)
Peak Performance Estimate II Table 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak performance is estimated for each architecture assuming the computation is compute-bound (ieinfinite bandwidth is available) and bandwidth-bound (i.e. infinite flops are available). The lesser of these two quantities is the more realistic estimate. In all cases, the algorithm is bandwidth-bound. Performance Estimate The ratio of computation to data arrival 2D: CPU 6.65x faster, GPU: 5.47x faster, Cell: 23.66x faster 3D: CPU 4.47x faster, GPU: 3.89x faster, Cell: 16.8x faster Processer Idle Rate 2D: CPU 85%, GPU 82%, Cell 96% 3D: CPU 79%, GPU 74%, Cell 94%
Peak Performance Estimate III Arithmetic Intensity When I (Iteration #) goes to infinity? A reasonable explanation: Algorithms runs well on the Cell and GPU when their arithmetic intensities are much greater than one. As both the 2D and 3D cases are close to one, the available flops will be underutilized.
Frame Rate Performance Measurement Table 2: Theoretical peak frames per second (The bandwidth-bound values from Table 1) and actual measured frames per second. None of the measured times exceed the predicted theoretical peaks, validating the finding that the algorithm is bandwidth bound. A GeForce7900 was used for the 16 bit timings because the frame rates were uniformly superior to the 8800. Some findings The predicted theoretical peaks were never exceeded, providing additional evidence that the algorithm is bandwidth-bound. A trend on both the GPU and Cell is that as the resolution is increased, the theoretical peak is more closely approached. (Larger Coherent loads)
MehrstellenSchemes -Background Poisson Solver for diffusion and projection stages: Discretized Version: Rewritten in Matrix format: From 2 nd order to 4 th order for less # of iteration
MehrstellenSchemes -Details An alternate discretizationthat allows us to increase the accuracy from second to fourth order without significantly increasing the complexity of the memory access pattern 2D 3D
MehrstellenSchemes Results I Spectral radius of the resultant matrix: the error of the current solution is multiplied by the spectral radius of the Jacobi matrix every iteration. Expectation: If the radius is significantly smaller than that of the second order discretization, then Less Jacobi iterations are needed overall. The spectral radius of Jacobi iteration using the Mehrstellen The equivalent radius for the standard Jacobi matrix The number of iterations it would take MehrstellenJacobi to achieve an error reduction equivalent to 20 iterations of standard Jacobi
MehrstellenSchemes Results II Table 3: Spectral radii of the fourth order accurate Mehrstellen Jacobi matrix (M) and the standard second order accurate Jacobi matrix (S). The third column computes the number of Mehrstellen iterations necessary to match the error reduction of 20 standard iterations. The last column is the fraction of Mehrstellen iterations necessary to match the error reduction of one standard iteration.
Advection Caching -Scheme Physical Characteristics Reasons to expect that the majority of the vector field exhibits high spatial locality The time-step size in practice would be quite small The projection and diffusion operators smear out the velocity field Large velocities quickly dissipate into smaller ones in both space and time. Make use of this: Assume that most of the advection rays terminate in regions that are very close to their origins. Static Caching Scheme Two way approach: Prefetchthe rows j 1, j, and j + 1 from the d0 array. While iterating over the elements of row j, first check to see if the semi-lagrangianray terminated in a 3x3 neighborhood of the origin. If so, make use of the prefetchedd0 values for the interpolation. Else, perform the more expensive fetch from main memory.
Advection Caching -Tests Two Test scene 2D scene : eight jets of velocity and density were injected into a 5122 simulation at different points and in different directions in order induce a wide variety of directions into the velocity field. 3D scene : A buoyant pocket of smoke is continually inserted into a 643 simulation Cache Miss Rate: 2D: miss rate never exceeds 0.65% 3D: miss rate never exceeds 0.44% Bandwidth Test for 2D scene on the Cell Bandwidth achieved by the advection stage on the Cell with and without the static cache.
Conclusion & Future Work Adetailed flop and bandwidth analysisof the implementation of Stable Fluids on current CPU, GPU and Cell architectures. Prove theoretically and experimentally that the performance of the algorithm is bandwidth-bound Proposed the use of Mehrstellendiscretizationto reduce the # of iterations in Jacobi solver to reduce processor idle rate This scheme allows the linear solver to terminate 17% earlier in 2D, and 33% earlier in 3D. Designed a static caching scheme for the advection stage that makes more effective use of the available memory bandwidth. 2x speedup is measured in the advection stage using this scheme on the Cell. Map algorithms that handle free surface cases to parallel architecture and do corresponding performance analysis Develop Mehrstellen discretizations like scheme for PCG solvers
Thanks for your attention. Questions???