Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Similar documents

CSE 6040 Computing for Data Analytics: Methods and Tools

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Accelerating CFD using OpenFOAM with GPUs

Binary search tree with SIMD bandwidth optimization using SSE

GPGPU accelerated Computational Fluid Dynamics

Parallelism and Cloud Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

YALES2 porting on the Xeon- Phi Early results

Introduction to GPGPU. Tiziano Diamanti

QCD as a Video Game?

Introduction to Cloud Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Architecture

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenMP Programming on ScaleMP

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Computer Graphics Hardware An Overview

GPUs for Scientific Computing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

OpenCL Programming for the CUDA Architecture. Version 2.3

FPGA-based Multithreading for In-Memory Hash Joins

Interactive Level-Set Deformation On the GPU

Introduction to GPU hardware and to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Integer Computation of Image Orthorectification for High Speed Throughput

1 Bull, 2011 Bull Extreme Computing

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Programming Survey

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Multi-Threading Performance on Commodity Multi-Core Processors

Next Generation GPU Architecture Code-named Fermi

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Multiphase Flow - Appendices

Optimizing Code for Accelerators: The Long Road to High Performance

High Performance Computing. Course Notes HPC Fundamentals

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

High Performance Matrix Inversion with Several GPUs

Hardware Acceleration for CST MICROWAVE STUDIO

CUDA programming on NVIDIA GPUs

HP ProLiant SL270s Gen8 Server. Evaluation Report

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Fast Fluid Dynamics Simulation on the GPU

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

CFD Implementation with In-Socket FPGA Accelerators

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

GPGPU acceleration in OpenFOAM

1. Memory technology & Hierarchy

Performance Characteristics of Large SMP Machines

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Turbomachinery CFD on many-core platforms experiences and strategies

Lecture 1: the anatomy of a supercomputer

ultra fast SOM using CUDA

64-Bit versus 32-Bit CPUs in Scientific Computing

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

High Performance Computing in CST STUDIO SUITE

Multi-GPU Load Balancing for Simulation and Rendering

OpenMP and Performance

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

A Cross-Platform Framework for Interactive Ray Tracing

SIDN Server Measurements

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Evaluation of CUDA Fortran for the CFD code Strukti

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Rethinking SIMD Vectorization for In-Memory Databases

HY345 Operating Systems

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Interactive Level-Set Segmentation on the GPU

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

6 Scalar, Stochastic, Discrete Dynamic Systems

GPU Acceleration of the SENSEI CFD Code Suite

MPI Hands-On List of the exercises

Lecture 2 Parallel Programming Platforms

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Performance of the JMA NWP models on the PC cluster TSUBAME.

Parallel Computing for Data Science

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Parallel Computing with MATLAB

Interactive simulation of an ash cloud of the volcano Grímsvötn

GPU Programming Strategies and Trends in GPU Computing

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Transcription:

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching Conclusions and Future Work

Introduction Implement JosStam sstable Fluids fluid simulation algorithm on CPU, GPU and Cell Detailed Flop and Bandwidth Analysis for each computational stage and each implementation Propose new schemes to solver the processor idle problem and the performance loss caused by random memory access

Highlights Stam s fluid algorithm is bandwidth bounded The cores sitting idle up to 96% of the time!! Detailed Flops and Bandwidth analysis of each computational stage Make use of otherwise idle processors by using higher order Mehrstellen methods Adopt a simple static caching scheme to beat the performance barrier in the semi -Lagrangianadvection step (99% hit rate ~~ this is very impressive)

Flops & Bandwidth Analysis -Overview Pre-assumptions: Ideal cache with maximum data reuse Three rows of the computational grid fit in cache Hardware: CPU: SSE4 -Intel Xeon 5100 (Woodcrest) GPU: Nvidia Geforce 8800 Ultra, Geforce 7900 Cell: IBM Cell QS20 Code version: CPU: Stam s open source implementation GPU: Harris implementation Cell: Theodore Kim implementation Multiply add is counted as one operation on all the architectures. The following example is in 2D. The 3D results could be obtained by applying the same analysis

Flops & Bandwidth Analysis Add Source Source Code: Analysis: Analysis: Within each loop: 2 loads + 1 store 2 Scalar components of velocity and the single density field Flops: 3N 2 Bandwidth: 9N 2

Flops & Bandwidth Analysis -Diffusion Source Code: Analysis: Perform I iterations on each grid cell 1 store for x[i][j], 1 store for x0[i][j] Under ideal cache assumption: 1 load only for x[i][j+1] 3 adds and 1 multiply-add in each iteration Flops: (3+ 12I) N 2 Bandwidth: (9+ 9I) N 2

Flops & Bandwidth Analysis Projection I Three sub-stages: divergence computation, pressure computation, final projection Divergence Computation: Source Code Analysis Analysis 1 store for div[i][j], 1 load for u, 1 load for new row of v 2 minus, 1 add, 1 multiply Flops: 3N 2 Bandwidth:4N 2 Pressure Computation: Computed by a linear solver The same as Divergence Computation Flops: 3N 2 Bandwidth: 4N 2

Flops & Bandwidth Analysis Projection II Final Projection Source Code Analysis Loads and stores for u and v Loads for p could be amortized into a single load 1 minus, 1 multiply add per line Flops: 5N 2 Bandwidth: 4N 2 Sum up: Flops: (8 + 4I)N 2 Bandwidth: (8+3I)N 2

Flops & Bandwidth Analysis Advection I Three steps: backtrace, grid index computation and interpolation Backtrace Source Code Analysis 1 multiply add for each line Loads from u and v Flops: 2N 2 Bandwidth: 2N 2 Grid Index Computation Source Code

Flops & Bandwidth Analysis Advection II Grid Index Computation: Analysis: The If statements can be stated as ternaries (0 Flops) and emulate a floor function, 1 flop for each Local variable computation Flops: 4N 2 Bandwidth: 0 Interpolation Two steps: weight computation and interpolation computation

Flops & Bandwidth Analysis Advection III Interpolation Weights Computation Source Code Flops: 4N 2 Bandwidth: 0 Interpolation Computation Source Code Analysis 1 load for d[i][j] No amortize for the loads of d0 (unpredictable access pattern) With multiply-add 6 flops Flops: 6N 2 Bandwidth: 5N 2

Flops & Bandwidth Analysis -Summary To sum up 2D case Flops : (56 + 16I)N 2 Bandwidth: (38 + 12I)N 2 3D case (Extended from 2D) Flops: (106 + 30I)N 3 Bandwidth: (71+15I)N 3

Peak Performance Estimate I Hardware specification: CPU: Intel Xeon 5100 Two cores at 3 Ghz Dispatch a 4-float SIMD instruction each clock cycle. Peak performance: 24 GFlops/s. Peak memory bandwidth: 10.66 GB/s GPU: NvidiaGeforce8800 Ultra 128 scalar cores at 1.5 Ghz. Peak Performance :192 GFlop/s. Peak memory bandwidth:103.7 GB/s Cell: IBM QS20 Cell blade Two Cell chips at 3.2 Ghz. 8 Synergistic Processing Elements (SPEs) per cell Dispatching 4-float SIMD instructions every clock cycle. Peak Performance: 204.8 GFlops/s. Peak memory bandwidth: 25.6 GB/s Performance Evaluation from the developed equations (Table1. on the next page)

Peak Performance Estimate II Table 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak performance is estimated for each architecture assuming the computation is compute-bound (ieinfinite bandwidth is available) and bandwidth-bound (i.e. infinite flops are available). The lesser of these two quantities is the more realistic estimate. In all cases, the algorithm is bandwidth-bound. Performance Estimate The ratio of computation to data arrival 2D: CPU 6.65x faster, GPU: 5.47x faster, Cell: 23.66x faster 3D: CPU 4.47x faster, GPU: 3.89x faster, Cell: 16.8x faster Processer Idle Rate 2D: CPU 85%, GPU 82%, Cell 96% 3D: CPU 79%, GPU 74%, Cell 94%

Peak Performance Estimate III Arithmetic Intensity When I (Iteration #) goes to infinity? A reasonable explanation: Algorithms runs well on the Cell and GPU when their arithmetic intensities are much greater than one. As both the 2D and 3D cases are close to one, the available flops will be underutilized.

Frame Rate Performance Measurement Table 2: Theoretical peak frames per second (The bandwidth-bound values from Table 1) and actual measured frames per second. None of the measured times exceed the predicted theoretical peaks, validating the finding that the algorithm is bandwidth bound. A GeForce7900 was used for the 16 bit timings because the frame rates were uniformly superior to the 8800. Some findings The predicted theoretical peaks were never exceeded, providing additional evidence that the algorithm is bandwidth-bound. A trend on both the GPU and Cell is that as the resolution is increased, the theoretical peak is more closely approached. (Larger Coherent loads)

MehrstellenSchemes -Background Poisson Solver for diffusion and projection stages: Discretized Version: Rewritten in Matrix format: From 2 nd order to 4 th order for less # of iteration

MehrstellenSchemes -Details An alternate discretizationthat allows us to increase the accuracy from second to fourth order without significantly increasing the complexity of the memory access pattern 2D 3D

MehrstellenSchemes Results I Spectral radius of the resultant matrix: the error of the current solution is multiplied by the spectral radius of the Jacobi matrix every iteration. Expectation: If the radius is significantly smaller than that of the second order discretization, then Less Jacobi iterations are needed overall. The spectral radius of Jacobi iteration using the Mehrstellen The equivalent radius for the standard Jacobi matrix The number of iterations it would take MehrstellenJacobi to achieve an error reduction equivalent to 20 iterations of standard Jacobi

MehrstellenSchemes Results II Table 3: Spectral radii of the fourth order accurate Mehrstellen Jacobi matrix (M) and the standard second order accurate Jacobi matrix (S). The third column computes the number of Mehrstellen iterations necessary to match the error reduction of 20 standard iterations. The last column is the fraction of Mehrstellen iterations necessary to match the error reduction of one standard iteration.

Advection Caching -Scheme Physical Characteristics Reasons to expect that the majority of the vector field exhibits high spatial locality The time-step size in practice would be quite small The projection and diffusion operators smear out the velocity field Large velocities quickly dissipate into smaller ones in both space and time. Make use of this: Assume that most of the advection rays terminate in regions that are very close to their origins. Static Caching Scheme Two way approach: Prefetchthe rows j 1, j, and j + 1 from the d0 array. While iterating over the elements of row j, first check to see if the semi-lagrangianray terminated in a 3x3 neighborhood of the origin. If so, make use of the prefetchedd0 values for the interpolation. Else, perform the more expensive fetch from main memory.

Advection Caching -Tests Two Test scene 2D scene : eight jets of velocity and density were injected into a 5122 simulation at different points and in different directions in order induce a wide variety of directions into the velocity field. 3D scene : A buoyant pocket of smoke is continually inserted into a 643 simulation Cache Miss Rate: 2D: miss rate never exceeds 0.65% 3D: miss rate never exceeds 0.44% Bandwidth Test for 2D scene on the Cell Bandwidth achieved by the advection stage on the Cell with and without the static cache.

Conclusion & Future Work Adetailed flop and bandwidth analysisof the implementation of Stable Fluids on current CPU, GPU and Cell architectures. Prove theoretically and experimentally that the performance of the algorithm is bandwidth-bound Proposed the use of Mehrstellendiscretizationto reduce the # of iterations in Jacobi solver to reduce processor idle rate This scheme allows the linear solver to terminate 17% earlier in 2D, and 33% earlier in 3D. Designed a static caching scheme for the advection stage that makes more effective use of the available memory bandwidth. 2x speedup is measured in the advection stage using this scheme on the Cell. Map algorithms that handle free surface cases to parallel architecture and do corresponding performance analysis Develop Mehrstellen discretizations like scheme for PCG solvers

Thanks for your attention. Questions???