Multi Grid for Multi Core

Size: px

Start display at page:

Download "Multi Grid for Multi Core"

Amie Martin
7 years ago
Views:

1 Multi Grid for Multi Core Harald Köstler, Daniel Ritter, Markus Stürmer and U. Rüde (LSS Erlangen, in collaboration with many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Seattle, February 2010 SIAM Parallel Processing

2 Motivation Overview Architectures: (Clusters of) Multi-Core CPUs Standard architecture IBM Cell GPU Algorithms: Multigrid Performance Engineering Porting multigrid to GPUs Local memory blocking techniques for multi-core CPUs Conclusions 2

3 Evolution of processors: Improvements pipelining superscalar execution out-of-order wider buses SIMD multithreading multiprocessing caches hardware prefetcher instruction data thread transfer local storage " Std.-CPU CBEA GPU!!!!! "! " "!!!!!!!! / "!!!!!! / "!!? / " " " /!! resource virtualization!! / " "

4 Cell/B.E. and PowerXCell 8i 4

5 Nvidia GeForce GTX 295 Costs: 450! Interface: PCI-E 2.0 x16 Shader Clock: 1242 MHz Memory Clock: 999 MHz Memory Bandwidth: 2x112 GB/s FLOPS: 2x894 GFLOPS Max Power Draw: 289 W Framebuffer: 2x896 MB Memory Bus: 2x448 bit Shader Processors: 2x240 5

6 ATI Radeon HD 4870 Costs: 150! Interface: PCI-E 2.0 x16 Shader Clock: 750 MHz Memory Clock: 900 MHz Memory Bandwidth: 115 GB/s FLOPS: 1200 GFLOPS Max Power Draw: 160 W Framebuffer: 1024 MB Memory Bus: 256 bit Shader Processors: 800 6

7 Performance of Multigrid for image processing on GPUs 7

8 iterative solvers only remove local components of the error fast instead: smooth remove local features of error restrict coarsen error compute correction of error on coarse grid prolongation interpolate and apply correction smooth again The Multigrid Idea 8

9 Full Multigrid Cycle Smoothing V-cycle Interpolation of solution u Exact solution Restriction of residual r = f - Au Interpolation of error and correction of solution u Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 9

10 Multigrid on Nvidia GeForce GTX 295 Runtime V(2,2) in ms Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 10

11 Multigrid on ATI Radeon HD 4870 Runtime V(2,2) in ms x x x4096 Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 11

12 Runtime Comparison Nvidia GeForce GTX 295 Half of the GPU Memory bandwidth 112 GB/s Runtime 41,5 ms (4096 x 4096) ATI Radeon HD 4870 Memory bandwidth 115 GB/s Runtime 40,4 ms (4096 x 4096) " Both cards show very similar performance Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 12

13 Red-black Splitting Store red and black values in two different arrays Doubles the performance Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 13

14 Multigrid on GTX 295 with red-black splitting Runtime V(2,2) in ms Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 14

15 Memory Bandwidth Percent of memory bandwidth 100% 80% 60% 40% 20% 0% Image size In percent from maximum measured (rounded) streaming bandwidth (100 GB / s) Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 15

16 Runtime Distribution for GPU Kernels memcopy 29% 2ndDerivative 3% RBGS 52% Gradients 3% Interpolate_co rr Residual_Rest 6% rict 7% Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 16

17 Frames per second for Image Stitching fps (stitching GTX 295) fps (solver GTX 295) fps (CPU) x x x x x4096 CPU: Intel Core2 Quad Q9550@2.83GHz with with OpenMP OpenMP (4 cores) (4 cores) Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 17

18 Leaping the memory wall: cache blocking techniques 18

19 Leaping the memory wall: local storage cache blocking techniques 19

20 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication blocking: multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil without blocking 20

21 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil without blocking 21

22 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil spatial blocking 22

23 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil spatial blocking 23

24 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil temporal blocking 24

25 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil temporal blocking 25

26 Temporal LS-blocking of 3D Stencil Code? options for temporal cache blocking in parallel synchronize data accesses at boundaries " compute required intermediate results locally! waste on block-boundaries (high surface/volume ratio for small LS) alignment and size of DMA and SIMD constraint block size low potential (estimate: max. 150% for 2# temporal blocking)»not even Markus would like to program that!«26

27 An approach for local storage blocking 27

28 Buffered blocking buffer structures (local storage) source grid (memory) target grid (memory) 28

29 Buffered blocking can hold one tile data 1 stripe per SPE at a time tile dependencies 29

30 Buffered blocking! " 30

31 Buffered blocking! " # 31

32 Buffered blocking! # " 32

33 Buffered blocking! can also be used with caches supporting (hybrid) framework feasible " 33

34 Framework architecture Framework Thread managment creation, synchronization, affinity Data structures alignment, padding (optional) Control traversal of grid distribution of work, calling of library kernels Application Code General configuration threads and their properties, type of per-thread buffers Setup of shared data Transfer of control grid size, constraints, number of sweeps Kernels storage2buffer( tiledesc&,buf& ) compute( bufferstack& ) buffer2storage( tiledesc&,buf& ) 34

35 Buffered blocking in action: Multigrid Method for Complex Diffusion 35

36 Image smoothing by complex diffusion smoothing image by solving a nonlinear, complex diffusion equation imaginary part works as an edge detector full approximation scheme multigrid solver simple variant of Schrödinger equation 36

37 FAS for Complex Diffusion ~400/450 flop/unknown on each grid level compute RHS (except finest level) two $-Jacobi iterations compute residual restrict current solution and residual two $-Jacobi iterations interpolate and apply correction 37

38 Framework performance results 400 V(2,2)-cycle for 4096#4096 image time [ms] 200 more than 36 GB/s 110 GFLOPS without blocking 0 2! Core2* Core i7** QS22 (16 SPEs) GTX 295 (half) straight-forward C++ implementation with OpenMP on Core architectures >1s * Core2 Xeon 2.8 GHz ** Core 2.93 GHz 38

39 Conclusions and Outlook 39

40 What else do we do? Parallel Multigrid Algorithms on 10,000 cores and beyond Talk by UR this afternoon (5:00pm) in MS 16 Challenges in Parallel Adaptive Mesh Refinement Parallel Rigid Body Dynamics MS 54 Friday 1:20-3:30 pm Talk by Klaus Iglberger on Friday 1:20 pm Poster by Tobias Preclik Parallel Rigid Body Dynamics Graduate Education in CS&E Talk by UR this afternoon (2:50 pm) in MS 11 Graduate Education for the Parallel Revolution Parallel Lattice Boltzmann Methods for Complex Flows no talk Performance Analysis: Talk by Georg Hager (Erlangen Computing Center) in MS 45, Friday 9:50-11:50 am Analysis of Hybrid Applications on Modern Architectures 40

41 Granular Flows with Non-Spherical Particles and Frictional Elastic Collisions 64 Processes, particles, each composed of 2-5 overlapping spheres, approx. 13 hours runtime D.M. Kaufman, T. Edmunds, and D.K. Pai: Fast frictional dynamics for rigid bodies. ACM Transactions on Graphics 24: ,

42 Thanks for your attention! Questions? Slides, reports, thesis, animations available for download at: www10.informatik.uni-erlangen.de 42

Fast Parallel Algorithms for Computational Bio-Medicine

Fast Parallel Algorithms for Computational Bio-Medicine H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter, D. Bartuschat, C. Feichtinger, C. Mihoubi, K. Iglberger (LSS Erlangen)