Attaining High Performance in General-Purpose Computations on Current Graphics Processors

Transcription

1 Attaining High Performance in General-Purpose Computations on Current Graphics Processors Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain) Attaining High Performance in General-Purpose Computations on GPUs 1 Igual et al.

2 Motivation (I) GPUs have become the first widely available HPC platform Successfully used in linear algebra, image processing, data mining, simulations... Recently introduced advances: Hardware level: Unified Architecture Software level: CUDA Peak performance up to 10x (Core2Duo vs Nvidia 8800) Is it always a benefit the use of GPUs? Attaining High Performance in General-Purpose Computations on GPUs 2 Igual et al.

3 Motivation (I) GPUs have become the first widely available HPC platform Successfully used in linear algebra, image processing, data mining, simulations... Recently introduced advances: Hardware level: Unified Architecture Software level: CUDA Peak performance up to 10x (Core2Duo vs Nvidia 8800) Is it always a benefit the use of GPUs? Attaining High Performance in General-Purpose Computations on GPUs 2 Igual et al.

4 Motivation (II) GPUs are specific-purpose processors The GPU architecture is focused on graphics performance Not every general-purpose algorithm will fit correctly on GPU Newest GPU models introduce interesting features from the GPGPU point of view Which algorithms fit better on the graphics processor? Is it always better the use of GPU? Impact of the improvements of last generations of GPUs Attaining High Performance in General-Purpose Computations on GPUs 3 Igual et al.

8 Outline 1 Introduction 2 Classical GPU architecture 3 Unified GPU architecture 4 Experimental setup 5 Results on classical GPUs 6 Results on unified GPUs 7 Conclusions Attaining High Performance in General-Purpose Computations on GPUs 4 Igual et al.

9 Introduction Introduction and goals Methodology Microbenchmarks from BLAS and image processing Implement the benchmarks on CPU and both generations of GPUs Use only optimized benchmarks on both architectures Goals Extract the features of GPU-like algorithms through the evaluation of different routines Compare both generations of GPU architectures Extract the impact of data transfers on both generations Decide which algorithms are CPU or GPU-like Attaining High Performance in General-Purpose Computations on GPUs 5 Igual et al.

10 Introduction Introduction and goals Methodology Microbenchmarks from BLAS and image processing Implement the benchmarks on CPU and both generations of GPUs Use only optimized benchmarks on both architectures Goals Extract the features of GPU-like algorithms through the evaluation of different routines Compare both generations of GPU architectures Extract the impact of data transfers on both generations Decide which algorithms are CPU or GPU-like Attaining High Performance in General-Purpose Computations on GPUs 5 Igual et al.

11 Classical GPU architecture Classical GPU pipeline Each stage of the pipeline works with a different datatype Each stage is computed on a different type of processor (vertex processors and fragment processors) But the processors are programmable... Attaining High Performance in General-Purpose Computations on GPUs 6 Igual et al.

12 Classical GPU architecture Classical GPU overview Programmability Both vertex processors and fragments processors are programmable Usually fragment processors (FP) are programmed The replication of FP fits well with data-parallel routines SIMD architecture Disadvantages No memory hierarchy at all Just one memory direction could be written for each fragment Hard to program (OpenGL + Cg) Attaining High Performance in General-Purpose Computations on GPUs 7 Igual et al.

13 Classical GPU architecture Classical GPU overview Programmability Both vertex processors and fragments processors are programmable Usually fragment processors (FP) are programmed The replication of FP fits well with data-parallel routines SIMD architecture Disadvantages No memory hierarchy at all Just one memory direction could be written for each fragment Hard to program (OpenGL + Cg) Attaining High Performance in General-Purpose Computations on GPUs 7 Igual et al.

14 Unified GPU architecture Unified GPU shader One unified shader with multiple functionality Works with any type of graphical data GPGPU: all programmable shaders are now available Attaining High Performance in General-Purpose Computations on GPUs 8 Igual et al.

15 Unified GPU architecture Unified GPU architecture: G80 NVIDIA G80 Main unified implementation Up to 128 cores in the unified shader Introduction of memory hierarchy Higher clock frequency in the shader More flexibility in reading/writing to memory CUDA library Attaining High Performance in General-Purpose Computations on GPUs 9 Igual et al.

16 Experimental setup Selection of benchmarks Features to be evaluated 1 Data parallelism 2 Input data reutilization 3 Computational intensity per stream element Routines to be evaluated: SGEMM: C = αab + βc SGEMV: y = αax + βy SAXPY: y = αx + y SSCAL: y = αy 2D Convolution Attaining High Performance in General-Purpose Computations on GPUs 10 Igual et al.

17 Experimental setup Features of the benchmarks Selected routines: Routine Type Data parallelism Input reutilization Arith. intensity SGEMM BLAS 3 High High High SGEMV BLAS 2 High Medium Medium SAXPY BLAS 1 High None Low SSCAL BLAS 1 High None Lowest Convolution Image High Low Medium Hardware setup: Classical architecture Unified architecture AMD Athlon Nvidia Geforce 6200 GotoBLAS / CUBLAS 1.0 Intel Core2Duo 1.86 Ghz + Nvidia Geforce 8800 Ultra Attaining High Performance in General-Purpose Computations on GPUs 11 Igual et al.

22 Results on classical GPUs Results on classical GPUs. SGEMM Test SGEMM MFLOPS CPU MFLOPS GPU MFLOPS GPU w. TX. Data reutilization impact MFLOPs CPU outperforms GPU Speedup up to x4 Impact of cache hierarchy on CPU Data transfers not significant Matrix dimension Input data reutilization seems a key factor on GPU SGEMV exhibits less input data reutilization... Attaining High Performance in General-Purpose Computations on GPUs 12 Igual et al.

23 Results on classical GPUs Results on classical GPUs. SGEMV Test SGEMV MFLOPS CPU MFLOPS GPU MFLOPS GPU w. TX Data reutilization impact MFLOPs CPU outperforms GPU, but up to x2 Data transfers more significant Vector dimension Input data reutilization is a key factor on GPU Attaining High Performance in General-Purpose Computations on GPUs 13 Igual et al.

24 Results on classical GPUs Results on classical GPUs. SAXPY and SSCAL Arithmetic intensity Test SAXPY MFLOPS CPU MFLOPS GPU MFLOPS GPU w. TX 2000 Test SSCAL MFLOPS CPU MFLOPS GPU MFLOPS GPU w. TX 1500 MFLOPs MFLOPs Vector dimension Vector dimension SAXPY and SSCAL are stream-oriented operations They should attain high performance on GPUs However, CPU outperforms GPU for these operations Arithmetic intensity per stream element is another key factor Attaining High Performance in General-Purpose Computations on GPUs 14 Igual et al.

25 Results on classical GPUs Results on classical GPUs. Convolution Time (ms) Test Convolution - Image 512x512 CPU GPU4 GPU4 w. TX GPU Convolution performance Attained performance similar to that on CPU Optimized implementation for vectorial capabilities (GPU4) attains 4x speedup Filter Width Required features on GPU High data parallelism Low input data reutilization High arithmetic intensity per stream element Attaining High Performance in General-Purpose Computations on GPUs 15 Igual et al.

26 Results on unified GPUs Results on unified GPUs. SGEMM GFLOPs Test SGEMM CUDA GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX Matrix dimension Data reutilization impact GPU outperforms CPU Speedup up to x10 However, only attaining 20% of the peak performance of the GPU Data transfers more significant than on previous generations Input data reutilization is still important on GPU, but less than on previous generations Reason: more sophisticated memory hierarchy Attaining High Performance in General-Purpose Computations on GPUs 16 Igual et al.

27 Results on unified GPUs Results on unified GPUs. SGEMV GFLOPs Test SGEMV CUDA GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX Vector dimension Data reutilization impact GPU outperforms CPU for big matrices Impact of cache hierarchy on CPU: faster for small streams of data Data transfers very significant Data transfer stage is a very important factor now G6 G80 (x20) but AGP PCIExpress (x2) Modern GPUs work better with big streams of data Attaining High Performance in General-Purpose Computations on GPUs 17 Igual et al.

28 Results on unified GPUs Results on unified GPUs. SAXPY and SSCAL Arithmetic intensity 6 5 Test SAXPY CUDA GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX Test SSCAL CUDA GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX 4 5 GFLOPs 3 GFLOPs Vector dimension Vector dimension However, arithmetic intensity per element is still a key factor CPU outperforms GPU for these operations Attaining High Performance in General-Purpose Computations on GPUs 18 Igual et al.

29 Results on unified GPUs Results on unified GPUs. Convolution GFLOPS Test Convolution - Image 512x512 GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX. Convolution performance Attained the highest speedup of the benchmarks Highest data transfer impact Filter Width Required features on unified GPUs High data parallelism, low reutilization, high intensity Low ratio transfer/calculation Big data streams Attaining High Performance in General-Purpose Computations on GPUs 19 Igual et al.

30 Results on unified GPUs Results on unified GPUs. Convolution GFLOPS Test Convolution - Image 512x512 GFLOPS CPU GFLOPS GPU GFLOPS GPU w. TX. Convolution performance Attained the highest speedup of the benchmarks Highest data transfer impact Filter Width Required features on unified GPUs High data parallelism, low reutilization, high intensity Low ratio transfer/calculation Big data streams Attaining High Performance in General-Purpose Computations on GPUs 19 Igual et al.

31 Conclusions Conclusions The GPU is an interesting platform for HPC However, not every algorithm extracts all the performance Desirable features: High data parallelism Low input data reutilization High arithmetic intensity Operating with big streams of data Necessary to design algorithm minimizing data transfers Attaining High Performance in General-Purpose Computations on GPUs 20 Igual et al.

32 Conclusions Thank you... For more information... figual Attaining High Performance in General-Purpose Computations on GPUs 21 Igual et al.