Performance Portability Study of Linear Algebra Kernels in OpenCL

Size: px

Start display at page:

Download "Performance Portability Study of Linear Algebra Kernels in OpenCL"

Patrick Horton
7 years ago
Views:

Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 rupp@iue.tuwien.

1 Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 1 Institute for Microelectronics, TU Wien, Austria 2 Institute for Analysis and Scientific Computing, TU Wien, Austria International Workshop on OpenCL 2014 May 13th, 2014

2 Introduction OpenCL Code Library Expectations 2

3 Introduction OpenCL Code Library Expectations Just work on my (user) hardware Run fast on my (user) hardware Integrate easily into my (user) code Mix with CUDA and OpenMP Available for free 2

4 Introduction OpenCL Code Library Expectations Just work on my (user) hardware Run fast on my (user) hardware Integrate easily into my (user) code Mix with CUDA and OpenMP Available for free Work Based on Developer Experiences iennacl Free open-source C++ linear algebra library (Python? PyViennaCL!) API similar to Boost.uBLAS 2

5 Performance Portability How OpenCL Helps Unified hardware information queries Unified kernel language 3

6 Performance Portability How OpenCL Helps Unified hardware information queries Unified kernel language Developer Tasks OpenCL automatic performance portability Parameterize each kernel Performance models vs. (auto)tuning Impractical to repeat for each kernel 3

7 Benchmark Setting Scope for Portability Study Vector and matrix-vector operations (BLAS levels 1 and 2) Limited by memory bandwidth Matrix-matrix-multiplication (only briefly today) 4

8 Benchmark Setting Scope for Portability Study Vector and matrix-vector operations (BLAS levels 1 and 2) Limited by memory bandwidth Matrix-matrix-multiplication (only briefly today) Key Question (Memory-Bandwidth-Limited Kernels) Good performance of complicated kernels by optimizing the simplest kernel? 4

9 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y 5

10 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y y x 5

11 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y y x Parameters (1900 variations) 5 Local work size, global work size Vector types (float1, float2,..., float16) Thread increment type

12 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y y x Parameters (1900 variations) 5 for (size t i = get global id(0); i < N; x[i] = y[i]; i+= get global size(0))

13 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y y x Parameters (1900 variations) 5 for (size t i = group start + get local id(0); i < group end; i+= get local size(0)) x[i] = y[i];

14 Benchmark Setting Vector Assignment (Copy) Kernel x y for (large) vectors x, y y x Parameters (1900 variations) 5 for (size t i = group start + get local id(0); i < group end; i+= get local size(0)) x[i] = y[i];

15 Benchmark Setting Operations Vector copy, vector addition, inner product Matrix-vector product z y x 6

16 Benchmark Setting Operations Vector copy, vector addition, inner product Matrix-vector product z y a 6

17 Benchmark Setting Operations Vector copy, vector addition, inner product Matrix-vector product z y a 6 Devices AMD: A APU, HD 5850 GPU INTEL: Dual Socket Xeon E5-2670, Xeon Phi NVIDIA: GTX 285, Tesla K20m

18 Benchmark Histograms 7

19 Benchmark AMD Radeon HD Increment by Local Work Size Increment by Global Work Size Bandwidth Inner Product (% of peak) 8

20 Benchmark NVIDIA Tesla K20m Increment by Local Work Size Increment by Global Work Size Bandwidth Inner Product (% of peak) 9

21 Benchmark Intel Xeon E Increment by Local Work Size Increment by Global Work Size Bandwidth Inner Product (% of theoretical peak) 10

22 Benchmark Intel Xeon E Increment by Local Work Size Increment by Global Work Size Bandwidth Inner Product (% of theoretical peak) y x 11

23 Benchmark NVIDIA Tesla K20m Density Rel. Bandwidth Copy Operation (%) 12

24 Benchmark NVIDIA Tesla K20m Density Rel. Bandwidth Matrix Vector Product Operation (%) 13

25 Benchmark [Addition Inner Product Matrix-Vector] vs. Copy Kernel Same Device 14

26 15 Benchmark

27 15 Benchmark

28 15 Benchmark

29 Benchmark NVIDIA Tesla K20m Addition Inner Product Mat-Vec Product Bandwidth Addition (% of peak) NVIDIA Tesla K20m Copy vs. Addition Bandwidth Inner Product (% of peak) NVIDIA Tesla K20m Copy vs. Inner Product Bandwidth Matrix Vector (% of peak) NVIDIA Tesla K20m Copy vs. Matrix Vector Product Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) 16

30 Benchmark AMD Radeon HD 5850 Addition Inner Product Mat-Vec Product Bandwidth Addition (% of peak) AMD Radeon HD 5850 Copy vs. Addition Bandwidth Inner Product (% of peak) AMD Radeon HD 5850 Copy vs. Inner Product Bandwidth Matrix Vector (% of peak) AMD Radeon HD 5850 Copy vs. Matrix Vector Product Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) 16

31 Benchmark INTEL Dual Xeon E Addition Inner Product Mat-Vec Product Bandwidth Addition (% of peak) 2x INTEL Xeon E Copy vs. Addition Bandwidth Inner Product (% of peak) 2x INTEL Xeon E Copy vs. Inner Product Bandwidth Matrix Vector (% of peak) 2x INTEL Xeon E Copy vs. Matrix Vector Product Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) 16

32 Benchmark INTEL Xeon Phi Addition Inner Product Mat-Vec Product Bandwidth Addition (% of peak) INTEL Xeon Phi 5110P Copy vs. Addition Bandwidth Inner Product (% of peak) INTEL Xeon Phi 5110P Copy vs. Inner Product Bandwidth Matrix Vector (% of peak) INTEL Xeon Phi 5110P Copy vs. Matrix Vector Product Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) Bandwidth Copy (% of peak) 16

33 Benchmark Conclusio: Focus on fastest configurations for copy-kernel sufficient 17

34 Benchmark [Copy Addition Inner Product Matrix-Vector] vs. Copy Kernel Different Device, Same Vendor 18

35 19 Benchmark Benchmark NVIDIA Hardware (x: GTX 285, y: K20m) GTX 285, BW Copy (% of peak) K20m, BW Copy (%)

36 Benchmark AMD Hardware (x: A K GPU, y: HD 5850) HD 5850, BW Copy (%) A K GPU, BW Copy (% of peak) 19

37 Benchmark INTEL Hardware (x: Xeon Phi, E5-2670) Xeon Phi, BW Copy (%) E5 2670, BW Copy (% of peak) 19

38 20 Benchmark Benchmark NVIDIA Hardware (x: GTX 285, y: K20m) Addition Inner Product Mat-Vec Product GTX 285, BW Copy (% of peak) K20m, BW Addition (%) GTX 285, BW Copy (% of peak) K20m, BW Inner Product (%) GTX 285, BW Copy (% of peak) K20m, BW Matrix Vector (%)

39 Benchmark AMD Hardware (x: A K GPU, y: HD 5850) Addition Inner Product Mat-Vec Product HD 5850, BW Addition (%) HD 5850, BW Inner Product (%) HD 5850, BW Matrix Vector (%) A K GPU, BW Copy (% of peak) A K GPU, BW Copy (% of peak) A K GPU, BW Copy (% of peak) 20

40 Benchmark Conclusio: Certain Performance Portability per Vendor 21

41 Benchmark [Copy Addition Inner Product Matrix-Vector] vs. Copy Kernel Different Device, Different Vendor 22

42 Benchmark A K CPU, BW Copy (%) x: INTEL CPU, y: AMD CPU E5 2670, BW Copy (% of peak) 23

43 Benchmark K20m, BW Copy (%) x: INTEL CPU, y: NVIDIA GPU E5 2670, BW Copy (% of peak) 23

44 Benchmark K20m, BW Copy (%) x: AMD GPU, y: NVIDIA GPU HD 5850, BW Copy (% of peak) 23

45 24 Benchmark Benchmark x: AMD HD 5850, y: NVIDIA K20m Addition Inner Product Mat-Vec Product HD 5850, BW Copy (% of peak) K20m, BW Addition (%) HD 5850, BW Copy (% of peak) K20m, BW Inner Product (%) HD 5850, BW Copy (% of peak) K20m, BW Matrix Vector (%)

46 Benchmark Conclusio: Fast Configurations Across Vendors Exist 25

47 Benchmark Matrix-Matrix Multiplication Single Precision 26

48 Benchmark 10 4 INTEL Core i MKL INTEL Core i ViennaCL NVIDIA GTX CUBLAS 5.0 NVIDIA GTX ViennaCL AMD HD clamdblas 1.10 AMD HD ViennaCL GFLOP/s Ph. Tillet et al.: HotPar Matrix Rows/Columns 27

49 27 Benchmark Benchmark NVIDIA GeForce 470 GTX Bandwidth Copy (% of peak) SGEMM Performance (GFLOP/s) Matrix Matrix over Copy

50 Conclusion Copy Kernel Prototype for complicated memory-bandwidth-limited kernels Strong performance correlation Reduces search space for auto-tuning 28

51 Conclusion Copy Kernel Prototype for complicated memory-bandwidth-limited kernels Strong performance correlation Reduces search space for auto-tuning Implications Tune primarily with memory transfers in mind Prefer regular memory access patterns Use appropriate vector data types 28

52 Conclusion Copy Kernel Prototype for complicated memory-bandwidth-limited kernels Strong performance correlation Reduces search space for auto-tuning Implications Tune primarily with memory transfers in mind Prefer regular memory access patterns Use appropriate vector data types Next Steps Build a device parameter database Provide tools for crowd-tuning 28

HPC with Multicore and GPUs

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware