PENCIL A Platform-Neutral Language for Accelerator Programming Vincent Grevendonk Media Processing Group, ARM 15 December 2014 1/16
Outline Introduction PENCIL Case Studies Conclusions 2/16
Accelerator Programming Concerns Programmer productivity Optimized accelerator code is tedious to write Algorithms are tightly coupled with hardware Poor opportunities for code reuse Performance portability OpenCL is not performance portable Particularly true for desktop/mobile In fact, some code may not run at all on a different device 2/16
DSL-to-Accelerator Compilation Compiling m DSLs to n different platforms DSL 1... DSL m OpenCL 1... OpenCL n Requires m n compilers! 3/16
CARP Approach VOBLA other DSLs Domain specific voblac DSL compilers PENCIL Domain independent Target independent PPCG CUDA OpenCL OpenMP Target specific 4/16
Outline Introduction PENCIL Case Studies Conclusions 5/16
C99 Subset + Extensions No pointer dereferencing or pointer manipulation No recursive functions Local arrays declared as VLA: float Y[n]; Array arguments declared as: float X[const restrict static n] Strict for-loop shape, e.g.: for (int i = start; i <= stop; i += stride) Library functions such as abs, min, max, cos,... Compatibility layer for compiling with regular C99 compilers 5/16
Basic PENCIL-to-OpenCL Example PENCIL input code: void f(int n, float A[const restrict static n]) for (int i = 0; i < n; i++) { A[i] = i; 6/16
Basic PENCIL-to-OpenCL Example PENCIL input code: void f(int n, float A[const restrict static n]) for (int i = 0; i < n; i++) { A[i] = i; PPCG output: Host code 1 kernel (equivalent code): void kernel0(int n, global float *A) int i = get_global_id(0); A[i] = i; 6/16
Independent Pragma #pragma pencil independent Indicates that the compiler can ignore any loop-carried dependencies Not checked at runtime void f(int n, float A[const restrict static n], float B[const restrict static n]) { for (int i = 0; i < n; i++) { A[B[i]] = i; 7/16
Independent Pragma #pragma pencil independent Indicates that the compiler can ignore any loop-carried dependencies Not checked at runtime void f(int n, float A[const restrict static n], float B[const restrict static n]) { #pragma pencil independent for (int i = 0; i < n; i++) { A[B[i]] = i; 7/16
Assume Statements pencil_assume(...); Tells the compiler to assume that the given expression holds Not checked at runtime void foo(int n, int m, int S, int D[const restrict static S]) { for (int i = 0; i < n; i++) { D[i] = D[i+m]; 8/16
Assume Statements pencil_assume(...); Tells the compiler to assume that the given expression holds Not checked at runtime void foo(int n, int m, int S, int D[const restrict static S]) { pencil_assume(m > n); for (int i = 0; i < n; i++) { D[i] = D[i+m]; 8/16
Summary Functions attribute ((pencil_access(...))); Describes memory access pattern for Non-PENCIL functions (e.g. hand-optimized OpenCL) PENCIL functions too complex for compiler analysis Summary functions are not executed /* Defined elsewhere */ void saxpy2(int i, float x[...], float y[...], float alpha); void saxpy(int n, float x[...], float y[...], float alpha) { for (int i = 0; i < n; i+=2) saxpy2(i, x, y, alpha); 9/16
Summary Functions attribute ((pencil_access(...))); Describes memory access pattern for Non-PENCIL functions (e.g. hand-optimized OpenCL) PENCIL functions too complex for compiler analysis Summary functions are not executed void saxpy2_summary(int i, float x[...], float y[...], float alpha) { y[i] = x[i]; y[i+1] = x[i+1]; attribute ((pencil_access(saxpy2_summary))) /* Defined elsewhere */ void saxpy2(int i, float x[...], float y[...], float alpha); void saxpy(int n, float x[...], float y[...], float alpha) { for (int i = 0; i < n; i+=2) saxpy2(i, x, y, alpha); 9/16
PENCIL Programming Recommendations Prefer for-loops over while-loops Prefer affine conditions and array access expressions such as a[2*i][j], a[i+j], but not a[i*n+j] Avoid data-dependent array accesses Avoid data-dependent control flow Keep arrays multi-dimensional Use pencil_assume copiously (but not recklessly!) 10/16
PENCIL-to-OpenCL Compilation Polyhedral Parallel Code Generator (PPCG) Developed and maintained by INRIA/ENS, France Performs parallelization and memory management Produces host and kernel code User flags affecting code generation: tile size, grid size, block size 11/16
Outline Introduction PENCIL Case Studies Conclusions 12/16
Basic Linear Algebra Subprograms (BLAS) Speedup on ARM Mali-T604 GPU (normalized to reference) 2 1.5 1 0.5 0 srot sswap sscal scopy saxpy sgemv sgbmv ssymv ssbmv sspmv sger ssyr sspr ssyr2 sspr2 sgemm ssyrk ssyr2k 12/16
Open Problems Finding appropriate PPCG flags Producing efficient code for downstream OpenCL compiler Handling reduction loops efficiently 13/16
Compute Benchmarks: SHOC and Rodinia PENCIL reimplementation of selected OpenCL benchmarks. 4 Speedup on ARM Mali-T604 GPU (normalized to original OpenCL) 3.5 3 2.5 2 1.5 1 0.5 0 Stencil Gaussian SRAD SpMV Radix BFS 14/16
Outline Introduction PENCIL Case Studies Conclusions 15/16
Conclusions and Future Work PENCIL: A C99-based intermediate language for Compute Serves as DSL compilation target Addresses OpenCL performance portability problem Increases programmer productivity 15/16
Conclusions and Future Work PENCIL: A C99-based intermediate language for Compute Serves as DSL compilation target Addresses OpenCL performance portability problem Increases programmer productivity Future Work: Continuing development of BLAS library Porting SLAMBench to PENCIL Encouraging results, work in progress. 15/16
References http://carpproject.github.io/ https://github.com/carpproject/ 16/16