PENCIL A Platform-Neutral Language for Accelerator Programming

Similar documents
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenACC 2.0 and the PGI Accelerator Compilers

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

OpenACC Basics Directive-based GPGPU Programming

OpenACC Programming and Best Practices Guide

Le langage OCaml et la programmation des GPU

Matrix Multiplication

HPC with Multicore and GPUs

12 Tips for Maximum Performance with PGI Directives in C

Introduction to GPU Programming Languages

Lecture 3. Optimising OpenCL performance

Towards OpenMP Support in LLVM

Spring 2011 Prof. Hyesoon Kim

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

CUDA Programming. Week 4. Shared memory and register

Part I Courses Syllabus

THE NAS KERNEL BENCHMARK PROGRAM

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems

Retargeting PLAPACK to Clusters with Hardware Accelerators

A Multi-layered Domain-specific Language for Stencil Computations

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Paragon: Collaborative Speculative Loop Execution

Parallel Computing for Data Science

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

OpenCL Static C++ Kernel Language Extension

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Android Renderscript. Stephen Hines, Shih-wei Liao, Jason Sams, Alex Sakhartchouk November 18, 2011

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

The OpenACC Application Programming Interface

The C Programming Language course syllabus associate level

An Incomplete C++ Primer. University of Wyoming MA 5310

CUDA programming on NVIDIA GPUs

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

C++ Overloading, Constructors, Assignment operator

OpenACC Programming on GPUs

Hardware design for ray tracing

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Evaluation of CUDA Fortran for the CFD code Strukti

DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY

A Case Study - Scaling Legacy Code on Next Generation Platforms

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Turbomachinery CFD on many-core platforms experiences and strategies

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Lecture 1 Introduction to Android

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Programming GPUs with CUDA

OpenMP* 4.0 for HPC in a Nutshell

Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

2) Write in detail the issues in the design of code generator.

U N C L A S S I F I E D

Cloud Computing. Up until now

HIGH PERFORMANCE BIG DATA ANALYTICS

ADVANCED SCHOOL OF SYSTEMS AND DATA STUDIES (ASSDAS) PROGRAM: CTech in Computer Science

Case Study on Productivity and Performance of GPGPUs

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

Thomas Jefferson High School for Science and Technology Program of Studies Foundations of Computer Science. Unit of Study / Textbook Correlation

AlphaZ: A System for Design Space Exploration in the Polyhedral Model. Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye

Intelligent Heuristic Construction with Active Learning

HPC Programming Framework Research Team

Chapter 5 Names, Bindings, Type Checking, and Scopes

COSCO 2015 Heterogeneous Computing Programming

Project INF BigData. Figure 1: Plot of the learned function from the checker board data set.

OpenCL for programming shared memory multicore CPUs

GPGPU Parallel Merge Sort Algorithm

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Leveraging Aparapi to Help Improve Financial Java Application Performance

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

HPC enabling of OpenFOAM R for CFD applications

Semantic Analysis: Types and Type Checking

Next Generation GPU Architecture Code-named Fermi

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Enabling Legacy Applications on Heterogeneous Platforms

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

CUDA Basics. Murphy Stein New York University

Optimizing Application Performance with CUDA Profiling Tools

Know or Go Practical Quest for Reliable Software

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

C Programming Language

Lecture 10 - Functional programming: Hadoop and MapReduce

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Simplified Machine Learning for CUDA. Umar

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

C# and Other Languages

FPGA area allocation for parallel C applications

C Programming. for Embedded Microcontrollers. Warwick A. Smith. Postbus 11. Elektor International Media BV. 6114ZG Susteren The Netherlands

Texture Cache Approximation on GPUs

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Object Oriented Software Design II

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

Recent Advances in Periscope for Performance Analysis and Tuning

Transcription:

PENCIL A Platform-Neutral Language for Accelerator Programming Vincent Grevendonk Media Processing Group, ARM 15 December 2014 1/16

Outline Introduction PENCIL Case Studies Conclusions 2/16

Accelerator Programming Concerns Programmer productivity Optimized accelerator code is tedious to write Algorithms are tightly coupled with hardware Poor opportunities for code reuse Performance portability OpenCL is not performance portable Particularly true for desktop/mobile In fact, some code may not run at all on a different device 2/16

DSL-to-Accelerator Compilation Compiling m DSLs to n different platforms DSL 1... DSL m OpenCL 1... OpenCL n Requires m n compilers! 3/16

CARP Approach VOBLA other DSLs Domain specific voblac DSL compilers PENCIL Domain independent Target independent PPCG CUDA OpenCL OpenMP Target specific 4/16

Outline Introduction PENCIL Case Studies Conclusions 5/16

C99 Subset + Extensions No pointer dereferencing or pointer manipulation No recursive functions Local arrays declared as VLA: float Y[n]; Array arguments declared as: float X[const restrict static n] Strict for-loop shape, e.g.: for (int i = start; i <= stop; i += stride) Library functions such as abs, min, max, cos,... Compatibility layer for compiling with regular C99 compilers 5/16

Basic PENCIL-to-OpenCL Example PENCIL input code: void f(int n, float A[const restrict static n]) for (int i = 0; i < n; i++) { A[i] = i; 6/16

Basic PENCIL-to-OpenCL Example PENCIL input code: void f(int n, float A[const restrict static n]) for (int i = 0; i < n; i++) { A[i] = i; PPCG output: Host code 1 kernel (equivalent code): void kernel0(int n, global float *A) int i = get_global_id(0); A[i] = i; 6/16

Independent Pragma #pragma pencil independent Indicates that the compiler can ignore any loop-carried dependencies Not checked at runtime void f(int n, float A[const restrict static n], float B[const restrict static n]) { for (int i = 0; i < n; i++) { A[B[i]] = i; 7/16

Independent Pragma #pragma pencil independent Indicates that the compiler can ignore any loop-carried dependencies Not checked at runtime void f(int n, float A[const restrict static n], float B[const restrict static n]) { #pragma pencil independent for (int i = 0; i < n; i++) { A[B[i]] = i; 7/16

Assume Statements pencil_assume(...); Tells the compiler to assume that the given expression holds Not checked at runtime void foo(int n, int m, int S, int D[const restrict static S]) { for (int i = 0; i < n; i++) { D[i] = D[i+m]; 8/16

Assume Statements pencil_assume(...); Tells the compiler to assume that the given expression holds Not checked at runtime void foo(int n, int m, int S, int D[const restrict static S]) { pencil_assume(m > n); for (int i = 0; i < n; i++) { D[i] = D[i+m]; 8/16

Summary Functions attribute ((pencil_access(...))); Describes memory access pattern for Non-PENCIL functions (e.g. hand-optimized OpenCL) PENCIL functions too complex for compiler analysis Summary functions are not executed /* Defined elsewhere */ void saxpy2(int i, float x[...], float y[...], float alpha); void saxpy(int n, float x[...], float y[...], float alpha) { for (int i = 0; i < n; i+=2) saxpy2(i, x, y, alpha); 9/16

Summary Functions attribute ((pencil_access(...))); Describes memory access pattern for Non-PENCIL functions (e.g. hand-optimized OpenCL) PENCIL functions too complex for compiler analysis Summary functions are not executed void saxpy2_summary(int i, float x[...], float y[...], float alpha) { y[i] = x[i]; y[i+1] = x[i+1]; attribute ((pencil_access(saxpy2_summary))) /* Defined elsewhere */ void saxpy2(int i, float x[...], float y[...], float alpha); void saxpy(int n, float x[...], float y[...], float alpha) { for (int i = 0; i < n; i+=2) saxpy2(i, x, y, alpha); 9/16

PENCIL Programming Recommendations Prefer for-loops over while-loops Prefer affine conditions and array access expressions such as a[2*i][j], a[i+j], but not a[i*n+j] Avoid data-dependent array accesses Avoid data-dependent control flow Keep arrays multi-dimensional Use pencil_assume copiously (but not recklessly!) 10/16

PENCIL-to-OpenCL Compilation Polyhedral Parallel Code Generator (PPCG) Developed and maintained by INRIA/ENS, France Performs parallelization and memory management Produces host and kernel code User flags affecting code generation: tile size, grid size, block size 11/16

Outline Introduction PENCIL Case Studies Conclusions 12/16

Basic Linear Algebra Subprograms (BLAS) Speedup on ARM Mali-T604 GPU (normalized to reference) 2 1.5 1 0.5 0 srot sswap sscal scopy saxpy sgemv sgbmv ssymv ssbmv sspmv sger ssyr sspr ssyr2 sspr2 sgemm ssyrk ssyr2k 12/16

Open Problems Finding appropriate PPCG flags Producing efficient code for downstream OpenCL compiler Handling reduction loops efficiently 13/16

Compute Benchmarks: SHOC and Rodinia PENCIL reimplementation of selected OpenCL benchmarks. 4 Speedup on ARM Mali-T604 GPU (normalized to original OpenCL) 3.5 3 2.5 2 1.5 1 0.5 0 Stencil Gaussian SRAD SpMV Radix BFS 14/16

Outline Introduction PENCIL Case Studies Conclusions 15/16

Conclusions and Future Work PENCIL: A C99-based intermediate language for Compute Serves as DSL compilation target Addresses OpenCL performance portability problem Increases programmer productivity 15/16

Conclusions and Future Work PENCIL: A C99-based intermediate language for Compute Serves as DSL compilation target Addresses OpenCL performance portability problem Increases programmer productivity Future Work: Continuing development of BLAS library Porting SLAMBench to PENCIL Encouraging results, work in progress. 15/16

References http://carpproject.github.io/ https://github.com/carpproject/ 16/16