Introduction OpenCL. Kristen Boydstun. TAPIR, California Institute of Technology. August 9, 2011

Similar documents
Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Lecture 3. Optimising OpenCL performance

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Introduction to OpenCL Programming. Training Guide

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

COSCO 2015 Heterogeneous Computing Programming

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Programming Guide. ATI Stream Computing OpenCL. June rev1.03

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

GPGPU. General Purpose Computing on. Diese Folien wurden von Mathias Bach und David Rohr erarbeitet

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009


Altera SDK for OpenCL

How OpenCL enables easy access to FPGA performance?

OpenCL Static C++ Kernel Language Extension

Introduction to GPU Programming Languages

gpus1 Ubuntu Available via ssh

PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

OpenCL Programming for the CUDA Architecture. Version 2.3

Java GPU Computing. Maarten Steur & Arjan Lamers

The Importance of Software in High-Performance Computing

OpenCL. An Introduction for HPC programmers. Tim Mattson, Intel

AMD Accelerated Parallel Processing. OpenCL Programming Guide. November rev2.7

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU Architecture

AMD APP SDK v2.8 FAQ. 1 General Questions

SOFTWARE DEVELOPMENT TOOLS USING GPGPU POTENTIALITIES

Next Generation GPU Architecture Code-named Fermi

CUDA programming on NVIDIA GPUs

OpenACC 2.0 and the PGI Accelerator Compilers

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

Introduction to GPU hardware and to CUDA

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

College of William & Mary Department of Computer Science

Evaluation of CUDA Fortran for the CFD code Strukti

GPU Hardware Performance. Fall 2015

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Leveraging Aparapi to Help Improve Financial Java Application Performance

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

CUDA Programming. Week 4. Shared memory and register

GPUs for Scientific Computing

GPGPU Parallel Merge Sort Algorithm

Rootbeer: Seamlessly using GPUs from Java

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Le langage OCaml et la programmation des GPU

Introduction to GPGPU. Tiziano Diamanti

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

CUDA Basics. Murphy Stein New York University

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Performance Analysis for GPU Accelerated Applications

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Optimizing Application Performance with CUDA Profiling Tools

GPGPU Computing. Yong Cao

GPU Profiling with AMD CodeXL

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Optimizing Code for Accelerators: The Long Road to High Performance

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPGPUs, CUDA and OpenCL

clopencl - Supporting Distributed Heterogeneous Computing in HPC Clusters

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU Computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Lecture 1: an introduction to CUDA

Parallel Programming Survey

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

ultra fast SOM using CUDA

OpenCL for programming shared memory multicore CPUs

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

OpenACC Basics Directive-based GPGPU Programming

Monte Carlo Method for Stock Options Pricing Sample

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Case Study on Productivity and Performance of GPGPUs

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Parallel and Distributed Computing Programming Assignment 1

GPU Tools Sandra Wienke

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Transcription:

Introduction OpenCL Kristen Boydstun TAPIR, California Institute of Technology August 9, 2011 Kristen Boydstun Introduction to OpenCL August 9, 2011 1

Introduction to OpenCL OpenCL - Open Computing Language Open, royalty-free standard Initially proposed by Apple Specification maintained by the Khronos Group Developed by a number of companies Specification: set of requirements to be satisfied must be implemented to use it Device agnostic Framework for parallel programming across heterogeneous platforms consisting of: CPUs, GPUs and other processors (FPGA,...) Similar: Nvidia s CUDA Kristen Boydstun Introduction to OpenCL August 9, 2011 2

Main Idea Main Idea of OpenCL: Replace loops with data-parallel functions (kernels) that execute at each point in a problem domain Traditional vector addition loop in C void vec_add(int N, const float *a, const float *b, float *c) { int i; for(i=0; i<n; i++) c[i] = a[i] + b[i]; } Vector addition OpenCL kernel kernel void vec_add( global const float *a, global const float *b, global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } Code comparison - note differences: Loop over N elements N kernel instances execute in parallel Qualifiers: kernel, global Each kernel instance has a global identification number Kristen Boydstun Introduction to OpenCL August 9, 2011 3

Motivation AMD OpenCL Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication AMD Phenom II X4 965 CPU (quad core) ATI Radeon HD 5870 GPU Unoptimized CPU performance: 1 SP GFLOP/s Optimized CPU performance reaches: 4 SP GFLOP/s Optimized GPU performance reaches: 50 SP GFLOP/s Kristen Boydstun Introduction to OpenCL August 9, 2011 4

Outline Introduction Operation model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 5

Outline Introduction Operation model Platform model Execution model Programming model Memory model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 6

Platform Model Platform: one Host + one or more OpenCL Devices OpenCL Device: divided into one or more compute units Compute unit: divided into one or more processing elements Platform model designed to present a uniform view of many different kinds of parallel processors Kristen Boydstun Introduction to OpenCL August 9, 2011 7

Platform Model Mapped onto AMD GPU OpenCL Device Example: AMD Radeon HD 6970 24 compute units (SIMD engines or processors) SIMD - Single Instruction, Multiple Data High level of parallelism within a processor 64 processing elements per compute unit = 1536 total processing elements Kristen Boydstun Introduction to OpenCL August 9, 2011 8

Outline Introduction Operation model Platform model Execution model Programming model Memory model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 9

Execution Model OpenCL Application Host Code Written in C/C++ Executes on host Submits work to OpenCL device(s) Device Code Written in OpenCL C Executes on device(s) Kristen Boydstun Introduction to OpenCL August 9, 2011 10

Outline Introduction Operation model Platform model Execution model Programming model Memory model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 11

Programming Model Data-parallel programming: Set of instructions are applied in parallel to each point in some abstract domain of indices. On a SIMD processor, data parallelism achieved by performing the same task on many different pieces of data in parallel Compare to MPI, where different processors perform the same task in parallel Example: 8x8 Matrix addition MPI with a 2-processor system: CPU A could add all elements from top half of matrices, CPU B could add all elements from bottom half - each CPU performs 32 additions serially OpenCL on AMD GPU: Each of the 64 processing elements on the SIMD processor performs 1 addition Task-parallel programming: Multiple parallel tasks Kristen Boydstun Introduction to OpenCL August 9, 2011 12

Programming Model: Data-Parallelism Define N-Dimensional computation domain (N = 1,2 or 3) Kristen Boydstun Introduction to OpenCL August 9, 2011 13

Data-Parallelism with 1D Index Space When a kernel is submitted for execution, an index space is defined A kernel instance (work item, CUDA: thread) executes for each point in index space Each work item executes the same code but the path taken and data operated upon can vary per work item Work items organized into work groups (CUDA: thread blocks) Assigned a unique work group ID Work group synchronization Kristen Boydstun Introduction to OpenCL August 9, 2011 14

Data-Parallelism with 2D Index Space Example: processing a 1024 x 1024 image: Global Size(0) = Global Size(1) = 1024 1 kernel execution per pixel 1,048,576 total kernel executions Kristen Boydstun Introduction to OpenCL August 9, 2011 15

AMD GPU: Work Item Processing All processing elements within SIMD engine execute same instruction Wavefront: block of work-items that are executed together Wavefronts execute N work items in parallel, where N is specific to the GPU For AMD Radeon HD 6970, N = 64 as there are 64 processing elements per SIMD engine Consequence on branching Kristen Boydstun Introduction to OpenCL August 9, 2011 16

Outline Introduction Operation model Platform model Execution model Programming model Memory model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 17

Memory Model Private Memory (CUDA: local) Private to a work item, not visible to other work items Local Memory (CUDA: shared) Shared within a work group Constant Memory Visible to all workgroups, read-only Global Memory Accessible to all work items and the host Host Memory Host-accessible Private Memory 1 Work-Item 1 Local Memory 1 Workgroup 1 Device K Host... Private Memory M Work-Item M... Private Memory 1 Work-Item 1 Global and Constant Memory Host Memory... Local Memory N Workgroup N Private Memory M Work-Item M Kristen Boydstun Introduction to OpenCL August 9, 2011 18

Device Info Example [kboyds@fermi x86_64]$./clinfo Number of platforms: 1 Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Max compute units: 24 Global memory size: 1073741824 Constant buffer size: 65536 Local memory size: 32768 Kernel Preferred work group size multiple: 64 Name: Cayman Vendor: Advanced Micro Devices, Inc. Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64... Global memory size = 1024 megabytes Constant buffer size = 64 kilobytes Local memory size = 32 kilobytes Kristen Boydstun Introduction to OpenCL August 9, 2011 19

Outline Introduction Operation model OpenCL framework Platform layer Runtime OpenCL C programming language Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 20

OpenCL Framework Platform Layer Allows host to discover OpenCL devices and create contexts Runtime Allows host to manipulate contexts (memory management, command execution..) OpenCL C Programming Language Supports a subset of the ISO C99 language with extensions for parallelism Device memory hierarchy Address space qualifiers ( global, local..) Extensions for parallelism - support for work items (get global id), work groups (get group id, get local id), synchronization Kristen Boydstun Introduction to OpenCL August 9, 2011 21

Outline Introduction Operation model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 22

Vector Addition Example Simple example: Vector Addition in C void vector_add_c(const float *a, const float *b, float *c, int N) { int i; for(i=0; i<n; i++) c[i] = a[i] + b[i]; } For the OpenCL solution to this problem, there are two parts: Kernel code Host code Kristen Boydstun Introduction to OpenCL August 9, 2011 23

Vector Addition in OpenCL Kernel code kernel void vec_add ( global const float *a, global const float *b, global float *c) { // Get global identification number // (returns a value from 0 to N-1) int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } // kernel executed over N work items Kristen Boydstun Introduction to OpenCL August 9, 2011 24

Vector Addition in OpenCL Inline Kernel Code #include <CL/cl.h> // OpenCL header file // OpenCL kernel source code included inline in host source code: const char *source = " kernel void vec_add ( global const float *a, \n" " global const float *b, \n" " global float *c) \n" "{ \n" " int gid = get_global_id(0); \n" " c[gid] = a[gid] + b[gid]; \n" "} \n"; void main{}{ (...) } Kristen Boydstun Introduction to OpenCL August 9, 2011 25

Vector Addition in OpenCL Host program sets up the environment for the OpenCL program, creates and manages kernels 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 26

Vector Addition - Host 1. Initialize device (Platform layer) #include <CL/cl.h> const char *source = (...) void main(){ int N = 64; // Array length // Get the first available platform // Example: AMD Accelerated Parallel Processing cl_platform_id platform; clgetplatformids(1, // number of platforms to add to list &platform, // list of platforms found NULL); // number of platforms available // Get the first GPU device the platform provides cl_device_id device; clgetdeviceids(platform, CL_DEVICE_TYPE_GPU, 1, // number of devices to add &device, // list of devices NULL); // number of devices available } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 27

Vector Addition - Host 1. Initialize device (Platform layer) void main(){ (...) // Contexts are used by the runtime for managing program // objects, memory, and command queues // Create a context and command queue on that device cl_context context = clcreatecontext( 0, // optional (context properties) 1, // number of devices &device, // pointer to device list NULL, NULL, // optional (callback function for reporting errors) NULL); // no error code returned cl_command_queue queue = clcreatecommandqueue( context, // valid context device, // device associated with context 0, // optional (command queue properties) NULL); // no error code returned } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 28

Vector Addition in OpenCL 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 29

Vector Addition - Host 2. Build program (Compiler) void main(){ (...) // An OpenCL program is a set of OpenCL kernels and // auxiliary functions called by the kernels // Create program object and load source code into program object cl_program program = clcreateprogramwithsource(context, 1, // number of strings &source, // strings NULL, // string length or NULL terminated NULL); // no error code returned } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 30

2. Build program (Compiler) void main(){ (...) Vector Addition - Host // Build program executable from program source clbuildprogram(program, 1, // number of devices &device, // pointer to device list NULL, // optional (build options) NULL, NULL); // optional (callback function, argument) // Create kernel object cl_kernel kernel = clcreatekernel( program, // program object "vec_add", // kernel name in program NULL); // no error code returned } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 31

Vector Addition in OpenCL 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 32

Vector Addition - Host 3. Create buffers (Runtime layer) void main(){ (...) // Initialize arrays cl_float *a = (cl_float *) malloc(n*sizeof(cl_float)); cl_float *b = (cl_float *) malloc(n*sizeof(cl_float)); int i; for(i=0;i<n;i++){ a[i] = i; b[i] = N-i; } } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 33

3. Create buffers (Runtime layer) void main(){ (...) Vector Addition - Host // A buffer object is a handle to a region of memory cl_mem a_buffer = clcreatebuffer(context, CL_MEM_READ_ONLY // buffer object read only for kernel CL_MEM_COPY_HOST_PTR, // copy data from memory referenced // by host pointer N*sizeof(cl_float), // size in bytes of buffer object a, // host pointer NULL); // no error code returned cl_mem b_buffer = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, N*sizeof(cl_float), b, NULL); cl_mem c_buffer = clcreatebuffer(context, CL_MEM_WRITE_ONLY, N*sizeof(cl_float), NULL, NULL); (...) } Kristen Boydstun Introduction to OpenCL August 9, 2011 34

Vector Addition in OpenCL 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 35

Vector Addition - Host 4. Set arguments, enqueue kernel (Runtime layer) void main(){ (...) size_t global_work_size = N; // Set the kernel arguments clsetkernelarg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer); clsetkernelarg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer); clsetkernelarg(kernel, 2, sizeof(c_buffer), (void*) &c_buffer); // Enqueue a command to execute the kernel on the GPU device clenqueuendrangekernel(queue, kernel, 1, NULL, // global work items dimensions and offset &global_work_size, // number of global work items NULL, // number of work items in a work group 0, NULL, // don t wait on any events to complete NULL); // no event object returned } (...) Kristen Boydstun Introduction to OpenCL August 9, 2011 36

Vector Addition in OpenCL 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 37

Vector Addition - Host 5. Read back results (Runtime layer) void main(){ (...) } // Block until all commands in command-queue have completed clfinish(queue); // Read back the results cl_float *c = (cl_float *) malloc(n*sizeof(cl_float)); clenqueuereadbuffer( queue, // command queue in which read command will be queued c_buffer, // buffer object to read back CL_TRUE, // blocking read - doesn t return until buffer copied 0, // offset in bytes in buffer object to read from N * sizeof(cl_float), // size in bytes of data being read c, // pointer to host memory where data is to be read into 0, NULL, // don t wait on any events to complete NULL); // no event object returned Kristen Boydstun Introduction to OpenCL August 9, 2011 38

Vector Addition - Host Cleanup void main(){ (...) free(a); free(b); free(c); clreleasememobject(a_buffer); clreleasememobject(b_buffer); clreleasememobject(c_buffer); clreleasekernel(kernel); clreleaseprogram(program); clreleasecontext(context); clreleasecommandqueue(queue); } Kristen Boydstun Introduction to OpenCL August 9, 2011 39

Vector Addition in OpenCL 5 steps in a basic Host program 1. Initialize device (Platform layer) 2. Build program (Compiler) 3. Create buffers (Runtime layer) 4. Set arguments, enqueue kernel (Runtime layer) 5. Read back results (Runtime layer) Kristen Boydstun Introduction to OpenCL August 9, 2011 40

Outline Introduction Operation model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 41

Optimization Strategies Expose data parallelism in algorithms Minimize host-device data transfer Overlap memory transfer with computation Prevent path divergence between work items Number of work items per work group should be a multiple of the wavefront size (64 for AMD Radeon HD 6970) Use local memory as a cache Others: memory coalescing, bank conflicts, OpenCL C vector data types.. Kristen Boydstun Introduction to OpenCL August 9, 2011 42

Outline Introduction Operation model OpenCL framework Example: vector addition Optimization Resources Kristen Boydstun Introduction to OpenCL August 9, 2011 43

OpenCL Resources Khronos OpenCL specification, reference card, tutorials, etc: http://www.khronos.org/opencl AMD OpenCL Resources: http://developer.amd.com/opencl NVIDIA OpenCL Resources: http://developer.nvidia.com/opencl MacResearch: 6 OpenCL tutorials: http://www.macresearch.org/opencl-tutorials June 2011 Cern Computing Seminar: http://indico.cern.ch/conferencedisplay.py?confid=138427 Kristen Boydstun Introduction to OpenCL August 9, 2011 44