OpenCL Parallel Computing on the GPU and CPU. Aaftab Munshi

Similar documents
Introduction to OpenCL Programming. Training Guide

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:


Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Lecture 3. Optimising OpenCL performance

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

COSCO 2015 Heterogeneous Computing Programming

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Next Generation GPU Architecture Code-named Fermi

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

CUDA Basics. Murphy Stein New York University

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU Programming Languages

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenCL Programming for the CUDA Architecture. Version 2.3

Writing Applications for the GPU Using the RapidMind Development Platform

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing - CUDA

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

OpenCL Static C++ Kernel Language Extension

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA programming on NVIDIA GPUs

GPU Profiling with AMD CodeXL

BLM 413E - Parallel Programming Lecture 3

gpus1 Ubuntu Available via ssh

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Virtualization: Hypervisors for Embedded and Safe Systems. Hanspeter Vogel Triadem Solutions AG

Press Briefing. GDC, March Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos. Copyright Khronos Group Page 1

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

College of William & Mary Department of Computer Science

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Introduction to GPU hardware and to CUDA

AMD APP SDK v2.8 FAQ. 1 General Questions

OPERATING SYSTEM SERVICES

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Enabling OpenCL Acceleration of Web Applications

Lecture 1: an introduction to CUDA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Parallel Web Programming

Embedded Systems: map to FPGA, GPU, CPU?

L20: GPU Architecture and Models

Programming GPUs with CUDA

Fundamentals of Computer Science (FCPS) CTY Course Syllabus

QCD as a Video Game?

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

OpenCL. An Introduction for HPC programmers. Tim Mattson, Intel

Getting Started with CodeXL

Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023

Altera SDK for OpenCL

ultra fast SOM using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

The OpenCL Specification

Leveraging Aparapi to Help Improve Financial Java Application Performance

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Java GPU Computing. Maarten Steur & Arjan Lamers

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Introduction to WebGL

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

The C Programming Language course syllabus associate level

ST810 Advanced Computing

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Evaluation of CUDA Fortran for the CFD code Strukti

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

COS 318: Operating Systems

U N C L A S S I F I E D

CS420: Operating Systems OS Services & System Calls

VMware Server 2.0 Essentials. Virtualization Deployment and Management

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

System Structures. Services Interface Structure

Monte Carlo Method for Stock Options Pricing Sample

Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.

CS3600 SYSTEMS AND NETWORKS

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

GPU Hardware Performance. Fall 2015

Chapter 6, The Operating System Machine Level

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Operating Systems Design 16. Networking: Sockets

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Programming Guide. ATI Stream Computing OpenCL. June rev1.03

Operating Systems 4 th Class

Camera BOF SIGGRAPH Copyright Khronos Group Page 1

Transcription:

OpenCL Parallel Computing on the GPU and CPU Aaftab Munshi

Opportunity: Processor Today s processors are increasingly parallel CPUs Multiple cores are driving performance increases GPUs Transforming into general purpose data-parallel computational coprocessors Improving numerical precision (single and double)

Challenge: Processor Parallelism Writing parallel programs different for the CPU and GPU Differing domain-specific techniques Vendor-specific technologies Graphics API is not an ideal abstraction for general purpose compute

Introducing OpenCL OpenCL Open Computing Language Approachable language for accessing heterogeneous computational resources Supports parallel execution on single or multiple processors GPU, CPU, GPU + CPU or multiple GPUs Desktop and Handheld Profiles Designed to work with graphics APIs such as OpenGL

OpenCL = Open Standard Specification under review Royalty free, cross-platform, vendor neutral Khronos OpenCL working group (www.khronos.org) Based on a proposal by Apple Developed in collaboration with industry leaders Performance-enhancing technology in Mac OS X Snow Leopard

OpenCL Working Group Members Broad Industry Support Copyright Khronos Group, 2008 - Page

OpenCL A Sneak Preview

Design Goals of OpenCL Use all computational resources in system GPUs and CPUs as peers Data- and task- parallel compute model Efficient parallel programming model Based on C Abstract the specifics of underlying hardware Specify accuracy of floating-point computations IEEE 754 compliant rounding behavior Define maximum allowable error of math functions Drive future hardware requirements

OpenCL Software Stack Platform Layer query and select compute devices in the system initialize a compute device(s) create compute contexts and work-queues Runtime resource management execute compute kernels Compiler A subset of ISO C99 with appropriate language additions Compile and build compute program executables online or offline

OpenCL Execution Model Compute Kernel Basic unit of executable code similar to a C function Data-parallel or task-parallel Compute Program Collection of compute kernels and internal functions Analogous to a dynamic library Applications queue compute kernel execution instances Queued in-order Executed in-order or out-of-order Events are used to implement appropriate

OpenCL Data-Parallel Execution Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a work-item The N-D domain defines the total number of workitems that execute in parallel global work size. Work-items can be grouped together work-group Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access Execute multiple work-groups in parallel Mapping of global work size to work-groups

OpenCL Task-Parallel Execution Data-parallel execution model must be implemented by all OpenCL compute devices Some compute devices such as CPUs can also execute task-parallel compute kernels Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function

OpenCL Memory Model Implements a relaxed consistency, shared memory model Multiple distinct address spaces Address spaces can be collapsed

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private local Local Memory Local Memory

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private local Local Memory Local Memory Global / Constant Memory Data Cache Compute Device constant and global Example: global float4 *p; Compute Device Memory Global Memory

Language for writing compute Derived from ISO C99 A few restrictions Recursion, function pointers, functions in C99 standard headers... Preprocessing directives defined by C99 are supported Built-in Data Types Scalar and vector data types Structs, Pointers Data-type conversion functions convert_type<_sat><_roundingmode> Image types

Language for writing compute Beyond Programmable Shading: Fundamentals

Language for writing compute Built-in Functions Required work-item functions math.h read and write image relational geometric functions synchronization functions

Language for writing compute Built-in Functions Required work-item functions math.h read and write image relational geometric functions synchronization functions Built-in Functions Optional double precision atomics to global and local memory selection of rounding mode

OpenCL FFT Example - Host API Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context, CL_MEM_READ_WRITE,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

OpenCL FFT Example - Host API Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create the compute program program = clcreateprogramfromsource(context, 1, &fft1d_1024_kernel_src, NULL); // build the compute program executable clbuildprogramexecutable(program, false, NULL, NULL); // create the compute kernel kernel = clcreatekernel(program, fft1d_1024 );

OpenCL FFT Example - Host API Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clcreatendrangecontainer(context, 0, 1, global_work_size, local_work_size); // set the args values clsetkernelarg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clsetkernelarg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clsetkernelarg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clexecutekernel(queue, kernel, NULL, range, NULL, 0, NULL);

OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function // Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008 kernel void fft1d_1024 ( global float2 *in, global float2 *out, local float *smemx, local float *smemy) { int tid = get_local_id(0); int blockidx = get_group_id(0) * 1024 + tid; float2 data[16]; } // starting index of data to/from global memory in = in + blockidx; out = out + blockidx; globalloads(data, in, 64); // coalesced global reads fftradix16pass(data); // in-place radix-16 pass twiddlefactormul(data, tid, 1024, 0); // local shuffle using local memory localshuffle(data, smemx, smemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftradix16pass(data); // in-place radix-16 pass twiddlefactormul(data, tid, 64, 4); // twiddle factor multiplication localshuffle(data, smemx, smemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftradix4pass(data); fftradix4pass(data + 4); fftradix4pass(data + 8); fftradix4pass(data + 12); // coalesced global writes globalstores(data, out, 64);

OpenCL and OpenGL Sharing OpenGL Resources OpenCL is designed to efficiently share with OpenGL Textures, Buffer Objects and Renderbuffers Data is shared, not copied Efficient queuing of OpenCL and OpenGL commands Apps can select compute device(s) that will run OpenGL and OpenCL

Summary A new compute language that works across GPUs and CPUs C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing