Lecture 1: an introduction to OpenCL

Similar documents
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 1: an introduction to CUDA

CUDA programming on NVIDIA GPUs

GPUs for Scientific Computing

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GPU hardware and to CUDA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Programming Survey

OpenCL Programming for the CUDA Architecture. Version 2.3

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Next Generation GPU Architecture Code-named Fermi

Lecture 3. Optimising OpenCL performance

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction to GPU Programming Languages

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

~ Greetings from WSU CAPPLab ~

GPU Computing - CUDA

BLM 413E - Parallel Programming Lecture 3

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU Architecture

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

CUDA Basics. Murphy Stein New York University

Introduction to GPGPU. Tiziano Diamanti

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPGPU Computing. Yong Cao

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

HPC with Multicore and GPUs

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Multi-Threading Performance on Commodity Multi-Core Processors

COSCO 2015 Heterogeneous Computing Programming

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Hardware Performance. Fall 2015

L20: GPU Architecture and Models

GPU Performance Analysis and Optimisation

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Scalability and Classifications

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

ultra fast SOM using CUDA

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Choosing a Computer for Running SLX, P3D, and P5

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Radeon HD 2900 and Geometry Generation. Michael Doggett

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPGPU Parallel Merge Sort Algorithm

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

gpus1 Ubuntu Available via ssh

Java GPU Computing. Maarten Steur & Arjan Lamers

NVIDIA GeForce GTX 580 GPU Datasheet

x64 Servers: Do you want 64 or 32 bit apps with that server?

GeoImaging Accelerator Pansharp Test Results

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to Cloud Computing

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Programming GPUs with CUDA

Introduction to OpenCL Programming. Training Guide

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Computer Graphics Hardware An Overview

GPU Usage. Requirements

Guided Performance Analysis with the NVIDIA Visual Profiler

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

OpenCL Static C++ Kernel Language Extension

The Fastest, Most Efficient HPC Architecture Ever Built


ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Overview of HPC Resources at Vanderbilt

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

IP Video Rendering Basics

Introduction to GPU Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Multi-core Programming System Overview

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

Texture Cache Approximation on GPUs

Performance Characteristics of Large SMP Machines

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

A quick tutorial on Intel's Xeon Phi Coprocessor

ME964 High Performance Computing for Engineering Applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenACC 2.0 and the PGI Accelerator Compilers

Turbomachinery CFD on many-core platforms experiences and strategies

Fast Implementations of AES on Various Platforms

Transcription:

Lecture 1: an introduction to OpenL Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research entre Edited from the UDA originals by Tom Deakin Lecture 1 p. 1

Overview hardware view software view OpenL programming Lecture 1 p. 2

Hardware view At the top-level, a PIe graphics card with a many-core GPU and high-speed graphics device memory sits inside a standard P/server with one or two multicore PUs: DDR3 GDDR5 motherboard graphics card Lecture 1 p. 3

Hardware view There are multiple GPU products out at the moment onsumer graphics cards (GeForce): GTX680: 1536 cores, 2/4GB ( 360/440) GTX690: 2 1536 cores, 2 2GB ( 800) AMD Radeon HD 7970 GHz Ed.: 2048 stream proc., 3GB ( 420) Dedicated HP cards (no graphics output): K10 module: 2 1536 cores, 2 4GB K20 card: 2496 cores, 5GB K20X module: 2688 cores, 6GB Lecture 1 p. 4

Hardware view We take a brief look at NVIDIA GPU design: building block is a streaming multiprocessor (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX different chips have different numbers of these SMXs: product SMXs bandwidth memory power GTX 650 Ti 4 86 GB/s 1/2 GB 110W GTX 680 8 190 GB/s 2/4 GB 195W K10 (2 ) 8 160 GB/s 4 GB 110W K20X 14 250 GB/s 6 GB 235WLecture 1 p. 5

Hardware View Kepler GPU SMX SMX SMX SMX L2 cache SMX SMX SMX SMX L1 cache / shared memory Lecture 1 p. 6

Hardware View Fermi GPU SM SM SM SM SM SM SM L2 cache SM SM SM SM SM SM SM L1 cache / shared memory Lecture 1 p. 7

Multithreading Key hardware feature is that the cores in an SMX are SIMT (Single Instruction Multiple Threads) cores: all cores execute the same instructions simultaneously, but with different data similar to vector computing on RAY supercomputers minimum of 32 threads all doing the same thing at (almost) the same time natural for graphics processing and much scientific computing SIMT is also a natural choice for many-core chips to simplify each core Lecture 1 p. 8

Multithreading Lots of active threads is the key to high performance: no context switching ; each thread has its own registers, which limits the number of active threads threads become inactive whilst waiting for data or part of the compute group takes a divergent path (if statements) Lecture 1 p. 9

Multithreading for each thread, one operation completes long before the next starts avoids the complexity of pipeline overlaps which can limit the performance of modern processors 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 time memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations, so hopefully there s enough computaton to hide the latency Lecture 1 p. 10

Software view At the top level, we have a master process which runs on the PU and performs the following steps: 1. initialises compute device 2. defines problem domain 3. allocates memory in host and on device 4. copies data from host to device memory 5. launches execution kernel on device 6. copies data from device memory to host 7. repeats 4-6 as needed 8. de-allocates all memory and terminates Lecture 1 p. 11

Software view At a lower level, within the compute device: each compute device is composed of multiple work-group each work-group is composed of multiple work-items all work-items execute an instance of the kernel simultaneously all work items within one work-group can access shared local memory but can t see what other work items are doing relaxed consistency memory model; i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times Lecture 1 p. 12

Platform model Lecture 1 p. 13

OpenL OpenL (Open omputing Language) is a program development environment maintained by Khronos: based on functions to set up and communicate with compute device compute kernels are written in. OpenL provides some common functions as part of runtime api alternative ++ API available for host code open royalty-free standard for general purpose parallel programming across PUs, GPUs and other processors Lecture 1 p. 14

Installing OpenL 2 components driver low-level software that controls the graphics card implementation often packaged as SDK, containing tools and examples vendor specific OpenL library and include files vendors include AMD, NVIDIA, Intel, Apple, etc. Lecture 1 p. 15

OpenL programming Already explained that a OpenL program has two pieces: host code on the PU which interfaces to the device kernel code which runs on the device At the host level, there is a choice of 2 APIs (Application Programming Interfaces): - the original API ++ - built on top of the API We will mostly use the API in this course, the latest ++ API looks promising and simplifies the host code. However it is useful to know what is really going on. Lecture 1 p. 16

OpenL programming At the host code level, there are library routines for: device initialisation memory allocation on graphics card data transfer to/from device memory constants images ordinary data programs and kernels command queues Lecture 1 p. 17

OpenL programming Lecture 1 p. 18

OpenL programming Boilerplate is similar between each program you write. It complicated, but DONT PANI! 1. Define the platform Get platform clgetplatformids() Discover devices within platform clgetdeviceids() reate context for device clreateontext() reate command queue to feed device clreateommandqueue() Lecture 1 p. 19

OpenL programming 2. reate and build the program Build program object clreateprogramwithsource() ompile program to build library of kernels clbuildprogram() Lecture 1 p. 20

OpenL programming 3. Setup memory objects Allocate and initialise input vectors on host Define OpenL memory objects clreatebuffer() Lecture 1 p. 21

OpenL programming 4. Define the kernel reate kernel object from program clreatekernel() Attach arguments the kernel clsetkernelarg() Lecture 1 p. 22

OpenL programming 5. Submit commands Write buffers from host into global memory clenqueuewritebuffer() Enqueue kernel for execution clenqueuendrangekernel() Read back result clenqueuereadbuffer() Note the command queue is in-order, so as long as the reading back is a blocking call the others do not need to be. Lecture 1 p. 23

OpenL programming At the lower level, when one instance of the kernel is started on a device it is executed by a number of work-items, each of which knows about: some variables passed as arguments memory buffers in global or local memory global constants in global memory local memory and private registers/variables some special functions: get global id() index in domain get local id() index in work-group get block id() index of work-group get local size() size of block etc... Lecture 1 p. 24

OpenL programming The kernel code looks fairly normal once you get used to two things: code is written from the point of view of a single thread quite different to OpenMP multithreading similar to MPI, where you use the MPI rank to identify the MPI process all private variables are private to that thread need to think about where each variable lives (more on this in the next lecture) any operation involving data in the device memory forces its transfer to/from registers in the GPU often better to copy the value into a private register variable Lecture 1 p. 25

Kernel code kernel void my_first_kernel( global float *x) { int tid = get_global_id(0); x[tid] = (float) get_local_id(0); } kernel identifier says it s a kernel function each work-item sets one element of x array within each work-group, get local id(0) ranges from 0 to get local size(0)-1, so each thread has a unique value for tid Lecture 1 p. 26

OpenL programming Suppose we have 1000 work-groups, each with 128 work-items. In this simple case we have a 1D grid, and a 1D set of work-items making up each work-group. Then the global size is 128000, and the local size is 128. If we want to use a 2D grid, we would set our global (and local) work size array with two elements: const size t global[] = {nx, ny} We specify the problem dimension when we enqueue the kernel. Problems can be 1 (like an array), 2 (like a grid) or 3 dimensional (like a cube) clenqueuendrangekernel(queue, kernel, work dim, offset, &global, &local, 0, NULL, NULL) Lecture 1 p. 27

Practical 1 start from code shown above (but with comments) test error-checking and printing from kernel functions modify code to add two vectors together (including sending them over from the host to the device) if time permits, look at OpenL examples in the UDA SDK Lecture 1 p. 28

Practical 1 Things to note: memory allocation cl mem d x = clreatebuffer(context, L MEM READ WRITE, nbytes, NULL, NULL); data copying clenqueuereadbuffer(queue, d x, L TRUE, 0, nbytes, h x, 0, NULL, NULL); reminder: prefix h and d to distinguish between arrays on the host and on the device is not mandatory, just helpful labelling kernel routine is declared by kernel prefix, and is written from point of view of a single thread Lecture 1 p. 29

Practical 1 Second version of the code is very similar to first, but uses a header file for various safety checks gives useful feedback in the event of errors. check for error return codes clutilsafeall(clapiall(... )); check for errors passes as API variable clapiall(..., &err); clutilsafeall(err); Lecture 1 p. 30

Practical 1 One thing to experiment with is the use of printf within a kernel function: (requires OpenL v1.2 SDK, i.e. AMD) essentially the same as standard printf; minor difference in integer return code each thread generates its own output; use conditional code if you want output from only one thread output from printf is flushed to implementation defined output stream at synchronisation points need to use clfinish(queue); at the end of the main code to flush all pending output Lecture 1 p. 31

Key reading OpenL Specification, version 1.2: hapter 3: The OpenL Architecture hapter 4: The OpenL Platform Layer OpenL Programming Guide: Aaftab Munshi, Benedict Gaster, Timothy G. Mattson and James Fung, 2011 Heterogeneous omputing with OpenL: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa, 2011 Lecture 1 p. 32