OpenCL 1.1 Enhancements for Multi-GPU Performance

Similar documents
Next Generation GPU Architecture Code-named Fermi


Stream Processing on GPUs Using Distributed Multimedia Middleware

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

COSCO 2015 Heterogeneous Computing Programming

Press Briefing. GDC, March Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos. Copyright Khronos Group Page 1

Vulkan on NVIDIA GPUs. Piers Daniell, Driver Software Engineer, OpenGL and Vulkan

Parallel Firewalls on General-Purpose Graphics Processing Units

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Final Project Report. Trading Platform Server

NVIDIA Tools For Profiling And Monitoring. David Goodwin

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Introduction to GPU hardware and to CUDA

Lecture 3. Optimising OpenCL performance

Introduction to GPU Programming Languages

Optimizing Application Performance with CUDA Profiling Tools

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Profiling with AMD CodeXL

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Case Study on Productivity and Performance of GPGPUs

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU Computing

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

HPC with Multicore and GPUs

Data Center and Cloud Computing Market Landscape and Challenges

Performance Analysis for GPU Accelerated Applications

Resource Utilization of Middleware Components in Embedded Systems

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Petascale Visualization: Approaches and Initial Results

Crosswalk: build world class hybrid mobile apps

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Intel DPDK Boosts Server Appliance Performance White Paper

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Using Intel Graphics Performance Analyzer (GPA) to analyze Intel Media Software Development Kitenabled

Introduction to WebGL

OpenACC 2.0 and the PGI Accelerator Compilers

OpenCL for programming shared memory multicore CPUs

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Technical Specifications

CHAPTER 1 INTRODUCTION

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Next Generation Operating Systems

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Embedded Systems: map to FPGA, GPU, CPU?

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction to OpenCL Programming. Training Guide

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Overview. NVIDIA Quadro K5200 8GB Graphics J3G90AA

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Le langage OCaml et la programmation des GPU

Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy)

Direct GPU/FPGA Communication Via PCI Express

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

BSC vision on Big Data and extreme scale computing

- An Essential Building Block for Stable and Reliable Compute Clusters

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

On Demand Satellite Image Processing

Parallel Computing with MATLAB

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Introduction to Cloud Computing

GPU for Scientific Computing. -Ali Saleh

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

BLM 413E - Parallel Programming Lecture 3

Operating System for the K computer

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

How To Manage An Sap Solution

Parallels Desktop 4 for Windows and Linux Read Me

Parallels Cloud Server 6.0

Using process address space on the GPU

SAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile

How To Understand And Understand An Operating System In C Programming

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Intel Xeon +FPGA Platform for the Data Center

Amazon EC2 Product Details Page 1 of 5

GraySort on Apache Spark by Databricks

Transcription:

OpenCL 1.1 Enhancements for Multi-GPU Performance Eric Young, NVIDIA Copyright Khronos Group, 2011 - Page 1

Overview NVIDIA Driver Availability Driver Performance Improvements OpenCL 1.1 Improvements OpenCL Multi-threading Example Summary Copyright Khronos Group, 2011 - Page 2

NVIDIA Driver Availability NVIDIA R280 drivers now support OpenCL 1.1! Available on Desktop and Mobile NVIDIA Driver Version Operating System Available on NVIDIA.COM 280.26 Release WHQL Vista and Windows 7 08/09/2011 280.13 Release Linux 08/01/2011 Copyright Khronos Group, 2011 - Page 3

NVIDIA Performance Improvements Driver Optimizations: - OpenCL driver adds multi-threaded optimizations. - Non-blocking API calls are enqueued and processed via worker threads. - Support for cross multi-gpu device synchronization. - CPU threads will not block during synchronization. - Less driver overhead with this type of synchronization. Data I/O Improvements - For data transfers between CPU and GPU with pageable memory buffers are asynchronous. Copyright Khronos Group, 2011 - Page 4

OpenCL 1.1 Improvements Commands can be sent from different host threads - APIs are thread safe except for clsetkernelarg( ) Support for event callback functions on Host - Enables heterogeneous computing (CPU + GPU) to be more efficient for both serial and parallel algorithms. - No explicit CPU/GPU synchronization required. Copyright Khronos Group, 2011 - Page 5

Callbacks and Events OpenCL 1.0 has support events - clenqueuebarrier(), clenqueuemarker(), clenqueuewaitforevent() for command-queue synchronization. - clwaitforevents() for waiting on a event. OpenCL 1.1 adds callback functions triggered by an event - clseteventcallback() passes a callback function that will be triggered by a user event. The callback function will get launched it a separate thread. - Use callback functions to distribute additional compute workloads on the GPU or CPU. Copyright Khronos Group, 2011 - Page 6

OpenCL Multi-Thread Example CPU thread initializes an array to be copied each GPU. - When each thread finishes, events (CPU Done Event) are set. 8 host threads run on 2 GPU devices. GPU 0 GPU 1 Copyright Khronos Group, 2011 - Page 7

OpenCL Multi-Thread Example CPU is not blocked during enqueuing of commands. Each host thread has an OpenCL command queue and event. Triggering an event trigger will launch the callback function in a separate thread. Copyright Khronos Group, 2011 - Page 8

With OpenCL 1.1 Data upload clenqueuewritebuffer( ) waits for CPU Done Event. clsetkernelcallback( ) - CPU post-process function gets triggered by the GPU Done Event. Upload Data CPU to GPU Launch Kernel on GPU Readback GPU to CPU CPU Done Event Prepare Data on CPU (thread) GPU Done Event Post-process Data on CPU Copyright Khronos Group, 2011 - Page 9

With OpenCL 1.0 clwaitforevents (GPU Done Event) before GPU->CPU readback and launching CPU Post-Process. Explicit CPU/GPU synchronization required. Prepare Data On CPU (thread) No Is GPU Done? Yes Post-Process Data on CPU CPU Done Event Upload Data CPU to GPU Launch Kernel GPU Blocks on GPU Event Readback GPU to CPU Copyright Khronos Group, 2011 - Page 10

oclmultithreads Example of oclmultithreads with multiple GPUs using OpenCL 1.1. [oclmultithreads.exe] starting... Detected 2 OpenCL devices of type CL_DEVICE_TYPE_GPU > OpenCL Device #0 (Tesla C2050), cl_device_id: 5445872 > OpenCL Device #1 (Quadro 4000), cl_device_id: 5445928 Tesla C2050: simpleincrement(), cl_device_id: 44505176, event_id: 0 Quadro 4000: simpleincrement(), cl_device_id: 44535184, event_id: 1 Tesla C2050: simpleincrement(), cl_device_id: 44505176, event_id: 2 Quadro 4000: simpleincrement(), cl_device_id: 44535184, event_id: 3 Tesla C2050: simpleincrement(), cl_device_id: 44505176, event_id: 4 Quadro 4000: simpleincrement(), cl_device_id: 44535184, event_id: 5 Tesla C2050: simpleincrement(), cl_device_id: 44505176, event_id: 6 Quadro 4000: simpleincrement(), cl_device_id: 44535184, event_id: 7 Copyright Khronos Group, 2011 - Page 11

oclmultithreads (contd.) Timing results from each thread and event > event_callback() event_id=0, kernel runtime 49.389344 (ms) > event_callback() event_id=1, kernel runtime 82.967264 (ms) > event_callback() event_id=2, kernel runtime 91.719552 (ms) > event_callback() event_id=4, kernel runtime 43.965088 (ms) > event_callback() event_id=3, kernel runtime 26.245088 (ms) > event_callback() event_id=5, kernel runtime 65.455936 (ms) > event_callback() event_id=6, kernel runtime 59.854240 (ms) > event_callback() event_id=7, kernel runtime 30.516416 (ms) Copyright Khronos Group, 2011 - Page 12

Summary NVIDIA OpenCL driver is multi-threaded: - Enqueuing commands and Data I/O are asynchronous to the application thread (non-blocking). - Improves overlap of the CPU and GPU workloads. OpenCL 1.1 adds - Event triggered callback functions: - Callback functions are non-blocking and run in a separate thread. Take advantage of these features to achieve: - Higher Compute throughput (CPU + GPU efficiency). - Higher GPU utilization (less time idle). - Lower API overhead with single and multiple GPUs. Copyright Khronos Group, 2011 - Page 13

Thank You! Contact: Eric Young eyoung@nvidia.com Copyright Khronos Group, 2011 - Page 14

OpenCL Programming Guide By Aaftab Munshi, Benedict R Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg Key Features (includes) Understanding OpenCL s core models, concepts, terminology, goals, and rationale Discovering and preparing available resources Programming with OpenCL C, its built-in functions, and runtime API Using buffers, sub-buffers, images, samplers, and events Sharing and synchronizing data with OpenGL and Microsoft s Direct3D Simplifying development with the C++ Wrapper API Using OpenCL Embedded Profiles to support devices ranging from cellphones to supercomputer nodes Building complete applications: image histograms, edge detection filters, physics simulations, Fast Fourier Transforms, optical flow, and more Using OpenCL with PyOpenCL, including issues in porting from C++ to Python Performing matrix multiplication and high-performance sparse matrix multiplication Copyright Khronos Group, 2011 - Page 15