Similar documents
OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to OpenCL Programming. Training Guide

College of William & Mary Department of Computer Science

Introduction to GPU hardware and to CUDA

ST810 Advanced Computing

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 3. Optimising OpenCL performance

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Le langage OCaml et la programmation des GPU

Multi-core Programming System Overview

Introduction to GPU Programming Languages

OpenCL Programming for the CUDA Architecture. Version 2.3

Java GPU Computing. Maarten Steur & Arjan Lamers

BLM 413E - Parallel Programming Lecture 3

CUDA programming on NVIDIA GPUs

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU Profiling with AMD CodeXL

GPUs for Scientific Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

U N C L A S S I F I E D

Parallel Computing with Mathematica UVACSE Short Course

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

OpenACC 2.0 and the PGI Accelerator Compilers

HPC Wales Skills Academy Course Catalogue 2015

Introduction to WebGL

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

COSCO 2015 Heterogeneous Computing Programming

Next Generation GPU Architecture Code-named Fermi

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Press Briefing. GDC, March Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos. Copyright Khronos Group Page 1

Windows Phone 7 Game Development using XNA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

GPU Computing - CUDA

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Virtual Machines.

HPC with Multicore and GPUs

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Lecture 1: an introduction to CUDA

CUDA Basics. Murphy Stein New York University

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Reminders. Lab opens from today. Many students want to use the extra I/O pins on

GPU Parallel Computing Architecture and CUDA Programming Model

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

INSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1

System Requirements G E N E R A L S Y S T E M R E C O M M E N D A T I O N S

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Chapter 13: Program Development and Programming Languages

Computer Graphics on Mobile Devices VL SS ECTS

Parallel Web Programming

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023

Debugging with TotalView

BSC vision on Big Data and extreme scale computing

Introduction to Virtual Machines

Using the Intel Inspector XE

4.1 Introduction 4.2 Explain the purpose of an operating system Describe characteristics of modern operating systems Control Hardware Access

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Contents. 2. cttctx Performance Test Utility Server Side Plug-In Index All Rights Reserved.

LittleCMS: A free color management engine in 100K.

CrossPlatform ASP.NET with Mono. Daniel López Ridruejo

Crosswalk: build world class hybrid mobile apps

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Overview of HPC Resources at Vanderbilt

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Parallel Computing for Data Science

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Building Applications Using Micro Focus COBOL

Optimizing Application Performance with CUDA Profiling Tools

Chapter 2 System Structures

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

A quick tutorial on Intel's Xeon Phi Coprocessor

Chapter 6, The Operating System Machine Level

Stream Processing on GPUs Using Distributed Multimedia Middleware

OpenCL for programming shared memory multicore CPUs

GPU Tools Sandra Wienke

CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler

Example of Standard API

End-user Tools for Application Performance Analysis Using Hardware Counters

Parallel Algorithm Engineering

Transcription:

Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic Vectory B.V. george.van.venrooij@organicvectory.com As defined by the Khronos Group (www.khronos.org): Khronos Group is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. Khronos Group controls many other standards like: OpenGL (ES) OpenVG COLLADA WebGL and many more vs. & Device Memory Model (source: nvidia Tutorial Slides)

Terminology Qualifiers Work-Item Thread kernel global Work-Group Thread-Block (no qualifier needed) device (function) Global Memory Global Memory Constant Memory Constant Memory constant constant global device (variable) local shared Local Memory Shared Memory Private Memory Local Memory Indexing API Objects get_num_groups() griddim cl_device_id CUdevice get_local_size() blockdim cl_context CUcontext get_group_id() blockidx get_local_id() threadidx cl_program CUmodule cl_kernel CUfunction cl_mem CUdeviceptr cl_command_queue CUstream get_global_id() (calculate manually) get_global_size() (calculate manually) Kernel Language Device Thread Synchronization Subset of C99 C for, subset of C barrier() syncthreads() data-parallel extensions data parallel extensions (no equivalent) threadfence() mem_fence() threadfence_block() read_mem_fence() (no equivalent) write_mem_fence() (no equivalent) C++ features (function templates) enable higher productivity through meta-programming techniques Requires run-time compilation by driver Compilation through separate compiler: NVCC No function pointers or recursion No function pointers or recursion

Performance Comparison (1) Profiler Tool accelereyes.com vs tested on Tesla C20 Performance Comparison (2) NVidia GeForce GT 285 - vs NVidia GeForce GT 285 - vs Particle simulation PI approximation Performance Comparison (3) Particle simulation PI approximation 0.014000 0.012000 0.0000 0.2000 0.00 0.0000 0.0000 20 40 60 80 0.0000 0.0000 0.008000 GT 285 0.006000 58 0.004000 0.002000 0 0 0 0 900 90 200 400 600 800 00 NVidia GeForce GT 285 - vs NVidia GeForce GT 285 - vs Random global memory reads Random global memory writes 0.00 0.0000 0.0000 20 40 60 80 0 0 0 0 900 90 200 400 600 800 00 0.00 0.0000 0.0000 20 40 60 80 0 0 0 0 900 90 200 400 600 800 00 Preliminary Conclusions There are cases where performs better than There are cases where performs better than seems to have slightly higher overhead for kernel launches compared to on NVidia's platform For some cases the differences can be large, but... Measuring = knowing! 200 600 00 90 400 800 Random global memory writes 0.4000 0.00 GT 285 0.2000 58 0.00 0.0000 0.0000 GT 285 200 600 00 90 400 800 Iterations 0.2000 GT 285 0.2000 0.0000 Random global memory reads 0.00 58 0.4000 0.00 GT 285 Iterations GT 285 40000 99856 698896 000 69696 399424 00000 29929 69696 199809 599076 00000 000 49729 90000 399424 799236 0.0000 0.00 0.00 0.020000 0.600000 0.00 0.2000 GT 285 0.00 58 0.0000 0.0000 GT 285 200 600 00 90 400 800 Back to the Host

Host Synchronization: Host Synchronization: Streams Command Queues Streams are a sequence of commands that execute in-order Default behavior of command queue's is similar to Streams Streams can contain kernel launches and/or memory transfers One big difference: out-of-order execution mode clenqueue...() commands can be given a set of events to wait for Each command itself can generate an event Host code can wait for stream completion using the cudastreamsynchronize() call Events can be inserted into the stream Host code can query event completion or perform a blocking wait for an event Useful for synchronization with host code and timing Task & Data Parallelism The commands and the events they must wait for, create a task graph The end-result is a task-parallel framework supporting data-parallel tasks It is possible to create multiple queues for a device It is possible have commands in one queue wait for events from a different queue Intermediate Conclusions will execute the commands in the queue as it sees fit, respecting the dependencies specified. Based on the dependencies between commands in the queue, can determine which commands are allowed to execute simultaneously The programming methodology for data-parallel application is virtually identical, i.e. if you can program in one language/environment, you can program in the other currently offers certain productivity advantages at the kernel level NVidia's hardware seems to be more capable on the GP side when compared to ATi's hardware has the platform advantage in that it presents a unified platform API for ALL computing hardware in your machine programs can be run on hardware from different vendors Your application could be written entirely in kernels, requiring only a small framework that fills the command queue Implementations AVAILABLE: Vendor Type Hardware Apple x86_64 (Intel) nvidia GeForce 8/9 series and higher ATi R0/800 series AMD any x86/x86_64 with SSE3 extensions Samsung ARM A9 IBM ACCELERATOR CELL BE ZiiLabs ARM ANNOUNCED/UPCOMING: Imagination Technology PowerVR SG Series 5 VIA VN 00 Chipset S3 Chrome 5400E Graphics Processor Apple ARM A4 Portability to other platforms Results of a kernel are guaranteed across platforms Optimal Performance is not All platforms are required to support data-parallelism, but are not required to support task-parallelism can be considered a replacement for OpenMP (data-parallel) can be considered a replacement for Threads (task-parallel)

Libraries & Tools for Libraries & Tools for ATi StreamProfiler (ATi hardware only) cublas (closed-source) NVidia Visual Profiler (NVidia hardware only) cufft (closed-source) Stream KernalAnalyzer (ATi hardware only) CUDPP (data-parallel primitives) NVidia NSight (NVidia hardware only) Thrust (high-level & OpenMP-based algorithms) gdebugger CL (Windows, Mac, Linux, currently in beta) CULATools (LAPACK) libstdcl (wrapper around context/queue management functions) NSight debugger GATLAS (Matrix multiplication) NVidia Visual Profiler ViennaCL (BLAS level 1 and 2) Language bindings for C++, Fortran, Java, Matlab,.NET, Python and Scala are available Language bindings for Python, Java,.NET, MATLAB, Fortran, Perl, Ruby, Lua (Unofficial) Sneak Preview Things to consider Platforms API stability/agility changes more slowly, retains backward compatibility changes more rapidly, unlocks new hardware features quicker Third-party library availability is currently the only choice if you do not want to tie your application to NVidia's hardware is about 2 years younger, so less numerous and less mature libraries are available has spawned a host of initiatives and various libraries are available, especially in the scientific computing domain Supporting tools has a fairly young, but decent set of tools NVidia recently launched the NSight debugger which seems more mature Questions Further Reading GP General Implementations? www.gpgpu.org www.khronos.org/opencl http://developer.amd.com/documentation/articles/pages/-and-the-ati-stream-v2.0-beta.aspx http://developer.nvidia.com/object/opencl.html http://www.alphaworks.ibm.com/tech/opencl http://developer.apple.com/mac/library/documentation/performance/conceptual/_macprogguide/whatis/wh / Comparisons Mobile/Embedded announcements http://www.gpucomputing.net/?q=node/128 http://www.imgtec.com/news/release/index.asp?newsid=557 http://www.ziilabs.com/technology/opencl.aspx

References http://blog.accelereyes.com/blog/20/05//nvidia-fermi-cuda-and-opencl/ http://www.s3graphics.com/en/news/news_detail.aspx?id=44 http://www.gremedy.com/gdebuggercl.php http://browndeertechnology.com/stdcl.html http://golem5.org/gatlas/ http://www.mainconcept.com/products/sdks/hw-acceleration/opencl-h264avc.html http://awaregeek.com/news/some-pictures-of-old-computers/