Java GPU Computing. Maarten Steur & Arjan Lamers

Similar documents
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011


OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Next Generation GPU Architecture Code-named Fermi

Lecture 3. Optimising OpenCL performance

Several tips on how to choose a suitable computer

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction to GPU hardware and to CUDA

Leveraging Aparapi to Help Improve Financial Java Application Performance

Several tips on how to choose a suitable computer

Computer Graphics Hardware An Overview

Introduction to OpenCL Programming. Training Guide

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

How to choose a suitable computer

NVIDIA GeForce Experience

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPGPU. Tiziano Diamanti

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction to GPU Programming Languages

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

gpus1 Ubuntu Available via ssh

NVIDIA Tools For Profiling And Monitoring. David Goodwin

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU Architecture

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

ST810 Advanced Computing

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GRID VGPU FOR VMWARE VSPHERE

GPGPU Computing. Yong Cao

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Stream Processing on GPUs Using Distributed Multimedia Middleware

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA VIDEO ENCODER 5.0

U N C L A S S I F I E D

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Profiling with AMD CodeXL

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Accelerating CFD using OpenFOAM with GPUs

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Installation Guide. (Version ) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Parallel Programming Survey

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

NVIDIA GRID DASSAULT CATIA V5/V6 SCALABILITY GUIDE. NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016

GPGPU accelerated Computational Fluid Dynamics

High Performance GPGPU Computer for Embedded Systems

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Parallel Algorithm Engineering

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Retargeting PLAPACK to Clusters with Hardware Accelerators

HPC with Multicore and GPUs

ELEC 377. Operating Systems. Week 1 Class 3

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CUDA Programming. Week 4. Shared memory and register

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Virtualisatie. voor desktop en beginners. Gert Schepens Slides & Notities op gertschepens.be

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Rootbeer: Seamlessly using GPUs from Java

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

College of William & Mary Department of Computer Science

Virtuoso and Database Scalability

Special Interest Group Oracle WebCenter

Packet-based Network Traffic Monitoring and Analysis with GPUs

OpenCL Programming for the CUDA Architecture. Version 2.3

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Replication on Virtual Machines

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

QCD as a Video Game?

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Getting Started with CodeXL

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Transcription:

Java GPU Computing Maarten Steur & Arjan Lamers

Overzicht OpenCL Simpel voorbeeld Casus Tips & tricks Vragen

Waarom GPU Computing

Afkortingen CPU, GPU, APU Khronos: OpenCL, OpenGL Nvidia: CUDA JogAmp JOCL, JavaCL, JOCL

GPU vergeleken met CPU Veel simpele cores Veel high bandwidth geheugen Intel core i7 GeForce GT 650M 8 cores 384 cores 180 Gflops 650 Gflops

Programmeer model Definieer stream (flow) Run in parallel

Gebruik Algorithme: Hoge Concurrency Partitioneerbaar Maar: Extra latency door on- en offloaden op de GPU Extra complexiteit

Componenten

Componenten

Voorbeeld (MacBook Pro) Platform Platform Platform Platform name: Apple profile: FULL_PROFILE spec version: OpenCL 1.2 vendor: Apple Device 16925696 HD Graphics 4000 Driver:1.2(Aug 17 2014 20:29:07) Max work group size:512 Global mem size: 1073741824 Local mem size: 65536 Max clock freq: 1200 Max compute units: 16 Device 16918272 GeForce GT 650M Driver:8.26.28 310.40.55b01 Max work group size:1024 Global mem size: 1073741824 Local mem size: 49152 Max clock freq: 900 Max compute units: 2 Device 4294967295 Intel(R) Core(TM) i7-3720qm CPU @ 2.60GHz Driver:1.1 Max work group size:1024 Global mem size: 17179869184 Local mem size: 32768 Max clock freq: 2600 Max compute units: 8

Work & Memory

Application / Kernel Schrijf.cl files in C variant Kernels zijn de 'publieke' functies Java Bytecode Aparapi (OpenCL) RootBeer (CUDA)

Disclaimer

Parallel sort kernel void sort(global const float* in, global float* out, int size) { int i = get_global_id(0); // current thread float id = in[i]; int pos = 0; for (int j=0;j<size;j++) { float jd = in[j]; // in[j] < in[i]? bool smaller = (jx < ix) (jx == ix && j < i); pos += (smaller)?1:0; } out[pos] = id; }

Java GPU Computing CLContext globalcontext = CLContext.create(); CLDevice device = globalcontext.getmaxflopsdevice(type.gpu); CLContext context = CLContext.create(device); CLCommandQueue queue = device.createcommandqueue(); CLProgram program = context.createprogram( First8GpuComputing.class.getResourceAsStream("MyTask.cl") ).build(); Je kunt ook builden voor specifieke devices: build(device)

Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload);

Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload); CLKernel kernel = program.createclkernel("mytask"); kernel.putargs(inbuffer, outbuffer).putarg(workload.length);

Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload); CLKernel kernel = program.createclkernel("mytask"); kernel.putargs(inbuffer, outbuffer).putarg(workload.length); queue.putwritebuffer(inbuffer, false).put1drangekernel(kernel, 0, globalworksize, localworksize).putreadbuffer(outbuffer, true); FloatBuffer output = outbuffer.getbuffer();

Praktijkcasus

Praktijk casus Rekeninstrument ter ondersteuning van de Programmatische Aanpak Stikstof. http://www.aerius.nl

Praktijk casus

Praktijk casus

Tips & tricks CL beheer getresourceasstream()? Java constanten #define Locale? Oops!

Tips & tricks Unit testen Aparte test kernels Test cases in batches kernel void testdifficultcalculation(const int testcount, global const double* distance, global double* results) { const int testid = get_global_id(0); if (testid < testcount) { results[testid] = difficultcalculation(distance[testid]); } }

Direct memory management -XX:MaxDirectMemorySize=??M ByteBuffer.allocateDirect(int capacity) Max 2GB per buffer Garbage collection te laat Getriggered door heap collection Handmatig vrijgeven ((sun.nio.ch.directbuffer) mybuffer).cleaner().clean(); VisualVM plugin voor direct buffers

GPU vs CPU GPU's checken minder dan CPU's Div by zero Out of bounds checks Test eerst op CPU

Portabiliteit OpenCL is portable, de performance niet Memory sizes verschillen Memory latencies verschillen Work group sizes verschillen Compute devices verschillen OpenCL implementatie verschillen Develop dus voor de productie hardware

Ten slotte Float vs Double Dubbele precisie Halve performance Double support optioneel

Conclusie

Conclusie Wanneer te gebruiken? Als performance echt nodig is Als probleem hoge concurrency heeft Als probleem partitioneerbaar is

Vragen? Setting up OpenCL test on Intel(R) Core(TM) i7-3720qm CPU @ 2.60GHz Warming up OpenCL test [thread 32003 also had an error][thread 33027 also had an error] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error] [thread 32259 also had an error] at pc=0x00000001250ded70, pid=99851, tid=29475 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops) # Problematic frame: # [thread 17415 also had an error] C [cl_kernels+0x1d70] sort_wrapper+0x1b0 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /Users/arjanl/Documents/opencl/workspace/opencl-test/jogamp/hs_err_pid99851.log [thread 31763 also had an error] # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp #