Java GPU Computing Maarten Steur & Arjan Lamers
Overzicht OpenCL Simpel voorbeeld Casus Tips & tricks Vragen
Waarom GPU Computing
Afkortingen CPU, GPU, APU Khronos: OpenCL, OpenGL Nvidia: CUDA JogAmp JOCL, JavaCL, JOCL
GPU vergeleken met CPU Veel simpele cores Veel high bandwidth geheugen Intel core i7 GeForce GT 650M 8 cores 384 cores 180 Gflops 650 Gflops
Programmeer model Definieer stream (flow) Run in parallel
Gebruik Algorithme: Hoge Concurrency Partitioneerbaar Maar: Extra latency door on- en offloaden op de GPU Extra complexiteit
Componenten
Componenten
Voorbeeld (MacBook Pro) Platform Platform Platform Platform name: Apple profile: FULL_PROFILE spec version: OpenCL 1.2 vendor: Apple Device 16925696 HD Graphics 4000 Driver:1.2(Aug 17 2014 20:29:07) Max work group size:512 Global mem size: 1073741824 Local mem size: 65536 Max clock freq: 1200 Max compute units: 16 Device 16918272 GeForce GT 650M Driver:8.26.28 310.40.55b01 Max work group size:1024 Global mem size: 1073741824 Local mem size: 49152 Max clock freq: 900 Max compute units: 2 Device 4294967295 Intel(R) Core(TM) i7-3720qm CPU @ 2.60GHz Driver:1.1 Max work group size:1024 Global mem size: 17179869184 Local mem size: 32768 Max clock freq: 2600 Max compute units: 8
Work & Memory
Application / Kernel Schrijf.cl files in C variant Kernels zijn de 'publieke' functies Java Bytecode Aparapi (OpenCL) RootBeer (CUDA)
Disclaimer
Parallel sort kernel void sort(global const float* in, global float* out, int size) { int i = get_global_id(0); // current thread float id = in[i]; int pos = 0; for (int j=0;j<size;j++) { float jd = in[j]; // in[j] < in[i]? bool smaller = (jx < ix) (jx == ix && j < i); pos += (smaller)?1:0; } out[pos] = id; }
Java GPU Computing CLContext globalcontext = CLContext.create(); CLDevice device = globalcontext.getmaxflopsdevice(type.gpu); CLContext context = CLContext.create(device); CLCommandQueue queue = device.createcommandqueue(); CLProgram program = context.createprogram( First8GpuComputing.class.getResourceAsStream("MyTask.cl") ).build(); Je kunt ook builden voor specifieke devices: build(device)
Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload);
Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload); CLKernel kernel = program.createclkernel("mytask"); kernel.putargs(inbuffer, outbuffer).putarg(workload.length);
Java GPU Computing CLBuffer<FloatBuffer> inbuffer = context.createfloatbuffer( input.length, READ_ONLY); CLBuffer<FloatBuffer> outbuffer = context.createfloatbuffer( input.length, WRITE_ONLY); maptobuffer(inbuffer.getbuffer(), workload); CLKernel kernel = program.createclkernel("mytask"); kernel.putargs(inbuffer, outbuffer).putarg(workload.length); queue.putwritebuffer(inbuffer, false).put1drangekernel(kernel, 0, globalworksize, localworksize).putreadbuffer(outbuffer, true); FloatBuffer output = outbuffer.getbuffer();
Praktijkcasus
Praktijk casus Rekeninstrument ter ondersteuning van de Programmatische Aanpak Stikstof. http://www.aerius.nl
Praktijk casus
Praktijk casus
Tips & tricks CL beheer getresourceasstream()? Java constanten #define Locale? Oops!
Tips & tricks Unit testen Aparte test kernels Test cases in batches kernel void testdifficultcalculation(const int testcount, global const double* distance, global double* results) { const int testid = get_global_id(0); if (testid < testcount) { results[testid] = difficultcalculation(distance[testid]); } }
Direct memory management -XX:MaxDirectMemorySize=??M ByteBuffer.allocateDirect(int capacity) Max 2GB per buffer Garbage collection te laat Getriggered door heap collection Handmatig vrijgeven ((sun.nio.ch.directbuffer) mybuffer).cleaner().clean(); VisualVM plugin voor direct buffers
GPU vs CPU GPU's checken minder dan CPU's Div by zero Out of bounds checks Test eerst op CPU
Portabiliteit OpenCL is portable, de performance niet Memory sizes verschillen Memory latencies verschillen Work group sizes verschillen Compute devices verschillen OpenCL implementatie verschillen Develop dus voor de productie hardware
Ten slotte Float vs Double Dubbele precisie Halve performance Double support optioneel
Conclusie
Conclusie Wanneer te gebruiken? Als performance echt nodig is Als probleem hoge concurrency heeft Als probleem partitioneerbaar is
Vragen? Setting up OpenCL test on Intel(R) Core(TM) i7-3720qm CPU @ 2.60GHz Warming up OpenCL test [thread 32003 also had an error][thread 33027 also had an error] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error] [thread 32259 also had an error] at pc=0x00000001250ded70, pid=99851, tid=29475 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops) # Problematic frame: # [thread 17415 also had an error] C [cl_kernels+0x1d70] sort_wrapper+0x1b0 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /Users/arjanl/Documents/opencl/workspace/opencl-test/jogamp/hs_err_pid99851.log [thread 31763 also had an error] # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp #