Rootbeer: Seamlessly using GPUs from Java

Similar documents

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introduction to CUDA C

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

U N C L A S S I F I E D

CUDA Programming. Week 4. Shared memory and register

Hands-on CUDA exercises

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Basics. Murphy Stein New York University

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Crash Course in Java

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

GPU Tools Sandra Wienke

Jonathan Worthington Scarborough Linux User Group

Android Development: Part One

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Chapter 1 Fundamentals of Java Programming

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders

Raima Database Manager Version 14.0 In-memory Database Engine

picojava TM : A Hardware Implementation of the Java Virtual Machine

Advanced CUDA Webinar. Memory Optimizations

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Java Programming Fundamentals

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Le langage OCaml et la programmation des GPU

OpenACC 2.0 and the PGI Accelerator Compilers

Restraining Execution Environments

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

An Overview of Java. overview-1

Lecture 1: an introduction to CUDA

The BSN Hardware and Software Platform: Enabling Easy Development of Body Sensor Network Applications

To Java SE 8, and Beyond (Plan B)

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Performance Improvement In Java Application

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Binary search tree with SIMD bandwidth optimization using SSE

Java Virtual Machine: the key for accurated memory prefetching

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

The Java Virtual Machine and Mobile Devices. John Buford, Ph.D. Oct 2003 Presented to Gordon College CS 311

Image Processing & Video Algorithms with CUDA

The Java Virtual Machine (JVM) Pat Morin COMP 3002

Lecture 1 Introduction to Android

General Introduction

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

CSCI E 98: Managed Environments for the Execution of Programs

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

ECE 122. Engineering Problem Solving with Java

Habanero Extreme Scale Software Research Project

The C Programming Language course syllabus associate level

GPU-based Decompression for Medical Imaging Applications

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Free Java textbook available online. Introduction to the Java programming language. Compilation. A simple java program

Free Java textbook available online. Introduction to the Java programming language. Compilation. A simple java program

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Java Interview Questions and Answers

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

1. If we need to use each thread to calculate one output element of a vector addition, what would

Java Programming. Binnur Kurt Istanbul Technical University Computer Engineering Department. Java Programming. Version 0.0.

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

ELEC 377. Operating Systems. Week 1 Class 3

Real-time Java Processor for Monitoring and Test

1 The Java Virtual Machine

OpenACC Basics Directive-based GPGPU Programming

Introduction to Java

JPURE - A PURIFIED JAVA EXECUTION ENVIRONMENT FOR CONTROLLER NETWORKS 1

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Network Traffic Monitoring & Analysis with GPUs

Optimizing Application Performance with CUDA Profiling Tools

Leveraging Aparapi to Help Improve Financial Java Application Performance

Cloud Computing. Up until now

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

How accurately do Java profilers predict runtime performance bottlenecks?

Compiling Object Oriented Languages. What is an Object-Oriented Programming Language? Implementation: Dynamic Binding

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

UIL Computer Science for Dummies by Jake Warren and works from Mr. Fleming

GPU Performance Analysis and Optimisation

Stream Processing on GPUs Using Distributed Multimedia Middleware

Embedded Systems: map to FPGA, GPU, CPU?

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Pemrograman Dasar. Basic Elements Of Java

Next Generation GPU Architecture Code-named Fermi

WebSphere Architect (Performance and Monitoring) 2011 IBM Corporation

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Network Traffic Monitoring and Analysis with GPUs

Transcription:

Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University.

Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java First GPU program was to speed up tomography in C The original program had many layers of looping and function calls 01: for(int degree = 0; degree < 360; ++degree){ 02: //function_calls 03: for(int azimuth = 0; azimuth < 90; ++azimuth){ 04: //function_calls 05: for(int x = 0; x < width; ++x){ 06: //function_calls 07: for(int y = 0; y < width; ++y){ 08: //code_that_can_be_made_parallel_that_uses_globals 09: } 10: } 11: } 12: } Developer needs to chose a cut level, collect jobs in a C++ vector, serialize state, transfer control to the GPU and deserialze after the GPU runs It was difficult to know in advance at what level to make the optimized cut to fit within GPU resource constraints while obtaining the maximum speedup Rootbeer makes trying cut levels simpler with automatic serialization and CUDA code generation for Java code

Hello World! Program Rootbeer allows a developer to program a GPU in Java ArrayMultiplyKernel.java 01: public class ArrayMultiplyKernel implements Kernel { 02: int[] m_array; 03: public void gpumethod(){ 04: m_array[rootbeergpu.getthreadid()] *= 11; 05: } 06: } ArrayMultiply.java 01: public class ArrayMultiply { 02: public void multiplyparallel(int[] array){ 03: List<Kernel> jobs = new ArrayList<Kernel>(); 04: for(int i = 0; i < array.length; ++i){ 05: ArrayMultiplyKernel kernel = new ArrayMultiplyKernel(); 06: kernel.m_array = array; 07: jobs.add(kernel); 08: } 09: Rootbeer rootbeer = new Rootbeer(); 10: rootbeer.runall(jobs); 11: } 12: }

Advanced Matrix Multiply Uses shared memory and syncthreads with tiling MatrixKernel.java (simplified) 01: public class MatrixKernel implements Kernel { 02: public void gpumethod(){ 03: int block_idxx = RootbeerGpu.getBlockIdxx(); 04: int thread_idxx = RootbeerGpu.getThreadIdxx(); 05: for(int block_iter = 0; block_iter < block_iters; ++block_iter){ 06: for(int sub_matrix = 0; sub_matrix < sub_matrix_size; ++sub_matrix){ 07: float sum = 0; 08: for(int m = 0; m < m_size; ++m){ 09: float a_value = a[a_src]; 10: float b_value = b[b_src]; 11: RootbeerGpu.setSharedFloat(thread_idxx, a_value); 12: RootbeerGpu.setSharedFloat((1024 + thread_idxx), b_value); 13: RootbeerGpu.synchthreads(); 14: 15: for(int k = 0; k < 32; ++k){ 16: a_value = RootbeerGpu.getSharedFloat(thread_row * 32 + k); 17: b_value = RootbeerGpu.getSharedFloat(1024 + k * 32 + thread_col); 18: sum += a_value * b_value; 19: } 20: RootbeerGpu.synchthreads(); 21: } 22: c[dest_index] += sum; 23: } 24: } 25: } 26: }

Advanced Matrix Multiply MatrixApp.java (simplified) with Kernel Templates 01: public void gpurun(){ 02: 03: MatrixKernel matrix_kernel = new MatrixKernel(a, b, c, block_size, grid_size, block_iters); 04: 05: Rootbeer rootbeer = new Rootbeer(); 06: rootbeer.setthreadconfig(1024, grid_size); 07: rootbeer.runall(matrix_kernel); 08: 09: } Performance (prototype, indexing needs work) (Java CPU time does not include transpose time) System Time Speedup Java CPU 4 cores 109 seconds - Rootbeer Java 5 seconds 21.8x CUDA (matrixmul) 822 milliseconds 132.6x (6x) A[256][256]/B[256][256*256*14] CPU: 4 core Xeon E5405 @ 2.00GHz with 16GB ram (DDR2 @ 667MHz (1.5ns)) GPU: 448 core Tesla C2050 @ 1.147GHz Geben Sie hier Ihre with Fußzeile 3GB ein (384-bit) ram over PCI-e x16 Gen2

MegaMandel Shows the power of using Java with a GPU CPU Average Time: 80ms GPU Average Time: 8ms Example Provided by Thorsten Kiefer

Language Features Supported Composite Reference Types (Classes) Instance and Static Methods and Fields N-Dimensional Arrays of Any Type Exceptions Strings Synchronized Keyword Unsupported Reflection Dynamic Method Invocation Native Code (JNI) System.out.println File/Network IO Coming Soon (two month s time expected) Java Collections Framework support

Running Rootbeer Download binary from http://rbcompiler.com/ Running easy tests $ java -jar Rootbeer.jar runeasytests Running single test case (and with Native Emulation) $ java -jar Rootbeer.jar runtest SimpleTest -nemu Compiling and Running an Application $ java -jar Rootbeer.jar app.jar app-gpu.jar $ java -jar app-gpu.jar Testing GPU Installation $ java -jar Rootbeer.jar -printdeviceinfo

Testing If you haven t tested it, it doesn t work. Guaranteed. Test Driven Development with Git/Github and Gitflow 8.8k lines of test code 20k lines of product code Master 42/43 tests pass on Windows, Linux and Mac Develop 49/60 tests pass on Windows, Linux and Mac Bugs (will be fixed in two month s time approx.) ExceptionBasicTest AtomicLongTest Java Collections ClassLoading has issues Separate Kernel reachable jar for now

Rootbeer Internals Compilation Packed Jar Classloading with Soot and Rootbeer Generate Serialization Bytecode Attach CompiledKernel Interface to Kernel Page 10 Find Reachable Classes, Fields and Methods Generate CUDA Code Package with Runtime Packed GPU Jar

Rootbeer Internals Memory spaces: 1. Root kernel pointer space 2. Object memory space (big) 3. Exception pointer space CUDA Entry Point: 01: global void entry(int * root_pointers, char * object_space, int * exceptions, int num_blocks){ 02: int loop_control = blockidx.x * blockdim.x + threadidx.x; 03: if(loop_control >= num_blocks){ 04: return; 05: } else { 06: int handle = root_pointers[loop_control]; 07: int exception = 0; 08: %%invoke_run%%(object_space, handle, &exception); 09: exceptions[loop_control] = exception; 10: } 11: } cumodulegetfunction(&cufunction, cumodule, "_Z5entryPcS_PiPxS1_S0_S0_i"); Page 11

Rootbeer Internals CUDA Code for ArrayMult.gpuMethod 01: device void rootbeer_examples_arraymult_arraymult_gpumethod0_(char * gc_info, int thisref, int * exception){ 02: int r0 =-1 ; 03: int $r1 =-1 ; 04: int $i0 ; 05: int $i1 ; 06: int $i2 ; 07: $r1 = instance_getter_rootbeer_examples_arraymult_arraymult_m_source(gc_info, r0, exception); 08: if(*exception!= 0){ 09: return; 10: } 11: $i0 = edu_syr_pcpratts_rootbeer_runtime_rootbeergpu_getthreadidxx(gc_info, exception); 12: if(*exception!= 0){ 13: return; 14: } 15: $i1 = int array_get(gc_info, $r1, $i0, exception); 16: if(*exception!= 0){ 17: return; 18: } 19: $i2 = $i1 * 11; 20: int array_set(gc_info, $r1, $i0, $i2, exception); 21: if(*exception!= 0){ 22: return; 23: } 24: return; 25: }

Rootbeer Internals CUDA Code for ArrayMult.m_source field ref 01: device int 02: instance_getter_rootbeer_examples_arraymult_arraymult_m_source(char * gc_info, int thisref, int * exception){ 03: char * thisref_deref ; 04: if(thisref == -1){ 05: *exception = 11; //corresponds to CompiledKernel.getNullPointerNumber() 06: return 0; 07: } 08: thisref_deref = edu_syr_pcpratts_gc_deref(gc_info, thisref); 09: return *((int *) &thisref_deref[32]); 10: } Page 13

Rootbeer Internals CUDA Code for getting m_source[index] 01: device int int array_get(char * gc_info, int thisref, int parameter0, int * exception) { 02: int offset; 03: int length; 04: char * thisref_deref; 05: offset = 16 + (parameter0 * 4); //object header: 16 bytes. 4 is element size that is generated for each type 06: if(thisref == -1){ 07: *exception = 11; 08: return 0; 09: } 10: thisref_deref = edu_syr_pcpratts_gc_deref(gc_info, thisref); 11: length = edu_syr_pcpratts_getint (thisref_deref, 8); 12: if(parameter0 >= length){ 13: *exception = edu_syr_pcpratts_rootbeer_runtimegpu_gpuexception_arrayoutofbounds(gc_info, 14: parameter0, thisref, length, exception); 15: return 0; 16: } 17: return *((int *) &thisref_deref[offset]); 18: } Page 14 ArrayOutOfBounds throws can be disabled Compile with -noarraychecks

Rootbeer Internals Optimizations/Notes Pointers are 32 bit integers Smallest object is 16 bytes Pointers are shifted before using Allows for addressing of about 56GB inside an int Fields are packed according to sort of size 64 bit fields must be read/written on 64 bit boundary, etc Concurrent malloc is supported on GPU Heap is serialized packed A heap end pointer is kept AtomicAdd is used Previous size is the returned malloc address Add size is the size of the object CUDA code is generated blindly (and some is hand coded) A simple pure Java C parser does dead code elimination before sending to nvcc Temporaries are kept in ~/.rootbeer/ Page 15

Best Performance Kernel Templates Reduces serialization time Single-Dimensional Arrays of Primitive Types Optimizations quickly copy directly from JVM Read a field once per GPU thread if it never changes Java Bytecode is compiled to CUDA C. In un-optimized bytecode fields are always read This translates to unnecessary reads to global ram Page 16

Future Work Support 2-3 dimensional block/thread ids Compile to ptx rather than CUDA C Allow leaving Java objects in GPU memory Support Java Collections Framework Possibly swap in Optimized GPU Hash Table from Dr. John Owen s Lab (UC Davis). Support System.out.println Support File and Network IO NVIDIA: Can you support sockets between CPU and GPU inside CUDA? GPU Garbage Collection Allow to disable null pointer checks Other small issues Page 17

Credits Rootbeer is Open Source MIT license Rootbeer Open Source Developers Thorsten Kiefer: http://toki.burn3r.de/ Aaloan Miftah: aaloanmiftah@yahoo.com ValeriyB Funding Rootbeer is supported by Syracuse University and NSF grant number MCB-0746066 to Roy D. Welch Page 18

Questions? What are your favorite brands of Rootbeer? 1. Virgil s 2. Boylan 3. Saranac Rootbeer 4. IBC 5. Barq s 6. Mug 7. A&W What is the difference between a computer scientist and a mathematician? 1. A computer scientist knows no limits. Page 19