U N C L A S S I F I E D



Similar documents
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Rootbeer: Seamlessly using GPUs from Java

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

OpenACC Basics Directive-based GPGPU Programming

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

CUDA Basics. Murphy Stein New York University

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU Parallel Computing Architecture and CUDA Programming Model

OpenACC 2.0 and the PGI Accelerator Compilers

Monitoring, Tracing, Debugging (Under Construction)

GPU Tools Sandra Wienke


Introduction to CUDA C

OpenACC Programming on GPUs

Hands-on CUDA exercises

HPC with Multicore and GPUs

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Le langage OCaml et la programmation des GPU

To Java SE 8, and Beyond (Plan B)

Introduction to GPU hardware and to CUDA

CUDA Programming. Week 4. Shared memory and register

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Embedded Software development Process and Tools: Lesson-4 Linking and Locating Software

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

HPC Wales Skills Academy Course Catalogue 2015

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Evaluation of CUDA Fortran for the CFD code Strukti

Virtual Machines.

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Using EDA Databases: Milkyway & OpenAccess

Java GPU Computing. Maarten Steur & Arjan Lamers

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

System Structures. Services Interface Structure

An Easier Way for Cross-Platform Data Acquisition Application Development

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

12 Tips for Maximum Performance with PGI Directives in C

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Lecture 1 Introduction to Android

How To Write Portable Programs In C

Lecture 1: an introduction to CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

The C Programming Language course syllabus associate level

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

Programming Languages

An Implementation Of Multiprocessor Linux

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Install Java Development Kit (JDK) 1.8

Parallelization: Binary Tree Traversal

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

ProTrack: A Simple Provenance-tracking Filesystem

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Introduction to Parallel Programming (w/ JAVA)

Cloud Computing. Up until now

Tool - 1: Health Center

AI: A Modern Approach, Chpts. 3-4 Russell and Norvig

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Black-Scholes option pricing. Victor Podlozhnyuk

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

HP ProLiant SL270s Gen8 Server. Evaluation Report

OpenACC Programming and Best Practices Guide

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

Unit Testing & JUnit

OpenPOWER Software Stack with Big Data Example March 2014

BSc (Hons) Business Information Systems, BSc (Hons) Computer Science with Network Security. & BSc. (Hons.) Software Engineering

Research and Design of Universal and Open Software Development Platform for Digital Home

An Incomplete C++ Primer. University of Wyoming MA 5310

Writing Applications for the GPU Using the RapidMind Development Platform

GPGPU Parallel Merge Sort Algorithm

Mutual Exclusion using Monitors

Sensors and the Zaurus S Hardware

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Experimental Comparison of Concolic and Random Testing for Java Card Applets

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Test Driven Development of Embedded Systems Using Existing Software Test Infrastructure

Freescale Semiconductor, I

Java Embedded Applications

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Part I Courses Syllabus

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

VOC Documentation. Release 0.1. Russell Keith-Magee

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Case Study on Productivity and Performance of GPGPUs

Multi-core Programming System Overview

General Introduction

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Transcription:

CUDA and Java GPU Computing in a Cross Platform Application Scot Halverson sah@lanl.gov LA-UR-13-20719 Slide 1

What s the goal? Run GPU code alongside Java code Take advantage of high parallelization Utilize GPU technology with existing Java code Minimal modifications Cross platform Slide 2

How? Separate processes communicating directly or indirectly Shared memory, file I/O Take advantage of Java Native Interface directly Use an intermediate library JCUDA, JOCL, etc Provide software interfaces to device Use an additional processing/compiling tool Rootbeer, aparapi, etc Converts some Java code to device code Slide 3

Separate Processes Advantages Java code is still cross platform Disadvantages C/CUDA side must be compiled for each platform Interprocess communication can be awkward Slide 4

Java Native Interface Advantages No management of communication between processes All code can be contained in one project Disadvantages Cross platform capabilities of Java are much more complicated Might require JNI code to be written for each platform supported Requires embedded C/C++ & CUDA code Slide 5

Intermediate Library Advantages Maintains build once, run anywhere Java ideal No recompiling for different platforms Any platform that supports Java and CUDA should work Disadvantages Requires bundled CUDA code Requires bundled library Slide 6

Additional Processing or Compiling tool Advantages Write only Java/Java-like code Fewer low-level concerns Memory management, pointers, etc. Disadvantages Relatively new & experimental Existing support info may not apply Few developers with experience Long term availability Less control Kernels and interfacing code are written by tool Slide 7

BeSSy Motivation for CUDA and Java interoperability at LANL D-3 BeSSy is a sensor siting tool Used to find optimal/near optimal placement of a set of sensors Uses computationally expensive algorithms Simulated Annealing Genetic Algorithm (experimental) Written in Java We d like to improve on this with GPU computing A significant amount of developer time has been spent on existing code Rewriting everything is not an option Slide 8

Optimization Example Population Density High Medium Low Zero

Optimization Example same city, same weather, same scenario; only the number of collectors differs Slide 10

BeSSy Using Separate Processes Advantages Code can be distributed and used independently Disadvantages Multiple processes are cumbersome Still requires C/CUDA executable to be recompiled and possibly modified upon platform change Slide 11

BeSSy Java Native Interface Advantages Probably don t need separate methods per platform Code all in one project Disadvantages Still not platform independent Requires native code to be compiled for each supported platform Additional build steps Producing C/C++ headers Compiling C/C++ code Compiling CUDA code Slide 12

BeSSy Intermediate Library Advantages Platform independent As long as library is supported on platform No intermediate C/C++ code between Java and CUDA Disadvantages Requires library to be distributed Additional build step for CUDA code Slide 13

BeSSy Additional Processing or Compiling Tool Advantages No C/C++ or CUDA Write Java-like code to run on GPU Disadvantages Requires additional build step Requires use of tool(s) with which few people are familiar Relatively untested systems Longevity in question due to competing tools, openacc Depending on tool, might still need to recompile per platform Slide 14

BeSSy GPU Parallelization Chose Intermediate library as best option Specifically, JCUDA JCUDA is open source As long as Java and CUDA are supported on a platform, JCUDA will probably work JCUDA is mature and well maintained Frequent updates, supports CUDA 5.0 (latest at time of writing) Working project since at least 2009 an eternity in GPU Computing Slide 15

BeSSy GPU Parallelization No recompiling of our own code Java is platform independent, and CUDA can be compiled to PTX supporting wide range of CUDA GPUs No need to distribute BeSSy source code The only set of code that is platform dependent is the library Existing Java and CUDA developers are already familiar with the necessary tools Slide 16

BeSSy Traveling Salesman Analog BeSSy is a sensitive tool not to be utilized outside of official use cases Traveling Salesman Problem is an effective analog Both are NP-Hard problems Simulated Annealing and Genetic Algorithms are reasonable approaches for generating solutions Traveling Salesman Problem is well known in CS world TSP is not sensitive to Gov t work Source code can be shown and distributed Slide 17

Traveling Salesman Problem n number of cities Salesman must visit each city exactly once Goal is to produce an optimal or near optimal sequence of cities to path Optimal defined as shortest overall distance traveled For n cities, there are ~n! possible solutions Solver Slide 18

TSP Simulated Annealing 1 - Generate set of solutions 2 - Modify solutions based on some Temperature (T) Higher T produces greater change on average 3 - Evaluate solutions 4 - Replace bad solutions with copies of good solutions 5 Repeat starting at 2, until some number of iterations is reached Slide 19

TSP Genetic Algorithm 1 Generate set of parent solutions 2 Generate child solutions for each parent solution 3 Correct children solutions 4 Find best between parent and children and replace parent 5 Repeat starting at 2, until some number of unchanged iterations reached Slide 20

Traveling Salesman Problem Ideal for GPU parallelization Highly parallelizable Low memory requirements Low device è host and host ç device transfers Algorithm porting from CPU to GPU is nearly direct Substantial Performance benefits ~20x improvement for Genetic Algorithm ~80x improvement for Simulated Annealing Slide 21

TSP Initializing the GPU Nearly identical to C/C++ GPU initialization Create CUdevice object Create CUcontext object using Cudevice object Load PTX on device Create device function pointers Allocate memory on device Copy data to device Slide 22

TSP Initializing the GPU cuinit(0); //Initialize the driver device = new CUdevice(); //create a context for the first device cudeviceget(device, 0); context = new CUcontext(); cuctxcreate(context, 0, device); module = new CUmodule(); //Load the ptx file cumoduleload(module, ptxfilename); initparents = new CUfunction(); //initialize CUDA global function pointers cumodulegetfunction(initparents, module, "initparents"); d_x = new CUdeviceptr(); d_y = new CUdeviceptr(); //initialize device array pointers cumemalloc(d_x, citycount * Sizeof.FLOAT); //allocate device memory for arrays cumemalloc(d_y, citycount * Sizeof.FLOAT); cumemcpyhtod(d_x, Pointer.to(x), citycount * Sizeof.FLOAT); //copy city positions to device cumemcpyhtod(d_y, Pointer.to(y), citycount * Sizeof.FLOAT); Slide 23

TSP Example Kernel extern "C" global void reject(int *solutions, float *values, int citycount, int solutioncount, int *mi, float *mv){ int index = blockidx.x * blockdim.x + threadidx.x; if(index < solutioncount){ } float rejectionmultiplier = 1.15f; if(values[index] > rejectionmultiplier * mv[0]){ } values[index] = mv[0]; for(int a = 0; a < citycount; a++){ } solutions[index * citycount + a] = solutions[mi[0] * citycount + a]; } Slide 24

TSP Kernel Launch Setup Create Pointer to set of kernel arguments Copy any necessary data to device Call culaunchkernel Set grid and block size Set Shared memory utilization Set kernel, extra parameters Copy any necessary data to host Slide 25

TSP Kernel Launch Example //create set of parameters for call to reject on device kernelparameters = Pointer.to( Pointer.to(d_solutions), // Solution array Pointer.to(d_values), // Value array Pointer.to(new int[]{citycount}), // Number of cities Pointer.to(new int[]{solutioncount}), // Number of solutions Pointer.to(d_mi), // index of best solution Pointer.to(d_mv) // value of best solution ); // Call reject on device. blocksizex = 256; gridsizex = (int)math.ceil((double)solutioncount / blocksizex); culaunchkernel(reject, gridsizex, 1, 1, // Grid dimension blocksizex, 1, 1, // Block dimension 0, null, // Shared memory size and stream kernelparameters, null // Kernel- and extra parameters Slide 26

Combining Java and CUDA Notes Major programming concerns result from programming styles associated with two languages Garbage collection in Java Manual memory management in CUDA Pointers are uncommon and avoided in Java Pointers are critical in CUDA Typical CUDA parallelization issues Setting up kernels rather than just calling host functions Copying data to/from host Designing code to run efficiently in parallel Slide 27

Conclusion Number of options for programming CUDA & Java Best option for BeSSy is intermediate library CUDA programming with Java is very similar to CUDA programming for C/C++ Most Functions behave identically or near identically Java programming model is somewhat different than C/C++ & CUDA Some extra difficulty is introduced for programmers not familiar with either Java or CUDA Slide 28

Source Code Code soon to be available at: http://code.google.com/p/travelingsalesmangpu Slide 29