Optimization of CUDA- based Monte Carlo Simulation for EM Radiation

Similar documents
Variance reduction techniques used in BEAMnrc

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

HP ProLiant SL270s Gen8 Server. Evaluation Report

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Bremsstrahlung Splitting Overview

Parallel Computing with MATLAB

Real-time Visual Tracker by Stream Processing

Volume Visualization Tools for Geant4 Simulation

Introduction to GPU hardware and to CUDA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Optimizing Application Performance with CUDA Profiling Tools

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

90 degrees Bremsstrahlung Source Term Produced in Thick Targets by 50 MeV to 10 GeV Electrons

ultra fast SOM using CUDA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Guided Performance Analysis with the NVIDIA Visual Profiler

Advanced variance reduction techniques applied to Monte Carlo simulation of linacs

Final Project Report. Trading Platform Server

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Interactive Level-Set Deformation On the GPU

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

HPC with Multicore and GPUs

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

GPU Performance Analysis and Optimisation

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA programming on NVIDIA GPUs

GPUs for Scientific Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Monte Carlo Simulations: Efficiency Improvement Techniques and Statistical Considerations

Introduction to GPU Programming Languages

Rethinking SIMD Vectorization for In-Memory Databases

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Multi-GPU Load Balancing for Simulation and Rendering

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

ACCELERATORS AND MEDICAL PHYSICS 2

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Interactive Level-Set Segmentation on the GPU

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Accelerating CFD using OpenFOAM with GPUs

Atomic and Nuclear Physics Laboratory (Physics 4780)

Introduction to the Monte Carlo method

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GTC 2014 San Jose, California

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

High Performance Computing in CST STUDIO SUITE

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Texture Cache Approximation on GPUs

Le langage OCaml et la programmation des GPU

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Medical Applications of radiation physics. Riccardo Faccini Universita di Roma La Sapienza

Hardware acceleration of an efficient and accurate proton therapy Monte Carlo

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Appendix A. An Overview of Monte Carlo N-Particle Software

Parallel Programming Survey

Parallel Simplification of Large Meshes on PC Clusters

~ Greetings from WSU CAPPLab ~

GPGPU accelerated Computational Fluid Dynamics

Monte Carlo Simulations in Proton Dosimetry with Geant4

Introduction to CUDA C

Introduction to GPGPU. Tiziano Diamanti

THE DOSIMETRIC EFFECTS OF

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Introduction to GPU Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

The Fastest, Most Efficient HPC Architecture Ever Built

Linear accelerator output, CT-data implementation, dose deposition in the presence of a 1.5 T magnetic field.

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

OpenACC 2.0 and the PGI Accelerator Compilers

GPGPU Computing. Yong Cao

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

GPU Architecture. Michael Doggett ATI

Program Grid and HPC5+ workshop

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Turbomachinery CFD on many-core platforms experiences and strategies

SIDN Server Measurements

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Transcription:

Optimization of CUDA- based Monte Carlo Simulation for EM Radiation 10th Geant4 Space User Workshop Main authors: N. Henderson (ICME- Stanford U) & K. Murakami (KEK) Presented by: A. Dotti (SLAC) adotti@slac.stanford.edu

Note In the following I will describe a specific G4 inspired application from medical physics (treatment with em particles) It is not directly related to what we have discussed in this workshop but it is very interesting for its technical content: It shows what the future of computing looks like It discusses some challenges that are common to simulations on accelerators It shows realistic performance boost to expect on accelerators 2

The collaboration Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC Joseph Perl, SLAC Andrea Dotti, SLAC Koichi Murakami, KEK Takashi Sasaki, KEK Margot Gerritsen, ICME Nick Henderson, ICME 3

Radiotherapy simulation methods Analytic time: seconds to minutes accurate within 3-5% used in treatment planning Monte Carlo time: several hours to days of CPU time accurate within 1-2% used to verify treatment plans in certain cases 4

Project overview GPU based dose calculation for radiation therapy Input CT data DICOM Processing GPU/CUDA Dose calc. Output Visualization gmocren Features GPU code based on Geant4 Voxelized geometry DICOM interface Material is water with variable density Limited Geant4 EM physics: electron/positron/gamma Scoring of dose in each voxel Process secondary particles

Basic idea Particles are transported through material in steps Physics processes act on particles and may generate secondaries Particles are removed when out of domain or out of energy domain boundary

A single step 1. possibly retrieve particle from stack 2. compute step length & limiting process post step physical process 3. along step physical process

G4CU: CUDA implementation Each GPU thread processes a single particle until the particle exits the geometry Each thread has a stack for secondary particles Every thread takes a step in each iteration Algorithm is periodically interrupted for various actions 8

G4CU: parallel execution CUDA kernels are applied to all particles Threads must wait if physics process does not apply to particle Parallel execution is achieved: maintenance operations transport process same process on same particle

Performance and validation Hardware and software: GPU: Tesla K20 Kepler 2496 cores, 706 MHz, 5GB GDDR5 (ECC) CPU Xeon X5680 (3.33GHz) single thread operation SDK: CUDA 5.5 Geant4 release 9.6 p2

Phantom geometry

Particle sources 20 MeV electron / 6 MeV photons (mono- energy) 6 MV & 18 MV spectrum photons

Comparisons for slab phantoms 6MV photon, Lung/Bone as inner slab material

Visualization with gmocren Image for demonstration purposes only! 50M 6MeV photons Pencil beam configuration

Challenges Geant4 is complex Implement targeted subset Varied character of physics process Bad occupancy No clear optimization target Random simulation Thread divergence Lookup (interpolation) tables Thread divergence Non- ideal memory access Energy deposition in global array Possible race condition Non- ideal memory access

Physics Processes Process Lines of code Initialization kernel registers Step length kernel registers Action kernel registers Compton scattering 116 14 18 49 Photo electric effect 118 14 18 40 Gamma conversion 160 14 18 36 Bremsstrahlung 156 14 18 44 Ionization 394 14 (18,18) (60,36) Scattering 575 12 28 Positron annihilation 156 12 26 Electron removal 44 12 8 Transport 491 20 22 (20,22)

Physics Profile (6 MeV gamma) Sum of Time(%) step along-step after-step other init at-rest Grand Total em-helper 18.4 4.2 22.5 ionization 2.5 15.3 2.1 19.9 multiple-scattering 11.6 7.7 19.3 management 3.7 8.7 0.5 12.8 transport 2.4 1.7 3.3 1.4 8.9 bremsstrahlung 5.7 5.7 positron-annihilation 2.1 0.9 0.6 0.7 4.3 compton-scattering 2.6 2.6 gamma-conversion 1.4 1.4 photo-electric-effect 1.3 1.3 electron-removal 1.2 1.2 Grand Total 40.7 24.6 17.4 8.7 6.6 2.0 100.0

Optimization 1: dose storage Old idea: use thrust to join and reduce dose stacks into global array Better idea: use atomicadd Speedup: ~1.5 (at the time) index dose dose stack i dose stack j 5 10 13 10 5 19 3.0 2.1 0.9 4.3 1.2 7.2 join and sort 5 5 10 10 13 19 1.2 3.0 2.1 4.3 0.9 7.2 reduce by index 5 10 13 19 4.2 6.4 0.9 7.2

Optimization 2: device configuration 6 MeV gamma pencil beam 3,000,000 events Table shows run time / shortest time Voxels: 61 x 61 x 150 1 = 22 seconds ~ 136 events / ms threads per block blocks 32 64 128 256 512 32 17.98 9.85 5.62 3.21 2.04 64 9.85 5.63 3.2 2 1.48 128 5.62 3.21 1.98 1.39 1.22 256 3.55 2.15 1.36 1.1 1.14 512 2.42 1.46 1.08 1.02 1.15 1024 1.8 1.22 1 1.01 1.61 threads per block blocks 32 64 128 1000 2.2 1.33 1 2000 1.82 1.19 1.03 3000 1.8 1.19 1.04 4000 1.74 1.27 1.15

Other optimizations New hardware! Refactoring New features sometime slow us down Results from: 512 x 512 x 256 geometry 6 MeV gamma pencil beam Date Hardwar e Feature Time (min) Events / ms March C2070 72 23.1 May K20 60 27.8 June K20 atomicadd 41 40.7 July K20 multiple scattering 42.5 39.2 August K20 world volume 49 34.0 Current K20 Refactoring, device config 23 72.5

Comparison to Geant4 on CPU Geometry:61 x 61 x 150 GPU: Tesla K20 CPU: Xeon X5680 6 core, 12 thread Our measurement was with single threaded G4 New G4 is multi- threaded We multiply by 12 to get CPU events / ms CPU GPU 20 MeV electron (pencil) 6 MeV photon (pencil) Time (h) 29.44 12.42 Events / ms / core 0.94 2.24 Events / ms 11.32 26.85 Time (min) 22.23 9.65 Events / ms 74.99 172.69 speedup (GPU/core) 79.49 77.19 speedup (GPU/CPU) 6.62 6.43

Other optimization experiments 6 MeV gamma pencil beam 1,000,000 events Voxels: 61 x 61 x 150 Experiment Run time (s) Events / ms standard 9.33 107.2 prefer L1 cache 9.28 107.8 disable L1 cache 9.27 107.9 remove atomicadd 9.31 107.4 limit to 32 registers 9.29 107.6 no thread divergence, no atomic add no thread divergence, no atomic add, limit to 32 registers 3.25 307.7 3.297 303.3 add bounds checking 17.91 55.8

Going forward Hope to get factor 1.5 speed up from splitting particles into separate CUDA streams Reduce thread divergence Allow concurrent kernel launches Some possible benefits from using texture memory and hardware interpolation for lookup tables Look for other opportunities in expensive kernels

Acknowledgements NVidia for support of the CUDA Center of Excellence Koichi Murakami Makoto Asai, Joseph Pearl, Andrea Dotti @ SLAC Takashi Sasaki @ KEK Akinori Kimura, Ashikaga Institute of Technology 24