Optimization of CUDA- based Monte Carlo Simulation for EM Radiation
|
|
|
- Drusilla Francis
- 9 years ago
- Views:
Transcription
1 Optimization of CUDA- based Monte Carlo Simulation for EM Radiation 10th Geant4 Space User Workshop Main authors: N. Henderson (ICME- Stanford U) & K. Murakami (KEK) Presented by: A. Dotti (SLAC)
2 Note In the following I will describe a specific G4 inspired application from medical physics (treatment with em particles) It is not directly related to what we have discussed in this workshop but it is very interesting for its technical content: It shows what the future of computing looks like It discusses some challenges that are common to simulations on accelerators It shows realistic performance boost to expect on accelerators 2
3 The collaboration Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC Joseph Perl, SLAC Andrea Dotti, SLAC Koichi Murakami, KEK Takashi Sasaki, KEK Margot Gerritsen, ICME Nick Henderson, ICME 3
4 Radiotherapy simulation methods Analytic time: seconds to minutes accurate within 3-5% used in treatment planning Monte Carlo time: several hours to days of CPU time accurate within 1-2% used to verify treatment plans in certain cases 4
5 Project overview GPU based dose calculation for radiation therapy Input CT data DICOM Processing GPU/CUDA Dose calc. Output Visualization gmocren Features GPU code based on Geant4 Voxelized geometry DICOM interface Material is water with variable density Limited Geant4 EM physics: electron/positron/gamma Scoring of dose in each voxel Process secondary particles
6 Basic idea Particles are transported through material in steps Physics processes act on particles and may generate secondaries Particles are removed when out of domain or out of energy domain boundary
7 A single step 1. possibly retrieve particle from stack 2. compute step length & limiting process post step physical process 3. along step physical process
8 G4CU: CUDA implementation Each GPU thread processes a single particle until the particle exits the geometry Each thread has a stack for secondary particles Every thread takes a step in each iteration Algorithm is periodically interrupted for various actions 8
9 G4CU: parallel execution CUDA kernels are applied to all particles Threads must wait if physics process does not apply to particle Parallel execution is achieved: maintenance operations transport process same process on same particle
10 Performance and validation Hardware and software: GPU: Tesla K20 Kepler 2496 cores, 706 MHz, 5GB GDDR5 (ECC) CPU Xeon X5680 (3.33GHz) single thread operation SDK: CUDA 5.5 Geant4 release 9.6 p2
11 Phantom geometry
12 Particle sources 20 MeV electron / 6 MeV photons (mono- energy) 6 MV & 18 MV spectrum photons
13 Comparisons for slab phantoms 6MV photon, Lung/Bone as inner slab material
14 Visualization with gmocren Image for demonstration purposes only! 50M 6MeV photons Pencil beam configuration
15 Challenges Geant4 is complex Implement targeted subset Varied character of physics process Bad occupancy No clear optimization target Random simulation Thread divergence Lookup (interpolation) tables Thread divergence Non- ideal memory access Energy deposition in global array Possible race condition Non- ideal memory access
16 Physics Processes Process Lines of code Initialization kernel registers Step length kernel registers Action kernel registers Compton scattering Photo electric effect Gamma conversion Bremsstrahlung Ionization (18,18) (60,36) Scattering Positron annihilation Electron removal Transport (20,22)
17 Physics Profile (6 MeV gamma) Sum of Time(%) step along-step after-step other init at-rest Grand Total em-helper ionization multiple-scattering management transport bremsstrahlung positron-annihilation compton-scattering gamma-conversion photo-electric-effect electron-removal Grand Total
18 Optimization 1: dose storage Old idea: use thrust to join and reduce dose stacks into global array Better idea: use atomicadd Speedup: ~1.5 (at the time) index dose dose stack i dose stack j join and sort reduce by index
19 Optimization 2: device configuration 6 MeV gamma pencil beam 3,000,000 events Table shows run time / shortest time Voxels: 61 x 61 x = 22 seconds ~ 136 events / ms threads per block blocks threads per block blocks
20 Other optimizations New hardware! Refactoring New features sometime slow us down Results from: 512 x 512 x 256 geometry 6 MeV gamma pencil beam Date Hardwar e Feature Time (min) Events / ms March C May K June K20 atomicadd July K20 multiple scattering August K20 world volume Current K20 Refactoring, device config
21 Comparison to Geant4 on CPU Geometry:61 x 61 x 150 GPU: Tesla K20 CPU: Xeon X core, 12 thread Our measurement was with single threaded G4 New G4 is multi- threaded We multiply by 12 to get CPU events / ms CPU GPU 20 MeV electron (pencil) 6 MeV photon (pencil) Time (h) Events / ms / core Events / ms Time (min) Events / ms speedup (GPU/core) speedup (GPU/CPU)
22 Other optimization experiments 6 MeV gamma pencil beam 1,000,000 events Voxels: 61 x 61 x 150 Experiment Run time (s) Events / ms standard prefer L1 cache disable L1 cache remove atomicadd limit to 32 registers no thread divergence, no atomic add no thread divergence, no atomic add, limit to 32 registers add bounds checking
23 Going forward Hope to get factor 1.5 speed up from splitting particles into separate CUDA streams Reduce thread divergence Allow concurrent kernel launches Some possible benefits from using texture memory and hardware interpolation for lookup tables Look for other opportunities in expensive kernels
24 Acknowledgements NVidia for support of the CUDA Center of Excellence Koichi Murakami Makoto Asai, Joseph Pearl, Andrea SLAC Takashi KEK Akinori Kimura, Ashikaga Institute of Technology 24
Variance reduction techniques used in BEAMnrc
Variance reduction techniques used in BEAMnrc D.W.O. Rogers Carleton Laboratory for Radiotherapy Physics. Physics Dept, Carleton University Ottawa, Canada http://www.physics.carleton.ca/~drogers ICTP,Trieste,
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th
Bremsstrahlung Splitting Overview
March 2007 Bremsstrahlung Splitting Overview Jane Tinslay, SLAC Overview & Applications Biases by enhancing secondary production Aim to increase statistics in region of interest while reducing time spent
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
Volume Visualization Tools for Geant4 Simulation
Volume Visualization Tools for Geant4 Simulation Ayumu Saitoh, Japan Science and Technology Agency Akinori Kimura, Ashikaga Institute of Technology Satoshi Tanaka, Ritsumeikan University Background and
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Optimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
90 degrees Bremsstrahlung Source Term Produced in Thick Targets by 50 MeV to 10 GeV Electrons
SLAC-PUB-7722 January 9 degrees Bremsstrahlung Source Term Produced in Thick Targets by 5 MeV to GeV Electrons X. S. Mao et al. Presented at the Ninth International Conference on Radiation Shielding, Tsukuba,
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Guided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
Advanced variance reduction techniques applied to Monte Carlo simulation of linacs
MAESTRO Advanced variance reduction techniques applied to Monte Carlo simulation of linacs Llorenç Brualla, Francesc Salvat, Eric Franchisseur, Salvador García-Pareja, Antonio Lallena Institut Gustave
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Interactive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
GPU Performance Analysis and Optimisation
GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler
ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer
ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Monte Carlo Simulations: Efficiency Improvement Techniques and Statistical Considerations
Monte Carlo Simulations: Efficiency Improvement Techniques and Statistical Considerations Daryoush Sheikh-Bagheri, Ph.D. 1, Iwan Kawrakow, Ph.D. 2, Blake Walters, M.Sc. 2, and David W.O. Rogers, Ph.D.
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
ACCELERATORS AND MEDICAL PHYSICS 2
ACCELERATORS AND MEDICAL PHYSICS 2 Ugo Amaldi University of Milano Bicocca and TERA Foundation EPFL 2-28.10.10 - U. Amaldi 1 The icone of radiation therapy Radiation beam in matter EPFL 2-28.10.10 - U.
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
Interactive Level-Set Segmentation on the GPU
Interactive Level-Set Segmentation on the GPU Problem Statement Goal Interactive system for deformable surface manipulation Level-sets Challenges Deformation is slow Deformation is hard to control Solution
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Atomic and Nuclear Physics Laboratory (Physics 4780)
Gamma Ray Spectroscopy Week of September 27, 2010 Atomic and Nuclear Physics Laboratory (Physics 4780) The University of Toledo Instructor: Randy Ellingson Gamma Ray Production: Co 60 60 60 27Co28Ni *
Introduction to the Monte Carlo method
Some history Simple applications Radiation transport modelling Flux and Dose calculations Variance reduction Easy Monte Carlo Pioneers of the Monte Carlo Simulation Method: Stanisław Ulam (1909 1984) Stanislaw
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions
Module 4: Beyond Static Scalar Fields Dynamic Volume Computation and Visualization on the GPU Visualization and Computer Graphics Group University of California, Davis Overview Motivation and applications
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Medical Applications of radiation physics. Riccardo Faccini Universita di Roma La Sapienza
Medical Applications of radiation physics Riccardo Faccini Universita di Roma La Sapienza Outlook Introduction to radiation which one? how does it interact with matter? how is it generated? Diagnostics
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief NVIDIA changed the high performance computing (HPC) landscape by introducing its Fermibased GPUs that delivered
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Appendix A. An Overview of Monte Carlo N-Particle Software
Appendix A. An Overview of Monte Carlo N-Particle Software A.1 MCNP Input File The input to MCNP is an ASCII file containing command lines called "cards". The cards provide a description of the situation
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Parallel Simplification of Large Meshes on PC Clusters
Parallel Simplification of Large Meshes on PC Clusters Hua Xiong, Xiaohong Jiang, Yaping Zhang, Jiaoying Shi State Key Lab of CAD&CG, College of Computer Science Zhejiang University Hangzhou, China April
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Monte Carlo Simulations in Proton Dosimetry with Geant4
Monte Carlo Simulations in Proton Dosimetry with Geant4 Zdenek Moravek, Ludwig Bogner Klinik und Poliklinik für Strahlentherapie Universität Regensburg Objectives of the Study what particles and how much
Introduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
THE DOSIMETRIC EFFECTS OF
THE DOSIMETRIC EFFECTS OF OPTIMIZATION TECHNIQUES IN IMRT/IGRT PLANNING 1 Multiple PTV Head and Neck Treatment Planning, Contouring, Data, Tips, Optimization Techniques, and algorithms AAMD 2013, San Antonio,
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12
(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12 A few words about GPUs Cache and control replaced by calculation units Large number of Multiprocessors
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run
SFWR ENG 3BB4 Software Design 3 Concurrent System Design 2 SFWR ENG 3BB4 Software Design 3 Concurrent System Design 11.8 10 CPU Scheduling Chapter 11 CPU Scheduling Policies Deciding which process to run
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
The Fastest, Most Efficient HPC Architecture Ever Built
Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing
Linear accelerator output, CT-data implementation, dose deposition in the presence of a 1.5 T magnetic field.
Simulations for the virtual prototyping of a radiotherapy MRI-accelerator system: Linear accelerator output, CT-data implementation, dose deposition in the presence of a 1.5 T magnetic field. A.J.E. Raaijmakers,
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.
Guest lecturer: David Hovemeyer November 15, 2004 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014
CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000
GPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
GPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
SIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
