Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs



Similar documents
Verifying Numerical Convergence Rates

Abstract. Introduction

Optimized Data Indexing Algorithms for OLAP Systems

How To Ensure That An Eac Edge Program Is Successful

FINITE DIFFERENCE METHODS

The EOQ Inventory Formula

Design and Analysis of a Fault-Tolerant Mechanism for a Server-Less Video-On-Demand System

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Research on the Anti-perspective Correction Algorithm of QR Barcode

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Part II: Finite Difference/Volume Discretisation for CFD

Computer Graphics Hardware An Overview

Schedulability Analysis under Graph Routing in WirelessHART Networks

Geometric Stratification of Accounting Data

ACT Math Facts & Formulas

SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS

Digital evolution Where next for the consumer facing business?

Optimizing Desktop Virtualization Solutions with the Cisco UCS Storage Accelerator

Area-Specific Recreation Use Estimation Using the National Visitor Use Monitoring Program Data

An inquiry into the multiplier process in IS-LM model

The modelling of business rules for dashboard reporting using mutual information

SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY

SAT Subject Math Level 1 Facts & Formulas

2 Limits and Derivatives

h Understanding the safe operating principles and h Gaining maximum benefit and efficiency from your h Evaluating your testing system's performance

2.23 Gambling Rehabilitation Services. Introduction

OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Operation go-live! Mastering the people side of operational readiness


SWITCH T F T F SELECT. (b) local schedule of two branches. (a) if-then-else construct A & B MUX. one iteration cycle

Working Capital 2013 UK plc s unproductive 69 billion

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama Contact: Joyce Lane NYSE Symbol: TMK

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Overview of Component Search System SPARS-J

Staying in-between Music Technology in Higher Education

Instantaneous Rate of Change:

1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution

Broadband Digital Direct Down Conversion Receiver Suitable for Software Defined Radio

SAT Math Must-Know Facts & Formulas

CUDA programming on NVIDIA GPUs

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Parallel Computing Architecture and CUDA Programming Model

2.12 Student Transportation. Introduction

Next Generation GPU Architecture Code-named Fermi

NAFN NEWS SPRING2011 ISSUE 7. Welcome to the Spring edition of the NAFN Newsletter! INDEX. Service Updates Follow That Car! Turn Back The Clock

Parallel Programming Survey

Shell and Tube Heat Exchanger

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Derivatives Math 120 Calculus I D Joyce, Fall 2013

High Performance Computing in CST STUDIO SUITE

His solution? Federal law that requires government agencies and private industry to encrypt, or digitally scramble, sensitive data.

OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS

Stream Processing on GPUs Using Distributed Multimedia Middleware

Case Study on Productivity and Performance of GPGPUs

Binary search tree with SIMD bandwidth optimization using SSE

GPU Acceleration of the SENSEI CFD Code Suite

Turbomachinery CFD on many-core platforms experiences and strategies

Research on Risk Assessment of PFI Projects Based on Grid-fuzzy Borda Number


Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Tangent Lines and Rates of Change

HPC with Multicore and GPUs

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring Handout by Julie Zelenski with minor edits by Keith Schwarz

A New Cement to Glue Nonconforming Grids with Robin Interface Conditions: The Finite Element Case

Referendum-led Immigration Policy in the Welfare State

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?

HP ProLiant SL270s Gen8 Server. Evaluation Report

Theoretical calculation of the heat capacity

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPGPU accelerated Computational Fluid Dynamics

Introduction to GPU hardware and to CUDA

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Tis Problem and Retail Inventory Management

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Note: Principal version Modification Modification Complete version from 1 October 2014 Business Law Corporate and Contract Law

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Note nine: Linear programming CSE Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

A strong credit score can help you score a lower rate on a mortgage

13 PERIMETER AND AREA OF 2D SHAPES

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities.

Accelerating CFD using OpenFOAM with GPUs

High Performance Matrix Inversion with Several GPUs

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Real-time Visual Tracker by Stream Processing

College Planning Using Cash Value Life Insurance

Multi-GPU Load Balancing for In-situ Visualization

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Catalogue no XIE. Survey Methodology. December 2004

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Unemployment insurance/severance payments and informality in developing countries

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula

Transcription:

Large-scale Virtual Acoustics Simulation at Audio Rates Using Tree Dimensional Finite Difference Time Domain and Multiple GPUs Craig J. Webb 1,2 and Alan Gray 2 1 Acoustics Group, University of Edinburg 2 Edinburg Parallel Computing Centre, University of Edinburg To be presented at te 21st International Congress on Acoustics, Montréal, Canada, 2013 Abstract Te computation of large-scale virtual acoustics using te 3D finite difference time domain (FDTD) is proibitively computationally expensive, especially at ig audio sample rates, wen using traditional CPUs. In recent years te computer gaming industry as driven te development of extremely powerful Grapics Processing Units (GPUs). Troug specialised development and tuning we can exploit te igly parallel GPU arcitecture to make suc FDTD computations feasible. Tis paper describes te simultaneous use of multiple NVIDIA GPUs to compute scemes containing over a billion grid points. We examine te use of asyncronous alo transfers between cards, to ide te latency involved in transferring data, and overall computation time is considered wit respect to variation in te size of te partition layers. As ardware memory poses limitations on te size of te room to be rendered, we also investigate te use of single precision aritmetic. Tis allows twice te domain space, compared wit double precision, but results in pase sifting of te output wit possible audible artefacts. Using tese tecniques, large-scale spaces of several tousand cubic metres can be computed at 44.1kHz in a useable time frame, making teir use in room acoustics rendering and auralization applications possible in te near future. C.J.Webb-2@sms.ed.ac.uk 1

INTRODUCTION Hig fidelity virtual room acoustics can be approaced troug direct numerical simulation of wave propagation in a defined space. Unlike ray-based [1] or image source [2] tecniques, tis approac seeks to model te entire acoustic field witin te simulation domain. Tree-dimensional Finite Difference Time Domain (FDTD) scemes can be employed, owever at audio sample rates suc scemes are extremely computationally expensive [3], proibitively so for serial computation. Recent advances in grapics processing unit (GPU) arcitectures allow for general purpose computation to be performed on tese forms of igly parallel ardware. Wilst central processing units (CPUs) may contain a small number of cores, suc as four or eigt, GPUs contain undreds of processing cores tat can be used to perform parallel computation. Using tis arcitecture, te data independence of FDTD scemes can be leveraged to gain significant acceleration over single-treaded implementations [4], and tis allows large-scale simulations to be computed in time scales tat are actually useable for performing researc. For scientific computing, Nvidia s Tesla GPUs are typically used in a workstation or compute node tat can be configured wit four GPUs connected across te same PCIe bus. Tis paper examines te simultaneous use of tese four-gpu systems to render virtual acoustic simulations. Tis allows greater acceleration of existing models, or te combined use of all available memory across four GPUs to render large-scale domains containing billions of grid points. Recent versions of te CUDA language facilitate tis process [5], witout recourse to MPI programming tecniques. Te first section details te FDTD scemes being used, followed by an outline of te CUDA programming model for te simultaneous use of multiple GPUs. We ten describe te implementation of te scemes using bot non-asyncronous and asyncronous approaces. Finally, we detail experimental testing in terms of floating-point precision and overall computation times for various configurations, including large-scale simulations tat use maximum memory. VIRTUAL ACOUSTICS USING FINITE DIFFERENCE METHOD Te starting point for acoustic FDTD simulations is te 3D wave equation, wic in second order form is given by: 2 Ψ t 2 = c2 2 Ψ (1) Here Ψ is te target acoustical field quantity, c is te wave speed in air, 2 is te 3D Laplacian. Simple first-order boundary conditions are used, were: Ψ = cβn Ψ (2) t Here n is a unit normal to a wall or obstacle, and β is an absorption coefficient. Te standard FDTD discretisation leads to te following update equation, wic includes boundary loss terms using a single reflection coefficient, w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p (3) were w l,m,p is te discrete acoustic field, K is 6 in free space, 5 at a face, 4 at an edge and 3 at a corner, λ= ct X, β te coefficient for boundary reflection losses, and Sn l,m,p is (wn l+1,m,p +wn l 1,m,p +wn l,m+1,p +wn l,m 1,p + w n l,m,p+1 + wn l,m,p 1 ). Te stability condition for te sceme follows from von Neumann analysis [6], suc tat for a given time step T te grid spacing X must satisfy: X 3c 2 T 2 (4) Te basic sceme can be extended to include te effect of viscosity, wic gives a frequency dependent damping [4]. 2 Ψ t 2 = c2 2 Ψ+ cα 2 Ψ (5) t

Here α is a viscosity coefficient. Tis leads to an update equation of te form: w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p + ckα(sn l,m,p Sn 1 l,m,p ) (6) Note tat tis update uses te nearest neigbours from two time steps ago, tus requiring te use of tree data grids. Te basic sceme, wic only uses te centre point from two time steps ago, can be implemented using only two grids and a read, ten overwrite procedure. Tese systems are referred to as te "2 Grid" and "3 Grid" scemes trougout tis paper. PARALLEL COMPUTING AND USE OF MULTIPLE GPUS IN CUDA CUDA is Nvidia s programming arcitecture for implementing igly-treaded GPU code. In a serial implementation of equation 3, loops would be used to iterate over te computation domain, applying te update equation to eac grid point. In CUDA, we issues a large number of kernel treads tat implement te SIMD (Single Instruction Multiple Data) operation, and tese treads are sceduled to execute using a large number of parallel processing cores. Te program code contains a mixture of ost serial C code, and device CUDA code. Wilst te ost uses standard CPU memory, te device as multiple memory types. Global memory is te core data store, and is typical in te range of 3 to 6Gb per GPU. CUDA treads make use of local, register memory, and also ave a small amount of sared memory per tread block. Tey can also communicate directly wit te global and (read-only) constant memory. Eac type as different access speeds, wit global memory being te slowest, and sared and constant being fast. Wit a four GPU server, we ave four instances of tis memory model wic are independent. Te GPUs are connected in a pair-wise manner over te PCIe bus, as sown in figure 1. GPU0 GPU1 GPU2 GPU3 PCIe PCIe PCIe To Host FIGURE 1: PCIe connections for four GPUs in a single compute node. Version four and above of te CUDA arcitecture contains functionality tat allows multiple GPUs tat are connected in suc a manner to be used concurrently [5]. Peer-to-peer communication allows data transfer between GPUs tat bypasses te ost altogeter (transferring data from device to ost, ten back to anoter device is an expensive operation). Tis can be combined wit te use of multiple streams of execution and asyncronous sceduling to acieve scalable speedups wen using multiple GPUs. IMPLEMENTATION OF THREE-DIMENSIONAL FDTD SCHEMES Tis section gives a detailed description of te implementation of te basic 2 Grid FDTD sceme wit its first-order boundaries. We start wit a single GPU version, ten extend tis to an initial multiple GPU implementation. Tis is ten developed to include te use of asyncronous data transfers of te partition alos.

Single GPU implementation Te CUDA programming model makes use of treads tat are grouped first into blocks, and ten into a grid. Bot of tese objects can be one, two, or tree-dimensional in sape. Given a tree-dimensional data domain, tere are many possible approaces to mapping te treading model over te domain. Prior to te FERMI arcitecture, te standard approac was to utilise sared memory by issuing treads to cover a 2D layer of data. Eac tread itself would ten iterate over te final dimension, reusing data and sared memory [7]. Post FERMI, te cacing system negates te benefits of tis approac, and one can simply issue treads tat cover te wole data domain [8]. Tree-dimensional treads blocks can be used, for example 32 x 4 x 2, and a tree-dimensional tread grid placed over te data. Te time loop for te simulation contains a single kernel launc, ten te input/output is processed, followed by swapping te pointers to te data grids, as sown in listing 1. 1 for (n=0;n<nf; n++) 2 { 3 UpDateSceme<<<dimGridInt, dimblockint >>>(u_d, u1_d ) ; 4 / / perform I /O 5 inout <<<dimgridio, dimblockio>>>(u_d, out_d, ins, n ) ; 6 / / update pointers 7 dummy_ptr = u1_d ; u1_d = u_d ; u_d = dummy_ptr ; 8 } LISTING 1: Time loop for single GPU implementation. Te tread kernel itself implements bot te interior and boundary update in a single SIMD operation, as sown in listing 2. 1 global void UpDateSceme( real *u, real *u1 ) { 2 / / Get X,Y, Z from 3D tread and block Id s 3 int X = blockidx. x * Bx + treadidx. x ; 4 int Y = blockidx. y * By + treadidx. y ; 5 int Z = blockidx. z * Bz + treadidx. z + 1; 6 LISTING 2: Kernel code for single GPU implementation. 7 / / Test tat not at alo, Z block excludes Z alo 8 i f ( ( X>0) && (X<(Nx 1)) && (Y>0) && (Y<(Ny 1 ) ) ) { 9 / / Calculate linear centre position 10 int cp = Z*area +(Y*Nx+X ) ; 11 int K = (0 (X 1) + (0 X (Nx 2)) + (0 Y 1) + (0 Y (Ny 2)) + (0 Z 1) + (0 Z (Nz 2)); 12 real c f = 1. 0 ; 13 real cf2 = 1. 0 ; 14 / / set loss c o e f f i c i e n t s i f at a boundary 15 i f (K<6){ c f = cf_d [ 0 ]. loss1 ; cf2 = cf_d [ 0 ]. loss2 ; } 16 / / Get sum of neigbour points 17 real S = u1 [ cp 1]+u1 [ cp+1]+u1 [ cp Nx]+u1 [ cp+nx]+u1 [ cp area ]+u1 [ cp+area ] ; 18 / / Calculate te update 19 u[ cp ] = c f *( (2.0 K* cf_d [ 0 ]. l2 )*u1 [ cp ] + cf_d [ 0 ]. l2 *S cf2 *u[ cp ] ) ; 20 } 21 } Te kernel keeps te use of conditional statements to a minimum. A layer of non-updated "gost" points is used around te data domain, and so line 8 employs a conditional to ceck for tis. A single furter conditional is used at line 15, to load te coefficients used at a boundary. Te logical expression at line 11 computes boundary position in an efficient manner, witout te need for a lengty IF-ELSEIF statement. Non-asyncronous implementation using multiple GPUs In transitioning from a single GPU to te use of four GPUs, te data domain needs to be partitioned. Te individual GPUs ave discrete memory, and so te 3D data needs to be separated into four segments. A furter complication is tat te FDTD sceme requires neigbouring points in all dimensions, and so overlap alos will be required. Te 3D data itself is decomposed using a row-major alignment for eac layer of te Z dimension, wit consecutive layers in series. In tis format, eac layer occupies contiguous memory locations. Tus, te most natural partitioning is across te Z dimension, as sown in figure 2. Te overlap alos are individual Z layers, and so can be transferred as a single contiguous block of memory.

Nz Data partitioned across te Nz layers, wit overlap alos () Ny Nx Ny Nx Nx... FIGURE 2: Data partitioning across te Z dimension using four GPUs, wit overlap alos. Wilst te domain partitioning is straigtforward, te CUDA code itself requires many extensions compared to te single GPU case. In terms of te pre-time loop setup code, individual commands suc as cudamalloc( ) become embedded in loops over te four GPUs. A call is made to cudasetdevice( ) at eac iteration, to perform te operation on individual GPUs. Single pointers to device memory become arrays of pointers, and constant memory as to be allocated to eac GPU. Te alo offset locations ave to be calculated as linear positions across memory, and finally te peer-to-peer access as to be initialised. In a non-asyncronous implementation, te time loop operates as follows: 1. Loop over te GPUs, issuing a kernel launc to compute te data on tat GPU. 2. Syncronize all GPUs. 3. Perform peer-to-peer data transfer for overlap alos. 4. Perform input/output. 5. Syncronize and swap data pointers. Eac GPU computes its data simultaneously, but only wen all ave completed do we ten perform te data transfers of te individual overlap alos. Implementation using asyncronous data transfers Te above implementation contains an inerent time lag, as te GPUs are idle during te data transfers across te PCIe bus. To eliminate tis, we can make use of asyncronous beaviour and streams. Te approac used is based on tat outlined by Nvidia [9], but is extended ere to operate wit te large-scale alo layers tat occur in te 3D case. An individual Z layer can contain millions of floating-point values, and six layers ave to be transferred between GPUs at eac time step. A stream is simply a sequence of CUDA events tat occur in series. However, multiple streams can be used so tat events can execute in a concurrent and asyncronous manner. As te FDTD sceme is data-independent at eac time step, te overlap alo layers can be computed and te data transfers performed at te same time as te larger interior data segments on eac GPU. Tis is accomplised by using one stream of events for te alos and transfers on eac GPU, wilst a second

stream is used for te interior. Te streams are identified in te kernel launces, as sown in te time loop code detailed in listing 3. LISTING 3: Time loop for asyncronous four GPU implementation. 1 for (n=0;n<nf; n++) 2 { 3 / / Compute alo layers, ten interior 4 p = 0; 5 for ( i =0; i <num_gpus ; i ++){ 6 cudasetdevice ( gpu [ i ] ) ; 7 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 8 p++; 9 i f ( i >0 && i <num_gpus 1){ 10 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 11 p++; 12 } 13 cudastreamquery ( stream_alo [ i ] ) ; 14 UpDateInterior <<<dimgridint, dimblockint,0, stream_int [ i ]>>>(u_d [ i ], u1_d [ i ], i ) ; 15 } 16 / / Excange Halos 17 cudamemcpypeerasync ( u_d [ 1 ], gpu[1],&u_d [ 0 ] [ pos [ 0 ] ], gpu [ 0 ], area_size, stream_alo [ 0 ] ) ; 18... 19 / / perform I /O 20... 21 / / Syncronise and update pointers 22... 23 } 24 } Initially, we iterate over te GPUs and launc te kernels required for te alos using stream_alo. Note tat GPUs 0 and 3 ave a single alo, wilst GPUs 1 and 2 contain two alos. Ten te main interior data kernels are launced, using stream_int. Te data transfer events are ten pused into stream_alo, wic will execute wen te alos ave been computed. In tis manner, te data transfers proceed at te same time as te interior computation is being performed. Te GPUs are ten syncronized before swapping te data pointers. EXPERIMENTAL TESTING Initial testing is performed using data grids containing 100 million points eac. At double precision tis requires 0.8Gb of data per grid (1.6Gb for te wole 2 Grid simulation), and so allows a comparison to be made between single GPU and four GPU implementation using Tesla C2050 GPUs tat ave 3Gb of global memory. Using a sample rate of 44.1kHz, te domain size is 244m 3. Computation times Te 2 Grid sceme is used to compare computation times for te single GPU, basic (non-asyncronous) four GPU, and asyncronous four GPU implementations. Te simulations are computed for 4,410 samples in eac case, at 44.1kHz, and for bot single and double precision floating-point accuracy. Table 1 sows te resulting times. Te data grids are of size Nx : 960 points, Ny : 396 points, and Nz : 264 points. TABLE 1: Computation times and speedups for double (DP) and single (SP) precision. Setup DP Time (sec) Speedup SP Time (sec) Speedup Single GPU 145.3-89.5 - Basic four GPU 59.4 x2.4 36.1 x2.5 Async four GPU 48.1 x3.0 29.5 x3.0 Te basic four GPU implementation only acieves a speedup of x 2.5, wilst te asyncronous version gets to x3. Te grid sizes for tese initial tests contain a very large Z layer (960 x 396 = 380,160 points). As six overlap alos of tis size ave to be transferred between GPUs at eac time step, tis is still a limiting factor. To test te effect of te Z layer size, te double precision simulation is performed for decreasing

sizes wilst keeping te same overall domain size of 244m 3, ranging from 380,160 points down to 76,032 points. Figure 3 sows te effect in terms of te dimensions of te space. Ny: 396 Z layer size : 380,160 Z layer size: 76,032 Ny: 396 Nx: 960 Nx: 192 17.808 m 5.328 m 5.328 m 3.559 m 12.898 m 2.582 m FIGURE 3: Variations in Z layer sizes for a domain of 244m 3. Table 2 sows te timing results. As te Z layer size decreases we get closer to te x4 scalable speedup. TABLE 2: Effect of variation in te size of te Z layer on computation time. Z layer size (points) Time (sec) Speedup over single GPU 380,160 48.1 x3.02 329,472 46.0 x3.16 278,784 45.1 x3.22 228,096 44.1 x3.29 177,408 43.4 x3.35 126,720 42.8 x3.39 76,032 41.3 x3.52 Floating-point precision Single precision floating-point variables require 32 bits of memory compared to double precision wic requires 64 bits. So, using single precision we can effectively double te size of te computation domain using te same amount of memory. Tere is also an additional benefit, as GPUs offer greater peak performance at single precision. However, testing on te 100 million point domain reveals stability issues wen running at, or very close to, te Courant limit for te sceme. Figure 4 sows te outputs for a 40,000 time step simulation at 44.1kHz using a DC-blocked audio input and grid spacing set at te Courant limit. Te single precision output (blue) is stable initially, but sows pase and amplitude differences compared to te double precision (red). After 30,000 samples, te single precision output begins to diverge, and finally becomes unstable after 40,000 samples. Backing away from te Courant limit by around 0.05% ensures stability in single precision, at te cost of introducing greater dispersion.

0 0.5 1 1.5 2 2.5 3 3.5 4 0.8 0.6 0.4 Normalised Level 0.2 0 0.2 0.4 0.6 0.8 1 Time (samples) x 10 4 FIGURE 4: Double (red) vs Single (blue) precision at te Courant limit over 40,000 time steps. LARGE-SCALE ACOUSTIC SIMULATIONS Having detailed te efficiency of te asyncronous four GPU implementation, we can now consider te use of maximum available memory to perform large-scale simulations. Nvidia s Tesla GPUs come wit various amounts of global memory, and so table 3 sows te maximum simulation sizes for various configurations, at a sample rate of 44.1kHz. Note tat te GPUs ave less available memory tan is actually labelled, for example a 3Gb C2050 as a useable global memory of around 2.8Gb. Wilst te table TABLE 3: Maximum simulation sizes in points per grid (millions) and cubic metres, at 44.1kHz. GPU 2 Grid SP m 3 2 Grid DP m 3 3 Grid SP m 3 3 Grid DP m 3 3Gb 352 865 175 430 235 584 118 290 5Gb 595 1,461 297 729 395 983 199 490 6Gb 722 1,774 361 886 480 1,193 240 589 4 x 3Gb 1,409 3,460 702 1,722 940 2,336 473 1,160 4 x 5Gb 2,380 5,844 1,189 2,918 1,582 3,934 798 1,960 4 x 6Gb 2,889 7,096 1,444 3,546 1,920 4,775 960 2,357 sows te maximum sizes for bot te 2 Grid and 3 Grid scemes, in practice tis as to be reduced to allow for storage of audio output arrays and, in te four GPU case, overlap alos of variable size. Four Tesla C2050 GPUs are used for te testing ere, eac of wic as 3Gb of global memory. Tus for te basic 2 Grid sceme at single precision we can compute simulations using 1.4 billion grid points, and a resulting simulation size of 3,350 m 3. For te 3 Grid sceme including viscosity, te grids contain just under a billion points. Table 4 sows te computation times for maximum memory simulations, running for 44,100 samples at 44.1kHz. TABLE 4: Maximum memory computation times for 44,100 samples at 44.1kHz. Simulation Size (m 3 ) Time (min) 2 Grid DP (double precision) 1,682 44.6 2 Grid SP (single precision) 3,350 53.1 3 Grid DP (double precision) 1,112 48.5 3 Grid SP (single precision) 2,257 52.7

CONCLUSIONS Te use of asyncronous data transfers and concurrent execution allows multiple GPUs to be used effectively to acieve near-scaleable speedups in tree-dimensional FDTD scemes, typically ranging from x3 to x3.5 wen using four GPUs, depending on te size of te overlap alos. By using all available memory on a four GPU compute node, we can perform virtual acoustic simulations using billions of grid points. At audio rates suc as 44.1kHz, tis allows te modelling of large rooms and alls, of several tousand cubic metres. Stability becomes an issue wen running at single precision to maximise memory usage. Computing scemes at te Courant limit using single precision can lead to instability over time, altoug tey may appear stable initially. Backing away from te Courant limit wit a small increase in te spatial resolution resolves tis beaviour. Computation times for large-scale maximum memory simulations are around forty to fifty minutes per second at 44.1kHz, using four Tesla C2050 GPUs. Initial testing on te latest Kepler arcitecture GPUs sows a near two-fold speedup over te FERMI Tesla GPUs used ere, and so sould bring computation times down to under alf an our. ACKNOWLEDGEMENTS Tis work is supported by te European Researc Council, under Grant StG-2011-279068-NESS. REFERENCES [1] N. Rober, U. Kaminski, and M. Masuc, Ray acoustics using computer grapics tecnology, in Proc. of te 10t Int. Conf. on Digital Audio Effects (DAFx-07, Bordeaux, France) (2007). [2] E. Lemann and A. Joansson, Diffuse reverberation model for efficient image-source simulation of room impulse responses, in IEEE Transactions on Audio, Speec and Language Processing, volume 18(6), 1429 1439 (2010). [3] L. Savioja, D. Manoca, and M. Lin, Use of GPUs in room acoustic modeling and auralization, in Proc. Int. Symposium on Room Acoustics (Melbourne, Australia) (2010). [4] C. Webb and S. Bilbao, Computing room acoustics wit CUDA - 3D FDTD scemes wit boundary losses and viscosity, in Proc. of te IEEE Int. Conf. on Acoustics, Speec and Signal Processing (Prague, Czec Republic) (2011). [5] Nvidia, Cuda C programming guide, CUDA toolkit documentation.[online][cited: 8t Jan 2013.] ttp://docs.nvidia.com/cuda/ (2012). [6] J. Strikwerda, Finite Difference Scemes and Partial Differential Equations (Wadswort and Brooks/- Cole Advanced Books and Software, Pacific Grove, California) (1989). [7] P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in Proceedings of 2nd Worksop on General Purpose Processing on Grapics Processing Units, GPGPU-2, 79 84 (New York, NY, USA) (2009). [8] C. Webb and S. Bilbao, Virtual room acoustics: A comparison of tecniques for computing 3D FDTD scemes using CUDA, in Proc. 130t Convention of te Audio Engineering Society (AES) (London, UK) (2011). [9] P. Micikevicius, Multi-GPU Programming, Nvidia Cuda webinars. [Online][Cited: 6t Jan 2013.] ttp://developer.download.nvidia.com/cuda/training/ (2011).