Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs

Large-scale Virtual Acoustics Simulation at Audio Rates Using Tree Dimensional Finite Difference Time Domain and Multiple GPUs Craig J. Webb 1,2 and Alan Gray 2 1 Acoustics Group, University of Edinburg 2 Edinburg Parallel Computing Centre, University of Edinburg To be presented at te 21st International Congress on Acoustics, Montréal, Canada, 2013 Abstract Te computation of large-scale virtual acoustics using te 3D finite difference time domain (FDTD) is proibitively computationally expensive, especially at ig audio sample rates, wen using traditional CPUs. In recent years te computer gaming industry as driven te development of extremely powerful Grapics Processing Units (GPUs). Troug specialised development and tuning we can exploit te igly parallel GPU arcitecture to make suc FDTD computations feasible. Tis paper describes te simultaneous use of multiple NVIDIA GPUs to compute scemes containing over a billion grid points. We examine te use of asyncronous alo transfers between cards, to ide te latency involved in transferring data, and overall computation time is considered wit respect to variation in te size of te partition layers. As ardware memory poses limitations on te size of te room to be rendered, we also investigate te use of single precision aritmetic. Tis allows twice te domain space, compared wit double precision, but results in pase sifting of te output wit possible audible artefacts. Using tese tecniques, large-scale spaces of several tousand cubic metres can be computed at 44.1kHz in a useable time frame, making teir use in room acoustics rendering and auralization applications possible in te near future. C.J.Webb-2@sms.ed.ac.uk 1

INTRODUCTION Hig fidelity virtual room acoustics can be approaced troug direct numerical simulation of wave propagation in a defined space. Unlike ray-based [1] or image source [2] tecniques, tis approac seeks to model te entire acoustic field witin te simulation domain. Tree-dimensional Finite Difference Time Domain (FDTD) scemes can be employed, owever at audio sample rates suc scemes are extremely computationally expensive [3], proibitively so for serial computation. Recent advances in grapics processing unit (GPU) arcitectures allow for general purpose computation to be performed on tese forms of igly parallel ardware. Wilst central processing units (CPUs) may contain a small number of cores, suc as four or eigt, GPUs contain undreds of processing cores tat can be used to perform parallel computation. Using tis arcitecture, te data independence of FDTD scemes can be leveraged to gain significant acceleration over single-treaded implementations [4], and tis allows large-scale simulations to be computed in time scales tat are actually useable for performing researc. For scientific computing, Nvidia s Tesla GPUs are typically used in a workstation or compute node tat can be configured wit four GPUs connected across te same PCIe bus. Tis paper examines te simultaneous use of tese four-gpu systems to render virtual acoustic simulations. Tis allows greater acceleration of existing models, or te combined use of all available memory across four GPUs to render large-scale domains containing billions of grid points. Recent versions of te CUDA language facilitate tis process [5], witout recourse to MPI programming tecniques. Te first section details te FDTD scemes being used, followed by an outline of te CUDA programming model for te simultaneous use of multiple GPUs. We ten describe te implementation of te scemes using bot non-asyncronous and asyncronous approaces. Finally, we detail experimental testing in terms of floating-point precision and overall computation times for various configurations, including large-scale simulations tat use maximum memory. VIRTUAL ACOUSTICS USING FINITE DIFFERENCE METHOD Te starting point for acoustic FDTD simulations is te 3D wave equation, wic in second order form is given by: 2 Ψ t 2 = c2 2 Ψ (1) Here Ψ is te target acoustical field quantity, c is te wave speed in air, 2 is te 3D Laplacian. Simple first-order boundary conditions are used, were: Ψ = cβn Ψ (2) t Here n is a unit normal to a wall or obstacle, and β is an absorption coefficient. Te standard FDTD discretisation leads to te following update equation, wic includes boundary loss terms using a single reflection coefficient, w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p (3) were w l,m,p is te discrete acoustic field, K is 6 in free space, 5 at a face, 4 at an edge and 3 at a corner, λ= ct X, β te coefficient for boundary reflection losses, and Sn l,m,p is (wn l+1,m,p +wn l 1,m,p +wn l,m+1,p +wn l,m 1,p + w n l,m,p+1 + wn l,m,p 1 ). Te stability condition for te sceme follows from von Neumann analysis [6], suc tat for a given time step T te grid spacing X must satisfy: X 3c 2 T 2 (4) Te basic sceme can be extended to include te effect of viscosity, wic gives a frequency dependent damping [4]. 2 Ψ t 2 = c2 2 Ψ+ cα 2 Ψ (5) t

Here α is a viscosity coefficient. Tis leads to an update equation of te form: w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p + ckα(sn l,m,p Sn 1 l,m,p ) (6) Note tat tis update uses te nearest neigbours from two time steps ago, tus requiring te use of tree data grids. Te basic sceme, wic only uses te centre point from two time steps ago, can be implemented using only two grids and a read, ten overwrite procedure. Tese systems are referred to as te "2 Grid" and "3 Grid" scemes trougout tis paper. PARALLEL COMPUTING AND USE OF MULTIPLE GPUS IN CUDA CUDA is Nvidia s programming arcitecture for implementing igly-treaded GPU code. In a serial implementation of equation 3, loops would be used to iterate over te computation domain, applying te update equation to eac grid point. In CUDA, we issues a large number of kernel treads tat implement te SIMD (Single Instruction Multiple Data) operation, and tese treads are sceduled to execute using a large number of parallel processing cores. Te program code contains a mixture of ost serial C code, and device CUDA code. Wilst te ost uses standard CPU memory, te device as multiple memory types. Global memory is te core data store, and is typical in te range of 3 to 6Gb per GPU. CUDA treads make use of local, register memory, and also ave a small amount of sared memory per tread block. Tey can also communicate directly wit te global and (read-only) constant memory. Eac type as different access speeds, wit global memory being te slowest, and sared and constant being fast. Wit a four GPU server, we ave four instances of tis memory model wic are independent. Te GPUs are connected in a pair-wise manner over te PCIe bus, as sown in figure 1. GPU0 GPU1 GPU2 GPU3 PCIe PCIe PCIe To Host FIGURE 1: PCIe connections for four GPUs in a single compute node. Version four and above of te CUDA arcitecture contains functionality tat allows multiple GPUs tat are connected in suc a manner to be used concurrently [5]. Peer-to-peer communication allows data transfer between GPUs tat bypasses te ost altogeter (transferring data from device to ost, ten back to anoter device is an expensive operation). Tis can be combined wit te use of multiple streams of execution and asyncronous sceduling to acieve scalable speedups wen using multiple GPUs. IMPLEMENTATION OF THREE-DIMENSIONAL FDTD SCHEMES Tis section gives a detailed description of te implementation of te basic 2 Grid FDTD sceme wit its first-order boundaries. We start wit a single GPU version, ten extend tis to an initial multiple GPU implementation. Tis is ten developed to include te use of asyncronous data transfers of te partition alos.

Single GPU implementation Te CUDA programming model makes use of treads tat are grouped first into blocks, and ten into a grid. Bot of tese objects can be one, two, or tree-dimensional in sape. Given a tree-dimensional data domain, tere are many possible approaces to mapping te treading model over te domain. Prior to te FERMI arcitecture, te standard approac was to utilise sared memory by issuing treads to cover a 2D layer of data. Eac tread itself would ten iterate over te final dimension, reusing data and sared memory [7]. Post FERMI, te cacing system negates te benefits of tis approac, and one can simply issue treads tat cover te wole data domain [8]. Tree-dimensional treads blocks can be used, for example 32 x 4 x 2, and a tree-dimensional tread grid placed over te data. Te time loop for te simulation contains a single kernel launc, ten te input/output is processed, followed by swapping te pointers to te data grids, as sown in listing 1. 1 for (n=0;n<nf; n++) 2 { 3 UpDateSceme<<<dimGridInt, dimblockint >>>(u_d, u1_d ) ; 4 / / perform I /O 5 inout <<<dimgridio, dimblockio>>>(u_d, out_d, ins, n ) ; 6 / / update pointers 7 dummy_ptr = u1_d ; u1_d = u_d ; u_d = dummy_ptr ; 8 } LISTING 1: Time loop for single GPU implementation. Te tread kernel itself implements bot te interior and boundary update in a single SIMD operation, as sown in listing 2. 1 global void UpDateSceme( real *u, real *u1 ) { 2 / / Get X,Y, Z from 3D tread and block Id s 3 int X = blockidx. x * Bx + treadidx. x ; 4 int Y = blockidx. y * By + treadidx. y ; 5 int Z = blockidx. z * Bz + treadidx. z + 1; 6 LISTING 2: Kernel code for single GPU implementation. 7 / / Test tat not at alo, Z block excludes Z alo 8 i f ( ( X>0) && (X<(Nx 1)) && (Y>0) && (Y<(Ny 1 ) ) ) { 9 / / Calculate linear centre position 10 int cp = Z*area +(Y*Nx+X ) ; 11 int K = (0 (X 1) + (0 X (Nx 2)) + (0 Y 1) + (0 Y (Ny 2)) + (0 Z 1) + (0 Z (Nz 2)); 12 real c f = 1. 0 ; 13 real cf2 = 1. 0 ; 14 / / set loss c o e f f i c i e n t s i f at a boundary 15 i f (K<6){ c f = cf_d [ 0 ]. loss1 ; cf2 = cf_d [ 0 ]. loss2 ; } 16 / / Get sum of neigbour points 17 real S = u1 [ cp 1]+u1 [ cp+1]+u1 [ cp Nx]+u1 [ cp+nx]+u1 [ cp area ]+u1 [ cp+area ] ; 18 / / Calculate te update 19 u[ cp ] = c f *( (2.0 K* cf_d [ 0 ]. l2 )*u1 [ cp ] + cf_d [ 0 ]. l2 *S cf2 *u[ cp ] ) ; 20 } 21 } Te kernel keeps te use of conditional statements to a minimum. A layer of non-updated "gost" points is used around te data domain, and so line 8 employs a conditional to ceck for tis. A single furter conditional is used at line 15, to load te coefficients used at a boundary. Te logical expression at line 11 computes boundary position in an efficient manner, witout te need for a lengty IF-ELSEIF statement. Non-asyncronous implementation using multiple GPUs In transitioning from a single GPU to te use of four GPUs, te data domain needs to be partitioned. Te individual GPUs ave discrete memory, and so te 3D data needs to be separated into four segments. A furter complication is tat te FDTD sceme requires neigbouring points in all dimensions, and so overlap alos will be required. Te 3D data itself is decomposed using a row-major alignment for eac layer of te Z dimension, wit consecutive layers in series. In tis format, eac layer occupies contiguous memory locations. Tus, te most natural partitioning is across te Z dimension, as sown in figure 2. Te overlap alos are individual Z layers, and so can be transferred as a single contiguous block of memory.

Nz Data partitioned across te Nz layers, wit overlap alos () Ny Nx Ny Nx Nx... FIGURE 2: Data partitioning across te Z dimension using four GPUs, wit overlap alos. Wilst te domain partitioning is straigtforward, te CUDA code itself requires many extensions compared to te single GPU case. In terms of te pre-time loop setup code, individual commands suc as cudamalloc( ) become embedded in loops over te four GPUs. A call is made to cudasetdevice( ) at eac iteration, to perform te operation on individual GPUs. Single pointers to device memory become arrays of pointers, and constant memory as to be allocated to eac GPU. Te alo offset locations ave to be calculated as linear positions across memory, and finally te peer-to-peer access as to be initialised. In a non-asyncronous implementation, te time loop operates as follows: 1. Loop over te GPUs, issuing a kernel launc to compute te data on tat GPU. 2. Syncronize all GPUs. 3. Perform peer-to-peer data transfer for overlap alos. 4. Perform input/output. 5. Syncronize and swap data pointers. Eac GPU computes its data simultaneously, but only wen all ave completed do we ten perform te data transfers of te individual overlap alos. Implementation using asyncronous data transfers Te above implementation contains an inerent time lag, as te GPUs are idle during te data transfers across te PCIe bus. To eliminate tis, we can make use of asyncronous beaviour and streams. Te approac used is based on tat outlined by Nvidia [9], but is extended ere to operate wit te large-scale alo layers tat occur in te 3D case. An individual Z layer can contain millions of floating-point values, and six layers ave to be transferred between GPUs at eac time step. A stream is simply a sequence of CUDA events tat occur in series. However, multiple streams can be used so tat events can execute in a concurrent and asyncronous manner. As te FDTD sceme is data-independent at eac time step, te overlap alo layers can be computed and te data transfers performed at te same time as te larger interior data segments on eac GPU. Tis is accomplised by using one stream of events for te alos and transfers on eac GPU, wilst a second

stream is used for te interior. Te streams are identified in te kernel launces, as sown in te time loop code detailed in listing 3. LISTING 3: Time loop for asyncronous four GPU implementation. 1 for (n=0;n<nf; n++) 2 { 3 / / Compute alo layers, ten interior 4 p = 0; 5 for ( i =0; i <num_gpus ; i ++){ 6 cudasetdevice ( gpu [ i ] ) ; 7 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 8 p++; 9 i f ( i >0 && i <num_gpus 1){ 10 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 11 p++; 12 } 13 cudastreamquery ( stream_alo [ i ] ) ; 14 UpDateInterior <<<dimgridint, dimblockint,0, stream_int [ i ]>>>(u_d [ i ], u1_d [ i ], i ) ; 15 } 16 / / Excange Halos 17 cudamemcpypeerasync ( u_d [ 1 ], gpu[1],&u_d [ 0 ] [ pos [ 0 ] ], gpu [ 0 ], area_size, stream_alo [ 0 ] ) ; 18... 19 / / perform I /O 20... 21 / / Syncronise and update pointers 22... 23 } 24 } Initially, we iterate over te GPUs and launc te kernels required for te alos using stream_alo. Note tat GPUs 0 and 3 ave a single alo, wilst GPUs 1 and 2 contain two alos. Ten te main interior data kernels are launced, using stream_int. Te data transfer events are ten pused into stream_alo, wic will execute wen te alos ave been computed. In tis manner, te data transfers proceed at te same time as te interior computation is being performed. Te GPUs are ten syncronized before swapping te data pointers. EXPERIMENTAL TESTING Initial testing is performed using data grids containing 100 million points eac. At double precision tis requires 0.8Gb of data per grid (1.6Gb for te wole 2 Grid simulation), and so allows a comparison to be made between single GPU and four GPU implementation using Tesla C2050 GPUs tat ave 3Gb of global memory. Using a sample rate of 44.1kHz, te domain size is 244m 3. Computation times Te 2 Grid sceme is used to compare computation times for te single GPU, basic (non-asyncronous) four GPU, and asyncronous four GPU implementations. Te simulations are computed for 4,410 samples in eac case, at 44.1kHz, and for bot single and double precision floating-point accuracy. Table 1 sows te resulting times. Te data grids are of size Nx : 960 points, Ny : 396 points, and Nz : 264 points. TABLE 1: Computation times and speedups for double (DP) and single (SP) precision. Setup DP Time (sec) Speedup SP Time (sec) Speedup Single GPU 145.3-89.5 - Basic four GPU 59.4 x2.4 36.1 x2.5 Async four GPU 48.1 x3.0 29.5 x3.0 Te basic four GPU implementation only acieves a speedup of x 2.5, wilst te asyncronous version gets to x3. Te grid sizes for tese initial tests contain a very large Z layer (960 x 396 = 380,160 points). As six overlap alos of tis size ave to be transferred between GPUs at eac time step, tis is still a limiting factor. To test te effect of te Z layer size, te double precision simulation is performed for decreasing

sizes wilst keeping te same overall domain size of 244m 3, ranging from 380,160 points down to 76,032 points. Figure 3 sows te effect in terms of te dimensions of te space. Ny: 396 Z layer size : 380,160 Z layer size: 76,032 Ny: 396 Nx: 960 Nx: 192 17.808 m 5.328 m 5.328 m 3.559 m 12.898 m 2.582 m FIGURE 3: Variations in Z layer sizes for a domain of 244m 3. Table 2 sows te timing results. As te Z layer size decreases we get closer to te x4 scalable speedup. TABLE 2: Effect of variation in te size of te Z layer on computation time. Z layer size (points) Time (sec) Speedup over single GPU 380,160 48.1 x3.02 329,472 46.0 x3.16 278,784 45.1 x3.22 228,096 44.1 x3.29 177,408 43.4 x3.35 126,720 42.8 x3.39 76,032 41.3 x3.52 Floating-point precision Single precision floating-point variables require 32 bits of memory compared to double precision wic requires 64 bits. So, using single precision we can effectively double te size of te computation domain using te same amount of memory. Tere is also an additional benefit, as GPUs offer greater peak performance at single precision. However, testing on te 100 million point domain reveals stability issues wen running at, or very close to, te Courant limit for te sceme. Figure 4 sows te outputs for a 40,000 time step simulation at 44.1kHz using a DC-blocked audio input and grid spacing set at te Courant limit. Te single precision output (blue) is stable initially, but sows pase and amplitude differences compared to te double precision (red). After 30,000 samples, te single precision output begins to diverge, and finally becomes unstable after 40,000 samples. Backing away from te Courant limit by around 0.05% ensures stability in single precision, at te cost of introducing greater dispersion.

0 0.5 1 1.5 2 2.5 3 3.5 4 0.8 0.6 0.4 Normalised Level 0.2 0 0.2 0.4 0.6 0.8 1 Time (samples) x 10 4 FIGURE 4: Double (red) vs Single (blue) precision at te Courant limit over 40,000 time steps. LARGE-SCALE ACOUSTIC SIMULATIONS Having detailed te efficiency of te asyncronous four GPU implementation, we can now consider te use of maximum available memory to perform large-scale simulations. Nvidia s Tesla GPUs come wit various amounts of global memory, and so table 3 sows te maximum simulation sizes for various configurations, at a sample rate of 44.1kHz. Note tat te GPUs ave less available memory tan is actually labelled, for example a 3Gb C2050 as a useable global memory of around 2.8Gb. Wilst te table TABLE 3: Maximum simulation sizes in points per grid (millions) and cubic metres, at 44.1kHz. GPU 2 Grid SP m 3 2 Grid DP m 3 3 Grid SP m 3 3 Grid DP m 3 3Gb 352 865 175 430 235 584 118 290 5Gb 595 1,461 297 729 395 983 199 490 6Gb 722 1,774 361 886 480 1,193 240 589 4 x 3Gb 1,409 3,460 702 1,722 940 2,336 473 1,160 4 x 5Gb 2,380 5,844 1,189 2,918 1,582 3,934 798 1,960 4 x 6Gb 2,889 7,096 1,444 3,546 1,920 4,775 960 2,357 sows te maximum sizes for bot te 2 Grid and 3 Grid scemes, in practice tis as to be reduced to allow for storage of audio output arrays and, in te four GPU case, overlap alos of variable size. Four Tesla C2050 GPUs are used for te testing ere, eac of wic as 3Gb of global memory. Tus for te basic 2 Grid sceme at single precision we can compute simulations using 1.4 billion grid points, and a resulting simulation size of 3,350 m 3. For te 3 Grid sceme including viscosity, te grids contain just under a billion points. Table 4 sows te computation times for maximum memory simulations, running for 44,100 samples at 44.1kHz. TABLE 4: Maximum memory computation times for 44,100 samples at 44.1kHz. Simulation Size (m 3 ) Time (min) 2 Grid DP (double precision) 1,682 44.6 2 Grid SP (single precision) 3,350 53.1 3 Grid DP (double precision) 1,112 48.5 3 Grid SP (single precision) 2,257 52.7

CONCLUSIONS Te use of asyncronous data transfers and concurrent execution allows multiple GPUs to be used effectively to acieve near-scaleable speedups in tree-dimensional FDTD scemes, typically ranging from x3 to x3.5 wen using four GPUs, depending on te size of te overlap alos. By using all available memory on a four GPU compute node, we can perform virtual acoustic simulations using billions of grid points. At audio rates suc as 44.1kHz, tis allows te modelling of large rooms and alls, of several tousand cubic metres. Stability becomes an issue wen running at single precision to maximise memory usage. Computing scemes at te Courant limit using single precision can lead to instability over time, altoug tey may appear stable initially. Backing away from te Courant limit wit a small increase in te spatial resolution resolves tis beaviour. Computation times for large-scale maximum memory simulations are around forty to fifty minutes per second at 44.1kHz, using four Tesla C2050 GPUs. Initial testing on te latest Kepler arcitecture GPUs sows a near two-fold speedup over te FERMI Tesla GPUs used ere, and so sould bring computation times down to under alf an our. ACKNOWLEDGEMENTS Tis work is supported by te European Researc Council, under Grant StG-2011-279068-NESS. REFERENCES [1] N. Rober, U. Kaminski, and M. Masuc, Ray acoustics using computer grapics tecnology, in Proc. of te 10t Int. Conf. on Digital Audio Effects (DAFx-07, Bordeaux, France) (2007). [2] E. Lemann and A. Joansson, Diffuse reverberation model for efficient image-source simulation of room impulse responses, in IEEE Transactions on Audio, Speec and Language Processing, volume 18(6), 1429 1439 (2010). [3] L. Savioja, D. Manoca, and M. Lin, Use of GPUs in room acoustic modeling and auralization, in Proc. Int. Symposium on Room Acoustics (Melbourne, Australia) (2010). [4] C. Webb and S. Bilbao, Computing room acoustics wit CUDA - 3D FDTD scemes wit boundary losses and viscosity, in Proc. of te IEEE Int. Conf. on Acoustics, Speec and Signal Processing (Prague, Czec Republic) (2011). [5] Nvidia, Cuda C programming guide, CUDA toolkit documentation.[online][cited: 8t Jan 2013.] ttp://docs.nvidia.com/cuda/ (2012). [6] J. Strikwerda, Finite Difference Scemes and Partial Differential Equations (Wadswort and Brooks/- Cole Advanced Books and Software, Pacific Grove, California) (1989). [7] P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in Proceedings of 2nd Worksop on General Purpose Processing on Grapics Processing Units, GPGPU-2, 79 84 (New York, NY, USA) (2009). [8] C. Webb and S. Bilbao, Virtual room acoustics: A comparison of tecniques for computing 3D FDTD scemes using CUDA, in Proc. 130t Convention of te Audio Engineering Society (AES) (London, UK) (2011). [9] P. Micikevicius, Multi-GPU Programming, Nvidia Cuda webinars. [Online][Cited: 6t Jan 2013.] ttp://developer.download.nvidia.com/cuda/training/ (2011).