Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs

Size: px
Start display at page:

Download "Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs"

Transcription

1 Large-scale Virtual Acoustics Simulation at Audio Rates Using Tree Dimensional Finite Difference Time Domain and Multiple GPUs Craig J. Webb 1,2 and Alan Gray 2 1 Acoustics Group, University of Edinburg 2 Edinburg Parallel Computing Centre, University of Edinburg To be presented at te 21st International Congress on Acoustics, Montréal, Canada, 2013 Abstract Te computation of large-scale virtual acoustics using te 3D finite difference time domain (FDTD) is proibitively computationally expensive, especially at ig audio sample rates, wen using traditional CPUs. In recent years te computer gaming industry as driven te development of extremely powerful Grapics Processing Units (GPUs). Troug specialised development and tuning we can exploit te igly parallel GPU arcitecture to make suc FDTD computations feasible. Tis paper describes te simultaneous use of multiple NVIDIA GPUs to compute scemes containing over a billion grid points. We examine te use of asyncronous alo transfers between cards, to ide te latency involved in transferring data, and overall computation time is considered wit respect to variation in te size of te partition layers. As ardware memory poses limitations on te size of te room to be rendered, we also investigate te use of single precision aritmetic. Tis allows twice te domain space, compared wit double precision, but results in pase sifting of te output wit possible audible artefacts. Using tese tecniques, large-scale spaces of several tousand cubic metres can be computed at 44.1kHz in a useable time frame, making teir use in room acoustics rendering and auralization applications possible in te near future. C.J.Webb-2@sms.ed.ac.uk 1

2 INTRODUCTION Hig fidelity virtual room acoustics can be approaced troug direct numerical simulation of wave propagation in a defined space. Unlike ray-based [1] or image source [2] tecniques, tis approac seeks to model te entire acoustic field witin te simulation domain. Tree-dimensional Finite Difference Time Domain (FDTD) scemes can be employed, owever at audio sample rates suc scemes are extremely computationally expensive [3], proibitively so for serial computation. Recent advances in grapics processing unit (GPU) arcitectures allow for general purpose computation to be performed on tese forms of igly parallel ardware. Wilst central processing units (CPUs) may contain a small number of cores, suc as four or eigt, GPUs contain undreds of processing cores tat can be used to perform parallel computation. Using tis arcitecture, te data independence of FDTD scemes can be leveraged to gain significant acceleration over single-treaded implementations [4], and tis allows large-scale simulations to be computed in time scales tat are actually useable for performing researc. For scientific computing, Nvidia s Tesla GPUs are typically used in a workstation or compute node tat can be configured wit four GPUs connected across te same PCIe bus. Tis paper examines te simultaneous use of tese four-gpu systems to render virtual acoustic simulations. Tis allows greater acceleration of existing models, or te combined use of all available memory across four GPUs to render large-scale domains containing billions of grid points. Recent versions of te CUDA language facilitate tis process [5], witout recourse to MPI programming tecniques. Te first section details te FDTD scemes being used, followed by an outline of te CUDA programming model for te simultaneous use of multiple GPUs. We ten describe te implementation of te scemes using bot non-asyncronous and asyncronous approaces. Finally, we detail experimental testing in terms of floating-point precision and overall computation times for various configurations, including large-scale simulations tat use maximum memory. VIRTUAL ACOUSTICS USING FINITE DIFFERENCE METHOD Te starting point for acoustic FDTD simulations is te 3D wave equation, wic in second order form is given by: 2 Ψ t 2 = c2 2 Ψ (1) Here Ψ is te target acoustical field quantity, c is te wave speed in air, 2 is te 3D Laplacian. Simple first-order boundary conditions are used, were: Ψ = cβn Ψ (2) t Here n is a unit normal to a wall or obstacle, and β is an absorption coefficient. Te standard FDTD discretisation leads to te following update equation, wic includes boundary loss terms using a single reflection coefficient, w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p (3) were w l,m,p is te discrete acoustic field, K is 6 in free space, 5 at a face, 4 at an edge and 3 at a corner, λ= ct X, β te coefficient for boundary reflection losses, and Sn l,m,p is (wn l+1,m,p +wn l 1,m,p +wn l,m+1,p +wn l,m 1,p + w n l,m,p+1 + wn l,m,p 1 ). Te stability condition for te sceme follows from von Neumann analysis [6], suc tat for a given time step T te grid spacing X must satisfy: X 3c 2 T 2 (4) Te basic sceme can be extended to include te effect of viscosity, wic gives a frequency dependent damping [4]. 2 Ψ t 2 = c2 2 Ψ+ cα 2 Ψ (5) t

3 Here α is a viscosity coefficient. Tis leads to an update equation of te form: w n+1 l,m,p = 1 ( ) (2 Kλ 2 )w n l,m,p 1+λβ + λ2 S n l,m,p (1 λβ)wn 1 l,m,p + ckα(sn l,m,p Sn 1 l,m,p ) (6) Note tat tis update uses te nearest neigbours from two time steps ago, tus requiring te use of tree data grids. Te basic sceme, wic only uses te centre point from two time steps ago, can be implemented using only two grids and a read, ten overwrite procedure. Tese systems are referred to as te "2 Grid" and "3 Grid" scemes trougout tis paper. PARALLEL COMPUTING AND USE OF MULTIPLE GPUS IN CUDA CUDA is Nvidia s programming arcitecture for implementing igly-treaded GPU code. In a serial implementation of equation 3, loops would be used to iterate over te computation domain, applying te update equation to eac grid point. In CUDA, we issues a large number of kernel treads tat implement te SIMD (Single Instruction Multiple Data) operation, and tese treads are sceduled to execute using a large number of parallel processing cores. Te program code contains a mixture of ost serial C code, and device CUDA code. Wilst te ost uses standard CPU memory, te device as multiple memory types. Global memory is te core data store, and is typical in te range of 3 to 6Gb per GPU. CUDA treads make use of local, register memory, and also ave a small amount of sared memory per tread block. Tey can also communicate directly wit te global and (read-only) constant memory. Eac type as different access speeds, wit global memory being te slowest, and sared and constant being fast. Wit a four GPU server, we ave four instances of tis memory model wic are independent. Te GPUs are connected in a pair-wise manner over te PCIe bus, as sown in figure 1. GPU0 GPU1 GPU2 GPU3 PCIe PCIe PCIe To Host FIGURE 1: PCIe connections for four GPUs in a single compute node. Version four and above of te CUDA arcitecture contains functionality tat allows multiple GPUs tat are connected in suc a manner to be used concurrently [5]. Peer-to-peer communication allows data transfer between GPUs tat bypasses te ost altogeter (transferring data from device to ost, ten back to anoter device is an expensive operation). Tis can be combined wit te use of multiple streams of execution and asyncronous sceduling to acieve scalable speedups wen using multiple GPUs. IMPLEMENTATION OF THREE-DIMENSIONAL FDTD SCHEMES Tis section gives a detailed description of te implementation of te basic 2 Grid FDTD sceme wit its first-order boundaries. We start wit a single GPU version, ten extend tis to an initial multiple GPU implementation. Tis is ten developed to include te use of asyncronous data transfers of te partition alos.

4 Single GPU implementation Te CUDA programming model makes use of treads tat are grouped first into blocks, and ten into a grid. Bot of tese objects can be one, two, or tree-dimensional in sape. Given a tree-dimensional data domain, tere are many possible approaces to mapping te treading model over te domain. Prior to te FERMI arcitecture, te standard approac was to utilise sared memory by issuing treads to cover a 2D layer of data. Eac tread itself would ten iterate over te final dimension, reusing data and sared memory [7]. Post FERMI, te cacing system negates te benefits of tis approac, and one can simply issue treads tat cover te wole data domain [8]. Tree-dimensional treads blocks can be used, for example 32 x 4 x 2, and a tree-dimensional tread grid placed over te data. Te time loop for te simulation contains a single kernel launc, ten te input/output is processed, followed by swapping te pointers to te data grids, as sown in listing 1. 1 for (n=0;n<nf; n++) 2 { 3 UpDateSceme<<<dimGridInt, dimblockint >>>(u_d, u1_d ) ; 4 / / perform I /O 5 inout <<<dimgridio, dimblockio>>>(u_d, out_d, ins, n ) ; 6 / / update pointers 7 dummy_ptr = u1_d ; u1_d = u_d ; u_d = dummy_ptr ; 8 } LISTING 1: Time loop for single GPU implementation. Te tread kernel itself implements bot te interior and boundary update in a single SIMD operation, as sown in listing 2. 1 global void UpDateSceme( real *u, real *u1 ) { 2 / / Get X,Y, Z from 3D tread and block Id s 3 int X = blockidx. x * Bx + treadidx. x ; 4 int Y = blockidx. y * By + treadidx. y ; 5 int Z = blockidx. z * Bz + treadidx. z + 1; 6 LISTING 2: Kernel code for single GPU implementation. 7 / / Test tat not at alo, Z block excludes Z alo 8 i f ( ( X>0) && (X<(Nx 1)) && (Y>0) && (Y<(Ny 1 ) ) ) { 9 / / Calculate linear centre position 10 int cp = Z*area +(Y*Nx+X ) ; 11 int K = (0 (X 1) + (0 X (Nx 2)) + (0 Y 1) + (0 Y (Ny 2)) + (0 Z 1) + (0 Z (Nz 2)); 12 real c f = 1. 0 ; 13 real cf2 = 1. 0 ; 14 / / set loss c o e f f i c i e n t s i f at a boundary 15 i f (K<6){ c f = cf_d [ 0 ]. loss1 ; cf2 = cf_d [ 0 ]. loss2 ; } 16 / / Get sum of neigbour points 17 real S = u1 [ cp 1]+u1 [ cp+1]+u1 [ cp Nx]+u1 [ cp+nx]+u1 [ cp area ]+u1 [ cp+area ] ; 18 / / Calculate te update 19 u[ cp ] = c f *( (2.0 K* cf_d [ 0 ]. l2 )*u1 [ cp ] + cf_d [ 0 ]. l2 *S cf2 *u[ cp ] ) ; 20 } 21 } Te kernel keeps te use of conditional statements to a minimum. A layer of non-updated "gost" points is used around te data domain, and so line 8 employs a conditional to ceck for tis. A single furter conditional is used at line 15, to load te coefficients used at a boundary. Te logical expression at line 11 computes boundary position in an efficient manner, witout te need for a lengty IF-ELSEIF statement. Non-asyncronous implementation using multiple GPUs In transitioning from a single GPU to te use of four GPUs, te data domain needs to be partitioned. Te individual GPUs ave discrete memory, and so te 3D data needs to be separated into four segments. A furter complication is tat te FDTD sceme requires neigbouring points in all dimensions, and so overlap alos will be required. Te 3D data itself is decomposed using a row-major alignment for eac layer of te Z dimension, wit consecutive layers in series. In tis format, eac layer occupies contiguous memory locations. Tus, te most natural partitioning is across te Z dimension, as sown in figure 2. Te overlap alos are individual Z layers, and so can be transferred as a single contiguous block of memory.

5 Nz Data partitioned across te Nz layers, wit overlap alos () Ny Nx Ny Nx Nx... FIGURE 2: Data partitioning across te Z dimension using four GPUs, wit overlap alos. Wilst te domain partitioning is straigtforward, te CUDA code itself requires many extensions compared to te single GPU case. In terms of te pre-time loop setup code, individual commands suc as cudamalloc( ) become embedded in loops over te four GPUs. A call is made to cudasetdevice( ) at eac iteration, to perform te operation on individual GPUs. Single pointers to device memory become arrays of pointers, and constant memory as to be allocated to eac GPU. Te alo offset locations ave to be calculated as linear positions across memory, and finally te peer-to-peer access as to be initialised. In a non-asyncronous implementation, te time loop operates as follows: 1. Loop over te GPUs, issuing a kernel launc to compute te data on tat GPU. 2. Syncronize all GPUs. 3. Perform peer-to-peer data transfer for overlap alos. 4. Perform input/output. 5. Syncronize and swap data pointers. Eac GPU computes its data simultaneously, but only wen all ave completed do we ten perform te data transfers of te individual overlap alos. Implementation using asyncronous data transfers Te above implementation contains an inerent time lag, as te GPUs are idle during te data transfers across te PCIe bus. To eliminate tis, we can make use of asyncronous beaviour and streams. Te approac used is based on tat outlined by Nvidia [9], but is extended ere to operate wit te large-scale alo layers tat occur in te 3D case. An individual Z layer can contain millions of floating-point values, and six layers ave to be transferred between GPUs at eac time step. A stream is simply a sequence of CUDA events tat occur in series. However, multiple streams can be used so tat events can execute in a concurrent and asyncronous manner. As te FDTD sceme is data-independent at eac time step, te overlap alo layers can be computed and te data transfers performed at te same time as te larger interior data segments on eac GPU. Tis is accomplised by using one stream of events for te alos and transfers on eac GPU, wilst a second

6 stream is used for te interior. Te streams are identified in te kernel launces, as sown in te time loop code detailed in listing 3. LISTING 3: Time loop for asyncronous four GPU implementation. 1 for (n=0;n<nf; n++) 2 { 3 / / Compute alo layers, ten interior 4 p = 0; 5 for ( i =0; i <num_gpus ; i ++){ 6 cudasetdevice ( gpu [ i ] ) ; 7 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 8 p++; 9 i f ( i >0 && i <num_gpus 1){ 10 UpDateHalo<<<dimGridHalo, dimblockhalo,0, stream_alo [ i ]>>>(u_d [ i ], u1_d [ i ], pos [p ] ) ; 11 p++; 12 } 13 cudastreamquery ( stream_alo [ i ] ) ; 14 UpDateInterior <<<dimgridint, dimblockint,0, stream_int [ i ]>>>(u_d [ i ], u1_d [ i ], i ) ; 15 } 16 / / Excange Halos 17 cudamemcpypeerasync ( u_d [ 1 ], gpu[1],&u_d [ 0 ] [ pos [ 0 ] ], gpu [ 0 ], area_size, stream_alo [ 0 ] ) ; / / perform I /O / / Syncronise and update pointers } 24 } Initially, we iterate over te GPUs and launc te kernels required for te alos using stream_alo. Note tat GPUs 0 and 3 ave a single alo, wilst GPUs 1 and 2 contain two alos. Ten te main interior data kernels are launced, using stream_int. Te data transfer events are ten pused into stream_alo, wic will execute wen te alos ave been computed. In tis manner, te data transfers proceed at te same time as te interior computation is being performed. Te GPUs are ten syncronized before swapping te data pointers. EXPERIMENTAL TESTING Initial testing is performed using data grids containing 100 million points eac. At double precision tis requires 0.8Gb of data per grid (1.6Gb for te wole 2 Grid simulation), and so allows a comparison to be made between single GPU and four GPU implementation using Tesla C2050 GPUs tat ave 3Gb of global memory. Using a sample rate of 44.1kHz, te domain size is 244m 3. Computation times Te 2 Grid sceme is used to compare computation times for te single GPU, basic (non-asyncronous) four GPU, and asyncronous four GPU implementations. Te simulations are computed for 4,410 samples in eac case, at 44.1kHz, and for bot single and double precision floating-point accuracy. Table 1 sows te resulting times. Te data grids are of size Nx : 960 points, Ny : 396 points, and Nz : 264 points. TABLE 1: Computation times and speedups for double (DP) and single (SP) precision. Setup DP Time (sec) Speedup SP Time (sec) Speedup Single GPU Basic four GPU 59.4 x x2.5 Async four GPU 48.1 x x3.0 Te basic four GPU implementation only acieves a speedup of x 2.5, wilst te asyncronous version gets to x3. Te grid sizes for tese initial tests contain a very large Z layer (960 x 396 = 380,160 points). As six overlap alos of tis size ave to be transferred between GPUs at eac time step, tis is still a limiting factor. To test te effect of te Z layer size, te double precision simulation is performed for decreasing

7 sizes wilst keeping te same overall domain size of 244m 3, ranging from 380,160 points down to 76,032 points. Figure 3 sows te effect in terms of te dimensions of te space. Ny: 396 Z layer size : 380,160 Z layer size: 76,032 Ny: 396 Nx: 960 Nx: m m m m m m FIGURE 3: Variations in Z layer sizes for a domain of 244m 3. Table 2 sows te timing results. As te Z layer size decreases we get closer to te x4 scalable speedup. TABLE 2: Effect of variation in te size of te Z layer on computation time. Z layer size (points) Time (sec) Speedup over single GPU 380, x , x , x , x , x , x , x3.52 Floating-point precision Single precision floating-point variables require 32 bits of memory compared to double precision wic requires 64 bits. So, using single precision we can effectively double te size of te computation domain using te same amount of memory. Tere is also an additional benefit, as GPUs offer greater peak performance at single precision. However, testing on te 100 million point domain reveals stability issues wen running at, or very close to, te Courant limit for te sceme. Figure 4 sows te outputs for a 40,000 time step simulation at 44.1kHz using a DC-blocked audio input and grid spacing set at te Courant limit. Te single precision output (blue) is stable initially, but sows pase and amplitude differences compared to te double precision (red). After 30,000 samples, te single precision output begins to diverge, and finally becomes unstable after 40,000 samples. Backing away from te Courant limit by around 0.05% ensures stability in single precision, at te cost of introducing greater dispersion.

8 Normalised Level Time (samples) x 10 4 FIGURE 4: Double (red) vs Single (blue) precision at te Courant limit over 40,000 time steps. LARGE-SCALE ACOUSTIC SIMULATIONS Having detailed te efficiency of te asyncronous four GPU implementation, we can now consider te use of maximum available memory to perform large-scale simulations. Nvidia s Tesla GPUs come wit various amounts of global memory, and so table 3 sows te maximum simulation sizes for various configurations, at a sample rate of 44.1kHz. Note tat te GPUs ave less available memory tan is actually labelled, for example a 3Gb C2050 as a useable global memory of around 2.8Gb. Wilst te table TABLE 3: Maximum simulation sizes in points per grid (millions) and cubic metres, at 44.1kHz. GPU 2 Grid SP m 3 2 Grid DP m 3 3 Grid SP m 3 3 Grid DP m 3 3Gb Gb 595 1, Gb 722 1, , x 3Gb 1,409 3, , , ,160 4 x 5Gb 2,380 5,844 1,189 2,918 1,582 3, ,960 4 x 6Gb 2,889 7,096 1,444 3,546 1,920 4, ,357 sows te maximum sizes for bot te 2 Grid and 3 Grid scemes, in practice tis as to be reduced to allow for storage of audio output arrays and, in te four GPU case, overlap alos of variable size. Four Tesla C2050 GPUs are used for te testing ere, eac of wic as 3Gb of global memory. Tus for te basic 2 Grid sceme at single precision we can compute simulations using 1.4 billion grid points, and a resulting simulation size of 3,350 m 3. For te 3 Grid sceme including viscosity, te grids contain just under a billion points. Table 4 sows te computation times for maximum memory simulations, running for 44,100 samples at 44.1kHz. TABLE 4: Maximum memory computation times for 44,100 samples at 44.1kHz. Simulation Size (m 3 ) Time (min) 2 Grid DP (double precision) 1, Grid SP (single precision) 3, Grid DP (double precision) 1, Grid SP (single precision) 2,

9 CONCLUSIONS Te use of asyncronous data transfers and concurrent execution allows multiple GPUs to be used effectively to acieve near-scaleable speedups in tree-dimensional FDTD scemes, typically ranging from x3 to x3.5 wen using four GPUs, depending on te size of te overlap alos. By using all available memory on a four GPU compute node, we can perform virtual acoustic simulations using billions of grid points. At audio rates suc as 44.1kHz, tis allows te modelling of large rooms and alls, of several tousand cubic metres. Stability becomes an issue wen running at single precision to maximise memory usage. Computing scemes at te Courant limit using single precision can lead to instability over time, altoug tey may appear stable initially. Backing away from te Courant limit wit a small increase in te spatial resolution resolves tis beaviour. Computation times for large-scale maximum memory simulations are around forty to fifty minutes per second at 44.1kHz, using four Tesla C2050 GPUs. Initial testing on te latest Kepler arcitecture GPUs sows a near two-fold speedup over te FERMI Tesla GPUs used ere, and so sould bring computation times down to under alf an our. ACKNOWLEDGEMENTS Tis work is supported by te European Researc Council, under Grant StG NESS. REFERENCES [1] N. Rober, U. Kaminski, and M. Masuc, Ray acoustics using computer grapics tecnology, in Proc. of te 10t Int. Conf. on Digital Audio Effects (DAFx-07, Bordeaux, France) (2007). [2] E. Lemann and A. Joansson, Diffuse reverberation model for efficient image-source simulation of room impulse responses, in IEEE Transactions on Audio, Speec and Language Processing, volume 18(6), (2010). [3] L. Savioja, D. Manoca, and M. Lin, Use of GPUs in room acoustic modeling and auralization, in Proc. Int. Symposium on Room Acoustics (Melbourne, Australia) (2010). [4] C. Webb and S. Bilbao, Computing room acoustics wit CUDA - 3D FDTD scemes wit boundary losses and viscosity, in Proc. of te IEEE Int. Conf. on Acoustics, Speec and Signal Processing (Prague, Czec Republic) (2011). [5] Nvidia, Cuda C programming guide, CUDA toolkit documentation.[online][cited: 8t Jan 2013.] ttp://docs.nvidia.com/cuda/ (2012). [6] J. Strikwerda, Finite Difference Scemes and Partial Differential Equations (Wadswort and Brooks/- Cole Advanced Books and Software, Pacific Grove, California) (1989). [7] P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in Proceedings of 2nd Worksop on General Purpose Processing on Grapics Processing Units, GPGPU-2, (New York, NY, USA) (2009). [8] C. Webb and S. Bilbao, Virtual room acoustics: A comparison of tecniques for computing 3D FDTD scemes using CUDA, in Proc. 130t Convention of te Audio Engineering Society (AES) (London, UK) (2011). [9] P. Micikevicius, Multi-GPU Programming, Nvidia Cuda webinars. [Online][Cited: 6t Jan 2013.] ttp://developer.download.nvidia.com/cuda/training/ (2011).

Verifying Numerical Convergence Rates

Verifying Numerical Convergence Rates 1 Order of accuracy Verifying Numerical Convergence Rates We consider a numerical approximation of an exact value u. Te approximation depends on a small parameter, suc as te grid size or time step, and

More information

Abstract. Introduction

Abstract. Introduction Fast solution of te Sallow Water Equations using GPU tecnology A Crossley, R Lamb, S Waller JBA Consulting, Sout Barn, Brougton Hall, Skipton, Nort Yorksire, BD23 3AE. amanda.crossley@baconsulting.co.uk

More information

Optimized Data Indexing Algorithms for OLAP Systems

Optimized Data Indexing Algorithms for OLAP Systems Database Systems Journal vol. I, no. 2/200 7 Optimized Data Indexing Algoritms for OLAP Systems Lucian BORNAZ Faculty of Cybernetics, Statistics and Economic Informatics Academy of Economic Studies, Bucarest

More information

How To Ensure That An Eac Edge Program Is Successful

How To Ensure That An Eac Edge Program Is Successful Introduction Te Economic Diversification and Growt Enterprises Act became effective on 1 January 1995. Te creation of tis Act was to encourage new businesses to start or expand in Newfoundland and Labrador.

More information

FINITE DIFFERENCE METHODS

FINITE DIFFERENCE METHODS FINITE DIFFERENCE METHODS LONG CHEN Te best known metods, finite difference, consists of replacing eac derivative by a difference quotient in te classic formulation. It is simple to code and economic to

More information

The EOQ Inventory Formula

The EOQ Inventory Formula Te EOQ Inventory Formula James M. Cargal Matematics Department Troy University Montgomery Campus A basic problem for businesses and manufacturers is, wen ordering supplies, to determine wat quantity of

More information

Design and Analysis of a Fault-Tolerant Mechanism for a Server-Less Video-On-Demand System

Design and Analysis of a Fault-Tolerant Mechanism for a Server-Less Video-On-Demand System Design and Analysis of a Fault-olerant Mecanism for a Server-Less Video-On-Demand System Jack Y. B. Lee Department of Information Engineering e Cinese University of Hong Kong Satin, N.., Hong Kong Email:

More information

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions? Comparison between two approaces to overload control in a Real Server: local or ybrid solutions? S. Montagna and M. Pignolo Researc and Development Italtel S.p.A. Settimo Milanese, ITALY Abstract Tis wor

More information

Research on the Anti-perspective Correction Algorithm of QR Barcode

Research on the Anti-perspective Correction Algorithm of QR Barcode Researc on te Anti-perspective Correction Algoritm of QR Barcode Jianua Li, Yi-Wen Wang, YiJun Wang,Yi Cen, Guoceng Wang Key Laboratory of Electronic Tin Films and Integrated Devices University of Electronic

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Part II: Finite Difference/Volume Discretisation for CFD

Part II: Finite Difference/Volume Discretisation for CFD Part II: Finite Difference/Volume Discretisation for CFD Finite Volume Metod of te Advection-Diffusion Equation A Finite Difference/Volume Metod for te Incompressible Navier-Stokes Equations Marker-and-Cell

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Schedulability Analysis under Graph Routing in WirelessHART Networks

Schedulability Analysis under Graph Routing in WirelessHART Networks Scedulability Analysis under Grap Routing in WirelessHART Networks Abusayeed Saifulla, Dolvara Gunatilaka, Paras Tiwari, Mo Sa, Cenyang Lu, Bo Li Cengjie Wu, and Yixin Cen Department of Computer Science,

More information

Geometric Stratification of Accounting Data

Geometric Stratification of Accounting Data Stratification of Accounting Data Patricia Gunning * Jane Mary Horgan ** William Yancey *** Abstract: We suggest a new procedure for defining te boundaries of te strata in igly skewed populations, usual

More information

ACT Math Facts & Formulas

ACT Math Facts & Formulas Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Rationals: fractions, tat is, anyting expressable as a ratio of integers Reals: integers plus rationals plus special numbers suc as

More information

SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS

SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS CIRED Worksop - Rome, 11-12 June 2014 SAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGTS ON ELECTRICAL LOAD PATTERNS Diego Labate Paolo Giubbini Gianfranco Cicco Mario Ettorre Enel Distribuzione-Italy

More information

Digital evolution Where next for the consumer facing business?

Digital evolution Where next for the consumer facing business? Were next for te consumer facing business? Cover 2 Digital tecnologies are powerful enablers and lie beind a combination of disruptive forces. Teir rapid continuous development demands a response from

More information

Optimizing Desktop Virtualization Solutions with the Cisco UCS Storage Accelerator

Optimizing Desktop Virtualization Solutions with the Cisco UCS Storage Accelerator Optimizing Desktop Virtualization Solutions wit te Cisco UCS Accelerator Solution Brief February 2013 Higligts Delivers linear virtual desktop storage scalability wit consistent, predictable performance

More information

Area-Specific Recreation Use Estimation Using the National Visitor Use Monitoring Program Data

Area-Specific Recreation Use Estimation Using the National Visitor Use Monitoring Program Data United States Department of Agriculture Forest Service Pacific Nortwest Researc Station Researc Note PNW-RN-557 July 2007 Area-Specific Recreation Use Estimation Using te National Visitor Use Monitoring

More information

An inquiry into the multiplier process in IS-LM model

An inquiry into the multiplier process in IS-LM model An inquiry into te multiplier process in IS-LM model Autor: Li ziran Address: Li ziran, Room 409, Building 38#, Peing University, Beijing 00.87,PRC. Pone: (86) 00-62763074 Internet Address: jefferson@water.pu.edu.cn

More information

The modelling of business rules for dashboard reporting using mutual information

The modelling of business rules for dashboard reporting using mutual information 8 t World IMACS / MODSIM Congress, Cairns, Australia 3-7 July 2009 ttp://mssanz.org.au/modsim09 Te modelling of business rules for dasboard reporting using mutual information Gregory Calbert Command, Control,

More information

SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY

SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY ASA Section on Survey Researc Metods SAMPLE DESIG FOR TE TERRORISM RISK ISURACE PROGRAM SURVEY G. ussain Coudry, Westat; Mats yfjäll, Statisticon; and Marianne Winglee, Westat G. ussain Coudry, Westat,

More information

SAT Subject Math Level 1 Facts & Formulas

SAT Subject Math Level 1 Facts & Formulas Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Reals: integers plus fractions, decimals, and irrationals ( 2, 3, π, etc.) Order Of Operations: Aritmetic Sequences: PEMDAS (Parenteses

More information

2 Limits and Derivatives

2 Limits and Derivatives 2 Limits and Derivatives 2.7 Tangent Lines, Velocity, and Derivatives A tangent line to a circle is a line tat intersects te circle at exactly one point. We would like to take tis idea of tangent line

More information

h Understanding the safe operating principles and h Gaining maximum benefit and efficiency from your h Evaluating your testing system's performance

h Understanding the safe operating principles and h Gaining maximum benefit and efficiency from your h Evaluating your testing system's performance EXTRA TM Instron Services Revolve Around You It is everyting you expect from a global organization Te global training centers offer a complete educational service for users of advanced materials testing

More information

2.23 Gambling Rehabilitation Services. Introduction

2.23 Gambling Rehabilitation Services. Introduction 2.23 Gambling Reabilitation Services Introduction Figure 1 Since 1995 provincial revenues from gambling activities ave increased over 56% from $69.2 million in 1995 to $108 million in 2004. Te majority

More information

OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS

OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS OPTIMAL DISCONTINUOUS GALERKIN METHODS FOR THE ACOUSTIC WAVE EQUATION IN HIGHER DIMENSIONS ERIC T. CHUNG AND BJÖRN ENGQUIST Abstract. In tis paper, we developed and analyzed a new class of discontinuous

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Operation go-live! Mastering the people side of operational readiness

Operation go-live! Mastering the people side of operational readiness ! I 2 London 2012 te ultimate Up to 30% of te value of a capital programme can be destroyed due to operational readiness failures. 1 In te complex interplay between tecnology, infrastructure and process,

More information

Improved dynamic programs for some batcing problems involving te maximum lateness criterion A P M Wagelmans Econometric Institute Erasmus University Rotterdam PO Box 1738, 3000 DR Rotterdam Te Neterlands

More information

SWITCH T F T F SELECT. (b) local schedule of two branches. (a) if-then-else construct A & B MUX. one iteration cycle

SWITCH T F T F SELECT. (b) local schedule of two branches. (a) if-then-else construct A & B MUX. one iteration cycle 768 IEEE RANSACIONS ON COMPUERS, VOL. 46, NO. 7, JULY 997 Compile-ime Sceduling of Dynamic Constructs in Dataæow Program Graps Soonoi Ha, Member, IEEE and Edward A. Lee, Fellow, IEEE Abstract Sceduling

More information

Working Capital 2013 UK plc s unproductive 69 billion

Working Capital 2013 UK plc s unproductive 69 billion 2013 Executive summary 2. Te level of excess working capital increased 3. UK sectors acieve a mixed performance 4. Size matters in te supply cain 6. Not all companies are overflowing wit cas 8. Excess

More information

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core

More information

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama 35233 Contact: Joyce Lane 972-569-3627 NYSE Symbol: TMK

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama 35233 Contact: Joyce Lane 972-569-3627 NYSE Symbol: TMK News Release Torcmark Corporation 2001 Tird Avenue Sout Birmingam, Alabama 35233 Contact: Joyce Lane 972-569-3627 NYSE Symbol: TMK TORCHMARK CORPORATION REPORTS FOURTH QUARTER AND YEAR-END 2004 RESULTS

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

Overview of Component Search System SPARS-J

Overview of Component Search System SPARS-J Overview of omponent Searc System Tetsuo Yamamoto*,Makoto Matsusita**, Katsuro Inoue** *Japan Science and Tecnology gency **Osaka University ac part nalysis part xperiment onclusion and Future work Motivation

More information

Staying in-between Music Technology in Higher Education

Staying in-between Music Technology in Higher Education Staying in-between Music Tecnology in Higer Education (Post-modern) Callenges and Opportunities for Music Tecnology Education Carola Boem Carola Boem Centre for Music Tecnology Department of Music Department

More information

Instantaneous Rate of Change:

Instantaneous Rate of Change: Instantaneous Rate of Cange: Last section we discovered tat te average rate of cange in F(x) can also be interpreted as te slope of a scant line. Te average rate of cange involves te cange in F(x) over

More information

1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution

1.6. Analyse Optimum Volume and Surface Area. Maximum Volume for a Given Surface Area. Example 1. Solution 1.6 Analyse Optimum Volume and Surface Area Estimation and oter informal metods of optimizing measures suc as surface area and volume often lead to reasonable solutions suc as te design of te tent in tis

More information

Broadband Digital Direct Down Conversion Receiver Suitable for Software Defined Radio

Broadband Digital Direct Down Conversion Receiver Suitable for Software Defined Radio Broadband Digital Direct Down Conversion Receiver Suitable for Software Defined Radio Moamed Ratni, Dragan Krupezevic, Zaoceng Wang, Jens-Uwe Jürgensen Abstract Sony International Europe GmbH, Germany.

More information

SAT Math Must-Know Facts & Formulas

SAT Math Must-Know Facts & Formulas SAT Mat Must-Know Facts & Formuas Numbers, Sequences, Factors Integers:..., -3, -2, -1, 0, 1, 2, 3,... Rationas: fractions, tat is, anyting expressabe as a ratio of integers Reas: integers pus rationas

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

2.12 Student Transportation. Introduction

2.12 Student Transportation. Introduction Introduction Figure 1 At 31 Marc 2003, tere were approximately 84,000 students enrolled in scools in te Province of Newfoundland and Labrador, of wic an estimated 57,000 were transported by scool buses.

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

NAFN NEWS SPRING2011 ISSUE 7. Welcome to the Spring edition of the NAFN Newsletter! INDEX. Service Updates Follow That Car! Turn Back The Clock

NAFN NEWS SPRING2011 ISSUE 7. Welcome to the Spring edition of the NAFN Newsletter! INDEX. Service Updates Follow That Car! Turn Back The Clock NAFN NEWS ISSUE 7 SPRING2011 Welcome to te Spring edition of te NAFN Newsletter! Spring is in te air at NAFN as we see several new services cropping up. Driving and transport emerged as a natural teme

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Shell and Tube Heat Exchanger

Shell and Tube Heat Exchanger Sell and Tube Heat Excanger MECH595 Introduction to Heat Transfer Professor M. Zenouzi Prepared by: Andrew Demedeiros, Ryan Ferguson, Bradford Powers November 19, 2009 1 Abstract 2 Contents Discussion

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Derivatives Math 120 Calculus I D Joyce, Fall 2013

Derivatives Math 120 Calculus I D Joyce, Fall 2013 Derivatives Mat 20 Calculus I D Joyce, Fall 203 Since we ave a good understanding of its, we can develop derivatives very quickly. Recall tat we defined te derivative f x of a function f at x to be te

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

His solution? Federal law that requires government agencies and private industry to encrypt, or digitally scramble, sensitive data.

His solution? Federal law that requires government agencies and private industry to encrypt, or digitally scramble, sensitive data. NET GAIN Scoring points for your financial future AS SEEN IN USA TODAY S MONEY SECTION, FEBRUARY 9, 2007 Tec experts plot to catc identity tieves Politicians to security gurus offer ideas to prevent data

More information

OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS

OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS New Developments in Structural Engineering and Construction Yazdani, S. and Sing, A. (eds.) ISEC-7, Honolulu, June 18-23, 2013 OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS JIALI FU 1, ERIK JENELIUS

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Research on Risk Assessment of PFI Projects Based on Grid-fuzzy Borda Number

Research on Risk Assessment of PFI Projects Based on Grid-fuzzy Borda Number Researc on Risk Assessent of PFI Projects Based on Grid-fuzzy Borda Nuber LI Hailing 1, SHI Bensan 2 1. Scool of Arcitecture and Civil Engineering, Xiua University, Cina, 610039 2. Scool of Econoics and

More information

Computer Science and Engineering, UCSD October 7, 1999 Goldreic-Levin Teorem Autor: Bellare Te Goldreic-Levin Teorem 1 Te problem We æx a an integer n for te lengt of te strings involved. If a is an n-bit

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Tangent Lines and Rates of Change

Tangent Lines and Rates of Change Tangent Lines and Rates of Cange 9-2-2005 Given a function y = f(x), ow do you find te slope of te tangent line to te grap at te point P(a, f(a))? (I m tinking of te tangent line as a line tat just skims

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring 2012. Handout by Julie Zelenski with minor edits by Keith Schwarz

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring 2012. Handout by Julie Zelenski with minor edits by Keith Schwarz CS106B Spring 01 Handout # May 3, 01 Huffman Encoding and Data Compression Handout by Julie Zelenski wit minor edits by Keit Scwarz In te early 1980s, personal computers ad ard disks tat were no larger

More information

A New Cement to Glue Nonconforming Grids with Robin Interface Conditions: The Finite Element Case

A New Cement to Glue Nonconforming Grids with Robin Interface Conditions: The Finite Element Case A New Cement to Glue Nonconforming Grids wit Robin Interface Conditions: Te Finite Element Case Martin J. Gander, Caroline Japet 2, Yvon Maday 3, and Frédéric Nataf 4 McGill University, Dept. of Matematics

More information

Referendum-led Immigration Policy in the Welfare State

Referendum-led Immigration Policy in the Welfare State Referendum-led Immigration Policy in te Welfare State YUJI TAMURA Department of Economics, University of Warwick, UK First version: 12 December 2003 Updated: 16 Marc 2004 Abstract Preferences of eterogeneous

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th

More information

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs Parallel Smooters for Matrix-based Multigrid Metods on Unstructured Meses Using Multicore CPUs and GPUs Vincent Heuveline Dimitar Lukarski Nico Trost Jan-Pilipp Weiss No. 2-9 Preprint Series of te Engineering

More information

Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?

Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade? Can a Lump-Sum Transfer Make Everyone Enjoy te Gains from Free Trade? Yasukazu Icino Department of Economics, Konan University June 30, 2010 Abstract I examine lump-sum transfer rules to redistribute te

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Theoretical calculation of the heat capacity

Theoretical calculation of the heat capacity eoretical calculation of te eat capacity Principle of equipartition of energy Heat capacity of ideal and real gases Heat capacity of solids: Dulong-Petit, Einstein, Debye models Heat capacity of metals

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

Tis Problem and Retail Inventory Management

Tis Problem and Retail Inventory Management Optimizing Inventory Replenisment of Retail Fasion Products Marsall Fiser Kumar Rajaram Anant Raman Te Warton Scool, University of Pennsylvania, 3620 Locust Walk, 3207 SH-DH, Piladelpia, Pennsylvania 19104-6366

More information

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

More information

Note: Principal version Modification Modification Complete version from 1 October 2014 Business Law Corporate and Contract Law

Note: Principal version Modification Modification Complete version from 1 October 2014 Business Law Corporate and Contract Law Note: Te following curriculum is a consolidated version. It is legally non-binding and for informational purposes only. Te legally binding versions are found in te University of Innsbruck Bulletins (in

More information

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Note nine: Linear programming CSE 101. 1 Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1

Note nine: Linear programming CSE 101. 1 Linear constraints and objective functions. 1.1 Introductory example. Copyright c Sanjoy Dasgupta 1 Copyrigt c Sanjoy Dasgupta Figure. (a) Te feasible region for a linear program wit two variables (see tet for details). (b) Contour lines of te objective function: for different values of (profit). Te

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

A strong credit score can help you score a lower rate on a mortgage

A strong credit score can help you score a lower rate on a mortgage NET GAIN Scoring points for your financial future AS SEEN IN USA TODAY S MONEY SECTION, JULY 3, 2007 A strong credit score can elp you score a lower rate on a mortgage By Sandra Block Sales of existing

More information

13 PERIMETER AND AREA OF 2D SHAPES

13 PERIMETER AND AREA OF 2D SHAPES 13 PERIMETER AND AREA OF D SHAPES 13.1 You can find te perimeter of sapes Key Points Te perimeter of a two-dimensional (D) sape is te total distance around te edge of te sape. l To work out te perimeter

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities.

What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities. Wat is? Spring 2008 Note: Slides are on te web Wat is finance? Deciding ow to optimally manage a firm s assets and liabilities. Managing te costs and benefits associated wit te timing of cas in- and outflows

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

College Planning Using Cash Value Life Insurance

College Planning Using Cash Value Life Insurance College Planning Using Cas Value Life Insurance CAUTION: Te advisor is urged to be extremely cautious of anoter college funding veicle wic provides a guaranteed return of premium immediately if funded

More information

Multi-GPU Load Balancing for In-situ Visualization

Multi-GPU Load Balancing for In-situ Visualization Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Catalogue no. 12-001-XIE. Survey Methodology. December 2004

Catalogue no. 12-001-XIE. Survey Methodology. December 2004 Catalogue no. 1-001-XIE Survey Metodology December 004 How to obtain more information Specific inquiries about tis product and related statistics or services sould be directed to: Business Survey Metods

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Unemployment insurance/severance payments and informality in developing countries

Unemployment insurance/severance payments and informality in developing countries Unemployment insurance/severance payments and informality in developing countries David Bardey y and Fernando Jaramillo z First version: September 2011. Tis version: November 2011. Abstract We analyze

More information

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula Student Name: Date: Contact Person Name: Pone Number: Lesson 0 Perimeter, Area, and Similarity of Triangles Objectives Determine te perimeter of a triangle using algebra Find te area of a triangle using

More information