Tridiagonal Solvers on the GPU and Applications to Flid Simlation Nikolai Sakharnykh, NVIDIA nsakharnykh@nvidia.com
Agenda Introdction and Problem Statement Governing Eqations ADI Nmerical Method GPU Implementation and Optimizations Reslts and Ftre Work
Introdction Trblence simlation Direct Nmerical Simlation all scales of trblence epensive Large-Eddy Simlation Reynolds-Averaged Navier-Stokes Research at Compter Science department of Moscow State University Paskonov V.M., Berezin S.B.
Problem Statement Viscid incompressible flid in 3D domain Initial and bondary conditions Eler coordinates: velocity and temperatre
Definitions Density const Velocity (, v, w) Temperatre T Pressre p State eqation p RT RT R gas constant for air
Governing Eqations Continity eqation div 0 Navier-Stokes eqations dimensionless form t T Re Re Reynolds nmber
Reynolds nmber Similarity parameter the ratio of inertia forces to viscosity forces 3D channel: Re V ' L' ' V ' L' - mean velocity - length of pipe ' - dynamic viscosity High Re trblent flow Low Re laminar flow
Governing Eqations Energy eqation dimensionless form T t T T Pr Re T Re Pr Prandtl nmber heat capacity ratio dissipative fnction
Nmerical Method Alternating Direction Implicit (ADI) t y z X Y Z t t y t z
ADI Heat Condction 3 fractional steps X, Y, Z Implicit finite-difference scheme /3,, /3,, /3,,,, /3,, t n k j i n k j i n k j i n k j i n k j i t n k j n n k j i n k j n k j n n k j i n k j t q q q q,,,,,, /3,, /3,, /3,, 0 0 t q
ADI Navier-Stokes Eqation for X velocity need iterations for non-linear PDEs Re z y T z w y v t Re T t Re y y v t Re z z w t X Y Z
ADI Time Step (n-) time step (n) time step (n+) time step Splitting by X Splitting by Y Splitting by Z Updating non-linear parameters Global iterations
ADI Fractional Time Step Linear PDEs Previos layer N time layer : -velocity v: y-velocity w: z-velocity Sweep N + time layer Net layer T: temperatre Solves many tridiagonal systems independently
ADI Fractional Time Step Non-Linear PDEs Previos layer N time layer : -velocity N + ½ time layer Update Local iterations v: y-velocity w: z-velocity Sweep N + time layer Net layer T: temperatre Solves many tridiagonal systems independently
Main Stages of the Algorithm Solve a lot of independent tridiagonal systems Comptationally intensive Easy to parallelize Sbtasks: Evalate dissipation term Update non-linear parameters
Tridiagonal Solvers Overview Simplified Gass elimination Also known as Thomas algorithm, Sweep The fastest serial approach Cyclic Redction methods Attend Yao Zhang s talk Fast Tridiagonal Solvers afterwards!
Sweep algorithm Memory reqirements one additional array of size N Forward elimination step Backward sbstittion step Compleity: O(N)
GPU Implementation All data arrays are stored on GPU Several 3D time-layers overall GB for 999 grid in DP Main kernels Sweep Dissipative fnction evalation Non-linear pdate
Sweep on the GPU One thread solves one system N^ systems on each fractional step Splitting by X Splitting by Y Splitting by Z Each thread operates with D slice in corresponding direction
Sweep performance time steps/sec 9 8 7 NVIDIA Tesla C060 6 5 4 3 float doble 0 Sweep X Sweep Y Sweep Z X splitting is mch slower than Y/Z ones
Sweep going into details Memory bond need to optimize access to the memory Sweep X Sweep Y Sweep Z ncoalesced coalesced coalesced
Sweep optimization Soltion for X-splitting Reorder data arrays and rn Y-splitting Need few additional 3D matri transposes.0.8.5.6.4 time steps/sec..0.7 original 0.8 optimized 0.6 0.4 0. 0.0 float doble
Code analysis GPU version is based on the CPU code // bondary conditions switch (dir) { case X: case X_as_Y: bc_0( ); break; case Y: bc_y0( ); break; case Z: bc_z0( ); break; } a[] = - c / c; _net[base_id] = f_i / c; // forward trace of sweep int id = base_id; int id_prev; for (int k = ; k < n; k++) { id_prev = id; id += p.stride; doble c = v_temp[id]; c = p.m_c3 * c - p.h; c = p.m_c; c3 = - p.m_c3 * c - p.h; } doble q = (c3 * a[k] + c); doble t = / q; a[k+] = - c * t; _net[id] = (f[id] - c3 * _net[id_prev]) * t;
Performance Comparison Test data Grid size of 8/9 8 non-linear iterations ( inner 4 oter) Hardware NVIDIA Tesla C060 Intel Core Qad (4 threads) Intel Core i7 Nehalem (8 threads)
Performance 8 - float time steps/sec 30 7 4 0 7 8 NVIDIA Tesla C060 5 9 6 7 9 Intel Core i7 Nehalem.93GHz (4 cores) Intel Core Qad.4GHz (4 cores) 3 0 Dissipation Sweep NonLinear Total
Performance 8 - doble time steps/sec 4 3 0 4 0 9 8 7 6 5 4 3 4 5 NVIDIA Tesla C060 Intel Core i7 Nehalem.93GHz (4 cores) Intel Core Qad.4GHz (4 cores) 0 Dissipation Sweep NonLinear Total
Performance 9 - float time steps/sec 9 8 8 8 7 6 5 4 3 8 NVIDIA Tesla C060 Intel Core i7 Nehalem.93GHz (4 cores) Intel Core Qad.4GHz (4 cores) 0 Dissipation Sweep NonLinear Total
Performance 9 - doble time steps/sec 6 5 4 3 5 NVIDIA Tesla C060 3 Intel Core i7 Nehalem.93GHz (4 cores) Intel Core Qad.4GHz (4 cores) 4 5 0 Dissipation Sweep NonLinear Total
GPU performance SP/DP time steps/sec 9 8 7 6 5 4 3 float doble 0 dissipation sweep nonlinear total In doble precision GPU is only slower than in single precision
Visal reslts Bondary conditions no-slip free = Constant flow at start:, v w 0 No-slip on sides: v w 0 Free at far end: v w 0
Visal Reslts X-slice v w T = 0,9 t = 6 Re = 000
Ftre Work Effective mlti-gpu sage Distribted memory systems Performance improvements High resoltion grids, high Reynolds nmbers
Conclsion High performance and efficiency of GPUs in comple 3D flid simlation CUDA is an easy-to-se tool for GPU compte programming GPU enables new possibilities for researching
Qestions? Thank yo! Keywords: ADI, Tridiagonal Solvers, DNS, Trblence E-mail: nsakharnykh@nvidia.com
References Paskonov V.M., Berezin S.B., Korkhova E.S. (007) A dynamic visalization system for mltiprocessor compters with common memory and its application for nmerical modeling of the trblent flows of viscos flids, Moscow University Comptational Mathematics and Cybernetics ADI method - Doglas Jr., Jim (96), "Alternating direction methods for three space variables", Nmerische Mathematik 4: 4 63
Dissipative Fnction z v y w z w z w z v z y w z v y v y w y v y w z v y w v z y z y