Parallelization of a 3-D SPH code for massive HPC simulations

Parallelzaton of a 3-D SPH code for massve HPC smulatons G. Oger a, D. Gubert b, M. de Leffe b, J.-G. Pccnal c a. LHEEA Ecole Centrale de Nantes b. Hydrocean c. CSCS-Swss Natonal Supercomputng Centre Résumé : La méthode Smoothed Partcle Hydrodynamcs (SPH) est une méthode partculare qu a connu une forte émergence au cours des deux dernères décennes. Ben qu ntalement développée pour des applcatons astrophysques, cette méthode numérque est aujourd hu largement applquée à la mécanque des fludes, à la mécanque des structures et à dverses branches de la physque. Le code SPH-flow est développé conjontement par le LHEEA de l Ecole Centrale de Nantes et l entreprse HydrOcean. Cet outl est prncpalement dédé à la modélsaton d écoulements à surface lbre, dans un contexte de dynamque rapde et en présence de soldes présentant des géométres complexes en nteracton avec le flude. Dans ce domane, le prncpal avantage de cette méthode repose sur sa capacté à smuler les déconnexons/reconnexons de surface lbre (déferlements, jets de surface lbre etc...) sans nécesster sa capture. Comme c est le cas pour la plupart des autres méthodes partculares, cette méthode est exgeante en termes de ressources de calcul, et sa parallélsaton est névtable dans le cadre d applcatons 3D massves, en vue de conserver des temps de resttuton rasonnables. L ordre de grandeur des résolutons adoptées dans nos smulatons est de pluseurs centanes de mllons de partcules, mplquant l utlsaton de pluseurs mllers de processeurs en réseau, et donc une parallélsaton performante. Cet artcle présente la stratége de parallélsaton retenue dans notre code SPH, basée sur une décomposton de domane et passant par l utlsaton du standard MPI (mémore dstrbuée). Les performances parallèles sont présentées en termes d accélératon et d effcacté, sur des cas allant jusqu à 3 mllards de partcules et utlsant 32768 processeurs en réseau. Abstract : Smoothed Partcle Hydrodynamcs (SPH) s a partcle method that experenced a large growth n the last two decades. Whle t was ntally developed for astrophyscs, today ths method s appled to flud, structures and other complex multphysc smulatons. The SPH-flow code developed jontly by Ecole Centrale de Nantes and HydrOcean s manly desgned to model complex free surface problems n a context of hgh dynamc phenomena, together wth the presence of complex sold geometres nteractng wth the flud. In ths feld, the man advantage of the SPH method reles on ts ablty to smulate dsconnecton and reconnecton of the free surface wthout havng to capture t. As for other partcle methods, ths method s demandng n terms of computatonal resources, and ts parallelzaton s compulsory for large 3-D applcatons, n order to mantan some reasonable resttuton tmes. The order of magntude of the resoluton nvolved n our smulatons s several hundred mllon partcles, mplyng the need for thousands of CPU cores. Ths paper deals wth ntroducng a parallelzaton strategy based on a doman decomposton wthn a purely MPI-based dstrbuted memory framework. The results are dscussed through a scalablty study nvolvng up to 32, 768 cores wth 3 bllon partcles. Mots clefs : SPH ; HPC ; MPI

Introducton The objectve of the present work s to mprove the performance of the HPC code SPH-flow developed jontly by Ecole Centrale de Nantes and HydrOcean, to make t effcent n research as well as n ndustral contexts. Varous problems can be solved usng ths 2-D/3-D parallel SPH code, such as multfud flows [4], Flud-Structure Interacton (FSI) [3], and vscous flows [5]. The man dffcultes of a SPH model parallelzaton arse from the nterpolaton process, based on a kernel functon and usng possbly varable compact supports. In Secton 2, the SPH method s descrbed, together wth ts kernel-based nterpolaton feature. In Secton 3, the man aspects of the parallelsaton performed are ntroduced. In Secton 4, parallel performances are presented and dscussed. Fnally, ndustral test cases nvolvng massvely parallel SPH smulatons are presented n Secton 5. 2 SPH method The equatons to be solved n our feld of applcaton are the Euler equatons, as classcally used to model non vscous flud flows. One of the man SPH features conssts n consderng any flud flow as compressble, resultng n the use of an equaton of state to close the system. SPH uses a set of nterpolatng ponts (partcles) whch are ntally dstrbuted n the flud medum. The spatal dervatves present n the Euler equatons are then smply nterpolated by ntroducng a kernel functon W that s convoluted wth the dscrete values of the feld, as n the followng example for the pressure gradent : < P ( x) >= N P ( x ) W ( x x, R) ω, () = where refers to the partcles n the vcnty of locaton x and located wthn the radus R of the compact support of W. Note that ths kernel functon W mples the same algorthmc needs as the Van der Waals functon used n molecular dynamcs, requrng the creaton of a neghbor lst and the computaton of partcle-to-partcle nteractons. The man dfference resdes n the fact that W acts as an actual nterpolaton functon, smlarly to the test functons used n Fnte Element methods for nstance. In ther dscrete form, the Euler equatons become fnally dρ dt = N d v dt = N j= j= d x dt = v, (2) m j ( v v j ) W ( r r j, R), (3) ( ) P m j ρ 2 + P j W ρ 2 ( r r j, R) + g, (4) j Note that the above scheme corresponds to one of the varous examples of SPH formulatons avalable n the lterature. As stated n equaton (2), each partcle moves accordng to the computed flud velocty, emphaszng the Lagrangan feature of SPH. Moreover, each of the above dervatves are advanced n tme usng an explct tme ntegraton scheme such as Runge-Kutta, preventng from the need of any lnear system resoluton. Ths explct and Lagrangan features of SPH stand for the man propertes that mpact the parallelzaton of ths method. Furthermore, n practce at least N = 2 and N = 7 neghbors are needed n 2-D and 3-D respectvely for the nterpolaton of each partcle, so that the man computatonal costs rely n the computaton of the flux terms of equatons (3) and (4). 2

3 Parallel flux computatons and tme advance The parallelzaton performed here s based on a doman decomposton strategy, whch conssts n splttng the whole partcle doman nto sub-domans, and to attrbute each sub-doman to each process. The whole doman s therefore retreved through pont-to-pont MPI communcatons between processors. In [6], we presented parallel results of computatons on up to 2, processors. The results were encouragng but exhbted a breakdown, revealng that the communcatons were not fully overlapped by the computaton of the flux terms and/or the tme advance. In order to overcome ths problem, an overlapped zone s dentfed. For a gven process of nterest, ths zone s made of three partcle groups : the outer partcles whch are receved from the neghbor processes, the nner partcles whch are sent to the neghbor processes, and the local partcles whch are nether outer nor nner partcles. Local partcles are usually more numerous than nner partcles. The strategy retaned here thus conssts n overlappng the communcaton tmes of nner partcles wth the computaton tmes of local partcle flux loops. The man dffculty then resdes n the creaton of such a Local-Inner-Outer zone, whch s not straghtforward due to the need for an easy dentfcaton of the outer partcles n the context of varable kernel rad dscretsaton. To avod huge communcatons of partcles that do not nteract wth nner partcles, the shape of the overlapped zone must be bult pecewse. Fnally, tme ntegraton of ODEs (3) and (4) can be summarzed as n equaton (5) : φ n+ = φ n + δt j F j partcles, (5) where φ n s the varables at tme t n of partcle, and F j the flux between partcles and j. The parallel scheme s evaluated as descrbed n Algorthm. Ths scheme ensures that a hgh parallelsaton effcency can be reached provded that the computatonal cost of the local partcle nteracton loops s greater than the communcaton tme of the nner partcles. Algorthm LIO scheme wth communcaton overlappng. Requre: the state varables φ n on the local, nner and outer partcles should be frst correctly addressed for all proceses. : post the MPI recepton of the newly updated state varables φ n+ of outer partcles. 2: for all nteracton of the nner partcles wth ther local, nner and outer neghbor partcles do 3: compute the flux terms F j 4: end for 5: for all nner partcles do 6: update φ n+ as n equaton (5). 7: end for 8: MPI send of the newly computed state varables of nner partcles to neghbor processes 9: for all nteractons of the local partcles wth ther local neghbor partcles do : compute the flux terms F j : end for 2: for all local partcles do 3: update φ n+ as n equaton (5). 4: end for 5: wat the communcaton endng of the outer partcle state varables. 4 Scalablty study n massve HPC context To study the parallel performances of the two approaches combned, two dstnct studes can be performed. The frst one consder a fxed problem sze whle the number of processes ncreases. Ideally the reducton factor of the CPU tme on p processes should be p, but we generally ntroduce the speedup whch s the nverse of ths factor. Such a study s called strong scalablty study, and s presented n Subsecton 4.. Another study may consder a fxed problem sze per process. In ths case, the communcatons ncrease as the number of processes ncrease. Ths study named weak scalablty 3

study enables to dentfy some MPI communcatons bottlenecks. The results are exposed n Subsecton 4.2. The study presented hereafter have been performed on the facltes provded by CSCS. The code has been executed on Monte Rosa, ftted wth AMD Interlagos 2 x 6-core 64-bt CPUs, 32 GB per compute node and hgh performance networkng wth Gemn 3D torus nterconnect. It features a total of, 496 nodes, correspondng to 47, 872 cores, wth a theoretcal peak performance of 42 TFlops. 4. Strong scalablty Fgure shows the results obtaned for a total of 7 partcles from 8 to 256 cores, for 8 partcles from 256 to 6, 384 cores and 9 partcles from 4, 96 to 32, 768 cores. 3 25 pb wth e7 partcles lnear scalablty 35 3 2 25 speedup 5 speedup 2 5 5 5 5 2 25 3 number of processors 5 pb wth e8 partcles pb wth e9 partcles lnear scalablty 5 5 2 25 3 35 number of processors Fgure Speedup obtaned on a dam break test case wth 7, 8 and 9 partcles Fgure shows that the code scales lnearly but wth a slope that remans lower than the deal speedup. Due to the nature of our mplementaton, the boundares are taken nto account wth some specfc treatments, whch mples that some extra nteractons must be computed. These treatments suffer from some defects n terms of parallelzaton. As a result, as the proporton of near-boundary partcles decreases whle the number of partcles ncreases, ths defect tends to present a lower relatve mportance for very large smulatons. For 8 partcles, the speedup tends to start saturatng from 6, 384 processors. For the smulaton nvolvng bllon partcles, a lnear speedup s obtaned, even when 32, 768 cores are used. 4.2 Weak scalablty The weak scalablty study s a way to defne as how the tme to soluton vares wth the number of processors for a fxed problem sze per processor. Note that a fxed number of numercal tme steps s appled for each case..2.8 effcency.6.4.2 e5 partcles per proc lnear scalablty 6 64 256 24 496 6384 65536 number of processors Fgure 2 Effcency on the dam break test case wth 5 partcles per processes. Monte Rosa s ftted wth 32 cores per node. As a result, n order to avod bus bandwdth bottlenecks, the weak scalablty s started from a reference of 32 processors. Fgure 2 shows that the effcency obtaned s nearby 95%. There s no performance decrease, showng that the communcatons are therefore fully overlapped by the computatons. 4

5 Industral test cases Two dstnct applcatons of the SPH-flow code are presented n ths part wth the mprovements descrbed above. These 3-D applcatons requre a large number of ponts and therefore an effcent HPC code. 5. Sphere mpact Ths test case conssts of the 3-D mpact of a sphere at the free surface. The sphere of radus R = m s movng at a constant mposed vertcal velocty U = m/s. The slammng coeffcent s defned F as C S = Z ρu. A convergence study s performed here for three dfferent resolutons. The partcle 2 πr 2 2 szes n the mpact area for the dfferent meshes are 2 2 m, 2 m and 5 3 m. x (m) N partcles N CPU CPU tme (h) 2 2 6 32 3.2 2 4.4 6 64 5. 5 3 24. 6 256 45 Table CPU datas of the sphere mpact. A snapshot of the free surface deformatons s gven n fgure 3, and CPU data are summarzed n table. Fgure 3 shows that the results obtaned converge towards the soluton gven by Baldwn & Steves [] and Battstn & Iafrat [2]. C s.5 mllon of partcles 4.5 mllon of partcles 24. mllon of partcles Baldwn & Steves Battstn & Iafrat..2.3 h(t)/r Fgure 3 Left : snapshot of the sphere mpact smulaton nvolvng 25 mllons of partcles. Rght : convergence study related to the sphere mpact. 5.2 Lfeboat water entry Ths case conssts n the free fall of a lfeboat n calm water. A correct assessment of loads and slammng pressures on a lfeboat durng ts entry n the sea s essental for both structural and human safety. The present smulaton has been performed for three resolutons :, and mllons of partcles. CPU data are summarzed n table 2. N partcles N CPU CPU tme (h) 6 64 3 6 52 5 6 248 3 Table 2 CPU datas of the lfeboat free fall. 5

Fgure 4 shows the comparson for the three resolutons. As expected, the free surface jets created by the lfeboat water entry are captured more and more accurately. Fgure 4 Lfeboat mpact wth, and mllons of partcles. 6 Concluson A strategy to carry out 2-D and 3-D massvely parallel SPH smulatons has been presented. Ths has been acheved by ntroducng a doman decomposton framework wth effectve non-blockng communcatons. Very good scalablty has been obtaned for bllon partcle problems on several thousands of processors. In order to further mprove the parallel performances, a partcular attenton sould be pad to the mplementaton. All the duplcated nstructons between processes should be avoded, especally the boundary handlng. A future way to reduce the smulaton run-tmes would be to combne our approach wth GPGPU approaches, n order to handle flux computatons on graphc processors. Acknowledgment The research leadng to these results has receved fundng from the European Communty s Seventh Framework Programme (FP7/27-23) under grant agreement #225967 NextMuSE. Ths work was supported by a grant from the Swss Natonal Supercomputng Centre (CSCS) under project ID s29 and the authors would lke to thanks the CSCS Staff, especally Jerome Soumagne and John Bddscombe. Hydrocean and ECN acknowledge the computng resources offered by CCIPL, IFREMER and CRIHAN, ther help was very much apprecated. Références [] L. Baldwn, H.X. Steves Vertcal water entry of spheres, NSWC/WOL/TR 75-49, Whte Oax Laboratory, Slver Sprng, MD, USA, 975 [2] D. Battstn, A. Iafrat Water mpact of 2D and axsymmetrc bodes of arbtrary secton, INSEAN Techncal Report No. 6/, Roma, 2 [3] G. Fourey, D. Le Touzé, B. Alessandrn Three-dmensonal valdaton of a SPH-FEM couplng method, 6 th SPHERIC workshop Proceedngs, 2 [4] P.-M. Gulcher, G. Oger, L. Brosset, E. Jacqun, N. Grener, D. Le Touzé Smulatons of lqud mpacts wth a two-phase parallel SPH model, Proceedngs of ISOPE, 2 [5] M. de Leffe, D. Le Touzé, B. Alessandrn A modfed no-slp condton n weakly-compressble SPH, 6 th SPHERIC workshop Proceedngs, 2 [6] P. Maruzewsk, G. Oger, D. Le Touzé, J. Bddsccombe Hgh performance computng 3D SPH model : Sphere mpactng the free-surface of water, 3nd Internatonal SPHERIC Workshop, 28 6