1 Technical white pape Scal abil it y of ANSYS 16 applicat ion and Hadwae elect ion. On multi-coe and floating point acceleato poceo ytem Table of Content Ab t a ct... 2 Tet configuation detail... 2 Meage Paing Inteface... 3 ANSYS CFD Tet eult... 3 ANSYS Fluent... 4 Standad benchmak on a ingle node... 4 Mult i-poce pefomance elative to ingle-poce pefomance... 4 Standad benchmak going fom one node to thity two node... 5 Poceo with highe clockpeed and le coe... 7 Fluent and GPU... 7 ANSYS CFX... 8 Small and medium tandad benchmak on a ingle node... 8 Mult i-poce pefomance elative to ingle-poce pefomance... 9 Standad benchmak going fom one node to ixteen node... 1 Summay of ANSYS CFD tet eult Solution Refeence Achitectue fo ANSYS CFD ANSYS Mechanical tet eult With GPU uppot Xeon PHI Without GPU uppot Fate poceo le coe. What doe that buy you? Summay of ANSYS Mechanical tet eult Solution Refeence Achitectue fo ANSYS Mechanical Concluion... 19
2 Abt act Fo many yea, the advance in compute deign have followed Mooe' Law, which tate that the numbe of tanito on a ingle chip inceae at a fixed ate. Duing thee yea, inceaing Cental Poceing Unit (CPU) computing powe involved adding moe tanito to a ingle CPU. Thi ceated fate and malle chip, but with moe complicated achitectue. Recently, thi tend ha changed. Along with uing floating-point acceleato unit, the cuent tend i to inceae computing powe by adding CPU to a ingle chip (ceating a multi-coe chip). Although multi-coe chip inceae computing powe, they do not neceaily eult in immediate pefomance impovement. The pefomance and calability of multi-coe chip depend on the application that you ae unning. The multi-coe chip can impove pefomance, but in doing o, inceae demand on othe ubytem. The memoy, I/O, and netwoking ubytem mut be able to handle thee demand. In addition to uing a floating point poceo and fat memoy, an application utilization of thee ubytem can maximize pefomance. Pefomance and calability depend on the application deign Thi pape eve thee main pupoe. One it look at ANSYS Inc. 16. application and how they un on Intel -baed HP PoLiant Gen9 eve. Second it give ue inight on how to un thee application a to get maximum pefomance. And thid it can help IT manage get the coect and optimal hadwae fo thei ue unning ANSYS application. Tet configuation detail In thi pape, we dicu pefomance tet eult fo the following ANSYS applicat ion: ANSYS Fluent eleae 16. ANSYS CFX eleae 16. ANSYS Mechanical(MAPDL) eleae 16. The configuation fo benchmak teting: Fo the all non GPU teting, we ued a clute of HP PoLiant BL46 Gen9 Seve and a clute of XL23a Gen Seve connected by FDR InfiniBand. Table 1 how the configuation fo each eve. Fo the GPU teting, we ued the HP PoLiant XL25 Gen9 Seve that wee configued a hown in Table 1. Table 1. Seve configuation detail. Subytem HP PoLiant BL46c Gen8 eve configuation HP PoLiant SL25 Gen8 eve configuation Opeating ytem Red Hat Entepie Linux Releae 6.5 Meage Paing Inteface (MPI) Platfom MPI veion 9.1 of Intel MPI veion 5. (depending on application) Poceo Two, Intel Xeon Poceo E5-268 v3, E5-2697v3, and E5-2698v3 poceo Two, Intel Xeon Poceo E5-269v3poceo Memoy 128GB of 2133MHz DIMM 128GB of 2133MHz DIMM Had dive Two, 6GB 1K SAS dive Fou, 3GB 15K SAS dive Gaphical Poceing Unit (GPU) Not applicable Two NVIDIA Keple K8 GPU o Intel 712P PHI Figue 1 depict the HP PoLiant BL46 Gen9 Seve that wa ued fo teting.
3 Figue 1. HP PoLiant Blade encloue (C7) with a clute of BL46 Gen9 blade. Fo eve configued with the Intel e5-26xxv3 Hawell poceo, it i impotant to have memoy DIMM connected to all fou channel. Failue to do o will degade pefomance. Although each poceo can ue anothe poceo memoy, doing o inceae memoy latency and degade pefomance. The PoLiant XL25 eve can be configued with Gaphical Poceing Unit (GPU) though 2 dedicated intenal PCIe Gen3 connection. In the cae of High Pefomance Computing (HPC), thee GPU can be ued a floating point acceleato to meauably peed up floating point application, povided thoe application ae pogammed to ue them. Figue two how the Apollo 6 which can be populated with thee XL25 Gen9 eve. Figue 2. Apollo 6 with eithe XL23 o XL25 Gen9 eve Meage Paing Inteface ANSYS application include an independently developed MPI. When eviewing the tet eult, notice that ome tet do not ue all the coe on each paticula type of poceo. When not uing all the coe, thee ae eveal way to ditibute the pocee ove both poceo ocket. We ued the default method ound obin. The ound obin method altenate the placement of pocee on the poceo. Fo example, when unning eight pocee on a node with two, eight-coe ocket, MPI altenate the placement of pocee between each poceo, eulting in fou pocee on each poceo. Thi i the bet way to un ANSYS application. Although MPI can be configued to ue othe placement method, we do not ecommend doing o. ANSYS CFD Tet eult The following benchmak cenaio wee ued when teting ANSYS Fluent and ANSYS CFX. Deciption of the benchmak can be found at the following URL Small, medium, and lage tandad benchmak on a ingle node fo Fluent. And mall and medium tandad benchmak fo CFX on a ingle node. Mult i-poce pefomance elative to ingle-poce pefomance Standad benchmak going fom one node to thity two node fo Fluent, and one to ixteen node fo CFX.
4 Running Fluent with GPU enablement on a ingle node and multinode. Compaing the calability of machine with 12, 14, and 16 coe poceo. ANSYS Fl uent Ou teting how that Fluent veion 16. cale well both within a ingle node and a node ae added to a clute. Alo GPU acceleation can boot pefomance whee applicable. St andad benchmak on a ingle node Figue 3 how the geometic mean when unning the Fluent mall, medium, and lage tandad benchmak on a ingle node. The eult ae in olve ating. Solve ating ae a meaue of the amount of wok that can be done in a ingle day, o a lage olve ating indicate bette pefomance. The olve ate i calculated a follow: Total numbe of econd in a day (864) divided by the time of the olve tep in the job Fo example, if the olve time in a paticula job un wee 1 econd, the olve ating fo that job would be 864/1, which i a olve ate of 864. A Figue 3 illutate, thee i good caling fom one to 32 pocee on a ingle node when unning thee benchmak. Howeve, notice that the pefomance benefit leen a you get cloe to 32 pocee. Figue 3. Geometic mean of tandad benchmak 6 o l v e a t i n g p 2p 4p 6p 8p 1p 12p 14p 16p 18p 2p 22p 24p 26p 28p 3p 32p Geometic mean pocee Mult i-poce pefomance elat ive t o ingle-poce pefomance Figue 4 illutate multi-poce pefomance elative to ingle-poce pefomance. Notice that thee i a teady inceae a you inceae the numbe of pocee up to 3-32 pocee. Not all application exhibit thi behavio. Some application might cale well up until 1 to 12 pocee on a node. It would not be beneficial to un thoe application with moe than 1 o 12 pocee on one node. Fluent make poductive ue at all 32 coe, although a tated ealie the benefit of unning with moe coe in a node ae tating to tail off. Alo elated to pefomance i poceo and memoy clock peed. With application uch a Fluent the fate the poceo and memoy clock peed the bette the pefomance. With the machine ued in the eult, the poceo clock wa 2.3 GHz and the memoy peed wa 2133 MHz. Poceo with a fate clock and le coe may be moe beneficial. Moe about thi late.
5 Figue 4. Speedup of Fluent in a ingle PoLiant XL23 Gen9 node Speedup Speedup 1p 2p 4p 6p 8p 1p 12p 14p 16p 18p 2p 22p 24p 26p 28p 3p 32p pocee Standad benchmak going fom one node to thity two node Figue 5 how the peedup fo the geometic mean of the Fluent mall, medium, and lage tandad benchmak going fom one node to thity two node. A you can ee, thee i a peedup of up to time, which i vey good when compaed to othe application and unning with thi many pocee, which i 124, and the ize of benchmak included in the geometic mean. Some of which ae not vey lage.. What i citical hee to the calability fom the hadwae ide i the netwok. With paallel poce application, thee ae ubtantial amount of communication going on between each poce in a node and alo with pocee pead out ove multinode in the netwok. Becaue of thi a netwok with low latency and high bandwidth i equied fo pefomance. With thee tet a netwok coniting of FDR InfiniBand inteconnect and witche wa ued. At thi time, thee i no netwok that ha lowe latency and highe bandwidth than FDR InfiniBand!
6 Figue 5. Speedup of Fluent going fom one to thity two node Speedup Speedup 1 node 2 node 4 node 8 node 16 node 32 node The next chat, Figue 6, i the peed up of the Fluent vey lage benchmak(combuto 71M, and F1 ace 14M), howing fom fou to thity two node. What i inteeting about it i the calability of a lage numbe of node, ay 8 to 16 o 16 to 32 i moe ponounced than in Figue 5. Thi i becaue lage benchmak have moe wok that need to be done on them and thi wok can be moe efficiently paallelized at a highe level than the othe benchmak. If you might notice thi i pefect caling! Figue 6. Speedup of Fluent vey lage benchmak Speedup node 8 node 16 node 32 node
7 Poceo with highe clockpeed and le coe Ealie it wa mentioned that pehap unning with poceo that had highe clock peed, but a few le coe may be the type of ytem fo Fluent. Figue 7 how a chat that i an example of thi. Hee we have a node by node compaiion of clute with thee vaiation of the Intel E5-26xxv3 poceo. The E5-2698v3 i the 16 coe chip that wa ued in the benchmaking eult fo the peviou chat. It clock peed i 2.3GHz. We alo have hown the E5-2697v3 a 14 coe 2.6GHz poceo, and the E5-268v3 a 12 coe 2.5GHz chip. A you can ee thee in t a lot of diffeence when compaing the node of thee vaiation. The node with the E5-268v3 pefom thewot, but keep in mind thee node only have 24 coe pe node veu 28 fo the E5-2697v3, and 32 fo the E5-2698v3 node. Figue 7. Node by node compaion of ytem with diffeent type of E5-26xxv3 poceo 7 6 o l v e a t i n g v3 2697v3 268v3 1 1 node 2 node 4 node 8 node 16 node Fluent and GPU In Fluent veion 16 continue with the GPU capability that wa in veion 15. Below how the eult fom a 1.2 million cell pipe benchmak un on two XL25 Gen9 with Intel E5-269v3 (2.6GHz) poceo, 128 gigabyte of memoy, and 2 Nvidia K8. Each machine had two 12 coe poceo, Each K8 ha two GPU in it, o with the below we ae looking at 2 and 4 functional GPU o 2 and 4 coe out of ou ANSYS HPC PACK licene. One K8 help up to between 8 and 14 coe pe node depending if you ae unning ingle o dual node. Two K8 will how benefit up to 24 pocee eithe on one node o two node. Although in the two node cae at 24 pocee, unning with no GPU i pefoming almot a well a with 2 K8. Figue 8. Fluent with and without GPU enablement 1.2M pipe benchmak 6 S o l v e R a t i n g No GPU 1 K8/node 2 K8/node 1p1n 2p1n 4p1n 6p1n 8p1n 1p1n 12p1n 14p1n 16p1n 18p1n 2p1n 22p1n 24p1n 8p2n 12p2n 16p2n 2p2n 24p2n pocee pe node
8 Below ae ome eult fom a lage 9.6 million cell benchmak unning on two XL25 node of the configuation hown with the peviou chat. Thi benchmak wa too lage to un on a ingle K8 within one node, but with 2 K8 we ee a good pefomance boot fom them. On two node we do ee a pefomance boot at fou and 8 pocee pe node with one K8 pe node, and with two we ee the bet pefomance fom two node when unning with eight pocee pe node. The 48 poce no GPU un ha been thown in fo compaion. A we can ee unning with 16 pocee and fou K8 outpefom thi. Alo one thing to note i that the GPU un will only ue 2 HPC PACK licene wheea the no GPU un will take thee. Figue 9. Fluent with and without GPU enablement on lage benchmak 9.6M pipe benchmak 8 S o l v e R a t i n g No GPU 1 K8/node 2 K8/node pocee pe node. ANSYS CFX Ou teting how that CFX i highly calable. Small and medium tandad benchmak on a ingle node Figue 1 illutate the mall and medium tandad benchmak on a ingle node.
9 Figue 1. Geometic mean of the mall and medium benchmak in a ingle node Solve Rating p1n 2p1n 4p1n 6p1n 8p1n 1p1n12p1n14p1n16p1n18p1n2p1n22p1n24p1n pocee Mult i-poce pefomance elat ive t o ingle-poce pefomance With CFX, the geometic mean of the mall and medium benchmak (Pump, Leman, and Aifoil 1M) exhibit a peedup of almot 18 time going fom one to 24 pocee on a BL46 Gen9 with two 12 coe 2.5GHz Intel E5-268v3 poceo a hown in the elative pefomance chat in Figue 1. Note that the amount of memoy on thee machine i128 GB. How much memoy you equie depend on the type and ize of cae you un. It i typical that fo ANSYS CFD application fom fou to eight gigabyte of memoy i available pe coe. Since thee ae 24 coe each of thee machine, a minimum of 96GB and a maximum of 192GB would be ecommended. 128GB fall ight in the middle of thi. The eaon fo thi ecommendation i you don t want to have too little memoy to un a lage enough job to take advantage of the poceing powe of the poceo and with lage job and too few coe, you aen t taking advantage of the calability of the application and the machine to give you a bette time to olution. Alo notice that ince thi machine ha much fewe coe than the 32 coe machine hown in the ealie Fluent eult. The caling hee continue all the way to the maximum numbe of coe on the ytem. Fluent would behave in a imila way on thi machine. Conveely the caling of CFX on the 32 coe machine would be analogou to Fluent. Figue 11. Speedup of CFX in a ingle PoLiant BL46c Gen9 node
10 Speedup Pocee St andad benchmak going fom one node to ixteen node When unning the mall, medium, and lage CFX benchmak (Pump, Leman, Aifoil 1M, and Aifoil 5M) on moe than one compute node, one can ee a good peed up, too. The pefomance inceae un ove ten time fo ixteen node, which how the benefit of going paallel ove multiple node with CFX. Figue 12. Speedup of CFX fom one to ixteen node Speedup Speed 1 node 2 node 4 node 8 node 16 node With Figue 13 we ee the calability of the lage CFX benchmak (pef_aifoiil 5M and 1M) and like Fluent, lage benchmak cale bette on a node by node bai than the malle one. Thi how almot pefect caling!
11 Figue 13. Speedup of the lage CFX benchmak Speedup node 4 node 8 node 16 node Summay of ANSYS CFD t et eul t Ou teting how that Fluent and CFX ae highly calable. What make Fluent and CFX o calable compaed to othe application? Fit, cedit mut be given to the develope at ANSYS who have woked ove the yea to make Fluent and CFX high pefoming calable application. Othe eaon involve the chaacteitic of the application. Like many CFD application, Fluent and CFX do not pefom a lot of file ytem I/O, o they ae not dependent on the peed of the file ytem, which can low down a high pefomance computational application. In addition, Fluent and CFX do not ove-tax the bandwidth of the memoy ytem on the node when ued with a pope memoy to coe configuation. Fo Fluent and CFX unning on ingle eve, anothe conideation i latency and bandwidth of the node memoy. In a ingle node, the data tanmitted between a coe unning a Fluent o CFX poce and memoy can affect pefomance. Thee memoy paamete alo affect the communication time among Fluent o CFX pocee. Since FLUENT and CFX paallelim i a multi-poce fom baed on meage paing potocol, fom time to time vaiou pocee in a paallel Fluent o CFX job will have to communicate with one anothe. In a ingle node, thi communication i paed fom a poce, which i unning on a coe, though memoy to anothe poce, which i unning on anothe coe. So, the memoy pefomance affect thi communication time, o uing 2133MHz veu the peviou maximum of1866 MHz benefit hee. Fo Fluent and CFX unning aco multiple eve, in a multi-node job, node-to-node communication time become impotant. In clute of PoLiant BL46 Gen9. XL23a Gen9, and XL25 Gen9 eve (ued in ou teting), a high peed FDR InfiniBand inte-node inteconnect wa ued to facilitate communication between node. Thi inteconnect ha low latency and high bandwidth chaacteitic which make it ideal fo thi tak. Lage job will cale bette with highe numbe of node with thee application. The moe wok you give them the moe paallelim you can get out of them, povided you have the hadwae and licene! Solution Refeence Achitectue fo ANSYS CFD Fom ou tet eult, it can be concluded that the bet hadwae configuation fo a given application will depend on a numbe of facto. Nonethele, ome geneal ecommendation and guidance can be made fo uitable compute clute unning ANSYS CFD, a hown in the thee SRA in Figue 14, 15, and 16. Figue 14. ANSYS CFD (Fluent and CFX): tate clute kit
12 Seve Option: 1 XL1x head node 2-4 PoLiant Xeon node, each uing 2 poceo, in an Apollo 2x chai coe pe compute node, E5-2697v3 12 coe 2.6GHz poceo ecommended. 3 to 6 SAS dive (Raid ) Option: 2 NVIDIA K4 (Fluent) Total Memoy fo the Clute: Compute node: 4 to 8 GB/coe Head node 32GB o moe depending on ole Clute Inteconnect: Integated Gigabit, 1 Gigabit Ethenet, o QDR InfiniBand (ecommended fo job uing moe than fou node) Opeating Envionment: 64-bit Linux, Micooft (HPC Pack) Seve 212 Wokload: Suited fo ANSYS CFD model up to ~23M cell (FLUENT), and depending on meh ~6M to ~23M node(cfx). HP PoLiant XL17 o XL19 Gen9 Node in 2U Apollo 2 chai The SRA hown in Figue 14 i ou State Clute Kit fo CFD. Thi howcae the Apollo 2 chai which can houe up to fou XL17 o XL19 PoLiant eve. Thee ae ome common thing to keep in mind when looking at thi SRA and the one that will follow. They ae the numbe of node, the numbe of coe pe compute note, amount of memoy pe coe, and the ideally uited wokload hown in the SRA. In the cae of thi SRA the wokload lited ae fo the maximum configuation hown in thi SRA, which i fou node, each with 24 to 28 coe pe compute node, and 8 GB pe coe. If the wokload you un ae ay half the ize. Then pehap you want two node, o fou node with 4 GB pe coe, o ome mixtue of lowe coe count poceo and memoy. If you job ae lage, ead on! Figue 15. ANSYS CFD (Fluent and CFX): midize clute
13 Seve Option: 1 DL38 head node 4-32 PoLiant XL23a o XL25a Xeon node (Apollo 6), each uing 2 poceo in a a6 chai (up to 1 node pe chai) coe pe compute node, E5-2698v3 2.3GHz 16 coe poceo ecommended. 2 to 4 1TB 15K SAS dive pe compute node Up to 2 NVIDIA K8 pe XL25a node (Fluent) Option: Configue the 5L38 node with up to 24 intenal SAS dive with exta memoy/toage fo vey lage job. Total Memoy fo the Clute: Compute node: 4 to 8 GB/coe Head node 32GB o moe depending on ole. Clute Inteconnect: FDR InfiniBand 2:1 Opeating Envionment: 64-bit Linux, Micooft (HPC Pack) Seve 212 Wokload: Suited fo ~4 imultaneou ANSYS CFD model up to ~5M cell (FLUENT), and depending on meh ~1 to ~5 node(cfx). O, un ~2 to ~3 imultaneou ANSYS CFD model on the cale of ~5M cell(fluent), ~1 to ~5 M node(cfx), again depending on meh. Apollo 6 In Figue 15 we ae looking at ou Midize Clute fo CFD. The ame conideation about the numbe of compute node, amount of coe pe compute node, amount of memoy pe coe, and how that elate to the ideal wokload i the ame a with the peviou SRA. Howeve with thi one we mention Apollo 6 chai with XL25. One thing to note hee, i that fo inteconnect FDR InfiniBand i mentioned. With fou node and up you want the highpeed InfiniBand inteconnect! Figue 16. ANSYS CFD (Fluent/CFX): Lage Scale-Out Clute Seve Option: 1 DL38 Head node PoLiant BL46c node, each uing 2 poceo coe pe compute node. E5-2697v3 2.6GHz 14 coe poceo ecommended. Two 1.2TB 15K SAS dive pe compute node Option: WS46c Gen9 woktation blade with NVIDIA Quado K6 gaphic cad fo pe/pot poceing uing emote viualization Configue the head node with exta memoy/toage fo vey lage job 42U Total Memoy fo the Clute: Compute node: 4 to 8 GB/coe Head node 32GB o moe depending on ole Clute Inteconnect: FDR InfiniBand 2:1 Opeating Envionment: 64-bit Linux, Micooft (HPC Pack) Seve 212 Wokload: Suited fo ~4 imultaneou ANSYS CFD model geate than 5M cell(fluent), and geate than1 o 5M node (CFX) depending on meh. O unning geate than 3 imultaneou ANSYS CFD model on the cale of ~5M cell (FLUENT), ~1 to ~5M node (CFX) depending on meh With Figue 16 we have ou lage cale out clute. Hee we have BL46c blade eve. One pecific thing to point out about thi SRA i you can eplace one of you eve blade with a WS46c woktation blade that can have a K6 gaphic cad, o a K2 fo combined computation and emote gaphic!
14 ANSYS Mechanical t et eult We teted ANSYS Mechanical wit h GPU uppot and without GPU with following benchmak cenaio which can be found at the following URL Wit h GPU uppot We an tet with ANSYS Mechanical veion 16. in ditibuted (DMP) mode with the following mainteam olve, which uppot GPU acceleation: Diect Spae Peconditioned Conjugate Gadient (N/A with the Intel Xeon PHI) Jacobian Conjugate Gadient (N/A with the Intel Xeon PHI) NOTE ANSYS Mechanical DMP mode gained Xeon PHI uppot in veion 16 fo the Diect pae olve with ymmetical matice. Thee ae diffeent way to ue GPU on job unning on one node o on a clute of node. Thee ae eveal way to utilize GPU: One ANSYS Mechanical job unning on one node with one o moe GPU: If only one job i unning on one node, that job can ue one o moe GPU. In DMP mode, a job i egmented into chunk. Thee chunk un a pocee on diffeent coe. Each poce compute in paallel with othe pocee. When utilizing GPU, each poce un until it can offload pat of the computation to the GPU. At thi point, the poce end wok to the GPU, wait fo the eult, and continue. Multiple ANSYS Mechanical job on one node with multiple GPU: To ue additional GPU on a node, you can un moe than one ANSYS Mechanical job on the node; each job ue a diffeent GPU. One ANSYS Mechanical job on multiple node, each with a GPU: A ingle ANSYS Mechanical job can ue moe than one GPU by unning the job on multiple node, one GPU on each node it i unning on. If you un a 24-poce paallel job ove fou node, you could ue fou GPU with the job, a long a you have at leat one GPU on each node. Note the following etiction when uing ANSYS Mechanical with GPU: Thee GPU-enabled olve uppot many ANSYS Mechanical dataet, but thee may be time when the GPU may not be ued fo cetain type of analyi. If the data fo the job i too big fo the onboad GPU memoy with the PCG olve, ANSYS Mechanical will not ue the GPU; intead, it will un in nomal compute mode. Figue 17 how that, on aveage, you get a ubtantial benefit fom the NVIDIA K8, with each K8 having 2 GPU on it. Note that a the numbe of pocee goe up the benefit of the GPU i le then when compaed to uing no GPU. Alo thee ae ytem with 24 coe pe node, but only a maximum16 pocee pe node i hown. Thi i becaue when uing GPU thee i not much of a pefomance benefit when unning with moe that 32 pocee ove two node. Alo obeve that eult fo 24 pocee and 28 pocee ove two node ae included too. The pefomance with 24 pocee and two K8 pe node exceed the pefomance with 32 pocee with no GPU o one K8 pe node. It eally in t that fa off fom 32 pocee with two K8 pe node. Fom a licening tandpoint, uing 24 coe and 8 GPU unit (4 K8) equie only two HPC PACK licene, wheea with 32 pocee and even one K8 you would need thee HPC PACK to un. If all you have i one HPC PACK licene you can ee that unning with ix pocee and one K8 would outpefom not uing GPU at all. In thee way the GPU can help you maximize you pefomance pe cot of licene.
15 Figue 17. Compaion of geometic mean of tandad benchmak eult with and without K8 GPU acceleation 9 S o l v e R a t i n g p1n 2p1n 4p1n 6p1n 8p1n 16p1n 24p2n 28p2n 32p2n Without GPU One K8/node Two K8/node Pocee pe node Xeon PHI New in ANSYS Mechanical veion 16 i uppot fo the Xeon PHI in the multipoce paallel poduct. Figue 18 how u the pefomance of unning with one and two 712 PHI with the eult being the geometic mean of a elect goup of the ANSYS tandad benchmak. The PHI only wok on job in which the pae olve i ued and whee the olution matix i ymmetical. Alo one PHI will only wok up to 8 pocee and two will go up to 16 pocee. The chat how u that whee applicable the PHI doe give u an advantage, caling well up to ix to eight pocee fo one and up to 14 when uing two PHI, which outpefom unning with 24 pocee without PHI acceleation. A with the GPU, whee the Xeon PHI can help you make optimal ue of a ingle HPC PACK licene when unning with ix pocee and one o two PHI a compaed a to unning with eight pocee without acceleation. Figue 18. Compaion of geometic mean of tandad benchmak eult with and without Xeon PHI acceleation 6 S o l v e R a t i n g No PHI One PHI Two PHI 1p 2p 4p 6p 8p 1p 12p 14p 16p 24p Pocee Without GPU uppot Figue 19 how ANSYS Mechanical un within a node. Specifically we can umie fom Figue 19 that Mechanical peed up ove 12 time. Alo we ee fom the two chat that the application ha an inceae in pefomance a you inceae the numbe of paallel pocee, although a we get up above 16 to 18 pocee the elative impovement leen. Still thi how that ANSYS Mechanical cale. Figue 19. Geometic mean of all the tandad benchmak an in a node.
16 o l v e a t i n g p 2p 4p 6p 8p 1p 12p 14p 16p 18p 2p 22p 24p pocee When unning on two node ANSYS Mechanical will cale to all the coe on both node, although again it will loe efficiency a you inceae the numbe of pocee up to 48. Howeve with fou node and beyond, look at figue 2. Reult hown hee ae unning with vaiou poce count fom one to eight node. It i a bit of an eye chat, but the point of thi chat i one to how that ANSYS Mechanical will cale up to 8 node, and that at eight node optimal pefomance i when unning on 16 pocee pe node even though each node had 24 coe and thi clute of eight will enable up to 192 pocee total. Now at lowe node count than eight the application will cale omewhat to all the poceo in a node a with the ingle node eult peviouly hown, but a you get above 16 to 18 pocee pe node the efficiency in pefomance impovement dop off a it did on the ingle node. Again at eight node the efficiency in impoving pefomance dop to above 16 pocee pe node a we can ee. Figue 2.. Geometic mean of all the tandad benchmak an fom one to eight node with vaying poceo count S o l v e R a t i n g p1n 2p1n 4p1n 6p1n 8p1n 8p2n 1p1n 12p1n 12p2n 14p1n 16p1n 16p2n 16p4n 18p1n 2p1n 2p2n 22p1n 24p1n 24p2n 24p4n 28p2n 32p2n 32p4n 32p8n 36p2n 4p2n 4p4n 44p2n 48p2n 48p4n 48p8n 56p4n 64p4n 64p8n 72p4n 8p4n 8p8n 88p4n 96p4n 96p8n 112p8n 128p8n 144p8n 16p8n 176p8n 192p8n Pocee pe node To clean thi up a bit look at Figue 21. Thi how multi-node caling fom one to eight node uing a maximum of16 pocee pe node. One othe thing you might notice hee i that 128 pocee ued thee HPC PACK, o if you have thee HPC PACK and eight node thi i the way you would un you job, povided it wa big enough to ue all eight node. Along with thi eaoning if you had two HPC PACK you would un 32 pocee ove two node, even if you eve ha two 16 coe poceo which can un 32 pocee on one node.
17 Figue 21. Geometic mean of all the tandad benchmak an fom one to eight node with a maximum of 16 pocee pe node S o l v e R a t i n g p1n 2p1n 4p1n 8p1n 16p1n 32p2n 64p4n 128p8n Pocee pe node Fat e poceo le coe. What doe that buy you? The anwe to thi quetion i it depend. Look at Figue 22. Hee i a compaion of the pefomance of vaiou poceo type within a node with eult fom unning ANSYS Mechanical benchmak. The highet pefoming poceo on a coe by coe o poce bai i the Intel E5-2667v3, which ha eight coe and i clocked at 3.2GHz. The poceo we have been looking at with epect to the peviou chat i the Intel E5-268v3, which ha 12 coe and i clocked at 2.5GHz. The othe two hown ae the E5-269v3, 12 coe clocked at 2.6GHz and the E5-2695v3, 14 coe clocked at 2.3GHz. Compaed to the othe poceo model, the fat 3.2GHz poceo ave you fou to ix coe. If you only have two HPC PACK and a mall numbe of node, and neve intend to expand, pehap thi i the poceo you would want. Alo to note hee you will ee imila behavio with the CFD application. Howeve if you have a medium to a lage clute of machine, you will want poceo with moe coe which will give you bette oveall pefomance on a node by node bai. Figue 22. Compaion of vaiou poceo type with ANSYS Mechanical. Summay of ANSYS Mechanical tet eult The concluion about how to un ANSYS Mechanical in DMP mode on a clute of PoLiant Gen9 eve, without GPU acceleation: to un a ingle job on one o two node, you would all the coe in the node, howeve if you wee limited by licene o you wee unning on fou node o moe you would limit the numbe of pocee to 16 pe node. When uing
18 GPU acceleation you want to un with a maximum of 16 pocee and two K8 pe node. Howeve if you ae limited by HPC PACK licene you might want to pead thi out to 14 pocee and one K8, which ha two GPU in it o 14 pocee and ue two K8 which ha fou GPU. Whee applicable the Xeon 712P PHI can help make optimal ue of a HPC PACK licene a well. One of the mot effective way to take advantage of the GPU i to un multiple job on a node each uing up to ix pocee and one K8. Al o wit h the XL19 eve model, the K8 i not available, but the K4 i. Roughly peaking each GPU on a K8 i equivalent to a K4. So you would adjut you pocee veu GPU combination accodingly. With epect to I/O, ince thi application make heavy ue of the file ytem, you want the application to ue local toage attached to each individual node. You can un it fom a haed file ytem and it will wok fine, but having the file I/O going ove netwok file ytem (NFS) o even a fat-haed file ytem will pefom woe than uing local toage. Solution Refeence Achitectue fo ANSYS Mechanical Fom ou tet eult, it can be concluded that the bet hadwae configuation fo a given application will depend on a numbe of facto. Nonethele, ome geneal ecommendation and guidance can be made fo uitable compute clute unning ANSYS Mechanical, a hown in the two SRA in Figue 22 and 23. Figue 22. State Clute Kit fo Mechanical Seve Option: 1 XL1x head node 2-4 PoLiant Xeon node, each uing 2 poceo, in an Apollo 2x chai Up to 24 coe pe compute node, E5-269v3 12 coe 2.6GHz poceo ecommended. 3 to 6 48GB SSD dive (Raid ) Option: 2 NVIDIA K4 (XL19) Total Memoy fo the Clute: Compute node: 4 to 8 GB/coe Head node 32 GB o moe depending on ole Clute Inteconnect: Integated Gigabit, 1 Gigabit Ethenet, o QDR InfiniBand (ecommended fo job uing moe than 2 node). Opeating Envionment: 64-bit Linux, Micooft (HPC Pack) Seve 212 Wokload: Suited fo Mechanical up to ~7M o ~48M DOF depending on olve ued HP PoLiant XL17 o XL19 Gen9 Node in 2U Apollo 2 chai The SRA hown in Figue 22 i ou State Clute Kit fo ANSYS Mechanical. It i imila to ou CFD tate kit, except we how the option of uing 48GB SSD. ANSYS Mechanical pefomance can be highly enitive to file ytem pefomance, and paallel RAID Solid State Dive (SSD) can damatically outpefom tandad had dive.
19 Figue 23. Fat node Clute fo Mechanical. Seve Option: 1 DL38 head node 4-8 PoLiant DL38 Xeon eve node, each uing 2 poceo (24 coe). A NVLDLA K8 and 2 to 24 intenal 6DB SAS 15K dive o 8DB SAS SSD tiped RALD pe compute node plu a 6x2TB SAS RALD dik aay on head node O 4-8 XL25a Xeon eve node (Apollo 6), each uing 2 poceo (24 coe), up to 2 NVLDLA K8 o Lntel 712P PHL and 2 intenal SAS 15K dive o 8DB SAS SSD pe compute node (uitable fo nonlinea job > = 2M DOF) E5-268v3 2.5DHz o E5-269v3 2.6DHz 12 coe poceo ecommended. 42U Total Memoy fo Clute: 8DB/coe on Head node 4 to 8 DB/coe on each emaining compute node Clute Lnteconnect: FDR LnfiniBand Opeating Envionment: 64-bit Linux OR Micooft HPC Seve 28 Apollo 6 Wokload: DB RAM configuation will handle up to ~8 imultaneou unning ANSYS megamodel of ~45-18M o ~45M DOF and up depending on olve ued. DL38p Out lat SRA i the Fat node Clute fo ANSYS Mechanical. The eaon it i called the Fat node clute i that it ha the DL38 Gen9 eve whee you can put up to 24 dik in it, uing SSD fo file ytem pefomance if you want. Alo, we do how the option of XL25 fo GPU uppot. Howeve, you can now get the DL38 eve with one K8.Al o ANSYS Mechanical can be even moe enitive to the type of inteconnect than the CFD code ae. So hee we have FDR InfiniBand ecommended fo configuation of fou node and lage. You would get ome benefit at two node a well. Concluion HP deigned the hadwae configuation ued in the analyi fo thi pape fo HPC. The eve ae configued uing Intel high pefoming E5-26v3 poceo, fat memoy DIMM, and high pefomance dik dive. Othe HP two-poceo eve model with imila poceo, memoy, netwok, and dik ubytem will pefom imilaly. Thi ummay of ANSYS application on HP PoLiant eve uing Intel Xe on E5-26XXv3 12 and 16 coe poceo how that now, a in the pat, a the numbe of coe on the poceo inceae, application pefomance impove. The pefomance of memoy and netwok component ha impoved to maximize the pefomance thee poceo; howeve, thee ae till conideation to be taken when unning ANSYS CFD application in paallel. Fluent and CFX ae both highly calable, both within a node and ove many node in an HPC clute. With both application the ecommendation when unning multi-node paallel i to fill up the node with pocee, Up to the maximum numbe of coe in the cae of the machine teted. Of coue, with thee a with all application, you need to match the level of paallelim to the equiement of the dataet. If you ty to paallelize a job that doe not have enough wok, thee will be too much inte-poce communication compaed to the amount of compute wok in an application poce, and the advantage fom peading out the compute wok ove a numbe of coe will be negated. In Veion 16 thee i GPU uppot fo the Fluent application. Whee applicable thee i a benefit at leat with mall paallel poce count. With the new veion of ANSYS Mechanical, we ee that it i highly calable. So it i ecommended to fill up the node with pocee when unning on a node o two, a hown in thi pape with the 24 coe ytem. Howeve unning on a lage numbe of node unning with the node fully populated may not buy you anything. Alo fate poceo with malle coe count will get bette pefomance on a coe by coe bai, but on a node by node bai having moe coe up to a point
20 i beneficial. When unning with GPU a maximum of 16 pocee pe node uing two K8 o fou GPU pe node would be ideal. GPU would alo be vey effective when unning moe than one job pe node, unning two to thee job of ix o even pocee with a GPU fo each job. Alo if you ae limited in the numbe of licene you have, uing GPU i a geat way to optimize you pefomance. Fo that matte, whee applicable the Xeon 712P PHI would be a well.