Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Size: px

Start display at page:

Download "Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications"

Ferdinand Webb
10 years ago
Views:

1 CO-DESIGN 2012, October 23-25, 2012 Peing University, Beijing Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications Hiroshi Ouda The University of Toyo [email protected]

FrontISTR - Parallel Performance Evaluation and Its Industrial

2 Outline Bacground : Towards peta/exascale computing Necessary advances in programming models and software FrontISTR : as an HPC tool for industry Large-grain Parallelism Assembly structures under hierarchical gridding Small-grain Parallelism Blocing with padding for multicore CPU (and GPUs) Summary

Large-grain Parallelism Assembly structures under hierarchical gridding

3 Necessary advances in programming models and software Fast SpMV for unstructured grid Two (at most) nested programming model, i.e. message passing and loop decomposition Keep consistency and stability B/F required in program consistent with H/W Automatic generation of compiler directives, which can consider the dependency, after trial runs. middleware ( mid-level interface ), bridging the application and the BLAS based numerical libraries. Uncertainty or ris in the physical model Hierarchical consistency between H/W configuration and the physical modeling, particularly in engineering fields.

can consider the dependency, after trial runs. middleware ( mid-level interface ), bridging the application and the BLAS based numerical libraries.

4 FrontISTR built on HEC-MW FrontISTR Nonlinear analysis functions Hyper-elasticity/Thermal-elasticplastic/Visco-elastic/Creep, Combined hardening rule Total/Updated Lagrangian Finite slip contact, Friction HEC-MW Advanced features of parallel FEM Hierarchical mesh refinement Assembly structure Up to O(10 5 ) nodes Portability From MPP to PC CAE cloud Nonlinear structural analysis functions have been deployed on a parallel FEM basis: HEC-MW.

Finite slip contact, Friction HEC-MW Advanced features of parallel FEM Hierarchical mesh refinement Assembly

5 Structural Analysis Functions Supported in FrontISTR Function Supported contents Static Material Geometry Boundary Elastic/Hyper-elasticity/Thermal- Elastic-Plastic/Visco-Elastic/Creep, Combined hardening rule Total Lagrangian/Updated Lagrangian Augmented Lagrangian/Lagrangian multiplier method, Finite slip contact, Friction Dynamic Linear/Nonlinear, Explicit/Implicit Eigen value Lanczos method ( considering differential stiffness ) Heat Steady / Non-steady (implicit), Nonlinear

Lagrangian Augmented Lagrangian/Lagrangian multiplier method, Finite slip contact, Friction Dynamic Linear/Nonlinear,

6 Front ISTR

7 Thermal-Elastic-Plastic Analysis of Welding Residual Stress Joint research with IHI Heat source transfer along a welding line Residual stress induced by plastic deformation Temperature

8 Cupping press simulation / Elasto-plasticity and friction on contact faces A punch is plugged into a blan, which is placed between a die and a blan holder. The blan is formed into a cylinder shape as the punch is plugged. 8

blan, which is placed between a die and a blan holder.

10 Friction of power transmission belt Joint research with Mitsuboshi Belt 対称性により幅方向に半分のみモデル化ベルト V belt プーリ面 ( 剛体 ) 負荷トルク Active 駆動プーリ回転ゴム大規模解析へのニーズ従 Passive 動プーリ心線帆布軸荷重

11 Structural integrity analysis of electrical devices 先端力学シミュレーション研究所殿ご提供

12 Contact force analysis of brae dis

13 C B A

14 Large-grain Parallelism : Parallelization based on domain decomposition メッシュ分割領域分割 Local Data Local Data Local Data Local Data FEM Code FEM Code FEM Code FEM Code Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem

Data Local Data FEM Code FEM Code FEM Code FEM Code Solver

15 Data structure for assembly structures with parallel and hierarchical gridding A) Partitioning ( MPI rans ) B) Hierarchical level C) Assembly model Assembly_2 Level_1 C) C) Assembly_1 Assembly_1 Level_1 Level_1 MPC Refine B) Partitioning A) Assembly_2 Level_2 C) Partitioning A) C) Assembly_1 Assembly_1 Level_2 Level_2 MPC

Level_1 C) C) Assembly_1 Assembly_1 Level_1 Level_1 MPC Refine B) Partitioning

16 Iterative solvers with MPC * preconditioning CG itrs. CPU (sec) CPU/CG itr. (sec) MPCCG 14,203 3, Penalty+CG 171,354 40, Mises stress r p KTu T f T r = = T T = 0, L 1, T T KTp T r r p u u KTp T p r r α α α = + = = ), ( ), ( p r p r r r r β β + = = ), ( ), ( Tu u = For Chec convergence end Algorithm *) Multi-point constraint T: Sparce MPC matrix

38 10-1 Mises stress 0 0 0 0 r p KTu T f T r = = T T = 0, L 1, T T KTp T r r p u u KTp T p r r α α α

17 Front ISTR Assembled Structure: Piping composed of many parts 5 pipes & 32 bolts 10mm 2 nd order tet-mesh 3,093,453 elements 5,433,029 nodes Num. of MPC : 70,166 fixed Piping system composed of many parts is easily handled. Mises stress

18 Strong Scale with Refiner - Static linear analysis of machine part - 2 nd order tetra element - PCG (eps=10^-6) FX10@UT SPARC64 Ixfx(1.848 GHz) 1CPU (16core)/node

19 Performance of 1 node is a crucial factor Access of innermost loop Simulator 2 Intra-node parallel - Remove dependency by ordering - OpenMP and/or vectorization

20 Blocing with padding for multicore CPU and GPUs SpMV in iterative solvers is crucial for unstructured grid. For improving B/F ratio Blocing, Padding, ELL+CSR. SpMV AXPY & AYPX DOT Breadown of CPU in CG operations Bloced CSR HYB format = ELL + CSR4

For improving B/F ratio Blocing, Padding, ELL+CSR.

21 Flop/Byte SpMV with CSR: Flop/Byte = 1/{6*(1+m/nnz)} = 0.08~ SpMV with BCSR: Flop/Byte = 1/{4*(1+fill/nnz) + 2/c + 2m/nnz*(2+1/r)} = 0.18~ nnz: number of non-zero components m: number of columns, r, c: bloc size, fill: number of zero s for blocing

22 Acceleration of SpMV Computation Parallelization Rows are distributed among threads. Load balancing Reallocate rows to balance loads. Blocing Matrix format is crucial. CSR: Compressed Sparse Row Flops/Byte = 0.08~ BCSR: Bloced CSR Flops/Byte = 0.18~ value colindx rowptr Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread A 0 B 0 C 0 D 0 E Balanced A B C D E

23 Performance Test (1/3) Load Balancing on Nehalem (Core i7 975) x10,000 SpMV of unbalanced matrices from the library[2] Left: w/o. load balancing Right: without load balancing [2] T. A. Davis. University of Florida sparse matrix collection, 1997.

24 Performance Test (2/3) Parallelization and Matrix Format Performance of SpMV on Nehalem (Core i7 975) CSR / BCSR format performance [MFLOPS] matrix #

25 Performance Test (3/3) Overall CG Solver CG solver s performance on CSR single thread on Nehalem (Core i7 975) performance over CSR single thread matrix #

26 Performance Model (1/2) The K-computer s roofline model based on William s model[1]. Sustained performance can be predicted w.r.t. applications Flop/Byte ratio. 8 / 128 estimated To-pea 6.25% Sustained [1] S. Williams. Auto-tuning Performance on Multicore Computers. Univ. of California, 2008.

27 Performance Model (2/2) SpMV with CSR (Flop/Byte = 0.08~0.16) Byte/Flop = 6.25~12.5 SpMV with BCSR: (Flop/Byte = 0.18~0.21) Byte/Flop = 4.76~5.56 Machine Node performance BW (catalog) BW (STREAM) B/F K 128 Gflops 64 GB/s 46.6 GB/s 0.36 FX Gflops 85 GB/s 64 GB/s 0.27 B/F of FISTR To- pea Measured performance by profiler on FX % SpMV with CSR 2.9~5.8 % SpMV with BCSR: 4.9~7.6 % SpMV with CSR 2.2~4.3 % SpMV with BCSR: 3.7~5.7 %

28 Scalability up to O(10^5) cores under memory wall Hierarchical approaches: - Memory : register cache main memory - Granularity : thread among cores message passing among nodes - Algorithm ( information transfer ) : near field far field ex) iterative solvers with multi-grid preconditioning, FETI with balancing preconditioning, fast multipole method, homogenization method, zooming methods, Saint-Venant theorem If algorithms are made multi-leveled, iterative process is generally introduced. Magic number, which controls the convergence and/or the accuracy, would appear there. With the non-linearity, different solutions may be searched at each hierarchical level.

29 Summary A parallel structural analysis system for tacling the assemble structure as a whole is being developed. Hierarchical gridding strategy is used for increasing the problem size and enhancing the accuracy. For intra-node, multithreading with considering the blocing/padding has been explored on multicore CPU and GPU (currently small number of nodes ). Future wor Intra-node ILU preconditioning Hiding data translation time Performance model

30 Front ISTR メッシュサイズ =0.1mm

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,