CO-DESIGN 2012, October 23-25, 2012 Peing University, Beijing Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications Hiroshi Ouda The University of Toyo ouda@.u-toyo.ac.jp 1998-2002 2002-2004 2005-2007
Outline Bacground : Towards peta/exascale computing Necessary advances in programming models and software FrontISTR : as an HPC tool for industry Large-grain Parallelism Assembly structures under hierarchical gridding Small-grain Parallelism Blocing with padding for multicore CPU (and GPUs) Summary
Necessary advances in programming models and software Fast SpMV for unstructured grid Two (at most) nested programming model, i.e. message passing and loop decomposition Keep consistency and stability B/F required in program consistent with H/W Automatic generation of compiler directives, which can consider the dependency, after trial runs. middleware ( mid-level interface ), bridging the application and the BLAS based numerical libraries. Uncertainty or ris in the physical model Hierarchical consistency between H/W configuration and the physical modeling, particularly in engineering fields.
FrontISTR built on HEC-MW FrontISTR Nonlinear analysis functions Hyper-elasticity/Thermal-elasticplastic/Visco-elastic/Creep, Combined hardening rule Total/Updated Lagrangian Finite slip contact, Friction HEC-MW Advanced features of parallel FEM Hierarchical mesh refinement Assembly structure Up to O(10 5 ) nodes Portability From MPP to PC CAE cloud Nonlinear structural analysis functions have been deployed on a parallel FEM basis: HEC-MW.
Structural Analysis Functions Supported in FrontISTR Function Supported contents Static Material Geometry Boundary Elastic/Hyper-elasticity/Thermal- Elastic-Plastic/Visco-Elastic/Creep, Combined hardening rule Total Lagrangian/Updated Lagrangian Augmented Lagrangian/Lagrangian multiplier method, Finite slip contact, Friction Dynamic Linear/Nonlinear, Explicit/Implicit Eigen value Lanczos method ( considering differential stiffness ) Heat Steady / Non-steady (implicit), Nonlinear
Front ISTR
Thermal-Elastic-Plastic Analysis of Welding Residual Stress Joint research with IHI Heat source transfer along a welding line Residual stress induced by plastic deformation Temperature
Cupping press simulation / Elasto-plasticity and friction on contact faces A punch is plugged into a blan, which is placed between a die and a blan holder. The blan is formed into a cylinder shape as the punch is plugged. 8
Friction of power transmission belt Joint research with Mitsuboshi Belt 対 称 性 により 幅 方 向 に 半 分 のみモデル 化 ベルト V belt プーリ 面 ( 剛 体 ) 負 荷 トルク Active 駆 動 プーリ 回 転 ゴム 大 規 模 解 析 へのニーズ 従 Passive 動 プーリ 心 線 帆 布 軸 荷 重
Structural integrity analysis of electrical devices 先 端 力 学 シミュレーション 研 究 所 殿 ご 提 供
Contact force analysis of brae dis
C B A
Large-grain Parallelism : Parallelization based on domain decomposition メッシュ 分 割 領 域 分 割 Local Data Local Data Local Data Local Data FEM Code FEM Code FEM Code FEM Code Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem
Data structure for assembly structures with parallel and hierarchical gridding A) Partitioning ( MPI rans ) B) Hierarchical level C) Assembly model Assembly_2 Level_1 C) C) Assembly_1 Assembly_1 Level_1 Level_1 MPC Refine B) Partitioning A) Assembly_2 Level_2 C) Partitioning A) C) Assembly_1 Assembly_1 Level_2 Level_2 MPC
Iterative solvers with MPC * preconditioning CG itrs. CPU (sec) CPU/CG itr. (sec) MPCCG 14,203 3,456 2.43 10-1 Penalty+CG 171,354 40,841 2.38 10-1 Mises stress 0 0 0 0 r p KTu T f T r = = T T = 0, L 1, T T KTp T r r p u u KTp T p r r α α α = + = = + + 1 1 ), ( ), ( p r p r r r r β β + = = + + + + 1 1 1 1 ), ( ), ( Tu u = For Chec convergence end Algorithm *) Multi-point constraint T: Sparce MPC matrix
Front ISTR Assembled Structure: Piping composed of many parts 5 pipes & 32 bolts 10mm 2 nd order tet-mesh 3,093,453 elements 5,433,029 nodes Num. of MPC : 70,166 fixed Piping system composed of many parts is easily handled. Mises stress
Strong Scale with Refiner - Static linear analysis of machine part - 2 nd order tetra element - PCG (eps=10^-6) FX10@UT SPARC64 Ixfx(1.848 GHz) 1CPU (16core)/node
Performance of 1 node is a crucial factor Access of innermost loop Experience @Earth Simulator 2 Intra-node parallel - Remove dependency by ordering - OpenMP and/or vectorization
Blocing with padding for multicore CPU and GPUs SpMV in iterative solvers is crucial for unstructured grid. For improving B/F ratio Blocing, Padding, ELL+CSR. SpMV AXPY & AYPX DOT Breadown of CPU in CG operations Bloced CSR HYB format = ELL + CSR4
Flop/Byte SpMV with CSR: Flop/Byte = 1/{6*(1+m/nnz)} = 0.08~0.16 0.16 SpMV with BCSR: Flop/Byte = 1/{4*(1+fill/nnz) + 2/c + 2m/nnz*(2+1/r)} = 0.18~0.21 0.21 nnz: number of non-zero components m: number of columns, r, c: bloc size, fill: number of zero s for blocing
Acceleration of SpMV Computation Parallelization Rows are distributed among threads. Load balancing Reallocate rows to balance loads. Blocing Matrix format is crucial. CSR: Compressed Sparse Row Flops/Byte = 0.08~0.16 0.16 BCSR: Bloced CSR Flops/Byte = 0.18~0.21 0.21 value colindx rowptr Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3 1 0 0 0 4 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 5 0 0 2 1 0 0 0 0 0 8 A 0 B 0 C 0 D 0 E Balanced A B C D E 1 0 0 2 4 0 0 0 0 0 0 3 0 5 0 0 2 1 0 8 0 4 2 0 4 0 2 3 4
Performance Test (1/3) Load Balancing on Nehalem (Core i7 975) x10,000 SpMV of unbalanced matrices from the library[2] Left: w/o. load balancing Right: without load balancing [2] T. A. Davis. University of Florida sparse matrix collection, 1997.
Performance Test (2/3) Parallelization and Matrix Format Performance of SpMV on Nehalem (Core i7 975) CSR / BCSR format performance [MFLOPS] matrix #
Performance Test (3/3) Overall CG Solver CG solver s performance on CSR single thread on Nehalem (Core i7 975) performance over CSR single thread matrix #
Performance Model (1/2) The K-computer s roofline model based on William s model[1]. Sustained performance can be predicted w.r.t. applications Flop/Byte ratio. 8 / 128 estimated To-pea 6.25% Sustained [1] S. Williams. Auto-tuning Performance on Multicore Computers. Univ. of California, 2008.
Performance Model (2/2) SpMV with CSR (Flop/Byte = 0.08~0.16) Byte/Flop = 6.25~12.5 SpMV with BCSR: (Flop/Byte = 0.18~0.21) Byte/Flop = 4.76~5.56 Machine Node performance BW (catalog) BW (STREAM) B/F K 128 Gflops 64 GB/s 46.6 GB/s 0.36 FX10 236.5 Gflops 85 GB/s 64 GB/s 0.27 B/F of FISTR To- pea Measured performance by profiler on FX10 4.2 % SpMV with CSR 2.9~5.8 % SpMV with BCSR: 4.9~7.6 % SpMV with CSR 2.2~4.3 % SpMV with BCSR: 3.7~5.7 %
Scalability up to O(10^5) cores under memory wall Hierarchical approaches: - Memory : register cache main memory - Granularity : thread among cores message passing among nodes - Algorithm ( information transfer ) : near field far field ex) iterative solvers with multi-grid preconditioning, FETI with balancing preconditioning, fast multipole method, homogenization method, zooming methods, Saint-Venant theorem If algorithms are made multi-leveled, iterative process is generally introduced. Magic number, which controls the convergence and/or the accuracy, would appear there. With the non-linearity, different solutions may be searched at each hierarchical level.
Summary A parallel structural analysis system for tacling the assemble structure as a whole is being developed. Hierarchical gridding strategy is used for increasing the problem size and enhancing the accuracy. For intra-node, multithreading with considering the blocing/padding has been explored on multicore CPU and GPU (currently small number of nodes ). Future wor Intra-node ILU preconditioning Hiding data translation time Performance model
Front ISTR メッシュ サイズ =0.1mm