Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Similar documents

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HPC Deployment of OpenFOAM in an Industrial Setting

YALES2 porting on the Xeon- Phi Early results

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Benchmark Tests on ANSYS Parallel Processing Technology

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

HPC enabling of OpenFOAM R for CFD applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

High Performance Computing in CST STUDIO SUITE

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Large-Scale Reservoir Simulation and Big Data Visualization

Multicore Parallel Computing with OpenMP

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

OpenMP and Performance

OpenFOAM Optimization Tools

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

A New Unstructured Variable-Resolution Finite Element Ice Sheet Stress-Velocity Solver within the MPAS/Trilinos FELIX Dycore of PISCEES

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Trends in High-Performance Computing for Power Grid Applications

ANSYS Solvers: Usage and Performance. Ansys equation solvers: usage and guidelines. Gene Poole Ansys Solvers Team, April, 2002

Best practices for efficient HPC performance with large models

Performance of the JMA NWP models on the PC cluster TSUBAME.

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

Mesh Generation and Load Balancing

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Fast Multipole Method for particle interactions: an open source parallel library component

AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION

Parallel Algorithm Engineering

Accelerating CFD using OpenFOAM with GPUs

Job scheduling of parametric computational mechanics studies on Cloud Computing infrastructures

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Petascale Software Challenges. William Gropp

Finite Element Method (ENGC 6321) Syllabus. Second Semester

Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms

Modeling of Earth Surface Dynamics and Related Problems Using OpenFOAM

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Recent Advances in HPC for Structural Mechanics Simulations

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

How High a Degree is High Enough for High Order Finite Elements?

Real Time Simulation for Off-Road Vehicle Analysis. Dr. Pasi Korkealaakso Mevea Ltd., May 2015

The simulation of machine tools can be divided into two stages. In the first stage the mechanical behavior of a machine tool is simulated with FEM

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

MPI and Hybrid Programming Models. William Gropp

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Turbomachinery CFD on many-core platforms experiences and strategies

Keys to node-level performance analysis and threading in HPC applications

FEAWEB ASP Issue: 1.0 Stakeholder Needs Issue Date: 03/29/ /07/ Initial Description Marco Bittencourt

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Architecture of Hitachi SR-8000

Matrix Multiplication

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introduction to the Finite Element Method

Parallel Computing for Data Science

Fast Iterative Solvers for Integral Equation Based Techniques in Electromagnetics

Fire Simulations in Civil Engineering

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Lecture 16 - Free Surface Flows. Applied Computational Fluid Dynamics

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS

Lap Fillet Weld Calculations and FEA Techniques

High Performance Computing. Course Notes HPC Fundamentals

Intelligent Heuristic Construction with Active Learning

Nonlinear Analysis Using Femap with NX Nastran

STRUCTURAL ANALYSIS SKILLS

Building Platform as a Service for Scientific Applications

ME6130 An introduction to CFD 1-1

Introduction to Cloud Computing

GPGPU accelerated Computational Fluid Dynamics

CAD-BASED DESIGN PROCESS FOR FATIGUE ANALYSIS, RELIABILITY- ANALYSIS, AND DESIGN OPTIMIZATION

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

A Multi-layered Domain-specific Language for Stencil Computations

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Programming Languages for Large Scale Parallel Computing. Marc Snir

~ Greetings from WSU CAPPLab ~

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

GPU Acceleration of the SENSEI CFD Code Suite

Transcription:

CO-DESIGN 2012, October 23-25, 2012 Peing University, Beijing Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications Hiroshi Ouda The University of Toyo ouda@.u-toyo.ac.jp 1998-2002 2002-2004 2005-2007

Outline Bacground : Towards peta/exascale computing Necessary advances in programming models and software FrontISTR : as an HPC tool for industry Large-grain Parallelism Assembly structures under hierarchical gridding Small-grain Parallelism Blocing with padding for multicore CPU (and GPUs) Summary

Necessary advances in programming models and software Fast SpMV for unstructured grid Two (at most) nested programming model, i.e. message passing and loop decomposition Keep consistency and stability B/F required in program consistent with H/W Automatic generation of compiler directives, which can consider the dependency, after trial runs. middleware ( mid-level interface ), bridging the application and the BLAS based numerical libraries. Uncertainty or ris in the physical model Hierarchical consistency between H/W configuration and the physical modeling, particularly in engineering fields.

FrontISTR built on HEC-MW FrontISTR Nonlinear analysis functions Hyper-elasticity/Thermal-elasticplastic/Visco-elastic/Creep, Combined hardening rule Total/Updated Lagrangian Finite slip contact, Friction HEC-MW Advanced features of parallel FEM Hierarchical mesh refinement Assembly structure Up to O(10 5 ) nodes Portability From MPP to PC CAE cloud Nonlinear structural analysis functions have been deployed on a parallel FEM basis: HEC-MW.

Structural Analysis Functions Supported in FrontISTR Function Supported contents Static Material Geometry Boundary Elastic/Hyper-elasticity/Thermal- Elastic-Plastic/Visco-Elastic/Creep, Combined hardening rule Total Lagrangian/Updated Lagrangian Augmented Lagrangian/Lagrangian multiplier method, Finite slip contact, Friction Dynamic Linear/Nonlinear, Explicit/Implicit Eigen value Lanczos method ( considering differential stiffness ) Heat Steady / Non-steady (implicit), Nonlinear

Front ISTR

Thermal-Elastic-Plastic Analysis of Welding Residual Stress Joint research with IHI Heat source transfer along a welding line Residual stress induced by plastic deformation Temperature

Cupping press simulation / Elasto-plasticity and friction on contact faces A punch is plugged into a blan, which is placed between a die and a blan holder. The blan is formed into a cylinder shape as the punch is plugged. 8

Friction of power transmission belt Joint research with Mitsuboshi Belt 対称性により幅方向に半分のみモデル化ベルト V belt プーリ面 ( 剛体 ) 負荷トルク Active 駆動プーリ回転ゴム大規模解析へのニーズ従 Passive 動プーリ心線帆布軸荷重

Structural integrity analysis of electrical devices 先端力学シミュレーション研究所殿ご提供

Contact force analysis of brae dis

C B A

Large-grain Parallelism : Parallelization based on domain decomposition メッシュ分割領域分割 Local Data Local Data Local Data Local Data FEM Code FEM Code FEM Code FEM Code Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem

Data structure for assembly structures with parallel and hierarchical gridding A) Partitioning ( MPI rans ) B) Hierarchical level C) Assembly model Assembly_2 Level_1 C) C) Assembly_1 Assembly_1 Level_1 Level_1 MPC Refine B) Partitioning A) Assembly_2 Level_2 C) Partitioning A) C) Assembly_1 Assembly_1 Level_2 Level_2 MPC

Iterative solvers with MPC * preconditioning CG itrs. CPU (sec) CPU/CG itr. (sec) MPCCG 14,203 3,456 2.43 10-1 Penalty+CG 171,354 40,841 2.38 10-1 Mises stress 0 0 0 0 r p KTu T f T r = = T T = 0, L 1, T T KTp T r r p u u KTp T p r r α α α = + = = + + 1 1 ), ( ), ( p r p r r r r β β + = = + + + + 1 1 1 1 ), ( ), ( Tu u = For Chec convergence end Algorithm *) Multi-point constraint T: Sparce MPC matrix

Front ISTR Assembled Structure: Piping composed of many parts 5 pipes & 32 bolts 10mm 2 nd order tet-mesh 3,093,453 elements 5,433,029 nodes Num. of MPC : 70,166 fixed Piping system composed of many parts is easily handled. Mises stress

Strong Scale with Refiner - Static linear analysis of machine part - 2 nd order tetra element - PCG (eps=10^-6) FX10@UT SPARC64 Ixfx(1.848 GHz) 1CPU (16core)/node

Performance of 1 node is a crucial factor Access of innermost loop Experience @Earth Simulator 2 Intra-node parallel - Remove dependency by ordering - OpenMP and/or vectorization

Blocing with padding for multicore CPU and GPUs SpMV in iterative solvers is crucial for unstructured grid. For improving B/F ratio Blocing, Padding, ELL+CSR. SpMV AXPY & AYPX DOT Breadown of CPU in CG operations Bloced CSR HYB format = ELL + CSR4

Flop/Byte SpMV with CSR: Flop/Byte = 1/{6*(1+m/nnz)} = 0.08~0.16 0.16 SpMV with BCSR: Flop/Byte = 1/{4*(1+fill/nnz) + 2/c + 2m/nnz*(2+1/r)} = 0.18~0.21 0.21 nnz: number of non-zero components m: number of columns, r, c: bloc size, fill: number of zero s for blocing

Acceleration of SpMV Computation Parallelization Rows are distributed among threads. Load balancing Reallocate rows to balance loads. Blocing Matrix format is crucial. CSR: Compressed Sparse Row Flops/Byte = 0.08~0.16 0.16 BCSR: Bloced CSR Flops/Byte = 0.18~0.21 0.21 value colindx rowptr Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3 1 0 0 0 4 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 5 0 0 2 1 0 0 0 0 0 8 A 0 B 0 C 0 D 0 E Balanced A B C D E 1 0 0 2 4 0 0 0 0 0 0 3 0 5 0 0 2 1 0 8 0 4 2 0 4 0 2 3 4

Performance Test (1/3) Load Balancing on Nehalem (Core i7 975) x10,000 SpMV of unbalanced matrices from the library[2] Left: w/o. load balancing Right: without load balancing [2] T. A. Davis. University of Florida sparse matrix collection, 1997.

Performance Test (2/3) Parallelization and Matrix Format Performance of SpMV on Nehalem (Core i7 975) CSR / BCSR format performance [MFLOPS] matrix #

Performance Test (3/3) Overall CG Solver CG solver s performance on CSR single thread on Nehalem (Core i7 975) performance over CSR single thread matrix #

Performance Model (1/2) The K-computer s roofline model based on William s model[1]. Sustained performance can be predicted w.r.t. applications Flop/Byte ratio. 8 / 128 estimated To-pea 6.25% Sustained [1] S. Williams. Auto-tuning Performance on Multicore Computers. Univ. of California, 2008.

Performance Model (2/2) SpMV with CSR (Flop/Byte = 0.08~0.16) Byte/Flop = 6.25~12.5 SpMV with BCSR: (Flop/Byte = 0.18~0.21) Byte/Flop = 4.76~5.56 Machine Node performance BW (catalog) BW (STREAM) B/F K 128 Gflops 64 GB/s 46.6 GB/s 0.36 FX10 236.5 Gflops 85 GB/s 64 GB/s 0.27 B/F of FISTR To- pea Measured performance by profiler on FX10 4.2 % SpMV with CSR 2.9~5.8 % SpMV with BCSR: 4.9~7.6 % SpMV with CSR 2.2~4.3 % SpMV with BCSR: 3.7~5.7 %

Scalability up to O(10^5) cores under memory wall Hierarchical approaches: - Memory : register cache main memory - Granularity : thread among cores message passing among nodes - Algorithm ( information transfer ) : near field far field ex) iterative solvers with multi-grid preconditioning, FETI with balancing preconditioning, fast multipole method, homogenization method, zooming methods, Saint-Venant theorem If algorithms are made multi-leveled, iterative process is generally introduced. Magic number, which controls the convergence and/or the accuracy, would appear there. With the non-linearity, different solutions may be searched at each hierarchical level.

Summary A parallel structural analysis system for tacling the assemble structure as a whole is being developed. Hierarchical gridding strategy is used for increasing the problem size and enhancing the accuracy. For intra-node, multithreading with considering the blocing/padding has been explored on multicore CPU and GPU (currently small number of nodes ). Future wor Intra-node ILU preconditioning Hiding data translation time Performance model

Front ISTR メッシュサイズ =0.1mm