Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Similar documents

Automation of Aircraft Pre-Design with Chameleon

The DLR Satellite Station in Inuvik a Milestone for the Canadian-German Cooperation in Earth Observation

How To Visualize At The Dlr

Andreas Schreiber, Michael Meinel, Tobias Schlauch

Equipping Sparse Solvers for Exascale A Survey of the DFG Project ESSEX

RepoGuard Validation Framework for Version Control Systems

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Advances and Work in Progress in Aerospace Predesign Data Exchange, Validation and Software Integration at the German Aerospace Center

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Large-Scale Reservoir Simulation and Big Data Visualization

Andreas Schreiber

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Wind-Tunnel Simulation using TAU on a PC-Cluster: Resources and Performance Stefan Melber-Wilkending / DLR Braunschweig

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

How To Make A Video From A Computer To A Phone Or Tablet Or Ipad Or Ipa Or Ipam Or Ipan Or Ipar Or Ipra Or Ipro Or Ipor Or Ipo Or Ipom Or Ipr Or Ipon

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Institute s brochure. Microstructure Analysis, Metallography and Mechanical Testing of Materials. Institute of Materials Research

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

YALES2 porting on the Xeon- Phi Early results

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Accelerating CFD using OpenFOAM with GPUs

HPC Deployment of OpenFOAM in an Industrial Setting

Modeling of Pre-Tactical Airline Decision Processes to enable Performance Based Airport Management

Mixing Python and Java How Python and Java can communicate and work together

HSL and its out-of-core solver

High-fidelity electromagnetic modeling of large multi-scale naval structures

HPC enabling of OpenFOAM R for CFD applications

Numerical simulation of maneuvering combat aircraft

Turbomachinery CFD on many-core platforms experiences and strategies

Domain Decomposition Methods. Partial Differential Equations

C3.8 CRM wing/body Case

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Press release from Siemens, Barco, Inform, ATRiCS, DLR and Stuttgart Airport

Fast Multipole Method for particle interactions: an open source parallel library component

Iterative Solvers for Linear Systems

Multigrid preconditioning for nonlinear (degenerate) parabolic equations with application to monument degradation

New Space Capabilities for Maritime Surveillance

Simulation of Fluid-Structure Interactions in Aeronautical Applications

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

PROJECT REPORT. Using Parallella for Numerical Linear Algebra. Marcus Naslund, Jacob Mattsson Project in Computational Science January 2015

A Massively Parallel Finite Element Framework with Application to Incompressible Flows

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Distributed block independent set algorithms and parallel multilevel ILU preconditioners

GPGPU accelerated Computational Fluid Dynamics

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Matrix Multiplication

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

6. Cholesky factorization

OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS

Multicore Parallel Computing with OpenMP

Masterthesis. Analysis of Software-Engineering-Processes. von. Clemens Teichmann. - Matriculation number.: IN-Master -

LINEAR REDUCED ORDER MODELLING FOR GUST RESPONSE ANALYSIS USING THE DLR TAU CODE

GPGPU acceleration in OpenFOAM

Computational Modeling of Wind Turbines in OpenFOAM

Solving Systems of Linear Equations Using Matrices

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Numerix CrossAsset XL and Windows HPC Server 2008 R2

COMPUTATIONAL ASPECTS AND RECENT IMPROVEMENTS IN THE OPEN-SOURCE MULTIBODY ANALYSIS SOFTWARE MBDYN

Notes on Cholesky Factorization

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Fast Iterative Solvers for Integral Equation Based Techniques in Electromagnetics

High Performance Computing in CST STUDIO SUITE

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

MATLAB and Big Data: Illustrative Example

benchmarking Amazon EC2 for high-performance scientific computing

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Mesh Generation and Load Balancing

2013 Code_Saturne User Group Meeting. EDF R&D Chatou, France. 9 th April 2013

Current Status and Challenges in CFD at the DLR Institute of Aerodynamics and Flow Technology

TIME-ACCURATE SIMULATION OF THE FLOW AROUND THE COMPLETE BO105 WIND TUNNEL MODEL

Recent Advances in HPC for Structural Mechanics Simulations

by the matrix A results in a vector which is a reflection of the given

Acceleration of a CFD Code with a GPU

OpenMP and Performance

The Basics of FEA Procedure

SOLVING LINEAR SYSTEMS

Calculation of Eigenmodes in Superconducting Cavities

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation

1 Determinants and the Solvability of Linear Systems

Question 2: How do you solve a matrix equation using the matrix inverse?

Transcription:

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner** German Aerospace Center (DLR) Simulation- and Software Technology Distributed Systems and Component Software Porz-Wahnheide, Linder Höhe, D-51147 Cologne, Germany achim.basermann@dlr.de **also RWTH Aachen University Folie 1

Survey Internal and external flow computations at DLR Storage Schemes for sparse matrices The Distributed Schur Complement method (DSC) Experiments with TRACE and TAU matrices Conclusions Folie 2

Parallel Simulation System TRACE TRACE: Turbo-machinery Research Aerodynamic Computational Environment Developed by the Institute for Propulsion Technology of the German Aerospace Center (DLR-AT) Calculates internal turbo-machinery flows Finite volume method with block-structured grids The linearized TRACE modules require the parallel, iterative solution with preconditioning of large, sparse, non-symmetric real or complex systems of linear equations Folie 3

Preconditioners for TAU: Background TAU: developed for the aerodynamic design of aircrafts by the DLR Institute of Aerodynamics and Flow Technology Unstructured RANS solver (Reynolds-averaged Navier-Stokes), exploits finite volumes Requires the parallel, iterative solution with preconditioning of large, sparse, real, non-symmetric systems of linear equations Solvers used: geometric Multigrid, AMG preconditoned GMRes Here: experiments with DSC methods Folie 4

Storage Schemes for Sparse Matrices Compressed Row Storage (CSR) and Block Compressed Row Storage (BCSR) Non-zero values, row-wise: Matrix: Column indices, row-wise: Row pointers: TRACE and TAU apply BCSR with 5x5 blocks. Avantage: less indirect addressing Disadvantage: A few zeros are stored. Folie 5

DSC Method (1) Distributed matrix, 2 processors Folie 6

DSC Method (2) DSC Algorithm BiCGstab or GMRes iteration for the local interface rows (unknowns) Schematic view on each processor Folie 7

DSC Method (3) Preconditioning within the DSC algorithm Folie 8

DSC Method: Strong Scaling (CSR, real) (Dual-processor nodes; Quad-Core Intel Harpertown; 2.83 GHz) 1000 Time in seconds 100 10 1 8 16 32 64 128 Processor cores TAU matrix: n=541,980; nz=170,610,950; threshold = 10-3 ; rel. residual < 10-5 Folie 9

DSC Method: CSR versus BCSR Format (real) (2x Intel XEON E5520 with 4 cores each, 2.26 GHz ) Results on 8 cores 120 100 121 102 TAU matrix: n=541,980; nz=170,610,950; threshold = 10-3 ; rel. residual < 10-7 Time in seconds 80 60 40 20 0 29 20 dsc2011, bs=1 dsc2011, bs=5 total ilut construction solver iteration 19 9 Folie 10

DSC Method: Strong Scaling (CSR, complex) (Dual-processor nodes; Quad-Core Intel Harpertown; 2.83 GHz) 10000 10000 Time in seconds 1000 100 10 Time in seconds 1000 1 4 8 16 32 64 128 192 100 32 64 128 192 Processor cores Processor cores TRACE matrix THD (n=378,400; nz=45,456,500; threshold = 10-3 ; rel. residual < 10-5 ) TRACE matrix UHBR (n=4,497,520; nz=552,324,700; threshold = 10-3 ; rel. residual < 10-10 ) Folie 11

DSC Solver: CSR versus BCSR Format (2x Intel XEON E5520 with 4 cores each, 2.26 GHz ) lineartrace matrix (8 processes, dim = 56,240, nnz = 2.6 Mio) 5 Time in seconds 4,5 4 3,5 3 2,5 2 1,5 real real blocked (bs=5) complex complex blocked (bs=5) 1 0,5 0 # 76 # 40 # 32 # 29 total ilut construction solver iteration (#number of iterations) Folie 12

lineartrace Performance: Internal versus DSC Solver dsc2011 solver for lineartrace (8 processes, test case "THD stator": dim = 0.8 Mio, nnz = 90 Mio) 70 60 Time in seconds 50 40 30 20 # 140 iterations trace (setup matrix etc) solver iteration prec. preparation (ilut) 10 # 57 iterations 0 internal solver ( gmres100, ssor(0.7,3) ) dsc2011 ( fgmres40, dsc gmres 5, ilut(0.01,1) ) Folie 13

Conclusions Complex computations distinctly faster than real ones BCSR format application significantly outperforms CSR format application for TRACE and TAU problems. DSC method distinctly faster than block-local methods DSC method very suitable for TRACE and TAU problems Folie 14

Questions? Folie 15

DLR German Aerospace Center Research Institution Space Agency Project Management Agency Folie 16

Locations and employees Germany: 6,900 employees across 33 research institutes and facilities at 15 sites. Hamburg Bremen- Neustrelitz Trauen Berlin- Braunschweig Offices in Brussels, Paris and Washington. Koeln Bonn Goettingen Lampoldshausen Stuttgart Oberpfaffenhofen Weilheim Folie 17

DSC Method: Effect of the Interface Iteration (real) (2x Intel XEON E5520 with 4 cores each, 2.26 GHz ) 35 Results on 8 cores TAU matrix: n=541,980; nz=170,610,950; threshold = 10-3 ; rel. residual < 10-7 Solver iteration time in seconds 30 25 20 15 10 5 0 interf-bicgstab, bs=1 interf-gmres, bs=1 interf-bicgstab, bs=5 interf-gmres, bs=5 0 1 2 3 4 5 6 7 8 9 10 interface iterations Folie 18