AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Similar documents

Domain Decomposition Methods. Partial Differential Equations

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

HPC Deployment of OpenFOAM in an Industrial Setting

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Domain decomposition techniques for interfacial discontinuities

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

DATA ANALYSIS II. Matrix Algorithms

Calculation of Eigenfields for the European XFEL Cavities

Fast Multipole Method for particle interactions: an open source parallel library component

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Introduction to the Finite Element Method

Numerical Methods I Eigenvalue Problems

CFD modelling of floating body response to regular waves

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Fast Iterative Solvers for Integral Equation Based Techniques in Electromagnetics

Iterative Solvers for Linear Systems

ALGEBRAIC EIGENVALUE PROBLEM

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

High Performance Computing in CST STUDIO SUITE

OpenFOAM Optimization Tools

Multigrid preconditioning for nonlinear (degenerate) parabolic equations with application to monument degradation

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Large-Scale Reservoir Simulation and Big Data Visualization

Lecture 5: Singular Value Decomposition SVD (1)

Best practices for efficient HPC performance with large models

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Introduction to CFD Analysis

Course Outline for the Masters Programme in Computational Engineering

LINEAR ALGEBRA. September 23, 2010

ACMSM - Computer Applications in Solids Mechanics

1 Finite difference example: 1D implicit heat equation

SOLVING LINEAR SYSTEMS

ANSYS Solvers: Usage and Performance. Ansys equation solvers: usage and guidelines. Gene Poole Ansys Solvers Team, April, 2002

Benchmark Tests on ANSYS Parallel Processing Technology

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

MAT 242 Test 3 SOLUTIONS, FORM A

High-fidelity electromagnetic modeling of large multi-scale naval structures

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

A Massively Parallel Finite Element Framework with Application to Incompressible Flows

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

Benchmark Computations of 3D Laminar Flow Around a Cylinder with CFX, OpenFOAM and FeatFlow

Numerical Analysis Introduction. Student Audience. Prerequisites. Technology.

Domain Decomposition Methods for Elastic Materials with Compressible and Almost Incompressible Components

Eigenvalues and Eigenvectors

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

Co-simulation of Microwave Networks. Sanghoon Shin, Ph.D. RS Microwave

Applied Linear Algebra I Review page 1

Mean value theorem, Taylors Theorem, Maxima and Minima.

FEM Software Automation, with a case study on the Stokes Equations

MPI Hands-On List of the exercises

Applications of the Discrete Adjoint Method in Computational Fluid Dynamics

Alexander Popp* (with contributions by M.W. Gee, W.A. Wall, B.I. Wohlmuth, M. Gitterle, A. Gerstenberger, A. Ehrl and T. Klöppel)

LINEAR REDUCED ORDER MODELLING FOR GUST RESPONSE ANALYSIS USING THE DLR TAU CODE

Calculation of Eigenmodes in Superconducting Cavities

Mesh Generation and Load Balancing

Customer Training Material. Lecture 2. Introduction to. Methodology ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

Diffusion: Diffusive initial value problems and how to solve them

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

MATHEMATICAL METHODS OF STATISTICS

How High a Degree is High Enough for High Order Finite Elements?

Factorization Theorems

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

A New Unstructured Variable-Resolution Finite Element Ice Sheet Stress-Velocity Solver within the MPAS/Trilinos FELIX Dycore of PISCEES

Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models. Cris Doloc, Ph.D.

The simulation of machine tools can be divided into two stages. In the first stage the mechanical behavior of a machine tool is simulated with FEM

Section Inner Products and Norms

The Application of a Black-Box Solver with Error Estimate to Different Systems of PDEs

GPU Acceleration of the SENSEI CFD Code Suite

INTEGRAL METHODS IN LOW-FREQUENCY ELECTROMAGNETICS

Vista: A Multi-field Object Oriented CFD-package

METHODOLOGICAL CONSIDERATIONS OF DRIVE SYSTEM SIMULATION, WHEN COUPLING FINITE ELEMENT MACHINE MODELS WITH THE CIRCUIT SIMULATOR MODELS OF CONVERTERS.

Feature Commercial codes In-house codes

Nonlinear Iterative Partial Least Squares Method

Introduction to CFD Analysis

by the matrix A results in a vector which is a reflection of the given

Benchmarking COMSOL Multiphysics 3.5a CFD problems

Ships Magnetic Anomaly Computation With Integral Equation and Fast Multipole Method

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Transcription:

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina <mstorti@intec.unl.edu.ar> http://www.cimec.org.ar November 2, 2003 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 1

Parallel solution of large linear systems Direct solvers are highly coupled and don t parallelize well (high communication times). Also they require too much memory, and they asymptotically demand more CPU time than iterative methods even in sequential mode. But they have the advantage that the computational cost do not depend on condition number. Iteration on the global system of eqs. is highly uncoupled (low communication times) but has low convergence rates, specially for bad conditioned systems. Substructuring or Domain Decomposition methods are somewhat a mixture of both: the problem is solved on each subdomain with a direct method and we iterate on the interface values in order to enforce the equilibrium equations there. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 2

Global iteration methods A 11 0 A 1I 0 A 22 A 2I x 1 x 2 = b 1 b 2 1 I 2 A I1 A I2 A II x I b I Computing matrix vector operations involve to compute diagonal terms A 11 x 1 and A 22 x 2 in each processor and, Communicate part of the non-diagonal contributions. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 3

Domain Decomposition Methods (DDM/SCMI) Eliminate x 1 and x 2 to arrive to the condensed eq. (A II A I1 A 1 11 A 1I A I2 A 1 22 A 2I) x I Ã x I = b Evaluation of y I = Ã x I implies = (b I A I1 A 1 11 b 1 A I2 A 1 22 b 2) Solving the local equilibrium equations in each processor for x j : A jj x j = A ji x I Suming the interface and local contributions: y I = A II x I + A I1 x 1 + A I2 x 2 This method will be referred later as DDM/SCMI ( Domain Decomposition Method/Schur complement matrix iteration ). Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 4

DDM/SCMI vs. Global iter. Iteration on the Schur complement matrix (condensed system) is equivalent to iterate on the subspace where the local nodes (internal to each subdomain) are in equilibrium. The rate of convergence (per iteration) is improved due to: a) the condition number of the Schur complement matrix is lower, b) the dimension of the iteration space is lower (non-stationary methods like CG/GMRES tend to accelerate as iteration proceeds). However this is somewhat compensated by factorization time and backsubst. time. As the iteration count is lower and the iteration space is significantly smaller, RAM requirement for the Krylov space is significantly lower, but this is somewhat compensated by the RAM needed by the factorized internal matrices LU(A jj ). Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 5

DDM/SCMI vs. global iter. (contd...) Better conditioning of the Schur comp. matrix prevents CG/GMRES convergence break-down As GMRES CPU time is quadratic in iteration count (orthogonalization stage) and global iteration requires usually more iterations, Schur comp. matrix iteration is comparatively better for lower tolerances ( / ). Global iteration is usually easier to load balance since it is easier to predict computation time accordingly to the number of d.o.f. s in the subdomain Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 6

Schur comp. matrix iter. vs. global iter. (contd...) CPU time global iter. direct DDM/SCMI RAM global iter. direct DDM/SCMI log(tolerance) log(tolerance) Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 7

Subpartitioning For large problems, the factorized part of the A jj matrix may exceed the RAM in the processor. So we can further subpartition the domain in the processor in smaller sub-subdomains. In fact, best efficiency is achieved with relatively small subdomains of 2,000-4,000 d.o.f. s per subdomain. P0 P1 P3 P2 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 8

Example. Navier Stokes cubic cavity Re=1000 625,000 tetras mesh, rtol=10 4, NS monolithic, [Tezduyar et.al. TET (SUPG+PSPG) algorithm, CMAME, vol. 95, pp. 221-242, (1992)] 0.01 R 0.001 DDM/SCMI 0.0001 1e 05 1e 06 1e 07 global iter. iter. 0 50 100 150 200 250 300 350 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 9

Example. NS cubic cavity Re=1000 (contd...) Of course, each iteration of DDM/SCMI takes more time, but finally, in average we have CPU TIME(DDM/SCMI) = 17.7 secs, CPU TIME(Global iteration) = 63.8 secs Residuals are on the interface for Schur compl, global for Global iter. But vector iteration for the SCM is equivalent to a global vector with null residual on the internal nodes. (So that they are equivalent). SCMI requires much less communication Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 10

Example. NS cubic cavity Re=1000 (contd...) For a lower tolerance (rtol=10 8 ), Global iteration breaks down, while DDM/SCMI continues to converge. 0.01 R 0.001 DDM/SCMI 0.0001 1e 05 1e 06 1e 07 1e 08 1e 09 1e 10 1e 11 global iter. 0 100 200 300 400 500 600 700 800 900 1000 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 11

Example. NS cubic cavity Re=1000 (contd...) All results run in Beowulf cluster Geronimo at CIMEC (see http://www.cimec.org.ar/geronimo). 8 x P4 1.7 Ghz (512MB RIMM (Rambus)). NS code is PETSc-FEM (GPL FEM multi-physics code developed at CIMEC, see/download http://www.cimec.org.ar/petscfem). With DDM/SCMI we can run 2.2 Mtetras with NS monolithic (cubic cavity test) 252secs/iter (linear solution only, to rtol=10 4 ). Global iteration crashes (out of memory!) at iteration 122 having converged a factor 5 10 3. DDM/SCMI is specially suited to stabilized methods, since they are in general worse conditioned. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 12

Fractional step solver Predictor and projection steps have κ = O(1) are better solved by global iteration (even Richardson). Poisson step can be solved with DDM/SCMI. [Some of the following observations apply in general for symmetric, positive definite operators.] Better conditioning (than monolithic) makes global iteration more appropriate. As Conjugate Gradient can be used in place of GMRES, CPU time vs. iteration is no more quadratic Factorized internal subdomain matrices can be stored and no refactorized, so we can use larger internal subdomains (but requires more RAM) We can solve 4.2 Mtetra mesh in 58sec (Poisson solution time only...) (rtol=10 4, 130 iters.) Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 13

Schur CM preconditioning - NN In order to further improve SCMI several preconditionings have been developed over years. When solving Ãx = b a preconditioner should solve Ãw = y for w in terms of y approximately. For the Laplace eq. this problem is equivalent to apply a concentrated heat flux (like a Dirac s δ ) y at the interface and solving for the corresponding temperature field. Its trace on the interface is w. Neumann-Neumann preconditioning amounts to split the heat flux one-half to each side of the interface ( 1 / 2 y for each subdomain). w w_left w_right left subd. right subd. left subd. right subd. y/2 y/2 y w = (w_left+w_right)/2 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 14

Flux splitting Neumann-Neumann prec. works well whenever equal splitting is right: right subdomain equal to left subdomain, symmetric operator... Eigenfunctions of the Stekhlov operator (Schur comp. matrix) are symmetric. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 15

Flux splitting (advective case) In the presence of advection splitting is biased towards the down-wind sub-domain. Eigenfunctions are no more symmetric. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 16

Interface strip preconditioner For high-frequency modes (high eigenvalues) the eigenfunctions concentrate near the interface. ISP: Solve the problem in a narrow strip around the interface w strip w left subd. y right subd. y Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 17

= C An interface strip preconditioner for domain decomposition methods, by M.Storti, L.Dalcín, R.Paz 10000 Strongly advective case (Pe=50) 1000 "! # 100 10 56 789*:;:,< $% &'(*),+ 1 >@? A B 0.1 0.01 -. /01 243 0.001 1 10 100 1000 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 18

Condition number, 50x50 mesh, Pe=50 u cond(s) cond(p 1 NN S) cond(p 1 IS S) 0 41.00 1.00 4.92 1 40.86 1.02 4.88 10 23.81 3.44 2.92 50 5.62 64.20 1.08 Table 1: Condition number for the Stekhlov operator and several preconditioners for a mesh of 50 50 elements. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 19

Condition number, 100x100 mesh, Pe=50 u cond(s) cond(p 1 NN S) cond(p 1 IS S) 0 88.50 1.00 4.92 1 81.80 1.02 4.88 10 47.63 3.44 2.92 50 11.23 64.20 1.08 Table 2: Condition number for the Stekhlov operator and several preconditioners for a mesh of 100 100 elements. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 20

An interface strip preconditioner for domain decomposition methods, by M.Storti, L.Dalcín, R.Paz 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Neumann Neumann 10 7 n=5 n=4 n=3 n=2 ISP n=1 layers not preconditioned 0 5 10 15 20 25 30 35 iteration number Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 21

An interface strip preconditioner for domain decomposition methods, by M.Storti, L.Dalcín, R.Paz ISP features Better than NN for strong advective case (NN can improve splitting, but here splitting should depend on wave-length) No floating subdomains. Cost depends on strip width. Good conditioning can be obtained with thin strips ( NN) Solution of strip problem should be done iteratively. Two possibilities: communicates (weakly coupled) or doesn t communicate (diagonal dominant). subdo1 subdo2 b strip subdo2 subdo3 b Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 22

ISP Implementation ISP matrix is global (has elements in all processors): can t be inverted with a direct solver. Iterative solution can t use a CG/GMRES solver (can t nest CG/GMRES inside CG/GMRES) Use either preconditioned Richardson iteration or disconnect (i.e. modify) ISP matrix in some way in order to get a more disconnected matrix and use a direct method. Only preconditioned Richardson iteration is currently implemented in PETSc-FEM. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 23

Cubic cavity Use N 3 regular hexahedral mesh or 5N 3 tetra mesh. Very dense connectivity (high LU band-width, strong 3D). Covers many real flow features (separation, boundary layers...) For N=30 (135Ktet), using 128 subdomains, 50% savings in CPU time, 25% savings in RAM. N=30 cubic cavity stat. nlay iters 0 340 1 148 2 38 Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 24

Conclusions Domain Decomposition + iteration on the Schur complement matrix is an efficient algorithm for solving large linear systems in parallel or sequentially. Specially suited for ill-conditioned problems. DDM/SCMI examples using PETSc-FEM (Finite Element parallel, general purpose, multi-physics code developed at CIMEC. Can be donwloaded at http://www.cimec.org.ar/petscfem). Interface strip preconditioner improves convergence, specially for advection dominated problems or floating subdomains. Centro Internacional de Métodos Computacionales en Ingeniería, CIMEC 25