Parallel implementation of the deflated GMRES in the PETSc package

Similar documents

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Mathematical Libraries on JUQUEEN. JSC Training Course

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

1 Bull, 2011 Bull Extreme Computing

Numerical Methods I Eigenvalue Problems

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

A Parallel Lanczos Algorithm for Eigensystem Calculation

Advanced Computational Software

Module 6 Case Studies

Domain Decomposition Methods. Partial Differential Equations

High Performance Computing in CST STUDIO SUITE

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Fast Multipole Method for particle interactions: an open source parallel library component

A matrix-free preconditioner for sparse symmetric positive definite systems and least square problems

Linköping University Electronic Press

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Espaces grossiers adaptatifs pour les méthodes de décomposition de domaines à deux niveaux

Clusters: Mainstream Technology for CAE

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

AMS526: Numerical Analysis I (Numerical Linear Algebra)

FAST-EVP: Parallel High Performance Computing in Engine Applications

Efficient Convergence Acceleration for a Parallel CFD Code

Hands-On Exercises for SLEPc

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Large-Scale Reservoir Simulation and Big Data Visualization

DATA ANALYSIS II. Matrix Algorithms

Fast Iterative Solvers for Integral Equation Based Techniques in Electromagnetics

Turbomachinery CFD on many-core platforms experiences and strategies

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

General Framework for an Iterative Solution of Ax b. Jacobi s Method

HSL and its out-of-core solver

ANSYS Solvers: Usage and Performance. Ansys equation solvers: usage and guidelines. Gene Poole Ansys Solvers Team, April, 2002

Course Outline for the Masters Programme in Computational Engineering

DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

Iterative Solvers for Linear Systems

Multicore Parallel Computing with OpenMP

Best practices for efficient HPC performance with large models

Cluster Computing in a College of Criminal Justice

Object-oriented scientific computing

A Massively Parallel Finite Element Framework with Application to Incompressible Flows

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Parallel Software for Inductance Extraction

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Multigrid preconditioning for nonlinear (degenerate) parabolic equations with application to monument degradation

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

Mesh Generation and Load Balancing

benchmarking Amazon EC2 for high-performance scientific computing

GPU Acceleration of the SENSEI CFD Code Suite

Study and improvement of the condition number of the Electric Field Integral Equation

Continuity of the Perron Root

Recommended hardware system configurations for ANSYS users

ALGEBRAIC EIGENVALUE PROBLEM

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

High-fidelity electromagnetic modeling of large multi-scale naval structures

Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations. Tzu-Yi Chen

Matrix Multiplication

LINEAR REDUCED ORDER MODELLING FOR GUST RESPONSE ANALYSIS USING THE DLR TAU CODE

How To Write A Pde Framework For A Jubilian (Jubilians)

ME6130 An introduction to CFD 1-1

A Markovian model of the RED mechanism solved with a cluster of computers

Improving Scalability for Citrix Presentation Server

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

64-Bit versus 32-Bit CPUs in Scientific Computing

Anasazi software for the numerical solution of large-scale eigenvalue problems

by the matrix A results in a vector which is a reflection of the given

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms. Amani AlOnazi.

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

SPE Abstract. Copyright 1999, Society of Petroleum Engineers, Inc.

Lecture 5: Singular Value Decomposition SVD (1)

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Computational Modeling of Wind Turbines in OpenFOAM

A numerically adaptive implementation of the simplex method

The PageRank Citation Ranking: Bring Order to the Web

GPGPU accelerated Computational Fluid Dynamics

Benchmark Tests on ANSYS Parallel Processing Technology

Partitioning and Load Balancing for Emerging Parallel Applications and Architectures

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Performance of the JMA NWP models on the PC cluster TSUBAME.

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Adaptive Stable Additive Methods for Linear Algebraic Calculations

The Application of a Black-Box Solver with Error Estimate to Different Systems of PDEs

Subspace Analysis and Optimization for AAM Based Face Alignment

High Performance Computing for Operation Research

Overlapping Data Transfer With Application Execution on Clusters

Transcription:

Work done while visiting the Jointlab @NCSA (10/26-11/24) 1/17 Parallel implementation of the deflated in the package in WAKAM 1, Joint work with Jocelyne ERHEL 1, William D. GROPP 2 1 INRIA Rennes, 2 NCSA-University of Illinois Third Workshop of the INRIA-Illinois Joint Laboratory for Petascale Computation, NCSA, November 22-24, 2010

Outline in 1 2 3 Practical 4 Early results 2/17

General purpose Linear system The core computation of many scientific simulations is to solve Ax = b we restrict to A R n n nonsymmetric, x,b R n The method [Saad and Schultz, 86] in From x 0, iteratively build a sequence of approximate solutions x 1,... x k... at step k, x k x 0 + K k s.t. r k = b Ax k AK k K k is a Krylov subspace From the Arnoldi process : AV k = V k+1 H k and x k = x 0 + V k y k From the minimal residual condition : y k solves min βe 1 H k y k The solution is found in at most n iterations. 3/17

(m) Computational and memory requirements grow with k. Restarted (m) At step m, set x 0 x m, and restart V m and H m are discarded difficult to predict the convergence behaviour. The iterative process can stall as well!! in An example Matrix: CASE 04 Field : Fluid dynamics Origin: 2D linear cascade turbine Source : FLUOREM Matrix Collection Rel. Res. Norm 1 0.8 0.6 0.4 0.2 CASE_04 n=7,980 nz=430,909 (32) 0 0 32 64 75 iterations 4/17

and Eigenvalues Rate of convergence depends on the spectral distribution of A Removing or deflating eigenvalues (usually the smallest ones) improve the convergence rate Deflation occurs autommatically in the full version of when the subspace is large enough. In the restarted version, deflate an eigenvalue add the corresponding eigenvector [Morgan, J. Sci. Stat. Comput. 02] in In this work Deflation occurs by using a preconditioner corresponding to the invariant subspace associated to the selected eigenvalues [Erhel et al, JCAM, 1996; Burrage et al, NLAA, 1998] U = [u 1...u k ], T = U T AU M 1 I n + U( λ n T 1 I k )U T, 5/17

with right preconditioning in Solve AM 1 (Mx) = b for x, Algorithm: Input (m, k)[erhel et al, JCAM, 1996] Set x 0 ; M = I; U = [] 1 Arnoldi process on AM 1 to get V m and H m. 2 x m x 0 + V m y m, y m solution of min βe 1 H m y m 3 If no convergence 1 Find k Schur vectors S k of H m 2 Orthogonalize X = V ms r against U 3 Increase U by X and Compute T = U T AU 4 Set M 1 I n + U( λ n T 1 I rj )U T 5 Set x 0 x m and Restart at step 1 6/17

Main observation induces extra cost to compute and apply the preconditioner Should be applied only if necessary ; for instance, to prevent stagnation or insufficient reduction in the residual norm : Detect stagnation at the end of a cycle and switch to deflation in How to detect stagnation in (m)?? Strictly : Rate of convergence r m / r 0 > τ, 0 < τ 1 Pratically : should have a more realistic experimental test (m) is declared to have stagnated if at the averate rate of progress over the last restart cycle of m steps, the residual norm tolerance cannot be met in some large multiple of the remaining number of steps allowed (the number of steps permitted is bounded by itmax) [Sosonkina et al., NLAA98]: 7/17

Pratical implementation Outline of (D(m, l, rmax)) choose x 0,tol, m, itmax,k Set B A M 1 where M 1 is any external preconditioner r 0 = b Bx 0 ; M = I; k = 0; it = 0; l = 0; while ( r 0 > tol) Run a cycle of m max. iterations with a minimization step to get x m and r m it+ = m; in If ( r m > tol and it < itmax) then test = m log(tol/ r m )/log( r m / r 0 ); If (test > smv * (itmax-it) ) then Estimate k smallest eigenvalues of BM 1 and compute data for the preconditioner M 1 associated to the deflation End If End If x 0 = x m, r 0 = r m end while 8/17

in New KSP type : D LEVEL OF ABSTRACTION PC (Preconditioners) MATRICES APPLICATIONS CODES EXTERNAL PACKAGES VECTORS KSP (Krylov Subspace Methods) INDEX SETS... BLAS LAPACK SCALAPACK BLACS MPI Other KSP... F L D MPI Calls are encapsulated in routines in Usage in Petsc In your executable, register dynamically the new KSP KSPRegisterDynamic(KSPD, Path-to-libdgmres.a, KSPCreate D, KSPCreate D); Use D just as any other KSP with the following options KSPSetType(ksp, KSPD); or ksp type dgmres ksp dgmres eigen <k>: Number of eigenvalues to deflate ksp dgmres max eigen <kmax>: Maximum Number of eigenvalues to deflate ksp dgmres smv <smv> : relaxation parameter in the adaptive ksp gmres restart<m>, ksp max it<itmax>, ksp rtol<rtol>... Any option from holds; Any preconditioner available for pc type<pc> 9/17

Environment for tests Platform of tests Test Examples Cluster (Parapluie) @ GRID 5000 SMP Nodes HP Proliant DL165 Dual CPU/Node; 12 AMD cores/cpu @ 1.7GHZ Infiniband DDR network in Petsc Example (Ex3) Basic poisson problem 2D regular mesh on the unit square 2000x2000 mesh in these tests N:=4,000,000 NZ:=35,936,032 easy to solve; given here just for large scale experimentation FLUOREM Example (CASE 017) Parametrized Navier-Stokes equation Finite volume discretization of the steady equation Nonsymmetric jacobian matrix N:=381,689 NZ:=37,464,962 strongly non-elliptic operator; need robust solver 10/17

Policy of tests and options Policy of tests Run the iterative method until b Ax / b <rtol> or the maximum number of iterations <max it> (not matrix-vectors) is reached. Domain decomposition preconditioner is used (Additive Schwarz, block jacobi); data assigned to processes in chunk of contiguous rows. For each test case and each number of subdomains #D (m) : restarted ; cycle of m iterations; right preconditioning; D(m, k, max) : ; k eigenvalues extracted at each restart; max eigenvalues extracted!! D A(m,k,max): with adaptive. in Ex3 ksp <rtol> 1e-12 gmres restart <m> 12 gmres <max it> 500 right preconditioning PC block jacobi Subdomains <D> [96 192 394 512 1024] sub solver ILU(0) (Hypre-Euclid) dgmres eigen <k> 2 CASE 017 ksp <rtol> 1e-08 gmres restart <m> 64 gmres <max it> 1500 right preconditioning PC ASM <overlap> 1 Subdomains <D> [16 32 64] sub solver LU (MUMPS) dgmres eigen <k> 5 11/17

Ex3: Poisson problem; 2000x2000 mesh; 36M entries Numerical convergence, 192-1024 subdomains in Rel. Res. Norm Rel. Res. Norm 10 8 EX3_N2000 D=192 10 9 10 10 (12) D_A(12;2;8) D(12;2;23) 10 11 0 50 100 150 200 250 iterations EX3_2000x2000 D=1024 10 8 (12) D_A(12;2;78) D(12;2;91) 10 9 10 10 10 11 Rel. Res. Norm 500 450 400 350 300 250 200 150 100 EX3_2000x2000 D=512 10 8 (12) D_A(12;2;17) D(12;2;31) 10 9 10 10 10 11 0 100 200 300 400 iterations (12) D(12, 2) D_A(12, 2) Number of Matrix vectors 0 100 200 300 400 500 iterations 50 0 96 192 384 512 768 1024 12/17

EX3: Overhead due to the deflation MPI Messages on MPI Messages and Reductions x MPI Messages and Reductions 106 8 (12) 7 D(12, 2) 6 D_A(12, 2) 5 4 3 2 1 0 384 512 768 1024 Processors MPI Messages and Reductions x Messages during deflation 106 8 Total in D 7 During Deflation 6 5 4 3 2 1 0 384 512 768 1024 Processors in Insignificant overhead of messages in the deflation phase More investigation on real test cases for the Flops and CPU time 13/17

CASE 17; 3D CFD case; 37.5M entries Numerical convergence; [8 32 64] subdomains, Rel. Res. Norm Rel. Res. Norm 10 0 CASE_17 D=8 10 2 10 4 10 6 (64) D_A(64;5;5) D(64;5;61) 10 8 0 500 1000 1500 iterations 10 0 CASE_17 D=64 10 2 10 4 10 6 (64) D_A(64;5;10) D(64;5;43) Rel. Res. Norm Matrix Vectors 10 0 CASE_17 D=32 10 2 10 4 10 6 (64) D_A(64;5;54) D(64;5;86) 10 8 0 500 1000 1500 iterations 1500 1000 500 Matrix vectors (64) D(64, 5) D_A(64, 5) in 10 8 0 500 1000 1500 iterations 0 8 32 64 Processors 14/17

CASE 17; 3D CFD case; 37.5M entries Flops, MPI messages and CPU Time; Linux cluster @GRID 5000 Time (sec) Flops x Floating point operations 1011 5 (64) D(64, 5) 4 D_A(64, 5) 3 2 1 0 2500 2000 1500 1000 500 8 32 64 Processors Total elapsed time D D_A Time (Sec) MPI Messages and Reductions x MPI Messages and Reductions 106 4 3.5 (64) D(64, 5) 3 D_A(64, 5) 2.5 2 1.5 1 0.5 0 1500 1000 500 8 32 64 Processors Overhead time in deflation Total In deflation in 0 8 32 64 Processors 0 8 32 64 Processors 15/17

Real contribution of the adaptive The fact is: Rel. Res. Norm Sometimes, an increase of the restart parameter in may prevent stagnation Difficult to know how much it should be increase; more difficult to know if the method will converge either way after the increase. The deflated with adaptive will provide more robustness with any restart parameter. 10 0 CASE_17 D=64 10 2 10 4 10 6 (48) (64) (96) Rel. Res. Norm 10 0 CASE_17 D=64 10 2 10 4 10 6 D_A(48;5;49) D_A(64;5;10) D_A(96;5;5) D_A(128;5;0) in 10 8 0 500 1000 1500 iterations 10 8 0 200 400 600 800 1000 iterations 16/17

Conclusion in The should be tested on more real applications and platforms Nevertheless the code exists as a KSP module... for real artihmetics. 17/17

Conclusion in The should be tested on more real applications and platforms Nevertheless the code exists as a KSP module... for real artihmetics. THANKS... QUESTIONS??? 17/17