Work done while visiting the Jointlab @NCSA (10/26-11/24) 1/17 Parallel implementation of the deflated in the package in WAKAM 1, Joint work with Jocelyne ERHEL 1, William D. GROPP 2 1 INRIA Rennes, 2 NCSA-University of Illinois Third Workshop of the INRIA-Illinois Joint Laboratory for Petascale Computation, NCSA, November 22-24, 2010
Outline in 1 2 3 Practical 4 Early results 2/17
General purpose Linear system The core computation of many scientific simulations is to solve Ax = b we restrict to A R n n nonsymmetric, x,b R n The method [Saad and Schultz, 86] in From x 0, iteratively build a sequence of approximate solutions x 1,... x k... at step k, x k x 0 + K k s.t. r k = b Ax k AK k K k is a Krylov subspace From the Arnoldi process : AV k = V k+1 H k and x k = x 0 + V k y k From the minimal residual condition : y k solves min βe 1 H k y k The solution is found in at most n iterations. 3/17
(m) Computational and memory requirements grow with k. Restarted (m) At step m, set x 0 x m, and restart V m and H m are discarded difficult to predict the convergence behaviour. The iterative process can stall as well!! in An example Matrix: CASE 04 Field : Fluid dynamics Origin: 2D linear cascade turbine Source : FLUOREM Matrix Collection Rel. Res. Norm 1 0.8 0.6 0.4 0.2 CASE_04 n=7,980 nz=430,909 (32) 0 0 32 64 75 iterations 4/17
and Eigenvalues Rate of convergence depends on the spectral distribution of A Removing or deflating eigenvalues (usually the smallest ones) improve the convergence rate Deflation occurs autommatically in the full version of when the subspace is large enough. In the restarted version, deflate an eigenvalue add the corresponding eigenvector [Morgan, J. Sci. Stat. Comput. 02] in In this work Deflation occurs by using a preconditioner corresponding to the invariant subspace associated to the selected eigenvalues [Erhel et al, JCAM, 1996; Burrage et al, NLAA, 1998] U = [u 1...u k ], T = U T AU M 1 I n + U( λ n T 1 I k )U T, 5/17
with right preconditioning in Solve AM 1 (Mx) = b for x, Algorithm: Input (m, k)[erhel et al, JCAM, 1996] Set x 0 ; M = I; U = [] 1 Arnoldi process on AM 1 to get V m and H m. 2 x m x 0 + V m y m, y m solution of min βe 1 H m y m 3 If no convergence 1 Find k Schur vectors S k of H m 2 Orthogonalize X = V ms r against U 3 Increase U by X and Compute T = U T AU 4 Set M 1 I n + U( λ n T 1 I rj )U T 5 Set x 0 x m and Restart at step 1 6/17
Main observation induces extra cost to compute and apply the preconditioner Should be applied only if necessary ; for instance, to prevent stagnation or insufficient reduction in the residual norm : Detect stagnation at the end of a cycle and switch to deflation in How to detect stagnation in (m)?? Strictly : Rate of convergence r m / r 0 > τ, 0 < τ 1 Pratically : should have a more realistic experimental test (m) is declared to have stagnated if at the averate rate of progress over the last restart cycle of m steps, the residual norm tolerance cannot be met in some large multiple of the remaining number of steps allowed (the number of steps permitted is bounded by itmax) [Sosonkina et al., NLAA98]: 7/17
Pratical implementation Outline of (D(m, l, rmax)) choose x 0,tol, m, itmax,k Set B A M 1 where M 1 is any external preconditioner r 0 = b Bx 0 ; M = I; k = 0; it = 0; l = 0; while ( r 0 > tol) Run a cycle of m max. iterations with a minimization step to get x m and r m it+ = m; in If ( r m > tol and it < itmax) then test = m log(tol/ r m )/log( r m / r 0 ); If (test > smv * (itmax-it) ) then Estimate k smallest eigenvalues of BM 1 and compute data for the preconditioner M 1 associated to the deflation End If End If x 0 = x m, r 0 = r m end while 8/17
in New KSP type : D LEVEL OF ABSTRACTION PC (Preconditioners) MATRICES APPLICATIONS CODES EXTERNAL PACKAGES VECTORS KSP (Krylov Subspace Methods) INDEX SETS... BLAS LAPACK SCALAPACK BLACS MPI Other KSP... F L D MPI Calls are encapsulated in routines in Usage in Petsc In your executable, register dynamically the new KSP KSPRegisterDynamic(KSPD, Path-to-libdgmres.a, KSPCreate D, KSPCreate D); Use D just as any other KSP with the following options KSPSetType(ksp, KSPD); or ksp type dgmres ksp dgmres eigen <k>: Number of eigenvalues to deflate ksp dgmres max eigen <kmax>: Maximum Number of eigenvalues to deflate ksp dgmres smv <smv> : relaxation parameter in the adaptive ksp gmres restart<m>, ksp max it<itmax>, ksp rtol<rtol>... Any option from holds; Any preconditioner available for pc type<pc> 9/17
Environment for tests Platform of tests Test Examples Cluster (Parapluie) @ GRID 5000 SMP Nodes HP Proliant DL165 Dual CPU/Node; 12 AMD cores/cpu @ 1.7GHZ Infiniband DDR network in Petsc Example (Ex3) Basic poisson problem 2D regular mesh on the unit square 2000x2000 mesh in these tests N:=4,000,000 NZ:=35,936,032 easy to solve; given here just for large scale experimentation FLUOREM Example (CASE 017) Parametrized Navier-Stokes equation Finite volume discretization of the steady equation Nonsymmetric jacobian matrix N:=381,689 NZ:=37,464,962 strongly non-elliptic operator; need robust solver 10/17
Policy of tests and options Policy of tests Run the iterative method until b Ax / b <rtol> or the maximum number of iterations <max it> (not matrix-vectors) is reached. Domain decomposition preconditioner is used (Additive Schwarz, block jacobi); data assigned to processes in chunk of contiguous rows. For each test case and each number of subdomains #D (m) : restarted ; cycle of m iterations; right preconditioning; D(m, k, max) : ; k eigenvalues extracted at each restart; max eigenvalues extracted!! D A(m,k,max): with adaptive. in Ex3 ksp <rtol> 1e-12 gmres restart <m> 12 gmres <max it> 500 right preconditioning PC block jacobi Subdomains <D> [96 192 394 512 1024] sub solver ILU(0) (Hypre-Euclid) dgmres eigen <k> 2 CASE 017 ksp <rtol> 1e-08 gmres restart <m> 64 gmres <max it> 1500 right preconditioning PC ASM <overlap> 1 Subdomains <D> [16 32 64] sub solver LU (MUMPS) dgmres eigen <k> 5 11/17
Ex3: Poisson problem; 2000x2000 mesh; 36M entries Numerical convergence, 192-1024 subdomains in Rel. Res. Norm Rel. Res. Norm 10 8 EX3_N2000 D=192 10 9 10 10 (12) D_A(12;2;8) D(12;2;23) 10 11 0 50 100 150 200 250 iterations EX3_2000x2000 D=1024 10 8 (12) D_A(12;2;78) D(12;2;91) 10 9 10 10 10 11 Rel. Res. Norm 500 450 400 350 300 250 200 150 100 EX3_2000x2000 D=512 10 8 (12) D_A(12;2;17) D(12;2;31) 10 9 10 10 10 11 0 100 200 300 400 iterations (12) D(12, 2) D_A(12, 2) Number of Matrix vectors 0 100 200 300 400 500 iterations 50 0 96 192 384 512 768 1024 12/17
EX3: Overhead due to the deflation MPI Messages on MPI Messages and Reductions x MPI Messages and Reductions 106 8 (12) 7 D(12, 2) 6 D_A(12, 2) 5 4 3 2 1 0 384 512 768 1024 Processors MPI Messages and Reductions x Messages during deflation 106 8 Total in D 7 During Deflation 6 5 4 3 2 1 0 384 512 768 1024 Processors in Insignificant overhead of messages in the deflation phase More investigation on real test cases for the Flops and CPU time 13/17
CASE 17; 3D CFD case; 37.5M entries Numerical convergence; [8 32 64] subdomains, Rel. Res. Norm Rel. Res. Norm 10 0 CASE_17 D=8 10 2 10 4 10 6 (64) D_A(64;5;5) D(64;5;61) 10 8 0 500 1000 1500 iterations 10 0 CASE_17 D=64 10 2 10 4 10 6 (64) D_A(64;5;10) D(64;5;43) Rel. Res. Norm Matrix Vectors 10 0 CASE_17 D=32 10 2 10 4 10 6 (64) D_A(64;5;54) D(64;5;86) 10 8 0 500 1000 1500 iterations 1500 1000 500 Matrix vectors (64) D(64, 5) D_A(64, 5) in 10 8 0 500 1000 1500 iterations 0 8 32 64 Processors 14/17
CASE 17; 3D CFD case; 37.5M entries Flops, MPI messages and CPU Time; Linux cluster @GRID 5000 Time (sec) Flops x Floating point operations 1011 5 (64) D(64, 5) 4 D_A(64, 5) 3 2 1 0 2500 2000 1500 1000 500 8 32 64 Processors Total elapsed time D D_A Time (Sec) MPI Messages and Reductions x MPI Messages and Reductions 106 4 3.5 (64) D(64, 5) 3 D_A(64, 5) 2.5 2 1.5 1 0.5 0 1500 1000 500 8 32 64 Processors Overhead time in deflation Total In deflation in 0 8 32 64 Processors 0 8 32 64 Processors 15/17
Real contribution of the adaptive The fact is: Rel. Res. Norm Sometimes, an increase of the restart parameter in may prevent stagnation Difficult to know how much it should be increase; more difficult to know if the method will converge either way after the increase. The deflated with adaptive will provide more robustness with any restart parameter. 10 0 CASE_17 D=64 10 2 10 4 10 6 (48) (64) (96) Rel. Res. Norm 10 0 CASE_17 D=64 10 2 10 4 10 6 D_A(48;5;49) D_A(64;5;10) D_A(96;5;5) D_A(128;5;0) in 10 8 0 500 1000 1500 iterations 10 8 0 200 400 600 800 1000 iterations 16/17
Conclusion in The should be tested on more real applications and platforms Nevertheless the code exists as a KSP module... for real artihmetics. 17/17
Conclusion in The should be tested on more real applications and platforms Nevertheless the code exists as a KSP module... for real artihmetics. THANKS... QUESTIONS??? 17/17