Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

Transcription

1 Parallel Smooters for Matrix-based Multigrid Metods on Unstructured Meses Using Multicore CPUs and GPUs Vincent Heuveline Dimitar Lukarski Nico Trost Jan-Pilipp Weiss No. 2-9 Preprint Series of te Engineering Matematics and Computing Lab (EMCL) KIT University of te State of Baden-Wuerttemberg and National Researc Center of te Helmoltz Association

2 Preprint Series of te Engineering Matematics and Computing Lab (EMCL) ISSN No. 2-9 Impressum Karlsrue Institute of Tecnology (KIT) Engineering Matematics and Computing Lab (EMCL) Fritz-Erler-Str. 23, building Karlsrue Germany KIT University of te State of Baden Wuerttemberg and National Laboratory of te Helmoltz Association Publised on te Internet under te following Creative Commons License: ttp://creativecommons.org/licenses/by-nc-nd/3./de.

3 Parallel Smooters for Matrix-based Multigrid Metods on Unstructured Meses Using Multicore CPUs and GPUs Vincent Heuveline, Dimitar Lukarski,2, Nico Trost and Jan-Pilipp Weiss,2 Engineering Matematics and Computing Lab (EMCL) 2 SRG New Frontiers in Hig Performance Computing Karlsrue Institute of Tecnology, Germany {vincent.euveline, dimitar.lukarski, jan-pilipp.weiss}@kit.edu, nico.trost@student.kit.edu Abstract. Multigrid metods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems wit complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meses in order to properly resolve all problem-inerent specifics and for maintaining a moderate number of unknowns. However, flexibility in te meses goes along wit severe drawbacks wit respect to parallel execution especially wit respect to te definition of adequate smooters. Tis point becomes in particular pronounced in te framework of fine-grained parallelism on GPUs wit undreds of execution units. We use te approac of matrixbased multigrid tat as ig flexibility and adapts well to te exigences of modern computing platforms. In tis work we investigate multi-colored Gauß-Seidel type smooters, te power(q)-pattern enanced multi-colored ILU(p) smooters wit fillins, and factorized sparse approximate inverse (FSAI) smooters. Tese approaces provide efficient smooters wit a ig degree of parallelism. In combination wit matrix-based multigrid metods on unstructured meses our smooters provide powerful solvers tat are applicable across a wide range of parallel computing platforms and almost arbitrary geometries. We describe te configuration of our smooters in te context of te portable lmplatoolbox and te HiFlow 3 parallel finite element package. In our approac, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Higly optimized implementations are idden beind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a compreensive performance analysis on multicore CPUs and GPUs. Keywords: Parallel smooters, unstructured meses, matrix-based multigrid, multi-coloring, power(q)-pattern metod, FSAI, multi-core, GPUs Corresponding autor: dimitar.lukarski@kit.edu

4 2 Introduction Te need for ig accuracy and sort simulation times relies bot on efficient numerical scemes and appropriate scalable parallel implementations. Te latter point is crucial in te context of fine-grained parallelism in te prospect of te manycore era. In particular, grapics processing units (GPUs) open new potentials in terms of computing power and internal bandwidt values. Tese arcitectures ave a significant impact on te design and implementation of parallel algoritms in numerical simulation. Te algoritms need to be laid out suc tat tousands of fine-grained treads can run in parallel in order to feed undreds of computing units. Portability of solver concepts, codes and performance across several platforms is furter a major issue in te era of igly capable multicore and accelerator platforms. Multigrid metods rely on te decomposition of te error in low- and igfrequency contributions. Te so-called smooters damp out te ig frequency contributions of te error at a given level. Adequate prolongation and restriction operators allow to address te considered problem on a ierarcy of discretizations and by tis means cover te full spectral range of error contributions only on te basis of tese smooters (see [2] and references terein for furter details). For full flexibility of te solvers in te context of complex geometries, stencil-based multigrid metods need to be replaced by more flexible concepts. We use te approac of matrix-based multigrid were all operations i.e. smooters, grid transfers and residual computation are represented by sparse matrix-vector multiplications (SpMV). Tis approac is sown to work well on modern multicore platforms and GPUs [3]. Moreover, te restriction to basic algoritmic building blocks is te key tecnique for building portable solvers. Te major callenge wit respect to parallelism is related to te definition and implementation of adequate parallel smooters. Flexible multigrid metods on emerging multicore tecnologies are subject of recent researc. Parallel multigrid smooters and preconditioners on regularly structured tensor-product meses (wit a fixed number of neigbors but arbitrary displacements) are considered in [7]. Te autor discusses parallel implementation aspects for several platforms and sows integration of accelerators into a finite element software package. In [6], geometric multigrid on unstructured meses for iger order finite element metods is investigated. A parallel Jacobi smooter and grid transfer operators are assembled into sparse matrix representation leading to efficient solvers on multicore CPUs and GPUs. In [5] performance results for multigrid solvers based on te sparse approximate inverse (SPAI) tecnique are presented. An alternative approac witout te need for mes ierarcies are algebraic multigrid metods (AMG) []. An implementation and performance results of an AMG on GPUs are discussed in [8]. In tis work we propose a new parallel smooter based on te power(q)- pattern enanced multi-colored ILU(p) factorization [4]. We compare it to multicolored splitting-type smooters and FSAI smooters. Tese approaces provide efficient smooters wit scalable parallelism across multicore CPUs and GPUs. Various ardware platforms can be easily used in te context of tese solvers by

5 3 means of te portable implementation based on te lmplatoolbox in te parallel finite element software package HiFlow 3 [2]. Te capability of te proposed approac is demonstrated in a compreensive performance analysis. Tis paper is organized as follows. In Section 2 we describe te context of matrix-based multigrid metods. Section 3 describes te parallel concepts for efficient and scalable smooters. Implementation aspects and te approac taken in te HiFlow 3 finite element metod (FEM) package are presented in Section 4. Te numerical test problem is te Poisson problem on an L-saped domain. Te impact of te coice and te configuration of te smooters on te MG convergence is investigated in Section 5. A performance analysis wit respect to solver times and parallel speedups on multicore CPUs and GPUs is te central teme in Section 6. After an outlook on future work in Section 7 we conclude in Section 8. 2 Matrix-based Multigrid Metods Multigrid (MG) metods are usually used to solve large sparse linear systems of equations arising from finite element discretizations (or related tecniques) of partial differential equations (PDEs) typically of elliptic type. Due to te ability to acieve asymptotically optimal complexity, MG as been proven to be one of te most efficient solvers for tis type of problems. In contrast to Krylov subspace solvers, te number of needed iterations in te MG metod in order to acieve a prescribed accuracy does not depend on te number of unknowns N or te grid spacing. Hence, te costs on finer grids only increase linearly in N due to te complexity of te applied operators. In tis sense, MG is superior to oter iterative metods (see e.g. [2,9]). Te main idea of MG metods is based on coarse grid corrections and te smooting properties of classical iterative scemes. Hig frequency error components can be eliminated efficiently by using elementary iterative solvers suc as Jacobi or Gauß-Seidel. On te oter and, smoot error components cannot be reduced efficiently on fine grids. Tis effect can be mitigated by performing a sequence of coarsening steps and transferring te smoot errors by restricting te defect to coarser mes levels. On eac coarser mes, pre-smooting is applied to reduce error components inerited by te grid transfer. Moreover, on tis coarser level te smoot components can be damped muc faster. Wen te coarsest level is reaced, te so-called coarse grid problem as to be solved exactly or by an iterative metod wit sufficient accuracy. By applying te prolongation operator to te corrected defect te next finer level can be reaced again. Any remaining ig frequency error components can be eliminated by post-smooting. Tis recursive execution of inter-grid transfer operators and smooters is called te MG cycle. Details on te basic properties of te MG metod as well as error analysis and programming remarks can be found in [2, 9, 8, 4] and in references provided terein. Aspects of parallel MG are e.g. discussed in [6]. Te MG cycle is an iterative and recursive metod, sown in Algoritm. Here, L u = f is te discrete version of te underlying PDE on te refinement

6 4 level wit mes size. On eac level, pre- and post-smooting is applied in te form u (n) = RELAX ν (ũ (n), L, f ). Te parameter ν denotes te number of smooting iterations. Grid transfer operators are denoted by I H (restriction) and IH (prolongation) were is representing te fine grid size and H is te coarse grid size. Te transferred residual r H is te input to te MG iteration on te coarser level were te initial value for te error is taken by zero. Te parameter γ specifies te number of two-grid cycle iterations on eac level and tus specifies ow often te coarsest level is visited. For γ = we obtain so called V-cycles wereas γ = 2 results in W-cycles. Te obtained solution after eac cycle serves as te input for te successive cycle. Algoritm : Multigrid cycle: u (n) = MG(u (n ), L, f, ν, ν 2, γ) () ū (n) := RELAX ν (u (n ), L, f ) pre-smooting (2) r (n) := f L ū (n) compute residual (3) r (n) H := IH r (n) restriction (4) if (H == ) ten L H e (n) H = r(n) H exact solution on te coarse grid (5) else e (n) H = MG(, L H, r (n) H, ν, ν 2, γ) recursion end if (6) e (n) := IHe (n) H prolongation (7) ũ (n) := ū (n) + e(n) correction (8) u (n) = RELAX ν 2 (ũ (n), L, f ) post-smooting In tis work, we consider matrix-based geometric MG metods [9]. In contrast to stencil-based geometric MG metods, all differential operators and grid transfer operators are not expressed by fixed stencils on equidistant grids but ave te full flexibility of sparse matrix representations. On te one and, tis approac gives us flexibility wit respect to complex geometries, non-uniform grids resulting from local mes refinements, and space-dependent coefficients in te underlying PDE. On te oter and, te solvers can be built upon standard building blocks of numerical libraries. Te convergence and robustness of MG metods strongly depend on te applied smooters. Wile classical MG performs very well wit additive smooters suc as Gauß-Seidel or Jacobi, strong anisotropies require more advanced smooters in order to acieve full MG efficiency. In te latter case, incomplete LU decompositions ave demonstrated convincing results, e.g. for 2D cases in [2]. 3 Parallel Smooters for Matrix-Based Multigrid Stencil-based geometric MG metods can be efficiently performed in parallel by domain sub-structuring and alo excange for te stencils [6]. In contrast,

7 5 te parallel implementation of a matrix-based MG requires more work. Wile te grid transfer operators are explicit updates wit sufficient locality, parallel smooters rely on implicit steps were typically triangular systems need to be solved. And as we are working on unstructured meses, straigtforward parallelization strategies like wave-front or red-black-coloring cannot be applied. Terefore, te focus of our work is directed to fine-grained parallelism for te smooting step. We consider additive and multiplicative matrix splittings used for designing efficient parallel smooters considering te following two restrictions: first, te proposed smooters sould rely on a ig degree of parallelism and sould sow significant speedups on multicore CPUs and GPUs. Second, te parallel versions of te smooters sould maintain smooting properties comparable to teir sequential version. Te considered smooters are based on te block-wise decomposition into small sub-matrices. In te additive and multiplicative splittings, typically a large amount of forward and backward sweeps in triangular solvers needs to be processed. For te description of te proposed parallel smooters we point out te link between smooters and preconditioning tecniques. Iterative solvers can generally be interpreted in fixed point form by x k+ = Gx k + f were te linear system of te type Ax = b is transformed by te additive splitting A = M + N and taking G = M N = M (M A) = I M A and f = M b. Tis version can be reformulated as a preconditioned defect correction sceme given by x k+ = x k + M (b Ax k ). () In tis context, we apply preconditioners M as smooters for te linear system Ax = b. Additive preconditioners are standard splitting scemes typically based on te block-wise decomposition A = D + L + R were L is a strictly lower triangular matrix, R is a strictly upper triangular matrix, and D is te matrix containing te diagonal blocks of A. We coose M = D (Jacobi), M = D + L (Gauß-Seidel) or M = (D + L)D (D + R) (symmetric Gauß-Seidel). For multiplicative splittings we coose M = LU in () were L is a lower triangular and U is an upper triangular matrix. For incomplete LU (ILU) factorizations wit or witout fill-ins we decompose te system matrix A into te product A = LU + R wit a remainder matrix R tat absorbs unwanted fill-in elements. Typically, diagonal entries of L are taken to be one and bot matrices L and U are stored in te same data structure (omitting te ones). Te quality of te approximation in tis case depends on te number of fill-in elements. Te tird class of considered parallel smooters are te approximate inverse preconditioners. Here, we are focusing on te factorized sparse approximate inverse (FSAI) algoritms [7] tat compute a direct approximation of A. Tese scemes are based on te minimization of te Frobenius norm I GA F were one looks for a symmetric preconditioner in te form G := G T L G L. In oter words one directly builds an approximation of te Colesky decomposition based on a given sparse matrix structure. FSAI() uses te sparsity pattern of A, FSAI(q), q 2, uses te sparsity pattern of A q respectively.

8 6 In order to arness parallelism witin eac block of te decomposition we apply multi-coloring tecniques [4]. For splitting-based metods (like Gauß- Seidel and SOR) and ILU() decompositions witout fill-ins te original sparsity pattern of te system matrix is preserved in te additive or multiplicative decompositions. Here, only subsets of te original sparsity pattern are populated and no additional matrix elements are inserted. Before applying te smooter, te matrix is re-organized suc tat diagonal blocks in te block decomposition are diagonal itself. Ten, inversion of te diagonal blocks is just an easy vector operation [3]. Furtermore, we allow fill-ins (i.e. additional matrix elements) in te ILU(p) factorization for acieving a iger level of coupling wit increased efficiency. Te power(q)-pattern metod is applied to ILU(p) factorizations wit fill-ins [4]. Tis metod is based on an incomplete factorization of te system matrix A subject to a predetermined non-zero pattern derived from a multi-coloring analysis of te matrix power A q and its associated multi-coloring permutation π. It as been proven in [4] tat te obtained sparsity pattern is a superset of te modified ILU(p) factorization applied to πaπ. As a result, for q = p + tis modified ILU(p,q) sceme applied to te multi-colored system matrix as no fill-ins into its diagonal blocks. Tis leads to an inerently parallel execution of triangular ILU(p,q) sweeps and ence to a parallel and efficient smooter. Te degree of parallelism can be increased by taking q < p + at te expense of some fill-ins into te diagonal blocks. In tis scenario (e.g. for te ILU(,) smooter in our experiments) we use a drop-off tecnique tat erases fill-ins into te diagonal blocks. Tese tecniques ave already been successfully tested in te context of parallel preconditioners [4]. Te major advantage is tat multicoloring can be applied before performing te ILU(p) wit additional fill-ins were fill-ins only occur outside te blocks on te diagonal. By precomputing te superset of te data distribution pattern we also eliminate costly insertion of new matrix elements into dynamic data structures and obtain compact loops [4]. 4 Implementation Aspects Flexibility of solution tecniques and software is a decisive factor in te current computing landscape. We ave incorporated te described multigrid solvers and parallel smooters into te multi-platform and multi-purpose parallel finite element software package HiFlow 3 [, 2]. Wit te concept of object-oriented programming in C++, HiFlow 3 provides a flexible, generic and modular framework for building efficient parallel solvers and preconditioners for PDEs of various types. HiFlow 3 tackles productivity and performance issues by its conceptual approac. It is structured in several modules. Its linear algebra operations are based on two communication and computation layers, te LAtoolbox for internode operations and te lmplatoolbox for intra-node operations in a parallel system. Te lmplatoolbox as backends for multiple platforms wit igly optimized implementations iding all ardware details from te user wile

9 7 maintaining optimal usage of resources. Numerical solvers in HiFlow 3 are built on te basis of unified interfaces to te different ardware backends and parallel programming approaces were only a single code base is required for all platforms. By tis means, HiFlow 3 is portable across diverse computing systems including multicore CPUs, CUDA-enabled GPUs and OpenCL-capable platforms. Te modular approac also allows an easy extension to emerging systems wit eterogeneous configuration. Te main modules of HiFlow 3 are described in te following: Mes, DoF/FEM and Linear Algebra. See Figure for an abstraction. Mes - Tis module provides a set of classes and functions for andling general types of meses. Tis library deals wit complex unstructured distributed meses wit different types of cells via a uniform interface. Te data structures provided can be used to represent meses of arbitrary topological and geometrical dimension, and even wit different cell types in one and te same mes. It is possible to create ierarcies of meses troug refinement and also to coarsen cells given suc a ierarcy. DoF/FEM (Degrees of Freedom / Finite Element Metods) - Treatment of te FEM and DoF is igly interdependent and terefore, te major idea of tis module is to capture bot sub-modules and unify tem witin a single scope. Te DoF sub-module deals wit te numbering of all DoF resulting from te finite element ansatz on eac mes cell. It can andle discontinuous and continuous finite element ansantz functions. Tis data is passed to te FEM sub-module. Tis part of te library andles te task of representing te cosen finite element ansatz on te reference cell and transforming it into te pysical space depending on its geometry. Linear Algebra - Tis module andles te basic linear algebra operations and offers (non-)linear solvers and preconditioners. It is implemented as a two-level library: te global level is an MPI-layer wic andles te distribution of data among te nodes and performs cross-node computations. Te local level (local multi-platform LAtoolbox) takes care of te on-node routines offering a unified interface to several platforms by providing different platform backends, e.g. to multicore CPUs (OpenMP, MKL) and GPUs (CUDA, OpenCL). In tis work we are only focusing on single node parallelism. Terefore, we do not consider te global MPI level. Based on te matrix and vector classes HiFlow 3 offers several Krylov subspace solvers and local additive and multiplicative preconditioners. Furter information about tis library, solvers and preconditioners can be found in [4, 2,3, 5]. Implementation of Matrix-based Multigrid in HiFlow 3 Te proposed multigrid solvers and smooters ave been implemented in te context of HiFlow 3. Starting on an initial mes, te HiFlow 3 mes module refines

10 8 Fig.. Modular configuration of te parallel FEM package HiFlow 3. te grid in order to reac te requested number of cells according to te demands for accuracy. Te current version supports quadrilateral elements as well as triangles for two dimensional problems. In te tree-dimensional case tetraedrons and exaedrons are used. For our two-dimensional test case, an unstructured mixed mes containing bot types of elements wit local refinements as been used. On eac refinement level, te corresponding stiffness matrix and te rigt and side are assembled for te given equation. Our implementation performs a bi-linear interpolation to ascend to a finer level and a full weigting restriction to descend to a coarser level respectively. Wit te relation I2 = 2dim (I 2)T for te intergrid transfer operators for te standard coarsening procedure, it is sufficient to assemble te prolongation matrix only. Due to te modular approac of te lmplatoolbox, te matrix-based MG solver is based on a single source code for all different platforms (CPUs, GPUs, etc). Te grid transformations and te defect computation are based on matrixvector and vector-vector routines wic can be efficiently performed in parallel on te considered devices. Moreover, a similar approac is taken for te smooting steps based on te preconditioned defect correction (). In tis context, our smooters make use of parallel preconditioners of additive and multiplicative type. See also [4] for a detailed description of te parallel concepts for preconditioners, te power(q)-pattern enanced multi-colored ILU(p) scemes and a compreensive performance analysis wit respect to solver acceleration and parallel speedups on multicore CPUs and GPUs.

11 9 5 Numerical Experiments and Smooter Efficiency As a numerical test problem we are solving te Poisson problem wit omogeneous Diriclet boundary conditions u = f in Ω, (2) u = on Ω in te two-dimensional L-saped domain Ω := (,) 2 \(,.5) 2 wit a reentrant corner. For our test case we coose te rigt and side f = 8π 2 cos(2πx) cos(2πy). We solve (2) by means of bilinear Q elements on quadrilaterals and linear P elements on triangles [, 2]. A uniform discretization of te L-saped domain is depicted in Figure 2 (left). Te numerical solution of (2) wit boundary layers is detailed in Figure 2 (rigt). Due to te steep gradients at te reentrant corner we Fig. 2. Discretization of locally refined L-saped domain (left) and discrete solution of te Poisson problem (2) (rigt). are using a locally refined mes in order to obtain a proper problem resolution. Figure 3 (left) sows a zoom-in into a uniformly refined coarse mes. In te mes in Figure 3 (rigt) we are using additional triangular and quadrilateral elements for local refinement and avoiding anging nodes and deformed elements []. Our coarsest mes in te multigrid ierarcy is a locally refined mes based on triangular and quadrilateral cells wit Q and P elements summing up to 3,266 degrees of freedom (DoF). By applying six global refinement steps to tis coarse mes we obtain te ierarcy of nested grids were te finest mes as 3,2,425 DoF. Details of te mes ierarcy are listed in Table. None of our grids is a uniform grid wit equidistant mes spacings. For te inter-grid operations we are using full weigting restrictions and bi-linear interpolations.

12 Fig. 3. Zoom-in to a uniform mes wit 3,2 DoF (left) and a locally refined mes wit 3,266 DoF (rigt) of te L-saped domain around te reentrant corner our coarsest MG mes wit level. level #cells #DoF max min at (.5,.5) 3,77 3, ,78 2, ,832 5, ,328 2, ,32 83, ,253,248 3,2, Table. Caracteristics of te six refinement levels of te MG ierarcy. For te motivation of our locally refined grids we consider te following example. Te gradient of te solution on a fine uniform mes wit 3,49,825 DoF sows a low resolution as can be seen in Figure 4 (left). Te gradient of te solution on te locally refined mes wit 3,2,425 DoF (our finest grid in te multigrid ierarcy) is approximated muc more accurately as depicted in Figure 4 (rigt). Clearly, wit te same amount of unknowns we can solve te problem wit iger accuracy only on locally refined meses. Tere is a significant improvement of approximation quality wit less tan 2 % additional grid points. Fig. 4. Zoom-in plots of te gradient of te solution of (2)on a uniform mes wit 3,49,825 DoF (left) and on a locally refined mes wit 3,2,425 DoF (rigt) our finest MG mes wit level 6. In tis work we study te beavior of parallel smooters based on additive splittings (Jacobi, multi-colored Gauß-Seidel and multi-colored symmetric Gauß-

13 Seidel), incomplete factorizations (multi-colored ILU() and power(q)-pattern enanced multi-colored ILU(p) wit fill-ins) and FSAI smooters. We investigate teir properties wit respect to ig-frequency error reduction and convergence beavior of te V- and W-cycle parallel MG. In particular, we consider te multigrid beavior on different test platforms. Our numerical experiments are performed on a ybrid platform, a dual-socket Intel Xeon (E545) quad-core system (wit eigt cores in total) tat is accelerated by an NVIDIA Tesla S7 GPU system wit four GPUs attaced pairwise by PCIe to one socket eac. Te memory capacity of a single CPU and GPU device is 6GB and 4GB respectively. We are only using double precision for te computations. We perform several tests wit different configurations for te pre- and postsmooting steps. We determine te number of cycles required to acieve a relative residual less tan 6. In Table 2 te iteration counts are sown for te V-cycle based MG. Note, tat te iteration counts in tis table do not reflect te total amount of work. Some smooting steps, e.g. ILU(p), need more work tan oters. From a teoretical point of view, a Gauß-Seidel smooting step is alf as costly as a symmetric Gauß-Seidel step; ILU() is ceaper tan ILU(,) wic is ceaper tan ILU(,2). For te MG solver, ν +ν 2 is te number of smooting steps on eac level and sould be kept low in order to reduce te total amount of work (and ence te solver time). For estimation of te corresponding work load, computing times are included for te MG solver on te GPU platform. For Jacobi smooters tere is no convergence of te MG solver witin 99 iterations. Multi-colored symmetric Gauß-Seidel (SGS) is better tan multi-colored Gauß-Seidel (GS) in terms of iteration count. Multi-colored ILU smooters are better tan additive factorizations (GS, SGS). Te quality of te ILU(p) scemes in terms of iteration counts gets better wit increasing p. Te drop-off tecnique for ILU(,) is providing only little improvements compared to ILU(). Iteration counts for te FSAI smooters are in te same range as te ILU smooters. Best candidates wit respect to reduced iteration counts are ILU(,2) and FSAI(2). Minima wit respect to te iteration count can be found for configurations were ν + ν 2 = 3 or 4. For larger values tere are no more significant improvements. Te run time results sow tat te benefits for te SGS smooter in terms of smooting properties and reduced iteration count are eaten up in terms of run time due to te additional overead in eac smooting step and an additional V-cycle. For te ILU smooters te additional work complexity still yields improvements in run time. Te best performance is obtained for te ILU(,2) smooter. Te drop-off tecnique for ILU(,) as some diminisment in performance. Te FSAI smooters give no particular improvements in run time compared to te oter smooters. Table 3 sows iteration counts and run times for te W-cycle based MG solver. Te iteration counts are reduced compared to te V-cycle based MG solver. However, te run times sow tat te W-cycle based MG solver is by a factor of 2 to 3 slower tan te V-cycle based MG solver. Te W-cycle based MG solver is slower because more smooting steps are performed on coarser grids. Tis involves a larger number of calls to SpMV routines for small matrices wit

14 2 (ν, ν 2 ) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (2,) (2,) (2,2) Jacobi [#its] >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 GS [#its] time [sec] SGS [#its] time[sec] ILU() [#its] time [sec] ILU(,) [#its] time [sec] ILU(,2) [#its] time [sec] FSAI() [#its] time [sec] FSAI(2) [#its] time [sec] FSAI(3) [#its] time [sec] (ν, ν 2 ) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) Jacobi [#its] >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 >99 GS [#its] time [sec] SGS [#its] time [sec] ILU() [#its] time [sec] ILU(,) [#its] time[sec] ILU(,2) [#its] time [sec] FSAI() [#its] time [sec] FSAI(2) [#its] time [sec] FSAI(3) [#its] time [sec] Table 2. Number of MG V-cycles for different smooters and total run times of te V-cycle based MG solver on te GPU for different smooter configurations; ν is te number of pre-smooting steps, ν 2 is te post-smooting step count. significant overead (in particular kernel call overeads on te GPU). Te best results for te W-cycle based MG are obtained for te ILU() smooter wit Gauß-Seidel following next. Except of te FSAI smooters, all oter smooters are on te same performance level.

15 (ν, ν 2 ) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (2,) (2,) (2,2) GS [#its] time [sec] SGS [#its] time [sec] ILU() [#its] time [sec] ILU(,) [#its] time [sec] ILU(,2) [#its] time [sec] FSAI() [#its] time [sec] FSAI(2) [#its] time [sec] FSAI(3) [#its] time [sec] (ν, ν 2 ) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) GS [#its] time [sec] SGS [#its] time [sec] ILU() [#its] time [sec] ILU(,) [#its] time [sec] ILU(,2) [#its] time [sec] FSAI() [#its] time [sec] FSAI(2) [#its] time [sec] FSAI(3) [#its] time [sec] Table 3. Number of multigrid W-cycles for different smooters and total run times of te W-cycle based MG solver on te GPU for different smooter configurations; ν is te number of pre-smooting steps, ν 2 is te post-smooting step count. In Table 4 te run times and iteration counts for te best V-cycle and W- cycle based smooter configurations for te parallel MG solvers on te GPU are summarized. Te first column in tis table specifies te optimal values for te parameters ν and ν 2 tat correspond to te number of pre- and post-smooting steps on eac refinement level. Te tird column gives te number of necessary MG cycle iterations.

16 4 V-cycle based MG W-cycle based MG Smooter (ν, ν 2 ) time [sec] #its (ν, ν 2 ) time [sec] #its GS (,4) (2,3) 7. 6 SGS (,3) (2,3) ILU() (,4) (2,3) ILU(,) (,3) (,3) 8. 6 ILU(,2) (,2) (2,2) FSAI() (,2) (,3) FSAI(2) (,2) (,3) 4 4 FSAI(3) (,) (,2).73 5 Table 4. Minimal run times and corresponding iteration counts for te V-cycle and W-cycle based parallel MG solvers on te GPU for various smooters and optimal configurations of te corresponding number of pre- and post-smooting steps (ν, ν 2 ). Compared to te results of te Krylov subspace solvers, MG performance is superior. In Table 5 te run times are listed for te preconditioned conjugate gradient (CG) metod on te CPU and GPU test platforms. Te considered smooters are used as preconditioners in tis context. Te number of CG iterations is given in te first row. We find tat all preconditioners significantly reduce te number of iterations. Run times are presented for te sequential CPU version, te eigt-core OpenMP parallel version, and te GPU version. We see tat te MG solver on te GPU is faster by a factor of up to 6 tan te preconditioned CG solver on te GPU. Te best preconditioner is ILU(,2) on te GPU. Precond None Jacobi SGS ILU() ILU(,) ILU(,2) FSAI() FSAI(2) FSAI(3) # iter 5,65 4,67 2,323 2,45 2,66,387 2,98,493,39 Sequential 492s 34s 433s 472s 347s 498s 27s 95s 82.5s OpenMP 64.4s 573.4s 662.6s 676.9s 63.s 589.7s 435.5s 438.5s 5.32s GPU 273s 28.3s 86.6s 95.s 26.7s 58.s 24.s 28.s s Table 5. Run times in seconds and iteration counts for te preconditioned CG solver for various preconditioners on te CPU (sequential and eigt-core OpenMP parallel) and on te GPU. In te following we consider te smooting properties in more details. In Figure 5 te initial error wit its ig oscillatory parts is sown for a random initial guess. In te following figures we present te reduction of te error (compared to te final MG solution) after one and tree smooting steps wit te corresponding smooter. For tis experiment we coose te locally refined mes of level 2 wit 2,795 DoF (te second coarsest mes in our MG ierarcy). We see tat te effect from Jacobi smooting as sown in Figure 6 is worse tan tat from Gauß-Seidel smooting sown in Figure 7 wic itself is worse tan te symmetric Gauß-Seidel smooter presented in Figure 8. Even more smoot results are observed for ILU(), ILU(,) and ILU(,2) as sown in Figure 9,

17 5 and. Wit te FSAI smooters, some iger order oscillations are still observed after te initial smooting step as can bee seen in Figures 2, 3 and 4. Error of initial guess.5 - Fig. 5. Initial error for te MG iteration wit random initial guess. Error after Jacobi iteration Error after 3 Jacobi iterations Fig. 6. Damped error after (left) and 3 (rigt) Jacobi smooting steps. Error after GS iteration Error after 3 GS iterations Fig. 7. Damped error after (left) and 3 (rigt) Gauß-Seidel smooting steps.

18 6 Error after SGS iteration Error after 3 SGS iterations Fig. 8. Damped error after (left) and 3 (rigt) symmetric Gauß-Seidel smooting steps. Error after ILU() iteration Error after 3 ILU() iterations Fig. 9. Damped error after (left) and 3 (rigt) ILU() smooting steps. Error after ILU(,) iteration Error after 3 ILU(,) iterations Fig.. Damped error after (left) and 3 (rigt) ILU(,) smooting steps. 6 Performance Analysis on Multicore CPUs and GPUs In tis section we conduct a performance analysis wit respect to te different test platforms. We assess te corresponding solver times and te parallel speedups. In Figure 5 we compare run times of te V-cycle (left) and W-cycle (rigt) MG solver wit various smooters for te Poisson problem on te L- saped domain. It sows te results for te sequential version and te OpenMP parallel version on eigt cores of te CPU as well as for te GPU version. For tis test problem, tere are no significant differences in performance for te tested smooters on a specific platform. Te multiplicative ILU-type smooters

19 7 Error after ILU(,2) iteration Error after 3 ILU(,2) iterations Fig.. Damped error after (left) and 3 (rigt) ILU(,2) smooting steps. Error after FSAI iteration Error after 3 FSAI iterations Fig. 2. Damped error after (left) and 3 (rigt) FSAI() smooting steps Error after FSAI2 iteration Error after 3 FSAI2 iterations Fig. 3. Damped error after (left) and 3 (rigt) FSAI(2) smooting steps Error after FSAI3 iteration Error after 3 FSAI3 iterations Fig. 4. Damped error after (left) and 3 (rigt) FSAI(3) smooting steps

20 8 are sligtly faster tan te additive splitting-type smooters. Te ILU-based smooters are more efficient in terms of smooting properties and reduced iteration counts, but tey are more expensive to be executed. In total, bot effects interact suc tat te total execution time is only sligtly better. Best performance results are obtained for te ILU(,2) smooter based MG solver on te GPU. In tis 2D Poisson test problem wit unstructured grids based on Q and P elements, te resulting stiffness matrix as only six colors in te multicoloring decomposition. By increasing te sparsity pattern wit respect to te structure of A 2 used for building te ILU(,2) decomposition, we obtain 6 colors. In tis case te triangular sweeps can still be performed wit a ig degree of parallelism. However, on te CPU te FSAI algoritms perform better due to utilization of cace effects. Te FSAI algoritms are performed by relying on parallel matrix-vector multiplications only. Tis is in contrast to te multicoloring tecnique wic re-orders te matrix by grouping te unknowns. Tis distribution is performing better on te GPU due to te lack of bank conflicts in te sparse matrix-vector multiplications. On te oter and, te FSAI algoritms are performing better on te cace-based arcitecture due to te large number of elements tat can benefit from data reuse. Fig. 5. Run times of V-cycle (left) and W-cyle (rigt) MG solver wit various smooters for te Poisson problem on te L-saped domain: sequential version and OpenMP parallel version on eigt cores on te CPU and GPU version. Te parallel speedups of te V-cycle and W-cycle based MG solvers are detailed in Figure 6. Te OpenMP parallel speedup is sligtly above two. In te sequential run on a single CPU core, about less tan one tird of te eigt core peak bandwidt can be utilized (see measurements in [3]). Terefore, te speedup of te eigt-core OpenMP parallel version is tecnically limited by a factor of less tan tree on tis particular test platform. Tis performance expectations are reflected by our measurements reported ere. Te GPU version is by a factor of two to tree faster tan te OpenMP parallel version. Tese factors are in good conformance wit experience for GPU acceleration of bandwidtbound kernels. For te FSAI smooters, te speedup on te GPU falls a little

21 9 bit beind since it is better suited for CPU arcitectures. Te presented parallel speedup results demonstrate te scalability of our parallel approac for efficient multigrid smooters. Parallel multigrid solvers are typically affected by bad communication/computation ratios and load imbalance on coarser grids. As we are working on sared memory type devices, tese influences do not occur. In contrast owever, our matrix-based multigrid solver depends by construction on te calls to SpMV kernels. On coarser grids, tese kernels ave a significant call overead at te expense of parallel efficiency. Tis as a direct impact on te speedups of te W-cycle based MG solvers on te GPU as can be seen in Figure 6 (rigt). Fig. 6. Parallel speedup of te V-cycle (left) and W-cycle (rigt) based multigrid solvers for various smooters. Similar results are obtained for te performance numbers and speedups of te preconditioned CG solver. In te left part of Figure 7 run time results are listed for various preconditioners. Te rigt figure details corresponding speedups for te OpenMP parallel version and te GPU implementation. Speedups for te CG are larger tan tose for te MG solver. Altoug bot solvers consist of te same building blocks, CG is more efficient since it fully relies on te finest grid of te MG ierarcy wit uge sparse matrices. In contrast, te MG solver is doing work on coarser grids were call overeads for SpMV kernels ave a significant influence on te results. 7 Outlook on Future Work In our future work te proposed multigrid solvers will be extended wit respect to iger order finite element metods. Tis is basically a question of inter-grid transfer operators since HiFlow 3 allows finite elements of arbitrary degree. Furter performance results will be included for our OpenCL backends. More importantly our solvers, smooters and preconditioners will be extended for parallelism on distributed memory systems. Te LAtoolbox in HiFlow 3 is built on an MPI-communication and computation layer. Te major work as

22 2 Fig. 7. Run times and parallel speedups for te preconditioned CG solvers on te CPU (sequential and eigt-core OpenMP parallel version) and on te GPU. to be done on te algoritmic side. Intensive researc needs to be invested into ybrid parallelization were MPI-based parallelism is coupled wit node-level parallelism (OpenMP, CUDA). MPI-parallel versions of efficient preconditioners and smooters on unstructured meses are subject of current researc. Moreover, te ierarcical memory sub-systems and ybrid configurations of modern multiand manycore systems require new ardware-aware algoritmic approaces. Due to te limited efficiency of SpMV kernels on GPUs wit respect to coarse grids, U-cycles for te MG solvers sould be investigated tat stop on finer grids. 8 Conclusion Matrix-based multi-grid solvers on unstructured meses are efficient numerical scemes for solving complex and igly relevant problems. Te paradigm sift towards manycore devices brings up new callenges in two dimensions: first, te algoritms need to express fine-grained parallelism and need to be designed wit respect to scalability to undreds and tousands of cores. Secondly, software solutions and implementations need to be designed flexible and portable in order to exploit te potential of various platforms in a unified approac. Te proposed tecniques for parallel smooters and preconditioners provide bot efficient and scalable parallel scemes. We ave demonstrated ow sopisticated matematical tecniques can be extended wit respect to scalable parallelism and ow tese tecniques can arness te computing power of modern manycore platforms. In particular, wit te formulation in terms of SpMV kernels te capabilities of GPUs can be exploited. Wit te described solvers, solution times for realistic problems can be kept at a moderate level. And wit te concept of our generic and portable software package, te numerical solvers can be used on a variety of platforms on a single code base. Te users are freed from specific ardware knowledge and platform-oriented optimizations. We ave reported speedups and we ave demonstrated tat even complex algoritms can be successfully ported to GPUs wit additional performance gains. Te proposed ILU(,2) smooter and te FSAI(2) smooter sow convincing performance and scalability results

23 2 for parallel matrix-based MG. Te proposed V-cycle based MG solvers wit parallel smooters on te GPU are by a factor of 6 faster tan te preconditioned CG solvers on te GPU. But due to te absence of coarse grid operations, te CG solvers ave sligtly better scalability properties on GPUs. Acknowledgements Te Sared Researc Group 6- received financial support by te Concept for te Future of Karlsrue Institute of Tecnology (KIT) in te framework of te German Excellence Initiative and its collaboration partner Hewlett-Packard. Te Engineering Matematics and Computing Lab (EMCL) at KIT as been awarded NVIDIA CUDA Center of Researc for its researc and efforts in GPU computing. Te autors acknowledge te support by means of ardware donation from te company NVIDIA in tat framework. We would also like to express our gratitude to Felix Rien and Niels Weg wo ave contributed to te implementation of our software. References. Andrew, R.B., Serman, A.H., Weiser, A.: Some refinement algoritms and data structures for regular local mes refinement (983) 2. Anzt, H., Augustin, W., Baumann, M., Bockelmann, H., Gengenbac, T., Han, T., Heuveline, V., Ketelaer, E., Lukarski, D., Otzen, A., Ritterbusc, S., Rocker, B., Ronnas, S., Scick, M., Subramanian, C., Weiss, J.P., Wilelm, F.: HiFlow 3 a flexible and ardware-aware parallel finite element package. Tec. Rep. 2-6, EMCL, KIT (2). URL ttp:// 3. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on trougput-oriented processors. In: SC 9: Proc. of te Conf. on Hig Perf. Computing Networking, Storage and Analysis, pp.. ACM, New York (29). DOI ttp://doi.acm.org/5/ Demmel, J.W.: Applied numerical linear algebra. SIAM, Piladelpia (997) 5. Geveler, M., Ribbrock, D., Göddeke, D., Zajac, P., Turek, S.: Efficient finite element geometric multigrid solvers for unstructured grids on GPUs. In: P. Iványi, B.H. Topping (eds.) Second Int. Conf. on Parallel, Distributed, Grid and Cloud Computing for Engineering, p. 22 (2). DOI 23/ccp Geveler, M., Ribbrock, D., Göddeke, D., Zajac, P., Turek, S.: Towards a complete FEM-based simulation toolkit on GPUs: Geometric multigrid solvers. In: 23rd Int. Conf. on Parallel Computational Fluid Dynamics (ParCFD ) (2) 7. Göddeke, D.: Fast and accurate finite-element multigrid solvers for PDE simulations on GPU clusters. P.D. tesis, Tecnisce Universität Dortmund (2) 8. Haase, G., Liebmann, M., Douglas, C., Plank, G.: A parallel algebraic multigrid solver on grapical processing unit. In: W. Zang, et al. (eds.) HPCA 2, vol. LNCS 5938, pp Springer, Heidelberg (2) 9. Hackbusc, W.: Multi-grid metods and applications, 2. print. edn. Springer series in computational matematics ; 4. Springer, Berlin (23). Henson, V.E., Yang, U.M.: BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Appl. Numer. Mat. 4(), (22). DOI ttp://dx.doi.org/.6/s ()5-5

24 22. Heuveline, V., et al.: HiFlow 3 - parallel finite element software (2). URL ttp:// 2. Heuveline, V., Lukarski, D., Subramanian, C., Weiss, J.P.: Parallel preconditioning and modular finite element solvers on ybrid CPU-GPU systems. In: P. Iványi, B.H. Topping (eds.) Proceedings of te 2nd Int. Conf. on Parallel, Distributed, Grid and Cloud Computing for Engineering, Paper 36. Civil-Comp Press, Stirlingsire, UK (2). DOI doi:23/ccp Heuveline, V., Lukarski, D., Weiss, J.P.: Scalable multi-coloring preconditioning for multi-core CPUs and GPUs. In: UCHPC, Euro-Par 2 Parallel Processing Worksops, LNCS vol. 6586, pp (2) 4. Heuveline, V., Lukarski, D., Weiss, J.P.: Enanced parallel ILU(p)-based preconditioners for multi-core CPUs and GPUs te power(q)-pattern metod. Tec. Rep. 2-8, EMCL, KIT (2). URL ttp:// 5. Heuveline, V., Subramanian, C., Lukarski, D., Weiss, J.P.: A multi-platform linear algebra toolbox for finite element solvers on eterogeneous clusters. In: PPAAC, IEEE Cluster 2 Worksops (2) 6. Hülsemann, F., Kowarscik, M., Mor, M., Rüde, U.: Parallel geometric multigrid. In: Numerical Solution of Partial Differential Equations on Parallel Computers, volume 5 of LNCSE, capter 5, pp (25) 7. Kolotilina, L., Yeremin, A.: Factorized sparse approximate inverse preconditionings, I: teory. SIAM J. Matrix Anal. Appl. (993) 8. Saad, Y.: Iterative metods for sparse linear systems. Society for Industrial and Applied Matematics, Piladelpia, PA, USA (23) 9. Sapira, Y.: Matrix-based multigrid. Teory and applications (Numerical metods and algoritms). Springer (23) 2. Trottenberg, U.: Multigrid. Academic Press, Amsterdam (2). Academic Press 2. Wittum, G.: On te robustness of ILU smooting. SIAM Journal on Scientific and Statistical Computing, :69977 (989)

25 Preprint Series of te Engineering Matematics and Computing Lab recent issues No. 2-8 Vincent Heuveline, Dimitar Lukarski, Jan-Pilipp Weiss: Enanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs Te Power(q)-pattern Metod No. 2-7 Tomas Gengenbac, Vincent Heuveline, Rolf Mayer, Matias J. Krause, Simon Zimny: A Preprocessing Approac for Innovative Patient-specific Intranasal Flow Simulations No. 2-6 Hartwig Anzt, Maribel Castillo, Juan C. Fernández, Vincent Heuveline, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí: Optimization of Power Consumption in te Iterative Solution of Sparse Linear Systems on Grapics Processors No. 2-5 Hartwig Anzt, Maribel Castillo, José I. Aliaga, Juan C. Fernández, Vincent Heuveline, Rafael Mayo, Enrique S. Quintana-Ortí: Analysis and Optimization of Power Consumption in te Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms No. 2-4 Vincent Heuveline, Micael Scick: A local time dependent Generalized Polynomial Caos metod for Stocastic Dynamical Systems No. 2-3 Vincent Heuveline, Micael Scick: Towards a ybrid numerical metod using Generalized Polynomial Caos for Stocastic Differential Equations No. 2-2 Panagiotis Adamidis, Vincent Heuveline, Florian Wilelm: A Hig-Efficient Scalable Solver for te Global Ocean/Sea-Ice Model MPIOM No. 2- Hartwig Anzt, Maribel Castillo, Juan C. Fernández, Vincent Heuveline, Rafael Mayo, Enrique S. Quintana-Ortí, Björn Rocker: Power Consumption of Mixed Precision in te Iterative Solution of Sparse Linear Systems No. 2-7 Werner Augustin, Vincent Heuveline, Jan-Pilipp Weiss: Convey HC- Hybrid Core Computer Te Potential of FPGAs in Numerical Simulation No. 2-6 Hartwig Anzt, Werner Augustin, Martin Baumann, Hendryk Bockelmann, Tomas Gengenbac, Tobias Han, Vincent Heuveline, Eva Ketelaer, Dimitar Lukarski, Andrea Otzen, Sebastian Ritterbusc, Björn Rocker, Staffan Ronnås, Micael Scick, Candramowli Subramanian, Jan-Pilipp Weiss, Florian Wilelm: HiFlow 3 A Flexible and Hardware-Aware Parallel Finite Element Package No. 2-5 Martin Baumann, Vincent Heuveline: Evaluation of Different Strategies for Goal Oriented Adaptivity in CFD Part I: Te Stationary Case No. 2-4 Hartwig Anzt, Tobias Han, Vincent Heuveline, Björn Rocker: GPU Accelerated Scientific Computing: Evaluation of te NVIDIA Fermi Arcitecture; Elementary Kernels and Linear Solvers No. 2-3 Hartwig Anzt, Vincent Heuveline, Björn Rocker: Energy Efficiency of Mixed Precision Iterative Refinement Metods using Hybrid Hardware Platforms: An Evaluation of different Solver and Hardware Configurations No. 2-2 Hartwig Anzt, Vincent Heuveline, Björn Rocker: Mixed Precision Error Correction Metods for Linear Systems: Convergence Analysis based on Krylov Subspace Metods Te responsibility for te contents of te working papers rests wit te autors, not te Institute. Since working papers are of a preliminary nature, it may be useful to contact te autors of a particular working paper about results or caveats before referring to, or quoting, a paper. Any comments on working papers sould be sent directly to te autors.