Utilizing the quadruple-precision floating-point arithmetic operation for the Krylov Subspace Methods

Similar documents
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

MODIFIED INCOMPLETE CHOLESKY FACTORIZATION FOR SOLVING ELECTROMAGNETIC SCATTERING PROBLEMS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

GPGPU accelerated Computational Fluid Dynamics

GPGPU acceleration in OpenFOAM

Toward a New Metric for Ranking High Performance Computing Systems

A Parallel Lanczos Algorithm for Eigensystem Calculation

SPARSE matrices are commonly present in computer memory

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

SOLVING LINEAR SYSTEMS

Mathematical Libraries on JUQUEEN. JSC Training Course

64-Bit versus 32-Bit CPUs in Scientific Computing

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

Operation Count; Numerical Linear Algebra

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

TESLA Report

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Multicore Parallel Computing with OpenMP

Performance Tuning of a CFD Code on the Earth Simulator

Numerical Methods I Eigenvalue Problems

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Newton-Type Methods for Solution of the Electric Network Equations

OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS

2: Computer Performance

A matrix-free preconditioner for sparse symmetric positive definite systems and least square problems

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Algorithmic Research and Software Development for an Industrial Strength Sparse Matrix Library for Parallel Computers

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Linköping University Electronic Press

CS3220 Lecture Notes: QR factorization and orthogonal transformations

High Performance Matrix Inversion with Several GPUs

Evaluation of High-Performance Computing Software

Limited Memory Solution of Complementarity Problems arising in Video Games

High-fidelity electromagnetic modeling of large multi-scale naval structures

Numerical Analysis. Professor Donna Calhoun. Fall 2013 Math 465/565. Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Conjugate Gradients on Multiple GPUs

FROM POTENTIAL THEORY TO MATRIX ITERATIONS IN SIX STEPS

Advanced Computational Software

Building an Inexpensive Parallel Computer

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Fast Multipole Method for particle interactions: an open source parallel library component

Technical Report. Parallel iterative solution method for large sparse linear equation systems. Rashid Mehmood, Jon Crowcroft. Number 650.

YALES2 porting on the Xeon- Phi Early results

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

Graphic Processing Units: a possible answer to High Performance Computing?

ALGEBRAIC EIGENVALUE PROBLEM

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Large-Scale Reservoir Simulation and Big Data Visualization

NUMERICAL ANALYSIS KENDALL E. ATKINSON. Depts of Mathematics and Computer Science, Universityof Iowa, Iowa City, Iowa 1

DARPA, NSF-NGS/ITR,ACR,CPA,

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations. Tzu-Yi Chen

Study and improvement of the condition number of the Electric Field Integral Equation

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

3 Orthogonal Vectors and Matrices

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Multigrid preconditioning for nonlinear (degenerate) parabolic equations with application to monument degradation

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Stability Control for Approximate Implicit Time Stepping Schemes with Minimum Residual Iterations

Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example

ANSYS Solvers: Usage and Performance. Ansys equation solvers: usage and guidelines. Gene Poole Ansys Solvers Team, April, 2002

Performance and Scalability of the NAS Parallel Benchmarks in Java

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

A numerically adaptive implementation of the simplex method

Best practices for efficient HPC performance with large models

Automatic parameter setting for Arnoldi-Tikhonov methods

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Scalability of parallel scientific applications on the cloud

Fast Iterative Solvers for Integral Equation Based Techniques in Electromagnetics

DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

FAST-EVP: Parallel High Performance Computing in Engine Applications

SPARC64 VIIIfx: CPU for the K computer

Adaptive Stable Additive Methods for Linear Algebraic Calculations

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Object-oriented scientific computing

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Performance Results for Two of the NAS Parallel Benchmarks

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Krylov subspace iterative techniques: on the detection of brain activity with electrical impedance tomography

on an system with an infinite number of processors. Calculate the speedup of

General Framework for an Iterative Solution of Ax b. Jacobi s Method

HSL and its out-of-core solver

Accelerating CFD using OpenFOAM with GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Automatic parameter setting for Arnoldi-Tikhonov methods

Transcription:

Utilizing the quadruple-precision floating-point arithmetic operation for the Krylov Subspace Methods Hidehiko Hasegawa Abract. Some large linear systems are difficult to solve by the Krylov subspace methods. The reason is that they include mathematically not good conditions for such iterative methods. If more accurate calculation is achieved on computers then such problems can be solved easily. However, in general, multiple precision arithmetic operations to achieve more accurate calculation are very costly on many computing environments, and more memory space and much computation time are required. We show the quadruple-precision floating-point arithmetic operation is cost-effective for some class of problems on new high-performance computers. This method will be able to get good performance without any difficulties to parallelize and vectorize some preconditioners. 1 Introduction The Krylov subspace methods, include Conjugate Gradient (CG) method[6], BiConjugate Gradient (BiCG) method[3], and other methods[2], are widely used for solving the large linear systems. These methods are consist of a combination of matrix-vector operations: Ax, A T x, K 1 r, K T r, an inner product: x T x, and other vector operations[1] where A is the coefficient matrix and K is a preconditioner. As one of effective preconditioners K, an incomplete LU factorization of A [7] [8] is used, however it includes a serial processing part which is not suitable for the parallel and vector processing[1]. For many class of problems, the Krylov subspace methods show good convergence, however for some class of problems they may stagnate or not converge[1]. The following facts about the Krylov subspace methods for linear systems are known: * The convergence is improved by more accurate computation[4]. * As a mathematically good preconditioner, such as an incomplete factorization, includes a serial processing part, it is difficult to get good performance on vector and parallel computers[1]. In general, multiple precision floating-point arithmetic operations are believed to be costly. However if the quadruple-precision floating-point arithmetic operation, one of the multiple precision arithmetic operations, has reasonable computation cost and good convergence, we have a possibility to get a cost-effective solver based on the Krylov subspace methods. We think about followings: School of Library, Information and Media Studies, University of Tsukuba, Tsukuba 305-8550, Japan (hasegawa@slis.tsukuba.ac.jp)

* The quadruple-precision floating-point arithmetic operation should have better convergence than the double-precision floating-point arithmetic operation with an incomplete LU factorization (ILU) preconditioner, however the computation cost may be expensive. * If there is no preconditioner, it is easy to parallelize and vectorize the Krylov subspace methods. In this paper, we examine the convergence and the computation time of the double-precision floating-point arithmetic operation with the ILU(0) preconditioner, and the quadrupleprecision floating-point arithmetic operation without any preconditioners on some computing environments. 2 Numerical Comparison The purpose of this paper is to compare the quadruple-precision floating-point arithmetic operation and the double-precision floating-point arithmetic operation. However the convergence property and its performance strongly depend on the problem to be solved. Mathematical property of the problem affects to its convergence, and the structure of the problem such as a sparsity affects to its performance. We solve the system by BiCG method [1] which includes all of basic matrix and vector operations of the Krylov subspace methods, Ax, A T x, K 1 r, K T r, x T x, and etc. 2.1 Model Problem We use a Toeplitz matrix A of order N = 200 with a parameter γ in this paper. It is known that the system is difficult to solve by the Krylov subspace methods when the parameter is large[5]. The important thing is that a difference occurs whether or not between at the quadruple-precision and at the double-precision floating-point arithmetic operations. Conditions A := 2 1 0 2 1 γ 0 2 1 γ 0 2............ Parameter: γ =1.3, 1.7, 2.1 and 2.5. Right-hand side vector: b =(1, 1,...,1) T, and starting vector: x 0 = 0. The stopping criterion: ε =10 12, and the maximum iteration: 2000 steps.

2.2 Convergence We examine the convergence under the following conditions (1 3) on a workstation Sun Enterprize 3000. 1. Double-precision floating-point arithmetic operation without preconditioner. 2. Double-precision floating-point arithmetic operation with ILU(0) preconditioner. 3. Quadruple-precision floating-point arithmetic operation without preconditioner (autoquadruple option is used without any tunings). Table 1: The iteration counts, computation time and true residual norm on a workstation Sun Enterprize 3000. γ =1.3 γ =1.7 γ =2.1 γ =2.5 Iter. 188 1. Double Time 1.7 10 1 No No No True residual 3.1 10 13 Iter. 40 75 148 179 2. Double + ILU(0) Time 6.8 10 2 8.5 10 2 1.1 10 2 1.4 10 2 True residual 1.7 10 13 5.6 10 13 8.3 10 10 5.2 10 2 Iter. 122 152 255 446 3. Quadruple Time 2.0 10 2 5.3 10 2 5.5 10 3 1.9 10 5 True residual 6.6 10 13 6.4 10 13 3.6 10 13 5.4 10 13 This comparison is shown as a table 1. We have got following results: * The residual norm doesn t converge on the double-precision floating-point arithmetic operation without preconditioner for the parameters γ = 1.7, 2.1 and 2.5. * We can t get the approximate solution on the double-precision floating-point arithmetic operation with the ILU(0) preconditioner for the parameters γ = 2.1 and 2.5. * We can get the solution only when the quadruple-precision floating-point arithmetic operation is used. * A lot of computation time is required for the quadruple-precision floating-point arithmetic operation on Sun Enterprize 3000. 2.3 Computation time The computation time per iteration depends on the hardware, and weights of each basic operation depend on the machine type. For example, as a preconditioner ILU(0) is not vectorized on the vector computer, its computation time is longer than other basic operations,

and its weight is much higher. We investigate the computation time under the following conditions (1 2) on the five computing environments. Here, the order of Toeplitz matrix is 10000. In below tables, ratio means the quotient of the computation time per iteration at the quadruple-precision floating-point arithmetic operation without preconditioner by it at the double-precision floating-point arithmetic operation with the ILU(0) preconditioner.,, PSOLVE, and PSOLVET mean Ax, A T x, K 1 r, and K T r respectively. DAXPY, DDOT and DNORM2 mean x T x and other main vector operations. 1. Double-precision floating-point arithmetic operation with ILU(0) preconditioner. 2. Quadruple-precision floating-point arithmetic operation without preconditioner(autoquadruple option is used without any tunings). Table 2: Results on a workstation: Sun Enterprize 3000. Double + ILU(0) Quadruple ratio 4.69 10 3 sec. 0.149sec. (17.5%) (24.7%) 31.7 4.89 10 3 sec. 0.159sec. (18.2%) (26.4%) 32.5 DAXPY, DDOT 7.47 10 3 sec. 0.290sec. DNORM2 (27.8%) (48.1%) 38.8 PSOLVE 4.87 10 3 sec. (18.1%) PSOLVET 4.92 10 3 sec. (18.3%) Total 2.68 10 2 sec. 0.602sec. 22.4 Table 3: Results on a vector computer: Fujitsu VPP800. Double + ILU(0) Quadruple ratio 3.92 10 5 sec. 1.20 10 3 sec. (1.06%) (9.52%) 30.6 3.93 10 5 sec. 1.20 10 3 sec. (1.07%) (9.52%) 30.5 DAXPY, DDOT 1.21 10 3 sec. 1.01 10 2 sec. DNORM2 (32.9%) (80.1%) 8.3 PSOLVE 1.35 10 3 sec. (36.7%) PSOLVET 1.03 10 3 sec. (28.0%) Total 3.67 10 3 sec. 1.26 10 2 sec. 3.4

Table 4: Results on a SMP: Hitachi SR8000. Double + ILU(0) Quadruple ratio 1.01 10 4 sec. 6.12 10 4 sec. (3.5%) (11.3%) 6.0 9.96 10 5 sec. 6.01 10 4 sec. (3.4%) (11.1%) 6.0 DAXPY, DDOT 7.81 10 4 sec. 4.19 10 3 sec. DNORM2 (27.2%) (77.5%) 5.3 PSOLVE 7.38 10 4 sec. (25.7%) PSOLVET 1.15 10 3 sec. (40.9%) Total 2.87 10 3 sec. 5.40 10 3 sec. 1.8 Table 5: Results on a mainframe: Hitachi MP5000. Double + ILU(0) Quadruple ratio 1.35 10 3 sec. 4.68 10 3 sec. (16.9%) (27.3%) 3.4 1.34 10 3 sec. 4.72 10 3 sec. (16.8%) (27.6%) 3.5 DAXPY, DDOT 2.44 10 3 sec. 7.19 10 3 sec. DNORM2 (30.6%) (42.0%) 2.9 PSOLVE 1.46 10 3 sec. (18.3%) PSOLVET 1.35 10 3 sec. (16.9%) Total 7.96 10 3 sec. 1.71 10 2 sec. 2.1

Next, we examine the speed-up of the quadruple-precision floating-point arithmetic operation without a ILU(0) preconditioner. We display the results on a vector parallel computer Fujitsu VPP700 without using its vector facility as a parallel computer. Table 6: Speed-up on the quadruple-precision: Fujitsu VPP700 (N = 10000). 1PE 2PE 4PE 8PE 317.4 sec. 158.4sec. 79.3sec. 39.8sec. (1.0) (2.0) (4.0) (7.9) 268.6 sec. 136.1sec. 68.3sec. 33.9sec. (1.0) (1.9) (3.9) (7.9) DAXPY, DDOT 839.4 sec. 416.9sec. 211.4sec. 109.1sec. DNORM2 (1.0) (2.0) (3.9) (7.6) Total 1425.4 sec. 711.4 sec. 359 sec. 182.8 sec. (1.0) (2.0) (3.9) (7.8) From table 2 6 we have got following results: * On the ordinary workstation Sun Enterprize 3000, the quadruple-precision floatingpoint arithmetic operation is almost thirty times slower than the double-precision floating-point operation, however on the classic mainframe Hitachi MP5000 its ratio is four. * On the vector computer Fujitsu VPP800, as its vector processing facility works well for the double-precision floating-point arithmetic operation, the ratio of Matrix-vector operation is thirty, and the ratio of other vector operations is eight, but the ILU(0) preconditioning is very costly operation on this machine. * On the SMP Hitachi SR8000, the quadruple-precision floating-point operations for all type of basic operations is only six time slower than the double-precision floating-point arithmetic operation, and the ILU(0) preconditioning is also costly operation. * The ratio per iteration between the double-precision floating-point arithmetic operation with the ILU(0) preconditioner and the quadruple-precision floating-point arithmetic operation without any preconditioners is about twice on the SMP Hitachi SR8000 and the mainframe Hitachi MP5800, and is about three times on a new vector computer Fujitsu VPP800. * The speed-up of the quadruple-precision floating-point arithmetic operation without any preconditioners on Fujitsu VPP700 is ideal. On the Hitachi MP5000 and SR8000, if iteration count of the quadruple-precision floatingpoint arithmetic operation without any preconditioners is half of the double-precision floatingpoint arithmetic operation with the ILU(0) preconditioner, then the total computation time

becomes to be same. It is also possible to tune for both the quadruple-precision and the double-precision floating-point arithmetic operation. We can state almost same conclusion for the result on the Fujitsu VPP800. On the simple parallel (without using vector facility) computer Fujitsu VPP700, the quadruple-precision floating-point arithmetic operation without any preconditioners can be forced an ideal speed-up. This is the greatest and the simplest result, because the parallelization of a mathematically powerful preconditioner is very difficult problem but more accurate calculation may work well instead of the preconditioner. In addition, for the quadruple-precision floating-point arithmetic operation, the size of memory is twice, and the computation time may be more than seven times. This fact fits for the RISC based computers, especially for the parallel computers, because of a small size of data to be transferred per computation. These results of the quadruple-precision floatingpoint arithmetic operation is measured by only changing its Fortran compiler s option without any tunings, it is possible to reduce the computation time on the quadruple-precision floating-point arithmetic operation by the following improvement: * To use the quadruple-precision floating-point arithmetic operation together with a mild (not so robust) preconditioner, which means parallelizable and vectorizable. * To tune the code. * There is a small weak point that the required memory of the quadruple-precision floating-point arithmetic operation is twice of that of the double-precision floatingpoint arithmetic operation. However many high-performance computers have enough memory. 3 Conclusion The quadruple-precision floating-point arithmetic operation is believed to be costly. But it is not true when some large linear systems of equations are solved by the Krylov subspace methods. On the new high-performance computers, the quadruple-precision floating-point arithmetic operation may become to be cost-effective because its convergence is accelerated by more accurate computation, and it is parallelized scalably without any difficulties. Although, in this paper, we could not show that the total computation time at the quadruple-precision floating-point arithmetic operation is really faster than that of at the double-precision floating-point arithmetic operation with the ILU(0) preconditioner for solving a large linear system by the Krylov subspace methods. Also the problem tested in this paper is not a practical problem. We found that we can get the good convergence on the quadruple-precision floatingpoint arithmetic operation, and the quadruple-precision floating-point arithmetic operation can bring better results than the double-precision floating-point arithmetic operation for the convergence. Also the computation time on the quadruple-precision floating-point arithmetic operation is about twice on the double-precision floating-point arithmetic operation when the vector facility or multiple CPUs are used. On the other hand, the preconditioner ILU(0) is not useful for reducing the computation time on vector and parallel computers because

the preconditioning includes a serial processing part. We expect that the computation time would be shorter when we employ the mild preconditioner, which can be vectorized and parallelized. We propose to utilize the quadruple-precision floating-point arithmetic operation with some mild preconditioner to get good performance for some class of problems, they are difficult to be solved by the double-precision floating-point arithmetic operation with the ILU(0) preconditioner, on new high-performance computing environments. Finally we believe that we should utilize the quadruple-precision floating-point arithmetic operation on new high-performance computing environments, and its high-performance should be used for the quality of computation. References [1] Barrett, R., Berry, M., Chan, T., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and van der Vorst, H., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994. [2] Bruaset, A. M., A Survey of Preconditioned Iterative Methods, Frontiers in Applied Mathematics 17, Longman Scientific and Technical, London, 1995. [3] Fletcher, R., Conjugate gradient methods for indefinite systems, in Numerical Analysis Dundee 1975, ed. by Watson, G., Lecture Notes in Mathematics, 506(1976), Springer-Verlag, pp.73-89. [4] Greenbaum, A., Iterative Methods for solving Linear Systems, SIAM, Philadelphia, 1997. [5] Gutknecht, M. H., Variants of BiCGSTAB for Matrix with Complex Spectrum, SIAM J. Sci. Comput., 14 (1993), 1020-1033. [6] Hestenes, M. R. and Stiefel, E., Methods of Conjugate Gradients for Solving Linear Systems, J. Res. Nat. Bur. Standards, 49 (1952), 409-435. [7] Meijerink, J. A. and van der Vorst, H. A., An Iterative Solution Method for Linear Systems of which the Coefficient Matrix is a Symmetric M-matrix, Mathematics of Computation, 31 (1977), 148-162. [8] Saad, Y., Iterative Methods for Sparse Linear Systems, PWS, Boston, 1996.