Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations. Tzu-Yi Chen

Similar documents

7 Gaussian Elimination and LU Factorization

Matrix Multiplication

DATA ANALYSIS II. Matrix Algorithms

Operation Count; Numerical Linear Algebra

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

SOLVING LINEAR SYSTEMS

Linear Programming. March 14, 2014

Solution of Linear Systems

6. Cholesky factorization

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Direct Methods for Solving Linear Systems. Matrix Factorization

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Numerical Methods I Eigenvalue Problems

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Vector and Matrix Norms

Lecture 3: Finding integer solutions to systems of linear equations

1 Solving LPs: The Simplex Algorithm of George Dantzig

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

8 Square matrices continued: Determinants

by the matrix A results in a vector which is a reflection of the given

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

[1] Diagonal factorization

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Notes on Determinant

Numerical Analysis Lecture Notes

HSL and its out-of-core solver

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Similarity and Diagonalization. Similar Matrices

Linear Algebra Review. Vectors

Introduction to Matrix Algebra

Linear Programming. April 12, 2005

Chapter 6. Orthogonality

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Continued Fractions and the Euclidean Algorithm

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Inner Product Spaces and Orthogonality

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

MATH APPLIED MATRIX THEORY

Practical Guide to the Simplex Method of Linear Programming

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

Linear Algebra: Determinants, Inverses, Rank

Similar matrices and Jordan form

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Section Inner Products and Norms

Solving Linear Systems, Continued and The Inverse of a Matrix

Linear Algebra and TI 89

DETERMINANTS IN THE KRONECKER PRODUCT OF MATRICES: THE INCIDENCE MATRIX OF A COMPLETE GRAPH

Lecture 5: Singular Value Decomposition SVD (1)

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

1 Symmetries of regular polyhedra

3 Orthogonal Vectors and Matrices

Factorization Theorems

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Systems of Linear Equations

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Iterative Methods for Solving Linear Systems

1.2 Solving a System of Linear Equations

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Row Echelon Form and Reduced Row Echelon Form

Linear Programming in Matrix Form

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Integer Factorization using the Quadratic Sieve

1 Review of Least Squares Solutions to Overdetermined Systems

LINEAR ALGEBRA. September 23, 2010

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Elementary Matrices and The LU Factorization

Unit 18 Determinants

Suk-Geun Hwang and Jin-Woo Park

Continuity of the Perron Root

ALGEBRAIC EIGENVALUE PROBLEM

Orthogonal Bases and the QR Algorithm

Solutions to Math 51 First Exam January 29, 2015

A Direct Numerical Method for Observability Analysis

The Characteristic Polynomial

LS.6 Solution Matrices

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Inner Product Spaces

Solution to Homework 2

What is Linear Programming?

Nonlinear Iterative Partial Least Squares Method

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Typical Linear Equation Set and Corresponding Matrices

Approximation Algorithms

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

Compact Representations and Approximations for Compuation in Games

Linear Codes. Chapter Basics

Numerical Matrix Analysis

Lecture 1: Schur s Unitary Triangularization Theorem

The Determinant: a Means to Calculate Volume

Notes on Symmetric Matrices

Transcription:

Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations by Tzu-Yi Chen B.S. (Massachusetts Institute of Technology) 1995 B.S. (Massachusetts Institute of Technology) 1995 M.S. (University of California, Berkeley) 1998 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge: Professor James W. Demmel, Chair Professor Gregory Fenves Professor Jonathan Shewchuk Fall 2001

The dissertation of Tzu-Yi Chen is approved: Chair Date Date Date University of California at Berkeley Fall 2001

Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations Copyright 2001 by Tzu-Yi Chen

1 Abstract Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations by Tzu-Yi Chen Doctor of Philosophy in Computer Science University of California at Berkeley Professor James W. Demmel, Chair Informally, given a problem to solve and a method for solving it, a preconditioner transforms the problem into one with more desirable properties for the solver. The solver may take less time to find the solution to the new problem, it may compute a more accurate solution, or both. The preconditioned system is solved and the solution is transformed back into the solution of the original problem. In this dissertation we look at the role of preconditioners in finding the eigenvalues of sparse matrices and in solving sparse systems of linear equations. A sparse matrix is one with so many zero entries that either only the nonzero elements and their locations in the matrix are stored, or the matrix is not given explicitly and one can only get the results of multiplying the matrix (and sometimes its transpose) by arbitrary vectors. The eigenvalues of a matrix A are the λ such that Ax = λx, where x is referred to as the (right) eigenvector corresponding to λ. Numerical algorithms that compute the eigenvalues of a nonsymmetric matrix A typically have backward errors proportional to the norm of A, so it can be useful to precondition an n n matrix A in such a way that its norm is reduced and its eigenvalues are preserved. We focus on balancing A, in other words finding a diagonal matrix D such that for 1 i n the norm of row i and column i of DAD 1 are the same. Interestingly, there are many relationships between balancing in certain vector norms and minimizing varied matrix norms. For example, in [143] Osborne shows balancing a matrix in the 2-norm also minimizes the Froebenius norm of DAD 1 over all D up to scalar multiples. We summarize results known about balancing in other

2 norms before defining balancing in a weighted norm and proving that this minimizes the 2-norm for nonnegative, irreducible A. We use our results on balancing in a weighted norm to justify a set of novel Krylov-based balancing algorithms which approximate weighted balancing and which never explicitly access individual entries of A. By using only matrix-vector (Ax), and sometimes matrix-transpose-vector (A T x), multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. We then show that for matrices from our test suite, these Krylovbased balancing algorithms do, in fact, often improve the accuracy to which eigenvalues are computed by dense or sparse eigensolvers. For our test matrices, Krylov-based balancing improved the accuracy of eigenvalues computed by sparse eigensolvers by up to 10 decimal places. In addition, Krylov-based balancing can also improve the condition number of eigenvalues, hence giving better computed error bounds. For solving sparse systems of linear systems the problem is to find a vector x such that Ax = b, where A is a square nonsingular matrix and b is some given vector. Algorithms for finding x can be classified as either direct or iterative: direct methods typically compute the LU factorization of A and solve for x through two triangular solves; iterative methods such as conjugate gradient iteratively improve on an initial guess to x. Though direct methods are considered robust, they can require large amounts of memory if the L and U factors have many more nonzero elements than the matrix A. On the other hand, though iterative methods require less space, they are also less robust than direct methods and their behavior is not as well understood. Fortunately, preconditioning can help with some of these issues. For example, preconditioners can be used to reduce the number of nonzero elements in the L and U factors of A, or to improve the likelihood of an iterative method converging quickly to the actual solution vector. We begin by discussing preconditioners for direct solvers, starting with several algorithms for reordering the rows and columns of A prior to factoring it. We present data comparing the results of decomposing matrices with a nonsymmetric permutation to results from using a symmetric permutation. For one matrix the size of the largest block found with a nonsymmetric permutation is a tenth of the size of the largest block found with a symmetric permutation, which can greatly reduce the subsequent factorization time. We also note that using a stability ordering in concert with a column approximate minimum

3 degree ordering can lead to L and U factors with significantly more or fewer nonzero elements than those computed after using the sparsity ordering alone. Focussing on a specific algorithm for reordering A to reduce fill, we then describe our design and implementation of a threaded column approximate minimum degree algorithm. Though we worked hard to avoid the effects of many known parallel pitfalls, our final implementation never achieved a speedup of more than 3 on 8 processors of an SGI Power Challenge machine, and more typically there was virtually no speedup. By analyzing the performance of our code in detail, we provide a better understanding of the difficulties of efficiently implementing algorithms with fine-grained parallelism even in a shared memory environment. Finally we turn to incomplete LU (ILU) factorizations, a family of preconditioners often used with iterative solvers. We propose a modification to a standard ILU scheme and show that it makes better use of the memory the user has available, leading to a greater likelihood of convergence for preconditioned GMRES(50), the iterative solver used in our studies. By looking at data gathered from tens of thousands of test runs combining matrices with different ILU algorithms, parameter settings, scaling algorithms, and ordering algorithms, we draw some conclusions about the effects of different ordering algorithm on the convergence of ILU-preconditioned GMRES(50). We find, for example, that both ordering for stability and partial pivoting are necessary for achieving the best convergence results. Professor James W. Demmel Dissertation Committee Chair

i Contents List of Figures List of Tables iii iv 1 Introduction 1 1.1 Sparse systems.................................. 2 1.1.1 Storage of sparse matrices........................ 3 1.1.2 Sparse matrix algorithms........................ 4 1.2 Roles of preconditioning............................. 5 1.3 Notation and Definitions............................. 6 1.3.1 Matrix notation and definitions..................... 6 1.3.2 Graph representations of matrices................... 7 1.3.3 Relationships between A, DG(A), and BG(A)............ 8 1.4 Test Matrices................................... 9 1.5 Contributions................................... 10 2 Preconditioning sparse matrices for computing eigenvalues 12 2.1 Decomposing the matrix............................. 13 2.1.1 The Parlett-Reinsch Algorithm..................... 14 2.1.2 The Strongly Connected Components Algorithm........... 15 2.1.3 Comparisons............................... 16 2.2 Balancing..................................... 18 2.2.1 Theory.................................. 19 2.2.2 Parlett-Reinsch balancing algorithm.................. 23 2.2.3 Krylov balancing algorithms....................... 25 2.3 Results....................................... 32 2.3.1 Balancing and Dense Eigensolvers................... 33 2.3.2 Balancing and Sparse Eigensolvers................... 37 2.4 Conclusions.................................... 39 3 Preconditioning sparse linear systems of equations 41 3.1 Decomposing the matrix............................. 44 3.2 Ordering for sparsity............................... 45 3.2.1 Background................................ 48

ii 3.2.2 Approximate column minimum degree code for symmetric multiprocessors................................... 54 3.3 Ordering for stability............................... 66 3.3.1 History.................................. 67 3.3.2 Observations............................... 68 3.3.3 Relationship to other orderings..................... 69 3.4 ILU preconditioners............................... 69 3.4.1 History of IC and ILU preconditioners................. 72 3.4.2 Experimental setup............................ 83 3.4.3 The ILUTP Push algorithm....................... 87 3.4.4 Effects of orderings............................ 97 3.4.5 Summary of experiments........................ 108 3.5 Conclusion.................................... 109 4 Conclusion 112 Bibliography 114 A Test matrices for chapter 2 131 B Test matrices for chapter 3 133

iii List of Figures 1.1 Example of a matrix stored in column compressed format.......... 3 2.1 Example of Parlett-Reinsch decomposition................... 15 2.2 Example of strongly connected components decomposition.......... 16 2.3 Pseudocode for the iterative balancing algorithm................ 24 2.4 Pseudocode for KrylovAz............................ 28 2.5 Pseudocode for KrylovAz if A not given explicitly.............. 29 2.6 Pseudocode for KrylovAtz........................... 30 2.7 Accuracy of the eigenvalues of qh768 computed with and without direct balancing........................................ 34 2.8 Accuracy of the eigenvalues of tols2000 computed with and without direct balancing...................................... 35 2.9 Accuracy of the eigenvalues of qh768 computed with and without Krylovbased balancing.................................. 36 2.10 Accuracy of the eigenvalues of tols2000 computed with and without Krylovbased balancing.................................. 36 2.11 Relative accuracy of the largest and smallest eigenvalues of qh768 computed with Krylov-based balancing........................... 38 2.12 Relative accuracy of the largest and smallest eigenvalues of tol2000 computed with Krylov-based balancing........................... 38 3.1 Pseudocode for parallel approximate minimum degree algorithm....... 55 3.2 Pseudocode for ILUTP.............................. 89 3.3 Number of nonzeros in each row of the incomplete factors of shyy41 and vavasis1....................................... 91 3.4 Pseudocode for ILUTP Push........................... 92 3.5 Amount of fill in complete LU factors...................... 96

iv List of Tables 2.1 Effects of different symmetric decomposition algorithms........... 17 2.2 Summary of known results on matrix norm minimization via diagonal scaling 23 2.3 Summary of known results on balancing matrices............... 24 2.4 Effect of Krylov balancing algorithms on matrix norms............ 31 3.1 Decompositions with scc vs. dmperm. Part 1.................. 46 3.2 Decompositions with scc vs. dmperm. Part 2.................. 47 3.3 Number of iterations taken by threaded column approximate minimum degree code with different parameter settings................... 60 3.4 Breakdown of time taken by threaded column approximate minimum degree algorithm...................................... 65 3.5 nnz(l + U) for different orderings. Part I.................... 70 3.6 nnz(l + U) for different orderings. Part II................... 71 3.7 Summary of packages including IC or ILU algorithms............. 82 3.8 Number of systems that converge with ILUTP and varied amounts of fill.. 88 3.9 Number of systems that converge with ILUTP and space used by factors.. 90 3.10 Number of systems that converge with ILUTP vs. ILUTP Push....... 94 3.11 Number of matrices that converge with ILUTP vs. ILUTP Push with high fill and various pivtol............................... 95 3.12 Number of systems converging with ILUTP Push and space used by factors. 95 3.13 Number of systems that converge with different orderings for various levels of ILU(k)...................................... 100 3.14 Number of systems that converge with different orderings and ILUTP Push with varied amounts of fill............................ 101 3.15 Number of systems that converge with ILU(k) and ILUTP Push with nnz(ˆl+ Û) = nnz(a).................................... 102 3.16 Effects of pivtol on convergence of ILUTP Push with different sparsity orderings......................................... 102 3.17 Number of systems that converge with ILU(k) with MC64 and different sparsity orderings.................................... 104 3.18 Number of systems that converge with ILUTP Push and MC64, but with different sparsity orderings............................ 104

3.19 Comparing ILU(k) and ILUTP Push with MC64 and fixed parameter values, but different sparsity orderings.......................... 105 3.20 Effects of different pivtol values and sparsity orderings on ILUTP Push with MC64........................................ 106 3.21 Number of systems converging with ILU(k), MC64 with scaling, and different sparsity orderings................................. 107 3.22 Number of systems converging with ILUTP Push, MC64 with scaling, and different sparsity orderings............................ 107 3.23 Difference between ILU(k) and ILUTP Push with MC64 and scaling, but different sparsity orderings............................ 108 3.24 Effects of pivtol on convergence of ILUTP Push with MC64 and scaling, but different sparsity orderings............................ 108 v

vi Acknowledgements For helping with research, I should first thank my advisor Jim Demmel, my committee members Jonathan Shewchuck and Greg Fenves, and my qualifying examination chair Kathy Yelick. I would also like to thank Sivan Toledo and John Gilbert for having me spend a summer at Xerox PARC; and Esmond Ng for having me spend two summers at NERSC. Other people I have had useful discussions with include: Beresford Parlett (on balancing), David Hysom (on ILU preconditioners), Brent Chun and Fred Wong (on the innards of the Berkeley NOW and Millennium), and Henry Cohn (on a variety of math topics). Of course, I also need to thank the agencies whose grants funded me. This research was supported in part by an NSF graduate fellowship and in part by LLNL Memorandum Agreement No. B504962 under the Department of Energy under DOE Contract No. W- 7405-ENG-48, and the National Science Foundation under NSF Cooperative Agreement No. ACI-9619020, and DOE subcontract to Argonne, no. 951322401. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. Some of my richest experiences at Berkeley had nothing to do with research. Linh, Thanh Thao, Herman, Rahel and Leya, Peter, Tsedenia, Kedest, Salma, Saana, Fatima, and many others: thank you for giving me a clearer sense of myself and a more complete picture of our world. And, of course, my thanks to my family and Chris.

1 Chapter 1 Introduction As processor speeds and storage capacities increase, people expect computers both to solve existing problems more quickly and to solve larger and more complex problems. In the field of linear algebra, the latter corresponds to solving systems with large, potentially ill-conditioned matrices. The storage requirements for large matrices can sometimes be reduced if they are sparse, ie. if many of the matrix entries are zero. Furthermore, the time and memory needed to solve a large system can sometimes be reduced by preconditioning the system prior to solving it. Informally, to precondition a system prior to computing a solution is to transform it into one with more desirable properties. The solution to the altered system is computed, and transformed into the solution of the original problem. The advantages of solving the modified system can include more accurate results, decreased running time, reduced memory requirements, or some combination of these. What makes a preconditioner desirable can depend on both the problem and the solution method. In this dissertation we look at preconditioners for two classes of linear algebra problems: eigenproblems and linear systems. In chapter 2 we discuss preconditioning for sparse eigenproblems, considering both how to permute a matrix to decompose it, and how to balance a matrix to improve the accuracy of its computed eigenvalues. In chapter 3 we turn to preconditioning for linear systems. We discuss heuristics for permuting the rows and columns to achieve goals such as decomposing the matrix, reducing the number of nonzeros in the factors, and stabilizing the matrix. We then look at incomplete LU factorizations, a class of preconditioners for iterative solvers. Before turning to preconditioners for specific problems, we first give brief overviews

2 of storage and algorithmic issues concerning sparse matrices, and of the roles of preconditioners for direct and iterative solvers. We then define some of the matrix notation and graph representations used throughout this report, and end with a summary of our contributions. 1.1 Sparse systems As noted, we are interested primarily in sparse matrices, which can be thought of as n by m matrices with enough zero elements to make storing only the nonzero elements, and not all nm entries, worthwhile. Sparse algorithms which ignore the zero elements can sometimes be faster than their dense counterparts which operate on all entries. For example, consider an n by n diagonal matrix. Clearly storing only the n nonzero diagonal entries is cheaper than storing all n 2 matrix elements. Clearly algorithms which ignore the zero off-diagonal elements can be far more efficient than those which do not. For example, if computing the diagonal matrix times a vector, the standard dense algorithm computes n dot products of length n, whereas a sparse algorithm operating only on the nonzero elements requires n scalar multiplications. In practice, sparse matrices arise in many application areas. For example, when simulating the effects of applying heat to a plate or the flow of air around an airplane wing, the first step is often to model the plate or the wing by putting a mesh on it. This mesh can be seen as an undirected graph with v vertices and e edges. If we translate this graph into a matrix, a process better described in section 1.3.2, the matrix is a v v matrix with 2e nonzeros. Since two vertices are connected by an edge only if they are near each other in the physical object, the number of edges is much smaller than v 2 /2. If we model a 2D square plate by putting an n n mesh on it, the matrix will have v = n 2 rows and columns, and only 5n 2 4n, rather than n 4 nonzero entries. As the matrix-vector multiplication example shows, we need both data structures for storing sparse matrices and algorithms that take advantage of the sparse storage formats. Because sparse matrices are less structured than dense matrices, creating either of the two can be challenging.

3 1.1.1 Storage of sparse matrices Traditionally dense matrices have been stored as a two-dimensional array in either column-major or row-major order, though more recent work suggests performance advantages to using recursive layouts [4, 67, 94]. For sparse matrices, on the other hand, significantly more storage methods are used. The range of possibilities comes from the fact that matrices from different applications have different nonzero structures (eg, that the diagonal of the matrix may be nonzero, or that the matrix has a narrow band), and that different matrix representations allow for efficient implementation of different operations. Column-compressed format is a very popular sparse matrix representation that some large matrix repositories, including the Harwell-Boeing collection [55] and the University of Florida sparse matrix collection [47], use. As the small example in figure 1.1 shows, the column-compressed format stores a sparse matrix with real entries in three arrays: nzval, rowind, and colptr. The nzval array has nnz elements, where nnz is the number of nonzeros in the matrix. The elements in nzval are the values of the nonzero elements, stored by column, so that the elements in column 1 are listed first, then those in column 2, and so on. The integer array rowind also has nnz entries and rowind[i] is the row index of the entry whose value is stored in nzval[i]. The integer array colptr has n + 1 entries, where n is the number of columns in the matrix, and colptr[i] is the location in nzval and rowind where the first element in column i can be found. Equivalently, colptr[i] is the total number of nonzeros in columns 1 through i 1. The first entry, colptr[0], has value 0, and the last entry, colptr[n + 1], has value nnz. Although in the example the elements in each column are sorted by increasing row index, this is not a requirement of the format. 1 6 colptr 0 2 5 7 8 3 2 4 8 rowind 0 2 1 2 3 0 3 2 5 7 nzval 1 2 3 4 5 6 7 8 Figure 1.1: This figure shows how a small sparse matrix is stored the compressed column format (also know as the Harwell-Boeing format). Row-compressed format, which we use in the work on preconditioning linear sys-

4 tems described in section 3.4, is the row-based analogue of column-compressed format: matrix entries are stored by row instead of by column. The row-compressed format uses rowptr and colind arrays in place of the colptr and rowind arrays. We describe less common storage formats as necessary throughout the report. Books such as [14] and [148] describe some of the many other sparse matrix storage representations people use. In principle, one could devise arbitrary hybrids of these to accommodate particular applications, as done in [107, 109, 175]. 1.1.2 Sparse matrix algorithms Just as we have an understanding of good storage methods for dense matrices, we also know how to exploit the memory hierarchy to write efficient dense linear algebra code. The goal is to limit the amount of data movement between levels of the memory hierarchy. The trick is to block the matrix, which divides it into smaller non-overlapping submatrices, and then to operate on the individual blocks. The operations on these smaller blocks should all fit into the lowest level (the one with the most storage) of the memory hierarchy. Since there are typically several levels in the memory hierarchy, the submatrices themselves may be again divided into smaller subblocks which are also operated on one at a time. Typically the memory needed to store the largest blocks is on the order of the size of the first level cache and the number of elements in the smallest blocks is on the order of the number of floating point registers. Blocking can be very effective for dense matrix computations because they typically access matrix and vector elements in regular patterns. Unfortunately, sparse matrices are not as structured as dense matrices and in general cannot be easily blocked into small dense subblocks. Memory references tend to be irregular, which makes exploiting temporal or spatial locality difficult. Of course, if a user knows his or her application generates sparse matrices with small dense subblocks, performance can be improved by using algorithms that can exploit this feature. Other work looks at padding sparse matrices by storing some zero elements in order to create dense blocks [107, 108, 109, 175]. Nevertheless, overall, achieving high performance on sparse matrix computations remains a complex open problem. Although algorithms operating on matrices stored in sparse matrix representations can be difficult to code efficiently, some algorithms may be easier to implement on sparse matrices. For example, graph algorithms translate nicely to sparse matrices stored

5 in row compressed format, which is essentially the same as the standard adjacency graph representation of a matrix. 1.2 Roles of preconditioning As noted, preconditioning a system alters it so that the changed system is somehow better. The answer to the improved system is computed, and from it the answer to the original system is derived. What makes the preconditioned system better depends largely on how the algorithm then used to solve the preconditioned system works. For example, consider direct versus iterative methods, a categorization we use to describe algorithms throughout this report. Informally, a direct method is an algorithm that is usually run for a fixed number of steps, at the end of which it almost always returns an answer that is sufficiently close to the exact solution that it is often considered exact, modulo roundoff error. An iterative method, on the other hand, begins with an initial guess to the solution and iteratively tries to improve it. The algorithm stops either when the approximation is deemed sufficiently close to the exact solution, or when some large number of iterations has been run and the user suspects the algorithm has stagnated and so a good approximate solution may never be reached. We note that some methods (for example, the conjugate gradient solver for linear systems [100]) span direct and iterative methods in the sense that they compute the exact answer in n steps in exact arithmetic, but in practice are used as iterative methods either because they often compute a reasonable solution in far fewer than n steps or because the iterations are expensive and n is large. The motive for preconditioning differs for iterative and direct methods. Since a direct method gives the answer after running for a fixed number of steps, a useful preconditioner might turn the system into one where each step takes less time, or one for which the solver computes a more accurate answer. For an iterative method, on the other hand, an effective preconditioner might create a system for which the iterative solver converges when it did not for the original problem, a system where the number of iterations needed for convergence is reduced, or one where each iteration can be computed more efficiently. However, accuracy of the solution and the speed with which that solution is computed remain paramount. Because of the latter, preconditioners are judged not only by how much they improve the performance of the solver, but also by other measures such as the cost of computing and applying that preconditioner. Since the preconditioner must be

6 first computed prior to solving the modified system, it should be relatively inexpensive to compute. Note that if many similar systems are to be solved and the same preconditioner is used for all of them, the cost of computing the preconditioner can potentially be amortized. After solving the preconditioned system the effects of the preconditioner must be undone to recover the solution to the original problem; this step should also be inexpensive. Furthermore if the preconditioner will be applied in every iteration of the algorithm, as with the incomplete LU preconditioner discussed in section 3.4, the application of the preconditioner should be inexpensive. The tradeoff between the time needed to compute and apply the preconditioner, and the time saved and accuracy gained with a high quality preconditioner is an issue throughout this report. 1.3 Notation and Definitions The discussions in this report move between matrix and graph terminology. To smooth the transitions between the two, in this section we summarize our matrix notation, define the graphs associated with a matrix, and discuss relationships between matrix and graph terminology. 1.3.1 Matrix notation and definitions We generally use capital letters for representing matrices, lower case letters for vectors, and Greek letters for constants. A few letters are reserved for special matrices: A, which refers to the n n, possibly nonsymmetric, matrix being preconditioned; B, which is A after preconditioning; P and Q, which are permutation matrices; and D, which is a diagonal matrix. The vector e is the vector whose entries are all 1. The number of nonzero elements in a matrix M is denoted by nnz(m). In the context of solving a system of linear equations, we use L and U to denote the complete LU factors of A, so A = LU if no pivoting is used. With row pivoting, we have P A = LU; with column pivoting we have AP = LU. We use ˆL and Û to denote incomplete factors of A, so ˆLÛ A. The number of nonzeros in the incomplete factorization is nnz(ˆl + Û), and we will typically denote the number of nonzeros in a matrix A by nnz(a), though the (A) may be omitted if the context clearly specifies the matrix.

7 We frequently use Matlab notation when referring to elements in vectors and matrices. For example, we use colons to indicate a sequence of indices, so A(i, :) is row i of A, and A(3 : 5, :) is the submatrix consisting of the third, fourth, and fifth rows of A. For more information on Matlab notation, see [133]. If a permutation P is applied symmetrically, it maps A to P AP T. If A is a nonnegative matrix, it has no negative entries. This may also be written as A 0. If A is real and symmetric, A = A T. If A is complex and Hermitian, A = A H. If A is structurally symmetric, A and A T have nonzeros in the same locations, though the values may differ. Finally, A is shorthand for the matrix whose entries are A(i, j). We use norms to measure the size of vectors and matrices, where the norm of x is written as x. If x is a length n vector, some of the vector norms we use are defined as follows: 1 norm: x 1 x 1 + x 2 +... + x n 2 norm: x 2 ( x 1 2 + x 2 2 +... + x n 2 ) 1/2 norm: x max i { x i } If A is an n n matrix, some of the matrix norms we use are defined as follows: 1 norm: A 1 max j { i (A(i, j))} 2 norm: A 2 max{λ 1/2 : λ is an eigenvalue of A H A} norm: A max i { j (A(i, j))} ( 1/2 Froebenius norm: A F i,j }) { A(i, j) 2 For more on norms, look in linear algebra books such as [52, 82, 102]. The condition number of a matrix A with respect to a particular problem is a measure of how sensitive the solution to that problem is to perturbations in A. The condition number of A with respect to matrix inversion is defined as κ(a) = A A 1 if A is nonsingular and κ(a) = if A is singular. Again, for more information consult books on linear algebra such as [52, 82, 102]. Other less frequently used terms will be defined when they are first used. 1.3.2 Graph representations of matrices We now describe two ways of representing matrices by a graph, both of which are often referred to in the literature. There are the directed graph and the bipartite graph

8 representations; given a matrix A we refer to the first graph as DG(A), and the latter as BG(A). Note the latter is sometimes called the dependency graph of A (e.g., in [131]). When discussing graphs we use common graph terminology. For example, let a graph G be defined by a set of vertices V and a set of directed edges from one vertex to another E. Then, a path from vertex s to t is a list of vertices (s, v 1, v 2,..., v k, t) such that there are directed edges in E from s to v 1, from v i to v i+1 for all 1 i < k, and from v k to t. We also refer later on to the subgraph induced by a set of vertices, and mention graph algorithms such as depth first search. The terminology and algorithms can be found in standard algorithm textbooks such as [1, 44]. Directed graph representation An n by n unsymmetric matrix A can be represented by a directed graph DG(A) = (V, E), where V = n, and the directed edge (i, j) E if and only if A(i, j) 0. If we need a weighted graph DG w (A), the weight on edge (i, j) is the value of A(i, j). If A is an n by m matrix where n m, then V = max(n, m) and either some of the nodes will have no incoming edges or some will have no outgoing edges, depending on whether A has more rows or more columns. Thus DG(A) is the same as DG(Ā) where Ā is gotten by extending A with enough zero rows or columns to make it square. Bipartite graph representation Alternatively, an n by n unsymmetric matrix A can be represented by a bipartite graph BG(A) = (R, C, E). In this representation R = C = n, and the undirected edge (r i, c j ) exists if and only if A(i, j) 0. If a weighted graph BG w (A) is needed, the weight on the edge (r i, c j ) is the value of A(i, j). If A is not square, the only change is that R = C. 1.3.3 Relationships between A, DG(A), and BG(A) We now point out some obvious, and some perhaps not so obvious, relationships between a matrix and its associated graphs. For example, note that entries on the diagonal of A correspond to self-edges in DG(A). Now consider the case where A is a structurally symmetric matrix. This means DG(A) has the attractive quality of being representable by an undirected graph since an

9 edge (i, j) in DG(A) implies the existence of the edge (j, i). Of course, if we need edge weights, A needs to be symmetric, not just structurally symmetric, for DG(A) to be representable by an undirected graph. In BG(A) = (R, C, E), symmetry in A is reflected in the fact that if edge (r i, c j ) E, then (r j, c i ) E. Although the nodes in a graph are typically not thought of as ordered in any way, the rows and columns of a matrix are always ordered (i.e., we refer to the first row or third column of a matrix). If we think of the nodes in DG(A) as numbered so that the node corresponding to the first row and column of A is node 1, a reordering of the nodes corresponds to permuting the rows and columns of A symmetrically (i.e., P AP T ). This makes DG(A) useful for algorithms that permute the rows and columns of a matrix symmetrically, such as some of those described in section 3.2.1. Renumbering the nodes of DG(A) is no longer appropriate if different permutations are applied to the rows and columns of A (i.e., P AQ T ), so the bipartite representation BG(A), which allows the nodes of R and C to be numbered separately, is often used instead in this situation. Finally, recall that a square matrix A is irreducible if there is no permutation matrix P such that A = X Y 0 Z where X and Z are square. If such a P does exist, A is reducible. If A is irreducible, DG(A) is strongly connected, which means there is a directed path from any vertex to any other (e.g., [52, lemma 6.6]). 1.4 Test Matrices The algorithms we describe in this report are tested on assorted matrices from a variety of applications. Because the performance of an algorithm depends on the matrices it is tested on, in this section we describe how we chose our test matrices. First we chose matrices from a range of application areas. By not biasing our matrices towards any one domain, we hoped our algorithms would be similarly unbiased and would work well for more than very specific types of matrices. Of course, if a user knows his or her matrices all share some common structure, the best algorithms for them will likely take advantage of that structure. Furthermore we chose matrices that spanned a range of sizes and densities.

10 The matrices used in chapter 2 are taken from a collection of non-hermitian eigenvalue problems [10]. These matrices are all specifically from eigenproblems. As the table in appendix A shows, the matrices used are of modest size and come from a range of applications. The matrices used in chapter 3 come from a variety of collections, with the majority of them available from either the Matrix Market [15] or the University of Florida sparse matrix collection [47]. Previous work on direct and iterative solvers also analyzes results of experiments conducted on a variety of matrices so to make comparisons between our results and those from previous work, we tried to ensure some overlap between our test suite and those from assorted other papers. Furthermore, since the size of sparse linear systems that computers can handle continually increases, we also tried to add new, larger matrices. The table in appendix B provides more details about the matrices chosen. 1.5 Contributions The contributions of this work are as follows. In chapter 2 we consider the problem of computing the eigenvalues of a sparse matrix, that is finding λ such that Ax = λx. We first notice that decomposing the matrix A into irreducible components can significantly reduce the expected time complexity of finding the eigenvalues of A. This scheme can be arbitrarily better than the conventional scheme used for dense matrices, described in [147] and used in packages such as Lapack [5] and Eispack [169]. We then define the concept of balancing in a weighted norm, show how to balance non-negative, irreducible matrices in the weighted norm, and prove this balancing minimizes the 2-norm of such matrices. The idea of a balancing in a weighted norm is used to justify a novel set of Krylov-based algorithms which balance a matrix without accessing its individual entries. By using only matrix-vector (Ax), and sometimes matrix-transpose-vector (A T x), multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. Finally we show that for matrices from our test suite Krylov-based balancing algorithms can improve the accuracy to which eigenvalues are computed by as much as 10 decimal places for sparse eigensolvers. Furthermore, Krylov-based balancing can also improve the condition number of the eigenvalues, which improves computed error bounds. In chapter 3 we turn to solving sparse linear systems, where the problem is to

11 find x such that Ax = b. Direct solvers first factor the matrix A, so we begin by discussing various reasons for reordering the rows and columns of A prior to factoring it. We present data showing the advantages of decomposing the matrices in our test suite through a nonsymmetric permutation rather than the symmetric strongly connected components decomposition used when computing eigenvalues. For one matrix in our test suite the size of the largest block found with a nonsymmetric permutation is a tenth of the size of that found with a symmetric permutation, which can greatly reduce the subsequent factorization time. We also note in section 3.3.2 that using a stability ordering in concert with a column approximate minimum degree ordering can lead to fill in the LU factors that differs significantly from that of using the sparsity ordering alone. On our test matrices the difference between the two could be up to a factor of 2. Focussing on one specific algorithm for reordering A, we next describe our design and implementation of a threaded column approximate minimum degree algorithm. Even after the extensive analysis and code modifications we describe, our final implementation never achieved a speedup of more than 3 on 8 processors of an SGI Power Challenge machine, and more typically there was virtually no speedup. This work, done jointly with Sivan Toledo and John Gilbert [37], gives us a better understanding of the difficulties of efficiently implementing algorithms with fine-grained parallelism even in a shared memory environment. Finally we turn to incomplete LU (ILU) factorizations, a family of preconditioners often used with iterative solvers. We propose a modification to a standard ILU scheme and show that it makes better use of the memory the user has available, leading to a greater likelihood of convergence for preconditioned GMRES(50), the iterative solver used in our studies. By looking at data gathered from tens of thousands of test runs combining matrices with different ILU algorithms, parameter settings, scaling algorithms, and ordering algorithms, we draw some conclusions about the effects of different ordering algorithm on the convergence of ILU-preconditioned GMRES(50). We find, for example, that both ordering for stability and partial pivoting are necessary for achieving the best convergence results.

12 Chapter 2 Preconditioning sparse matrices for computing eigenvalues Given a matrix A, we say that λ is an eigenvalue of A with corresponding eigenvector x if Ax = λx and x 0. Given A, eigensolvers try to find some or all of its eigenvalues and eigenvectors. The eigenvalues of a matrix are maintained under similarity transforms, which means the eigenvalues of B = SAS 1, where S is nonsingular, are the same as those of A. The eigenvectors, on the other hand, are transformed by S so if x is an eigenvector of A, Sx is an eigenvector of B. Preconditioning in this context means choosing S so that the eigenvalues of SAS 1 can be computed more quickly or more accurately than those of the untransformed A. In this chapter we explore two methods for choosing S. In section 2.1 we constrain S to be a permutation matrix and show how to decompose A into a set of smaller systems. In section 2.2 we then constrain S to be a diagonal matrix and look at algorithms for scaling the entries of A to improve the accuracy with which its eigenvalues can be computed. These techniques can sometimes be combined, and in section 2.3 we show the effects of preconditioning on the accuracy of computed eigenvalues. Our contributions are first to notice that decomposing a matrix A by using a strongly connected components algorithm can significantly reduce the expected time complexity of finding the eigenvalues of A. This scheme can be arbitrarily better than the conventional scheme used for dense matrices, described in [147] and used in packages such as Lapack [5] and Eispack [169].

13 We then switch gears and define weighted balancing, which we show minimizes the 2-norm of a nonnegative matrix. We further describe novel Krylov-based balancing algorithms which approximate weighted balancing and which operate on a matrix A without explicitly accessing its entries. By using only matrix-vector, and sometimes matrix-transposevector, multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. Finally we show that for matrices from our test suite, these Krylov-based balancing algorithms do, in fact, often improve the accuracy to which eigenvalues are computed by dense or sparse eigensolvers by as much as 5 decimal places in our examples with dense eigensolvers and as much as 10 decimal places for sparse eigensolvers. Furthermore, Krylovbased balancing can also improve the condition number of the eigenvalues, which improves computed error bounds. Portions of this work were published in [33, 35, 36]. 2.1 Decomposing the matrix We first consider choosing a similarity transform P, where P is a permutation matrix. The goal is to find P such that: P AP T = X Y, 0 Z where X and Z are square. Since the eigenvalues of A are the eigenvalues of X together with those of Z, this decomposition reduces the problem of finding the eigenvalues of A to the smaller, and therefore simpler, eigenproblems for X and Z. Recall from section 1.3.3 that if such a P exists, A is reducible, otherwise A is irreducible. P is applied symmetrically, so the transformation permutes the rows and columns of A in the same way. Looking at DG(A) and DG(P AP T ) (defined in section 1.3.2) we see the two are isomorphic, and only the numbering of the nodes is changed. In this section we review an algorithm for finding P described by Parlett and Reinsch in [147] and used by codes in several popular linear algebra libraries. We then describe a permutation which can do significantly better and end with a comparison of the two algorithms on our test matrices.

14 2.1.1 The Parlett-Reinsch Algorithm In [147] Parlett and Reinsch describe a two-step algorithm for transforming a matrix prior to computing its eigenvalues. separate out rows and columns which isolate eigenvalues. rows and columns are scaled. In the first step, the matrix is permuted to In the second, the remaining This algorithm is implemented in linear algebra packages such as EISPACK (under the name balanc), LAPACK (under the name gebal 1 ), and MATLAB (under the name balance). In the permutation phase, the algorithm first searches for a row with zeros on all n 1 off-diagonal entries. If a row with this structure exists, it is permuted to the bottom of the matrix by swapping two rows and swapping the same two columns. The algorithm then iterates on the first n 1 rows and columns of the permuted matrix. When no more rows isolating eigenvalues on the diagonal are found, the process repeats for columns. Afterwards the algorithm has found a permutation matrix P such that P T AP = T 1 X Y 0 C Z 0 0 T 2 (2.1) where T 1 and T 2 are upper triangular. The eigenvalues of T 1 and T 2 are their diagonal entries. Even though the square submatrix C may not be irreducible, this decomposition is deemed sufficiently good. In graph theoretic terms, this algorithm begins by looking for a sink node s of DG(A). If one exists, it is given the number n, meaning in the permuted matrix it corresponds to the last row and column. The algorithm then looks for a sink node in the subgraph induced by the set of all nodes except s. If a sink node exists in this subgraph, it is numbered n 1. When there are no sink nodes found in the last subgraph the algorithm then continues by looking for source nodes, which are numbered in increasing order, starting with 1. Nodes not identified as source or sink nodes end up numbered after all the source nodes and before all the sink nodes. See figure 2.1 for a small example. 1 In LAPACK there is also an additional character at the beginning of the subroutine name which specifies the data type (e.g. dgebal balances matrices whose entries are in double precision) [5].

15 2 3 6 1 6 1 6 4 5 (a) (b) (c) Figure 2.1: This figure shows how the Parlett-Reinsch algorithm would decompose a small graph. In (a) a sink node is located and numbered last; in (b) there are no sink nodes, so a source node is located and numbered 1; in (c) there are neither sink nor source nodes so the remaining nodes are considered a group and numbered consecutively. Gray elements were eliminated in a previous step and are not considered. 2.1.2 The Strongly Connected Components Algorithm By locating individual rows and columns which isolate eigenvalues, the permutation phase of the Parlett-Reinsch algorithm finds a permutation matrix P which makes P AP T as upper triangular as possible. In other words, it decomposes A into 1 1 blocks and one large block consisting of all the remaining rows and columns, as shown in equation 2.1. We suggest choosing P to make P AP T as block upper triangular as possible, which decomposes A into a set of diagonal blocks of size ˆn 1, ˆn 2,..., ˆn k, where k depends on the structure of the matrix. This minimizes the size of the largest diagonal block which is significant since that a dense eigensolver on the decomposed matrix runs in time O( k i=1 ˆn3 i ). In graph theoretic terms, making A as block upper triangular as possible corresponds to finding the strongly connected components of a directed graph whose adjacency matrix has the same structure as A and then sorting the components using a topological sort [44, section 23.5]. See figure 2.2 for a small example. Tarjan noted that finding the strongly connected components of a directed graph can be done using two depth first searches [173]; a description of his algorithm can be found in [1, section 5.5]. Descriptions of implementations are in [59] and [151]. We point out that Tarjan s algorithm is particularly well suited to this application because it outputs the nodes one strongly connected component at a time, with the components already topologically sorted. Permuting A to block upper triangular form means the indices of several diagonal blocks may need to be stored so that the eigenproblems corresponding to these blocks

16 2 4 1 6 3 5 Figure 2.2: This figure shows how the strongly connected components algorithm would decompose a small graph. All the strongly connected components are located, then a topological sort is done on the components, and the nodes are numbered so that if component i comes after component j in the topological sort, all elements of component i have numbers greater than those of the elements of component j. can be identified and solved. However, we believe the potential benefits of this algorithm over the Parlett-Reinsch algorithm compensate for the additional complexity. Because the eigenvalues of the original matrix A are the same as the union of the eigenvalues of the individual diagonal blocks in P T AP, the running time of any eigensolver run on the blocks of the balanced matrix depends strongly on the size of the largest block. The size of the largest diagonal block found using this permutation can be significantly smaller than that found by the Parlett-Reinsch algorithm, and it is never larger. 2.1.3 Comparisons The benefits of using the strongly connected components algorithm instead of the Parlett-Reinsch algorithm are two-fold. First, as described in the next paragraph, computing the strongly connected components is likely to take less time for sparse matrices stored in compressed format (row or column). Second, the size of the largest diagonal block can be much smaller with the strongly connected components algorithm, which makes computing the eigenvalues of A less expensive. The Parlett-Reinsch permutation algorithm permutes rows one at a time, which requires O(nnz) time for each row if the matrix is in compressed column format. For a permuted upper triangular matrix, the permutation phase would take O(nnz n) time, which is much more than the O(n + nnz) time taken by Tarjan s strongly connected components algorithm. In short, it is more efficient to run an algorithm that needs only two sweeps through the entire data structure to identify all the blocks, rather than one which repeatedly looks for a single 1 by 1 block in each sweep.

17 strongly conn. comp. name n # scc max(blocks) (blocks 3 ) Parlett-Reinsch max(blocks) (blocks 3 ) tols2000 2000 1529 90 7.33e5 854 6.23e8 t240 240 121 90 7.29e5 150 3.38e6 ecsiemensa 177 6 167 4.66e6 177 5.54e6 ecsiemensb 177 26 151 3.44e6 153 3.58e6 qh1484 1484 3 1470 3.18e9 1484 3.27e9 qh882 882 1 882 6.86e8 882 6.86e8 mhd4800a 4800 8 4793 1.10e11 4793 1.10e11 qh768 768 1 768 4.53e8 768 4.53e8 mvmpde 900 1 900 7.29e8 900 7.29e8 mhd3200a 3200 8 3193 3.26e10 3193 3.26e10 mhd1280a 1280 15 1266 2.03e9 1266 2.03e9 mhd416a 416 8 409 6.84e7 409 6.84e7 qc2534 2534 1 2534 1.63e10 2534 1.63e10 qc324 324 1 324 3.40e7 324 3.40e7 Table 2.1: This table shows the number of strongly connected components (# scc) found in matrices in the test suite as well as the size of the largest component (max(block)). The size of the submatrix C, as defined in Equation 2.1, from the Parlett-Reinsch algorithm is given for comparison. As a measure of how long an O(n 3 ) algorithm would take to find the eigenvalues of the decomposed matrix, we give the sum of the block sizes cubed ( (blocks 3 )). The matrices are sorted so that the matrices helped most by finding strongly connected components are listed first: the largest block is reduced significantly for tols2000 and t240, reduced somewhat for ecsiemensa, ecsiemensb, and qh1484, and reduced not at all for the other matrices. Furthermore, the largest block found via strongly connected components can be much smaller than that found by the Parlett-Reinsch algorithm. Table 2.1 shows the number of strongly connected components found for each of the test matrices, together with the size of the largest block found by each permutation algorithm. As a measure of how long an O(n 3 ) algorithm would take to compute the eigenvalues of the decomposed matrices, we also give the sum of the diagonal block sizes cubed. Improvements, measured by this sum of cubes, range from 1 (no improvement) up to nearly 10 3. For information about these matrices, including the application areas from which they come, see appendix A.

18 2.2 Balancing In the previous section we studied similarity transforms determined solely by the nonzero structure of A. We now restrict ourselves to similarity transforms that are diagonal scaling matrices and hence depend on and affect the nonzero values in A. Numerical algorithms that compute the eigenvalues of a nonsymmetric matrix A typically have backward errors of the magnitude of ε A, where ε is the machine precision. Prior to computing the eigenvalues, applying a simple and accurate similarity transform DAD 1, which reduces either the norm of A or the condition numbers of some subset of A s eigenvalues, can be advantageous. For example, consider the matrix: A = Choosing D = diag(100, 1,.01) gives: B = DAD 1 = 1 0 10 4 1 1 10 2 10 4 10 2 1. 1 0 1 10 2 1 1 1 1 1. Whereas A F, the Frobenius norm of A, is approximately 10 4, B F is approximately 2.6. Furthermore, the condition numbers of the eigenvalues of B are all approximately 1, whereas those of the eigenvalues of A range in magnitude from 10 1 to 10 3. Therefore, one expects to compute the eigenvalues of B more accurately than those of A. Notice B is balanced in the -norm: a matrix is balanced in the α-norm if for any i, the α-norm of row i is the same as the α-norm of column i. Osborne [143] showed balancing an irreducible matrix in the 2-norm is equivalent to minimizing its Frobenius norm; balancing a matrix in an arbitrary norm may not have such a simple effect on a matrix norm. Previous work studies the theory behind using diagonal scaling to balance matrices and to minimize matrix norms, as well as practical issues associated with implementing balancing algorithms. In this section we summarize and extend the theory of balancing before describing a family of balancing algorithms our theory suggests.

19 2.2.1 Theory Before summarizing previous work on the theory of balancing and norm minimization, we note a few assumptions. First, we consider primarily irreducible matrices. Recalling the relationship between irreducible matrices and graphs as described in section 1.3.3, if a matrix is reducible, we can compute the strongly connected components of its graph using the algorithm described in section 2.1.2, and then consider the irreducible diagonal blocks individually. Furthermore, note that if A is reducible with block structure [X Y ; 0 Z], the block Y can be scaled arbitrarily close to 0, without affecting the values in X and Z. In other words, if we partition the scaling matrix so D = [D X 0; 0 D Z ], the scaled matrix is: DAD 1 = D XXD 1 X D X Y D 1 Z. (2.2) 0 D Z ZD 1 Z If D Z is scaled by a large constant, the only change in DAD 1 is a decrease in the elements of D X Y D 1 Z. By increasing the constant, Y can be scaled arbitrary close to 0. Furthermore, we typically do not consider the diagonal elements of the matrix. Not only are diagonal elements unaffected by balancing, but in most norms a matrix that is balanced when its diagonal elements are zero remains balanced when the diagonal elements are made non-zero. The converse is not true; if the diagonal entries are sufficiently large relative to the off-diagonal entries, the matrix will be nearly balanced regardless of the exact off-diagonal entries. Therefore balancing matrices with a zero diagonal is the more important case. History There are many interesting questions regarding the use of diagonal scale factors to balance a matrix A in some vector norm, or to minimize some matrix norm of A. Questions pursued include whether exact balancing in a given norm is achievable, if the balancing matrix D is unique, and whether balancing in a given vector norm also minimizes some matrix norm of A. In this section we summarize a few theoretical results from the literature on diagonal scaling. Tables 2.2 and 2.3 summarize the known results on balancing and norm minimization, and [33] contains a more complete overview. Osborne [143] was the first to study balancing, showing that a matrix balanced in the 2-norm has minimal Frobenius norm. The iterative algorithm suggested by Osborne is

20 used in the Parlett-Reinsch algorithm [147], although the code provided in [147] balances in the 1-norm, which is cheaper to compute than the 2-norm. For balancing in the 1-norm, Hartfiel [96] proved that if A is irreducible, a diagonal balancing matrix exists and is unique up to scalar multiples. Eaves et al. [62] then showed that balancing a non-negative matrix in the 1-norm minimizes the sum of its elements. If the iterative algorithm of [143, 147] is used to balance a matrix in the 1-norm, Grad [85] showed the algorithm would find the balancing scale factors, provided the diagonal scale factors were not limited to powers to the machine base as they often are in actual code as this eliminates roundoff error. Finally, in [113] the authors show that a matrix can be balanced in the 1-norm to within any prescribed accuracy in polynomial time. For balancing in the -norm, the balancing matrix is not necessarily unique [33]. By defining balancing in the -norm more strictly so that the values of more than the largest element in each row and column are considered, Schneider and Schneider show in [166] that the balancing matrix can be made unique up to scalar multiples. Graph algorithms for balancing in the -norm are studied in [33] and [166]. Moving away from balancing, Ström [172] considered using diagonal scaling solely to minimize various matrix norms, disregarding the question of whether the matrix is also balanced in some vector norm. He proved lower bounds on the norm achievable for several matrix norms, and in some cases showed how to attain the lower bounds. Table 2.2 summarizes some of his results. Weighted Balancing In this section we define weighted balancing for non-negative, irreducible matrices; show how to compute the weighted balancing of a matrix A; and finally prove that weighted balancing achieves the minimum 2-norm of DAD 1. This draws another connection between balancing and minimizing matrix norms and extends Ström s work, in which he shows that a companion matrix C can be scaled to achieve the minimum 2-norm ρ( C ) [172]. We begin by defining several terms used throughout the section; for more information on these terms see, for example, [102]. The spectral radius of A, written ρ(a), is defined as max i λ i. For non-negative A, the eigenvalue ρ(a) is also called the Perron root. The (right) Perron vector is defined as the (right) eigenvector corresponding to the largest eigenvalue of A.

21 We now define weighted balancing as follows: an irreducible, non-negative matrix A is balanced in the weighted sense if A(i, :)z = z T A(:, i) for all i = 1... n, where z is the eigenvector corresponding to the Perron root of A, i.e., the eigenvalue ρ(a). Theorem 1 Let α = ρ(a), where A is an n n irreducible, non-negative matrix. Let x and y be corresponding positive right and left Perron vectors, i.e. Ax = αx, and y T A = y T α, ( y1 where x > 0 and y > 0. Let D = diag /x 1, y 2 /x 2,..., ) y n /x n, z = Dx = [ x1 y 1, x 2 y 2,..., ] T x n y n, and B = DAD 1. Then the following are true. 1. ρ(b) = ρ(a) = α 2. The left and right eigenvectors of B corresponding to the eigenvalue α are identical and equal to z; this means the eigenvalue α has minimal condition number. 3. B is balanced in the weighted sense. Proof: 1. By Perron-Frobenius theory, A has a positive real eigenvalue α = ρ(a) whose corresponding right and left eigenvectors x and y are positive. Therefore D is finite and non-singular, and B has the same eigenvalues as A, since the two are similar. 2. Since Ax = αx, D 1 BDx = αx, and BDx = αdx. Hence Dx is the right eigenvector of B corresponding to α. Similarly, y T D 1 is the left eigenvector corresponding to α. For componentwise equality of the left and right eigenvectors, choose D = ( y1 diag /x 1, y 2 /x 2,..., ) y n /x n. Both eigenvectors then equal z. The formula for the condition number of the eigenvalue is Dx D 1 y / (D 1 y) T (Dx). Since Dx = D 1 y, the condition number equals 1 and is minimized. 3. Since Bz = αz and z T B = αz T, B(i, :) z = αz i = z T B(:, i). Corollary 2 D is unique up to scalar multiples. Proof: Because A is irreducible and non-negative, by Perron-Froebenius theory the right and left Perron vectors are unique up to multiplication by a scalar (e.g. [102, Theorem 8.4.4]). We have defined weighted balancing and shown how to find a scaling matrix that balances in a weighted sense. Next we show weighted balancing recovers symmetry whenever possible and achieves the minimum 2-norm.

22 Proposition 3 Assume A is irreducible and non-negative. Let A = D 1 SD 2, where D 1 and D 2 are non-singular, diagonal matrices. If S = S T, then B = B T Theorem 1). (where B is defined in Proof: Weighted balancing uses the right and left Perron vectors x and y of A s largest eigenvalue α, which are unique up to multiplication by a scalar by corollary 2. Since A = D 1 SD 2, A and A T are diagonally similar, so cd 1 2 D 1y = x, where c is some positive scalar. Therefore, the scaling matrix D defined in Theorem 1 is: ( ) D = diag c D 2(1) D 1 (1), c D 2(2) D 1 (2),... c D 2(n) D 1 (n) We can take out the c so that D = cd 1/2 2 D 1/2 1. This means: B = DAD 1 = cd 1/2 2 D 1/2 1 1 D 1 SD 2 c D 1/2 2 D 1/2 1 = D 1/2 1 D 1/2 2 SD 1/2 2 D 1/2 1 Since S is symmetric, clearly B is symmetric. Next we prove a lower bound on the 2-norm of B, where B is defined in Theorem 1. A trivial lower bound on the 2-norm of B is 1 n ρ( A ). This bound holds regardless of whether or not A is non-negative since for any B = DAD 1, B 2 1 n B 1 n ρ( B ) = 1 n ρ( A ). However, if A is non-negative, a stronger result can be shown. Theorem 4 If A is non-negative and irreducible, B 2 = ρ(a) (where B is defined in Theorem 1). Furthermore B 2 ρ(a) = ρ(b) for any D. Proof: By definition, B 2 2 = ρ(bbt ). Since Bz = αz and B T z = αz, BB T z = α 2 z. Since B 0, BB T 0. In addition, α 2 = ρ(bb T ) (e.g. [180, Theorem 2.2]). Therefore, B 2 = α. Furthermore B 2 ρ(a) for any D because B 2 ρ(b) (e.g. [102, Theorem 5.6.9]) and B and A have the same eigenvalues. Reducing the norm of a matrix is a common goal of balancing and we have shown weighted balancing achieves this goal.

23 Summary We summarize the theoretical results in two tables. Table 2.2 summarizes results on using diagonal scaling matrices to minimize assorted matrix norms. Table 2.3 summarizes results on using diagonal matrices to balance a matrix in assorted vector norms. Minimizing the Norm of a Matrix Is D requires how to find lower exactly norm unique? A 0? scaling matrix? bound attainable? 1 Y [172] N use Perron vector [172] ρ( A T ) [172] if irreducible [172] 2? N solve GEVP [28] ρ(a) [172]? 2 Y (Th. 1) Y use l,r Perron vectors ρ(a) [172] if irreducible (Th. 1) (Th. 1) Y [172] N use Perron vector [172] ρ( A ) [172] if irreducible [172] F Y [143] N iterative algorithm [143] i λ i 2 [172] iff normalizable by diag. [172] if irreducible lower bound attainable [172] Table 2.2: This table summarizes known results on minimizing a matrix norm by diagonal scaling. A question mark in the table means the answer is still unknown. By unique we mean D is unique up to scalar multiples, and when we say requires A 0 we refer to the proofs cited for that norm. Also, when [172] says A is normalizable by a diagonal matrix the author means there exists Q such that Q H Q = I and Q H AQ = Λ where Λ = diag(λ 1, λ 2,..., λ n ). 2.2.2 Parlett-Reinsch balancing algorithm The balancing phase of the Parlett-Reinsch algorithm operates on the C matrix, defined in equation 2.1, using the iterative procedure described in [143, 147]. The iterative algorithm, whose pseudocode is given in figure 2.3, looks for a diagonal matrix D such that B = DC D 1 is nearly balanced in the 1-norm. The algorithm iterates over the rows and columns of C, for each row/column pair finding a scale factor which balances that row/column pair. The appropriate entry of D is updated and that row/column is scaled. The algorithm terminates when significant progress in balancing the matrix cannot be made by updating any element of D. In practice the entries of D are all powers of the machine base, so that scaling A can be done without introducing roundoff error. This changes line 5 in the pseudocode to: 5 f power of 2 nearest B(:, i) / B(i, :)

24 Balancing a Matrix Is D req. norm unique? relation to matrix norm A 0? how to find scaling matrix? 1 Y [96] minimizes i,j A(i, j) [62] N find vector to minimize function [62] 2 Y [143] minimizes F-norm [143] N iterative algorithm [143, 147] N [33] (*) [33] N cycle-based algorithm [33, 166] iterative algorithm [33, 147] w Y (Cor. 2) minimizes 2-norm (Th. 4) Y use l,r Perron vectors (Th. 1) (*) Knowing that the largest entry in the matrix is no greater than ρ gives: A bal 1 ρ (maximum nnz in a column) A bal ρ (maximum nnz in a row) A bal F ρ 2 n 2 Table 2.3: This table summarizes known results on balancing matrices in the 1, 2,, and weighted norms. By unique, we mean D is unique up to scalar multiples. By requires A 0, we refer to the requirements of the proofs cited for that norm. However, note that for the 1-norm both [62] and [96] assume A 0 since they deal with A and not A ; nevertheless their results are easily adapted for all A since the 1-norm only cares about absolute values. (B, D) = Balance(A) 1 D I 2 B A 3 repeat 4 for i = 1 to n 5 f ( B(:, i) / B(i, :) ) 1/2 6 B(i, :) B(i, :) f 7 B(:, i) B(:, i)/f 8 D(i, i) D(i, i) f 9 endfor 10 until entries of D do not change much in an iteration Figure 2.3: Pseudocode for the iterative balancing algorithm described in [143, 147]. As opposed to the pseudocode, the actual code also contains checking for overflow and underflow. In [33, 35] we describe spbalance, code for balancing sparse matrices in com-

25 pressed column format. In spbalance the permutation phase is done using the strongly connected components algorithm and the balancing phase uses an implementation of the iterative algorithm that is efficient for matrices stored in compressed column format. Our experiments show that on our sparse test matrices spbalance is up to 400 times faster than the dense code in Lapack [5], which is based on the Parlett-Reinsch algorithm. 2.2.3 Krylov balancing algorithms Because traditional balancing algorithms such as the Parlett-Reinsch algorithm calculate exact row and column norms, they require that the elements of A be given explicitly. In this section we consider the case where elements are not given explicitly and, instead, we can only access A through matrix-vector multiplications. Balancing algorithms which use only matrix-vector multiplications can be useful as preconditioners for eigensolvers which similarly use only matrix vector multiplication to access the matrix (see [11, 119] for surveys on such solvers). These balancing algorithms may also be faster than the traditional balancing algorithms if the sparse matrix-vector multiplications are optimized, perhaps using techniques described in [107, 175]. In this section we explain the theory behind and give the pseudocode for Krylovbased balancing algorithms which use matrix vector multiplications (Ax), and sometimes also matrix transpose vector multiplications (A T x), to access the matrix. The algorithm using only Ax to access A is also called the one-sided algorithm. The algorithms using both Ax and A T x multiplications are also called two-sided algorithms. Theory Both the one-sided and two-sided algorithms depend on approximating the Perron vectors of A. The one-sided algorithm approximates the right Perron vector of A and the two-sided algorithm approximates both the right and left Perron vectors of A. Since the Perron vector of A is the eigenvector corresponding to the largest eigenvalue of A, it can be computed using the power method on A, which requires the ability to compute A z for arbitrary z. Clearly if A is known to be non-negative, Az = A z and there is no need for approximations. In this section we show the row norms of A can be approximated by Az, where z is a vector of random ±1 s; this choice of z optimizes the approximation to the 2-norm of the rows of A. Everything generalizes to apply to A T z

26 and the 1-norms of the columns of A, which can be approximated by A T z. Given some vector x (which could be a row of A), let X = x i z i, where the z i are independent and identically distributed (i.i.d.) random variables such that E(z i ) = 0 and V (z i ) = 1. Lemmas 5 and 6 state useful facts which are easily derivable using basic statistics (see, for example, [51, Section 4.1, 4.3]). Lemma 5 E(X ) = 0. Proof: E(X ) = i E(z ix i ) = 0 = 0. Lemma 6 V (X ) = E(X 2 ) = x 2 2. Proof: V (X ) = V (z i x i ) = x 2 i V (z i) = x 2 i = x 2 2 Since X 2 naturally approximates the square of the 2-norm, we want to choose the probability distribution of z i so that the variance of X 2 is minimized. Theorem 8 proves that choosing z i to equal 1 or 1 with probability.5 is the best probability distribution; the following lemma is useful in the proof of Theorem 8. Lemma 7 If z equals 1 or 1 with equal probability.5, E(z 4 ) = 1. This is minimal under the constraints that E(z) = 0 and E(z 2 ) = 1. Proof: V (z 2 ) = E(z 4 ) (E(z 2 )) 2 = E(z 4 ) 1. V (z 2 ) 0 by definition, therefore E(z 4 ) 1. If z = ±1 with equal probability, E(z 4 ) = 1, achieving the lower bound. Theorem 8 Let X = x i z i, where x is given and the z i are i.i.d. random variables such that E(z i ) = 0 and V (z i ) = 1. If z i equals 1 or 1 with probability.5, V (X 2 ) is minimized and equals 2( x 4 2 x 4 4 ). Proof: Since V (X 2 ) = E(X 4 ) E(X 2 ) 2 = E(X 4 ) x 4 2, minimizing V (X 2 ) requires choosing the probability distribution of z to minimize E(X 4 ). ( ) 4 E(X 4 ) = E z i x i i = E zi 4 x 4 i + 4 zi 3 z j x 3 i x j + 3 i i j i j + ijkl distinct z i z j z k z l x i x j x k x l z 2 i z 2 j x 2 i x 2 j + 6 ijk distinct z 2 i z j z k x 2 i x j x k

27 Because the z i are independent and E(z i ) = 0, most of the above sums equal zero, leaving: E(X 4 ) = E zi 4 x 4 i + 3 zi 2 zj 2 x 2 i x 2 j i i j = i x 4 i E(z 4 i ) + 3 i j x 2 i x 2 je(z 2 i )E(z 2 j ) = i x 4 i E(z 4 i ) + 3 i j x 2 i x 2 j From Lemma 7, E(zi 4 ) is minimized when z equals 1 or 1 with probability.5, in which case E(z 4 i ) = 1. Therefore the minimum value of E(X 4 ) is i x4 i + 3 i j x2 i x2 j = x 4 4 + 3( x 4 2 x 4 4 ) = 3 x 4 2 2 x 4 4. Since V (X 2 ) = E(X 4 ) x 4 2, the minimum value of V (X 2 ) is 2 x 4 2 2 x 4 4. If z equals 1 or 1 with equal probability, V (X 2 ) is minimized, which optimizes our approximation. The following corollary notes the approximation does not estimate the 2-norm of all vectors equally well. Corollary 9 Let E(X 2 ) = x 2 2 = 1. V (X 2 ) is minimized, and equals 0, when x is a vector of all zeros except for one element of magnitude 1. V (X 2 ) is maximized, and equals 2(1 1/n), when x is a vector whose elements all have magnitude 1/ n. Proof: Let y = [x 2 1, x2 2,..., x2 n] T, so V (X 2 ) = 2( x 4 2 x 4 4 ) = 2( y 2 1 y 2 2 ). From basic vector norm properties we know y 1 / n y 2 y 1. The condition x 2 2 = 1 means y 1 = 1, so 1/ n y 2 1. The lower bound is achieved when all entries of y have magnitude 1/n; the upper bound is achieved when all entries of y are 0, except for one with magnitude 1. Therefore V (X 2 ) is minimized, and equals 0, when all the entries of x are 0, except for one entry with magnitude 1. V (X 2 ) is maximized, and equals 2(1 1/n), when all the entries of x have magnitude 1/ n. This analysis shows that ( Az ) i is likely to have a magnitude close enough to ( A e) i for our purposes. Algorithms We now describe the three Krylov balancing algorithms.

28 One-Sided Algorithm: To justify the one-sided algorithm, note that a lower bound on DAD 1, is ρ( A ) [172]. If A is irreducible and x is the right Perron vector of A, D = diag(1/x(1), 1/x(2),..., 1/x(n)) achieves the lower bound, since the right Perron vector of DAD 1 is e, so DAD 1 = D A D 1 e = ρ( A ). From section 2.2.3, we know how to approximate the power method (described in, for example, [52, 82]) on A using matrix vector multiplications with A. Therefore, pseudocode for the one-sided algorithm is as given in figure 2.4. KrylovAz(A,t) 1 for j 1 to t 2 z n 1 vector of random ±1s 3 Compute p Az 4 for i = 1 to n 5 if (p(i) = 0) 6 D(i) 1 7 else D(i) 1 p(i) 8 endfor 9 A DAD 1 10 endfor Figure 2.4: Pseudocode for the KrylovAz Krylov-based balancing algorithm. Although more iterations (i.e. t > 1) do not exactly correspond to more iterations of the power method, experiments show more iterations can sometimes improve the solution. Experiments described in [33, 35] suggest a default value of t = 5 works well in practice. The pseudocode in figure 2.4 assumes A is available explicitly, so the command A D 1 AD can be executed. If instead of A we are given a function A(x) which computes Ax, the pseudocode in figure 2.5 shows the new code. In this case the scaling matrix D is returned, rather than the balanced matrix. Two-Sided Algorithm: With the two-sided algorithm we can compute both Az and A T z. Because this provides more information about A, we expect the two-sided algorithm to

29 KrylovAz(A,t) 1 D I. 2 for j 1 to t 3 z n 1 vector of random ±1 s 4 z D 1 z 5 p A(z) 6 p Dp 7 for i = 1 to n 8 if (p(i) = 0) 9 D(i) D(i) 10 else D(i) D(i) p(i) 11 endfor 12 endfor Figure 2.5: Pseudocode for the KrylovAz Krylov-based balancing algorithm if A not accessed explicitly. outperform the one-sided algorithm. We motivate the two-sided algorithm Krylov- Atz in two ways: first by comparison to the direct iterative algorithm described in [143, 147], and then by using Perron-Frobenius theory and weighted balancing. The standard iterative algorithm computes r, the norm of row i, and c, the norm of column i. Elements of row i are then scaled by c/r and elements of column i by r/c. Instead of exactly computing individual row and column norms, KrylovAtz approximates all the row and column norms by choosing a random ±1 vector z and computing Az and A T z. Rather than balancing one row and one column in each iteration, KrylovAtz uses the approximate row and column norms to scale the entire matrix. The difference between the standard iterative algorithm and KrylovAtz is similar to the difference between the Jacobi and Gauss-Seidel iterative methods for solving systems of linear equations (see, for example, [14]). Another motivation for KrylovAtz uses Perron-Frobenius theory. Section 2.2.1 defines weighted balancing and shows that to compute a weighted balancing of a nonnegative, irreducible matrix A, the scaling matrix D should equal

30 diag( y 1 /x 1, y 2 /x 2,..., y n /x n ), where [y 1 y 2... y n ] T is the left Perron vector of A and [x 1 x 2... x n ] T is the right Perron vector of A. From Section 2.2.3 we know multiplying A by a random vector of ±1s approximates one step of the power method on A with a starting vector of all 1s; multiplying A T by a random vector approximates one step of the power method on A T. Therefore one iteration of KrylovAtz calculates an approximation of the scaling needed for a weighted balancing of A. Either motivation leads to the two-sided algorithm with the pseudocode in figure 2.6. Again, if A is not given explicitly, we can use black box functions A(x) and A T (x) as KrylovAtz(A,t) 1 for j 1 to t 2 z n 1 vector of random ±1 s 3 Compute p Az 4 Compute r A T z 5 for i 1 to n 6 if (p(i) = 0) or (r(i) = 0) 7 D(i) 1 r(i) 8 else D(i) 9 endfor 10 A DAD 1 11 endfor p(i) Figure 2.6: Pseudocode for the KrylovAtz Krylov-based balancing algorithm. shown in the pseudocode for KrylovAz. Two-Sided Algorithm with a Cutoff Value: Experiments show that on some matrices the addition of a cutoff value to KrylovAtz further reduces the norm of the matrices returned. To add a cutoff to KrylovAtz, set D(i) to 1 in line 7 of the pseudocode for KrylovAtz when Az(i) is less than some cutoff value and not only when p(i) = Az(i) or r(i) = A T z(i) equals 0. Our implementation of KrylovCutoff uses a scaled cutoff value, i.e. the cutoff used by the algorithms is the input value given by the user multiplied by the norm of A. Any matrix norm can be used, so if A is not given

31 original (norm with balancing)/(original norm) matrix norm spbalance KrylovAz KrylovAtz KrylovCutoff tols2000 5.4e+07 1.1e 03 1.2e 01 4.2e 03 3.5e 03 qh1484 4.7e+16 1.1e 04 1.3e 02 1.7e 03 1.1e 04 qh882 2.3e+13 1.4e 06 8.3e 03 2.0e 02 1.6e 06 mhd4800a 4.1e+05 2.4e 03 7.6e 01 1.7e 02 5.6e 03 qh768 2.5e+13 1.4e 06 1.2e 02 8.4e 03 1.6e 06 elman 1.4e+02 1.0e+00 6.2e+00 2.7e+00 2.1e+00 mhd3200a 1.8e+05 3.7e 03 2.3e+00 2.6e 02 1.6e 02 ecsiemensb 5.4e 04 1.0e+00 4.1e+01 1.0e+00 1.0e+00 mhd1280a 1.3e+05 4.8e 03 2.8e+00 5.7e 02 4.5e 02 ecsiemensa 5.4e 04 1.0e+00 3.5e+03 1.0e+00 1.0e+00 t240 3.7e+05 6.5e 03 2.4e 01 6.2e 02 9.7e 02 mhd416a 2.9e+03 2.8e 02 1.5e+01 2.4e 01 2.9e 01 qc2534 4.0e+01 1.0e+00 3.5e+00 1.0e+00 1.0e+00 qc324 5.6e+00 1.0e+00 3.4e+00 1.0e+00 1.0e+00 Table 2.4: This table shows the ratio of norms of matrices with and without balancing. All Krylov algorithms were run with the default values of 5 iterations and a cutoff value of 10 8. explicitly, A can be approximated by multiplying A with a vector of random ±1 s and taking the largest component of the absolute value of the resulting vector. The two parameters chosen by the user are the number of iterations and, for KrylovCutoff, the cutoff value. Although currently we know neither the right stopping criteria nor how to choose the best cutoff value, experiments described in [33, 35] suggest the default values of 5 iterations and a cutoff value of 10 8. Table 2.4 summarize the results of running all three Krylov algorithms on the matrices in our test suite with the default parameter settings. Tests were done in MATLAB and the Frobenius norm was used in all cases. For the matrices in our test suite, Table 2.4 shows that using the default of 5 iterations and a cutoff of 10 8 gives excellent results with KrylovCutoff. Of the 14 matrices in our test suite, there are 5 whose norms were not improved by spbalance. For 4 of these 5 matrices, KrylovCutoff with the default cutoff also did not affect the norm of the matrix (see Table 2.4). On the fifth, the elman matrix, using the default cutoff and number of iterations increased the norm by a factor of about 2, which is not much. Of the remaining 9 matrices, the norm of the matrix returned by KrylovCutoff with

32 default parameter settings is still typically within an order of magnitude of the norm of the matrix returned by spbalance. KrylovAz slightly reduces the norm of six of the test matrices. KrylovAtz performs similarly to KrylovCutoff, except on the qh768 and qh882 matrices where KrylovCutoff does four orders of magnitude better. 2.3 Results The quality of a preconditioning algorithm can be measured both by how efficiently the preconditioner S itself can be computed, and by the extent to which it improves the performance of an eigensolver applied to SAS 1. The latter can be considered improved if it runs faster, if the backward error bounds are improved, or if the eigenvalues computed are more accurate. In previous sections we discussed decomposing a matrix to improve the running time of the eigensolver and using diagonal scaling to improve the backward error bounds. We now turn to showing that our preconditioners improve the accuracy of the computed eigenvalues. 2 To study the accuracy of computed eigenvalues, we need to compare them to the true eigenvalues. Since the true eigenvalues are not known, we estimate them by computing the eigenvalues using our most accurate algorithm: first finding the strongly connected components, then using the double precision codes for balancing and computing the eigenvalues of each diagonal block. By then computing the relative error bound associated with computing these eigenvalues and observing that they were very small (less than 10 10 for both the qh768 and tols2000 matrices studied later in this section), we knew the estimated eigenvalues were close to the true eigenvalues. The following defines the relative error bound we used. Let λ i be an eigenvalue of matrix A, A(λ i ) be the strongly connected component of A containing λ i, and cond num(λ i ) be the condition number of λ i as an eigenvalue of A(λ i ). The following is a relative error bound on λ i : error bound(λ i ) = ε cond num(λ i ) A(λ i ) / λ i (2.3) Whether ε is the single or double precision machine epsilon depends on the precision used to compute λ i. 2 Note that balancing is also used in other situations such as for adjusting social accounting matrices [167] and for finding ɛ-decompositions [68].

33 These true eigenvalues were used to measure the accuracy of eigenvalues computed by less precise means in order to gauge the effects of balancing. Specifically, we compared them to eigenvalues computed in single precision, both with and without balancing, using both dense (geevx in LAPACK [5]) and sparse (eigs in MATLAB [170]) eigensolvers. These tests were done on a Sun workstation, using MATLAB, with MATLAB s mexfile interfaces for calling the C routines geevx, gebal, and spbalance. 2.3.1 Balancing and Dense Eigensolvers In this section we compare eigenvalues computed in single precision by geevx after balancing against the true eigenvalues and the worst case eigenvalues, the latter computed in single precision without balancing. By the norm of an eigenvalue we mean the norm of the matrix block containing that eigenvalue. An unbalanced, unpermuted matrix may still have natural blocks; geevx will identify this upper block triangular structure and decompose the eigenproblem accordingly. Error bounds for the case when the matrix is not balanced should be closer to the true error if natural block structure is taken into consideration when computing condition numbers and norms. Our experiments show balancing by gebal and spbalance reduces norms and condition numbers, frequently leading to not only more accurate eigenvalues, but also more accurate error bounds (i.e., closer to the actual error). Due to the probabilistic nature of the Krylov balancing algorithms, the results with Krylov balancing were less predictable. Direct Balancing Algorithms We found balancing using spbalance often reduced eigenvalue condition numbers and norms, leading to both more accurate error bounds and more accurately computed eigenvalues. These effects were seen on several test matrices; we present here results on the qh768 and tols2000 matrices to show in detail some of the types of effects of balancing on accuracy. Figures 2.7 and 2.8 each show two graphs, both plotted on a logarithmic scale. The graph on the left plots the relative error in eigenvalues calculated with and without balancing. Each cross shows the relative error of an eigenvalue computed without balanc-

34 ing (the horizontal axis) plotted against the error of that same eigenvalue computed with spbalance as a preconditioner (the vertical axis). Crosses below the dotted diagonal line represent eigenvalues calculated more accurately with balancing. The further a cross lies from the line, the greater the effect of balancing on accuracy. The graph on the right plots the ratio of the actual relative error to the error bound for eigenvalues computed after applying the original dense balancing algorithm, after the sparse algorithm spbalance, and without any balancing. Values greater than 1 reflect the fact that our error bounds are not exact since our true eigenvalues are not precise. Similarly, if the lines were horizontal with value 1, this would mean the error bounds are exact; in practice this is unlikely for reasons including errors in our true eigenvalues and poor condition numbers. Note that the order of the eigenvalues may differ between the lines plotted since the eigenvalues in each are sorted to make the graphs easy to read. qh768: error without balancing vs. error with spbalance 10 5 qh768: comparison of eigenvalue accuracy to error bound 10 2 relative error with spbalance 10 0 10 2 10 4 10 6 eigenvalue accuracy / error bound 10 0 10 5 10 10 10 15 10 20 spbalance gebal nobal 10 8 10 8 10 6 10 4 10 2 10 0 10 2 relative error without balancing 10 25 0 100 200 300 400 500 600 700 800 eigenvalues Figure 2.7: Plots examining the accuracy of eigenvalues computed with and without direct balancing for the qh768 matrix. Figure 2.7 plots results for the qh768 matrix. All but one of the crosses in the left hand graph are below the diagonal, showing balancing improves the relative accuracy to which the eigenvalues are computed. Furthermore the right hand graph shows balancing improves the error bounds by 8 to 14 orders of magnitude. Figure 2.8 shows balancing has an even greater effect on the tols2000 matrix, improving both the error bound and the accuracy. Because the tols2000 matrix has several strongly connected components of size 2, spbalance is preferable to gebal.

35 tols2000: error without balancing vs. error with spbalance 10 2 tols2000: comparison of eigenvalue accuracy to error bound 10 0 10 1 relative error with spbalance 10 2 10 4 10 6 10 8 eigenvalue accuracy / error bound 10 0 10 1 10 2 10 3 10 4 10 5 spbalance gebal nobal 10 10 10 10 10 8 10 6 10 4 10 2 10 0 relative error without balancing 10 6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 eigenvalues Figure 2.8: Plots examining the accuracy of eigenvalues computed with and without direct balancing for the tols2000 matrix. Clearly for both the qh768 and tols2000 matrices, balancing improves the accuracy of the computed eigenvalues and the quality of the error bounds. Krylov Balancing Algorithms Because the Krylov algorithms are probabilistic, they are more difficult to test than the direct balancing algorithms. For these experiments we ran each Krylov algorithm at least three times on each test matrix, then took the balanced matrix with the smallest Frobenius norm and computed its eigenvalues. We use the default Krylov balancing parameter settings of 5 iterations and a cutoff of 10 8. The graphs in figures 2.9 and 2.10 plot the same things as those in figures 2.7 and 2.8, but using the Krylov balancing algorithms. The graphs in figure 2.9 show KrylovAtz and KrylovCutoff improve the error bounds for the qh768 matrix. More importantly, with KrylovCutoff almost all the eigenvalues are computed more accurately than without any balancing. Although table 2.4 shows that in double precision KrylovAz reduces the norm of qh768, unfortunately KrylovAz does not significantly improve the error bounds. In figure 2.10 we see balancing the tols2000 matrix using KrylovCutoff also leads to more accurate calculation of almost all the eigenvalues. However, the effect of Krylov balancing on error bounds is unpredictable. Our results show the Krylov algorithms can improve the accuracy of computed

36 qh768: error without balancing vs. error with KrylovCutoff 10 5 qh768: comparison of eigenvalue accuracy to error bound 10 2 relative error with KrylovCutoff 10 0 10 2 10 4 10 6 eigenvalue accuracy / error bound 10 0 10 5 10 10 10 15 10 20 spbalance KrylovCutoff KrylovAtz KrylovAz nobal 10 8 10 8 10 6 10 4 10 2 10 0 10 2 relative error without balancing 10 25 0 100 200 300 400 500 600 700 800 eigenvalues Figure 2.9: Plots examining the accuracy of eigenvalues computed with the different Krylovbased balancing algorithms for the qh768 matrix. There are two solid lines in the right hand graph; the higher one is for the results with spbalance and the lower one is for the results without balancing. tols2000: error without balancing vs. error with KrylovCutoff 10 1 10 2 tols2000: comparison of eigenvalue accuracy to error bound relative error with KrylovCutoff 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 eigenvalue accuracy / error bound 10 0 10 2 10 4 10 6 10 8 spbalance KrylovCutoff KrylovAtz KrylovAz nobal 10 8 10 8 10 6 10 4 10 2 10 0 relative error without balancing 10 10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 eigenvalues Figure 2.10: Plots examining the accuracy of eigenvalues computed with the different Krylov-based balancing algorithms for the tols2000 matrix. There are two solid lines in the right hand graph; the higher one is for the results with spbalance and the lower one is for the results without balancing. eigenvalues for some matrices. As expected, KrylovAz does not perform as well as KrylovAtz and KrylovCutoff. Since balancing means setting row and column norms equal to one another, KrylovAz is at a disadvantage since it cannot use matrix transpose vector multiplications to gather information about column norms. We also note that

37 strongly connected component structure cannot be exploited with these Krylov algorithms since the matrix is not accessed directly. 2.3.2 Balancing and Sparse Eigensolvers The previous section looked at the accuracy of eigenvalues computed by the direct eigensolver geevx with Krylov balancing as a preconditioner. In practice, if a sparse matrix is given explicitly and a user is willing to run an O(n 3 ) dense eigensolver, spbalance is probably the better choice of balancing algorithm. A potentially more practical use of the Krylov balancing algorithms is as preconditioners for sparse eigensolvers which also access the matrix only through matrix vector multiplications. In this section we compare the accuracy of eigenvalues computed using a sparse eigensolver with and without Krylov-based balancing. All computations were done in double precision so that we could use the eigs function in MATLAB. The eigs function has a number of user-specified parameters. We specify that the largest 10 and the smallest 10 eigenvalues in magnitude should be computed, and use the default values for all other parameters. For the Krylov algorithms we again use the default values of 5 iterations and a cutoff value of 10 8. Figures 2.11 and 2.12 plot the relative accuracy of the 10 largest and smallest eigenvalues when calculated without balancing, with each of the three varieties of Krylov balancing, and with spbalance. In each plot the eigenvalues were sorted to make the graphs easier to read. Of the two solid black lines, the upper is typically the result without balancing and the lower is with spbalance. In figure 2.11, with the qh768 matrix, we see little difference between algorithms for the smallest eigenvalues, though KrylovAtz seems to do best. On the largest eigenvalues KrylovCutoff does almost as well as spbalance, with both KrylovAz and Krylov- Atz slightly outperforming the results without balancing. In figure 2.12, with the tols2000 matrix, all three Krylov balancing algorithms do significantly better than the case without balancing on the large eigenvalues. On the small eigenvalues KrylovAz does poorly and KrylovCutoff does well; however the difference between balancing with spbalance, balancing with KrylovAtz, and not balancing is at most about 2 digits. These graphs show that on some matrices a Krylov-based balancing algorithm is a

38 10 2 qh768: comparison of relative accuracy for smallest 10 eigenvalues 10 2 qh768: comparison of relative accuracy for largest 10 eigenvalues 10 0 10 2 no bal KrylovAz KrylovAtz KrylovCutoff spbalance 10 0 10 2 10 4 no bal KrylovAz KrylovAtz KrylovCutoff spbalance relative accuracy 10 4 10 6 10 8 relative accuracy 10 6 10 8 10 10 10 12 10 10 10 14 10 12 1 2 3 4 5 6 7 8 9 10 eigenvalues 10 16 1 2 3 4 5 6 7 8 9 10 eigenvalues Figure 2.11: Plots comparing the relative accuracy of the largest and smallest (in magnitude) eigenvalues computed with the different Krylov-based balancing algorithms for the qh768 matrix. There are two solid lines in each graph; the one that is typically higher is for the results without balancing and the one that is typically lower is for the results with spbalance. 10 2 tols2000: comparison of relative accuracy for smallest 10 eigenvalues 10 0 tols2000: comparison of relative accuracy for largest 10 eigenvalues 10 4 relative accuracy 10 6 10 8 10 10 10 12 relative accuracy 10 5 10 10 no bal KrylovAz KrylovAtz KrylovCutoff spbalance 10 14 no bal KrylovAz KrylovAtz KrylovCutoff spbalance 10 16 1 2 3 4 5 6 7 8 9 10 eigenvalues 10 15 1 2 3 4 5 6 7 8 9 10 eigenvalues Figure 2.12: Plots comparing the relative accuracy of the largest and smallest (in magnitude) eigenvalues computed with the different Krylov-based balancing algorithms for the tols2000 matrix. There are two solid lines in each graph; the one that is typically higher is for the results without balancing and the one that is typically lower is for the results with spbalance. good preconditioner for a sparse eigensolver. Unfortunately, the magnitude of the potential gain in accuracy on specific eigenvalues varies and is difficult to predict. However, of all the

39 Krylov-based algorithms, our experiments show KrylovCutoff with the default values of 5 iterations and 10e 1 for the cutoff generally performs best. 2.4 Conclusions In this chapter we described two classes of techniques for preconditioning sparse matrices prior to computing their eigenvalues. We showed how to use algorithms for computing the strongly connected components of a graph in order to find an improved decomposition of a matrix prior to computing eigenvalues. We briefly described spbalance, code which combines the strongly connected components algorithm with a standard iterative balancing phase. We also described a family of new probabilistic balancing algorithms which use only matrix vector and matrix transpose vector multiplications to access the matrix, and so can be used when a matrix is not available explicitly. Of the three Krylov algorithms, Krylov- Az uses only matrix vector multiplications, while KrylovAtz and KrylovCutoff use both matrix vector and matrix transpose vector multiplications. We explain why Krylov- Az and KrylovAtz should help reduce the norm of the matrix, using ideas both from Perron-Frobenius theory and the direct iterative method. The addition of a cutoff value to KrylovAtz gives us our ultimate algorithm, KrylovCutoff. With the cutoff value set appropriately, KrylovCutoff reduced the norm of all the matrices in our test suite to within a factor of 2 of the norm of the matrix after balancing using spbalance. With the default values of 5 iterations and 10 8 for the cutoff, KrylovCutoff reduced the norm of all the matrices to within an order of magnitude of the norm with spbalance, and typically reduced the norm to within a factor of 2.5. Since our focus is on preconditioning for eigensolvers, we ended with tests showing how our algorithms affect the accuracy of eigenvalues computed by both dense and sparse eigensolvers. The direct balancing algorithms consistently reduced the norms of matrices and the condition numbers of eigenvalues, leading to more accurate eigenvalues and better error bounds. The Krylov algorithms are probabilistic and therefore more difficult to test. Nevertheless, our experiments showed that they could be useful preconditioners, though the exact nature of the improvement on different matrices is difficult to predict. Many questions remain unanswered, including several related to our Krylov-balancing schemes:

40 Why does adding a cutoff value lead to the much better performance of Krylov- Cutoff on some matrices? Can we determine the best cutoff value for a given matrix without using trial and error? Is there a way to predict the benefits in eigenvalue accuracy of the Krylov algorithms on different matrices? There also remain interesting questions regarding general balancing theory, for example: If we use the iterative algorithm described in figure 2.3 to balance a matrix in the norm, what is the rate of convergence? In [33] we looked at the convergence rate for simple matrices, but the question for general matrices is still open. The code for spbalance (in C, with a MATLAB interface), and the three Krylovbased algorithms (in MATLAB) is available at http://www.cs.berkeley.edu/ tzuyi/balancing.

41 Chapter 3 Preconditioning sparse linear systems of equations In this chapter we turn to preconditioning sparse linear systems. The problem is to find a vector x such that Ax = b, where we assume A is a sparse matrix and b is an arbitrary dense vector. Whereas in the previous chapter we preconditioned using similarity transformations mapping A to SAS 1, here we use equivalence transformations, which map A to EAF 1. If Ax = b, then EAF 1 (F x) = Eb. Note that because the solution is not preserved, solving for x such that F x = y should be inexpensive if the overall solver is to be efficient. As with eigenproblems discussed in the previous chapter, both direct and iterative methods are used to solve sparse linear systems. However, the goals of preconditioning differ somewhat depending on whether an iterative or direct solver is used. Direct solvers compute the LU factorization of A and then solve two triangular systems to find x. If the rows and columns of A are not permuted, A = LU, where L is lower triangular and U is upper triangular. Now x can be found by finding y such that Ly = b and x such that Ux = y. However, in general, the rows and/or columns are permuted for sparsity or stability, turning A into the product of possibly permuted lower and upper triangular matrices. The solution x is then found through the solution of two, possibly permuted, triangular solves. One set of preconditioners for direct methods are based on finding good permutations, for example the sparsity orderings discussed in section 3.2 find permutations which reduce the number of nonzeros in the L and U factors.

42 Iterative solvers, on the other hand, iteratively improve an initial guess at the solution, stopping either when the solution is sufficiently accurate or when a user supplied upper bound on the number of iterations to run has been reached. There are a wide variety of solvers, many described in books such as [14, 86, 95, 158], each generating the successive guesses to x differently. Choosing a good preconditioner is very important as it can mean the difference between converging in 10 iterations, converging in 1000 iterations, or not converging at all. The question, then, is how to choose a preconditioner for an iterative solver. To answer this, we need to first look at how to judge a preconditioned iterative solver. The most important criteria is whether the solver converges to the solution, in the sense that if the solver never converges or converges to the wrong vector, the user will not have the solution they want. However, if a user has a stricter criteria, for example needing a solution within a given time period, a solver that finds the solution after that time will be as useless as one that never finds the answer. Of course, in general the preconditioner should be relatively inexpensive to compute and apply, and the computer resources the preconditioner uses should be predictable. Ideally, when comparing preconditioners, the user should have a sense of how much memory a given preconditioner will use and the extent to which it is likely to improve the convergence of the iterative solver. In this chapter we begin by describing popular preconditioning techniques that reorder the rows and/or the columns of A. These methods were developed for and are commonly used in direct solvers, but have sometimes proven useful for iterative methods as well. Ordering can decompose the matrix, potentially reducing the problem to a collection of smaller ones (section 3.1). Ordering can also reduce fill in the factors (section 3.2), where fill is basically the additional storage needed for L and U over the storage needed for A. Finally we discuss methods for ordering A to increase stability, which can reduce the need for pivoting (section 3.3). After describing preconditioners that can be used for both direct and iterative methods, in section 3.4 we turn to preconditioners used solely for iterative solvers. We focus on incomplete Cholesky (IC) and incomplete LU (ILU) factorizations, which form a popular class of preconditioners. All compute an approximate LU factorization of A, ˆLÛ A, and use ˆLÛ as the preconditioner. Since complete and incomplete factorizations are closely related, preconditioning methods developed for complete factors can also prove useful for incomplete methods and we discuss how the techniques covered in sections 3.2

43 and 3.3 affect ILU preconditioners. In section 3.4 we summarize the history of incomplete factorizations and describe a modified value-based ILU algorithm with more predictable memory usage. The contributions of this chapter are in several areas. We describe several different reasons for reordering the rows and columns of A prior to factoring it. We present data showing the effects of ordering matrices with a nonsymmetric permutation and decomposition rather than the symmetric strongly connected components decomposition described previously in section 2.1.2. For one matrix the size of the largest block found with a nonsymmetric permutation is a tenth of the size of the largest block found with a symmetric permutation. We also note in section 3.3.2 that using a stability ordering in concert with a column approximate minimum degree ordering can lead to a fill in the LU factors that is very different from that of using the sparsity ordering alone. On our test matrices the difference could be up to a factor of 2. Focussing on one specific algorithm for reordering A, we next describe our design and implementation of a threaded column approximate minimum degree algorithm. Though we analyzed the performance of the algorithm in detail, our final implementation never achieved a speedup of more than 3 on 8 processors of an SGI Power Challenge machine, and more typically there was virtually no speedup. This work, done jointly with Sivan Toledo and John Gilbert [37], gives us a better understanding of the difficulties of efficiently implementing algorithms with fine-grained parallelism even in a shared memory environment. Finally we turn to incomplete LU factorizations, a family of preconditioners often used with iterative solvers. We propose a modification to a standard ILU scheme and show that it uses the memory the user makes available more efficiently, leading to a greater likelihood of convergence in the preconditioned iterative solver. We also study the effects of different ordering algorithms on the convergence of ILU preconditioned GMRES(50) finding, for example, that both ordering for stability and partial pivoting are necessary for best performance. Our conclusions are based on data gathered from tens of thousands of test runs combining matrices with different ILU algorithms, parameter settings, scaling algorithms, and ordering algorithms.

44 3.1 Decomposing the matrix A well-known technique for reducing the running time of an algorithm on a large matrix is to first decompose the matrix into smaller blocks in such a way that the overall solution can be constructed from the solutions to these smaller problems. As described in section 2.1.2, for preconditioning sparse eigenproblems the best decomposition corresponds to performing a topological sort on the strongly connected components of DG(A). For preconditioning sparse linear systems we can do better since we are no longer restricted to symmetric permutations. The decomposition used when the rows and columns can be permuted independently permutes A into block upper triangular form. Of course, if A is structurally symmetric and we want to preserve its symmetry, we should still use the strongly connected components ordering. In [61], Dulmage and Mendelsohn studied the problem of choosing permutation matrices R and C such that RUC, where U is not necessarily square, is as block upper triangular as possible. In their terminology, the matrices R and C give the canonical decomposition of U. Though their algorithm is defined on general rectangular matrices, for our purposes in the following description we consider only square, nonsingular matrices since otherwise the system described by the matrix is not solvable. The two-phase algorithm for computing the canonical decomposition of a matrix A described in [61] first finds a possibly nonsymmetric permutation which maximizes the number of nonzeros on the diagonal of A. If A is nonsingular, the diagonal will be completely nonzero. The second phase then permutes the square submatrix into block upper triangular form using a symmetric permutation. If U is square and already has a nonzero diagonal, the algorithm is the same as the strongly connected components algorithm described in section 2.1.2 [151]. This decomposition is also mentioned in concert with the strong Hall property on bipartite graphs in [80]. Given a matrix U, we say BG(A), as defined in section 1.3.2, has the strong Hall property if every set of k column vertices is adjacent to at least k + 1 row vertices, for 1 k < n. In [80] Gilbert and Ng state the known fact that the canonical decomposition of a matrix whose bipartite graph does not have the strong Hall property will have more than one diagonal block. To solve Ax = b for x, where A is in block upper triangular form, we solve the diagonal blocks one at a time, with a block form of back substitution between each block solve. This can save both space and time: since the blocks are solved individually, we need

45 allocate only as much memory as needed to factor the largest block, and the total time also will likely depend largely on the time needed for the largest block. An extreme example demonstrating the importance of recognizing block triangular structure is a lower triangular matrix. Permuted into block upper triangular form, this matrix becomes an upper triangular matrix and back substitution alone, without any factorization step, would find the solution vector. Without first looking to decompose the problem using unsymmetric permutations, the problem is viewed as one large block and if the largest entry of the matrix is in the lower left corner in the permuted matrix, the complete LU factors (with pivoting) will be dense. Matrices from real applications also provide dramatic examples. The canonical decomposition of the tols4000 matrix, with dimension 4000, consists of 3983 diagonal blocks, the largest having dimension 18. The canonical decomposition of the lhr71 matrix, with dimension 70304, consists of 7066 diagonal blocks, the largest with dimension 7663. Tables 3.1 and 3.2 show the decomposition information on all of our test matrices, together with information on the decomposition found by the strongly connected components algorithm described in section 2.1.2. 3.2 Ordering for sparsity When ordering to decompose a matrix A, we know the ordering that finds the canonical decomposition is optimal in the sense that it permutes A into the block upper triangular form with the maximum number of diagonal blocks. Furthermore, we have an inexpensive algorithm for computing the decomposition. For most preconditioning techniques we are less fortunate. For example, consider ordering the rows and columns of A to minimize nnz(l+u). Unfortunately, computing the permutation matrix P that minimizes the fill in the factors of P AP T is prohibitively expensive (ie, the decision version is NP-complete [70]). Similarly, we believe the unsymmetric version which looks to minimize the fill in the factors of P AQ T is also NP-complete, although this appear to be an open question. In both cases, lacking an efficient algorithm, we turn to heuristics which may not find the optimal ordering, but which hopefully find something close. Because the amount of fill can vary dramatically depending on the ordering of the rows and columns in the matrix, using a good fill-reducing ordering is important. If A is

46 scc dmperm name n # blocks max block size # blocks max block size NASASRB 54870 1 54870 1 54870 add32 4960 1 4960 1 4960 af23560 23560 1 23560 1 23560 appu 14000 1 14000 1 14000 av41092 41092 4 41086 4 41086 bbmat 38744 1 38744 1 38744 bramley1 17933 1 17933 1 17933 bramley2 17933 1 17933 1 17933 circuit 1 2624 6 2614 60 2560 circuit 2 4510 3231 1280 3249 1262 circuit 3 12127 4399 7729 4521 7607 circuit 4 80209 27923 52287 28005 52005 cry10000 10000 1 10000 1 10000 dw8192 8192 1 8192 1 8192 ecl 32 51993 9653 42341 9653 42341 ex11 16614 1 16614 1 16614 extr1 2837 1 2837 425 2413 fs 541 2 541 2 540 2 540 garon2 13535 1 13535 1 13535 gemat11 4929 2 4928 352 4578 goodwin 7320 2 7319 2 7319 gre 1107 1107 1 1107 1 1107 hydr1 5308 1 5308 974 2370 inaccura 16146 1 16146 1 16146 jpwh 991 991 146 846 146 846 lhr01 1477 21 1457 298 1171 lhr04 4101 31 4071 439 3658 lhr71 70304 121 70184 7066 7663 lns 3937 3937 290 3648 351 3558 lnsp3937 3937 290 3648 351 3558 mahindas 1258 1 1258 670 589 mcfe 765 5 697 5 697 mchln85ks17 84180 - - - - memplus 17758 1 17758 1 17758 mhd4800a 4800 8 4793 8 4793 Table 3.1: This table, together with table 3.2, shows the number of diagonal blocks and the size of the largest diagonal block when the matrices are reordered using the canonical decomposition (dmperm) and the strongly connected components algorithm (scc).

47 scc dmperm name n # blocks max block size # blocks max block size olm5000 5000 1 5000 1 5000 onetone1 36057 203 35653 3843 32211 onetone2 36057 203 35653 3843 32211 orani678 2529 1 2529 700 1830 orsreg 1 2205 1 2205 1 2205 pores 2 1224 1 1224 1 1224 radfr1 1048 2 1047 98 951 raefsky3 21200 1 21200 1 21200 raefsky4 19779 1 19779 1 19779 rdist1 4134 2 4133 199 3936 rdist2 3198 2 3197 199 3000 rdist3a 2398 2 2397 99 2300 rma10 46835 1 46835 1 46835 rw5151 5151 1 5151 6 5146 saylr4 3564 1 3564 1 3564 sherman3 5005 2111 2830 2111 2830 sherman4 1104 559 546 559 546 sherman5 3312 1675 1638 1675 1638 shyy161 76480 321 25440 25761 25440 shyy41 4720 81 1560 1641 1560 tols4000 4000 3129 90 3983 18 twotone 120750 5 120746 13131 105740 utm5940 5940 147 5794 147 5794 vavasis1 4408 4 4402 4 4402 vavasis2 11924 5 11916 5 11916 vavasis3 41092 4 41086 4 41086 venkat01 62424 1 62424 1 62424 wang3 26064 1 26064 1 26064 wang4 26068 1 26068 1 26068 west2021 2021 1 2021 522 1500 Table 3.2: This table, together with table 3.1, shows the number of diagonal blocks and the size of the largest diagonal block when the matrices are reordered using the canonical decomposition (dmperm) and the strongly connected components algorithm (scc).

48 structurally symmetric, users often choose to permute the rows and columns in the same way so as to preserve symmetry. If A is nonsymmetric, the rows and columns can be permuted independently. In addition, if nonsymmetric pivoting is used during the factorization, it will alter any symmetry in the reordered matrix. However, similar algorithms are often used for both symmetric and nonsymmetric A: if A is not symmetric, a symmetric permutation can be found for A + A T or A T A (assuming no numerical cancellation so A + A T has the same nonzero structure as A + A T, and then applied only to the columns (or rows) of A. The justification for using the symmetric permutation of A T A is as follows: the fill in an LL T factorization of P A T AP T is an upper bound on the fill in an LU factorization of P A, if the diagonal elements of A are nonzero [74]. Furthermore, if A has the strong Hall property, the bound is tight in that every predicted fill element can be nonzero for some numeric values of A [80]. Hence ordering A to reduce fill in P A T AP T will lower the bound on the fill in the LU factors of P A. In this section we first review the history of fill-reducing orderings in section 3.2.1In section 3.2.2 we describe our experience implementing a particular fill-reducing heuristic, approximate column minimum degree, on symmetric multiprocessors. 3.2.1 Background As previously noted, computing the fill-minimizing ordering is prohibitively expensive, so various heuristics for finding fill-reducing orderings are used instead. In this section we give an overview of some of the most popular heuristics. We divide the heuristics into two broad classes. The first class orders matrices by simulating the symbolic Gaussian elimination process. In each step of the elimination, these methods decide which row/column to factor next by choosing one whose elimination creates little fill. Algorithms in this class include minimum degree, first described in [174], and the many variants latter proposed (see sections 3.2.1 and 3.2.1). The second class of heuristics tries to order the matrix so the permuted matrix has a structure known to generate little fill with Gaussian elimination. Algorithms in this class include the Cuthill-McKee [45] and reverse Cuthill-McKee [75] algorithms, which order a matrix to reduce its bandwidth (see section 3.2.1). The class also includes divide-and-conquer methods such as nested dissection [71], which partition the entire matrix, order the separator nodes last, and then recurse on the partitions (see section 3.2.1).

49 hybrid orderings. The two types of heuristics can be combined, and in section 3.2.1 we describe some Bandwidth Reducing Orderings There are matrix structures which are known to have limited fill. For example, if the matrix A is a band matrix with lower bandwidth b l and upper bandwidth b u, the lower and upper bandwidths of L + U, if computed without pivoting, are also b l and b u. Even if row partial pivoting is used, the lower bandwidth of L + U is still b l, though the upper bandwidth may grow up to b l + b u [52]. The ordering algorithms described in this section try to limit the fill by permuting A to be a band matrix of, hopefully, low bandwidth. Since the number of nonzeros in a band matrix is clearly bounded by (b l + b u + 1)n, a small bandwidth limits the number of nonzeros in the factor and hence the fill. The Cuthill-McKee ordering algorithm [45] takes does a breadth first search on DG(A), numbering the nodes in the order in which they are seen. The algorithm was first developed for symmetric matrices, though it can be applied to nonsymmetric matrices by, say, using the graph DG( A + A T ). It was later noticed that ordering the matrix with the reverse order of that returned by the Cuthill-McKee algorithm often gave orderings that led to reduced fill in the factors [75]. This heuristic is called the reverse Cuthill-McKee ordering. In [127] the authors compare the two orderings in terms of the number of nonzeros and number of operations needed to factor the ordered matrix using a method which does not exploit nonzeros in the band. They prove that for these methods the reverse Cuthill-McKee algorithm is always at least as good as the Cuthill-McKee algorithm and gives a condition under which reverse Cuthill-McKee is strictly better than regular Cuthill-McKee. When implementing these methods a few choices must be made. For example, the choice of a starting node for the breadth first search can affect the quality of the ordering: in [77] they note that the ordering seems to be better when the first node is on the periphery of the matrix, and hence they provide an algorithm for locating two nodes that are endpoints of a pseudo-diameter. In [77] the breadth first search is run at least twice, the second time starting with the last node located in the first run.

50 Minimum Degree The original minimum degree algorithm for symmetric matrices, proposed by Tinney and Walker [174], operates on DG(A) = (V, E), the directed graph representation of A defined in section 1.3.2. At each step of the minimum degree algorithm the vertex v of smallest degree is chosen (ties can be broken arbitrarily) and ordered next. Edges are added to DG so that the neighbors of v now form a clique, and the process is repeated on the updated graph DG = (V \v, E ). The edges added in each step represent the fill that is created when we eliminate the row/column of A corresponding to the chosen vertex. Since the development of the original symmetric minimum degree algorithm in 1967, numerous changes have been suggested (see [73] for a summary). These changes mainly seek to speed up implementations of the minimum degree algorithm, though some may also improve the quality of the ordering. A few of the suggestions made are the quotient-graph representation [72], approximate degree updates [79], supernodes [76], and multiple eliminations [126]. The clique cover representation of the graph represents the graph as a set of cliques such that the union of all the cliques includes every edge in the graph. When a vertex is eliminated, the cliques that it belongs to are merged into one larger clique. The clique-cover representation is compact and eliminations can be implemented efficiently, but computing the degree of a node is more difficult. To compensate, implementations which use a clique-cover representation also use approximate degree updates. In other words, at any given step, the degree associated with each remaining, unordered vertex may not be its actual degree but simply some approximation. Many degree approximations are possible, and many of these approximations are summarized in [48]. Supernodes, described in [6, 73], are defined as groups of nodes with the same neighbors and so which can be represented by one representative node. When that node is eliminated, all the nodes in the supernode are eliminated. This both reduces the work done in the algorithm as only one vertex needs to have its degree updated at each step, and also can improve the quality of the ordering since these similar nodes are automatically ordered consecutively. Multiple elimination, described in [73, 126], uses the observation that the elimination of a node only affects the degrees of adjacent nodes. Therefore, instead of eliminating a single node in each step, multiple elimination chooses an independent set of nodes, all

51 with minimum degree, and eliminates them all prior to updating any node degrees. Though this is not an exact implementation of the minimum degree algorithm, the approach usually produces orderings that are just as good. Furthermore, it amortizes the cost of updating the degrees over the elimination of several nodes. Although the minimum degree algorithm was originally designed for symmetric matrices, routines for ordering nonsymmetric or even symmetric indefinite matrices exist [2, 79, 48], though they all simply compute a symmetric ordering of A T A (without explicitly forming the product), and apply the permutation found only to the columns of A. Ordering to reduce fill for general matrices is more difficult since pivoting for stability may be required in the subsequent factorization, and pivoting complicates predicting and minimizing fill. Since the fill in the LL T factorization of P A T AP T is an upper bound on the fill in an LU factorization of P A [74], reducing the former reduces the bound on the latter. In practice column minimum degree algorithms do not usually compute A T A. Instead, they use the nonzero structure of A to construct an initial clique cover of A T A directly. Essentially, the nonzero structure of each row of A is interpreted as a bitmap that specifies membership of columns in one clique. Furthermore, rather than initializing the degree of each vertex to its actual degree in A T A, again approximations are used. There has also been some work on an analogue to the minimum degree algorithm for nonsymmetric matrices. These algorithms work on the bipartite graph representation of the matrix, and instead of choosing nodes to eliminate, in each step they choose an edge (i, j) with minimal N(i) N(j). Edges are added so N(i) and N(j) form a bipartite clique, the nodes i and j are deleted, and the algorithm iterates. This scheme was first proposed in [145]; [3] contains results showing this ordering is better than others. Other orderings based on symbolic factorization The minimum degree algorithm just described is far from the only ordering algorithm which repeatedly chooses a node to order next, simulates the elimination of that node by forming a clique of its neighbors, then iterates on the remaining graph. The difference between the algorithms described here and the minimum degree algorithm lies in how they choose the next node (or supernode) to eliminate. The minimum fill algorithm chooses the node whose elimination generates the least fill, and so prefers nodes whose neighbors already almost form a clique. This heuristic is

52 proposed in the original minimum degree paper [174] as Scheme 3, but is rejected because of the additional computation cost of identifying the minimum fill node. However, more recent studies of this algorithm note that although more computation is needed, the ordering generated typically does lead to lower fill in the factors [153, 154]. We also point out that eliminating supernodes in the minimum degree algorithm has some of the same effects as the minimum fill algorithm since once the first node in the supernode is eliminated, eliminating the rest generates no fill. In [21, 22] Betancourt applies the minimum degree framework to finding good orderings for problems where partial refactorization will be needed. Partial refactorization refers to the situation where after the initial factoring of A, some of the elements in A are changed and one wants to go from the initial LU factors to those of the new matrix without refactoring. To decrease the number of rows and columns that need to be refactored, one wants triangular factors with short paths. To this end Betancourt suggests a minimum depth ordering which chooses to eliminate the node of minimum depth in the graph of dependencies for the triangular solve with the factors. Since the goal is to reduce the length of the critical path in the triangular solves, the minimum depth ordering is not a sparsity ordering, and, in fact, the minimum depth algorithm gives a poor fill-reducing ordering because it pays no attention to the amount of fill it generates in the factors. However, minimum depth as a tie-breaker for nodes of the same degree in a minimum degree algorithm can achieve a useful compromise [22]. The minimum depth ordering heuristic comes from work on matrices from power systems. Other orderings motivated by the same application have a similar feel. Examples include heuristics which use other functions based on the path lengths in the triangular factors [83]. Although not so useful alone, some of these other orderings can also be useful as tie-breakers for minimum degree algorithms [83]. Partitioning methods Whereas minimum degree and similar heuristics repeatedly choose a locally optimal node to order next, partition based methods such as nested dissection take a global view. First, recall that a node bisector of a graph G = (V, E) is a set of nodes S V which defines two other sets of nodes P 1 and P 2 such that every node in V is in exactly one

53 of S, P 1, or P 2. Furthermore, the number of nodes in P 1 and P 2 is approximately equal and any path in G from a vertex in P 1 or P 2 to a vertex in the other goes through a vertex in S. Given an undirected graph G, which could be DG(A + A T ) for a given matrix, nested dissection finds a node bisector of G, then orders the nodes in the bisector last and recursively orders the two halves of the original graph. Nested dissection was proposed for regular meshes in [71] and generalized for planar and almost-planar graphs in [125]. Note that each step of this algorithm looks to put a matrix into bordered block diagonal form, which is known to be a good form for limiting fill in the factors since fill can happen only in the diagonal blocks or along the band. A package which includes routines for sequentially computing partition based fillreducing orderings is METIS [115]. Parallel routines can be found in [116]. Since nested dissection is a divide-and-conquer algorithm and therefore clearly parallelizable, partition based methods are attractive for parallel solvers. However, in part because basic questions such as how best to choose a separator remain unanswered, partition based methods are still not as widely used as minimum degree algorithms for sequential codes. Hybrid Methods Hybrid methods are a recent family of heuristics which combine elements of both partition based and minimum degree algorithms. Their advantage comes from the fact that they combine the parallelizability of pure partition methods with the better understanding we have of the effect of minimum degree orderings on fill patterns. The main observations behind the hybrid methods described here are that a recursive algorithm such as nested dissection should be stopped before each partition has too few nodes in it, and that both the nodes in the partitions and the nodes in the separators need to then be ordered amongst themselves. A discussion of various combinations of algorithms for partitioning the graph, for ordering the separator, and for ordering the pieces, can be found in [7]. Descriptions of the BEND code, which implements a hybrid ordering heuristic, and a discussion of its performance including practical observations and recommendations can be found in [97, 98].

54 3.2.2 Approximate column minimum degree code for symmetric multiprocessors In this section we describe our attempts to write efficient threaded code implementing the approximate column minimum degree algorithm, and what we discovered in the process. In particular we began with the column approximate degree algorithm described in [48] and implemented in the sequential colamd code, then tried to parallelize it using threads. As discussed in section 3.2.1, minimum degree heuristics form a popular class of fill-reducing algorithms. We knew in advance that minimum degree algorithms were considered difficult to parallelize because the obvious parallelism is fine grained, leading to a poor communication to computation ratio. However, we hoped to gain more insight into exactly why the algorithm is difficult to parallelize. We chose to work on an implementation for shared-memory multiprocessors (SMPs), in part because we hoped we could replace the sequential ordering phase of an existing threaded sparse LU-factorization code [123]. Section 3.2.2 describes our parallel algorithm. Because we found that our algorithm does not always speed up significantly with more processors, section 3.2.2 explores possible causes of parallel inefficiency. Section 3.2.2 summarizes the results of our experiments on large matrices from the University of Florida sparse matrix collection [47]. In section 3.2.2 we end with some thoughts about parallelizing minimum degree type algorithms. We used several machines for testing our code. Our experiments were mainly performed on an SGI Power Challenge machine, though we also ran some tests on SGI Origin 2000 and Origin 200 machines, as well as on a Sun Enterprise 5000. The parallel algorithm Our main strategy for parallelizing the algorithm is to use several processors to perform multiple eliminations in a single iteration. Each node elimination is performed sequentially by one processor, but different processors can eliminate different nodes in parallel. Each iteration begins with a sequential phase in which a single processor finds a set of independent nodes with small degrees to eliminate and creates a work queue of eliminations for this iteration. After this initial phase, during which all other processors are idle, the elimination phase begins and all processors repeatedly take a node from the work queue and eliminate it, continuing until the queue is empty. The pseudocode in figure 3.1 summarizes

55 the algorithm. 1 preprocessing 2 calculate approximate initial degrees 3 repeat { 4 proc 0: find a set of nodes S to eliminate 5 do in parallel { 6 repeat { 7 choose an uneliminated node v in S 8 eliminate v by merging cliques and updating clique lists of adjacent nodes 9 calculate degrees of nodes adjacent to v 10 } until all nodes in S are eliminated 11 } 12 } until all nodes ordered Figure 3.1: Pseudocode for parallel approximate minimum degree algorithm. The preprocessing step (line 1) identifies identical cliques in DG(A T A) (i.e., identical rows of A) and identical nodes (i.e., identical columns in A). We use a hash table to quickly identify identical rows and columns; identical rows are considered a single row as they all represent the same clique, and identical columns are grouped as supernodes. This step can be performed in parallel: the code assigns a set of rows and columns to each processor for insertion into the hash table, and then assigns a set of hash-table buckets to each processor for further processing. The calculation of approximate initial degrees is also done in parallel using a similar load-balancing mechanism. The selection of an independent set S of nodes with small degrees (line 4) is controlled by three parameters: k, which is an upper bound on the number of nodes in S; tol, which bounds the largest degree a node in S can have relative to the node of smallest degree; and numsearch, which limits the amount of work done in this phase. The parameter values can be specified at either compile time or run time. The algorithm for choosing the independent set works as follows: Initially all nodes are eligible. The code repeatedly selects

56 an eligible node with the smallest degree, marks its neighbors as ineligible, and continues by selecting another eligible node. This process terminates when one of the following conditions occurs: (a) enough nodes have been found, (b) the degree of the selected node is too high, or (c) too many nodes have been touched (selected or marked). Setting the independent-set parameters requires considering various trade-offs. The number of nodes to be selected, k, is chosen to provide enough work in the queue to ensure reasonable load balancing. Since the elimination of two nodes can differ significantly in cost, in general we set k to be larger than the number of processors, so that if one processor is busy with a single expensive elimination, another processor which finishes an elimination quickly can work on another node. The trade-off is that attempting to find a large set S generally requires more work and may produce an inferior ordering (in particular if nodes with high degrees have to be chosen). Another parameter determines the maximum allowed deviation tol of the degree of a selected node from the minimum degree in that step. Giving tol a large value is inconsistent with the basic idea of the minimum degree algorithm and may lead to an inferior ordering and significant fill in the LU factorization. On the other hand, a small value for tol may prevent the algorithm from finding enough nodes to be eliminated in every iteration, and hence lead to poor load balancing. Finally, the limit numsearch on the number of nodes that are touched is used to prevent the algorithm from performing too much work in this serial phase of the algorithm. Our implementation uses three primary data structures: a node array, where each node points to an array of cliques containing that node; a clique array, where each clique points to a singly-linked list of the nodes it contains; and a degree array, which contains linked lists of nodes which share the same approximate degree. Our code serializes accesses to these data structures by using mutual exclusion variables (mutexes). We protect each node, clique, and list of nodes with the same degree by a mutex, hence allowing nodes to be eliminated in parallel without corrupting the data structures. Analysis Having described our implementation, we now turn to analyzing the performance of this parallel code. Generally speaking, we found that our algorithm rarely speeds up significantly as the number of processors increases from 1 to 8. Although we made several small changes to the algorithm and our implementation to try to eliminate sources of inef-

57 ficiency, we were unable to code a version that achieved significant speedup. Therefore, the analysis in this section tries mainly to explain the likely sources of inefficiency. Analyzing the algorithm suggests several possible explanations for the poor parallel performance: 1. Amdahl s law: there is too much serial work. 2. Lack of parallelism: we are unable to find enough low-degree nodes to eliminate in parallel in each iteration. 3. Load imbalance: Some nodes require more nodes to eliminate than others, which might lead to load imbalance. 4. Contention: processors block waiting to acquire mutual exclusion variables. 5. Expensive synchronization primitives: barriers and mutexes are slow. 6. Cache misses: the code does not attempt to reuse data in a processor s cache. A processor may update a node which is eliminated by another processor in the following iteration. The data structure of the node is first brought to the first processor s cache, only to be used once and transported to the second processor s cache. In the remainder of this section we analyze each possibility, suggesting causes and solutions. However, we first discuss an issue that complicates all the analysis, namely, the sensitivity of the algorithm to its scheduling. Scheduling Sensitivity To measure the parallel efficiency of our code on any given matrix, the algorithm must perform the same amount of work regardless of the number of processors. Unfortunately this is not always the case because of the code s sensitivity to thread scheduling. Not only do runs on the same input with different numbers of processors do differing amount of work, but multiple runs with the same number of processors can perform differing amounts of work. In short, the algorithm is nondeterministic. In contrast, the total amounts of work done by parallel algorithms for matrix multiplication and Cholesky factorization are insensitive to scheduling. The sensitivity of our code to scheduling is due to the fact that a node s approximate degree depends on the order in which adjacent nodes are eliminated and their degree updates applied. For example, if a node v is adjacent to several nodes being eliminated in

58 the same iteration, the (approximate) degree of v at the end of the iteration depends on the order of the degree updates from the eliminated nodes. This further determines whether or not v is chosen for elimination in the next iteration and hence affects the overall ordering computed. To make our code deterministic we would need to fix an order in which each processor would update the shared data structures in each iteration. However, while perhaps making it easier to analyze, this in effect forces sequential behavior on a section with high potential for parallelism by eliminating the code s ability to let processors with less work finish updating the shared data structures while those with more work continue processing. We assess the efficiency of our algorithm mostly by analyzing speedups, even though speedups do not accurately measure efficiency when runs with different numbers of processors perform different computations. We limit the analysis, however, to matrices in which different runs produce roughly the same computation and perform roughly the same amount of work. We estimate the similarity of two computations by comparing the number of iterations of the outer repeat loop performed by the algorithm. On most matrices, runs with different numbers of processors differ in the number of iterations by only 1 to 2%, which implies that these runs perform roughly the same amount of work. This allows us to use speedup as an overall measure of parallel efficiency. On some matrices, however, runs with different numbers of processors and even multiple runs with the same number of processors perform totally different computations. For example, ordering the matrix bbmat with one processor required 1037 iterations, but three separate runs with two processors required 512, 744, and 496 iterations. For such matrices, we cannot evaluate parallel efficiency simply by comparing the running time on multiple processors to the running time on one processor. Different runs with different numbers of processors may perform different amounts of work. To reduce the effect of scheduling on our analysis, we limit ourselves to matrices whose ordering is relatively insensitive to the scheduling of the algorithm, as determined by the number of iterations. Amdahl s Law Amdahl s law states that the fraction of sequential work in an algorithm determines an upper bound on its parallel speedup. For our code, the main sequential work is the finding of the independent set S in each iteration. If the work in this phase is a significant fraction of the total work, speedup is an impossibility.

59 The current code spends about 1 5% of the total time on one processor in finding the set S if k = 1, and about 7 15% of the time on one processor in finding S if k = 128. While this percentage will not enable the code to scale well to large numbers of processors, it should not prevent the code from speeding up on 2 or 4 processors. We are not aware of efficient parallel algorithms to find such an independent set. While there are parallel algorithms for finding maximal independent sets [128], we require an algorithm for selecting a relatively small independent set consisting only of nodes with small degrees. During the design of the algorithm, we also experimented with different amounts of separation between the nodes in S. Our current implementation allows S to contain nodes that are within distance 2 in G, but not nodes within distance 1 (i.e., neighbors). We also tried restricting S to nodes that are at least distance 3 or 4 apart. The advantage of such schemes is that they allow processors to perform eliminations with less locking of data structures. If nodes in S are at least distance 4 apart, no locking of cliques or nodes is necessary. If nodes are distance 3 apart, some locking is required but less than in the current distance-2 algorithm. Regardless of the distance, locking elements of the degree list is required. Our experiments showed that finding a p-node set S, where p is the number of processors, whose elements are at least distance 4 apart took 60% to 90% of the total time (90% on the memplus matrix). Finding nodes of distance at least 3 apart took up to 75% of the time (also on the memplus matrix). Selecting S takes much longer when the nodes must be distance 3 or 4 apart because a large number of nodes are marked as ineligible for every selected node. Therefore, we decided to abandon this approach. For the remainder of the paper we assume the nodes in S need only be at least distance 2 apart. As explained above, this reduces the running time of the sequential phase that selects S to a small fraction of the total running time: 1 2% in the one processor case, typically 3 17% in the eight processor case though 30% for one matrix. We then turned to looking for other causes for the lack of parallel speedup. Lack of Parallelism In Amdahl s law, the bound is not determined solely by the fraction of sequential work, it is actually determined by the amount of time that processors stay idle. By a lack of parallelism we refer to the case where our algorithm cannot find enough nodes to eliminate in some iterations, hence leaving some processors idle.

60 # of iter with tol = 4 # of iter with tol = 1500 name p=1 p=2 p=4 p=8 p=1 p=2 p=4 p=8 goodwin 1303 1012 923 827 1303 923 663 551 memplus 9113 5095 3063 1899 9113 4878 2775 1690 orani678 924 630 513 462 924 610 472 378 shyy161 37778 19458 10026 5201 37778 19079 9521 4803 wang4 6998 3832 2227 1351 6998 3528 1799 936 Table 3.3: This table shows the number of iterations taken by the algorithm where p, the number of processors, is varied from 1 to 8. The maximum number of nodes found in each iteration is p, the number of nodes searched is set to 4p, and the tolerance is set to 4 and 1500 on the left and right sides, respectively. The question is then whether the algorithm finds enough nodes to eliminate in each iteration. We found that when we allow many nodes to be touched (i.e., numsearch is large) and for nodes with high degrees to be selected (i.e., tol is large), the algorithm can find large independent sets S in almost all the iterations. With no limits on these parameters, using a target size of k = 1 for S on a matrix av41092 requires 14416 iterations, but using a target size of k = 128 requires only 124 iterations. This indicates that the algorithm indeed found a 128-node independent set in most iterations. The results are similar on several other matrices. On some matrices, however, the reduction in the number of iterations is not so dramatic, implying that on many iterations the algorithm was only able to find much smaller independent sets. For example, on a matrix raefsky4, the number of iterations only dropped from 2997 to 50, a factor of 60, when we increased the target size k from 1 to 128. As expected, when we set tight tolerances on the maximum degree and the number of nodes searched, the algorithm is often unable to find enough nodes to eliminate. Table 3.3 shows the number of iterations when we only allow nodes with degree at most 4 and at most 1500 larger than the minimum and only touch 4p nodes, where p is the number of processors. The table clearly shows that a tight tolerance increases the number of iterations, which implies that during many iterations some processors had no nodes to eliminate. To summarize, it is usually possible to find large sets S, but at a cost of more computation in the sequential phase of finding the independent set and of a potentially inferior ordering. Again we find another tradeoff between parallelism and quality of ordering.

61 Load Imbalance Even if we found enough nodes in each iteration for every processor to eliminate at least one node, there is still the possibility of processors being idle because of load imbalance from several reasons. First, after all the nodes are removed from the work queue S and until all these nodes are eliminated, the algorithm cannot proceed to the next iteration. As processors complete their last elimination for the iteration during this period and find the work queue idle, they stop performing useful work and block until the end of the iteration. If S contains many nodes, say 100 times the number of processors, we do not expect these idle times at the end of the iteration to slow down the algorithm considerably. But if S contains few nodes, especially if S contains fewer nodes than processors, then the idle times are expected to be significant. Another source of imbalance is related to the cost of each elimination. Specifically, some nodes require more work to eliminate than others. If each processor eliminates only one or a few nodes in each iteration, or if the differences in the cost of elimination between nodes are large, then this source of imbalance may have a significant impact on the overall performance of the algorithm. As explained above, we address this issue by selecting large sets of nodes to eliminate in every iteration. For many matrices, this leads to good load balancing. For example, on one matrix (av41092, k = 128), processors spend only about 2% of the running time waiting at the end of iterations when two processors are used, and about 15% of the time when 8 processors are used. These numbers are highly matrix dependent; on another matrix (memplus, k = 128), when 8 processors are used they spend up to 37% of the running time waiting at the end of iterations. An extreme case is the pre2 matrix for which a single elimination took about 30% of the running time on one processor (about 300 out of 1000 seconds). To summarize, the relative cost of different eliminations is highly matrix dependent. On some matrices the differences do not cause significant load imbalance, whereas on other matrices this source of imbalance essentially prevents the code from running efficiently in parallel. Contention As noted in the discussion regarding Amdahl s law, choosing nodes that are only distance 2 apart requires more synchronization. Since our code normally blocks when it tries to acquire a lock and the lock is held by another processor, this is another time

62 when processors could be idle. We measure contention for locks using a nonblocking call to try to acquire the lock. If the lock is not available, the code increments a counter that counts contention and then blocks waiting for the lock. We selectively measured contention in different phases of the algorithm and for different data structures in order to better assess the cost incurred by contention for locks. We initially found the amount of contention for both the degree list and node array surprisingly high. For the memplus and shyy161 matrices, 3 4% of the attempts to acquire degree-list locks block due to contention with 4 processors, and only slightly less with 2 and 8 processors. To overcome this significant contention, we separated the updating of the data structures into a local and global phase. In the local phase all changes are marked in a local private array with no locking; in the global phase all local arrays are used to update the global data structures. This improvement typically reduces contention to between 0.1 0.5% of all attempts to acquire locks, although at the cost of more memory. High Overheads for Parallel Programming Primitives The code uses two barriers in each iteration. One guarantees that the node degrees are updated prior to selecting a new set of nodes to eliminate, the other guarantees the independent set is found before processors attempt to eliminate nodes. The code also locks data structure to maintain them correctly. Clearly barriers, locks, and unlocks must be implemented efficiently. We first coded the algorithm using Sun threads and POSIX threads primitives, including mutex variables and condition variables (using the latter only to implement global barriers). We found the cost of these primitives high on both Sun machines running Solaris and SGI machines running Irix. We suspect that the implementors of these primitives used system calls to avoid busy waiting. Our first barrier code was a code published by Sun [132], which uses both mutex variables and condition variables to implement a barrier. The locking in our code is fairly fine grained. The code acquires and releases locks frequently in some phases of the algorithm and spends little time inside critical sections. Contention is infrequent, which makes spinning on locks a reasonable strategy since busy waiting is normally absent or short, and making numerous system calls slows down our code and is unlikely to improve the utilization of the system as a whole. Therefore, we replaced the implementation of the mutex primitives and the barriers

63 with one that uses atomic memory operations that are provided by the SGI C compiler (the operations we use are fetch-and-subtract, lock-test-and-set, and lock-release). This version of the code only works on SGI machines. Our measurements show this implementation of the primitives is fast. On a single processor, a code that calls the synchronization primitives and one that does not call them run at the same speed. This implies the primitives are essentially free when there is no waiting, at least at the rate that our code uses these primitives. The barriers that use the atomic memory operations are about 20 times faster than the barriers that use the POSIX primitives. We also remark that the number of barriers is reduced when the algorithm finds larger independent sets since the number of iterations is reduced. This, however, has other implications for the code as explained previously, so fast barriers are still useful. Furthermore, the number of mutex lock operations cannot be reduced in the same way. Cache Misses Finally, we considered the fact that on SMPs it is significantly faster for a processor to fetch data from its own cache rather than to get data from main memory or from the caches of other processors. In other words, exploiting locality is important. Our algorithm does not attempt to have processors reuse data already in their caches. Ideally, after a processor touches data for various cliques and nodes in eliminating a node, the same processor would next eliminate a nearby node and hopefully reuse some of the data already in the cache from the previous elimination. However, we do not know how to achieve this goal within the framework of our algorithm, nor do we know how to otherwise enhance data reuse. We suspect that cache misses are a significant source of inefficiency. On some matrices, despite not detecting significant contention, load imbalance, sequential bottlenecks, or other parallel overheads, the algorithm still does not speed up with more processors. We conclude that the cost of memory accesses increased with more processors due to cache misses. In particular, we have observed that the running time of the preprocessing step in which identical rows and columns are identified actually increases with the number of processors. During this phase, the amount of work is independent of the number of processors and there is no locking (except for a few barriers). In theory this means this portion of the code should benefit from additional processors. In practice this portion of the code slows down, leading us to believe that cache misses are a significant cost.

64 We note that cache misses are not only caused by data used first on one processor and then on another, or by data that is evicted due to cache conflicts. On a multiprocessor, maintaining the coherency of caches also causes cache misses. In machines with weak coherency, such as the machines we use, frequent synchronization can lead to cache misses on data that is not actually communicated between processors. Clearly the use of hardware performance monitors would have made our argument more compelling and future work along these lines will take advantage of available monitors. Summary of Experimental Results We now summarize our experimental results. The results reported in table 3.4 are meant to show as much speedup as we can achieve. There is no upper limit on the number of nodes touched nor any limit on the degrees of nodes included in the independent sets. This maximizes potential parallelism, though at the cost of an inferior ordering. Furthermore, because the parallel preprocessing step was found to slow down as processors were added, for these results we did the preprocessing step serially. For each matrix, we report two sequential running times: one with a k = 1, and the other with k = 128. The first case implements a conventional sequential approximate minimum degree algorithm (without multiple eliminations). The second case performs a computation similar to that performed in parallel runs, hence allowing us to compare running times. We also report runs with 2, 4, and 8 processors, all using the same target size k of 128 for S. (We thus expect the load balance to degrade with the number of processors; but increasing k to maintain load balance would prevent us from comparing running times because the computations would be different.) The runs that we report were performed on a 12-processor SGI Power Challenge with 196Mhz R10000/R10010 processors, 32Kbytes primary data caches, 1MBytes secondary unified caches, and 1024Mbytes of 4-way interleaved main memory. The machine was lightly loaded but not idle when we ran the experiments; p CPU s were always available during runs with p threads. The next section includes a high level summary of these results. Summary In short, the fact that we achieved hardly any speedup on most of our test matrices suggests column minimum degree algorithms are indeed difficult to parallelize. Because

65 matrix p k # its T Tpre T is av41092 1 1 14416 30.10 2.72 0.22 1 128 124 29.67 2.73 0.37 2 128 123 18.81 2.79 0.37 4 128 124 12.96 2.75 0.40 8 128 123 10.44 2.75 0.41 memplus 1 1 8186 9.55 0.61 0.12 1 128 259 12.06 0.55 1.61 2 128 260 9.62 0.56 1.23 4 128 258 8.58 0.56 1.44 8 128 232 7.58 0.56 1.29 rim 1 1 3947 10.19 1.10 0.12 1 128 623 13.08 1.11 2.75 2 128 404 9.38 1.12 2.31 4 128 395 9.41 1.12 2.64 8 128 528 11.51 1.12 3.53 raefsky4 1 1 2997 3.96 0.88 0.03 1 128 50 2.70 0.93 0.24 2 128 51 2.42 0.88 0.30 4 128 51 2.31 0.88 0.33 8 128 51 2.50 0.88 0.37 shyy161 1 1 37207 7.01 1.89 0.22 1 128 285 7.24 1.89 0.43 2 128 284 7.81 1.85 0.48 4 128 285 7.18 1.91 0.50 8 128 283 7.37 1.91 0.58 onetone1 1 1 21356 6.40 1.31 0.09 1 128 178 6.51 1.33 0.25 2 128 181 6.29 1.34 0.31 twotone 1 1 53658 60.91 5.06 1.51 1 128 431 63.61 4.95 2.94 2 128 432 62.97 5.05 2.94 Table 3.4: This table shows the number of iterations, the total time T, the preprocessing time Tpre, and the time to select the independent set T is as a function of the number p of processors and the target size k of the independent sets.

66 we knew that the fine-grained parallelism in the minimum degree algorithm would make it difficult to parallelize efficiently on a distributed-memory parallel computer due to the communication latency, we decided to concentrate on shared memory machines, hoping that faster interprocessor communication would enable us to obtain significant speedups. The problem was more difficult than we expected. Still, the code sometimes speeds up significantly with more processors, and it rarely slows down. For example, on one matrix (av41092) the code speeds up by almost a factor of 3 with 8 processors. Therefore, our current suggestion is to use the parallel algorithm when several processors are available and the user is seeking to reduce the solution time. On the other hand, on a heavily loaded system in which the overall utilization is important (and not only speed), running a sequential code makes more sense. There are still unresolved issues concerning the current code. We have not made substantial tests comparing the quality of the orderings, especially with relaxed independent set parameters, with the quality of the orderings produced by sequential codes. We have made only a few tests that indicated that the quality of the orderings is reasonable and not particularly sensitive to these parameters, but this must be checked thoroughly before the algorithm is put into use. While it is possible that this algorithm can be parallelized in a way that speeds up significantly with more processors, our experience leads us to believe that exploiting the fine-grained parallelism inherent in minimum degree algorithms will be difficult. 3.3 Ordering for stability The algorithms described so far for finding orderings to decompose the system or to reduce fill in the LU factors depend only on the nonzero structure of A, and not on the numerical values of its elements. However, recall that finding fill-reducing orderings for general matrices is complicated by the possible need for pivoting for stability in computing the factors. Recently people have looked at reducing the need for pivoting through the use of a preconditioner which maximizes the diagonal entries of A, as discussed in section 3.3.1. With the observation that pivoting is done to put large entries on the diagonal during the factorization, algorithms that permute for stability look to put large entries on the diagonal prior to the factorization. Of course, this is not an exact substitute for partial pivoting due to the fact that values in A change during the factorization.

67 An interesting observation is that MC64 [56, 57], described below, can reduce fill in the factors when used in conjunction with sparsity orderings. In this section we review work on stability orderings and describe some of our observations. 3.3.1 History We first discuss the general case of permuting to create a nonzero diagonal. Notice that permuting to maximize the product of the diagonal elements is a special case. Permuting to put nonzero elements on the diagonal of A can be done by finding a maximum matching in BG(A), the bipartite graph representation of A defined in section 1.3.2. If A is nonsingular, there will be n edges in the maximum matching, which we then call a perfect matching 1. The rows and columns of A are then permuted so that the entries corresponding to the edges in the matching are on the diagonal. A description of algorithms for finding matchings can be found in [146]. The asymptotically fastest known algorithm is due to Hopcroft and Karp and runs in time O( n(n + nnz)) [101]. However, the MC21 subroutine in the Harwell Subroutine Library [104] uses a different algorithm which finds a maximum matching in worst-case time O(n nnz), but which in practice they find runs in time closer to O(n + nnz) [57]. In general permuting for stability is used as a substitute for pivoting, so orderings which increase the values along the diagonals are preferred. Algorithms that seek to maximize the diagonal entries want a maximum matching in the weighted bipartite graph representation of A. However, the maximum matching used depends on what measure of the diagonal the algorithm looks to maximize, and there is no theory governing which measure is best. If we seek to maximize the absolute value of the product of the diagonal elements, this can be written as the standard weighted maximum bipartite matching (discussed in [146]) and solved in O(n nnz log n) time [56]. In [141] the authors consider using such a permutation prior to computing a complete LU factorization, also noting that the permuted matrix can then be scaled so the diagonal entries have absolute value 1 and the off-diagonal entries are no larger than 1. In [56, 57], Duff and Koster discuss algorithms for maximizing various measures of the diagonal. The possibilities include maximizing the absolute value of the product of the diagonal entries and maximizing the absolute value of the smallest element on the 1 Counting the number of perfect matchings is equivalent to computing the permanent of a matrix with only 0 and 1 entries and is known to be #P hard [179].

68 diagonal. They released their implementations of assorted algorithms as MC64, also part of the Harwell Subroutine Library [104]. In [122] the authors use MC64 to maximize the product of the diagonal entries and to scale the diagonal entries to 1 and the off-diagonal entries to values 1 as a substitute for partial pivoting in a complete LU factorization. This tradeoff is particularly attractive for their parallel LU code, since pivoting requires communication and so is expensive. The previous algorithms try to maximize the diagonal of the matrix. Alternatively, we could try to maximize the block diagonal of a matrix, which contains the case of maximizing the diagonal as the special 1 by 1 block case. PABLO and TPABLO are two heuristics for increasing the block diagonal of a matrix. PABLO (PArameterized BLock Ordering) is a heuristic for permuting the matrix so its diagonal consists of dense blocks [142]. TPABLO is the threshold version which looks to create dense diagonal blocks containing large values [16]. However, both have been tested only in the context of block iterative solvers, and not as preconditioners for direct methods. 3.3.2 Observations Since partial pivoting is generally less expensive in sequential code than in parallel code, there seems to be less incentive for using MC64 as a preconditioner for single-processor LU. However, our experiments confirm the observation in [57] that maximizing the diagonal prior to ordering with colamd, instead of using colamd alone, can reduce fill even if the factorization uses partial pivoting for stability. For 27 of the 65 matrices in our test suite, using both MC64 with the option that maximizes the product of the diagonal and colamd to order the matrix led to less fill than using colamd alone. For 2 of the matrices there was insufficient memory for computing the LU factors, for 5 of the matrices the addition of MC64 made no difference in the fill, and for the rest colamd alone performed best. For some matrices the addition of MC64 reduced fill substantially: for the NASASRB matrix colamd ordering alone led to nnz(l+u) = 64795613 whereas MC64 applied before colamd led to nnz(l+u) = 56881063. If the matrices are ordered with MC64 followed by colamd applied symmetrically, prior to computing the LU factorization without pivoting, then for 31 of the matrices the fill is reduced. It is interesting to observe the differences. However, because no single option is

69 significantly better all the time, we see no reason not to continue using simply colamd for sparsity. Tables 3.5 and 3.6 give the results for all the test matrices. 3.3.3 Relationship to other orderings Notice that these algorithms for maximizing the diagonal can be combined with those described in section 3.1 for decomposing the matrix into block upper triangular form. Not only does the first step of the Dulmage-Mendelsohn decomposition compute a matching, but the diagonal blocks found in the second step are independent of the exact values on the diagonal, as long as the diagonal is nonzero [150, 151]. In other words, we can permute A to maximize the diagonal using, for example, MC64 code, and then run a strongly connected components algorithm on the permuted matrix to get a block upper triangular matrix with a large diagonal. Similarly, we could first run the full Dulmage-Mendelsohn decomposition algorithm and then run MC64 on each diagonal block. Assuming no ties in the values of elements, either order will give us the same diagonal since in both cases the diagonal elements can only be chosen from those in the diagonal blocks (hence the others are classified as inadmissible in [61] 2 ). The second option, finding the complete block upper triangular decomposition and then maximizing the diagonals for each block, is asymptotically faster. Finding the decomposition by running MC21 and then Tarjan s strongly connected components code should take O(n + nnz) time, and MC64 (running time O(n nnz log n)) on the diagonal blocks will be faster than running it on the entire matrix. 3.4 ILU preconditioners Having described preconditioners which can be used for both direct and iterative solvers, we now move to a family of preconditioners used solely for iterative solvers. Incomplete Cholesky (IC) for symmetric positive definite (spd) matrices and Incomplete LU (ILU) for general matrices form a popular class of preconditioners for iterative methods. These preconditioners are formed by computing triangular factors ˆL and Û such that ˆLÛ A (assuming no pivoting is used). They can be implemented by making simple modifications to code for complete factorizations, and most come with parameters which 2 Note Theorem 8 of [61] erroneously uses the word admissible instead of inadmissible in its statement. This is pointed out in Tutte s review [177].

70 nnz(l + U) mc64+colamd, name n nnz colamd mc64+colamd no pivoting NASASRB 54870 2677324 64795613 56881063 49527718 add32 4960 23884 39360 39360 50338 af23560 23560 484256 12133890 12060267 12085818 appu 14000 1853104 - - - av41092 41092 1683902 43636648 44602429 44832344 bbmat 38744 1771722 47380829 48402240 47070983 bramley1 17933 1021849 9329147 12248025 8860956 bramley2 17933 1021849 9285181 12183920 8833616 circuit 1 2624 35823 775756 1098018 775919 circuit 2 4510 21199 60473 75847 60285 circuit 3 12127 48137 384331 853743 199971 circuit 4 80209 307604 37976487 47911063 75036297 cry10000 10000 49699 674511 682304 721046 dw8192 8192 41746 761704 763142 830102 ecl 32 51993 380415 75414623 75653180 74520290 ex11 16614 1096948 19606287 18857254 14591104 extr1 2837 11407 37401 36009 38587 fs 541 2 541 4285 18676 17592 17083 garon2 13535 390607 5799777 5786872 3903193 gemat11 4929 33185 87687 92628 87096 goodwin 7320 324784 5312250 4449822 4385738 gre 1107 1107 5664 123999 140152 141125 hydr1 5308 23752 77450 78944 78141 inaccura 16146 1015156 9145171 8331268 6893278 jpwh 991 991 6027 135433 146447 142254 lhr01 1477 18592 67901 72456 74011 lhr04 4101 82682 343308 350678 351189 lhr71 70304 1528092 7312879 7610914 7568726 lns 3937 3937 25407 444392 427721 427475 lnsp3937 3937 25407 433876 419202 421071 mahindas 1258 7682 35119 37474 46671 mcfe 765 24382 60869 60824 60833 mchln85ks17 84180 7179192 - - - memplus 17758 126150 4335427 4394954 4433486 mhd4800a 4800 102252 465212 452907 442644 olm5000 5000 19996 28862 28866 25056 onetone1 36057 341088 4881390 4724465 4556026 onetone2 36057 227628 1156229 1140793 1184146 orani678 2529 90158 191666 188267 195176 orsreg 1 2205 14133 351886 348254 349747 pores 2 1224 9613 56130 56825 68179 Table 3.5: Together with table 3.6, this shows the number of nonzeros in the LU factors with different orderings: colamd with partial pivoting, mc64 and colamd with partial pivoting, and finally mc64 and colamd but without pivoting in the factorization.

71 nnz(l + U) mc64+colamd, name n nnz colamd mc64+colamd no pivoting radfr1 1048 13299 36050 34674 35233 raefsky3 21200 1488768 21083344 18419696 17124432 raefsky4 19779 1328611 19155550 18974701 18802580 rdist1 4134 94408 296028 314346 288855 rdist2 3198 56934 166952 171321 165963 rdist3a 2398 61896 196692 191155 189940 rma10 46835 2374001 22863581 19273693 16314169 rw5151 5151 20199 397392 423057 340999 saylr4 3564 22316 543910 531000 545710 sherman3 5005 20033 370187 368172 376306 sherman4 1104 3786 25038 25039 25960 sherman5 3312 20793 189292 190727 190536 shyy161 76480 329762 6033444 6060328 6746933 shyy41 4720 20042 176516 176673 197899 tols4000 4000 8784 25296 25296 25296 twotone 120750 1224224 18689608 18318330 19112060 utm5940 5940 83842 1007976 1015620 988454 vavasis1 4408 95752 1539858 1547599 1573926 vavasis2 11924 306842 6191717 6235021 6258072 vavasis3 41092 1683902 43636648 44589603 44792131 venkat01 62424 1717792 20999704 20999704 20999704 wang3 26064 177168 24446941 24445882 24677548 wang4 26068 177196 26230605 26220584 26511316 west2021 2021 7353 21555 20724 23182 Table 3.6: Together with table 3.5 this table shows the number of nonzeros in the LU factors with different orderings: colamd with partial pivoting, mc64 and colamd with partial pivoting, and finally mc64 and colamd but without pivoting in the factorization.

72 can be tuned depending on the accuracy of the approximation desired. The downside is that there are a large number of ILU algorithms to choose from and little consensus as to which is best or how to choose parameter settings. The variety of choices available can be baffling for a user seeking simply to solve a system of equations from some application area. We hope to provide users who believe ILU preconditioned iterative solvers may be appropriate for their system with some insight into what ILU algorithm to use, what parameter settings to try first, and what combination of the ordering strategies described earlier in this chapter to choose. In section 3.4.5 we give our recommendations. Aside from summarizing some of the extensive literature on IC and ILU preconditioners, in this section we also discuss results from our tests of different ILU heuristics. We observe that matrices from real applications often require high-fill ILU preconditioners for the subsequent iterative solver to converge, which leads us to suggest a modified ILUTP algorithm whose benefits we demonstrate. Testing ordering strategies in combination with different ILU heuristics shows that while maximizing the diagonal is beneficial, it does not fully compensate for pivoting. Although we focus on value-based ILU heuristics, we include comparisons to levelbased heuristics, describing also the results of tests using high-fill level-based ILU preconditioners. 3.4.1 History of IC and ILU preconditioners The first incomplete factorizations were Cholesky factorizations developed for M- matrices such as those arising from a 5-point stencil on a square grid. Recall that A is an M-matrix if A(i, i) > 0 for all i, A(i, j) 0 for all i j, and A 1 > 0. On M-matrices it was shown that certain incomplete factorizations could always be computed [134]. Furthermore, studies such as [93] found these preconditioners improved the convergence behavior of the conjugate gradient method introduced in [100]. The first incomplete factorization methods were mostly level-based methods, and so their decision of which non-zero elements to set to zero during the factorization was based on the elements location in the factors rather than their numerical value. Because IC with conjugate gradients (a combination sometimes referred to as ICCG in the literature [93, 130, 134, 149]) worked so well on M-matrices, people tried extending the method to more general matrices.

73 Unfortunately there exist symmetric positive definite (spd) matrices for which some IC factorizations can not be computed due to the creation of non-positive diagonal elements during the factorization [117]. This inspired numerous papers describing better heuristics (e.g. [130, 137, 117, 112, 124]). An important development was the introduction of value-based methods, which decide whether or not to keep elements depending on their value. Generalizing to ILU factorizations of nonsymmetric matrices adds another layer of complexity. We summarize work on level-based methods in section 3.4.1. We then review valuebased algorithms in section 3.4.1. Both types of dropping algorithms can be augmented with techniques that alter values of ˆL and Û to improve the stability of the factorization. We describe an assortment of stabilization heuristics in section 3.4.1. Though we use only basic value-based and level-based ILU preconditioners in our studies, we briefly describe work on other styles of preconditioners in sections 3.4.1, 3.4.1, and 3.4.1. The number of different heuristics, the assortment of options, and the quantity of parameters form one of the primary barriers preventing widespread use of incomplete LU preconditioners. The variety means users cannot easily tell what type of ILU preconditioner to use, thus increasing the attractiveness of direct solvers which may take longer and require more memory, but are more reliable. In section 3.4.1 we describe a few comparison studies and summarize their conclusions. The literature contains several surveys covering both sequential and parallel preconditioners for iterative methods. These include a book on iterative methods which includes a section on ILU preconditioners [158], as well as survey papers such as [162, 30, 60]. Level-based dropping ILU algorithms which use level-based dropping decide whether to keep a nonzero element based solely on its location. In broad terms, given an n n matrix A and a set of indices P, go through the steps of a complete factorization, but only update an entry (i, j) in ˆL or Û if (i, j) P (see, for example, [30]). The question, then, is how to choose P. More fill can mean more accurate incomplete factorizations, but it also means the factorization takes more time to compute and more memory to store. If A is a symmetric positive definite (spd) matrix and P includes all possible index pairs, the algorithm computes a complete Cholesky factorization. However, although a

74 complete Cholesky factorization is guaranteed to exist for an spd matrix, an incomplete Cholesky factorization is not guaranteed to exist for all choices of P since the diagonal elements may become negative. What is known is that IC factorizations can be computed for any P for the cases where A is an M-matrix [134], and where A is an H-matrix [130]. Recall that A is an H-matrix if the matrix whose elements have the same magnitude as those of A but all of whose diagonal entries are positive and whose off-diagonal entries are negative is an M-matrix (see, for example, [103] for more details on M-matrices and H-matrices). For ILU factorizations this is less of an issue as the diagonal elements have to approach zero for the factorization to break down. As described above, level-based heuristics decide whether to keep an element based on its location in the factors. In addition, these methods are parameterized by a level k, which measures how close to complete the factorization is. If k = 0 we get the ILU(0) preconditioner, where (i, j) P if and only if A(i, j) 0. In other words, ˆL + Û has the same nonzero pattern as A. In [93] Gustafsson started with the ILU(0) heuristic and used it to define higher level preconditioners. For their ILU(1) factorization, the incomplete factors are allowed nonzeros both where A is non-zero and also where ˆL 0 Û 0 A is nonzero. The definition could be repeatedly applied to get higher level algorithms. This definition could be used for, say, finite difference meshes where added levels corresponded to added diagonals in the incomplete factors, and so the pattern could be easily determined. For more general matrices Watts formulation for level-based heuristics is used [181]. He suggested letting the diagonal of a matrix be of level 0, letting the off-diagonal positions where A is non-zero be of level 1, and letting all other positions be of level. As the incomplete factorization is computed, let the level of each updated element (i, j) be the minimum of its current level and one plus the sum of the levels of the two positions (i, l) and (l, j) used to update it (elements are updated by equations of the form A(i, j) = A(i, j) A(i, l)a(l, j)). For an incomplete factorization of level k, keep only elements in positions of level k + 1 or less. A graph interpretation of Watts algorithm is described in [41, 49, 50]: ILU(k) keeps an element if there is a path of length k + 1 between its two endpoints in the directed graph representation of A. Originally only fill levels of 0 or 1 were studied because they seemed adequate for the problems of interest and because computing the fill pattern for higher levels of fill was expensive. However, recent studies suggest higher levels of fill (up to k = 10 or 12)

75 can be beneficial for problems with several hundred thousand unknowns [106]. Hysom and Pothen s code interlaces the computation of the fill pattern with the computation of the numerical values in each position, so that for each row first its structure is computed and then its values [106]. This formulation allows them to make decisions for each row based on the numerical values in it, as with the value-based algorithms described in section 3.4.1. However, it makes the storage impossible to predict in advance, and so their code allocates memory dynamically: whenever they run out of space for the factors, the double the memory allocated and copy the already computed rows of the incomplete factorization into the newly allocated space. Value-based dropping Although Watts extended the definition of level-based ILU to apply to general matrices, studies such as [41] found level-based methods inadequate for many matrices, in particular for matrices which were indefinite or which were not diagonally dominant. Since for some of the matrices on which Watts ILU heuristic did work well, the higher an element s level, the smaller its absolute value [181], researchers suggested value-based ILU heuristics which dropped elements based on their numerical value (e.g. [186]). These value-based ILU methods generally lead to more accurate factorizations than level-based methods with the same amount of fill and are also, in general, more robust than level-based methods [41]. A basic row-based value-based ILU algorithm works as follows: after factoring row i, so the final values of both ˆL(i, :) and Û(i, :) are known, set the values of entries in row i of ˆL and Û with value less than some threshold tolerance equal to zero. A slight modification to this basic algorithm (used in [157, 158]) drops in two phases: first row i of ˆL is computed and small values in that row are dropped, then the remaining values in ˆL are used to compute values of Û(i, :) and small elements in Û are dropped. In the remaining discussion of ILU heuristics, unless specified otherwise, we assume heuristics are row-based. Column-based versions are easily derived. Although value-based methods all drop elements that are small, different methods use different measures to determine whether or not to drop an element, most requiring a user-supplied parameter α. The heuristics in [9, 63, 137] drop an element in position (i, j) if A(i, j) < α A(i, i)a(j, j). In [157, 158] an element in row i is dropped if its value is

76 less than α( j A(i, j) /nnz(a(j, :))). In [144] an element in row i is dropped if its value is less than α max j A(i, j). In [50] an element (i, j) is dropped if its value is less than α min( A(i, :), A(j, :) ), though notice that in the paper D Azevedo et al. combine this with level-based dropping. As specified so far, the storage requirements of value-based ILU are more difficult to predict than those of level-based techniques until the elements are calculated during the factorization, their value and whether they will be dropped cannot be known. To make the storage requirements predictable, value-based ILU heuristics generally limit the number of nonzeros kept in ˆL and Û by keeping only the largest elements whose values are greater than the drop threshold. Methods for restricting the storage can be distinguished by how they allocate the number of nonzeros to each row of ˆL and Û. An equal number of elements could be kept in each row, the number could take into consideration the number of nonzeros in the same row of A, or a more complex metric could be used. As noted in [9], at the most general a user could simply specify a function which determines the number of entries to keep in each row of ˆL and each row of Û. The methods described in [112] and [124] do not use a drop tolerance, instead just allowing the user to specify the number of nonzeros to keep in each row. In [112] Jones and Plassmann suggest keeping the same number of elements in each row of ˆL and Û as there are nonzeros in each row of A. Notice this method does not require the user to specify any parameters. Lin and More build on [112] by suggesting an algorithm which keeps β extra elements in each row, where β is a user-supplied constant [124]. Suarjana and Law propose an incomplete Cholesky factorization which takes a user-supplied upper bound on the total number of nonzeros allowed in the incomplete factors and divides the nonzeros in one of three ways: equally among all the columns, proportionally to the number of nonzeros in the same columns of A, or proportionally to the number of nonzeros in the same column of L [171]. The latter requires an initial symbolic factorization to determine the number of nonzeros in L, but this is not too expensive since they focus on symmetric positive definite matrices. However, we note the algorithm they describe is quite expensive, keeping in effect 2 incomplete factorizations, the second of which is used to update successive columns in the factorization. Axelsson and Munksgaard also propose several IC techniques which bound the storage by some user-supplied limit [9]. One method takes a drop tolerance and computes a value-based factorization using that drop tolerance until the available space has been used,

77 after which a no-fill factorization (ie, a level-based factorization with level equal to 0) is computed for the remaining rows. Important information is often in the last dense portion of the factorization, so Axelsson and Munksgaard suggest modifying this technique to take into account an ideal curve for the amount of storage to be used as a function of the row/column, and try a compromise where the number of nonzeros allowed in the row being factored is equal to a percentage of the total amount of space remaining. ILUT [157], perhaps the most popular value-based ILU heuristic, requires the user to supply both a drop tolerance and a fill parameter k. Values that are too small are dropped; furthermore only the largest k off-diagonal elements in each row of Û and ˆL are kept. So that triangular solves with the factors are always possible, the diagonal elements are always kept. Just as pivoting is often needed for stability of complete LU factorizations, pivoting can also be useful in incomplete factorizations. In fact, pivoting can be more important for incomplete factorizations. Munksgaard notes in [137] that although pivoting is not needed when computing the complete Cholesky factorization of an spd matrix, some pivoting may help with incomplete Cholesky factorizations of spd matrices. For ILU factorizations Saad describes ILUTP [158], which is ILUT with partial pivoting. In ILUTP the user supplies a pivot tolerance, with a value between 0 and 1. An off-diagonal element is only permuted to the diagonal if it is at least as large as the current diagonal element divided by the pivot tolerance. A tolerance of 1 is equivalent to using partial pivoting; a value of 0 ensures no pivoting. Much of the work we describe in later sections is based on Saad s ILUTP algorithm, and so also uses a drop tolerance, a pivot tolerance, and a bound on the number of nonzeros. Stabilizing techniques Whereas the value-based dropping and level-based dropping techniques just described mostly determine which elements to keep (the exception being pivoting, which is not directly related to dropping elements), the stabilizing techniques discussed in this section alter the values of certain entries to hopefully avoid problems in either computing or applying the preconditioner. These techniques can be applied to both level-based and value-based dropping algorithms. Because IC factorizations can break down if a diagonal element is not positive, and

78 ILU factorizations can break down if a diagonal element is too close to zero, stabilization techniques typically increase the diagonal of the matrix. One set of stabilization techniques modifies the diagonal by an amount dependent on the value of dropped elements. In [181] they suggest adding a fraction of the value of each dropped element to the diagonal of that row; they note performance does not seem very sensitive to the fraction used. A heuristic called MILU, for modified ILU, adds the value of any dropped elements in a row i from the diagonal element in row i of Û. This algorithm is described for IC factorizations in [93], and for ILU factorizations in [158, Section 10.3.5]. Another variant by Jennings and Malik suggests adding s r ij to the diagonal element in row i and r ij /s to the diagonal element in row j if element r ij is dropped [110]. Jennings and Malik suggest s = a jj /a ii. Alternatively, the diagonal can be increased by some amount that is independent of the values dropped. Kershaw suggests replacing small pivots by a value dependent on the largest entries in that row of Û and that column of ˆL [117, 118]. Others have suggested shifting the matrix and computing a factorization of A + αi instead of A [130, 137], though in [135] Meijerink and van der Vorst suggest the local change of modifying a single diagonal element should be better than a global shift of the entire matrix. Complete factors of incomplete matrices Incomplete factorization techniques take the original A and compute approximate factors ˆL and Û such that ˆLÛ A. An alternative method which also computes approximations to L and U begins by finding Â A and then uses the exact factors of Â to precondition the original system. One advantage of this method is that Â can be chosen so its factorization can be computed cheaply. For example, Vaidya described two preconditioners for symmetric, diagonally dominant M-matrices in [178]. In the first Â is chosen to be the matrix whose graph is a maximum spanning tree of the graph of A, which means Â can be ordered so its complete factors have the same sparsity pattern as itself. The second preconditioner uses an augmented maximum spanning tree, so the factors may be denser than Â, but not by much. Aside from the ease of computing the preconditioner, another advantage of Vaidya s preconditioners is that bounds can be computed for the condition number of the precondi-

79 tioned matrix, which means a bound on the number of iterations needed for preconditioned conjugate gradient to converge. In fact, generalizations of his theory (eg. [26, 25, 32]) have been used to analyze multilevel methods [20, 87, 88, 152] and incomplete Cholesky factorizations [20, 92]. Multilevel techniques Mimicking the multigrid algorithm (an overview can be found in [29]), several proposed ILU algorithms locate a set of independent nodes, whose rows are then factored cheaply. The Schur complement of the system is formed, and another set of independent nodes is found in it, those rows are factored, and so on. An early example of this style algorithm is ILUM [156], which chooses an independent set of nodes in the directed graph representation of A and orders them first in the matrix. This (1, 1) subblock is diagonal and so can be factored for free; the (2, 2) subblock is updated, using dropping to keep it sparse, and the process is repeated until the Schur complement is deemed small enough to factor. This gives you the final preconditioner. A block version BILUM is presented in [163]. The authors of BILUM also study how to choose the independent set in [164, 165]. Work on similar methods by other researchers includes [13, 27, 90]. There have also been recent attempt to generalize the multilevel framework. Work along this line includes [185], and [161], the latter describing a package called ARMS. ARMS presents a generalized formulation of recursive, multi-level solvers which both covers earlier techniques such as ILUM and BILUM, and allows for other methods. Other preconditioners Preconditioners based on an approximate LU factorization of A are only one of a wide variety of preconditioners. One alternative are based on sparse approximate inverses. These are explicit instead of implicit preconditioners, which means they compute an approximation to A 1, as opposed to incomplete factorizations which compute an approximation to A. Both the computation and application of sparse approximate inverse preconditioners are inherently parallel; the former can be done through approximately solving n linear systems, and the latter is simply a matrix-vector multiplication. This makes them

80 attractive for many situations; unfortunately they are thought to be less robust than ILU preconditioners [38]. Work on such methods includes that in [39, 42, 84, 91]. Connections between specific approximate inverse and ILU heuristics have also been noted in [23, 24]. Comparison Studies Because the ILU algorithms described in the previous sections are simply heuristics, they are evaluated through comparison with other heuristics. Papers presenting new ILU algorithms frequently describe results comparing the new algorithm to some earlier one. However, a few larger scale comparison studies have been done, some looking at ILU preconditioned iterative solvers on a wide variety of matrices [41, 81], and others focussing on more specific matrices from a single application area [31]. In [41] Chow and Saad begin with 36 matrices which cannot be solved using GMRES(50) [160] preconditioned by ILU(0) in 500 iterations. They then test a variety of level-based and value-based ILU heuristics on these problems, finding that ILU(2) is sufficient for solving all but 1 of the problems and that Saad s ILUTP algorithm can still break down even with the relatively large lfil value of 30. They then observe that I A(ˆLÛ) 1 often reveals more about an incomplete factorization than A ˆLÛ and further point out several statistics which can reveal common problems with ILU factors: a condition estimator, the size of the smallest pivot element, and the size of the inverse of the largest element in ˆL and Û. They also note that, in their experience, if all the statistics are small and the preconditioned iterative solver still does not converge, the problem is likely due to inaccuracy from dropping. Ultimately in [41] they find no technique that works on all matrices, leading them to suggest that general ILU codes should provide many options and parameters for the user to set, using her knowledge of the particular system. In [81] Gilbert and Toledo also comment on the lack of robustness in ILU preconditioned iterative solvers. More specifically, the authors of [81] focus on whether ILU preconditioned TFQMR (a solver introduced in [69]) is competitive with the direct solver SuperLU [53]. On 24 matrices taken largely from those used in [122] they test value-based ILU preconditioners with various drop tolerances, partial pivoting in some cases, a heuristic for permuting to put nonzeros on the diagonal prior to computing the incomplete factorization, and a heuristic that tries to keep nonzeros on the diagonal when partial pivoting is used.

81 They find that ILU preconditioners are not yet robust enough to compete with direct solvers such as SuperLU, although there are cases where ILU can be much faster. Their observations include: minimum degree ordering is better than a bandwidth ordering such as reverse Cuthill-McKee; with similar amounts of fill, drop tolerance methods are better than those which keep a fixed number of elements in each column of ˆL and Û; that the best combinations of preconditioner and iterative solver typically require only 8 16 iterations of TFQMR to converge; and that pivoting is more important for incomplete than complete factorization. Another study comparing an IC preconditioned iterative solver to a direct solver had similar results. When there was enough space to run complete Cholesky, it was almost always faster than the preconditioned iterative solver [139]. In [31] the authors look at ILU preconditioners for matrices from CFD problems. Looking at both level based and value based ILU algorithms, they note that the preconditioners generally had similar performance with similar amounts of fill. They also find block versions of ILU algorithms are generally better than point versions. Packages We conclude this section with a list of software packages which include the capability to solve a system of linear equations using an IC or ILU preconditioned iterative solver. The parallel codes are all written using MPI and hence can be used on either shared memory or distributed memory machines. In addition to these publically available codes there are also commercial packages sold by companies. These include Diffpack [140] and the Elegant Mathematics packages [65]. We also point our that there are other iterative solver packages which do not include ILU preconditioners and hence are not included in the table. These include IT- PACK [89], PIM [46], and Templates [14]. For more detail on these packages, including more on the IC and ILU algorithms provided, see http://www.cs.berkeley.edu/ tzuyi/ilu/ilu codes.html The web site also provides brief summaries of packages in development.

82 Code Scope ILU Algorithms Includes solvers Serial platforms BPKIT [40] unsym ILU (level,value) yes a LASPack [168] sym, sym-pat IC, ILU (level,value) yes PETSc [12] sym, unsym IC, ILU (level,value) yes Sparskit [155] unsym ILU (level,value) yes SPOOLES [8] sym, sym-pat IC, ILU (value) yes Parallel (MPI) codes AZTEC [176] unsym IC, ILU (level,value) yes BlockSolve95 [111] sym-pat IC, ILU (level) yes ParPRE b [64] unsym ILU (level,value) no PETSc c [12] sym-pat IC, ILU (level) yes PSPARSLIB [159] unsym ILU (level,value) yes a not focus of package, only FGMRES included b uses PETSc data structures c calls routines in BlockSolve95 Table 3.7: A summary of publicly available software packages that can solve systems of linear equations using IC or ILU preconditioned iterative solvers. The Scope column refers to the structure of the matrix given as input, whereas the ILU algorithms column details whether value-based and/or level-based IC and ILU algorithms are provided. Sym-pat means the matrix is structurally, but not necessarily numerically, symmetric.

83 3.4.2 Experimental setup Although incomplete preconditioners can be useful in certain situations, for example if one lacks the memory to compute a complete factorization, their unpredictability is a barrier to their usage. Even if a particular preconditioner uses almost no memory, if the subsequent solver does not converge, the savings in memory mean little. As noted previously, one of the problems with ILU preconditioners is the number of parameters whose values must be set by the user, often with little guidance. One of our hopes was to be able to provide suggestions for default parameter values for our algorithms, and so we anticipated needing to run hundreds of sequential tests with different parameter settings on assorted matrices. To simplify this task we set up a test harness which allowed us to easily set which ordering heuristics were used, as well as to generate scripts for running the same code with a variety of parameter settings. This is similar to work being done at Boeing where they are using a tool for executing and analyzing runs to gather information about the effects of different parameter settings on the behavior of ILU preconditioned iterative solvers on specific matrices [120, 121]. All our runs had the same structure. After the initial matrix was read into our sparse matrix data structure, we could then equilibrate it, order and scale for stability, order for sparsity, compute an incomplete factorization, and run the iterative solver. Only the last two steps were used in every run, the first three could be turned on or off depending on the experiment being run. In this section we briefly describe the algorithms used in each step. Doing comparisons of multiple algorithms with an assortment of parameter settings over several dozen matrices is an embarrassingly parallel task. To run them all we used rexec [43] on the Berkeley Millennium [136], a cluster of approximately a hundred 2 and 4-way SMPs running Linux. Since the Millennium is a shared resource, we rarely had the machine to ourselves, and so we decided to focus on convergence results rather than on timing results. Most of our code is written in C, although we link to a few Fortran codes written by other people. We use the gnu cc and f77 compilers. We also link to the optimized BLAS [99].

84 Equilibration In [183] Wilkinson notes that, in his opinion, pivoting during the factorization is only reasonable if the rows and columns in A all have comparable norms. This can be accomplished initially through equilibration, which computes diagonal scaling matrices R and C such that max-norm of each row and column of B = RAC is 1. In practice the R and C computed only make the max-norm of each row and column of B close to 1. Although the equilibrated matrix B is not guaranteed to have smaller condition number than A, in practice it often does. In our experiments we always scaled the matrix A so that the maximum entry in A was 1. However, this scaling was done sometimes through equilibration and sometimes through the scaling in MC64, described next. Ordering for stability and sparsity In our tests we experimented with ordering the matrices for both stability and sparsity, using a few of the algorithms described in sections 3.2.1 and 3.3.1. For stability we used the MC64 option that maximizes the absolute value of the product of the diagonal elements, since experiments in [17, 57] suggest this option works best. If scaling was used, we applied scale factors that set the diagonal values to have magnitude 1 and the off-diagonals to values 1. For sparsity we tried both the colamd algorithm [48] and an implementation of reverse Cuthill McKee (rcm). Results with sparsity orderings are compared against those using simply the natural order of the matrix, as some researchers note that the matrices in sparse matrix collections sometimes already have significant natural structure [182]. Our rcm implementation uses the ordering generated by a second run of the breadth first search, where the start node is the last node found in the first run of the algorithm. This makes it more likely that the start node is on the periphery of the graph. Furthermore the unsearched neighbors of a node are placed on the main queue in order of increasing degree. For orderings by rcm we run the algorithm on A + A T and apply the permutation symmetrically to the rows and columns of A. When both MC64 and colamd are used, we always apply the permutation found by colamd symmetrically to an A that has already been permuted by the MC64 ordering. Applying the colamd permutation symmetrically

85 maintains the large diagonal found by MC64. When colamd is the only permutation algorithm used, we run tests applying it both symmetrically and only to the rows of A. Because we compute the incomplete factors by rows, as described in section 3.4.2, we pivot columns for stability, and so permute the rows for sparsity. The ILU algorithms We tested both level-based and value-based ILU algorithms. Though we focussed on the latter, we were also interested in testing the trade-off between tunability and simplicity, with the former associated more with value-based methods and the latter with level-based methods. Value based methods frequently require the user to specify values for a number of parameters. While parameter tuning allows for greater flexibility of value-based ILU algorithms, a large number of parameters without some guidelines as to how best to set them can overwhelm a user. Level based methods, on the other hand, traditionally take a single parameter specifying the level, thus making them seemingly easier to use. As noted in section 3.4.1, drop tolerances and pivoting can be added to level based methods. However, we chose not to test codes with this added capability as we were primarily interested in seeing whether specifying one parameter for a level-based method could lead to performance comparable to that of a tuned multi-parameter value-based method. The ILU algorithms we test all have the same basic structure. The codes are based on row-wise, up-looking complete factorizations: in each iteration a row of the matrix is loaded, updated with the appropriate preceeding already-factored rows, the factorization of the current row is completed, and the factors are stored into some sparse matrix data structure. For Saad s ILUTP algorithm [158], the row is updated by preceeding rows only if the contributions are likely to be large. The incomplete factorization of the row drops all elements less than some norm of the row of A times a drop tolerance, and only the lfil largest elements of that row of Û and ˆL are stored. Columnwise pivoting is done only if the pivoted element is larger than the diagonal element divided by a pivot tolerance pivtol. We tested drop tolerance values of 0, 0.001, 0.01, and 0.1; and pivot tolerances of 0, 0.1 and 1. For the ILU(k) algorithms we used sequential code provided by Hysom [105]. In his code the symbolic phase, which determines the structure of each row by doing a symbolic

86 update and seeing what level each new fill element would have, is interleaved with the numeric phase in which the values of elements in low-level positions are filled on a row-byrow basis. Because this makes the amount of memory needed more difficult to predetermine, the code allocates memory dynamically: if the factors run out of space, twice as much memory is allocated and the rows calculated so far are copies into the new space. Because of this strategy, there are cases where tests run out of memory even though it is possible the incomplete factors could have fit in memory. To simplify comparing the amounts of fill for different matrices and ILU algorithms, we typically normalize by nnz(a). By doing this we can measure storage requirements in terms of how much the user has to pay in addition to the storage already needed for A. Whereas in a complete factorization A can be overwritten by its factors, for an incomplete factorization A must be kept for computing matrix-vector multiplications during the iterative solve. The iterative solver As others have done in their studies of preconditioners, we choose a specific iterative solver and use it for all our tests of different ILU heuristics. Of the several iterative solvers we could have chosen (see [14] for an overview of iterative methods), we decided to use restarted GMRES [160]. GMRES, which stands for generalized minimal residual, is an iterative method for general unsymmetric systems. It constructs a sequence of orthogonal vectors and finds the approximate solution vector in that space which minimizes the residual. In exact arithmetic GMRES will find the exact answer in no more than n steps, but in practice we want neither to take n iterations when n can easily be in the hundreds of thousands, nor use the amount of memory required. Therefore a more popular method is restarted GMRES(m), which runs m iterations of GMRES before restarting the entire algorithm with the current approximate solution as the new initial guess. The memory requirements are now less, although we must now choose the restart value m. Unfortunately, knowing when to restart is not obvious. Too large a value for m and the storage becomes prohibitive, too small and systems may fail to converge. We choose m = 50, a value also used in other ILU studies, including [31, 41, 114, 184]. We stop either when 500 iterations have been taken or when the relative residual has been reduced by

87 1 10 8. Both values are also used in studies such as [41]. The matrices Our set of 65 test matrices overlaps those used as test cases in various other papers studying ILU algorithms, including [17, 41, 157]. The test cases in [157] are on matrices that we consider easy for value-based ILU preconditioning. For example, in our tests, both the pores2 and sherman3 matrices converge for values of lfil greater than or equal to 5. Our test suite also includes several matrices that are large relative to the test cases in [41]. Only one of their matrices has over 1,000,000 nonzeros, whereas 16 of our matrices do; 2 of their matrices have dimension over 10,000, whereas 28 of ours do. Similarly, on the whole, our test suite includes matrices that are larger than those in [17]. In general the larger size of the matrices in our test suite as compared to those in previous studies reflects the fact that as the size of computer memory increases over time, the size of the systems being solved on computers increases similarly. However, because of the size of the largest matrices in our test suite, in some of our experiments test cases on certain matrices run out of memory. When presenting our results we note when this happens. See Appendix B for detailed information about the individual matrices in our test suite, including dimension, number of nonzeros, largest component, and application area. 3.4.3 The ILUTP Push algorithm Experiments with Saad s ILUTP algorithm on the matrices in our test suite show that achieving convergence often requires a very high value for the lfil parameter. However, whereas the ILUTP algorithm assumes lfil is the number of nonzeros kept in each row, with higher values the nonzeros are not evenly distributed across the rows. We suggest a modified version of ILUTP, which we call ILUTP Push. ILUTP Push gives the user a better sense of how much memory is actually used, and provides for more efficient use of the space a user is willing to allocate for the incomplete factors. Briefly, experiments show that if a user specifies ILU parameters corresponding to a fixed upper bound on the memory, ILUTP Push generally leads to convergence if ILUTP does, and frequently also when ILUTP does not. The motivation is described more thoroughly in

88 section 3.4.3. The algorithm is presented in section 3.4.3 and comparisons of this code to ILUTP are presented in section 3.4.3. We assume the user s first concern is that the iterative method converges. Clearly a preconditioner that uses very little space, yet which leads to stagnation in the preconditioned iterative solver is useless. In this section we also assume that the user has an upper bound on the amount of memory he is willing to allocate for the preconditioner. In general we assume this bound is at least nnz(a), and less than the space needed for the complete factors. Note the space needed for the complete factors can be bounded using the techniques in [78]. Motivation First recall that the ILUTP [158] algorithm takes three parameters: lfil, droptol, and pivtol. The lfil value is the maximum number of nonzeros kept in each row of ˆL and Û. The droptol value determines which elements are small and should be dropped. The pivtol value specifies how much larger than the diagonal element an off-diagonal element must be to justify pivoting. Figure 3.2 gives pseudocode for this algorithm. The parameter lfil is the number of nonzeros to be kept in each row of ˆL and Û. Papers which test these and related algorithms seem to use lfil values between 5 and 30, suggesting an lfil value of 30 is very large [17, 41]. Only more recently have researchers tried larger values of fill, as in [31]. Our experience suggests factorizations with high amounts of fill are often necessary for subsequent preconditioned iterative solvers to converge. The tests summarized in table 3.8 show that values of lfil much higher than 25 were sometimes required for convergence. Notice that we do not subdivide the convergences, so both a matrix that converges for only one combination of the pivtol {0,.1, 1} and droptol {0,.001,.01,.1} parameters and one that converges for all of them are counted just once. value of lfil 0 5 10 25 50 75 100 150 200 # matrices that converge 12 26 32 37 44 46 47 49 51 Table 3.8: In this table we show the number of matrices, out of 65, that converge with ILUTP and a specified lfil value for at least one setting of the droptol and pivtol parameters. We used mc64 ordering and scaling for stability, and colamd ordering for sparsity. Unfortunately, to guarantee that you will not run out of memory when using such

89 (ˆL, Û, P ) = ILUTP(A, lfil, droptol, pivtol) 1 for i 1 to n 2 copy A(i, :) into work vector w 3 for j 1 to i 1 % dropping in ˆL 4 if w(j) 0 and w(j)/û(j, j) > droptol 5 then w(j : n) = w(j : n) (w(j)/û(j, j)) Û(j, j : n) 6 else w(j) = 0.0 7 ˆL(i, 1 : i 1) = largest lfil elements of w(1 : i 1). 8 for j i to n % dropping in Û 9 if w(j) droptol A(i, :) 10 then w(j) = 0.0 11 Û(i, i) = w(i) 12 Û(i, i + 1 : n) = largest lfil 1 elements of w(i + 1 : n). 13 if max(û(i, i + 1 : n)) > Û(i, i)/pivtol % pivot if necessary 14 then pivot by swapping the max and diagonal entries 15 update ˆL, Û 16 update P Figure 3.2: Pseudocode for the ILUTP algorithm.

90 a high lfil value, one either has to call the program with room for 2 n lfil doubles allocated, or to dynamically allocate memory. If the code is in Fortran 77, as in Sparskit [155], only the former can be done, and table 3.9 shows that even with an lfil value of 5, only half of the factorizations require even half of the total space allocated. # # where nnz(ˆl + Û) k 2 lfil n lfil converge k =.1 k =.25 k =.5 k =.75 0 12 12 12 12 12 5 20 20 19 10 6 10 27 27 23 13 8 25 32 30 24 16 4 50 36 31 24 16 4 75 40 32 28 16 5 100 45 36 30 16 7 150 46 36 26 16 5 200 48 37 25 16 3 Table 3.9: In this table we show the number of matrices which converged with ILUTP for a given lfil value, a droptol of 0.0, and a pivtol of 1.0. We also show the percentage of the maximum space allowed that is actually used by the incomplete factors. As in table 3.8, the matrices are ordered and scaled using MC64 and colamd. Because the experiments here were conducted with droptol and pivtol fixed, the numbers differ from those in table 3.8. Furthermore, even if the ILUTP code were altered to use dynamic memory allocation, the user would still lose a sense of how much memory is actually used. The user would have an upper bound on the memory usage, but no sense of how gross an overestimate that bound might be. Graphs such as that in figure 3.3 suggest the ILUTP heuristic of specifying a constant bound on the number of nonzeros to be kept in each row of ˆL and Û does not accurately reflect how nonzeros are distributed in the factors. As figure 3.3 shows, for the shyy41 and vavasis1 matrices, even with a droptol of 0.0, many rows do not even have that many nonzeros to keep. Rather, one should keep more nonzeros in later rows to reflect their increasing density. This can be seen in the graphs in figure 3.3, where the factors are shown to go from having fewer than 10 nonzeros in each row (seen by the jaggedness of the line for the earlier rows) to having mostly full rows by the end (as the line approaches a straight line with value 10).

91 shyy41 nnz(il)=20889 lfil*n=47200 vavasis1 nnz(il)=24875 lfil*n=44080 nnz in row of IL nnz in row of IU 10 8 6 4 2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 row of IL 10 8 6 4 2 nnz(iu)=33771 lfil*n=47200 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 row of IU nnz in row of IL nnz in row of IU 10 8 6 4 2 0 0 500 1000 1500 2000 2500 3000 3500 4000 row of IL 10 8 6 4 2 nnz(iu)=39148 lfil*n=44080 0 0 500 1000 1500 2000 2500 3000 3500 4000 row of IU Figure 3.3: These plots look at the number of nonzeros kept in each row of ˆL + Û using ILUTP with an lfil of 10, a pivtol of 0.0, and a droptol of 0 on two of the smaller matrices from our test suite. The two plots on the left show the counts for ˆL and Û of the shyy41 matrix; the two on the right show the same for the vavasis1 matrix. The ILUTP Push Algorithm The ILUTP Push algorithm exploits the observation that the nonzeros in the incomplete factors computed using ILUTP with high values of lfil are not distributed evenly across the rows. Rather, the incomplete factors ˆL and Û computed with high values of lfil mimic the complete factors L and U in that the rows towards the bottom of the matrix are generally denser. The pseudocode in figure 3.4 shows how a modified ILUTP algorithm which we call ILUTP Push can exploit this observation. This pseudocode is meant to give an idea of how the ILUTP Push algorithm works, so some details of our implementation are omitted. Comparing the pseudocode for ILUTP Push to that in figure 3.2 for ILUTP, we see the meanings of droptol and pivtol are the same. The change is in how the algorithms decide how many nonzeros to keep in each row of ˆL and Û: in ILUTP lfil is the upper bound on the number of nonzeros kept in each row of ˆL and Û; in ILUTP Push lfil nnz is an upper bound on the total number of nonzeros in ˆL + Û, divided by the number of nonzeros in A, so nnz(ˆl + Û) lfil nnz nnz(a) but the

92 (ˆL, Û, P ) = ILUTP Push(A, lfil nnz, droptol, pivtol) 1 space left = lfil nnz nnz(a) 2 for i 1 to n 3 copy A(i, :) into work vector w 4 space row = space left/(n i + 1) % determine space for this row 5 lfil = space row/2 % space for this row of ˆL 6 for j 1 to i 1 % dropping in ˆL 7 if w(j) 0 and w(j)/û(j, j) > droptol 8 then w(j : n) = w(j : n) (w(j)/û(j, j)) Û(j, j : n) 9 else w(j) = 0.0 10 ˆL(i, 1 : i 1) = largest lfil elements of w(1 : i 1). 11 for j i to n % dropping in Û 12 if w(j) droptol A(i, :) 13 then w(j) = 0.0 14 Û(i, i) = w(i) 15 lfil = space row nnz(ˆl(i, :)) % space for this row of Û 16 Û(i, i + 1 : n) = largest lfil 1 elements of w(i + 1 : n). 17 if max(û(i, i + 1 : n)) > Û(i, i)/pivtol % pivot if necessary 18 then pivot by swapping the max and diagonal entries 19 update ˆL, Û 20 update P 21 space left = space left nnz(ˆl(i, :)) nnz(û(i, :)) Figure 3.4: Pseudocode for our ILUTP Push algorithm.

93 number of nonzeros kept in each row is determined based on the number kept in previous rows. In their most general form, ILU algorithms could take as input a function which bounds the number of nonzeros kept in each row. Standard ILUTP uses a constant function with value lfil. ILUTP Push, on the other hand, uses a non-decreasing function which uses information about the number of nonzeros kept so far. In general, one can use any of a number of functions. For example, we tried a variation in which the number of nonzeros allowed in ˆL and Û for any given row was the same, so line 15 in the pseudocode was eliminated. Another variant we tried determined the amount of space allowed for a row of ˆL or Û by taking the total amount of space left and dividing by the number of rows in ˆL added to the number of rows in Û. In other words, we eliminated line 4 and changed lines 5 and 15 in the pseudocode as follows: 5 lfil = space left/(2(n i + 1)).. 15 lfil = (space left nnz(ˆl(i, :)))/(2(n i + 1) 1) However, we found that none of the variants we tried performed significantly better than ILUTP Push. In [9] Axelsson and Munksgaard describe a method for computing a modified incomplete Cholesky factorization which is similar to ours, though designed for spd matrices. Their M31D algorithm also takes an upper bound on the number of nonzeros allowed in the factors and in factoring the ith row, allows it 1/n i + 1 of the remaining space available. However, they run without a drop tolerance, and avoid the need for pivoting by modifying the values of entries in the row being factored. In [171] Suarjana and Law describe an incomplete Cholesky factorization in which the number of nonzeros in a row of the incomplete factors is proportional to the number of nonzeros in that same row of the complete factors. Experimental results To compare the ILUTP Push algorithm to the ILUTP algorithm, we compare instances where lfil and lfil nnz are set to impose similar upper bounds on the number of nonzeros in ˆL + Û. Differences between the bounds are due to the fact that both lfil and

94 lfil nnz need to be integers. We look both at the number of matrices that converge and at how well the upper bound matches the memory used. lfil nnz 0 1 2 3 4 5 ILUTP 12 24 31 40 44 44 ILUTP Push 12 29 42 46 49 50 Table 3.10: This table shows the number of matrices that achieve convergence with ILUTP or ILUTP Push preconditioned GMRES(50) for at least one combination of the droptol and pivtol parameters and the given value of lfil nnz. The matrices are ordered and scaled using MC64, then the colamd permutation is used to order the rows and columns symmetrically. Notice ILUTP Push always converges at least as often as ILUTP for the same upper bound on the memory Note that the relationship between lfil nnz and lfil can vary dramatically depending on the matrix. For example, an lfil nnz value of 2 corresponds to lfil = 2 on the tols4000 matrix and to lfil = 132 on the appu matrix. This suggests specifying lfil, thus fixing the number of nonzeros per row of ˆL and Û, is a poor way to specify the amount of memory one is willing to allocate to the preconditioner. For the complete factors of A, computed using SuperLU with colamd ordering, nnz(l + U)/nnz(A) ranges from 1.44 for the olm5000 matrix to 198.0 for the ecl 32 matrix, with the appu and mchln85ks17 matrices being too large to compute their complete factorization. Table 3.10 shows the number of matrices that converge with ILUTP and ILUTP Push for different levels of fill. Clearly with the same bound on the fill significantly more matrices converge with ILUTP Push. In fact, though we do not show the full results here, for any specific combination of lfil, droptol, and pivtol, at least as many test cases converge with ILUTP Push as with ILUTP. Discounting the case where lfil nnz = 0, so only the diagonal is kept, for all but 1 combination of parameters more cases converge with ILUTP Push than with ILUTP. However, note that as discussed below, the superior performance is due to the fact that ILUTP Push uses more of the space allocated. We also observe that for the large lfil nnz values of 4 and 5, the value of pivtol has very little effect on the number of matrices that converge with ILUTP as the preconditioner (see table 3.11). For ILUTP Push, on the other hand, the number of matrices increases with the pivot tolerance. This reflects the fact that the incomplete factors computed using ILUTP Push can more closely resemble a complete factorization. Other interesting observations include the fact that the behaviour seems to change

95 lfil nnz 4 5 pivtol 0.1 1.0 0.1 1.0 ILUTP 37 37 37 38 39 40 ILUTP Push 41 42 44 41 45 44 Table 3.11: This table shows the effect of pivoting on the number of matrices that converge using ILUTP and ILUTP Push preconditioning with lfil nnz values of 4 and 5. A matrix is considered to have converged if it converges for at least one setting of droptol. with respect to the effects of the droptol and pivtol values once lfil nnz is at least 3. For both ILUTP Push and ILUTP, with lfil nnz 2, a pivtol value of 0.1 outperforms both the case where the value is 0 and where it is 1. However, for lfil nnz 3, partial pivoting with pivtol = 1.0 becomes superior. Also for lfil nnz 3, with ILUTP a drop tolerance of 0.0 becomes superior to a drop tolerance of 0.001, whereas for ILUTP Push the nonzero value is always superior. Notice that the fill is specified differently than in table 3.8. In table 3.11 the fill parameter specified is the lfil nnz value given to ILUTP Push. For ILUTP we use lfil nnz nnz(a) lfil = 2 n which ensures the upper bound on the number of nonzeros in ˆL + Û is lfil nnz times the number of nonzeros in A. Again, to be counted as having converged, a system need only have converged for at least one combination of the pivtol and droptol parameters. ILUTP ILUTP Push # # nnz(ˆl + Û) k lfil nnz nnz(a) # # nnz(ˆl + Û) k lfil nnz nnz(a) lfil nnz conv k =.1 k =.25 k =.5 k =.75 conv k =.1 k =.25 k =.5 k =.75 0 12 12 12 12 12 12 12 12 12 12 1 18 18 18 12 3 19 19 19 19 18 2 23 23 23 13 5 31 31 31 30 28 3 32 32 32 20 11 37 37 37 34 32 4 37 37 33 23 10 40 40 39 37 33 5 41 40 35 24 7 42 42 39 37 30 Table 3.12: In this table we show the number of matrices which converged with ILUTP Push for a given lfil value, a droptol of 0.0, and a pivtol of 1.0. We also show the percentage of the maximum space allowed that is actually used by the incomplete factors. Again, the matrices are ordered and scaled using mc64 and colamd.

96 Table 3.12 shows that for a specified bound on fill, the ILUTP Push algorithm uses more of the space. Although this means the ILUTP Push algorithm uses more memory, it also means that the user can specify an honest upper bound for the memory available for the preconditioner, rather than artificially inflating it, knowing that the modified algorithm is more likely to use a significant fraction of that space. Also for comparison, we note that the number of nonzeros in the complete factors of our matrices is typically more than 5 times the number of nonzeros in A. In short, this means ILU preconditioned GMRES(50) can certainly be competitive in terms of space. In figure 3.5 we show how many nonzeros are in the complete factors of our matrices. 30 amount of fill in complete LU factors 25 20 number of matrices 15 10 5 0 0 1 2 3 4 5 10 25 50 100 200 Inf nnz(l+u)/nnz(a) Figure 3.5: This graph shows the number of nonzeros in the complete factors of the matrices in our test suite. In each collection of three bars, the bar on the left is for the number of matrices whose fill is within the specified range with colamd and partial pivoting; the bar in the middle is with MC64 and colamd applied symmetrically with partial pivoting in the factorization; the bar on the right is with MC64 and colamd applied symmetrically but without pivoting in the factorization. For two of the matrices (appu and mchln95ks17) there was never enough space to compute the complete factorizations. Because ILUTP Push converges for more matrices than ILUTP with the same bound on fill, we use it in the tests described in the rest of this chapter.

97 3.4.4 Effects of orderings In this section we look at the effects of orderings on ILU preconditioned iterative solvers. In addition to summarizing the previous work on this topic, we also report on our studies showing the effects of sparsity and stability orderings on the convergence rate of GMRES(50) preconditioned with level and value-based ILU algorithms. We begin by looking at the effect of orderings only for sparsity, then study the combination of sparsity and stability orderings. Our findings confirm observations made by other researchers, for example that the original ordering of a matrix is often reasonable, that minimum degree orderings work well with value-based ILU algorithms, and that pivoting is important even if the matrix is first permuted for stability. Other observations we make are, for example, that full partial pivoting is typically not desirable, and that for small fill levels (eg, lfil nnz of 0 or 1) rcm is generally better than either no ordering or colamd ordering. Furthermore, we find equilibrating A before any of the orderings is not as effective as skipping the equilibration but then applying the scaling suggested by MC64 with job 5 after the orderings. Previous Work We divide the previous work on ordering matrices prior to computing an incomplete factorizations into two broad groups. The first group introduces new orderings expected to perform well with incomplete factorizations. The second group studies how orderings designed for complete factorizations (i.e., direct solvers) affect ILU preconditioned iterative solvers. We review some examples of work in the first group here, but reserve discussion of work in the second category for sections 3.4.4 and 3.4.4. The minimum discarded fill (MDF) algorithm discussed in [49, 50] uses the minimum degree ordering framework described in section 3.2.1 to find an ordering suitable for incomplete factorizations. In an incomplete factorization, the elimination of an element leads to some set of elements in the factors being dropped, or set to zero. If F k is the matrix of elements dropped by the elimination of element k, then in each step the MDF algorithm chooses to eliminate the element k such that F k F is minimized. In [49] their experiments show the MDF ordering typically improves the convergence rate of ILU-preconditioned conjugate gradients for finite element problems on unstructured grids. However, because the ordering is so expensive to compute relative to RCM, they note the usefulness of MDF

98 is likely greatest when used for problems where the cost of computing the ordering can be amortized over several solves with matrices of the same structure. Experiments in [50] with an approximate MDF ordering show that it can be quite effective for matrices from Navier-Stokes problems. In [18] the authors introduce a multicoloring ordering for ILU(0) and ILU(1) factorizations of matrices representing regular grids. Although a rectangular grid can be colored using only two colors in a checkerboard pattern (i.e., the red-black ordering), using more colors can allow for faster convergence. They use 8 or 16 colors and show that if the nodes of any given color are ordered sequentially, the diagonal blocks of both the matrix and its incomplete factors are diagonal. Though designed to balance between amount of parallelism and rate of convergence, the authors find their multicolorings also can outperform the natural ordering sequentially, for example improving the convergence rate of solving Poisson s equation using preconditioned CG. The effects of ordering for sparsity We begin by looking at the effects of the fill-reducing orderings colamd and rcm, without considering stability orderings such as mc64. In section 3.2 we described sparsity orderings that can be used for either complete or incomplete factorizations. Although orderings used for complete factorizations can also be used prior to computing incomplete factorizations, the metric for judging the ordering changes. Whereas for complete factorizations the best ordering is that which minimizes nnz(l+u), for incomplete factorizations the goal is to find the ordering which minimizes the number of iterations needed for the subsequent preconditioned iterative solver to converge. In other words, minimizing fill is not necessarily the right strategy; rather, you want to ensure the elements kept in ˆL and Û are as significant as possible. In this section we discuss the effects of sparsity orderings on the convergence of iterative solvers preconditioned using incomplete factorizations. After describing previous studies of the effects of orderings in section 3.4.4, we describe some of our tests in section 3.4.4. Our studies confirm generally held beliefs about orderings for incomplete factorizations: fill-reducing orderings work best with value-based techniques and bandwidth-reducing orderings work best with level-based techniques [138]. We also confirm the observation in [182] that there is significant structure in how the matrices are stored in sparse matrix collections,

99 so using the natural ordering is often reasonable. Previous Work Many previous studies have looked at the effects of ordering a matrix for sparsity in its factors on the convergence rate of IC and ILU preconditioned iterative solves. However, it is difficult to draw solid conclusions from this body of work in part because the effects of an ordering can vary with the matrix test suite and with the IC or ILU algorithm used to compute the incomplete factors. Nevertheless, some general themes appear, including discussions of the tradeoff between orderings which enhance parallelizability (eg. nested dissection or multicolor orderings) and the convergence rate of the subsequent solver [18, 54, 58]. One much studied ordering is the red-black (RB) ordering. In [58] they find it leads to slower convergence of level-based incomplete Cholesky preconditioned conjugate gradient (ICCG) on matrices from 5-point discretization schemes. The slower convergence was also noted for general symmetric positive definite matrices when RB ordering was used for ICCG with a low level of fill [66, 149]. However, for more general matrices and other preconditioners the effects of RB coloring are less clear. In [157] Saad finds RB ordering can improve the convergence of restarted GMRES for nonsymmetric, non-diagonally dominant matrices if enough accuracy is used in the incomplete factorization. In [19] Benzi, Szyld, and van Duin also find that RB orderings are not necessarily bad, this time for structurally symmetric matrices from PDEs discretized on structured grids using finite differences. The RB ordering is often compared to other orderings. So, for example, in [58] Duff and Meurant make the general observation that local ordering techniques such as reverse Cuthill-McKee (RCM) tend to improve convergence whereas many orderings which are well-suited for parallelism, including nested dissection (ND) and red-black orderings, lead to slower convergence. In [19] Benzi, Szyld, and van Duin compare the natural ordering against RB, RCM, and ND orderings for ILU(0) and ILU(1) preconditioners. They conclude that in general ND is a poor ordering; that RB and RCM are better than ND and the natural ordering for symmetric problems, and that RB is better than the natural ordering for highly non-symmetric matrices though neither is as good as RCM. Furthermore for symmetric matrices the results depend on the amount of fill in the factorizations and on whether an IC or symmetric ILUT algorithm is used. Minimum degree orderings have also been tested. One of the tests described in [81] looks at the behavior of RCM and MMD (an approximate minimum degree ordering

100 described in [79]) orderings for value-based ILU preconditioned TFQMR. In [81] Gilbert and Toledo find that pivoting with MMD is generally faster than pivoting with RCM. In general MMD tends to lead to sparser factors, and more tests converged using MMD than RCM. Our Studies In our studies we first equilibrate the matrices, then use either the natural ordering, rcm applied symmetrically to the rows and columns, or colamd also applied symmetrically. We tried colamd applied nonsymmetrically to only the rows of A, but found this generally performed worse than colamd applied symmetrically. This is expected since colamd applied nonsymmetrically assumes partial pivoting will be used, and in most of our tests we use less than full partial pivoting. Looking first at a level-based ILU heuristic we find that increasing the level beyond 3 does not cause significantly more matrices to converge and that there is enough structure in the original matrices for the natural ordering to outperform both rcm and colamd orderings. This structure in the original matrices has been noted elsewhere. For example, in [58, 19] they find the natural ordering is generally at least as good as rcm and minimum degree orderings for matrices that are nearly symmetric, though for nonsymmetric matrices [19] finds the natural ordering is almost never the best. Also, in [58, 66, 129] they find the natural ordering is better than multicoloring orderings such as the red-black ordering. level 0 1 2 3 4 5 natural 14 18 a 21 a,b 23 a,b 23 a,b 24 a,b rcm 14 15 a 19 a 22 a 21 a,c 22 a,c colamd (sym) 12 14 16 17 17 19 a appu out of memory b memplus out of memory c mchln85ks17 out of memory Table 3.13: The number of matrices that achieve convergence with different levels of ILU(k) and different sparsity orderings. Turning to ILUTP Push, the value-based ILU heuristic we test, table 3.14 shows all the orderings perform similarly for lower amounts of fill, though both rcm and colamd outperform the natural ordering for lfil nnz equal to 3, 4, or 5. Again, the results show what happens when colamd is applied symmetrically. Although applying colamd only to the columns is not as catastrophic for this value-based ILU algorithm since pivtol can be set

101 high enough to eliminate zero diagonals, the results are still worse than those with colamd applied symmetrically. lfil nnz 0 1 2 3 4 5 natural 11 25 30 33 35 39 rcm 11 24 30 36 42 42 colamd (sym) 11 24 31 38 40 40 Table 3.14: The number of matrices that achieve convergence with ILUTP Push, different lfil nnz values, and different orderings. The matrices are equilibrated, then symmetrically ordered for sparsity. A test case is considered to have converged if it converges for at least one combination of droptol and pivtol values. Notice that the only entries in tables 3.13 and 3.14 that can be compared directly are ILU(k) with k = 0 and ILUTP Push with lfil nnz = 1. In both of those cases the bound on the space used for the preconditioner is nnz(a), and for ILU(0) this bound is tight. Other entries in table 3.13 and 3.14 cannot be compared easily because, in general, the number of nonzeros in an ILU(k) factorization is difficult to predict. For example, with colamd ordering applied symmetrically and k = 1, nnz(ˆl + Û) varies from 1.1 nnz(a) for the add32 matrix and 6.9 nnz(a) for the memplus matrix. As the columns for k = 0 in table 3.13 and for lfil nnz = 1 in table 3.14 show, ILUTP Push generally does much better. However, this comparison is not truly fair since ILUTP Push can pivot and drop small elements. Table 3.15 shows that even if ILUTP Push is restricted to lfil nnz = 1, droptol = 0.0, pivtol = 0.0, ILUTP Push still outperforms ILU(0), though the difference is lessened. In short, the freedom to choose which nnz(a) nonzeros to keep, instead of keeping exactly the nonzeros in the locations that are nonzero in A, gains but a little. The rest of the gain comes from the other two parameters. For example, the last two columns in table 3.15 show that with pivoting, far more cases converge. Our results show that more matrices do not converge with droptol > 0 than with droptol = 0, so the difference between the values in table 3.15 and those in the column of table 3.14 corresponding to lfil nnz = 1 comes from the fact that some matrices converge only for certain pivtol values. We now take a closer look at the effects of the pivtol parameters on ILUTP Push with higher fill levels. Looking at table 3.16, which separates out the results from table 3.14 by pivtol value, we see that while some pivoting is clearly necessary for optimal performance,

102 ILUTP Push ILU(0) (1, 0.0, 0.0) (1, 0.0, 0.1) (1, 0.0, 1.0) natural 14 17 22 20 rcm 14 14 20 21 colamd (sym) 12 14 21 18 Table 3.15: Comparison of the number of matrices that achieve convergence with ILU(0) as opposed to with ILUTP Push, with parameter setting lfil nnz = 1, droptol = 0.0, and assorted pivtol. In all cases an upper bound on nnz(ˆl + Û) is nnz(a). full partial pivoting (i.e., pivtol = 1.0) is not desirable. We suspect this is because of the structure in the original matrix, which is partially preserved by leaving the diagonal elements on the diagonal. Hence, while pivoting is useful when used to correct rows where the diagonal element is significantly smaller than off-diagonal entries, the loss of structure is not worthwhile when the diagonal element is only slightly smaller than an off-diagonal entry. This is perhaps related to why, even with full partial pivoting, applying colamd symmetrically is more successful than applying it to only the rows. In other words, there is some structure in the original matrix and so leaving the diagonal alone is generally the best strategy. Comparing the numbers in table 3.16 and table 3.14, we see that while pivtol = 0.1 has the best performance, there are still matrices which converge only with other values of pivtol. In short, while 0.1 may be an adequate default value, it will not work on all systems. Notice that the largest entry of each triple in table 3.16 is not the same as the corresponding entry in table 3.14 because there are systems which converge only with certain settings of pivtol; in other words, a larger pivtol value is not always better. lfil nnz 0 1 2 3 4 5 natural 8/11/11 17/23/23 20/28/28 22/31/29 22/35/31 25/39/33 rcm 8/11/11 14/21/23 19/27/26 20/32/31 22/36/33 23/38/35 colamd (sym) 8/11/11 14/23/21 18/30/26 21/34/33 22/36/33 22/39/34 Table 3.16: The number of matrices that achieve convergence with ILUTP Push, different lfil nnz values, and different orderings. Unlike table 3.14, the results are separated by pivtol value. So, each entry contains 3 numbers showing the number of matrices which converged with pivtol = 0.0, 0.1, and 1.0, respectively. A test case is considered to have converged if it converges for at least one value of droptol.

103 As opposed to pivtol, where a non-extremal value worked best, a droptol value of 0.0 is almost always better than the larger values we tried, and was never worse by more than 1 case. The effects of ordering for stability We now add ordering for stability to the list of preprocessing steps applied prior to computing the ILU factorization. We again describe previous work on the effects of ordering for stability, then summarize the results of our studies. Previous Work Since ordering a matrix for stability is a more recent idea than ordering a matrix for sparsity, the effect of stability orderings on ILU preconditioned iterative solvers is not as well studied as the effect of sparsity orderings. In general, previous studies of stability orderings such as those described in section 3.3 find that they are useful, but cannot be a full substitute for pivoting. In [34] we combine the variant of MC64 which maximizes the product of the diagonal elements and scales them to have absolute value 1.0 with ILUTP preconditioned GMRES(50). We found that while maximizing the diagonal improved the robustness of the preconditioned iterative solver, so that more cases converged, some form of partial pivoting was still necessary for a subset of the matrices. The more complete study by Benzi et al. in [17] tests several algorithms for maximizing the diagonal of a matrix as a first step in an ILU preconditioned BiCGSTAB solver. For the ILU algorithm they try value-based ILU with and without pivoting, as well as level-based ILU with levels 0 and 1. They find that maximizing the product of the diagonal elements improves the robustness of value-based ILU preconditioners, but also find that pivoting does not improve the result. Pivoting is again found to be necessary in [81], which primarily tests a value-based ILU preconditioned TFQMR solver (though they include a few tests of ILU preconditioned QMR and IC preconditioned CG). However, their tests use only a basic stability ordering which simply puts nonzeros on the diagonal without consideration of value. Furthermore, they compare only full partial pivoting and no pivoting, that is, a pivtol of 1.0 or 0.0 but nothing in between. Regarding TPABLO, a heuristic for permuting a matrix so its diagonal consists of dense blocks with large entries, in [16] they find TPABLO can improve the convergence rate for assorted non-pivoting ILU preconditioned Krylov solvers when used for nonsymmetric

104 problems from discretizing PDEs on a 2D domain, further noting that it may improve the stability of incomplete factorizations. PABLO, which does not take into consideration the values of the entries being grouped into diagonal blocks, can also be effective at improving convergence [142]. Our studies The experiments described here use the same setup as those described in section 3.4.4, except we permute the matrices for stability prior to permuting them for sparsity. The stability ordering is computed using MC64 with job 5, which maximizes the product of the diagonal elements. This option also provides scaling factors which, if applied, make the absolute value of the maximized diagonal elements equal to 1.0 and the off-diagonal elements no larger. However, in this section we look only at the effects of the ordering, postponing the discussion of the effects of using the MC64 scaling until the next section. If the colamd or rcm sparsity orderings are used, we apply them symmetrically since any other application would alter the entries on the diagonal. We refer to the case where no sparsity ordering is used as the natural ordering. As in section 3.4.4, the first two tables in this section present the number of matrices that converge for various levels of fill for level-based (table 3.17) and value-based (table 3.18) ILU preconditioned GMRES(50). level 0 1 2 3 4 5 natural 19 23 30 30 31 35 rcm 20 26 3 30 a 33 a 36 a 37 a colamd 17 26 29 31 33 38 Table 3.17: The number of matrices for which GMRES(50), with MC64 and different sparsity orderings and different levels of ILU(k) as a preconditioner, converges. lfil nnz 0 1 2 3 4 5 natural 14 24 32 34 35 38 rcm 14 27 37 39 42 45 colamd 14 27 34 41 47 48 Table 3.18: The number of matrices for which GMRES(50), with MC64 and different sparsity orderings and ILUTP Push with different levels of fill as a preconditioner, converges. A test case is considered to have converged if it converges for at least one combination of droptol and pivtol values.

105 Comparing the level-based ILU(k) results in table 3.17 to those gotten without stability ordering in table 3.13, we notice that stability ordering is a clear improvement: for any level k, and any sparsity ordering, anywhere from 5 to 19 more matrices converge with MC64 ordering. Furthermore, with MC64 ordering, as the level increases, the number of cases which converge similarly increase. The stagnation at levels higher than 3 is no longer present. Turning to the value-based ILUTP Push, we see it still outperforms ILU(k). Furthermore, the results with MC64 are better than those without for the rcm and colamd orderings, but merely comparable for the natural ordering. The latter is again attributable to the structure in the original matrix, which we disturb by reordering with MC64. This is also reflected in the fact that the natural ordering is now noticeably worse than the rcm and colamd orderings. The fact that the improvement with MC64 is less with ILUTP Push than with ILU(k) reflects the fact that the ILU(k) results are all without pivoting, and so using MC64 reflects some of the difference pivoting can make. In table 3.19 we again compare the ILU(k) and ILUTP Push results when the amount of fill in the factors is bounded by nnz(a). As opposed to table 3.15, which reported the same results but without permuting for stability, we now see that pivoting does not gain much, and that ILUTP Push(1, 0.0, 0.0) is now an improvement over ILU(0) for all three orderings, with the largest difference in colamd. ILUTP Push ILU(0) (1, 0.0, 0.0) (1, 0.0, 0.1) (1, 0.0, 1.0) natural 19 21 21 21 rcm 20 22 24 22 colamd (sym) 17 22 23 20 Table 3.19: Comparison of the number of matrices that achieve convergence with ILU(0) as opposed to with ILUTP Push, with parameter setting lfil nnz = 1, droptol = 0.0, and assorted pivtol. MC64 and assorted sparsity orderings are used. In all cases an upper bound on nnz(ˆl + Û) is nnz(a). Since ordering for stability has been suggested as a substitute for partial pivoting, and table 3.19 shows it to be quite effective for small levels of fill, in table 3.20 we break down the results from table 3.18 by pivtol value to see the effect of MC64 on higher fill ILUTP Push preconditioners. We see from table 3.20 that full partial pivoting (i.e., pivtol = 1.0) is not desirable,

106 lfil nnz 0 1 2 3 4 5 natural 14/14/14 21/23/23 26/29/28 31/32/29 32/34/29 36/36/32 rcm 14/14/14 22/26/22 32/33/29 35/35/33 38/36/34 38/40/37 colamd (sym) 14/14/14 24/25/22 32/32/28 38/35/32 41/39/39 42/42/40 Table 3.20: The number of matrices that achieve convergence with ILUTP Push, different lfil nnz values, MC64, and different orderings. Unlike table 3.18, the results are separated by pivtol value. So, each entry contains 3 numbers showing the number of matrices which converged with pivtol = 0.0, 0.1, and 1.0, respectively. A test case is considered to have converged if it converges for at least one value of droptol. a moderate value of pivtol =.1 is advantageous for the natural and rcm orderings, though no pivoting at all seems better for the colamd ordering. However, again the data indicates that there is no one setting of all three parameters that works well for all matrices. For example, table 3.18 shows 48 matrices converge for lfil nnz = 5 and some combination of droptol and pivtol values. However, table 3.20 shows that for lfil nnz = 5 and any specific pivtol value at most 43 cases converge. Notice that the largest entry of each triple in table 3.20 is not the same as the corresponding entry in table 3.18 because there are systems which converge only with certain settings of pivtol; in other words, a larger pivtol value is not always better. Other observations about ordering for stability are that the colamd ordering with no pivoting seems best for ILUTP Push, whereas the rcm ordering seems slightly better for ILU(k). A droptol value of 0.0 in ILUTP Push works best. Scaling studies As previously noted, MC64 with job option 5 not only permutes the matrix to maximize the product of its diagonal elements, but also provides row and column scaling factors which, when applied, set the magnitude of the diagonal elements to 1.0 and the off-diagonal elements to values no greater than 1.0. In the previous section we only applied the MC64 permutation; in this section we combine the permutation with the scaling. Because the effect of MC64 scaling is similar to equilibration in that it makes the norms of the rows and columns more similar (though the -norm instead of the 1-norm), the tests in this section do not equilibrate the matrix. Instead, we first permute and scale with MC64, and then apply a sparsity ordering before computing the ILU factorization. The tables here correspond to those in the preceeding sections; the only difference is that equilibration is not used, and MC64 scaling is. Table 3.21 shows that for the level-based ILU algorithm, adding MC64 scaling does

107 level 0 1 2 3 4 5 natural 21 24 a 29 a,b 31 a,b 32 a,b 33 a,b rcm 20 27 28 32 35 37 colamd 16 25 29 31 33 36 a appu out of memory b memplus out of memory Table 3.21: This table shows the number of matrices that achieve convergence with different levels of ILU(k) and different orderings. The matrices are permuted and scaled for stability, then permuted for sparsity. not significantly improve the performance with any ordering over the performance without scaling, as shown in table 3.17. lfil nnz 0 1 2 3 4 5 natural 14 24 33 38 40 42 rcm 14 34 44 47 50 51 colamd 12 29 42 46 49 50 Table 3.22: This table shows the number of matrices that achieve convergence with ILUTP Push, different lfil nnz values, and different orderings. The matrices are permuted so their diagonal is maximized, scaled so the diagonal elements are 1, and finally symmetrically ordered for sparsity. A test case is considered to have converged if it converges for at least one combination of droptol and pivtol values. Looking now at ILUTP Push and comparing the results in table 3.22 to those in table 3.18, we see that MC64 scaling significantly improves the results for almost every combination of the parameter values. In this our results agree with those in [17], where they find that adding the scaling to an ILUT preconditioned solver increased the number of cases where the solver converges. We next look at how the value of the pivtol parameter affects the results. In these comparisons of level and value-based algorithms with the same upper bound on fill, we see that ILUTP Push still outperforms ILU(k), even without pivoting. Furthermore, using a pivtol value of 0.1 is still clearly better than the values of either 0.0 or 1.0. Finally, breaking down the results of table 3.22 by pivtol value, we see that pivtol =.1 is still generally best for the natural and rcm orderings. However, for the

108 ILUTP Push ILU(0) (1, 0.0, 0.0) (1, 0.0, 0.1) (1, 0.0, 1.0) natural 21 21 23 21 rcm 20 26 27 27 colamd (sym) 16 22 25 19 Table 3.23: This table compares the number of matrices that achieve convergence with ILU(0) as opposed to with ILUTP Push, with parameter setting lfil nnz = 1, droptol = 0.0, and assorted pivtol. In all cases an upper bound on nnz(ˆl + Û) is nnz(a). lfil nnz 0 1 2 3 4 5 natural 14/14/14 21/24/24 28/31/29 32/38/32 34/39/36 37/41/38 rcm 14/14/14 27/27/28 35/35/35 37/40/39 40/43/42 42/45/44 colamd (sym) 12/12/12 25/27/22 35/37/36 38/38/42 41/42/44 41/45/44 Table 3.24: This table shows the number of matrices that achieve convergence with ILUTP Push, different lfil nnz values, and different orderings. Unlike table 3.22, the results are separated by pivtol value. So, each entry contains 3 numbers showing the number of matrices which converged with pivtol = 0.0, 0.1, and 1.0, respectively. A test case is considered to have converged if it converges for at least one value of droptol. colamd ordering, sometimes full partial pivoting is better. Notice that using no pivoting at all seems undesirable. However, again note that the largest entry of each triple in table 3.24 is not the same as the corresponding entry in table 3.22 because there are systems which converge only with certain settings of pivtol; in other words, a larger pivtol value is not always better. 3.4.5 Summary of experiments In the preceding sections we presented the results of assorted experiments testing the effects of orderings and scaling on the convergence of ILU-preconditioned GMRES(50). One of our main observations, also made in [81], is that the data is difficult to analyze because it depends on so many variables. Nevertheless, some patterns can be noted, so in this section we summarize our observations on orderings, on parameter settings, and on level-based and value-based ILU heuristics. First, on the whole, for the level-based ILU(k), the natural ordering was best if MC64 was not used. However, with MC64, rcm was the best ordering. For the value-based ILUTP Push, both the rcm and colamd orderings worked well when either only a sparsity

109 ordering was used, or when both MC64 ordering and scaling were used. If the stability ordering without the scaling was used, colamd was the clear winner. We further observe that for both level and value-based methods, the colamd ordering should always be applied symmetrically, and ordering for stability using MC64 is a clear win, although scaling using the MC64 scale factors is an improvement only for ILUTP Push. For ILUTP Push, a droptol of 0 seems best, but 0.1 outperformed both 0.0 and 1.0 as the pivtol value. Our explanation for the latter is that the structure of the matrix before factoring has some desirable properties which should not be lightly altered. Comparing level-based and value-based ILU heuristics, we find that value-based methods generally seem better. Level-based methods are appealing for reasons including the fact that the structure of the factors can be computed using a graph algorithm, and the fact that they are simpler to use in the sense that the user need specify only one parameter instead of the several often used by value-based methods. Unfortunately, the drawbacks of level-based methods include the fact that the correspondence between the level and the number of nonzeros in the factors is difficult to predict. This suggests, for example, that if a user with a bound on the amount of memory he is willing to allocate for the preconditioner will still not have a good idea of what value to use for the level parameter. In addition, our tests find that when nnz(ˆl + Û) is bounded by nnz(a), ILUTP Push outperforms the level-based ILU(k), regardless of the ordering, the scaling, and the pivtol value used by ILUTP Push. Although adding dropping and pivoting to a level-based method may improve robustness, we are not convinced the added complexity is worthwhile given that it eliminates many of the other advantages of level-based methods. Nevertheless, if we had to recommend a preconditioner to try first with GMRES(50) we would suggest using ILUTP Push with MC64 job setting 5 with scaling, colamd applied symmetrically, a drop tolerance of 0.0, a pivot tolerance of.1 or 1.0, and the largest lfil nnz you can afford. 3.5 Conclusion In this chapter we looked at a variety of methods for preconditioning a sparse system of linear equations. We began by considering different algorithms for ordering a sparse matrix to decompose the problem, to reduce fill in the factors, and to increase the stability of the factorization. After reviewing the existing work on these orderings,

110 we made a few observations on their effects on the 65 matrices in our test suite. fill-reducing orderings we described experiences with designing and implementing a sharedmemory minimum degree ordering algorithm. We then turned to preconditioners based on incomplete factorizations. We again summarized the existing work, then described a slight modification to the popular ILUTP algorithm, which we call the ILUTP Push algorithm. Given the same amount of memory, the ILUTP Push algorithm generally uses more memory than the ILUTP algorithm when run with the same parameters, hence leading it to converge significantly more often. This makes ILUTP Push attractive when a user knows how much memory he can use for the preconditioner and is willing to use as much of it as necessary to make convergence of the iterative solver more likely. Taking our ILUTP Push algorithm, we combine it with the orderings described earlier and investigate how different orderings affect the number of systems which converge. We obtain the best results by combining the MC64 ordering, the MC64 scaling, and our rcm ordering. With those orderings, 34 of our 65 test cases converge if nnz(ˆl + Û) 1, and 51 converge if nnz(ˆl+û) 5, where a test case is said to converge if it converges for at least one combination of the droptol and pivtol parameter values. We conduct the same experiments using a level-based ILU algorithm and find that combining the rcm ordering and the MC64 ordering, but not using the MC64 scaling, is best. However, when we compare the results with level-based and value-based heuristics, we note that, in general, value-based methods seem superior. Preconditioners span such a wide variety of techniques, and even the family of ILU-preconditioners is so broad that, not surprisingly, many open questions remain. We mention here a few unresolved, or untouched on, issues which are closely related to the contents of this document. A few of the open questions regarding orderings are: Can we resolve whether the minimum-fill nonsymmetric ordering problem is NPcomplete? Can we design a single ordering algorithm which combines the desired effects of MC64 and colamd, in the sense that it both increases stability and reduces fill in the factors? And a few of the open questions about ILU preconditioners raised are: For

111 We only did direct comparisons of level-based and value-based ILU algorithms for the cases where nnz(ˆl)+nnz(û) nnz(a). How do the two algorithms compare at higher levels of fill? There are advantages to both level-based and value-based ILU algorithms. Is there some way to combine the two and take advantage of each ones strengths? What happens to the comparisons of different orderings and different algorithms when you consider running times in addition to the basic question of whether or not the system converges? Can one design a set of recommendations regarding what preconditioners to use for users with specific systems they want to solve?

112 Chapter 4 Conclusion In this report we looked at an assortment of preconditioning techniques for two significant problems in linear algebra: finding eigenvalues of large sparse matrices and solving large sparse systems of linear equations. For finding eigenvalues our main results are in new techniques for balancing, though we also make an observation about decomposing the matrix into a set of smaller problems. We summarize the history of balancing, covering both theoretical results about uniqueness and practical results about convergence of specific algorithms. We then describe our own results, defining balancing in a weighted norm and showing that it minimizes the 2-norm of non-negative, irreducible matrices. The theory of weighted balancing is used to justify a novel set of Krylov-balancing algorithms which can be used to approximately balance a matrix without accessing explicit matrix elements, instead using only matrix-vector and matrix-transpose-vector multiplications. These algorithms are shown to improve the accuracy to which eigenvalues of matrices from practical applications are computed, by both dense and sparse eigensolvers. Open questions remain: for example, the convergence rate of iterative balancing in the infinity-norm, and the theoretical justification for the effects of the cutoff parameter used by one variant of our Krylov-balancing algorithms. For solving large sparse systems of linear equations we first remark on the effects of different algorithms for ordering A prior to computing its LU factorization. We note that using a nonsymmetric decomposition instead of a symmetric one can significantly reduce the sizes of the diagonal blocks. We further note that MC64, an ordering algorithm designed to stabilize the matrix, can improve the colamd sparsity ordering: in other words, the fill with MC64 followed by colamd can be less than the fill with colamd alone. We also look

113 at algorithms for computing incomplete LU factorizations. We suggest a modification to a popular value-based ILU scheme and show that it uses memory more effectively, hence improving the chances that preconditioned GMRES(50) will converge. Regarding the effects of ordering algorithms on the performance of ILU preconditioned GMRES(50), we find that both MC64 and partial pivoting are necessary for the best performance. There are many remaining open questions, with the experimental and theoretical intertwined: more thorough analysis of more experiments may suggest things that can be proven about the behavior of preconditioners; better theory may suggest improved heuristics. For balancing some open questions include the convergence rate of iterative balancing in the infinity norm and the theory justifying the use of the cutoff parameter in Krylov- Cutoff. For sparse linear systems, questions include whether it is possible to design a single ordering algorithm that combines the desired behavior of MC64 and colamd (i.e., improves both stability and sparsity), and whether we can improve our recommendations regarding what ILU preconditioners by exploiting more information about individual problems.

114 Bibliography [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, Reading, MA, 1974. [2] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl., 17(4):886 905, October 1996. [3] P. R. Amestoy, X. S. Li, and E. G. Ng. Symmetric minimum priority orderings for sparse unsymmetric factorization. Talk given at Seventh SIAM Conference on Applied Linear Algebra, October 2000. [4] B. S. Andersen, F. Gustavson, A. Karaivanov, J. Wasniewski, and P. Y. Yalamov. LAWRA linear algebra with recursive algorithms. In Proceedings of the Conference on Parallel Processing and Applied Mathematics, September 1999. Also UNIC technical report #UNIC-99-01. [5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK users guide Third Edition. SIAM, 1999. [6] C. Ashcraft. Compressed graphs and the minimum degree algorithm. SIAM J. Sci. Comput., 16(6):1404 1411, November 1995. [7] C. Ashcraft and J. W. H. Liu. Robust ordering of sparse matrices using multisection. SIAM J. Matrix Anal. Appl., 19(3):816 832, July 1998. [8] C. Ashcraft, D. Pierce, D. K. Wah, and J. Wu. The reference manual for SPOOLES, release 2.2: an object oriented software library for solving sparse linear systems of equations, January 1999. http://www.netlib.org/linalg/spooles/spooles.2.2.html.

115 [9] O. Axelsson and N. Munksgaard. Analysis of incomplete factorizations with fixed storage allocation. In D. Evans, editor, Preconditioning Methods Theory and Applications, pages 219 241. Gordon and Breach, 1983. [10] Z. Bai, D. Day, J. Demmel, and J. Dongarra. A test matrix collection for non- Hermitian eigenvalue problems. Technical Report CS-97-355, University of Tennessee, March 1997. [11] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the Solution of Algebraic Eigenvalue Problems: a Practical Guide. SIAM, 2000. [12] S. Balay, W. Gropp, L. C. McInnes, and B. Smith. PETSc users manual. Argonne National Laboratory. http://www-fp.mcs.anl.gov/petsc/. [13] R. E. Bank and C. Wagner. Multilevel ILU decomposition. Numer. Math., 82(4):543 576, 1999. [14] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics, 1994. Code for algorithms at http://www.netlib.org/templates/. [15] R. Barrett, R. Boisvert, J. J. Dongarra, R. Lipman, B. Miller, R. Pozo, and K. Remington. Matrix Market. http://math.nist.gov/matrixmarket. [16] M. Benzi, H. Choi, and D. B. Szyld. Threshold ordering for preconditioning nonsymmetric problems. In Proceedings of the Workshop on Scientific Computing, pages 159 165, March 1997. [17] M. Benzi, J. C. Haws, and M. Tuma. Preconditioning highly indefinite and nonsymmetric matrices. Technical Report LA-UR-99-4857, Los Alamos National Laboratory, 1999. [18] M. Benzi, W. D. Joubert, and G. Mateescu. Numerical experiments with parallel orderings for ILU preconditioners. Electronic Transactions on Numerical Analysis, pages 88 114, 1999. Was LANL Tech Report LA-UR-98-4316.

116 [19] M. Benzi, D. B. Szyld, and A. van Duin. Orderings for incomplete factorization preconditioning of nonsymmetric problems. SIAM J. Sci. Comput., 20(5):1652 1670, 1999. [20] M. Bern, J. R. Gilbert, B. Hendrickson, N. Nguyen, and S. Toledo. Support-graph preconditioners. Submitted to SIAM J. Matrix Anal. Appl. [21] R. Betancourt. Efficient parallel processing technique for inverting matrices with random sparsity. IEEE Proceedings E (Computers and digital techniques), 133(4):235 240, July 1986. [22] R. Betancourt. An efficient heuristic ordering algorithm for partial matrix refactorization. IEEE Trans. on Power Systems, 3(3):1181 1187, August 1988. [23] M. Bollhöfer and Y. Saad. ILUs and factorized approximate inverses are strongly related. Part I: overview of results. Technical Report UMSI 2000-39, University of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, 2000. [24] M. Bollhöfer and Y. Saad. ILUs and factorized approximate inverses are strongly related. Part II: applications to stabilization. Technical Report UMSI 2000-70, University of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, 2000. [25] E. Boman, D. Chen, B. Hendrickson, and S. Toledo. Maximum-weight-basis preconditioners. Submitted to J. Numerical Linear Algebra. [26] E. Boman and B. Hendrickson. Support theory for preconditioning. Submitted to SIAM J. Matrix Anal. and Appl. [27] E. F. F. Botta and F. W. Wubs. Matrix renumbering ILU: an effective algebraic multilevel ILU-preconditioner for sparse matrices. SIAM J. Matrix Anal. Appl., 20:1007 1026, 1999. [28] S. Boyd, L. El Fhaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory. Number 15 in SIAM studies in applied mathematics. SIAM, Philadelphia, PA, 1994. [29] W. L. Briggs. A Multigrid Tutorial. SIAM, Philadelphia, PA, 1987.

117 [30] T. F. Chan and H. A. van der Vorst. Approximate and incomplete factorizations. In D. E. Keyes, A. Samed, and V. Venkatakrishnan, editors, Parallel numerical algorithms, volume 4 of ICASE/LaRC Interdisciplinary Series in Science and Engineering, pages 167 202. Kluwer Academic, Dordecht, 1997. [31] A. Chapman, Y. Saad, and L. Wigton. High-order ILU preconditioners for CFD problems. Int. J. Numer. Meth. Fluids, 33:767 788, 2000. [32] D. Chen. Analysis, implementation, and evaluation of Vaidya s preconditioners. Master s thesis, Tel-Aviv University, February 2001. [33] T.-Y. Chen. Balancing sparse matrices for computing eigenvalues. Master s thesis, University of California at Berkeley, May 1998. [34] T.-Y. Chen. Heuristics for serial and parallel incomplete LU preconditioners, May 1999. University of California at Berkeley qualifying exam report. [35] T.-Y. Chen and J. Demmel. Balancing sparse matrices for computing eigenvalues. Lin. Alg. Appl., 309:261 287, April 2000. [36] T.-Y. Chen and J. Demmel. Balancing sparse matrices for computing eigenvalues. In Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, editors, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, pages 152 157. SIAM, 2000. [37] T.-Y. Chen, J. Gilbert, and S. Toledo. Toward an efficient column minimum degree code for symmetric multiprocessors. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, March 1999. [38] E. Chow. Personal communication, 1999. [39] E. Chow. A scalable parallel computation of sparse approximate inverse. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, 1999. [40] E. Chow and M. A. Heroux. BPKIT: Block preconditioning toolkit, September 1996. http://www.cs.umn.edu/ chow/bpkit.html.

118 [41] E. Chow and Y. Saad. Experimental study of ILU preconditioners for indefinite matrices. J. Comp. and Appl. Math., 86:387 414, 1997. [42] E. Chow and Y. Saad. Parallel approximate inverse preconditioners. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. [43] B. N. Chun and D. E. Culler. REXEC: A decentralized, secure, remote execution environment for clusters. In 4th Workshop on Communication, Architecture, and Applications for Network-based Parallel Computing, Toulouse, France, January 2000. [44] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. [45] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th national conference of the ACM, pages 157 172, 1969. [46] R. D. da Cunha and T. Hopkins. PIM 2.2: The parallel iterative methods package for systems of linear equations. http://www.mat.ufrgs.br/ rudnei/pim/pim-i.html. [47] T. Davis. University of Florida sparse matrix collection. NA Digest, v.92, n.42, Oct. 16, 1994 and NA Digest, v.96, n.28, Jul. 23, 1996, and NA Digest, v.97, n.23, Jun. 7, 1997. available at http://www.cise.ufl.edu/ davis/sparse/. [48] T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. A column approximate minimum degree ordering algorithm. Technical Report TR-00-005, Department of Computer and Information Science and Engineering, University of Florida, October 2000. [49] E. F. D Azevedo, P. A. Forsyth, and W.-P. Tang. Ordering methods for preconditioned conjugate gradient methods applied to unstructured grid problems. SIAM J. Matrix Anal. Appl., 13(3):944 961, July 1992. [50] E. F. D Azevedo, P. A. Forsyth, and W.-P. Tang. Towards a cost-effective ILU preconditioner with high level fill. BIT, 32:442 463, 1992. [51] M. H. DeGroot. Probability and Statistics. Addison-Wesley Publishing Company, Reading, MA, 1975. [52] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadephia, PA, 1997.

119 [53] J. W. Demmel, J. R. Gilbert, and X. S. Li. SuperLU user s guide, September 1999. http://www.nersc.gov/ xiaoye/superlu/. [54] S. Doi and T. Washio. Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations. Parallel Computing, 25:1995 2014, 1999. [55] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users guide for the Harwell-Boeing sparse matrix collection, (release i) edition, October 1992. http://math.nist.gov/matrixmarket/collections/hb.html. [56] I. S. Duff and J. Koster. The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl., 20(4):889 901, 1999. [57] I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl., 22(4):973 996, 2001. [58] I. S. Duff and G. A. Meurant. The effect of ordering on preconditioned conjugate gradients. BIT, 29:635 657, 1989. [59] I. S. Duff and J. K. Reid. An implementation of Tarjan s algorithm for the block triangularization of a matrix. ACM Trans. on Math. Softw., 4(2):137 147, June 1978. [60] I. S. Duff and H. A. van der Vorst. Preconditioning and parallel preconditioning. Technical Report TR/PA/98/23, CERFACS, 1998. Also RAL-TR-1998-052 from Rutherford Appleton Laboratory. [61] A. L. Dulmage and N. S. Mendelsohn. Coverings of bipartite graphs. Canad. J. Math., 10:517 534, 1958. [62] B. C. Eaves, A. J. Hoffman, U. G. Rothblum, and H. Schneider. Line-sum-symmetric scalings of square nonnegative matrices. In Mathematical Programming Study 25, pages 124 141, 1985. [63] V. Eijkhout. Overview of iterative linear system solver packages. Lapack working note 141, July 1998.

120 [64] V. Eijkhout and T. Chan. ParPre: A Parallel Preconditioners Package reference manual for version 2.0.21, revision 1, April 1998. http://www.cs.utk.edu/ eijkhout/parpre.html. [65] Robust, efficient linear solvers. http://www.elegant-math.com/brochure.htm. [66] H. C. Elman and E. Agrón. Ordering techniques for the preconditioned conjugate gradient method on parallel computers. Computer Physics Communications, 53:253 269, May 1989. [67] E. Elmroth and F. G. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development, 44(4):605 624, July 2000. [68] J. D. Finney and B. S. Heck. Matrix scaling for large-scale system decomposition. Automatica, 1996. [69] R. W. Freund. A transpose-free quasi-minimal residual algorithm for non-hermitian linear systems. SIAM J. Sci. Comput., 15(2):470 482, March 1993. [70] M. R. Garey and D. S. Johnson. Computers and intractability: A guide to the theory of NP-completeness. W.H. Freeman, New York, 1991. [71] A. George. Nested dissection of a regular finite element mesh. SIAM J. Num. Anal., 10(2):345 363, April 1973. [72] A. George and J. W. H. Liu. A fast implementation of the minimum degree algorithm using quotient graphs. ACM Transactions on Mathematical Software, 6(3):337 358, September 1980. [73] A. George and J. W. H. Liu. The evolution of the minimum degree ordering algorithm. SIAM Review, 31(1):1 19, March 1989. [74] A. George and E. Ng. An implementation of Gaussian Elimination with partial pivoting for sparse systems. SIAM J. Sci. Stat. Comput., 6(2):390 409, 1985. [75] J. A. George. Computer implementation of the finite element method. Technical Report STAN-CS-71-208, Stanford University, 1971.

121 [76] J. A. George and D. R. McIntyre. On the application of the minimum degree algorithm to finite element systems. SIAM J. Num. Anal., 15:90 111, 1978. [77] N. E. Gibbs, W. G. Poole-Jr., and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM J. Num. Anal., 13(2):236 250, April 1976. [78] J. R. Gilbert, X. S. Li, E. G. Ng, and B. W. Peyton. Computing row and column counts for sparse QR and LU factorization. Technical Report LBNL-47372, Lawrence Berkeley National Laboratory, January 2001. Submitted to BIT. [79] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in Matlab: design and implementation. SIAM J. Matrix Anal. Appl., 13(1):333 356, January 1992. [80] J. R. Gilbert and E. G. Ng. Predicting structure in nonsymmetric sparse matrix factorizations. In A. George, J. R. Gilbert, and J. W. H. Liu, editors, Graph theory and sparse matrix computation, volume 56 of The IMA volumes in mathematics and its applications, pages 107 140. Springer-Verlag, 1993. [81] J. R. Gilbert and S. Toledo. An assessment of incomplete-lu preconditioners for nonsymmetric linear systems. Technical report, Xerox Palo Alto Research Center, 3333 Coyote Hill Rd., Palo Alto, CA 94304, 1997. [82] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. [83] A. Gomez and L. G. Franquelo. An efficient ordering algorithm to improve sparse vector methods. IEEE Trans. on Power Systems, 3(4):1538 1544, November 1988. [84] N. I. M. Gould and J. A. Scott. On approximate-inverse preconditioners. Technical Report RAL-TR-95-026, Rutherford Appleton Laboratory, 1995. [85] J. Grad. Matrix balancing. The Computer Journal, 14(3):280 284, 1971. [86] A. Greenbaum. Iterative methods for solving linear systems, volume 17 of Frontiers in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.

122 [87] K. D. Gremban. Combinatorial preconditioners for sparse, symmetric, diagonally dominant linear systems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, October 1996. Technical report CMU-CS-96-123. [88] K. D. Gremban, G. L. Miller, and M. Zagha. Performance evaluation of a new parallel preconditioner. In Proceedings of 9th International Parallel Processing Symposium, pages 65 69. IEEE, 1995. [89] R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 user s guide, 1979. http://www.ma.utexas.edu/cna/itpack/. [90] L. Grosz. Preconditioning by incomplete block elimination. Numer. Linear Algebra Appl., 7:527 541, 2000. [91] M. Grote and T. Huckle. Effective parallel preconditioning with sparse approximate inverses. In Proceedings of the 7th SIAM conference on parallel processing for scientific computing, pages 466 471, San Francisco, CA, February 1995. [92] S. Guattery. Graph embedding techniques for bounding condition numbers of incomplete factor preconditioners. Technical report, ICASE, NASA Langley Research Center, 1997. [93] I. Gustafsson. A class of first order factorization methods. BIT, 18:142 156, 1978. [94] F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear algebra algorithms. IBM Journal of Research and Development, 41(6):737 755, 1997. [95] W. Hackbusch. Iterative solution of large sparse systems of equations, volume 95 of Applied Mathematical Sciences. Springer-Verlag, New York, 1994. [96] D. J. Hartfiel. Concerning diagonal similarity of irreducible matrices. Proceedings of the American Mathematical Society, 30(3):419 425, November 1971. [97] B. Hendrickson and E. Rothberg. Effective sparse matrix ordering: just around the BEND. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997.

123 [98] B. Hendrickson and E. Rothberg. Improving the run time and quality of nested dissection ordering. SIAM J. Sci. Comput., 20(2):468 489, 1998. [99] G. Henry. ASCI Red Pentium II BLAS 1.2F. http://www.cs.utk.edu/ ghenry/distrib/archive.htm#blas, 2000. [100] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Standards, 49:409 435, 1952. [101] J. E. Hopcroft and R. M. Karp. An n 5/2 algorithm for maximum matchings in bipartite graphs. SIAM J. Comput., 2(4):225 231, December 1973. [102] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [103] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. [104] Harwell Subroutine Library. http://www.cse.clrc.ac.uk/activity/hsl. [105] D. Hysom. Personal communication, May 2001. [106] D. Hysom and A. Pothen. Efficient parallel computation of ILU(k) preconditioners. Technical Report 2000-210120, NASA/CR, 2000. Also ICASE report number 2000-23. [107] E.-J. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California at Berkeley, May 2000. [108] E.-J. Im and K. Yelick. Optimizing sparse matrix vector multiplication on SMP. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, 1999. [109] E.-J. Im and K. Yelick. Optimizing sparse matrix-vector multiplication for register reuse. In International Conference on Computational Science, May 2001. [110] A. Jennings and G. M. Malik. Partial elimination. J. Inst. Math. Appl., 20:307 316, 1977. [111] M. T. Jones and P. E. Plassmann. BlockSolve95 users manual: scalable library software for the parallel solution of sparse linear systems. Technical Report ANL-95/48, Argonne National Laboratory, Argonne, IL, December 1995. ftp://info.mcs.anl.gov/pub/blocksolve95.

124 [112] M. T. Jones and P. E. Plassmann. An improved incomplete Cholesky factorization. ACM Trans. on Math. Softw., 21(1):5 17, March 1995. [113] B. Kalantari, L. Khachiyan, and A. Shokoufandeh. On the complexity of matrix balancing. SIAM J. Matrix Anal. Appl., 18(2):450 463, April 1997. [114] G. Karypis and V. Kumar. Parallel threshold-based ILU factorization. Technical Report #96-061, University of Minnesota, 1996. [115] G. Karypis and V. Kumar. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, Version 4.0. University of Minnesota, Department of Computer Science / Army HPC Research Center, September 1998. [116] G. Karypis, K. Schloegel, and V. Kumar. ParMETIS parallel graph partitioning and sparse matrix ordering library, Version 2.0. University of Minnesota, Department of Computer Science / Army HPC Research Center, September 1998. [117] D. S. Kershaw. The incomplete Cholesky-conjugate gradient method for the interactive solution of systems of linear equations. J. Comp. Phys., 26(1):43 65, 1978. [118] D. S. Kershaw. On the problem of unstable pivots in the incomplete LU-conjugate gradient method. J. Comp. Phys., 38(1):114 123, 1980. [119] R. B. Lehoucq and J. A. Scott. An evaluation of software for computing eigenvalues of sparse nonsymmetric matrices. Technical Report MCS-P547-1195, Argonne National Laboratory, 1996. [120] J. G. Lewis. Designing preconditioners with optimization tools. Talk at Preconditioning 2001, April 2001. [121] J. G. Lewis and Y.-J. J. Wu. Cruising (approximately) at 41,000 feet Iterative solvers at Boeing. Slides from SIAMLA 2000 talk, October 2000. [122] X. S. Li and J. W. Demmel. Making sparse Gaussian Elimination scalable by static pivoting. In Proceedings of Supercomputing 98, 1998.

125 [123] X. S. Li, J. W. Demmel, and J. Gilbert. An asynchronous parallel supernodal algorithm for sparse gaussian elimination. SIAM J. Matrix Anal. Appl., 20(4):915 952, 1999. [124] C.-J. Lin and J. J. Moré. Incomplete Cholesky factorizations with limited memory. Technical Report MCS-P682-0897, Argonne National Laboratory, August 1997. [125] R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection. SIAM J. Num. Anal., 16(2):346 358, April 1979. [126] J. W. H. Liu. Modification of the minimum-degree algorithm by multiple elimination. ACM Transactions on Mathematical Software, 11(2):141 153, June 1985. [127] W.-H. Liu and A. H. Sherman. Comparative analysis of the Cuthill-McKee and the reverse Cuthill-McKee ordering algorithms for sparse matrices. SIAM J. Num. Anal., 13(2):198 213, April 1976. [128] M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput., 15(4):1036 1053, November 1986. [129] S. Ma and Y. Saad. Distributed ILU(0) and SOR preconditioners for unstructured sparse linear systems. Technical Report 94-027, Army High Performance Computing Research Center, University of Minnesota, Minneapolis, MN, 1994. [130] T. A. Manteuffel. An incomplete factorization technique for positive definite linear systems. Mathematics of Computation, 34:473 497, April 1980. [131] G. Manzini. On the ordering of sparse linear systems. Theoretical Computer Science, 156:301 313, 1996. [132] R. Marejka. A Barrier for Threads. SunOpsis, 4(1), January March 1995. [133] MathWorks. Matlab on-line documentation. http://www.mathworks.com/access/helpdesk/help/techdoc/matlab.shtml. [134] J. Meijerink and H. A. van der Vorst. An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Mathematics of Computation, 31:148 162, January 1977.

126 [135] J. Meijerink and H. A. van der Vorst. Guidelines for the usage of incomplete decompositions in solving sets of linear equations as they occur in practical problems. J. Comp. Phys., 44(1):134 155, 1981. [136] UC Berkeley Millennium Project. http://www.millennium.berkeley.edu/. [137] N. Munksgaard. Solving sparse symmetric sets of linear equations by preconditioned conjugate gradients. ACM Trans. on Math. Softw., 6:206 219, 1980. [138] E. G. Ng. Personal communication, 1999. [139] E. G. Ng, B. W. Peyton, and P. Raghavan. A blocked incomplete Cholesky preconditioner for hierarchical-memory computers. In D. R. Kincaid et al., editor, Iterative Methods in Scientific Computation II, pages 1 11. IMACS, 1999. [140] Numerical Objects. Diffpack Kernel and Toolboxes Documentation Version 3.5.00. http://nobjects.com/diffpack/. [141] M. Olschowka and A. Neumaier. A new pivoting strategy for Gaussian elimination. Lin. Alg. Appl., 240:131 151, 1996. [142] J. O Neil and D. B. Szyld. A block ordering method for sparse matrices. SIAM J. Sci. Stat. Comput., 11:811 823, 1990. [143] E. E. Osborne. On pre-conditioning of matrices. J. of the ACM, 7:338 345, 1960. [144] Tz. Ostrosky, A. Sameh, and V. Sarin. A parallel sparse linear system preconditioner: the balance scheme. In Proceedings of the Fourth International Conference on Information Systems, Analysis and Synthesis. Orlando, FL, pages 367 372, July 1998. [145] G. Pagallo and C. Maulino. A bipartite quotient graph model for unsymmetric matrices. In Numerical Methods, volume 1005 of Lecture Notes in Mathematics. Springer- Verlag, 1983. [146] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity, chapter 11, pages 247 255. Dover Publications, inc., 1998. [147] B. Parlett and C. Reinsch. Balancing a matrix for calculation of eigenvalues and eigenvectors. Numer. Math., 13:293 304, 1969.

127 [148] S. Pissanetzky. Sparse Matrix Technology. Academic Press Inc, London, 1984. [149] E. L. Poole and J. M. Ortega. Multicolor ICCG methods for vector computers. SIAM J. Num. Anal., 24(6):1394 1418, December 1987. [150] A. Pothen. Sparse null bases and marriage theorems. PhD thesis, Cornell University, Department of Mathematics, 1985. Also TR85-676. [151] A. Pothen and C.-J. Fan. Computing the block triangular form of a sparse matrix. ACM Trans. on Math. Softw., 16(4):303 324, December 1990. [152] J. H. Reif. Efficient approximate solution of sparse linear systems. Computers & Mathematics with Applications, 36(9):37 58, November 1998. [153] E. Rothberg. Ordering sparse matrices using approximate minimum local fill, April 1996. Silicon Graphics manuscript. [154] E. Rothberg and S. C. Eisenstat. Node selection strategies for bottom-up sparse matrix ordering. SIAM J. Matrix Anal. Appl., 19(3):682 695, 1998. [155] Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. http://www.cs.umn.edu/research/arpa/sparskit/sparskit.html. [156] Y. Saad. ILUM: A parallel multi-elimination ILU preconditioner for general sparse matrices. Technical Report 92/241, University of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, 1992. [157] Y. Saad. ILUT: A dual threshold incomplete ILU factorization. Numer. Linear Algebra Appl., 4:387 402, 1994. [158] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS publishing company, 1996. [159] Y. Saad, G.-C. Lo, and S. Kuznetsov. PSPARSLIB users manual: A portable library of parallel sparse iterative solvers, 1998. http://www.cs.umn.edu/research/arpa/p sparslib/psp-abs.html. [160] Y. Saad and M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856 869, July 1986.

128 [161] Y. Saad and B. Suchomel. ARMS: An algebraic recursive multilevel solver for general sparse linear systems. Technical Report UMSI 99/107, University of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, June 1999. [162] Y. Saad and H. A. van der Vorst. Iterative solution of linear systems in the 20th century. Technical Report UMSI 99/152, University of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN, September 1999. [163] Y. Saad and J. Zhang. BILUM: block versions of multi-elimination and multi-level ILU preconditioner for general sparse linear systems. SIAM J. Sci. Comput., 20(6):2103 2121, 1999. [164] Y. Saad and J. Zhang. BILUTM: A domain-based multi-level block ILUT preconditioner for general sparse matrices. SIAM J. Matrix Anal. Appl., 21(1):279 299, 1999. [165] Y. Saad and J. Zhang. Diagonal threshold techniques in robust multi-level ILU preconditioners for general sparse linear systems. Numerical linear algebra with applications, 6(4):257 280, 1999. [166] H. Schneider and M. H. Schneider. Max-balancing weighted directed graphs and matrix scaling. Mathematics of Operations Research, 16(1):208 222, February 1991. [167] M. H. Schneider and S. A. Zenios. A comparative study of algorithms for matrix balancing. Operations Research, 38(3):439 455, May-June 1990. [168] T. Skalicky. LASPack reference manual, 1996. http://www.tudresden.de/mwism/skalicky/laspack/laspack.html. [169] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler. Matrix Eigensystem Routines, EISPACK Guide, volume 6 of Lecture Notes in Computer Science. Springer Verlag, New York, 1976. [170] D. C. Sorensen. Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In Matlab 5.0 distribution, October 1995. [171] M. Suarjana and K. H. Law. A robust incomplete factorization based on value and space constraints. Int. J. Numer. Meth. Engng., 38:1703 1719, 1995.

129 [172] T. Ström. Minimization of norms and logarithmic norms by diagonal similarities. Computing, 10:1 7, 1972. [173] R. Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146 160, June 1972. [174] W. F. Tinney and J. W. Walker. Direct solution of sparse network equations by optimally ordered triangular factorization. Proc. IEEE, 55:1801 1809, November 1967. [175] S. Toledo. Improving the memory system performance of sparse matrix vector multiplication. IBM J. Research and Development, 41(6):711 725, November 1997. [176] R. S. Tuminaro, M. Heroux, S. A. Hutchinson, and J. N. Shadid. Official Aztec User s Guide: Version 2.1. Sandia National Laboratories, December 1999. http://www.cs.sandia.gov/crf/aztec1.html. [177] W. T. Tutte. Review of coverings of bipartite graphs by dulmage and mendelsohn. In MathSciNet Mathematical Reviews on the Web at http://www.ams.org/mathscinet. [178] P. Vaidya. Invited talk. In Workshop on graph theory and sparse matrix computation, University of Minnesota, October 1991. Institute for Mathematics and its Applications. [179] L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189 201, 1979. [180] R. S. Varga. Matrix Iterative Analysis. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1962. [181] J. W. Watts-III. A conjugate gradient truncated direct method for the iterative solution of the reservoir simulation pressure equation. Society of Petroleum Engineer Journal, 21:345 353, 1981. [182] J. B. White-III and P. Sadayappan. On improving the performance of sparse matrixvector multiplication. In Proceedings of the Fourth International Conference on High- Performance Computing, 1997. [183] J. H. Wilkinson. Error analysis of direct methods of matrix inversion. J. of the ACM, 8(3):281 330, 1961.

130 [184] J. Zhang. Sparse approximate inverse and multilevel block ILU preconditioning techniques for general sparse matrices. Appl. Num. Math., 35(1):67 86, September 2000. [185] J. Zhang. A class of multilevel recursive incomplete LU preconditioning techniques. Korean J. Comput. Appl. Math., 8(2):213 234, 2001. [186] Z. Zlatev. Use of iterative refinement in the solution of sparse linear systems. SIAM J. Num. Anal., 19(2):381 399, April 1982.

131 Appendix A Test matrices for chapter 2 The following table summarizes the 14 matrices in our test suite for chapter 2. They are all from the collection described in [10], which also contains more extensive descriptions of all the matrices.

132 name n nnz density application area tols2000 2000 1.3e-3 stability analysis of airplane t240 240 3.9e-2 stability analysis of airplane ecsiemensa 177 2.9e-2 circuit simulation ecsiemensb 177 2.6e-2 circuit simulation qh1484 1484 2.8e-3 power systems simulation qh882 882 4.3e-3 power systems simulation mhd4800a 4800 4.4e-3 plasma physics qh768 768 5.0e-3 power systems simulation mvmpde 900 5.4e-3 CFD mhd3200a 3200 6.6e-3 plasma physics mhd1280a 1280 2.9e-2 plasma physics mhd416a 416 5.0e-2 plasma physics qc2534 2534 7.2e-2 quantum chemistry qc324 324 2.5e-1 quantum chemistry

133 Appendix B Test matrices for chapter 3 The following table summarizes the matrices in our test suite for the experiments conducted in chapter 3. Most are available in sparse matrix collections such as [15] and [47]. The circuit 1, circuit 2, circuit 3, and circuit 4 matrices were downloaded from the website of Wim Bomhof at http://www.math.ruu.nl/people/bomhof/.

134 name n nnz density application area NASASRB 54870 2677324 8.9e-4 structural engineering add32 4960 23884 9.7e-4 circuit simulation af23560 23560 484256 8.7e-4 fluid flow / CFD appu 14000 1853104 9.4e-3 random sparse matrix av41092 41092 1683902 1.0e-3 PDE bbmat 38744 1771722 5.2e-3 fluid flow / CFD bramley1 17933 1021849 3.2e-3 fluid flow / CFD bramley2 17933 1021849 3.2e-3 fluid flow / CFD circuit 1 2624 35823 5.2e-3 circuit simulation circuit 2 4510 21199 1.0e-3 circuit simulation circuit 3 12127 48137 3.3e-4 circuit simulation circuit 4 80209 307604 4.8e-5 circuit simulation cry10000 10000 49699 5.0e-4 crystal growth dw8192 8192 41746 6.2e-4 dielectric waveguide ecl 32 51993 380415 1.4e-4 device simulation ex11 16614 1096948 4.0e-3 fluid flow / CFD extr1 2837 11407 1.4e-3 chemical engineering fs 541 2 541 4285 1.5e-2 ODE garon2 13535 390607 2.1e-3 fluid flow / CFD gemat11 4929 33185 1.4e-3 power flow modelling goodwin 7320 324784 6.1e-3 fluid mechanics gre 1107 1107 5664 4.6e-3 circuit simulation hydr1 5308 23752 8.4e-4 chemical engineering inaccura 16146 1015156 3.9e-3 structural engineering jpwh 991 991 6027 6.1e-3 circuit physics lhr01 1477 18592 8.5e-3 chemical engineering lhr04 4101 82682 4.9e-3 chemical engineering lhr71 70304 1528092 3.1e-4 chemical engineering lns 3937 3937 25407 1.6e-3 fluid flow / CFD lnsp3937 3937 25407 1.6e-3 fluid flow / CFD mahindas 1258 7682 4.8e-3 economics mcfe 765 24382 4.2e-2 astrophysics mchln85ks17 84180 7179192 1.0e-3 tire design memplus 17758 126150 4.0e-4 circuit simulation mhd4800a 4800 102252 4.4e-3 plasma physics olm5000 5000 19996 8.0e-4 hydrodynamics onetone1 36057 341088 2.6e-4 circuit simulation onetone2 36057 227628 1.8e-4 circuit simulation orani678 2529 90158 1.4e-2 economics orsreg 1 2205 14133 2.9e-3 oil reservoir modelling pores 2 1224 9613 6.4e-3 oil reservoir modelling

135 name n nnz density application radfr1 1048 13299 1.2e-2 chemical engineering raefsky3 21200 1488768 3.3e-3 fluid flow / CFD raefsky4 19779 1328611 3.4e-3 structural engineering rdist1 4134 94408 5.5e-3 chemical engineering rdist2 3198 56934 5.6e-3 chemical engineering rdist3a 2398 61896 1.1e-2 chemical engineering rma10 46835 2374001 1.1e-3 fluid flow / CFD rw5151 5151 20199 7.6e-4 Markov chain modelling saylr4 3564 22316 1.8e-3 oil reservoir modelling sherman3 5005 20033 8.0e-4 oil reservoir modelling sherman4 1104 3786 3.1e-3 oil reservoir modelling sherman5 3312 20793 1.9e-3 oil reservoir modelling shyy161 76480 329762 5.6e-5 fluid flow / CFD shyy41 4720 20042 9.0e-4 fluid flow / CFD tols4000 4000 8784 5.5e-4 aeroelasticity twotone 120750 1224224 8.4e-5 circuit simulation utm5940 5940 83842 2.4e-3 plasma physics vavasis1 4408 95752 4.9e-3 PDE vavasis2 11924 306842 2.2e-3 PDE vavasis3 41092 1683902 1.0e-3 PDE venkat01 62424 1717792 4.4e-4 fluid flow / CFD wang3 26064 177168 2.6e-4 device simulation wang4 26068 177196 2.6e-4 device simulation west2021 2021 7353 1.8e-3 chemical engineering