Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations. Tzu-Yi Chen

Transcription

1 Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations by Tzu-Yi Chen B.S. (Massachusetts Institute of Technology) 1995 B.S. (Massachusetts Institute of Technology) 1995 M.S. (University of California, Berkeley) 1998 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge: Professor James W. Demmel, Chair Professor Gregory Fenves Professor Jonathan Shewchuk Fall 2001

2 The dissertation of Tzu-Yi Chen is approved: Chair Date Date Date University of California at Berkeley Fall 2001

4 1 Abstract Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations by Tzu-Yi Chen Doctor of Philosophy in Computer Science University of California at Berkeley Professor James W. Demmel, Chair Informally, given a problem to solve and a method for solving it, a preconditioner transforms the problem into one with more desirable properties for the solver. The solver may take less time to find the solution to the new problem, it may compute a more accurate solution, or both. The preconditioned system is solved and the solution is transformed back into the solution of the original problem. In this dissertation we look at the role of preconditioners in finding the eigenvalues of sparse matrices and in solving sparse systems of linear equations. A sparse matrix is one with so many zero entries that either only the nonzero elements and their locations in the matrix are stored, or the matrix is not given explicitly and one can only get the results of multiplying the matrix (and sometimes its transpose) by arbitrary vectors. The eigenvalues of a matrix A are the λ such that Ax = λx, where x is referred to as the (right) eigenvector corresponding to λ. Numerical algorithms that compute the eigenvalues of a nonsymmetric matrix A typically have backward errors proportional to the norm of A, so it can be useful to precondition an n n matrix A in such a way that its norm is reduced and its eigenvalues are preserved. We focus on balancing A, in other words finding a diagonal matrix D such that for 1 i n the norm of row i and column i of DAD 1 are the same. Interestingly, there are many relationships between balancing in certain vector norms and minimizing varied matrix norms. For example, in [143] Osborne shows balancing a matrix in the 2-norm also minimizes the Froebenius norm of DAD 1 over all D up to scalar multiples. We summarize results known about balancing in other

5 2 norms before defining balancing in a weighted norm and proving that this minimizes the 2-norm for nonnegative, irreducible A. We use our results on balancing in a weighted norm to justify a set of novel Krylov-based balancing algorithms which approximate weighted balancing and which never explicitly access individual entries of A. By using only matrix-vector (Ax), and sometimes matrix-transpose-vector (A T x), multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. We then show that for matrices from our test suite, these Krylovbased balancing algorithms do, in fact, often improve the accuracy to which eigenvalues are computed by dense or sparse eigensolvers. For our test matrices, Krylov-based balancing improved the accuracy of eigenvalues computed by sparse eigensolvers by up to 10 decimal places. In addition, Krylov-based balancing can also improve the condition number of eigenvalues, hence giving better computed error bounds. For solving sparse systems of linear systems the problem is to find a vector x such that Ax = b, where A is a square nonsingular matrix and b is some given vector. Algorithms for finding x can be classified as either direct or iterative: direct methods typically compute the LU factorization of A and solve for x through two triangular solves; iterative methods such as conjugate gradient iteratively improve on an initial guess to x. Though direct methods are considered robust, they can require large amounts of memory if the L and U factors have many more nonzero elements than the matrix A. On the other hand, though iterative methods require less space, they are also less robust than direct methods and their behavior is not as well understood. Fortunately, preconditioning can help with some of these issues. For example, preconditioners can be used to reduce the number of nonzero elements in the L and U factors of A, or to improve the likelihood of an iterative method converging quickly to the actual solution vector. We begin by discussing preconditioners for direct solvers, starting with several algorithms for reordering the rows and columns of A prior to factoring it. We present data comparing the results of decomposing matrices with a nonsymmetric permutation to results from using a symmetric permutation. For one matrix the size of the largest block found with a nonsymmetric permutation is a tenth of the size of the largest block found with a symmetric permutation, which can greatly reduce the subsequent factorization time. We also note that using a stability ordering in concert with a column approximate minimum

6 3 degree ordering can lead to L and U factors with significantly more or fewer nonzero elements than those computed after using the sparsity ordering alone. Focussing on a specific algorithm for reordering A to reduce fill, we then describe our design and implementation of a threaded column approximate minimum degree algorithm. Though we worked hard to avoid the effects of many known parallel pitfalls, our final implementation never achieved a speedup of more than 3 on 8 processors of an SGI Power Challenge machine, and more typically there was virtually no speedup. By analyzing the performance of our code in detail, we provide a better understanding of the difficulties of efficiently implementing algorithms with fine-grained parallelism even in a shared memory environment. Finally we turn to incomplete LU (ILU) factorizations, a family of preconditioners often used with iterative solvers. We propose a modification to a standard ILU scheme and show that it makes better use of the memory the user has available, leading to a greater likelihood of convergence for preconditioned GMRES(50), the iterative solver used in our studies. By looking at data gathered from tens of thousands of test runs combining matrices with different ILU algorithms, parameter settings, scaling algorithms, and ordering algorithms, we draw some conclusions about the effects of different ordering algorithm on the convergence of ILU-preconditioned GMRES(50). We find, for example, that both ordering for stability and partial pivoting are necessary for achieving the best convergence results. Professor James W. Demmel Dissertation Committee Chair

7 i Contents List of Figures List of Tables iii iv 1 Introduction Sparse systems Storage of sparse matrices Sparse matrix algorithms Roles of preconditioning Notation and Definitions Matrix notation and definitions Graph representations of matrices Relationships between A, DG(A), and BG(A) Test Matrices Contributions Preconditioning sparse matrices for computing eigenvalues Decomposing the matrix The Parlett-Reinsch Algorithm The Strongly Connected Components Algorithm Comparisons Balancing Theory Parlett-Reinsch balancing algorithm Krylov balancing algorithms Results Balancing and Dense Eigensolvers Balancing and Sparse Eigensolvers Conclusions Preconditioning sparse linear systems of equations Decomposing the matrix Ordering for sparsity Background

8 ii Approximate column minimum degree code for symmetric multiprocessors Ordering for stability History Observations Relationship to other orderings ILU preconditioners History of IC and ILU preconditioners Experimental setup The ILUTP Push algorithm Effects of orderings Summary of experiments Conclusion Conclusion 112 Bibliography 114 A Test matrices for chapter B Test matrices for chapter 3 133

9 iii List of Figures 1.1 Example of a matrix stored in column compressed format Example of Parlett-Reinsch decomposition Example of strongly connected components decomposition Pseudocode for the iterative balancing algorithm Pseudocode for KrylovAz Pseudocode for KrylovAz if A not given explicitly Pseudocode for KrylovAtz Accuracy of the eigenvalues of qh768 computed with and without direct balancing Accuracy of the eigenvalues of tols2000 computed with and without direct balancing Accuracy of the eigenvalues of qh768 computed with and without Krylovbased balancing Accuracy of the eigenvalues of tols2000 computed with and without Krylovbased balancing Relative accuracy of the largest and smallest eigenvalues of qh768 computed with Krylov-based balancing Relative accuracy of the largest and smallest eigenvalues of tol2000 computed with Krylov-based balancing Pseudocode for parallel approximate minimum degree algorithm Pseudocode for ILUTP Number of nonzeros in each row of the incomplete factors of shyy41 and vavasis Pseudocode for ILUTP Push Amount of fill in complete LU factors

10 iv List of Tables 2.1 Effects of different symmetric decomposition algorithms Summary of known results on matrix norm minimization via diagonal scaling Summary of known results on balancing matrices Effect of Krylov balancing algorithms on matrix norms Decompositions with scc vs. dmperm. Part Decompositions with scc vs. dmperm. Part Number of iterations taken by threaded column approximate minimum degree code with different parameter settings Breakdown of time taken by threaded column approximate minimum degree algorithm nnz(l + U) for different orderings. Part I nnz(l + U) for different orderings. Part II Summary of packages including IC or ILU algorithms Number of systems that converge with ILUTP and varied amounts of fill Number of systems that converge with ILUTP and space used by factors Number of systems that converge with ILUTP vs. ILUTP Push Number of matrices that converge with ILUTP vs. ILUTP Push with high fill and various pivtol Number of systems converging with ILUTP Push and space used by factors Number of systems that converge with different orderings for various levels of ILU(k) Number of systems that converge with different orderings and ILUTP Push with varied amounts of fill Number of systems that converge with ILU(k) and ILUTP Push with nnz(ˆl+ Û) = nnz(a) Effects of pivtol on convergence of ILUTP Push with different sparsity orderings Number of systems that converge with ILU(k) with MC64 and different sparsity orderings Number of systems that converge with ILUTP Push and MC64, but with different sparsity orderings

11 3.19 Comparing ILU(k) and ILUTP Push with MC64 and fixed parameter values, but different sparsity orderings Effects of different pivtol values and sparsity orderings on ILUTP Push with MC Number of systems converging with ILU(k), MC64 with scaling, and different sparsity orderings Number of systems converging with ILUTP Push, MC64 with scaling, and different sparsity orderings Difference between ILU(k) and ILUTP Push with MC64 and scaling, but different sparsity orderings Effects of pivtol on convergence of ILUTP Push with MC64 and scaling, but different sparsity orderings v

12 vi Acknowledgements For helping with research, I should first thank my advisor Jim Demmel, my committee members Jonathan Shewchuck and Greg Fenves, and my qualifying examination chair Kathy Yelick. I would also like to thank Sivan Toledo and John Gilbert for having me spend a summer at Xerox PARC; and Esmond Ng for having me spend two summers at NERSC. Other people I have had useful discussions with include: Beresford Parlett (on balancing), David Hysom (on ILU preconditioners), Brent Chun and Fred Wong (on the innards of the Berkeley NOW and Millennium), and Henry Cohn (on a variety of math topics). Of course, I also need to thank the agencies whose grants funded me. This research was supported in part by an NSF graduate fellowship and in part by LLNL Memorandum Agreement No. B under the Department of Energy under DOE Contract No. W ENG-48, and the National Science Foundation under NSF Cooperative Agreement No. ACI , and DOE subcontract to Argonne, no The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. Some of my richest experiences at Berkeley had nothing to do with research. Linh, Thanh Thao, Herman, Rahel and Leya, Peter, Tsedenia, Kedest, Salma, Saana, Fatima, and many others: thank you for giving me a clearer sense of myself and a more complete picture of our world. And, of course, my thanks to my family and Chris.

13 1 Chapter 1 Introduction As processor speeds and storage capacities increase, people expect computers both to solve existing problems more quickly and to solve larger and more complex problems. In the field of linear algebra, the latter corresponds to solving systems with large, potentially ill-conditioned matrices. The storage requirements for large matrices can sometimes be reduced if they are sparse, ie. if many of the matrix entries are zero. Furthermore, the time and memory needed to solve a large system can sometimes be reduced by preconditioning the system prior to solving it. Informally, to precondition a system prior to computing a solution is to transform it into one with more desirable properties. The solution to the altered system is computed, and transformed into the solution of the original problem. The advantages of solving the modified system can include more accurate results, decreased running time, reduced memory requirements, or some combination of these. What makes a preconditioner desirable can depend on both the problem and the solution method. In this dissertation we look at preconditioners for two classes of linear algebra problems: eigenproblems and linear systems. In chapter 2 we discuss preconditioning for sparse eigenproblems, considering both how to permute a matrix to decompose it, and how to balance a matrix to improve the accuracy of its computed eigenvalues. In chapter 3 we turn to preconditioning for linear systems. We discuss heuristics for permuting the rows and columns to achieve goals such as decomposing the matrix, reducing the number of nonzeros in the factors, and stabilizing the matrix. We then look at incomplete LU factorizations, a class of preconditioners for iterative solvers. Before turning to preconditioners for specific problems, we first give brief overviews

14 2 of storage and algorithmic issues concerning sparse matrices, and of the roles of preconditioners for direct and iterative solvers. We then define some of the matrix notation and graph representations used throughout this report, and end with a summary of our contributions. 1.1 Sparse systems As noted, we are interested primarily in sparse matrices, which can be thought of as n by m matrices with enough zero elements to make storing only the nonzero elements, and not all nm entries, worthwhile. Sparse algorithms which ignore the zero elements can sometimes be faster than their dense counterparts which operate on all entries. For example, consider an n by n diagonal matrix. Clearly storing only the n nonzero diagonal entries is cheaper than storing all n 2 matrix elements. Clearly algorithms which ignore the zero off-diagonal elements can be far more efficient than those which do not. For example, if computing the diagonal matrix times a vector, the standard dense algorithm computes n dot products of length n, whereas a sparse algorithm operating only on the nonzero elements requires n scalar multiplications. In practice, sparse matrices arise in many application areas. For example, when simulating the effects of applying heat to a plate or the flow of air around an airplane wing, the first step is often to model the plate or the wing by putting a mesh on it. This mesh can be seen as an undirected graph with v vertices and e edges. If we translate this graph into a matrix, a process better described in section 1.3.2, the matrix is a v v matrix with 2e nonzeros. Since two vertices are connected by an edge only if they are near each other in the physical object, the number of edges is much smaller than v 2 /2. If we model a 2D square plate by putting an n n mesh on it, the matrix will have v = n 2 rows and columns, and only 5n 2 4n, rather than n 4 nonzero entries. As the matrix-vector multiplication example shows, we need both data structures for storing sparse matrices and algorithms that take advantage of the sparse storage formats. Because sparse matrices are less structured than dense matrices, creating either of the two can be challenging.

15 Storage of sparse matrices Traditionally dense matrices have been stored as a two-dimensional array in either column-major or row-major order, though more recent work suggests performance advantages to using recursive layouts [4, 67, 94]. For sparse matrices, on the other hand, significantly more storage methods are used. The range of possibilities comes from the fact that matrices from different applications have different nonzero structures (eg, that the diagonal of the matrix may be nonzero, or that the matrix has a narrow band), and that different matrix representations allow for efficient implementation of different operations. Column-compressed format is a very popular sparse matrix representation that some large matrix repositories, including the Harwell-Boeing collection [55] and the University of Florida sparse matrix collection [47], use. As the small example in figure 1.1 shows, the column-compressed format stores a sparse matrix with real entries in three arrays: nzval, rowind, and colptr. The nzval array has nnz elements, where nnz is the number of nonzeros in the matrix. The elements in nzval are the values of the nonzero elements, stored by column, so that the elements in column 1 are listed first, then those in column 2, and so on. The integer array rowind also has nnz entries and rowind[i] is the row index of the entry whose value is stored in nzval[i]. The integer array colptr has n + 1 entries, where n is the number of columns in the matrix, and colptr[i] is the location in nzval and rowind where the first element in column i can be found. Equivalently, colptr[i] is the total number of nonzeros in columns 1 through i 1. The first entry, colptr[0], has value 0, and the last entry, colptr[n + 1], has value nnz. Although in the example the elements in each column are sorted by increasing row index, this is not a requirement of the format. 1 6 colptr rowind nzval Figure 1.1: This figure shows how a small sparse matrix is stored the compressed column format (also know as the Harwell-Boeing format). Row-compressed format, which we use in the work on preconditioning linear sys-

16 4 tems described in section 3.4, is the row-based analogue of column-compressed format: matrix entries are stored by row instead of by column. The row-compressed format uses rowptr and colind arrays in place of the colptr and rowind arrays. We describe less common storage formats as necessary throughout the report. Books such as [14] and [148] describe some of the many other sparse matrix storage representations people use. In principle, one could devise arbitrary hybrids of these to accommodate particular applications, as done in [107, 109, 175] Sparse matrix algorithms Just as we have an understanding of good storage methods for dense matrices, we also know how to exploit the memory hierarchy to write efficient dense linear algebra code. The goal is to limit the amount of data movement between levels of the memory hierarchy. The trick is to block the matrix, which divides it into smaller non-overlapping submatrices, and then to operate on the individual blocks. The operations on these smaller blocks should all fit into the lowest level (the one with the most storage) of the memory hierarchy. Since there are typically several levels in the memory hierarchy, the submatrices themselves may be again divided into smaller subblocks which are also operated on one at a time. Typically the memory needed to store the largest blocks is on the order of the size of the first level cache and the number of elements in the smallest blocks is on the order of the number of floating point registers. Blocking can be very effective for dense matrix computations because they typically access matrix and vector elements in regular patterns. Unfortunately, sparse matrices are not as structured as dense matrices and in general cannot be easily blocked into small dense subblocks. Memory references tend to be irregular, which makes exploiting temporal or spatial locality difficult. Of course, if a user knows his or her application generates sparse matrices with small dense subblocks, performance can be improved by using algorithms that can exploit this feature. Other work looks at padding sparse matrices by storing some zero elements in order to create dense blocks [107, 108, 109, 175]. Nevertheless, overall, achieving high performance on sparse matrix computations remains a complex open problem. Although algorithms operating on matrices stored in sparse matrix representations can be difficult to code efficiently, some algorithms may be easier to implement on sparse matrices. For example, graph algorithms translate nicely to sparse matrices stored

17 5 in row compressed format, which is essentially the same as the standard adjacency graph representation of a matrix. 1.2 Roles of preconditioning As noted, preconditioning a system alters it so that the changed system is somehow better. The answer to the improved system is computed, and from it the answer to the original system is derived. What makes the preconditioned system better depends largely on how the algorithm then used to solve the preconditioned system works. For example, consider direct versus iterative methods, a categorization we use to describe algorithms throughout this report. Informally, a direct method is an algorithm that is usually run for a fixed number of steps, at the end of which it almost always returns an answer that is sufficiently close to the exact solution that it is often considered exact, modulo roundoff error. An iterative method, on the other hand, begins with an initial guess to the solution and iteratively tries to improve it. The algorithm stops either when the approximation is deemed sufficiently close to the exact solution, or when some large number of iterations has been run and the user suspects the algorithm has stagnated and so a good approximate solution may never be reached. We note that some methods (for example, the conjugate gradient solver for linear systems [100]) span direct and iterative methods in the sense that they compute the exact answer in n steps in exact arithmetic, but in practice are used as iterative methods either because they often compute a reasonable solution in far fewer than n steps or because the iterations are expensive and n is large. The motive for preconditioning differs for iterative and direct methods. Since a direct method gives the answer after running for a fixed number of steps, a useful preconditioner might turn the system into one where each step takes less time, or one for which the solver computes a more accurate answer. For an iterative method, on the other hand, an effective preconditioner might create a system for which the iterative solver converges when it did not for the original problem, a system where the number of iterations needed for convergence is reduced, or one where each iteration can be computed more efficiently. However, accuracy of the solution and the speed with which that solution is computed remain paramount. Because of the latter, preconditioners are judged not only by how much they improve the performance of the solver, but also by other measures such as the cost of computing and applying that preconditioner. Since the preconditioner must be

18 6 first computed prior to solving the modified system, it should be relatively inexpensive to compute. Note that if many similar systems are to be solved and the same preconditioner is used for all of them, the cost of computing the preconditioner can potentially be amortized. After solving the preconditioned system the effects of the preconditioner must be undone to recover the solution to the original problem; this step should also be inexpensive. Furthermore if the preconditioner will be applied in every iteration of the algorithm, as with the incomplete LU preconditioner discussed in section 3.4, the application of the preconditioner should be inexpensive. The tradeoff between the time needed to compute and apply the preconditioner, and the time saved and accuracy gained with a high quality preconditioner is an issue throughout this report. 1.3 Notation and Definitions The discussions in this report move between matrix and graph terminology. To smooth the transitions between the two, in this section we summarize our matrix notation, define the graphs associated with a matrix, and discuss relationships between matrix and graph terminology Matrix notation and definitions We generally use capital letters for representing matrices, lower case letters for vectors, and Greek letters for constants. A few letters are reserved for special matrices: A, which refers to the n n, possibly nonsymmetric, matrix being preconditioned; B, which is A after preconditioning; P and Q, which are permutation matrices; and D, which is a diagonal matrix. The vector e is the vector whose entries are all 1. The number of nonzero elements in a matrix M is denoted by nnz(m). In the context of solving a system of linear equations, we use L and U to denote the complete LU factors of A, so A = LU if no pivoting is used. With row pivoting, we have P A = LU; with column pivoting we have AP = LU. We use ˆL and Û to denote incomplete factors of A, so ˆLÛ A. The number of nonzeros in the incomplete factorization is nnz(ˆl + Û), and we will typically denote the number of nonzeros in a matrix A by nnz(a), though the (A) may be omitted if the context clearly specifies the matrix.

19 7 We frequently use Matlab notation when referring to elements in vectors and matrices. For example, we use colons to indicate a sequence of indices, so A(i, :) is row i of A, and A(3 : 5, :) is the submatrix consisting of the third, fourth, and fifth rows of A. For more information on Matlab notation, see [133]. If a permutation P is applied symmetrically, it maps A to P AP T. If A is a nonnegative matrix, it has no negative entries. This may also be written as A 0. If A is real and symmetric, A = A T. If A is complex and Hermitian, A = A H. If A is structurally symmetric, A and A T have nonzeros in the same locations, though the values may differ. Finally, A is shorthand for the matrix whose entries are A(i, j). We use norms to measure the size of vectors and matrices, where the norm of x is written as x. If x is a length n vector, some of the vector norms we use are defined as follows: 1 norm: x 1 x 1 + x x n 2 norm: x 2 ( x x x n 2 ) 1/2 norm: x max i { x i } If A is an n n matrix, some of the matrix norms we use are defined as follows: 1 norm: A 1 max j { i (A(i, j))} 2 norm: A 2 max{λ 1/2 : λ is an eigenvalue of A H A} norm: A max i { j (A(i, j))} ( 1/2 Froebenius norm: A F i,j }) { A(i, j) 2 For more on norms, look in linear algebra books such as [52, 82, 102]. The condition number of a matrix A with respect to a particular problem is a measure of how sensitive the solution to that problem is to perturbations in A. The condition number of A with respect to matrix inversion is defined as κ(a) = A A 1 if A is nonsingular and κ(a) = if A is singular. Again, for more information consult books on linear algebra such as [52, 82, 102]. Other less frequently used terms will be defined when they are first used Graph representations of matrices We now describe two ways of representing matrices by a graph, both of which are often referred to in the literature. There are the directed graph and the bipartite graph

20 8 representations; given a matrix A we refer to the first graph as DG(A), and the latter as BG(A). Note the latter is sometimes called the dependency graph of A (e.g., in [131]). When discussing graphs we use common graph terminology. For example, let a graph G be defined by a set of vertices V and a set of directed edges from one vertex to another E. Then, a path from vertex s to t is a list of vertices (s, v 1, v 2,..., v k, t) such that there are directed edges in E from s to v 1, from v i to v i+1 for all 1 i < k, and from v k to t. We also refer later on to the subgraph induced by a set of vertices, and mention graph algorithms such as depth first search. The terminology and algorithms can be found in standard algorithm textbooks such as [1, 44]. Directed graph representation An n by n unsymmetric matrix A can be represented by a directed graph DG(A) = (V, E), where V = n, and the directed edge (i, j) E if and only if A(i, j) 0. If we need a weighted graph DG w (A), the weight on edge (i, j) is the value of A(i, j). If A is an n by m matrix where n m, then V = max(n, m) and either some of the nodes will have no incoming edges or some will have no outgoing edges, depending on whether A has more rows or more columns. Thus DG(A) is the same as DG(Ā) where Ā is gotten by extending A with enough zero rows or columns to make it square. Bipartite graph representation Alternatively, an n by n unsymmetric matrix A can be represented by a bipartite graph BG(A) = (R, C, E). In this representation R = C = n, and the undirected edge (r i, c j ) exists if and only if A(i, j) 0. If a weighted graph BG w (A) is needed, the weight on the edge (r i, c j ) is the value of A(i, j). If A is not square, the only change is that R = C Relationships between A, DG(A), and BG(A) We now point out some obvious, and some perhaps not so obvious, relationships between a matrix and its associated graphs. For example, note that entries on the diagonal of A correspond to self-edges in DG(A). Now consider the case where A is a structurally symmetric matrix. This means DG(A) has the attractive quality of being representable by an undirected graph since an

21 9 edge (i, j) in DG(A) implies the existence of the edge (j, i). Of course, if we need edge weights, A needs to be symmetric, not just structurally symmetric, for DG(A) to be representable by an undirected graph. In BG(A) = (R, C, E), symmetry in A is reflected in the fact that if edge (r i, c j ) E, then (r j, c i ) E. Although the nodes in a graph are typically not thought of as ordered in any way, the rows and columns of a matrix are always ordered (i.e., we refer to the first row or third column of a matrix). If we think of the nodes in DG(A) as numbered so that the node corresponding to the first row and column of A is node 1, a reordering of the nodes corresponds to permuting the rows and columns of A symmetrically (i.e., P AP T ). This makes DG(A) useful for algorithms that permute the rows and columns of a matrix symmetrically, such as some of those described in section Renumbering the nodes of DG(A) is no longer appropriate if different permutations are applied to the rows and columns of A (i.e., P AQ T ), so the bipartite representation BG(A), which allows the nodes of R and C to be numbered separately, is often used instead in this situation. Finally, recall that a square matrix A is irreducible if there is no permutation matrix P such that A = X Y 0 Z where X and Z are square. If such a P does exist, A is reducible. If A is irreducible, DG(A) is strongly connected, which means there is a directed path from any vertex to any other (e.g., [52, lemma 6.6]). 1.4 Test Matrices The algorithms we describe in this report are tested on assorted matrices from a variety of applications. Because the performance of an algorithm depends on the matrices it is tested on, in this section we describe how we chose our test matrices. First we chose matrices from a range of application areas. By not biasing our matrices towards any one domain, we hoped our algorithms would be similarly unbiased and would work well for more than very specific types of matrices. Of course, if a user knows his or her matrices all share some common structure, the best algorithms for them will likely take advantage of that structure. Furthermore we chose matrices that spanned a range of sizes and densities.

22 10 The matrices used in chapter 2 are taken from a collection of non-hermitian eigenvalue problems [10]. These matrices are all specifically from eigenproblems. As the table in appendix A shows, the matrices used are of modest size and come from a range of applications. The matrices used in chapter 3 come from a variety of collections, with the majority of them available from either the Matrix Market [15] or the University of Florida sparse matrix collection [47]. Previous work on direct and iterative solvers also analyzes results of experiments conducted on a variety of matrices so to make comparisons between our results and those from previous work, we tried to ensure some overlap between our test suite and those from assorted other papers. Furthermore, since the size of sparse linear systems that computers can handle continually increases, we also tried to add new, larger matrices. The table in appendix B provides more details about the matrices chosen. 1.5 Contributions The contributions of this work are as follows. In chapter 2 we consider the problem of computing the eigenvalues of a sparse matrix, that is finding λ such that Ax = λx. We first notice that decomposing the matrix A into irreducible components can significantly reduce the expected time complexity of finding the eigenvalues of A. This scheme can be arbitrarily better than the conventional scheme used for dense matrices, described in [147] and used in packages such as Lapack [5] and Eispack [169]. We then define the concept of balancing in a weighted norm, show how to balance non-negative, irreducible matrices in the weighted norm, and prove this balancing minimizes the 2-norm of such matrices. The idea of a balancing in a weighted norm is used to justify a novel set of Krylov-based algorithms which balance a matrix without accessing its individual entries. By using only matrix-vector (Ax), and sometimes matrix-transpose-vector (A T x), multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. Finally we show that for matrices from our test suite Krylov-based balancing algorithms can improve the accuracy to which eigenvalues are computed by as much as 10 decimal places for sparse eigensolvers. Furthermore, Krylov-based balancing can also improve the condition number of the eigenvalues, which improves computed error bounds. In chapter 3 we turn to solving sparse linear systems, where the problem is to

23 11 find x such that Ax = b. Direct solvers first factor the matrix A, so we begin by discussing various reasons for reordering the rows and columns of A prior to factoring it. We present data showing the advantages of decomposing the matrices in our test suite through a nonsymmetric permutation rather than the symmetric strongly connected components decomposition used when computing eigenvalues. For one matrix in our test suite the size of the largest block found with a nonsymmetric permutation is a tenth of the size of that found with a symmetric permutation, which can greatly reduce the subsequent factorization time. We also note in section that using a stability ordering in concert with a column approximate minimum degree ordering can lead to fill in the LU factors that differs significantly from that of using the sparsity ordering alone. On our test matrices the difference between the two could be up to a factor of 2. Focussing on one specific algorithm for reordering A, we next describe our design and implementation of a threaded column approximate minimum degree algorithm. Even after the extensive analysis and code modifications we describe, our final implementation never achieved a speedup of more than 3 on 8 processors of an SGI Power Challenge machine, and more typically there was virtually no speedup. This work, done jointly with Sivan Toledo and John Gilbert [37], gives us a better understanding of the difficulties of efficiently implementing algorithms with fine-grained parallelism even in a shared memory environment. Finally we turn to incomplete LU (ILU) factorizations, a family of preconditioners often used with iterative solvers. We propose a modification to a standard ILU scheme and show that it makes better use of the memory the user has available, leading to a greater likelihood of convergence for preconditioned GMRES(50), the iterative solver used in our studies. By looking at data gathered from tens of thousands of test runs combining matrices with different ILU algorithms, parameter settings, scaling algorithms, and ordering algorithms, we draw some conclusions about the effects of different ordering algorithm on the convergence of ILU-preconditioned GMRES(50). We find, for example, that both ordering for stability and partial pivoting are necessary for achieving the best convergence results.

24 12 Chapter 2 Preconditioning sparse matrices for computing eigenvalues Given a matrix A, we say that λ is an eigenvalue of A with corresponding eigenvector x if Ax = λx and x 0. Given A, eigensolvers try to find some or all of its eigenvalues and eigenvectors. The eigenvalues of a matrix are maintained under similarity transforms, which means the eigenvalues of B = SAS 1, where S is nonsingular, are the same as those of A. The eigenvectors, on the other hand, are transformed by S so if x is an eigenvector of A, Sx is an eigenvector of B. Preconditioning in this context means choosing S so that the eigenvalues of SAS 1 can be computed more quickly or more accurately than those of the untransformed A. In this chapter we explore two methods for choosing S. In section 2.1 we constrain S to be a permutation matrix and show how to decompose A into a set of smaller systems. In section 2.2 we then constrain S to be a diagonal matrix and look at algorithms for scaling the entries of A to improve the accuracy with which its eigenvalues can be computed. These techniques can sometimes be combined, and in section 2.3 we show the effects of preconditioning on the accuracy of computed eigenvalues. Our contributions are first to notice that decomposing a matrix A by using a strongly connected components algorithm can significantly reduce the expected time complexity of finding the eigenvalues of A. This scheme can be arbitrarily better than the conventional scheme used for dense matrices, described in [147] and used in packages such as Lapack [5] and Eispack [169].

25 13 We then switch gears and define weighted balancing, which we show minimizes the 2-norm of a nonnegative matrix. We further describe novel Krylov-based balancing algorithms which approximate weighted balancing and which operate on a matrix A without explicitly accessing its entries. By using only matrix-vector, and sometimes matrix-transposevector, multiplications to access A, these new algorithms can be used with eigensolvers that similarly assume only that a subroutine for computing Ax (and possibly A T x) is available. Finally we show that for matrices from our test suite, these Krylov-based balancing algorithms do, in fact, often improve the accuracy to which eigenvalues are computed by dense or sparse eigensolvers by as much as 5 decimal places in our examples with dense eigensolvers and as much as 10 decimal places for sparse eigensolvers. Furthermore, Krylovbased balancing can also improve the condition number of the eigenvalues, which improves computed error bounds. Portions of this work were published in [33, 35, 36]. 2.1 Decomposing the matrix We first consider choosing a similarity transform P, where P is a permutation matrix. The goal is to find P such that: P AP T = X Y, 0 Z where X and Z are square. Since the eigenvalues of A are the eigenvalues of X together with those of Z, this decomposition reduces the problem of finding the eigenvalues of A to the smaller, and therefore simpler, eigenproblems for X and Z. Recall from section that if such a P exists, A is reducible, otherwise A is irreducible. P is applied symmetrically, so the transformation permutes the rows and columns of A in the same way. Looking at DG(A) and DG(P AP T ) (defined in section 1.3.2) we see the two are isomorphic, and only the numbering of the nodes is changed. In this section we review an algorithm for finding P described by Parlett and Reinsch in [147] and used by codes in several popular linear algebra libraries. We then describe a permutation which can do significantly better and end with a comparison of the two algorithms on our test matrices.

26 The Parlett-Reinsch Algorithm In [147] Parlett and Reinsch describe a two-step algorithm for transforming a matrix prior to computing its eigenvalues. separate out rows and columns which isolate eigenvalues. rows and columns are scaled. In the first step, the matrix is permuted to In the second, the remaining This algorithm is implemented in linear algebra packages such as EISPACK (under the name balanc), LAPACK (under the name gebal 1 ), and MATLAB (under the name balance). In the permutation phase, the algorithm first searches for a row with zeros on all n 1 off-diagonal entries. If a row with this structure exists, it is permuted to the bottom of the matrix by swapping two rows and swapping the same two columns. The algorithm then iterates on the first n 1 rows and columns of the permuted matrix. When no more rows isolating eigenvalues on the diagonal are found, the process repeats for columns. Afterwards the algorithm has found a permutation matrix P such that P T AP = T 1 X Y 0 C Z 0 0 T 2 (2.1) where T 1 and T 2 are upper triangular. The eigenvalues of T 1 and T 2 are their diagonal entries. Even though the square submatrix C may not be irreducible, this decomposition is deemed sufficiently good. In graph theoretic terms, this algorithm begins by looking for a sink node s of DG(A). If one exists, it is given the number n, meaning in the permuted matrix it corresponds to the last row and column. The algorithm then looks for a sink node in the subgraph induced by the set of all nodes except s. If a sink node exists in this subgraph, it is numbered n 1. When there are no sink nodes found in the last subgraph the algorithm then continues by looking for source nodes, which are numbered in increasing order, starting with 1. Nodes not identified as source or sink nodes end up numbered after all the source nodes and before all the sink nodes. See figure 2.1 for a small example. 1 In LAPACK there is also an additional character at the beginning of the subroutine name which specifies the data type (e.g. dgebal balances matrices whose entries are in double precision) [5].

27 (a) (b) (c) Figure 2.1: This figure shows how the Parlett-Reinsch algorithm would decompose a small graph. In (a) a sink node is located and numbered last; in (b) there are no sink nodes, so a source node is located and numbered 1; in (c) there are neither sink nor source nodes so the remaining nodes are considered a group and numbered consecutively. Gray elements were eliminated in a previous step and are not considered The Strongly Connected Components Algorithm By locating individual rows and columns which isolate eigenvalues, the permutation phase of the Parlett-Reinsch algorithm finds a permutation matrix P which makes P AP T as upper triangular as possible. In other words, it decomposes A into 1 1 blocks and one large block consisting of all the remaining rows and columns, as shown in equation 2.1. We suggest choosing P to make P AP T as block upper triangular as possible, which decomposes A into a set of diagonal blocks of size ˆn 1, ˆn 2,..., ˆn k, where k depends on the structure of the matrix. This minimizes the size of the largest diagonal block which is significant since that a dense eigensolver on the decomposed matrix runs in time O( k i=1 ˆn3 i ). In graph theoretic terms, making A as block upper triangular as possible corresponds to finding the strongly connected components of a directed graph whose adjacency matrix has the same structure as A and then sorting the components using a topological sort [44, section 23.5]. See figure 2.2 for a small example. Tarjan noted that finding the strongly connected components of a directed graph can be done using two depth first searches [173]; a description of his algorithm can be found in [1, section 5.5]. Descriptions of implementations are in [59] and [151]. We point out that Tarjan s algorithm is particularly well suited to this application because it outputs the nodes one strongly connected component at a time, with the components already topologically sorted. Permuting A to block upper triangular form means the indices of several diagonal blocks may need to be stored so that the eigenproblems corresponding to these blocks

28 Figure 2.2: This figure shows how the strongly connected components algorithm would decompose a small graph. All the strongly connected components are located, then a topological sort is done on the components, and the nodes are numbered so that if component i comes after component j in the topological sort, all elements of component i have numbers greater than those of the elements of component j. can be identified and solved. However, we believe the potential benefits of this algorithm over the Parlett-Reinsch algorithm compensate for the additional complexity. Because the eigenvalues of the original matrix A are the same as the union of the eigenvalues of the individual diagonal blocks in P T AP, the running time of any eigensolver run on the blocks of the balanced matrix depends strongly on the size of the largest block. The size of the largest diagonal block found using this permutation can be significantly smaller than that found by the Parlett-Reinsch algorithm, and it is never larger Comparisons The benefits of using the strongly connected components algorithm instead of the Parlett-Reinsch algorithm are two-fold. First, as described in the next paragraph, computing the strongly connected components is likely to take less time for sparse matrices stored in compressed format (row or column). Second, the size of the largest diagonal block can be much smaller with the strongly connected components algorithm, which makes computing the eigenvalues of A less expensive. The Parlett-Reinsch permutation algorithm permutes rows one at a time, which requires O(nnz) time for each row if the matrix is in compressed column format. For a permuted upper triangular matrix, the permutation phase would take O(nnz n) time, which is much more than the O(n + nnz) time taken by Tarjan s strongly connected components algorithm. In short, it is more efficient to run an algorithm that needs only two sweeps through the entire data structure to identify all the blocks, rather than one which repeatedly looks for a single 1 by 1 block in each sweep.