- PDF Free Download

Transcription

1 Solving linear equations on parallel distributed memory architectures by extrapolation Christer Andersson Abstract Extrapolation methods can be used to accelerate the convergence of vector sequences. It is shown how three dierent extrapolation algorithms, the minimal polynomial extrapolation (MPE), the reduced rank extrapolation (RRE) and the modied minimal polynomial extrapolation (MMPE), can be used to solve systems of linear equations. The algorithms are derived and equivalence to dierent Krylov subspace methods are established. The extrapolation algorithms are to prefer on parallel distributed memory architectures since less inter-processor communication is needed. Numerically the extrapolation methods are not as stable as the Krylov subspace methods since they require the solution of ill-conditioned overdetermined systems. Several techniques of improving convergence and stability are presented. Some of these are new to the best of the author's knowledge. The use of regularization methods and a slightly modied stationary method have proved to be especially useful. Error bounds and methods of estimating accuracy are given. Some aspects of implementation are discussed with emphasis on parallel distributed memory architectures. Implementations of RRE and GMRES are compared on an IBM RS/6000 Power Parallel SP. RRE has some diculties converging to solutions with errors close to the oating point relative accuracy. For slightly larger tolerances however, RRE is better than GMRES. RRE seems to be the most useful of the extrapolation algorithms, especially if it is used with the modied stationary method given here. 1

2 CONTENTS 2 Contents 1 Introduction 3 2 Theoretical background Notation and common properties Minimal polynomial extrapolation Reduced rank extrapolation Modied minimal polynomial extrapolation Summary Extrapolation vs. Krylov subspace methods Practical use and implementation Cycling Improving convergence Initial iterations Discarding generated vectors Choosing stationary method Using higher precision to solve the overdetermined linear system Normalization Regularization Solving overdetermined systems Truncated SVD regularization Computational complexity Numerical experiments Test problems Choosing extrapolation method Empirical examination of techniques of improving convergence Initial iterations Discarding generated vectors Choosing stationary method Using higher precision to solve the overdetermined linear system Normalization Truncated SVD regularization Extrapolation methods compared to Krylov subspace methods Estimating the accuracy of the extrapolated solution Mathematical results Error bounds The eigenvectors of G A heuristic argument for normalizing the solution of the overdetermined system Parallel implementation Some important concepts Hardware and software Implementation Matrix-vector multiplication Forming and solving the normal equations Computing the residual Timing experiments MPI Allreduce Extrapolation methods Extrapolation methods compared to Krylov subspace methods Conclusion 37

3 1 INTRODUCTION 3 1 Introduction Many problems in science and engineering at some point require the solution of a system of linear equations. There are several methods for computing the solution, the most well-known being standard Gaussian elimination. When choosing the algorithm to use one wishes to take advantage of any special properties the system might have. For larger systems this becomes increasingly important. Sparse linear systems constitutes an important class of problems. For sparse systems a large number of the elements in the coecient matrix are zero. Banded systems, where all the non-zero elements are located in a band along the diagonal, are of special interest since they arise from discretizing partial dierential equations. Due to ll-in Gaussian elimination does not preserve sparsity, i.e. the elimination process introduces non-zero elements. Thus sparse systems are often solved by iterative methods requiring only matrix-vector multiplications with the original matrix. Starting from some initial guess iterative methods generate a sequence of approximations converging to the solution. Extrapolation methods constitute a class of iterative methods used to accelerate the convergence of vector sequences. Starting from a vector sequence generated by a stationary iterative method the reduced rank extrapolation (RRE), the minimal polynomial extrapolation (MPE) and the modied minimal polynomial extrapolation (MMPE) construct a better approximation from the information in the vector sequence. It is known that for linearly generated vector sequences the above methods are mathematically equivalent towell known Krylov subspace methods such as Arnoldi, Lanczos and GMRES. Numerically the extrapolation methods are not as stable as the later methods since they require the solution of ill-conditioned overdetermined linear systems. The stability ofthis solution, and thus the extrapolation methods, can be increased by using regularization methods or a slightly modied stationary method. The main reason for preferring extrapolation methods to Krylov subspace methods is their computational advantages on distributed parallel memory computers. They do not require the orthogonalization of vectors needed for the Krylov subspace methods. Thus the need for inter-processor communication is reduced. The purpose of this Master's project is to investigate dierent extrapolation methods in terms of numerical stability and accuracy, as well as to compare this class of methods with some well known Krylov subspace methods. In section 2 the theoretical background for the extrapolation methods is briey reviewed. The main focus is on the derivation of the methods. This section is based on the papers [17], [19], [21] and the book [14]. Practical usage and implementation of extrapolation methods are discussed in section 3. Especially dierent approaches to improve stability and convergence are suggested. Some numerical experiments are described in section 4. From these experiments we conclude that the extrapolation methods are as ecient as Krylov subspace methods for small systems. Implementation on parallel distributed memory architectures is discussed in section 6. Parallel implementations of RRE and GMRES are compared for a few test problems. RRE has some diculties converging to tolerances close to the oating point relative accuracy. For slightly larger tolerances RRE works well indicating that it can be useful for solving large systems as well. In terms of speed and parallelization properties RRE shows major advantages compared to GMRES. 2 Theoretical background 2.1 Notation and common properties In this paper we are interested in solving sparse linear systems Ax = b (2.1) where A 2 R NN is invertible. More specic, an approximation to the solution will be constructed by accelerating a linear stationary iterative method of the form x j+1 = Gx j + f (2.2) where A = M ; N is a splitting such thatg = M ;1 N and M ;1 b exist. The class of iterative methods described by (2.2) includes the well known Jacobi and Gauss-Seidel methods. To accelerate the convergence of (2.2) G and f do not have tobeknown expicitly, the knowledge of the

4 2 THEORETICAL BACKGROUND 4 sequence alone is sucient. We will however assume that all eigenvalues of G are dierent from 1 so that a unique xed point to (2.2) exists and that it is the solution to (I ; G)s = f: (2.3) Comparing this to the original system of equations (2.1) we nd that s is the solution to the preconditioned system M ;1 As =(I ; G)s = f = M ;1 b: If a sequence of k vectors has been generated starting from some initial guess x 0,wewant to determine coecients j so that s can be expressed as a linear combination of the x j, s = j x j (2.4) where j =1: (2.5) It turns out that if k is chosen to at least the degree of the minimal polynomial of x 1 ; x 0 with respect to G, i.e. the monic polynomial P k of least degree k 0 such that P k0 (x 1 ; x 0 ) = 0, then all extrapolation methods herein converge to the xed point s. We will assume that k = k 0 in sections 2.2, 2.3 and 2.4 and return to the more general case in sections 2.6 and 3.1. In the following sections we will derive the reduced rank extrapolation (RRE), the minimal polynomial extrapolation (MPE) and the modied minimal polynomial extrapolation (MMPE). For that purpose the following notation will be useful. Dene u j x j = x j+1 ; x j (2.6) and v j u j = u j+1 ; u j (2.7) U n [u 0 u 1 ::: u n ] (2.8) The dierence between x j and the solution is denoted by From (2.2), (2.6) and (2.7) it follows that and 2.2 Minimal polynomial extrapolation V n [v 0 v 1 ::: v n ]: (2.9) j x j ; s: (2.10) u j+1 = Gu j = G j+1 u 0 j 0 (2.11) v j+1 =(G ; I)u j j 0: (2.12) Using (2.10) and (2.5) the solution s to (I ; G)x = f can be written s = j x j = j (s + j )=s + Clearly the j should be chosen such that P k j j = 0 holds. The errors j can not be computed unless s is known. However we have the following j j :

5 2 THEORETICAL BACKGROUND 5 P Lemma P k 1. Let c j be coecients in the polynomial P k (t) = c jt j such that P k (G)u 0 = 0, k then c j j =0 Proof. First (G ; I) j =(G ; I)(x j ; s) =(Gx j ; x j ) ; (Gs ; s) = (x j+1 ; f) ; x j ; (s ; f) ; s = x j+1 ; x j = u j : From (2.11) and the denition of the minimal polynomial it follows that 0= c j G j u 0 = c j u j =(G ; I) and since (G ; I) is of full rank this concludes the proof. In other words, if c j are the coecients in the minimal polynomial of G with respect to u 0 then P k c j j =0. From this lemma it follows that j = c j satises P j j =0. This relation is satised even if c j j all the j are multiplied with a constant. To satisfy the constraint wechoose j = c j Pk c j j =0 1 ::: k provided that P c j 6=0. It can be shown that this is always true when (I ; G) is non-singular. Using (2.11) the minimal polynomial can be written Since P is monic c k =1andthus P (G)u 0 = k;1 X c j G j u 0 = c j u j =0: c j u j = ;u k : (2.13) Introducing the vector c =[c 0 c 1 ::: c k;1] T this can be expressed as the overdetermined linear system, U k;1c = ;u k (2.14) where U k;1 2 R Nk. Since k is the degree of the minimal polynomial it follows from Cayley-Hamilton's theorem [3] that k N. Thus the resulting system of equations is smaller than the original. In practice, for many problems the degree of the minimal polynomial is much less than N. An important question that needs to be addressed if we expect to nd the solution to Ax = b is whether or not the system (2.14) is consistent. By denition of the minimal polynomial we know that (2.13) has a unique solution and hence (2.14) is consistent. Summarizing we have the following algorithm. Algorithm 1 (MPE). 1. Generate k +1 vectors x 0 x 1 ::: x k Compute U k;1 and u k. 3. Solve U k;1c = ;u k. 4. Set c k =1and j = P cj cj. 5. s = P j x j. This version of MPE diers somewhat from the algorithm originally presented by Cabay and Jackson [4]. We have chosen the simpler form given by Ford et al. in [21].

6 2 THEORETICAL BACKGROUND Reduced rank extrapolation When determining the j for MPE we solved U k [c c k ] T =0 (2.15) subject to c k =1: (2.16) The c j were then scaled to obtain j satisfying P j =1. Since =[ 0 1 ::: k ] also satises U k = 0 we might ask ourselves what would happen if the constraint were to be changed from c k =1to P c k =1. The resulting system would be U k =0 subject to j =1: (2.17) This approach also yields the solution. Consider the residual P r = f ; (I ; G)x =(Gx + f) ; x: Substituting x for the solution j x j yields r = f ; j (I ; G)x j and with (2.17) we nd r = j (Gx j + f) ; x j = j u j : (2.18) Since the residual is zero for the solution the j can be found by solving (2.17). This extrapolation approach is called reduced rank extrapolation (RRE). Algorithm 2 (RRE). 1. Generate k +1 vectors x 0 x 1 ::: x k Compute U k. 3. Solve U k =0subject to P j =1. 4. s = P j x j. Even though rst and second dierences are used in the alternative form of RRE it does not require any additional storage. There is no need to store the generated vectors once U k has been computed. The memory allocated for x 1 ::: x k can be used to store v 0 ::: v k;1. This form of RRE is sometimes called Mesinas algorithm [13], it was chosen to emphasize the similarities with MPE, the only dierence being the constraint chosen when the overdetermined system is solved. The form of RRE usually found in the literature is somewhat dierent, see [21]. It does not involve a constraint and the overdetermined system is formed using second instead of rst dierences, see algorithm 3. Algorithm 3 (Alternative RRE). 1. Generate k +1 vectors x 0 x 1 ::: x k Compute U k and V k;1. 3. Solve V k;1 = ;u s = x 0 + P j u j.

7 2 THEORETICAL BACKGROUND 7 The derivation of the alternative RRE is straightforward, see for example [21], and will not be given here. However, we show that it is equivalent with algorithm 2. The expression for the solution in the alternative RRE has the form X k;1 k;1 s = x 0 + j u j =(1; 0 )x 0 + ( j;1 ; j )x j + k;1x k = X j=1 Thus it is evident that can be expressed using, = Inserting this in (2.17), we obtain The constraint is satised since ;1 0 ::: ;1 ::: ::: 1 ;1 0 0 ::: j x j : = e 1 + S S 2 R (k+1)k : (2.19) 0=U k = u 0 + U k S = u 0 + V k;1 : (2.20) X k;1 j =(1; 0 )+ ( j;1 ; j )+ k =1 j=1 so algorithms 2 and 3 are mathematically equivalent. However, for practical purposes the algorithms show dierent numerical behaviour. In most cases the alternative RRE seems to be the most ecient. This does not apply for slowly converging sequences. Due to cancellation signicant digits will be lost when computing U k and even more when computing V k;1. For a slowly converging sequence the coecient matrix is very inaccurate which results in inaccurate j. For RRE (algorithm 2) we encounter other diculties due to the constraint P j = 1. The constraint does not specify an upper bound for j j j. Often the j are large with alternating signs which may produce large round-o errors when the solution is computed. j j j seems to be roughly proportional to j. This is consistent with the notion that the last vectors generated should be the most accurate. 2.4 Modied minimal polynomial extrapolation Modied minimal polynomial extrapolation rst appeared in [19] where a general procedure of deriving extrapolation methods using the Shanks-Schmidt transform [15] was given. The starting point is equation (2.13), k;1 X c j u j = ;u k : (2.21) Instead of solving this system in least squares sense, we introduce k linearly independent bounded linear functionals Q i, i =1 ::: k. Applying them to (2.21) yields k;1 X c j Q i (u j )=;Q i (u k ) i =1 ::: k (2.22) which isak k linear system. Choosing Q i (y) (e i y) where e i is the i:th standard basis vector, (2.22) is just the rst k equations of (2.21). Since (2.21) is consistent this is enough to nd the unique solution to (2.21), in fact, any k equations will do. For k k 0 equation (2.21) is no longer consistent andq i must be

8 2 THEORETICAL BACKGROUND 8 chosen more carefully. This will be discussed in section 4.2. In practice we represent the functionals with a matrix Q 2 R Nk and solve the k k system Q T U k;1c = ;Q T u k : For completeness we also formulate the modied minimal polynomial extrapolation algorithm (MMPE). Algorithm 4 (MMPE). 1. Generate k +1 vectors x 0 x 1 ::: x k Compute U k;1 and u k. 3. Choose k linearly independent bounded linear functionals Q 1 ::: Q k. P 4. Solve c j Q i (u j )=;Q i (u k ) i =1 ::: k. 5. Set c k =1and j = P cj cj. 6. s = P j x j. 2.5 Summary PAll three methods, MPE, RRE and MMPE solve the system U k =0subject to the constraint j =1. For RRE the system is solved directly using for example Lagrange relaxation [18]. In section 2.3 we saw that the residual can be written U k and thus the resulting algorithm minimizes the L 2 -norm of the residual. If the last column of U k is moved to the right hand side and U k;1c = ;u k is solved instead, we obtain MPE by introducing c k = 1 and normalizing the c j. The solution to U k;1c = ;u k can be found by multiplying with U T k;1 from the left and solving the normal equations. If the system is multiplied with a matrix Q T of rank k instead, Q T U k;1c = ;Q T u k,wehave MMPE. All algorithms discussed so far belong to the family of polynomial extrapolation methods. There is another family of method knowns as the epsilon algorithms, including the scalar epsilon algorithm (SEA), the vector epsilon algorithm (VEA) and the topological epsilon algorithm (TEA) [21]. They are based on recursive formulas and hence they are more dicult to implement eciently on parallel architectures. Even on serial computers they have a major drawback since a sequence of 2k vectors is required to nd the solution instead of the k +1 vectors needed for the polynomial methods. An overview of the epsilon algorithms can be found in [21]. 2.6 Extrapolation vs. Krylov subspace methods Krylov subspace methods (or Krylov methods for short) are examples of iterative projection methods. In general projection methods seek an approximate solution to Ax = b satisfying x m 2 x 0 + K m b ; Ax m?l m where K m and L m are subspaces of dimension m, see for example the exposition in [14]. Krylov methods are based on projection onto Krylov subspaces, K m (A r 0 )=spanfr 0 Ar 0 A 2 r 0 ::: A m;1 r 0 g where r 0 is the residual of the initial guess, r 0 = b ; Ax 0. By choosing L m dierently one obtains dierent methods. In general there are several mathematically equivalent ways to formulate these methods. The Arnoldi method (or the Full Orthogonalization Method, FOM) is an orthogonal projection method with L m (A r 0 ) = K m (A r 0 ). The approximation x m computed in iteration m is the

9 2 THEORETICAL BACKGROUND 9 projection of the solution onto K m. GMRES takes L m (A r 0 )=AK m (A r 0 )). This results in a method that minimizes the L 2 -norm of the residual in every iteration. Further information on Krylov methods can be found in [14]. One important property of the Krylov methods is that in exact arithmetic a solution belonging to R N is found in no more than N iterations. More specic, the necessary number of iterations is equal to the degree of the minimal polynomial of r 0 with respect to A. To establish equivalence with the extrapolation methods Krylov methods will be applied to the system (I ; G)x = f and thus r 0 = u 0. Using the notation introduced for the extrapolation methods we know that the solution will be found in no more than k 0 iterations. This suggests that performing k iterations with some Krylov method is equivalent to extrapolating with k vectors. Theorem 1. RRE and GMRES are equivalent when applied to the system (I ; G)x = f. Proof. To prove equivalence we will show that the extrapolated solution lies in x 0 + K k and is orthogonal to L k. We have K k (I ; G u 0 )=spanfu 0 (I ; G)u 0 ::: (I ; G) k;1 u 0 g = spanfu 0 Gu 0 ::: G k;1 u 0 g = spanfu 0 u 1 ::: u k;1g: (2.23) For alternative RRE s = P k;1 ju j obviously lies in x 0 + K m. To show that s is orthogonal to L k we use (2.12), For r k = f ; (I ; G)s we have L k =(I ; G)spanfu 0 u 1 ::: u k;1g = spanfv 0 v 1 ::: v k;1g: (V k;1) T (f ; (I ; G)s) =(V k;1) T (V k;1 + u 0 )= T (V T k;1 V k;1 + V T k;1u 0 )=0 by equation (2.20). Thus we have shown r k?l k. Theorem 2. MPE and the Arnoldi method areequivalent when applied to the system (I;G)x = f. Proof. By using the formulation of MPE by Cabay and Jackson [4] it is easy to show that the extrapolated solution belong to x 0 + K k in the same way astheproofoftheorem1. For r k = f ; (I ; G)s we have (U k;1) T (f ; (I ; G)s) = T U T k;1(u k;1c + u k )=( T =( by equation (2.14). Thus we have shown r k?l k. c j )= c j ) (U T k;1 U k;1c + U T k;1u k )=0 It is also possible to show equivalence between the topological epsilon algorithm (TEA) and the Lanczos method. The theorems above make no assumptions on k and are valid for k 6= k 0. For RRE (and GMRES) we know thatchoosing k less than k 0 results in an algorithm minimizing the L 2 -norm of the residual. We conclude this section by showing that the error in the solution found by using MPE (or the Arnoldi method) is orthogonal to the k dominant components of the error. From the proof of lemma 1 we have u i =(G ; I) i. With L k dened as by the Arnoldi method we nd L k = spanfu 0 u 1 ::: u k g =(I ; G)spanf 0 1 ::: k g: The residual of the extrapolated solution, f ;(I ; G)~s, is orthogonal to this subspace. Since (I ; G) is non-singular we multiply by (I ; G) ;1 from the left to obtain (I ; G) ;1 f ; ~s = s ; ~s? spanf 0 1 ::: k g:

10 3 PRACTICAL USE AND IMPLEMENTATION 10 3 Practical use and implementation The implementation of extrapolation methods consists of two separate parts, the implementation of the sequence generator and the implementation of the extrapolation process. These two parts can be implemented independent ofeach other and require dierent computational kernels. Since we are concerned with linear sequence generators given by (2.2), the sequence can be generated using only matrix-vector multiplications and saxpys, z x + y. Both the matrixvector multiplication and saxpy operation can be found optimized in numerical libraries like BLAS [7]. For non-linear sequence generators other kernels may beofinterrest. In the extrapolation process an overdetermined N k linear system is formed and solved. We would like to nd the solution by solving the normal equations since they are easy to implement eciently on both serial and parallel architectures. This requires routines for matrix-matrix multiplication and solution of linear systems available in the BLAS [7] and LAPACK [1] libraries. There are some numerical diculties associated with the normal equations that will be discussed in section Implementation on serial computers is straightforward and follow the algorithms outlined in the previous section with some modications discussed in section 3.1. Some techniques of improving the convergence of the extrapolation methods are given in section 3.2. Regularization methods for improving the stability of the solution of the overdetermined systems are considered in section 3.3. Finally the arithmetic complexities of the extrapolation and Krylov methods are compared in section 3.4. On parallel computers some of the operations require special treatment, see the discussion in section Cycling Extrapolation methods are used to accelerate the convergence of a stationary iterative method. The objective isto nd the solution to (I ; G)x = f faster than we would have usingonly the stationary method. Since we are using nite arithmetics we do not expect to compute the exact solution, the objective is to nd a solution accurate enough. For this purpose it is usually not optimal to choose k equal to k 0 (if we should happen to know it). A process called cycling is used instead. Algorithm 5 (Cycling). 1. Choose an initial vector x Generate k +1 vectors starting from x Extrapolate to nd an approximate solution ~s. 4. If ~s is accurate enough stop. Otherwise set x 0 = ~s and go to 2. Cycling does not provide a way ofchoosing k but makes it possible to nd an accurate solution without knowing k 0 and hopefully without having to generate a large number of vectors. It also has other advantages. The size of the overdetermined linear system increases with k and so does the time required solving it. If k 0 is large, solving this full system may be more time consuming than solving the original sparse system Ax = b. For a larger k one can in general expect a faster convergence. By using cycling we can choose k such that the work and storage needed to solve the overdetermined system does not become the dominating part, and yet at the same time achieve a reasonable convergence. For GMRES and the Arnoldi method it is not necessary to know k 0. If suitable algorithmic structures are used we canseeifk 0 = i for each iteration i. We could use a similar approach for extrapolation methods and compute a solution for each new vector generated. This is very time consuming and we usually prefer cycling. Cycling can also be used with Krylov methods where it is often called restarting. Numerical experiments indicate that k can be chosen somewhat arbitrary with good results. For systems with 100 to unknowns, a value between 10 and 40 will usually do. The condition number of the overdetermined system increases with k so a higher k does not always imply faster convergence.

11 3 PRACTICAL USE AND IMPLEMENTATION Improving convergence There are several ways of improving the convergence of extrapolation methods. Five techniques will be presented. The rst three techniques concerns the convergence of the stationary method. The fourth is used to obtain a more accurate solution to the system of equations. Thelasttechnique attempts to decrease the error after the extrapolated solution has been computed. Numerical experiments with these techniques are presented in section 4.3. In section 4.4 an alternative method of generating the vectors that span a Krylov subspace that seems to improve the stability of RRE is discussed Initial iterations The eigenvectors of G are of great importance to the stationary method. To see why we assume that G is diagonalizable. Then the eigenvectors of G 2 R NN form an orthogonal basis of R N. The initial error, 0, can be written as a linear combination of the eigenvectors, and thus 0 = G n 0 = N;1 X i=0 N;1 X i=0 a i p i a i n i p i : If the largest of the j i j, the spectral radius of G, is less than one, the stationary method is convergent, otherwise From lim n!1 Gn 0 6=0: x j+1 = Gx j + f = G(s + j )+f = s + G j we have j = G j;1 and thus the sequence is divergent ifany j i j > 1. Extrapolation methods can be used to nd the xed point (or anti-limit) for diverging sequences but converging sequences are of greater interrest. If some of the eigenvalues are very small, the components of the error along the corresponding eigenvectors vanishes after a few iterations. This can be used to improveconvergence. If k+1+n vectors are generated instead of k+1 we can extrapolate using x n x 1+n ::: x k+1+n, thus eliminating contributions to the error from some eigenvectors. The rst n iterations are called initial iterations. In [16] the relation lim n!1 s n ; s = O j k+1 j n is established. s n denotes the extrapolated solution when n initial iterations are performed. It is assumed that the eigenvalues of G are ordered according to magnitude so that j 1 j j 2 j :::j k j :::, and that j k j > j k+1 j. For a converging sequence initial iterations improves the extrapolated solution asymptotically. Further analysis of the eect on initial iterations can be found in [20] where a thorough theoretical explanation is given Discarding generated vectors Initial iterations can be used to improve convergence when there are eigenvalues close to zero but they have little eect on eigenvalues with magnitude close to one. Assuming a convergent sequence (all j i j < 1) where some of the eigenvalues are close to 1 we nd for the eigenvectors p i G n p i = n i p i p i

12 3 PRACTICAL USE AND IMPLEMENTATION 12 even for large n. If the other eigenvalues are small we will encounter diculties due to cancellation (the loss of signicant digits when subtracting two almost equal numbers) when computing the rst dierences. To deal with this problem we must somehow "lower" the magnitude of the larger eigenvalues. This can be done by discarding some of the generated vectors. Instead of generating k + 1 vectors we generate q(k+1) and extrapolate using every q-th vector, i.e. x 0 x q ::: x q(k+1). It is equivalent to use the stationary method X q;1 x j+1 = G q x j + G i f: Since the eigenvalues of G j are j wehave \lowered" the magnitude of the eigenvalues. If we have complex eigenvalues with magnitudes close to one this becomes even more important because of the oscillations in the vectors generated by the stationary method Choosing stationary method The better convergence the underlying stationary method has, the better the convergence of the extrapolation method will be. In this work we have for simplicity used the Jacobi method which requires only a matrix-vector multiplication and a vector addition to compute a vector in the sequence. Other methods could of course be used as well. If Jacobi gives unsatisfactory convergence it is natural to consider the Gauss-Seidel method. For the Gauss-Seidel method a triangular system of the same size as A must be solved for each vector in the sequence which makes it more expensive than the Jacobi method. The question is how many cycles less are needed if we use a more ecient stationary method and if the time gained there is enough to justify a more expensive sequence generator. The Gauss-Seidel method involves solving a system of linear equations for every generated vector which makes it dicult to implement on parallel computers. An alternative method that is discussed in [18] is the relaxation process i=0 x j+1 =(1;!)x j +!(Gx j + f) 0 <! 1: Using higher precision to solve the overdetermined linear system In all extrapolation methods discussed here the extrapolated solution is constructed as a linear combination of successive stationary iterations. The coecients in this linear combination is determined from solving an overdetermined linear system. This system is built from rst- or second dierences of the successive approximations. It can thus be expected that at least when one is close to the solution this linear system will be ill-conditioned. The accuracy of the solution to a system of linear equations is largely dependent on the condition number of the system (see section 3.3.1). For ill-conditioned systems we might have to use higher precision to obtain the desired accuracy in the solution. In a programming language like C or FORTRAN this is easy, long double is used instead of double (C). If this is not possible (in Matlab for example) an alternative approach must be used. Dekker proposed the following algorithm for extending available precision [5]. Algorithm = base digits= = a = a ; 3. a u = + Calculate upper half 4. a l = a ; a u Calculate lower half

13 3 PRACTICAL USE AND IMPLEMENTATION 13 The new variables are used to store terms of dierent size. This reduces the contributions from round-o errors. Dekker's algorithm has been used for numerical testing in Matlab. Before discussing the results we look at an example of how to use the new variables. The example below describes a matrixvector multiplication, y Ax. Compute A u, A l, x u and x l y u = A u x u y l = A u x l + A l x u y err = A l x l y = y l + y err y = y + y u If the higher precision is to have any impact it must be used in all relevant operations. To solve the overdetermined system using the normal equations and computing the extrapolated solution this means extended precision has to be used in: forming the normal equations solving the normal equations computing the extrapolated solution computing the residual Extended double precision has been used to solve the overdetermined systems for both MPE and RRE. It does not aect the solution in most cases. It can have some eect when the relative error in the extrapolated solution is close to the oating point relative accuracy of the computer Normalization AP necessary condition for nding the solution to (I ; G)x = f using extrapolation methods is that j = 1, see algorithms 2.2, 2.3 and 2.4. In nite arithmetics we do not expect to compute the exact j but an approximation ~ j j + e j. With this in mind it is likely that ~ j =1+ e j 1+E where E is small but dierent from zero. One way of making sure the constraint is satised is to divide the ~ j with 1 + E. From numerical experiments we know that the magnitude of E can be used to predict the accuracy of the extrapolated solution when we are close to the solution. In section 5.3 a heuristic argument for doing this normalization will be presented. Normalizing does not always have a positive eect. Normalization eliminates one component of the error (equations (5.2) and 5.3)) but introduces another. In section 5.3 it is shown that the error in the extrapolated solution can be eliminated when the residual is small. The norm of the residual can be used to determine when to normalize. Another criteria is to examine E or perhaps study E and the norm of the residual at the same time. In practice these three criteria seems to be equally eective. 3.3 Regularization Some of the overdetermined systems that appear when we use extrapolation methods are examples of ill-posed problems. For ill-posed problems small perturbations in the data may cause large variations in the solution. To avoid this we introduce a well-posed approximation to the original problem thus increasing the stability of the solution. This is the basic principle of regularization [10].

14 3 PRACTICAL USE AND IMPLEMENTATION Solving overdetermined systems Both RRE and MPE produce overdetermined systems that must be solved in order to nd the extrapolated solution. If we have k<k 0 these systems are inconsistent, all equations can not be satised at the same time. Instead the solution is computed in least-squares sense. To compute this solution we can use a number of algorithms. The simplest of these is to solve the normal equations. From a parallel point of view the normal equations are superior to other algorithms for solving least-squares problems. Unfortunately it is not a very stable algorithm. Golub and Van Loan presents an upper bound for the relative error in the solution of Ax = b [8] that grows with the condition number of the overdetermined system, (A). kx ; ~xk1 kxk1 4u 1(A) (3.1) where u is the oating point relative accuracy. For a square matrix the condition number is given by the ratio of the largest and the smallest eigenvalue. When normal equations are used to solve the overdetermined system the condition number is squared. Since the systems that appear when extrapolation methods are used often have large condition numbers (sometimes as large as 10 6 ; 10 7 ) we can not expect the computed j to be very accurate. QR-factorization [8] is a better algorithm from a numerical point of view. If the overdetermined system is denoted Cx d, the QR-factorization of C is C = Q R where Q is orthogonal and R is upper triangular. The solution is found by solving the triangular system Rx = Q T d: More operations and storage are needed if QR-factorization is used but since only orthogonal transformations are used the condition number of R is the same as the condition number of C, giving a more stable and accurate process. For comparison the overdetermined systems were solved in Matlab using both QR-factorization and the normal equations. As expected the QR-factorization results in more accurate solutions but it seems that the normal equations are to be preferred. To see why we recall the cycling algorithm. In every cycle we solve a system of equations. Since QR-factorization is much more time-consuming it is necessary to reduce the number of needed cycles signicantly for the implementation using QR-factorization to be more ecient. Such a reduction in the number of cycles has yet not been seen in our numerical experiments. More important, when implementing extrapolation methods in a parallel computer environment for parallelization eciency reasons we want to avoid using a global QR-factorization. Instead we want to form the normal equations C T C in parallel and then solve this small k k linear system sequentially on every processor. An alternative to using QR-factorization is to use a regularization method to stabilize the normal equations. To illustrate this concept we use a regularization technique based upon singular value decomposition of the normal equations Truncated SVD regularization The eigenvalues are an important tool for analysing problems involving square matrices. For nonsquare matrices we introduce singular values that can be seen as a generalization of eigenvalues. For square matrices the eigenvalues and the singular values are identical. The singular values are associated with the decomposition of matrices given in theorem 3. Since this is not a text on numerical algebra only the concepts necessary for understanding regularization will be presented. Further information can be found in [8] from which the following theorem was taken. Theorem 3 (Singular value decomposition (SVD)). If A is a real m-by-n matrix then there exist orthogonal matrices U =[u 1 ::: u m ] 2 R mm and V =[v 0 ::: v n ] 2 R nn

15 3 PRACTICAL USE AND IMPLEMENTATION 15 such that where 1 2 r 0. Proof. See proof of theorem in [8]. U T AV = diag( 1 ::: r ) 2 R mn r = minfm ng With the SVD it is possible to dene a pseudo-inverse that can be used to nd the solution to linear least-squares problems. Introducing, and =diag( 1 ::: r ) 2 R mn r = minfm ng + = diag( ;1 1 ::: ;1 r ) 2 R nm r = minfm ng C can be written C = UV T = rx i=1 i u i v T i : Using the orthogonality ofu and V and the fact that + =I rr we dene C + = V + U T and the solution to Cx d as x = V + U T d = rx i=1 u T d i v: (3.2) It can be shown that this is the least squares-solution. The matrix V + U T is often referred to as the Moore-Penrose generalized inverse. An important property of the singular value decomposition is that it can be used to nd the closest rank-decient approximation of a matrix. If we set C p = px i=1 i u i v T i p<r then C p is the best approximation of rank p to C in the sense that kc ; C p k 2 is minimized. Furthermore we have kc ; C p k 2 = p+1. The characterization of ill-posed linear least-squares problems can now be stated (as given in [10]). The problem Cx d is said to be ill-posed if the singular values of C decay gradually to zero the ratio between the largest and the smallest nonzero singular value is large. The singular values of these systems often decrease gradually to zero without any distinct drop in magnitude. Another characteristic is that u i and v i have more sign changes in their elements as i increases. Considering the solution to the overdetermined system in (3.2) we notice that the solution is dominated by the terms corresponding to the small singular values. The many signchanges in v i will also be seen in the solution. Regularization is used to damp the contribution from the small singular values by truncating formula (3.2) x reg = px i=1 u T d i v:

16 3 PRACTICAL USE AND IMPLEMENTATION 16 With regularization the contribution to the error due to perturbations in d is reduced but a new error due to regularization appears. The original problem has been replaced with min x kcx ; dk 2 min x kc p x ; dk 2 : C has been replaced with the closest approximation of rank p. This is called truncated SVD regularization (TSVD). TSVD can also be motivated by looking at the condition number of the coecient matrix. From the characterization of an ill-posed least-squares problem we know thatthe ratio between the largest and the smallest singular value is large. Since the condition number of C can be written 1 = r the condition number is large as well. By using regularization the condition number of the coecient matrix decreases. A dierent problem is solved but the contributions from round-o errors are reduced. One diculty with regularization is to balance the perturbation and regularization errors, whether or not regularization has any positive eect depends solely on how well this is done. To balance the errors an appropriate number of singular values must be discarded. The easiest way of doing this is to introduce a largest permitted condition number,. If the quotient 1 = i is larger than, i is discarded. This approach has been used successfully to stabilize the solution of the normal equations and improve the convergence of the extrapolation methods. 3.4 Computational complexity One way of measuring the eciency of an algorithm is to count the number of operations required. This gives an estimate of the computing time. Here we will give the complexity for the extrapolation methods as well as some Krylov methods. The complexity of the Krylov methods was taken from [2] and is computed from methods based on modied Gram-Schmidt orthogonalization. For other orthogonalization methods, see [2] or [14]. The complexity in terms of vector operations is given in table 1. Saxpy denotes the number of vector updates, z x + y, and Matvec the number of matrix-vector multiplications with the original system matrix. All methods involve solving a system of linear equations. For the Krylov methods the coecient matrix is a Hessenberg matrix, a triangular matrix with one subdiagonal. The extrapolation methods require the solution of an overdetermined system. It is assumed that this system is solved using the normal equations. These linear systems are smalled compared to the original problem, they are of size k k or (k +1) (k +1). Method Inner products Saxpy Matvec Linsys MPE (k 2 +2k)=2 2k +3 k RRE (k +1) 2 =2 2k +3 k Alternative RRE (k +1) 2 =2 3k +3 k MMPE 0 2k +2 k GMRES (k 2 +3k)=2 (k 2 +3k)=2 k 1 4 The Arnoldi method (k 2 +3k)=2 (k 2 +3k)=2 k 1 4 Table 1: Complexity From table 1 we see that the complexity is roughly the same for all methods. The only exception is MMPE that does not require any inner products. The inner products for the other extrapolation methods comes from forming the normal equations and thus they are not needed for MMPE. The dierence is the number of synchronization points needed for a parallel implementation. A synchronization point is a point in the program where all processors must have completed their tasks before the program can continue. This means that at each synchronization point all

17 4 NUMERICAL EXPERIMENTS 17 Method Synchronization points RRE and MPE 1 MMPE 2 GMRES and Arnoldi's method 2k Table 2: Synchronization points processors must wait for the processor that requires the longest time to complete its task. One of the advantages of extrapolation methods over Krylov methods is that they require fewer synchronization points, as seen in table 2. Instead of having GMRES's 2k synchronization points per cycle we only have one(ortwo when using MMPE). For MPE and RRE one synchronization point is needed to form the normal equations. MMPE requires one synchronization point to compute the solution to the linear system and one to compute the residual. The number of synchronization points necessary for the Krylov methods are computed from an algorithm in [6] originally derived by de Sturler. 4 Numerical experiments 4.1 Test problems Numerical experiments are important tools that can be used to validate or disprove assumptions and theories. It is important tochoose test problems that reect the properties of the real problems we wish to solve. Since iterative methods are used widely for the solution of sparse systems it is appropriate to choose sparse test problems. dierences to approximate dierential equations. For most tests we have used the two-dimensional convection-diusion u u 2 One way to obtain sparse systems is to use nite + u = g(x y) (4.1) with Dirichlet boundary conditions. g(x y) is chosen to obtain a solution that is easy to verify. By varying and the spectral radius of G can be modied and thus the convergence rate of the stationary iterative method. Unless otherwise stated, whenever test problems are referred to in this section, we meanthetwo-dimensional convection-diusion equation. The second test problem was found in [20] has been used to produce slowly converging sequences. Here we choose the iteration matrix directly, G = ::: 0 ::: 0 0 ::: ::: 0 and choose an arbitrary f. For this choice of G we can give an analytical expression for the eigenvalues provided that, and are real, j = +2 p cos j j =1 2 ::: N: N +1 If a large N is chosen most eigenvalues will lie in the proximity of +2 p or ; 2 p. For most tests the Jacobi method has been used as a stationary method. For reasons of comparison the Gauss-Seidel method has been used as well (section 4.3.3). The numerical experiments were conducted on a serial computer using Matlab. 1 N k system solved with normal equations resulting in a k k linear system 2 N (k + 1) system solved with normal equations resulting in a (k +1) (k + 1) linear system 3 k k linear system 4 N k Hessenberg system (4.2)

18 4 NUMERICAL EXPERIMENTS Choosing extrapolation method For k = k 0 all polynomial extrapolation methods can be used to nd the solution. If cycling is used for k<k 0 the choice of method becomes more interresting. All three methods have similar complexities and parallel properties so the choice must be based on numerical properties. For RRE and MPE we must also choose a suitable algorithmic structure. We will start by comparing the dierent extrapolation methods and then discuss the choice of algorithmic structure. In general the most eective algorithmic structure has been used for comparison. MMPE seems to be the least eective of the three methods. Perhaps this is a consequence of not using all the information in the generated vectors. To achieve the same convergence as MPE and RRE, MMPE requires a slightly larger k. Using a few initial iterations greatly improves the stability of the method. For diverging sequences MMPE has a tendency to stagnate, i.e., 1 is close to one and all the other j are close to zero. For most sequences MPE and RRE seem to be equally ecient during the rst few cycles. After that RRE is usually better but there are exceptions. MPE is recommended for slowly converging sequences. RRE is always more ecient for diverging sequences for which MPE may failtoconverge if k is not large enough. Even for small values of k RRE accelerates the convergence of divergent stationary methods. For RRE the dierent algorithmic structures are mathematically equivalent for k < k 0 and behave alike. This is not true for MMPE. Dierent choices of Q j lead to dierent linear systems and mathematically dierent algorithmic structures. A few dierent ways of choosing Q j have been tried. The simplest wayistochoose Q j so that only k of the equations in U k;1c = ;u k are considered. For parallel implementations it is advantageous to choose k equations next to each other, for example the rst k equations. For sequences where information is propagated slowly in the vectors this choice of Q j is not good. If the k equations are chosen in a part of the system where the convergence is slow the coecient matrix will be nearly singular or singular. It is better to choose the equations distributed equally across the system. Even better results are obtained if the k equations are formed using all the N original equations. One way ofdoing this is to compute the sum of N=k equations to obtain one new. On a serial computer this is a fast operation. The convergence and stability properties are better than if just k equations are selected. It is conceivable to let Q j vary between cycles to obtain a more adaptive method. An example of such an approachwould be to select the equations corresponding to the k largest elements of the residual in the previous cycle. This has not been tried. The conclusions here are based on experience and numerical testing and can not easily be proved. In [21] Ford et al. compare RRE to MPE and come to the conclusion that MPE is at least as ecient asrre. Herewehave reached the opposite conclusion, RRE is at least as eective as MPE. 4.3 Empirical examination of techniques of improving convergence In this section the techniques for improving the convergence discussed in section 3.2 are applied to the test problems in section 4.1. All techniques are except the vector-discard is applied to the two-dimensional convection-diusion equation. The technique of discarding generated vectors is applied to the slowly converging vector sequence 4.2, also given in section Initial iterations The two-dimensional convection-diusion equation with = ;2:8211 and = 4:0053 is solved on a grid. Figure 1 shows the convergence for RRE with and without initial iterations with k = 10. The spectral radius of G is approximately There are 10 small eigenvalues (< 10 ;15 ), the rest lie between and To nd a solution for which the L 2 -norm of the residual is less than 10 ; vectors are generated without initial iterations, 143 vectors are generated with one initial iteration per cycle and 117 whith two initial iterations per cycle. Not

19 4 NUMERICAL EXPERIMENTS No initial iterations 1 initial iteration 2 initial iterations L2 norm of residual Cycle Figure 1: Initial iterations only do we benet from having to generate fewer vectors, there are also fewer extrapolation steps and thus fewer linear systems to solve Discarding generated vectors In gure 2 an example of a case where the eigenvalues of the coecient matrix are complex with magnitudes close to one. Test problem 4.2 with 50 unknowns and parameters = 0:03, =0:015+0:5i, = ;0:09 ; 0:45i is solved using RRE with k =5. Every q-th vector is discarded. The Jacobi method converges slowly due to eigenvalues close to one (the spectral radius is approximately 0.986), the oscillations in the generated sequence causes some instability in the extrapolation process which can be seen by the oscillating residual. When every other vector is discarded the convergence and stability are greatly improved. The oscillations can also be damped by choosing a larger k. With k = 10 the two methods converge to given tolerance in 6 and 14 cycles respectively Choosing stationary method In gure 3 both the Jacobi and the Gauss-Seidel method have been applied to a discretization of the two-dimensional convection-diusion equation with 225 unknowns and = 1:1899 and = 16:5527. RRE has been used for both stationary methods. In this example we do not benet from using the Gauss-Seidel method. We only gain one cycle and that is not enough to compensate for the extra time needed to generate the sequence. It is not dicult to nd cases where much fewer cycles are needed for the Gauss-Seidel than the Jacobi method (sometimes only half as many). It seems however, that it is dicult to nd cases where we nd the solution faster by using the Gauss-Seidel method even in the serial case.

20 4 NUMERICAL EXPERIMENTS RRE, q=2 RRE, q=1 Jacobi 10 4 L2 norm of residual Cycle Figure 2: Discarding generated vectors L2 norm of residual RRE using Jacobi RRE using Gauss Seidel The Jacobi method The Gauss Seidel method Cycle Figure 3: Dierent stationary methods