3 P0 P0 P3 P3 8 P1 P0 P2 P3 P1 P2



Similar documents
solution flow update flow

c 1999 Society for Industrial and Applied Mathematics

High Performance Matrix Inversion with Several GPUs

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

A Comparison of General Approaches to Multiprocessor Scheduling

Parallel Algorithm for Dense Matrix Multiplication

Notation: - active node - inactive node

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

6. Cholesky factorization

SOLVING LINEAR SYSTEMS

Direct Methods for Solving Linear Systems. Matrix Factorization

Resource Allocation Schemes for Gang Scheduling

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Sparse Matrix Decomposition with Optimal Load Balancing

A Parallel Lanczos Algorithm for Eigensystem Calculation

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Solution of Linear Systems

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

7 Gaussian Elimination and LU Factorization

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Dynamic load balancing of parallel cellular automata

HSL and its out-of-core solver

Operation Count; Numerical Linear Algebra

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,


A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Mesh Generation and Load Balancing

Modification of the Minimum-Degree Algorithm by Multiple Elimination

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Report: Declarative Machine Learning on MapReduce (SystemML)

Distributed Memory Machines. Sanjay Goil and Sanjay Ranka. School of CIS ond NPAC.

2: Computer Performance

Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example

Matrix Multiplication

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

A Direct Numerical Method for Observability Analysis

Advanced Computational Software

Proling of Task-based Applications on Shared Memory Machines: Scalability and Bottlenecks

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Load Balancing Techniques

PERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL

Reliable Systolic Computing through Redundancy

A note on fast approximate minimum degree orderings for symmetric matrices with some dense rows

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Distributed communication-aware load balancing with TreeMatch in Charm++

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Factorization Theorems

EXTENDING LOOP PARALLELIZATION FOR THE GRID TO LARGELY DECENTRALIZED COMMUNICATION

Parallel Scalable Algorithms- Performance Parameters

Cellular Computing on a Linux Cluster

Load Balancing using Potential Functions for Hierarchical Topologies

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

A Systematic Approach. to Parallel Program Verication. Tadao TAKAOKA. Department of Computer Science. Ibaraki University. Hitachi, Ibaraki 316, JAPAN

Using Logs to Increase Availability in Real-Time. Tiina Niklander and Kimmo Raatikainen. University of Helsinki, Department of Computer Science

BSPCloud: A Hybrid Programming Library for Cloud Computing *

General Framework for an Iterative Solution of Ax b. Jacobi s Method

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

AN OUT-OF-CORE SPARSE SYMMETRIC INDEFINITE FACTORIZATION METHOD

Data analysis in supersaturated designs


A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS

Parallel Computing. Benson Muite. benson.

Load Balancing In Concurrent Parallel Applications

Locality Based Protocol for MultiWriter Replication systems

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Survey on Load Balancing and Scheduling in Cloud Computing

Abstract: In this paper, we describe our experience in developing. parallel implementations for the Bin Packing problem.

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Contributions to Gang Scheduling


Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

An Ecient Dynamic Load Balancing using the Dimension Exchange. Ju-wook Jang. of balancing load among processors, most of the realworld

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Parallel Computing for Data Science

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Solving Systems of Linear Equations

Scalability and Classifications

Parallel Low-Storage Runge Kutta Solvers for ODE Systems with Limited Access Distance

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

Customized Dynamic Load Balancing for a Network of Workstations 1

Lecture 1: Systems of Linear Equations

SPARSE matrices are commonly present in computer memory

Manipulability of the Price Mechanism for Data Centers

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Introduction to Cloud Computing

GameTime: A Toolkit for Timing Analysis of Software

Hierarchical QR factorization algorithms for multi-core cluster systems

UTS: An Unbalanced Tree Search Benchmark

A Practical Approach of Storage Strategy for Grid Computing Environment

Transcription:

A Comparison of 1-D and 2-D Data Mapping for Sparse LU Factorization with Partial Pivoting Cong Fu y Xiangmin Jiao y Tao Yang y Abstract This paper presents a comparative study of two data mapping schemes for parallel sparse LU factorization with partial pivoting on distributed memory machines. Our previous work has developed an approach that incorporates static symbolic factorization, nonsymmetric L/U supernode partitioning and graph scheduling for this problem with 1-D column-block mapping. The 2-D mapping is commonly considered more scalable for general matrix computation but is dicult to be eciently incorporated with sparse LU because partial pivoting and row interchanges require frequent synchronized inter-processor communication. We have developed an asynchronous sparse LU algorithm using 2-D block mapping, and obtained competitive performance on Cray-T3D. We report our preliminary studies on speedups, scalability, communication cost and memory requirement of this algorithm and compare it with the 1-D approach. 1 Introduction Ecient parallelization for sparse LU factorization with pivoting is important to many scientic applications. Dierent from sparse Cholesky factorization, for which the parallelization problem has been relatively well solved [3, 8, 12, 13], the sparse LU factorization is much harder to be parallelized due to its dynamic nature caused by pivoting operations. The previous work has addressed parallelization issues using shared memory platforms or restricted pivoting [6, 7, 9]. In [4] we proposed a novel approach that integrates three key strategies together in parallelizing this algorithm on distributed memory machines: 1) adopt a static symbolic factorization scheme [7] to eliminate the data structure variation caused by dynamic pivoting 2)identify data regularity from the sparse structure obtained by the symbolic factorization so that ecient dense operations can be used to perform most of the computation 3) make use of graph scheduling techniques and ecient run-time support called RAPID [3] to exploit irregular parallelism. The preliminary experiments are encouraging and good performance results are obtained with 1-D data mapping [4]. In the literature 2-D mapping has been shown more scalable than 1-D for dense LU factorization and sparse Cholesky [13, 14]. In [10] a sparse solver with element-wise 2-D mapping is presented. For better cache performance, block partitioning is preferred [1]. However there are several diculties to apply the 2-D block-oriented mapping to the case of sparse LU factorization even the static structure is predicted in advance. First of all, pivoting operations and row interchanges require frequent and well-synchronized inter-processor communication when submatrices in the same column block are assigned to Supported by NSF RIA CCR-9409695 and CDA-9529418, and by astartup fund from University of California at Santa Barbara. y Department of Computer Science, University of California, Santa Barbara, CA 93106 1

2 dierent processors. In order to exploit irregular parallelism, we need to deal with irregular and asynchronous communication, which requires delicate message buer management. Secondly, it is dicult to model irregular parallelism from sparse LU. Using the elimination tree of A A T is possible but not accurate. Lastly, the space complexity is another issue. Exploiting irregular parallelism to a maximum degree may need more buer space. This paper presents our preliminary work on the design of an asynchronous algorithm for sparse LU with 2-D mapping. Section 2 briey reviews the static symbolic factorization, sparse matrix partitioning and 1-D algorithms. Section 3 presents the 2-D algorithm. Section 4 discusses the advantages and disadvantages of the 1-D and 2-D codes. Section 5 presents the experimental results. Section 6 concludes the paper. 2 Background and 1-D approaches The purpose of sparse LU factorization is to nd two matrices L and U for a given nonsymmetric sparse matrix A such that PA = LU, where L is a unit lower triangular matrix, U is a upper triangular matrix, and P is a permutation matrix containing pivoting information. It is very hard to parallelize sparse LU because the pivot selection and row interchange dynamically increase ll-ins and change L/U data structures. Using the precise pivoting information at each elimination step can certainly optimize data space usage and improve load balance, but lead to high run-time overhead. In this section, we briey discuss some techniques used in our S parallel sparse LU algorithm [4]. Static symbolic factorization. Static symbolic factorization is proposed in [7] to identify the worst case nonzero patterns. The basic idea is to statically consider all the possible pivoting choices at each step. And then the space is allocated for all the possible nonzeros that would be introduced by any pivoting sequence that could occur during the numerical factorization. The static approach avoids data structure expansions during the numerical factorization. The dynamic factorization, which is used in an ecient sequential code SuperLU [11], provides more accurate data structure prediction on the y, but it is challenging to parallelize SuperLU on distributed memory machines. Currently the SuperLU group has been working on shared memory parallelizations [11]. L/U supernode partitioning. After the nonzero ll-in patterns of a matrix is predicted, the matrix is further partitioned using a supernode approach to improve the cache performance. In [11], a nonsymmetric supernode is dened as a group of consecutive columns in which the corresponding L factor has a dense lower triangular block on the diagonal and the same nonzero pattern below the diagonal. Based on this denition, in each column block thel part only contains dense subrows. We call this partitioning method L supernode partitioning. Here by \subrow" we mean the contiguous part of a row within a supernode. For SuperLU, it is dicult to explore structure regularity inau factor after L supernode partitioning. However in our approach, nonzeros (including overestimated ll-ins) in the U factor can be clustered as dense columns or subcolumns. We discuss the U partitioning strategy as follows. After a L supernode partition has been obtained on a sparse matrix A, i.e., a set of column blocks with possible dierent block sizes, the same partitioning is applied to the rows of the matrix to further break each supernode into submatrices. Now each o-diagonal submatrix in the L part is either a dense block or contains dense blocks. Furthermore, in [4] we have shown that each nonzero submatrix in the U factor of A contains only dense subcolumns. This is the key to maximizing the use of BLAS-3 subroutines in our algorithm. Figure 1(a) illustrates the dense patterns in a partitioned sparse matrix.

Data Mapping. For block-oriented matrix computation, 1-D column-block cyclic mapping and 2-D block cyclic mapping are commonly used. In 1-D column-block cyclic mapping, as illustrated in Figure 1(b), a column-block A : j is assigned to the same processor P j mod p, where p is the number of the processors. Each column-block is called a panel in [12]. A2-Dblock cyclic mapping views the processors as a 2-D r s grid, and a block is the minimum unit for data mapping. A nonzero submatrix block A i j is assigned to the processor P i mod r j mod s as illustrated in Figure 1(c). The 2-D mapping is usually considered more scalable than 1-D mapping for Cholesky because it tends to have better computation load balance and lower complexity of communication volume. However the 2-D mapping introduces more overhead for pivoting and row swapping. 3 1 2 3 4 5 6 7 8 01 00 11 1 01 00 11 01 00 11 2 111 000 111 000 3 00 11 0000 1111 000 111 4 00 11 1110000 1111 5 00 11 111 00 1111 111 000 111 000 6 7 8 (a) P0 P1 P2 P3 1 2 3 4 5 6 7 8 1 P0 P2 P2 2 P1 P0 P3 3 P0 P2 P2 4 P3 P1 5 P1 P0 P3 6 P3 P1 7 P2 P1 P2 8 P1 P0 P2 P3 (b) P0 P1 P2 P3 1 2 3 4 5 6 7 8 1 P0 P0 P0 2 P3 P1 P3 3 P0 P0 P0 4 P3 P3 5 P2 P0 P2 6 P3 P3 7 P0 P2 P0 8 P3 P1 P1 P3 (c) Fig. 1. (a) A partitioned sparse matrix (b) A 1-D cyclic mapping of column blocks(panels) (c) 2-D cyclic mapping of blocks. The 1-D program partitioning. Given an n n matrix A, assume after the matrix partitioning it has N panels or N N blocks. The maximum number of columns in a panel is b. Each panel k is associated with two types of tasks: F actor(k) and Update(k j) for 1 k < j N. 1) A task Factor(k) is to factorize all the columns in the k-th panel, including nding the pivoting sequence associated with those columns and updating the lower triangular portion of panel k. The pivoting sequence is held until the factorization of the k-th panel is completed. Then the pivoting sequence is applied to the rest of the matrix. This is called \delayed pivoting" [2]. The F actor() tasks mainly use BLAS-1 and BLAS-2 subroutines. 2) Task Update(k j) is to use panel k to modify panel j. That includes \row swapping" using the result of pivoting derived by F actor(k), \scaling" which uses the factorized A k k to scale A k j, and \updating" which usesa i k and A k j to modify A i j. The Update() tasks mainly use BLAS-2 and BLAS-3 subroutines. 1-D approaches. We call the set of updating tasks Update(k j) k < j N as the stage k matrix updating. In [4], we have implemented two dierent parallel methods with 1-D data mapping as follows. The compute-ahead algorithm (CA). This algorithm uses the column-block cyclic mapping. The pivoting process proceeds in the order of k, i.e., Factor(k+1) is started after Factor(k). But the Update() tasks can be overlapped. The CA schedule tries to execute the pivoting tasks as early as possible since they are normally in the critical path of the computation.

4 The 1-D RAPID code. This code uses a DAG to model irregular parallelism and the RAPID system [3] to schedule the tasks. Then RAPID will execute the DAG on a distributed memory machines using a low-overhead communication scheme. In [4], we show that the 1-D RAPID code outperforms the CA code because graph scheduling exploits the parallelism more aggressively. While the above approaches are reasonable as the rst step towards ecient parallelization of sparse LU on distributed memory machines, they have a disadvantage that 1-D partitioning can expose limited parallelism. Also in order to support the concurrency among multiple updating stages in CA, multiple buers are needed to keep pivoting panels on each processor. Thus the space complexity is high. The current RAPID implementation in [4] also uses extra memory space for supporting general irregular computations. 3 The 2-D algorithm In this section we present an asynchronous sparse LU factorization using 2-D mapping. We still employ the optimization strategies such as static symbolic factorization, nonsymmetric L/U supernode partitioning and amalgamation strategy. The computation of a task Factor(k) or Update(k j) is distributed to a column of nodes in the processor grid. The main challenge is the coordination of pivoting and data swapping across a subset of processors to minimize the overhead, exploit maximum amount of parallelism and maintain lower space usage. The ideas of our approach is summarized below. Thegoalistoeliminate global barriers in the code design, which maximizes concurrency. For the simplicity of description, we assume that the processor grid is a square of size p p p p. The computation ow is still controlled by the pivoting tasks F actor(k). Namely, Factor(k + 1) will start after Factor(k). However, the U pdate() tasks can be overlapped. The maximum degree of overlapping is p p. Naturally it is ideal if pivoting operations can be conducted in parallel as much as possible. But we have not found a good way todothis unless we use the graph scheduling techniques. Thus a certain amount of parallelism is lost while the code is relatively easy to be implemented. The pivoting process within F actor(k) is parallelized as follows. Let the processor that owns the block A k k be called as P k k. For each column m in the k-th panel, each processor in this column nds the local maximum value and the corresponding subrow index independently. Then it sends the subrow with the local maximum to P k k. After receiving all the subrows with the local maxima, processor P k k nds the global maximum and the corresponding subrow index t from the collected subrows, and multicasts the subrow t to processors in the same column for further updating of the lower triangular portion of panel k. It also sends the original m-th subrow ina k k to the processor owning the global maximum for swapping. Notice that the whole subrow is communicated when each processor reports its local maximum to P k k. In this way P k k can swap with the subrow t locally without further synchronization. This shortens the waiting time to conduct further updating with a little more communication volume. In the algorithm, each processor maintains p p buers to collect the possible selected subrows for pivoting and each buer is of size b. The synchronization point is maintained at P k k which performs a global maximum selection when each processor reports its local maximum. The Updating of the lower triangular portion of panel k in F actor(k) can be conducted in parallel as soon as the selected subrow t is available locally. After all the columns in panel k are factorized, the pivoting sequence derived by F actor(k) is broadcasted to all processors. Since the F actor(k) is performed in a column of the processor grid, each

processor in this column has the pivoting sequence. Thus the pivoting sequence information can be broadcasted along the rows of the processor grid simultaneously. While other processors in the grid are conducting swapping operations, processor P kk needs to broadcast factorized A k k along its own processor row. Similarly each processor in the same processor column as P k k broadcasts the updated panel k along the row direction of the processor grid. Considering there could be at most p p stages of matrix updating overlapped, each processor should have p p buers to receive the result of Factor(). Each buer has size Nb 2 = p p. The rst thing to be done in Update(k j) is row swapping. Swapping in dierent updating stages can be overlapped on dierent columns of processors. Each processor needs to swap some subrows out while receiving some subrows. At most there are p buers, each buer needs to hold swapping subrows from other processors with size Nb= p p. Scaling in Update(k j) uses A k k to modify A k j. They can be done in parallel immediately after row swapping. After scaling, the nonzero blocks of A k j, k +1 j N, are broadcasted along the column direction of the processor grid. Considering the maximum number of overlapping of updating, each processor has p p buers and each buer is of size Nb 2 = p p. The nal updating of the submatrix A k+1:n k+1:n in Update(k j), k+1 j N, can be done in parallel. If nonzero submatrices A i k and A k j are available, they can be used to update the nonzero submatrix A i j. 5 4 A comparison with 1-D data mapping We study the performance issues of the 2-D algorithm and compare it with the two 1-D algorithms. Let the number of nonzero lower triangular submatrices be X and the number of nonzero upper triangular submatrices be Y. The total number of nonzero blocks in matrix A is Z = X + Y + N. Let n be the dimension of the matrix. Again for the simplicity of description, we assume that the processor grid is p p p p, and that each submatrix is a b b dense block. In the actual implementation this is not always true. We also assume that Y X Z=2 andx Y >> N. Parallelism and load balancing. Since the operating units are submatrices instead of panels, the 2-D algorithm exploits parallelism in the submatrix level. The degree of possible parallelism in the 2-D algorithm is O(Z) while that for the 1-D algorithms is O(N). The ner granularity in the 2-D algorithm gives more exibility for load balancing. The 1-D approach has limited parallelism thus the load balancing is more dicult. Memory requirement. Blocks are distributed evenly onto processors. But each processor needs extra buer space to hold messages for asynchronous execution. In the CA algorithm, the extra memory requirement for communication is about Zb 2 (1 ; 1=p). The 2-D algorithm needs about 2Nb 2 buer space. Thus the 2D algorithm is capable of solving large problem instances. The implementation of RAPID in [3] also requires that each processor have aboutzb 2 (1 ; 1=p) space for holding messages. Communication volume. Assume at each stage, the probability for swapping rows(due to pivoting) across processors is q. In the 1-D approach, the total communication volume is about p(x + N)b 2 0:5pZb 2 while in the 2-D approach, the total communication volume is about ( p p(z +2N)+2qY )b 2 p pzb 2. Thus the 2-D algorithm has less communication volume comparing to the 1-D algorithm. The total number of messages. The total number of messages for 1-D is np, all of which are for broadcasting panels. In the 2-D approach itisabout p p(2n + Z)+2qn and

6 the messages include those for nding pivoting, row swapping and submatrix broadcasting. The above analysis shows that 2-D mapping has better scalability than 1-D mapping, because 2-D mapping has more exibility for load balancing, less memory requirement and smaller communication volume. However 2-D mapping leads to more frequent communications. 5 Comparative experimental studies Our experiments are conducted on Cray-T3D. Each processing element of T3D has 64 megabytes of memory with an 8K cache. The BLAS-3 routine DGEMM can achieve 103 MFLOPS. Testing matrices are from the Harwell-Boeing collection except \goodwin" matrix. Overall performance of the 2-D code. The overall speedups of the 2-D code are shown in Figure 2(a). The results show that the 2-D code scales up to 64 processors for those matrices. The speedups are reasonable. All speedups are obtained by comparing with the S sequential code which is in average 1.4 times slower than the SuperLU sequential code for those cases [4]. In Figure 2(b), we compare the average parallel time dierences between the 2-D PT 1D PT 2D approach and the two 1-D approaches by computing the ratio: ; 1. The 1-D RAPID code achieves the best performance because it uses sophisticated graph scheduling technique to guide the mapping of panels which results in better overlapping of communication with computation. The 1-D CA code uses a little bit scheduling and performs slightly better than the 2-D code on a small number of processors. But on 32 and 64 processors the 2-D code catches up. We will explain this issue by analyzing the load balance factor below. 16 14 12 Speedup of sparse LU factorization using 2 D block mapping : b33_5600 : goodwin : saylr4 : orsreg1 1 0.8 0.6 0.4 Parallel time comparison of 1 D and 2 D : 1 D RAPID vs. 2 D : 1 D CA vs. 2 D speedup 10 8 6 parallel time difference 0.2 0 0.2 0.4 4 0.6 2 0.8 10 20 30 40 50 60 #proc 1 10 20 30 40 50 60 #proc (a) (b) Fig. 2. (a) Parallel speedups on Cray-T3D (b) Average parallel time dierence comparisons. Scalability and load balancing. We measure the scalability of an algorithm by S(2 i )=PT(2 i;1 )=P T (2 i ), where PT(2 i ) is the parallel time on 2 i processors. Figure 3(a) shows that 2-D approach has the best scalability on a large number of processors. For instance, the 2-D algorithm achieves 201.35 MFLOPS on 16 processors for B33 5800, while RAPID code achieves 302.5 MFLOPS. But on 64 processors the 2-D algorithm achieves 533.68 MFLOPS, better than the RAPID code whichachieves 519.3 MFLOPS. We analyze the load balancing factor in Figure 3(b). The load balance factor is dened as work total =(P work max ) [13]. Here we only count the work from the updating part

because it is the major part of the computation. The 2-D code has the best load balance, which canmake up for the impact of lacking of ecient task scheduling. That is the main reason to make the 2-D the most scalable approach. 7 2 Scalability comparison of 1 D and 2 D 1 Load balance comparison of 1 D and 2 D 1.8 : 2 D cyclic 0.9 1.6 1.4 : 1 D RAPID : 1 D CA 0.8 0.7 scalability 1.2 1 0.8 load balance 0.6 0.5 0.4 : 2 D cyclic 0.6 0.3 : 1 D RAPID 0.4 0.2 : 1 D CA 0.2 0.1 0 0 10 20 30 40 50 60 #proc (a) 0 10 20 30 40 50 60 #proc (b) Fig. 3. Scalability and load balancing comparisons. Impact of inter-processor pivoting and swapping. We further examine the impact of the overhead caused by inter-processor pivoting and swapping. In Table 1, it shows the distribution of the execution times for pivoting and swapping in the 1-D CA code and the 2-D code. The rst item is for \Factorizing and waiting". This is the period when a column of processors (only one processor in the 1-D case) are working on Factor(k), other processors have to wait for the factorizing results. The \Swapping" is for the time of row-interchanges. Table 1 Execution time distributions for 1-D CA code and 2-D code Part P=4 P=16 P=64 1-D CA 2-D 1-D CA 2-D 1-D CA 2-D Factorizing + waiting 18.44% 23.71% 56.25% 39.66% 81.17% 52.80% Swapping 4.05% 6.77% 2.03% 8.77% 0.77% 7.88% The results show that the increase of \Swapping" time is insignicant. However the \Factorizing + waiting" dominates the execution time and it aects the parallel time signicantly because most likely it will be in the critical path of the computation. For a small number of processors, the 2-D code has more overhead percentage for this part. But as the number of processors increases, this part becomes the bottleneck of 1-D CA code. The reason is as follows. When a small number of processors are used, the 1-D code has good load balancing. However on a large number of processors, the load balancing for the 1-D code deteriorates. On the other hand, the 2-D approach starts to perform better because the parallelization of Factor() tasks begins to show advantages and the loads remain well balanced. Synchronous versus asynchronous 2-D code. Using global barrier in the 2-D code simplies the implementation, but it cannot overlap computations among dierent updating stages. We list average parallel time reduction using our asynchronous design in

8 Table 2. We obtain an average 5-26% improvement, which demonstrates the importance of the barrier free design. Table 2 Performance improvement of 2-D asynchronous code over 2-D synchronous code P=2 P=4 P=8 P=16 P=32 P=64 Average 5.27% 7.13% 15.57% 16.95% 25.93% 26.07% 6 Conclusions In this paper, we present an asynchronous sparse LU factorization algorithm using 2-D block mapping, compare its performance with two algorithms using 1-D column-block mapping. The preliminary experimental results show that 2-D mapping results in better scalability and load balance than 1-D column-block mapping, although the 2-D mapping requires frequent communication. Another important point of the 2-D code is that it requires less memory and therefore has more potential to solve larger problems. We are currently working on the memory-ecient design of RAPID [5] and testing larger problem instances for the RAPID and 2-D code. References [1] C. Ashcraft, R. Grimes, J. Lewis, B. Peyton, and H. Simon, Progress in Sparse Matrix Methods for Large Sparse Linear Systems on Vector Supercomputers, International Journal of Supercomputer Applications, 1 (1987), pp. 10{30. [2] J. Demmel, Numerical Linear Algebra on Parallel Processors. Lecture Notes for NSF-CBMS Regional Conference in the Mathematical Sciences, June 1995. [3] C. Fu and T. Yang, Run-time Compilation for Parallel Sparse Matrix Computations, in Proceedings of ACM Inter. Conf. on Supercomputing, Philadelphia, May 1996, pp. 237{244. [4], Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines, in Proceedings of Supercomputing'96, Pittsburgh, Nov. 1996. [5], Space and Time Ecient Execution of Parallel Irregular Computation, 1997. In preparation. [6] K. Gallivan, B. Marsolf, and H. Wijsho, The Parallel Solution of Nonsymmetric Sparse Linear Systems using H* Reordering and an Assoicated Factorization, in Proc. of ACM International Conference on Supercomputing, Manchester, July 1994, pp. 419{430. [7] A. George and E. Ng, Parallel Sparse Gaussian Elimination with Partial Pivoting, Annals of Operations Research, 22 (1990), pp. 219{240. [8] A. Gupta and V. Kumar, Optimally Scalable Parallel Sparse Cholesky Factorization, in Proc. of 7th SIAM Conf. on Parallel Processing for Scientic Computing, Feb. 1995, pp. 442{447. [9] S. Hadeld and T. Davis, A Parallel Unsymmetric-pattern Multifrontal Method, Tech. Rep. TR-94-028, CIS Departmenmt, University of Florida, Aug. 1994. [10] S. C. Kothari and S. Mitra, A Scalable 2-D Parallel Sparse Solver, in Proc. of Seventh SIAM Conference on Parallel Processing for Scientic Computing, Feb. 1995, pp. 424{429. [11] X. Li, Sparse Gaussian Elimination on High Performance Computers, PhD thesis, CS, UC Berkeley, 1996. [12] E. Rothberg, Exploiting the Memory Hierarchy in Sequential and Parallel Sparse Cholesky Factorization, PhD thesis, Dept. of Computer Science, Stanford, Dec. 1992. [13] E. Rothberg and R. Schreiber, Improved Load Distribution in Parallel Sparse Cholesky Factorization, in Proc. of Supercomputing'94, Nov. 1994, pp. 783{792. [14] R. Schreiber, Scalability of Sparse Direct Solvers, vol. 56 of Graph Theory and Sparse Matrix Computation (Edited by Alan George and John R. Gilbert and Joseph W.H. Liu), Springer- Verlag, New York, 1993, pp. 191{209.