Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms

Transcription

1 Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science and Engineering by I Shiva Rama Krishna sivaramakrishna.i@research.iiit.ac.in International Institute of Information Technology Hyderabad , INDIA August 2015

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms by I Siva Rama Krishna, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Dr. Kishore kothapalli

4 To My Family

5 Acknowledgments First I would like to thank Dr. Kishore Kothapalli for all the support, guidance, suggestions and encouragement. He is a wonderful guide and supported me with his patience and his knowledge. He patiently listened to what ever I say and helped me to think in right direction when i was stuck. I am really thankful to him for being available to me even during odd hours. I am grateful to CSTAR for providing enough resources and nice atmosphere for doing research. I would like to thank my lab mates, Jatin, Kiran, Dip sankar banerjee, Anil kishore, Aman, Shashank, Manoj, Piyush, Ravi kishore, Chiranjeevi for their support and encouragement. Special thanks to Anil kishore for his simplicity and encouragement and to Kiran for his support, guidance and encouragement. I am really thankful to my friends Nikhil, Vikram, Jatin, Ruchi for their love and support when I was having tough time. Special thanks to harsh for enlightening me about various issues. I will cherish all the memories for rest of life. Finally and most importantly, I would like to thank my parents for their unconditional love and support throughout the thesis. v

6 Abstract After microprocessor clock speeds have levelled off, high performance community started using GPU (graphics processing unit) for general purpose computing. This is because of their performance per unit cost, performance per unit watt and CUDA programming model. So it is not surprising that most of top super computers use GPU as one of their computing element [56]. As GPU s architecture is different, to get good performance using GPU, we need to reinterpret the application in highly multithreaded manner. GPUs are good at exploiting massive data parallelism and they are well suited for applications which has regular memory access patterns. This can be observed in many applications such as scan primitives [57], sort [52], dense matrix multiplication [15]. However most of the applications in high performance computing are irregular in nature. GPUs are not well suited for irregular applications such as graph algorithms [16], sparse matrix operations [45], list ranking [27] etc. GPU is not a standalone device and it needs a host device like CPU. In a normal GPU application, CPU sits idle while GPU is doing the computation. So it is beneficial to include the CPU in computation. We call it as heterogeneous computing. This kind of effort was done in recent works such as dense linear algebra computations [60], sorting [29], list ranking [31]. Sparse matrix operations are some of fundamental problems in parallel computing. It is included in the seven dwarfs of parallel computing identified in the Berkely report [3]. In this thesis we design heterogeneous algorithms for sparse matrix operations such as sparse matrix - vector multiplication (spmv), sparse matrix - sparse matrix multiplication (spgemm) and sparse matrix - dense matrix multiplication (csrmm). The fundamental problem in heterogeneous computing is partitioning the work among devices. So we explore different work division methodologies. We first designed static load balancing algorithms for these sparse matrix operations. Later we proposed a dynamic load balancing algorithm using a work queue and studied the efficacy of it using sparse matrix operations. We also propose a analytical model to divide work in case of band matrix multiplication. We noticed that the heterogeneous computing is suitable for irregular applications such as sparse matrix operations. Our static load balancing algorithm of spgemm, csrmm, spmv are 30%, 15%, 20% faster compared to their pure GPU solutions respectively. We also show that for scale free matrices, in spmv operation, giving large rows to CPU and small rows to GPU is most suitable work division scheme. We verified the efficacy of our dynamic load balancing algorithm on two different heterogeneous platforms using spgemm and csrmm. We show that the absolute difference of work division percentages and execution times with respect to static load balancing approach are vi

7 vii under 6% and 10% respectively. Also in case of band matrix multiplication, our proposed analytical method is able to predict the best work division percentage at accuracy more than 95%.

8 Contents Chapter Page 1 Introduction Relevance of Parallel Computing GPU Computation and CUDA Model Load Balancing Strategies for Heterogeneous Platforms Static Load Balancing Dynamic Load Balancing Analytical Model Contributions Background Matrix Multiplication Formulations The Row-Column Formulation The Row-Row Formulation The Column-Row Formulation The Column-Column Formulation Sparse Matrix Storage Formats Compressed Sparse Row (CSR) Format Coordinate (COO) Format Sparse Matrix - Matrix Multiplication Previous Work Sparse Matrix Vector Multiplication Previous Work Platforms The Hetero-I The Hetero-II The Hetero-III Datasets Static Load Balancing Algorithms For Sparse Matrix Kernels Sparse Matrix - Sparse Matrix Multiplication(spgemm) Algorithm Heuristic I Heuristic II Sparse Matrix - Dense Matrix Multiplication (csrmm) Algorithm viii

9 CONTENTS ix Results Sparse Matrix Vector Multiplication(spmv) Reordering Rows of A in spmv Work Division Schemes Experimental Results Scale-free Matrices Dynamic Load Balancing Algorithms For Sparse Matrix Kernels Work Queue Model Framework Sparse Matrix - Matrix Multiplication Sparse Matrix Sparse Matrix Multiplication(spgemm) Algorithm Sparse Matrix Dense Matrix Multiplication (csrmm ) Results Sparse Matrix - Vector Multiplication Algorithm Results Analytical Model For Band Matrix Multiplication Band Matrix Algorithm Analytical Model Experiments and Results Conclusions and Future work Bibliography

10 List of Figures Figure Page 1.1 tightly coupled cpu-gpu heterogenous platform Different sparse matrix representations for an example matrix The specifications for the different GPUs and CPUs used in our experiments List of sparse matrices. Number of columns and rows are equal for all the matrices 2.4 except for the matrix LP, where the number of columns is equal to 1, 092, List of sparse matrices from SNAP dataset spgemm using static load balancing. The red colored rows are processed on CPU and blue colored rows are processed on GPU Performance comparison of heterogeneous method w.r.t Row-Row method on datasets shown in Figure 2.3, Figure 2.4 on Hetero-II given in Section Performance comparison of two presented heuristics w.r.t the best heterogeneous timings on the dataset shown in Figure 2.3 on platform Hetero-II given in Section X-axis represents instances in dataset and Y-axis represents performance w.r.t best heterogeneous time. The last instance Average shows the average value of the series Performance comparison of heterogeneous algorithm w.r.t GPU algorithm for csrmm[48] on dataset shown in Figure 2.3 on the Hetero-I given in Section Figure shows the Direct-Division scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly Figure shows the Large-Rows-GPU scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly Figure shows the Small-Rows-GPU scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly Performance comparison of execution times of the three work division methods for sparse matrices from two different datasets on platform given in The line anchored to the second Y-axis, labeled Max/Min, measures the ratio of the best speed-up to the least speed-up among the three work division methods. The last item on the X-axis refers to the average of the dataset x

11 LIST OF FIGURES xi 3.9 Figure showing the timeline of two iterations of spmv on the matrix FEM/Cantilever from Table 2.3. The labels CPU and GPU indicate computations on the CPU and the GPU respectively. The labels CPU GPU and GPU CPU indicate transfer of the partial result vector from the CPU to the GPU and vice-versa Applying the Small-Rows-GPU method to scale-free matrices from the datasets of Table 2.3 and 2.4 on platform given in The last item on the X-axis refers to the average of the series Cache hit ratio on the CPU last level cache for four different scale-free matrices. The X-axis indicates the percentage of the total number of non-zeros that were assigned to the CPU Work Queue model Figure shows the absolute difference in the work split percentage with respect to the baseline implementation for spgemm on platforms given in Sections 2.5.1(Hetero- High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series Figure shows the absolute difference in the work split percentage with respect to the baseline implementation for csrmm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series Figure shows the absolute difference in the runtime with respect to the baseline implementation for spgemm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero- Low). The last instance Average shows the average value of the series Figure shows the absolute difference in the runtime with respect to the baseline implementation for csrmm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero- Low). The last instance Average shows the average value of the series Absolute difference of work division percentage and the overall runtime between the work queue based algorithm and the baseline algorithm for scale-free matrices from Table 2.3 and Table 2.4 on platform given in Section An example illustrating DIA format representation Graph showing the performance comparison of best time, predicted time using formulae, for various combinations of A r, A d, and B d. A tuple l, m, n in the x-axis indicates A r, A d, B d. The last instance Average shows the average value of the series

12 List of Tables Table Page xii

13 Chapter 1 Introduction Sparse matrix is a special kind of matrix in which most of the entries are zeros. Sparse matrix operations are some of fundamental problems in parallel computing. It is included in the seven dwarfs of parallel computing identified in the Berkely report [3]. Because of their importance these operations are included in various libraries such as Intel MKL [42], Nvidia cusp [51] and cusparse [13]. Sparse matrix operations are also considered as important kernels in the class of throughput oriented applications [14]. As many entries are zeros in sparse matrices, separate storage formats are used where only the non-zero entries are stored. Implementing sparse matrix operations on modern architectures such as GPU is challenging for various reasons. Due to variations in the sparsity nature of the matrices, these operations pose severe load balancing problems amongst threads and irregular memory access patterns. So most of the work focuses on designing efficient heterogeneous algorithms for sparse matrix operations. 1.1 Relevance of Parallel Computing Parallel computing has been in practice for the past several decades in supercomputing. But it became mainstream after the slowdown in frequency scaling and memory speeds due to physical constraints. General purpose processors became multi-core and there has been renowned interest in hardware accelerators such GPUs, FPGAs and so on. In recent years, accelerator-based computing using accelerators such as the IBM Cell SPUs, FPGAs, GPUs, and ASICs has achieved clear performance gains compared to CPUs. Several challenging problems from parallel computing such as sorting [52], graph traversals [54], and the like are already known to have highly efficient implementations on a variety of accelerators. A common model in all the above works is to delegate the entire computation to the accelerator. Graphics processing units (GPUs) are generally used for graphics rendering. They are used as accelerator for graphics computations along with the main processor. GPUs are also used in a wide range of devices including laptops, personal computers, mobile phones, embedded systems and game consoles. Graphics applications are highly parallel in nature. So they necessitate the GPU to be highly parallel architecture. GPUs also have very high bandwidth for efficient transfer of data among cores. 1

14 Figure 1.1 tightly coupled cpu-gpu heterogenous platform Because of these properties GPUs have drawn the attention of HPC community for general purpose calculations. Among the accelerators, GPUs have occupied a prominent place due to their low cost and high performance-per-watt ratio along with powerful programming models such as CUDA. For instance, for under $2000, one can buy a GPU which has 4 TFLOP computation power and more than 200 GB bandwidth while it requires 225 Watt of power. However, GPU is not a stand alone device and should be attached to a host device like CPU. In most of the GPU applications, CPU passes the code and the data onto the GPU and gets back the results. CPU sits idle while GPU is doing the computation. So involving the CPU in computation makes the resource utilization efficient. We call it as heterogeneous computing. Recently Intel has crafted a hybrid chip which has FPGA and Xeon processor on same chip [10]. Further, it is anticipated that future generation commodity systems will naturally have a heterogeneous collection of computing devices including CPUs and other accelerators. This warrants the design and development of heterogeneous algorithms that run on a commodity heterogeneous platform. Designing a heterogeneous algorithm possesses several challenges such as reducing data transfers among devices, assigning right kind of work to devices and load balancing among devices. There have been recent efforts to design and develop heterogeneous algorithms for a variety of problems such as dense linear algebra computations [60], sorting [29], list ranking [31]. In this thesis, we investigate several load balancing techniques given in Section 1.3 on sparse matrix operation from Basic Linear Algebra Subprograms (BLAS) Level 2 and Level 3. 2

15 1.2 GPU Computation and CUDA Model GPU is viewed as a massively multi-threaded architecture which has hundreds of small processing elements known as cores. These cores are grouped in an SIMD fashion into a Streaming Multiprocessor (SM). Using CUDA API, we can create a large number of threads to execute code on the GPU. All these threads are grouped into blocks and blocks make up a grid. Each block is further divided into SIMD groups called warps. Each warp consists of up to 32 threads which execute the same instruction at any given time. During the execution time, blocks are serially scheduled on each SM. If a block finishes its computation the next block will be scheduled. An SM can schedule multiple blocks at a time depending upon on the block size i.e, number of threads in block. For instance, the GTX480 has 16 SMs and an SM can schedule two blocks if blocksize is 512. It means GTX480 can schedule 32 warps at a time. CUDA has zero over head scheduling when warps that are stalled on memory fetch are swapped with another warp. Data fetch latencies are hidden by switching between warps that are resident on the SM. The GPU has various types of memory. In a GPU each SM has an L1 cache or shared memory which is divided among the blocks that are scheduled on the SM. In each block the shared memory is shared by the threads in that block. Also each SM has 32 bit registers which are divided among the threads scheduled on that SM. In general, all the private variables of a thread are stored in registers. In older GPUs, L1 cache is used only as a software managed cache. But in latest GPUs, some part of L1 cache is used as a hardware cache. Also new versions of GPUs have an L2 cache that services all load, store, and texture read/write requests. GPU also has off-chip memory known as global memory which is accessible to all the threads. We can efficiently access global memory through two read-only caches known as the constant memory and texture memory. The constant memory space is cached on chip and it is used to store the constant data. So, one read memory request costs just one read from the constant cache if there is a cache hit, other wise it is one memory read from gloabal memory. T exture memory is also cached on chip similar to constant memory. Specifically, texture caches are designed for graphics applications where memory access patterns has spatial locality. Typically the accesing times for global memory and shared memory are and cycles respectively. Each computation that is to be performed using a GPU is written inside a module known as kernels. Before we launch a kernel, All the data needed by kernel should be transfered to GPU. Once the kernel is invoked from the CPU code, the kernel will execute on GPU asynchronously with the CPU code. We do not have control on the scheduling of blocks on SM. Scheduling of blocks will be taken care by the GPU. Also we do not have control on the execution or ordering of threads. All the threads in a grid work independently. We can keep barrier synchronization for all the threads in a block. Global synchronization among all the threads will be performed across separate kernel launches. 3

16 1.3 Load Balancing Strategies for Heterogeneous Platforms In this section we give a brief explanation about the load balancing strategies used in this thesis Static Load Balancing Static load balancing is a simple load balancing technique used in heterogeneous algorithms to partition the work among heterogeneous platforms. In this technique some t% of work is given to CPU and remaining work to GPU. But t is selected randomly before the start of computation. It has some drawbacks like partitioning the work irrespective of the instance leads to imbalance in the load. Because this partition method fails to capture any properties of a instance such as sparsity nature in case of matrices. Also for any instance it takes so much time to find best t% value as we need to find it through exhaustive experimentation. So we proposed another dynamic load balancing technique which is explained later. Heterogeneous computing on commodity platforms is gaining large scale research attention in recent years. Heterogeneous algorithms have been designed recently for several challenging problems in parallel computing including graph BFS [62, 48, 65], dense matrix computations [67], sorting [29], and the like. In the above cited works, they spread the entire computation across the computational devices. In some cases a post-processing phase is needed which combines the outputs of the individual computations [31, 48, 29, 46, 60]. This approach of designing heterogeneous algorithms can be called as the static work partitioning or static load balancing approach Dynamic Load Balancing Dynamic load balancing is another simple load balancing strategy in which work is dynamically divided among heterogeneous devices. Static partitioning of work among heterogeneous platform has drawbacks such as partitioning irrespective of input instance cannot lead to well-balanced load. Finding best partitioning for a instance takes so much time as we have to search exhaustively. On the other hand analytical methods are available for special cases of workloads. Thus, a fundamental problem in heterogeneous computing is to propose generic mechanisms that can help address the issue of load balancing in heterogeneous algorithms designed using the work partitioning model[47]. So we proposed a light weight, low overhead, and completely dynamic load balancing framework that addresses the load balancing problem of heterogeneous algorithms. Details of our frame work is given in Chapter Analytical Model In this strategy, we define a mathematical model using the parameters that effect the execution time. Using this model we find the work division threshold t. But it is not always possible to design a analytical model for work division. Devising an analytical model reduces the exhaustive search time for 4

17 finding best work division threshold. In Chapter 5, we give details about the analytical model we used in this thesis. 1.4 Contributions We investigate sparse matrix operations using different work division methodologies on different heterogeneous platforms. The contributions of this these are as follows We give an efficient heterogeneous algorithms for Sparse matrix - Sparse matrix multiplication(spgemm), Sparse matrix - Dense matrix multiplication(csrmm) and Sparse matrix - Vector multiplication(spmv). Our static load balancing algorithms for spgemm, csrmm, spmv are 30%,15%,20% faster compared to GPU algorithms. We propose a dynamic work load balancing framework and used it to solve sparse matrix operations on heterogeneous platform of CPU + GPU. The absolute difference of work split percentages and execution times with respect to static partitioning approach are under 6% and 10% respectively. We define an analytical model to identify correct work division among CPU and GPU in case of band matrix multiplication. We are able to predict correct work division percentage with an accuracy more than 95%. 5

18 Chapter 2 Background This chapter has a brief explanation of different types of matrix multiplications, sparse matrix storage formats, sparse matrix operations, different load balancing algorithms for heterogeneous platforms and heterogeneous platforms which were used in this thesis. 2.1 Matrix Multiplication Formulations Let A, B and C be three matrices with sizes M P, P N and M N respectively such that C = A B. There are four different types of formulations to multiply two matrices. They are Row-Row formulation, Column-Row formulation, Row-Column formulation and Column-Column formulations. All these four formulations are briefly explained with an example The Row-Column Formulation In the Row-Column formulation, to get one element in C, we multiply a row in the A matrix with a column in the B matrix, i.e., C(i, j) = A(i, :) B(:, j) for i = 1, 2,, M, and j = 1, 2,, N. This is the standard matrix multiplication approach. For a given i,j, let I(i, j) denote the set of indices k such that both the elements A(i, k) and B(k, j) are nonzero. Then, C(i, j) = k I(i,j) A(i, k).b(k, j). However, to obtain I(i, j), we need to access all the elements in the i th row of A and j th column of B. Therefore, we bring in elements which may not contribute to the output. In the worst case, we would access the entire row i of A and a column j of B whereas I(i, j) = Φ. Hence, this approach is not suited for sparse matrices in general The Row-Row Formulation In the Row-Row formulation, to compute the i th row in C, C(i, :), we multiply each element in A(i, :) with corresponding row in B. We then add all the scaled B rows to get the C(i, :). Thus, C(i, :) = j A(i,:) A(i, j).b(j, :). In this formulation, we access only the elements which contribute to the output. The working of the Row-Row formulation is shown below. 6

19 Example matrices A = B = Row-Row formulation [ C(1, :) = ] [ ] [ ] [ ] [ = ] [ C(2, :) = ] [ ] [ ] [ ] [ = ] [ C(3, :) = ] [ ] [ ] [ ] [ = ] [ C(4, :) = ] [ ] [ ] [ ] [ = ] C = (C(1, :); C(2, :); C(3, :); C(4, :)) = The Column-Row Formulation In the Column-Row formulation, for i = 1, 2,, P, we multiply the i th column of A with the i th row of B to get a matrix C i = A(:, i) B(i, :). The output matrix C is sum of all such matrices obtained, i.e., C = N i=1 C i. In this formulation also, we access only the elements which contribute to the output. An Example is given below. Example matrices A = B = Column-Row formulation 7

20 Let C i denote the matrix obtained by multiplication of i th column of A and i th row of B [ ] T [ ] C 1 = = [ ] T [ ] 0 0 0, C 2 = = [ ] T [ ] C 3 = = [ ] T [ ] 4 0 0, C 4 = = C = C 1 + C 2 + C 3 + C 4 = The Column-Column Formulation The Column-Column formulation is similar to the Row-Row formulation. Here column elements of B are used to scale the corresponding columns of A. An example given below. Example matrices A = B = Column-Column formulation [ ] T [ ] T [ ] T [ C(:, 1) = [ ] T = [ ] T [ ] T [ ] T [ C(:, 2) = [ ] T = [ ] T [ ] T [ ] T [ C(:, 3 = [ ] T = ] T ] T ] T 8

21 C = (C(:, 1), C(:, 2), C(:, 3)) = Sparse Matrix Storage Formats In this section we describe some of sparse matrix storage formats which i used in my work Compressed Sparse Row (CSR) Format This is the popular storage format. This format stores only required elements and does not make any assumptions about sparsity pattern of the matrix. Let A be a sparse matrix with dimensions M N and has nnz non-zeros. In this format we use three arrays say data, rowp tr, colp tr to store the matrix. The data array contains only nonzero elements of matrix. The colp tr stores column indices corresponding to the non-zeros in data array. In the rowp tr we store starting and ending indices of each row in data array. rowp tr[i], rowp tr[i + 1] indicates the starting and ending indices of i t h row of A in data, colp tr arrays. The sizes of rowp tr, colp tr, data are m + 1, nnz, nnz respectively. An example of CSR format is given in 2.1(b) Coordinate (COO) Format This is also a popular store format. Let A be a sparse matrix with dimensions M N and has nnz non-zeros. Similar to CSR format, it also uses three arrays to store matrix. These three arrays are data, rowindex, colindex. In similar to CSR format, data array stores only non zeros values of matrix A. rowindex, colindex stores row value, column value corresponding to non-zeros in data array respectively. So the three arrays are each of size nnz respectively. An example of COO format is given in 2.1(c). 2.3 Sparse Matrix - Matrix Multiplication Multiplication of two sparse matrices is also known as spgemm. It has applications in various domains such as graph algorithms [4, 5, 6], climate modeling, molecular dynamics, CFD solvers, and the like [7, 8, 9]. Multiplication of a sparse matrix with a dense matrix is known as csrmm. It is an important kernel in linear algebra. It is used in iterative algorithms such as Lanczos method [28] and the Conjugate gradient method [28]. Implentation of spgemm is challenging because of sparsity nature of matrices, it poses severe load balancing problem among threads. As the prediction of output size is 9

22 A = (a) The input matrix data = [ ] cols = [ ] rowptr = [ ] (b) CSR Format data = [ ] cols = [ ] row = [ ] (c) COO Format Figure 2.1 Different sparse matrix representations for an example matrix difficult, it is also difficult to manage output memory. Because of SIMD nature execution of GPU, any divergence in execution path of warp also causes performance degradation Previous Work Matrix matrix multiplication is a important problem in computer science. Because of its importance it has given a lot of attention in high performance. Efficient solutions for dense matrices are given on different architectures such as GPU [15], FPGA [17]. The first important work on spgemm was done by Gustavson et al. [19]. They developed a algorithm for spgemm and presented spgemm in Row-Row formulation using CSR format. Now this algorithm is used in softwares such as [20] and CSparse [21]. Yuster el al. [26] considered spgemm over a ring. They presented algorithms which use fast dense matrix multiplication algorithms. In [22] Park et al. gave a efficient data structure to store a class of sparse matrices in which non-zeros appear adjacently. By using this new data structure they presented a algorithm for fast sparse matrix multiplication. Buluc et al. [23] extensively worked on spgemm. They explored scalable parallel algorithms for spgemm on distributed memory systems. They have given data structures for hyper-sparse matrices. Also, they proposed 2D algorithms for spgemm and analyzed the scalability of 1D and 2D algorithms for parallel spgemm on distributed systems. They showed that existing 1D algorithms do not scale compared to 2D algorithms. Siegel et al. [24] developed a run-time framework for spgemm on heterogeneous clusters. They introduced a task based programming model in which multiplication of block of 10

23 matrices represents a task. They also provided a run time execution model to address load balancing on clusters which consists of CPUs and GPUs. Sulatycke et al. [25] explored Row-Row, and Column-Row formulations of matrix multiplications and also presented cache optimized algorithms on sequential machines for spgemm. The 2D matrix multiplication algorithms given in [18] are applicable for distributed systems not suitable for standalone systems. As the programming model and architecture of GPU is different from distributed systems, the optimizations and algorithms proposed for distributed systems doesn t suit to GPUs. To best of my knowledge no one has proposed any heterogeneous algorithm using CPU and a GPU for spgemm, csrmm. So we gave hybrid load balancing algorithms for spgemm, csrmm in next chapters. 2.4 Sparse Matrix Vector Multiplication Multiplying a sparse matrix with a vector, usually denoted spmv, is one of the important problem in parallel computing with several applications to solve systems of linear equations using iterative methods like the conjugate gradient method, GMRES, iterative methods for finding eigenvalues and eigenvectors of sparse matrices, and the like [39]. These methods in turn find applications in many areas of Computer Science such as information extraction, image processing, and the like. The importance of spmv can be judged by the fact that most multi-core architectures support an optimized library routine for spmv [51, 42]. spmv computation offers a lot of data parallelism that can be exploited in a heterogeneous setting too. However, designing a heterogeneous algorithm for spmv requires one to address several challenges. The amount of computation required with respect to a row of a matrix depends on the number of non-zeros in that row. In a general unstructured sparse matrix, the number of non-zeros per row can vary significantly across rows. Thus, it is difficult to apriori apportion the right amount of work, i.e the number of rows in this case, to the individual devices in a heterogeneous platform. An additional difficulty with respect to spmv arises from its typical usage, illustrated in Algorithm 1. In Algorithm 1, the function Modify can make suitable modifications to its input as dictated by the application. The spmv kernel is often used in iterative methods and successive iterations use the vector generated in the previous iteration. If the vector Y is computed in pieces at both the CPU and the GPU, assuming that the function Modify can still be executed on both devices independently, the next iteration at the CPU requires the portion of the Y vector computed at the GPU, and vice-versa. Hence, heterogeneous algorithms for spmv have to take into account the time required to transfer the partial vector from the CPU to the GPU and vice-versa at the end of every iteration. This places a hard synchronization requirement across the devices. 11

24 Algorithm 1 A typical usage of spmv. Input: A sparse matrix A, a vector X; while not done do Compute Y := A X; Modify(Y ); X := Y ; end while Previous Work Due to the fundamental nature of the spmv operation, there has been lot of studies done on implementing spmv efficiently on various architectures like multi-core CPU [63], GPU s [40, 30, 32], FPGAs [36, 58] and vector architectures [34, 35]. Much of the above-cited work has focused on obtaining performance improvements of the spmv kernel by designing suitable data structures and identifying low level code optimizations. In [64] Vuduc extensively studied optimizations on spmv and auto tuning spmv kernel for sequential machines. In [63] Williams et al. studied spmv on AMD dual core, Intel quad-core, STI Cell and Sun Niagara2. They presented optimization strategies for those multi-core environments. They have given low-level code optimizations and data structure optimizations that largely address single-core performance and parallelization optimizations to improve the multi-core performance. Bell and Garland [33] proposed the CSR and the ELL matrix representation formats to store the sparse matrix on a GPU. Also, they proposed spmv kernels for different sparse matrix formats. Choi et al. [37] proposed a blocked ELLPACK data structure to store sparse matrix in spmv. Monakov et al. [49] implement blocked spmv on GPU. In [50], Monakov et al. also proposed a sparse matrix data structure called Sliced ELLPACK, in which a slice of the matrix is a set of adjacent rows that are stored in ELL format. The size of each slice can be different. Each slice is assigned to a block of threads in CUDA. Load balancing of threads is achieved by assigning multiple threads to a row if required. Another direction that is pursued recently is to consider special cases of sparse matrices and optimize the performance of spmv for those special cases. Yang et al. [66] proposed optimizations for power law graphs. Their work improves the cache hit ratio of the texture cache of the GPU leading to increasing in the performance of spmv. Heterogeneous algorithms for spmv are not reported so far to the best of our knowledge. Such algorithms have been designed for related computations such as multiplying two sparse matrices [48], dense linear algebra computations [60], and the like. Matching workload to a computational device based on the characteristics of the workload is an emerging line of research. In [41], Gharaibeh et al. consider three graph algorithms and suggest that for large, sparse graphs, it is advisable to process vertices of low degree on the GPU and vertices of high degree on the CPU. The authors of [41] also show that such a choice can help improve the hit ratio of 12

25 GPUs Device Cores # of SMs Clock Global Memory L2 Cache Threads per block GTX GHz 1535 MB 768 KB 1024 Tesla C GHz 2687 MB 768 KB 1024 GT GHz 1024 MB 64 KB 1024 CPUs Device # of Cores Clock L1 Cache L2 Cache L3 Cache # of Threads i7 980x GHz 32 KB 256 KB 12 MB 12 i7 920x GHz 32KB 256 KB 8 MB 8 Core 2 Duo GHz 32KB 3MB - 4 Figure 2.2 The specifications for the different GPUs and CPUs used in our experiments. the last level cache on current multi-core architectures. In Chapter 3, we show similar effects can be seen for sparse matrix computations also. 2.5 Platforms We experiment our algorithms on various platforms. Table 2.2 shows brief view about the CPUs and GPUs we used in our experiments. We use three heterogeneous platforms to evaluate our algorithms. The details of these heterogeneous platforms given below The Hetero-I The Hetero-I heterogeneous platform we used is a coupling of the two devices, the Intel i7 980x CPU and the Nvidia GTX 480 GPU. The CPU and the GPU are connected via a PCI Express version 2.0 link which supports a data transfer bandwidth of 8 GB/s between the CPU and the GPU The Hetero-II The Hetero-II platform consists of two devices, the Intel i7 920 CPU and the Tesla C2050 GPU. The CPU and the GPU are connected via a PCI Express version 2.0 link which supports a data transfer bandwidth of 8 GB/s between the CPU and the GPU The Hetero-III The Hetero-III platform consists of two devices, the Intel core 2 duo CPU and the GT520 GPU. The CPU and the GPU are connected via a PCI Express version 2.0 link which supports a data transfer bandwidth of 8 GB/s between the CPU and the GPU. 13

26 2.6 Datasets In our experiments we use datasets proposed by Williams et.al [33] which are shown in Figure 2.3. It contains 14 matrices which are from various fields such as circuits simulation, linear programming, finite element method based modeling etc. It contains matrices with different sparsity nature. Some of matrices have few non zero elements per row and some are highly unstructured in natured like LP and Webbase. We also use another set of sparse matrices, collected from SNAP sparse matrix collection [2] which are shown in Figure 2.4. Matrix Rows NNZ NNZ/Row Dense 2,000 4,000, Protein 36,417 4,344, FEM/Spheres 83,334 6,010, FEM/Cantilever 62,451 4,007, Wind Tunnel 217,918 11,634, FEM/Harbor 46,835 2,374, QCD 49,152 1,916, FEM/Ship 140,874 7,813, Economics 206,500 1,273, Epidemiology 525,825 2,100, FEM/Accelerator 121,192 2,624, Circuit 170, , Webbase 1,000,005 3,105, LP 4,284 11,279, Figure 2.3 List of sparse matrices. Number of columns and rows are equal for all the matrices except for the matrix LP, where the number of columns is equal to 1, 092,

27 Collection Instance Rows NNZ/Row Road Networks roadnet-ca 1,971, Web Graphs web-google 916, Communication -enron 36, networks Product copurchasing amazon , networks Collaboration ca-condmat 23, networks Internet peer-topeer p2p-gnutella 62, networks Social networks wiki-vote 8, Citation networks cit-patents 3,774, Autonomous systems graphs as-skitter 1,696, Figure 2.4 List of sparse matrices from SNAP dataset 15

28 Chapter 3 Static Load Balancing Algorithms For Sparse Matrix Kernels This chapter gives a brief explanation about how we used static load balancing in our heterogeneous algorithms for sparse matrix operations along with the results. 3.1 Sparse Matrix - Sparse Matrix Multiplication(spgemm) In this section we discuss about the algorithm, heuristics we use for work division among CPU, GPU and results Algorithm Let A,B be two sparse matrices with sizes M P, P N respectively. C = A B is a matrix with size M N. Recall from Chapter 2, We have four different formulations for matrix multiplication. We noticed that Row-Row formulation of matrix multiplication from [48] is performing better than cusp library spgemm for sparse matrices. We also notice that the performance of spgemm on CPU is comparable to GPU [48]. So we extend the Row-Row algorithm to work as a heterogeneous algorithm. Our heterogeneous algorithm given in Algorithm 2, uses Row-Row formulation algorithm from [48] on GPU and Intel MKL library spgemm routine on CPU as it is efficient and standard. In Algorithm 2, the labels CPU (GPU), and GPU CPU, refer to steps executed on the CPU (resp. GPU), and data transfer from GPU to CPU. In our algorithm input matrices A, B are stored stored in CSR format and output matrix C is stored in COO format. We experiment on platform mentioned in Section To divide the work among CPU and GPU we choose static load balancing strategy as follows. We choose a threshold t% and assign the computation corresponding to first t% of the rows of A (from top) to the CPU. The remaining computation is performed on the GPU. Let matrix A CP U and A GP U be partial matrices which are processed on CPU and GPU respectively and matrix B is present on both CPU and GPU. Let C CP U and C GP U be the outputs computed by CPU and GPU respectively. After the computation, GPU transfers the output matrix C GP U onto the CPU. The final output is stored on CPU. The work division and computation is described in Figure 3.1. The challenge in designing 16

29 Figure 3.1 spgemm using static load balancing. The red colored rows are processed on CPU and blue colored rows are processed on GPU. Algorithm 2 Heterogeneous Algorithm for spgemm 1: Identify a threshold t at which to split the input matrix. Row number corresponding to t is r. 2: A CPU = A(1..r; :) and A GPU = A(r + 1..M; :); //A CPU contains first r rows of A, A GPU contains remaining rows. 3: CPU :: C CP U = CP Uspgemm(A CPU,B);//Compute C CP U 4: CPU :: Wait for GPU to finish /* Synchronization */ 5: GPU :: C GP U = GP Uspgemm(A GPU,B);//Compute C GP U 6: GPU CPU : Transfer C GP U to the CPU 7: GPU :: Wait for CPU to finish /* Synchronization */ 17

30 Figure 3.2 Performance comparison of heterogeneous method w.r.t Row-Row method on datasets shown in Figure 2.3, Figure 2.4 on Hetero-II given in Section efficient hybrid algorithms then lies in finding the right threshold. A good value of t can be obtained by exhaustive experimentation. We call the corresponding time as Best heterogeneous time. The result of the best heterogeneous times on platform given in Section is shown in Figure 3.2. We notice that out heterogeneous algorithm has 1.5x speed up over GPU implementation on datasets given in Figure 2.3, Figure 2.4. However, exhaustive experimentation is not an ideal solution. Hence, we start with identifying heuristics to find a good value for t. We experiment with two different heuristics Heuristic I In our first heuristic, we find the threshold based on the number of multiplications involved in an instance of spgemm when using the Row-Row formulation. For a sparse matrix A, let N i (A) to denote the number of nonzero elements in the i th row of A. Let I i (A) denote the indices of the nonzero elements in the i th row of A. According to the Row-Row formulation, the number of multiplications for processing the i th row of A in A B is j I i (A) N j(b). We observed that average GPU performance over CPU on the dataset from Figure 2.3 is around 3x[48]. So, we set t to be 25% of the total number of multiplications. We find r which refers to the row number by which t% of the multiplications occur. We then assign rows indexed 1 to r to be processed on the CPU and rows r + 1 to M are processed on the GPU. The results of this heuristic are presented in Figure

31 As can be observed, the best heterogeneous run time using the heterogeneous approach outperforms the hybrid run time obtained by using the proposed heuristic. Our heuristic considers only the average speedup to arrive at a value of t and the weakness of our heuristic can be attributed to that. To remedy this situation, we propose a better heuristic that takes the run time of Intel MKL and the GPU Row-Row formulation into account Heuristic II In this heuristic, we delve a bit into each instance. We take the run time of the instance on CPU and t also the GPU. Let these run times be t c and t g. We take the threshold t to be g t c+t g %. As earlier, we find a value of r so that the first r rows account for t% of the multiplication operations. The results of using this heuristic are shown in the last column of Figure 3.3. As can be observed, this heuristic performs better than Heuristic I in general but still cannot meet the performance of the best possible heterogeneous approach. It clearly shows that finding best possible threshold value with heuristics is difficult. The difficulty can be partly explained by the fact that spgemm is a highly irregular computation. Moreover, it is difficult to estimate the number of rows that are required to make up a given percentage of the total number of operations. Knowing this, one can indeed estimate the size of the output matrix, which is one of the difficulties of the spgemm computation. Further, the highly unstructured sparsity nature of the matrices in the dataset from Figure 2.3 makes the tasks of estimating the threshold very difficult. It may therefore help if there is any prior knowledge on the nature of sparsity of the input matrices. We are able to design a analytical model to find best threshold value for band matrices which we explain in the next chapter. 3.2 Sparse Matrix - Dense Matrix Multiplication (csrmm) Algorithm Let A be a sparse matrix in CSR format with size M P, B and C are dense matrices stored in row major format with sizes P N, M N and C = A B. Similar to spgemm, the performance of csrmm on CPU and GPU are comparable. It is shown that GPU implementation of csrmm [48] is outperforming the cusparse library implementation of csrmm. So we devised a heterogeneous algorithm given in Algorithm 3, which uses GPU implementation of csrmm [48] on GPU and Intel MKL library [42] csrmm routine on CPU as it is efficient and standard. In Algorithm 3, the labels CPU (GPU), and CPU GPU (GPU CPU), refer to steps executed on the CPU (resp. GPU), and data transfer from CPU to GPU (resp. GPU to CPU). In similar to spgemm, we choose a threshold t% and assign the computation corresponding to t% of the rows of A to the CPU. The remaining computation is performed on the GPU. Let matrix A CP U and A GP U be partial matrices which are processed on CPU and GPU respectively and matrix B is present 19

32 Figure 3.3 Performance comparison of two presented heuristics w.r.t the best heterogeneous timings on the dataset shown in Figure 2.3 on platform Hetero-II given in Section X-axis represents instances in dataset and Y-axis represents performance w.r.t best heterogeneous time. The last instance Average shows the average value of the series. Algorithm 3 Heterogeneous Algorithm for csrmm 1: Identify a threshold t at which to split the input matrix. Row number corresponding to t is r. 2: A CPU = A(1..r; :) and A GPU = A(r + 1..M; :); //A CPU contains first r rows of A, A GPU contains remaining rows. 3: CPU :: C CP U = CP Ucsrmm(A CPU,B);//Compute C CP U 4: CPU GPU : Transfer C CP U to the GPU 5: CPU :: Wait for GPU to finish /* Synchronization */ 6: GPU :: C GP U = GP Ucsrmm(A GPU,B);//Compute C GP U 7: GPU CPU : Transfer C GP U to the CPU 8: GPU :: Wait for CPU to finish /* Synchronization */ 20

33 Figure 3.4 Performance comparison of heterogeneous algorithm w.r.t GPU algorithm for csrmm[48] on dataset shown in Figure 2.3 on the Hetero-I given in Section on both CPU and GPU. Let C CP U and C GP U be the outputs computed by CPU and GPU respectively. As csrmm is a iterative algorithm, GPU transfers the output matrix C GP U onto the CPU and CPU transfers the C CP U onto the GPU. So that both CPU and GPU will have final output matrix Results We experiment our algorithm on platform mentioned in using dataset given in figure 2.3. We vary the threshold value t from 0 to 100. Best value of t is obtained by exhaustive experimentation. Performance comparison of our heterogeneous algorithm and GPU algorithm[48] is shown in Figure 3.4. Our heterogeneous algorithm is up to 15% faster compared to GPU algorithm. It can be observed that some instances such as Economics, Epidemiology, Circuit, Webbase are not performing well with heterogeneous algorithm. Because the computation time less than the transfer time in those matrices. 3.3 Sparse Matrix Vector Multiplication(spmv) In this section we give an example which shows that multiplying a sparse matrix with a vector can be done by reordering the rows of matrix A and the vector X suitably. We make use of such a reordering in our work division algorithms. Later we discuss about work divisions schemes along with results. 21

34 3.3.1 Reordering Rows of A in spmv The spmv computation is typically used in many iterative methods such as Lanczos iterations, Arnoldi iterations and the like [39]. These methods run the spmv kernel for around iterations. In each iteration the vector will be updated according to certain rules and the updated vector is used in next iteration. The rules in these method are generally scalar addition/multiplication of a vector and subtraction of vectors. All these operations will not be effected by rearrangement of input vector and input matrix. Hence, in some of our work division algorithms, we rely on the above fact. This is illustrated with an example below. Let A be an M M matrix and X is initial vector with size M 1. Let A 1 be the matrix obtained after rearranging the rows of A based on increasing number of non-zeros per row. Let the vector X be also rearranged accordingly, and call the resulting vector as X 1. Let A 2 be the matrix obtained after rearranging the columns of A 1 based on number of non-zeros per row of A. Let R = [r 1,r 2,...,r M ] be an array represents the row numbers of A which are sorted in ascending order of number of non-zeros per row where 1 r i M and 1 i M. Notice that X 1 [k] is equal to X[R[k]]. For a matrix A, let A(i; :), A(:; i) denote the i th row and the i th column of A respectively. Now A 1 (r[k]; :) = A(R[k]; :) and A 2 (:; k) = A 1 (:; R[k]). With this notation, we give an example below to show the reordering the rows and columns of a matrix A does not affect the output of A X beyond a corresponding reordering A = X = 1 2 R = A 1 = X 1 = A 2 = [ ] T , and Without any reordering, we get that A X = [ T with A again, we get the vector ]. [ ] T, and pre-multiplying the resulting vector With a reordering of the rows and columns of A and the elements of X, the resulting vectors during [ T the first two iterations are computed as follows. A 2 X 1 = ] and pre-multiplying this [ T vector with A 2 yields the vector ]. We can see from the above example that the output produced in i th iteration by multiplying preprocessed initial matrix, and input vector in every iteration is identical to a rearrangement of output produced in i th iteration by multiplication actual matrix and actual vector. So it is sufficient if we rearrange the final output. 22

35 3.3.2 Work Division Schemes In this section we describe three work division schemes for spmv. Let A be input matrix with size M M and X be a vector of size M 1. Let Y = A X be the result vector with size M 1. Note that the spmv workload possesses data parallelism and hence it is possible to divide the computation across devices also. However, there are two algorithmic challenges that one has to address. Firstly, the computational methodology of the CPU and the GPU are vastly different. Hence, one has to understand how to match the right kind of work to each device. For instance, should rows with several non-zero elements be processed on the CPU or the GPU? Secondly, the volume of computation involved per row varies according to the number of non-zeros in that row. Hence, one has to identify the right amount of work division across the devices. It has to be noted however that one seeks simple yet effective solutions for both the above challenges. In this section, we present three possible approaches for work division with respect to spmv. Later sections address the questions of matching the right work to the right device, and the quantum of work. All the three work division methods that we present in this section are based on the criteria that arriving at the work division is as simple as sorting the rows of the matrix on the number on non-zeros. Direct-Division : This scheme involves dividing the matrix A into two matrices A CPU and A GPU. Let r be a number between 1 to m. Then, A CPU consists of rows 1 to r and A GPU consists of rows r + 1 to m. Choosing r can be done empirically. Figure 3.5 Figure shows the Direct-Division scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly. Large-Rows-GPU : In this scheme, we first sort the rows of A in increasing order of the number of non-zeros and rearrange the matrix, vector accordingly like we explained in Section Then, we pick a number r between 1 to M. We then define the two matrices A CPU and A GPU as earlier. The matrix A CPU consists of rows 1 to r of A, and A GPU consists of rows r + 1 to M. 23

36 Figure 3.6 Figure shows the Large-Rows-GPU scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly. Small-Rows-GPU : In this scheme, as in the earlier scheme, we sort the rows of A in descending order of the number of non-zeros per row and rearrange the matrix, vector accordingly like we explained in Section Then, a number r between 1 to M is chosen. The matrices A CPU and A GPU are then defined as the matrix consisting of rows 1 to r and rows r + 1 to M of A respectively. Figure 3.7 Figure shows the Small-Rows-GPU scheme for spmv. The red colored rows of A are processed on the CPU, and the other rows are processed on the GPU. The corresponding portion of the result vector in each iteration is also colored accordingly. Figure 3.5, Figure 3.6 and Figure 3.7 illustrate the above work division schemes. All the above methods can use a common algorithmic model described below. The main algorithm we use is Algorithm 4 that creates two matrices A CPU and A GPU. For a matrix A of M rows and positive integers 1 a < b M, we use A(a..b; :) to indicate the sub matrix obtained by taking rows numbered a to b of A. In Algorithm 4, the labels CPU (GPU), and CPU GPU (GPU CPU), refer to steps executed on the CPU (resp. GPU), and data transfer from CPU to GPU (resp. GPU to CPU). On multi-core CPUs, the 24

37 spmv library routine from Intel MKL [42] is the best known reported implementation. Hence, we use this routine on A CPU. Let us call the output of this computation as Y CPU. Similarly, on the GPU, we make use of the spmv routine in the cusp library provided by Nvidia [51]. This is the best known reported implementation on GPUs for general sparse matrices. Let us call this output as Y GPU. Now, Steps 6 and 10 of Algorithm 4 call for transfer of Y CPU from the CPU to the GPU and the transfer of Y GPU from the GPU to the CPU. This step is required so that the next iteration at the CPU and the GPU has access to the entire B vector computed in this iteration. Algorithm 4 Heterogeneous Algorithm for spmv 1: Identify a row number r at which to split the input matrix; 2: A CPU = A(1..r; :) and A GPU = A(r + 1..m; :); 3: X CPU 0 = X0 GPU = B; 4: for i = 0 to n iterations do 5: CPU :: Y CPU = CP Uspmv(A CPU, X CPU i, C i); 6: CPU GPU :: Transfer Y CPU to the GPU. 7: CPU :: Wait for GPU to finish /* Synchronization */ 8: Append Y CPU to Y GPU to get C i on GPU side; 9: GPU :: Y GPU = GP Uspmv(A GPU, X GPU i, C i); 10: GPU CPU : Transfer Y GPU to the CPU 11: GPU :: Wait for CPU to finish /* Synchronization */ 12: GPU :: Append Y GPU to Y CPU to get C i on GPU side; 13: CPU:: X CPU i+1 = C i; 14: CPU:: X GPU i+1 = C i; 15: end for Experimental Results We evaluate our heterogeneous algorithms on the platform mentioned in Section using two datasets. One is the standard dataset proposed by Williams et al. [63] shown in Table 2.3. This dataset has been the choice dataset for several recent works on spmv [50, 49, 48]. We also consider a dataset containing a sample of matrices from the University of Florida sparse matrix collection [38]. These matrices are shown in Table 2.4. The results of the three work division schemes on the dataset from Table 2.3 and on the dataset from Table 2.4 are shown in Figure 3.8(a) and 3.8(b) respectively. The label Pure GPU in Figure 3.8 refers to spmv implementation from cusp[51]. We show the relative speed-up of the three work division methods with respect to spmv implementation from cusp[51]. It can be noted that all the three work division schemes generally outperform a pure GPU alone implementation. In some cases, we can notice an improvement of 45 % (e.g., matrix FEM/Harbor from Figure 3.8(a)), and an average improvement of 20 %. 25

38 2.5 2 Pure GPU/Direct-Division Pure GPU/Small-Rows-GPU Pure GPU/Large-Rows-GPU Max/Min Speedup over GPU Max/Min Dense Protein FEM/Spheres FEM/Cantilever WindTunnel FEM/Harbor QCD FEM/Ship Economics Epidemiology Matrix Instance FEM/Accelerator Circuit Webbase Average (a) Dataset from Table Pure GPU/Direct-Division Pure GPU/Small-Rows-GPU Pure GPU/Large-Rows-GPU Max/Min Speedup over GPU Max/Min amazon0312 internet web-google dblp-2010 p2p-gnutella31 ca-condmat -enron Matrix Instance cit-patents as-skitter wiki-vote Average (b) Dataset from Table 2.4 Figure 3.8 Performance comparison of execution times of the three work division methods for sparse matrices from two different datasets on platform given in The line anchored to the second Y-axis, labeled Max/Min, measures the ratio of the best speed-up to the least speed-up among the three work division methods. The last item on the X-axis refers to the average of the dataset. 26

39 It is however interesting to note that no single scheme has a clear advantage across the matrices considered. Some of this can be explained by the following reasoning. The matrices considered in the dataset have widely varying nature of sparsity. In some cases, for example the Epidemiology matrix has most of the rows with 4 non-zero elements. In this case, a pure GPU implementation can outperform a heterogeneous implementation because of identical and small workload across GPU threads. In a heterogeneous implementation on this matrix, the time taken to transfer the result vector dominates the compute time required by either device. That is indeed the case in other instances also such as Internet, p2p-gnutella31, wiki-vote. Some matrices such as Protein, FEM/Spheres, FEM/Harbor from Table 2.3 exhibit a strong degree of locality with respect to the column indices of the non-zero elements. This can be observed also from the row-column plots of the matrices 1. This means that while all the three work division methods presented earlier outperform a non-heterogeneous implementation, there is very little difference between the three heterogeneous algorithms. This phenomenon is illustrated in Figure 3.8 via the Max/Min line anchored to the second Y-axis. For the above matrices, among the three work division methods, the ratio of the best speed-up achieved to that of the least speed-up achieved with respect to a GPU alone implementation is under 10%. Figure 3.9 Figure showing the timeline of two iterations of spmv on the matrix FEM/Cantilever from Table 2.3. The labels CPU and GPU indicate computations on the CPU and the GPU respectively. The labels CPU GPU and GPU CPU indicate transfer of the partial result vector from the CPU to the GPU and vice-versa. For another set of matrices such as Wind Tunnel, Economics, FEM/Accelerator, and FEM/Cantilever, the number of non-zeros in each row is near-uniform, and also is moderate in number. This means that there is not much difference in how the three methods partition the workload. Therefore, we see very little difference in the speed-up achieved by the three methods. It should be noted however that all the three methods outperform a non-heterogeneous implementation on these matrices. 1 Available at HYS.html 27

40 However, for matrices such as Circuit and Webbase from Table 2.3, and matrices such as amazon0312, internet, web-google, dblp-2010, p2p-gnutella31, ca-condmat from Table 2.4, the Small- Rows-GPU method outperforms the other methods considerably. The reason for this is explained in the following Section To study the efficiency of our implementation in utilizing the CPU and the GPU, we show the work done by the CPU and the GPU on the FEM/Cantilever matrix from Table 2.3 on a timescale in Figure 3.9. Figure 3.9 also shows the time taken to transfer the result vector in each iteration. It can be noticed from Figure 3.9 that our implementation is able to match the computation and transfer time required by the CPU and the GPU very closely. This also indicates that our implementation suffers from a very small idle time either for the CPU or the GPU. It is possible that we can further reduce the idle time by using a standard double buffering technique at both the CPU and the GPU to overlap the computation with data transfer. However, this means that the calls to the spmv routine in the MKL library and the cusp library have to be issued multiple time on portions of A CPU and A GPU respectively. Also, multiple calls have to be issued for initiating the transfer of partial Y CPU to the GPU and the partial Y GPU to the CPU. These additional calls result in an overhead that outweigh the advantages of double buffering Scale-free Matrices A matrix is said to exhibit scale-free nature if the matrix has several rows with very few non-zero elements per row, and a very few rows with a large number of non-zero elements. Such matrices arise in several practical settings including transportation networks, web search, Internet algorithmics, and the like as the matrices underlying such computations tend to be scale-free. It is observed that some matrices from Table 2.3, Table 2.4 have a large majority of the rows with a small number non-zero elements per row, and there are very few rows with a large number non-zero elements per row. It is to be noted that the threshold of whether a row has a small number of non-zeros varies with the matrix, from order of hundred for the webbase matrix to less than 50 for the Circuit matrix. In this section, we show that the Small-Rows-GPU scheme of work division suits for scale-free matrices along with evidence for the same. The algorithm we use is described in Algorithm 4. We identify the matrices A CPU and A GPU using an empirical exhaustive search to identify the best possible division amongst the CPU and the GPU. Figure 3.10 shows the result of applying this division scheme on a collection of scale-free matrices taken from the dataset of Table 2.3 and Table 2.4. As can be noticed, the Small-Rows-GPU method outperforms a GPU alone implementation and also the other three methods except in cases where the transfer time dominates the compute time as in matrices such as Internet, and p2p-gnutella31. In the other cases, the average improvement is noted to be 20 %. We offer two explanations for this behavior. Firstly, we investigate the cache-hit ratio of the last level cache on the CPU used in our experiments. In Figure 3.11, we compare the cache hit ratio of two schemes, Small-Rows-GPU and Large-Rows-GPU, on four different scale-free matrices used in 28

41 our experiments. As can be seen from Figure 3.11, when large rows are processed on the CPU, indeed there is an improvement in the cache-hit ratio. In some instances, e.g., the matrix webbase, the cache hit ratio for the Small-Rows-GPU method is much better than for the Large-Rows-GPU method. This augurs well for the computation. On the other hand, small rows happen to be a good fit for the GPU. In the GPU workload consisting of rows with small number of non-zero elements, it is likely that threads in a warp have near-identical work. So, there is lesser chances of load imbalance across threads in a warp leading to better performance on a GPU. One can also see increased occupancy also as each thread brings in only few elements into the shared memory and registers. Speedup over GPU Pure GPU/Direct-Division Pure GPU/Small-Rows-GPU Pure GPU/Large-Rows-GPU 0 Circuit Webbase amazon0312 internet web-google dblp-2010 p2p-gnutella31 ca-condmat Average Matrix Instance Figure 3.10 Applying the Small-Rows-GPU method to scale-free matrices from the datasets of Table 2.3 and 2.4 on platform given in The last item on the X-axis refers to the average of the series. For the above reasons, the Small-Rows-GPU work division scheme suits for scale free matrices. One drawback however is that we need to do exhaustive search to find best work division. The next chapter answers this question using a workqueue framework. 29

42 Cache hit rate Small-Rows-GPU Large-Rows-GPU %nonzeros Cache hit rate Small-Rows-GPU Large-Rows-GPU %nonzeros (a) Matrix : Circuit (b) Matrix : Webbase 100 Small-Rows-GPU Large-Rows-GPU Small-Rows-GPU Large-Rows-GPU Cache hit rate Cache hit rate %nonzeros %nonzeros (c) Matrix : Web-Google (d) Matrix : DBLP-2010 Figure 3.11 Cache hit ratio on the CPU last level cache for four different scale-free matrices. The X-axis indicates the percentage of the total number of non-zeros that were assigned to the CPU. 30

43 Chapter 4 Dynamic Load Balancing Algorithms For Sparse Matrix Kernels The challenging problem in any heterogeneous algorithm is to divide the work among heterogeneous devices. We have to do an exhaustive search to find the best possible work division, which takes a lot of time. Also, static partitioning of irrespective of instance lead to load imbalances. So we devised a work queue frame work to address the load balancing problem. In this chapter we give a brief explanation of our work queue frame work. Also we discuss about how we used this frame work to achieve dynamic load balancing in sparse matrix operations along with results. 4.1 Work Queue Model Framework In this section, we observe that workloads such as sparse matrix operations possess a few characteristics that make them amenable for dynamic load balancing even in heterogeneous environments. These characteristics are listed below. Independent work units: The computation can be broken down into independent subproblems called work units. It is not necessary that the work units have identical computational requirement. Easily describable work units: These workloads have easy and succinct to describe independent subproblems. For instance, a work unit could correspond to processing a contiguous set of elements, say rows in a matrix. Minimal or no post-processing: The solution to the entire problem be a (near)-immediate consequence of the solutions to the independent work units. There should be little post-processing involved. Dynamic load balancing of the above category of workloads can be achieved by having multiple threads of the CPU and the GPU share a queue that contains several work units. The individual threads can access the work queue to fetch the next work unit for which computation is still pending. The 31

44 Figure 4.1 Work Queue model CPU threads and the GPU access the work queue from either end so that there is no need to, in most cases, synchronize accesses to the work queue by the CPU and the GPU. However, CPU threads have to access the queue in a concurrent fashion. Given the low number of CPU threads that we use, we employ a simple locking mechanism on the CPU front variable. We also perform the following further optimizations for improved performance. Minimal Synchronization: To minimize the synchronization requirement when accessing the queue, we make the queue double-ended. The CPU and the GPU dequeue work units from either ends. The only synchronization required between the CPU and the GPU is for dequeuing the last work unit from the queue. Having a double ended queue also ensures that it is easy to maintain the state of the queue correctly at all times. In practice, we also notice that the synchronization requirement is also most non-existent. In sparse matrix operations, they have work units that are independent. So this optimization is possible. Reducing the Overhead of Queue operations: Secondly, the overhead of queue operations is kept at a minimum by maintaining logically, and not physically. So, we do not actually fill the queue with work units before the start of the computation. Assuming a total of n work units initially, the front pointer on the CPU side is set at work unit 1, and the front pointer on the GPU side is set work unit n. The front pointer at each end can completely describe the progress of the computation. The computation is said to finish when the front pointer on the GPU side meets, or crosses, the front pointer on the CPU side. Since in the sparse matrix operations, the computation has work units are succinctly describable, this optimization is possible. Other Program Optimizations: To keep this overhead of GPU kernel launches low, we launch the GPU kernel only once irrespective of the number of work units that the GPU works. The GPU kernel interacts with the host to fetch multiple work units without exiting execution on the device. 32

45 4.2 Sparse Matrix - Matrix Multiplication Multiplying a sparse matrix with another sparse/dense matrix is an important workload in parallel computing. These operations, called spgemm and csrmm respectively, have applications to several problems from various domains. For instance, spgemm has applications to problems from engineering such as graph algorithms [4], numerical applications including climate modeling, molecular dynamics, CFD solvers, and so on. csrmm is widely used in Krylov subspace methods such as Lancozs method and conjugate gradient method [28]. Indeed, sparse matrix operations are listed as one of the seven dwarfs in the Berkeley report [3] Sparse Matrix Sparse Matrix Multiplication(spgemm) Let A, B be sparse matrices and C = A B, where A,B and C are the matrices of sizes M N, N P and M P respectively. Some of the recent approaches to provide efficient and scalable algorithms for spgemm include [23, 48]. Often, it is a question of how to arrange the matrices in suitable data structures so that the expensive nature of highly irregular memory access patterns can be partly mitigated. It is shown in [48] that the Row-Row method, described below, is most suited for GPUs and also for CPU+GPU heterogeneous platforms. In the Row-Row method, the ith row in C, C(i, :), is obtained by multiplying each element in A(i, :) with the corresponding row in B. We then add all the scaled B rows to get the C(i, :). In other words, C(i, :) = j A(i,:) A(i, j) B(j, :). As identified in [23, 48], one of the main challenges of arriving at an efficient algorithm for spgemm is the difficulty in estimating the size of the output for a given input or a subset of a input. This suggests that the volume of computation for a subset of the input can vary substantially. However, when one uses the Row-Row formulation, the computation for each row of the output is entirely independent of the computation for other rows of the output. Further, a work unit in this case can be succinctly described as a contiguous set of output rows. Additionally, there is no post-processing involved. All these characteristics of the spgemm workload indicate that the framework from Section is suitable for spgemm. In this section, we show that it is indeed the case Algorithm Our heterogeneous algorithm for spgemm can be described briefly as follows. Let cpusize and gpusize denote the work unit sizes of CPU and GPU respectively. Let cpuoffset and gpuoffset are two global variables to track working units of CPU and GPU. cpurows,gpurows denotes the number of rows computed in that call made by CPU and GPU respectively. The GPU and CPU use Algorithm 5, Algorithm 6 respectively. In our implementation, the function spgemmgp U uses the Row-Row based matrix multiplication from [48], which is also used in our static load balancing algorithm in previous chapter. On the CPU side, the function spgemmcp U uses the Intel MKL routine [42]. As described in Figure 4.1, CPU process cpurows of A from top where as GPU process gpurows of A 33

46 from bottom. Once the CPU(GPU) finishes it s computation, it process the next cpurows(gpurows) from top(bottom). Every time CPU process cpurows of A, cpuof f set is incremented by cpurows. Similarly gpuoffset is decreased by gpurows. When both the pointers meet, we stop processing. Algorithm 5 Work queue model on GPU side cpuoffset = 0; // initialized globally gpuoffset = M;//initialized globally //check if CP U front, GP U front are not meeting. while cpuoffset < gpuoffset do if cpuoffset < (gpuoffset gpusize) then gpurows = gpusize; gpuoffset = gpuoffset gpurows; else gpurows = gpuoffset cpuoffset; gpuoffset = cpuoffset end if spgemmgp U(A, B, C, gpuoffset, gpurows);// compute of output of gpurows number of rows starting from gpuoffset th row. Tranfer partial output asynchronously. end while Sparse Matrix Dense Matrix Multiplication (csrmm ) Let A be a sparse matrix, B be a dense matrix stored in the row major format, and let C = A B. Similar to spgemm, csrmm also has all the characteristics of framework mentioned in Section So we can use work queue model for csrmm. The algorithm we follow is similar to the one mentioned in It is shown that the GPU implementation of csrmm from [48] outperforms the cusparse library implementation [13]. Hence, in our implementation, we use the csrmm implementation from [48] on the GPU and the csrmm routine from the Intel MKL library [42] on the CPU. We use similar algorithms given in Algorithm 6, Algorithm 5 on CPU and GPU respectively Results To validate our approach, in our experiments we have used the popular dataset of sparse matrices from the work of Williams et al. [63] which is shown in Table 2.3. The dataset from [63] consists of 14 matrices from a wide range of applications areas and has been the choice dataset for works on sparse matrix operations in recent times [48, 61, 49, 50, 66]. In our experiments, the baseline algorithm for spgemm, csrmm is the corresponding static load balancing algorithm used in previous chapter. This baseline algorithm uses an empirical strategy to 34

47 Algorithm 6 Work queue model algorithm CPU side cpuoffset =0; // initialized globally gpuoffset = M;//initialized globally //check if CP U front, GP U front are not meeting. while cpuoffset < gpuoffset do if (cpuoffset + cpusize) < gpuoffset then cpurows = cpusize; cpuoffset1 = cpuoffset; cpuoffset = cpuoffset + cpusize; else cpurows = gpuoffset cpuoffset; cpuoffset1 = cpuoffset; cpuoffset = gpuoffset end if spgemmcp U(A, B, C, cpuoffset1, cpurows);compute of output of cpurows number of rows starting from cpuoffset1 th row. end while 12 Hetero-High Hetero-Low 10 Percentage difference Dense Protein FEM/Spheres FEM/Cantilever Wind Tunnel FEM/Harbor QCD FEM/Ship Economics Matrix Instance Epidemiology FEM/Accelerator Circuit Webbase Average Figure 4.2 Figure shows the absolute difference in the work split percentage with respect to the baseline implementation for spgemm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series. 35

48 10 Hetero-High Hetero-Low 8 Percentage difference Dense Protein FEM/Spheres FEM/Cantilever Wind Tunnel FEM/Harbor QCD FEM/Ship Economics Matrix Instance Epidemiology FEM/Accelerator Circuit Webbase Average Figure 4.3 Figure shows the absolute difference in the work split percentage with respect to the baseline implementation for csrmm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series. 36

49 Hetero-High Hetero-Low 20 Percentage difference Dense Protein FEM/Spheres FEM/Cantilever Wind Tunnel FEM/Harbor QCD FEM/Ship Economics Matrix Instance Epidemiology FEM/Accelerator Circuit Webbase Average Figure 4.4 Figure shows the absolute difference in the runtime with respect to the baseline implementation for spgemm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series. 37

50 14 Hetero-High Hetero-Low 12 Percentage difference Dense Protein FEM/Spheres FEM/Cantilever Wind Tunnel FEM/Harbor QCD FEM/Ship Economics Matrix Instance Epidemiology FEM/Accelerator Circuit Webbase Average Figure 4.5 Figure shows the absolute difference in the runtime with respect to the baseline implementation for csrmm on platforms given in Sections 2.5.1(Hetero-High), 2.5.3(Hetero-Low). The last instance Average shows the average value of the series. 38

51 identify the best possible work distribution for the CPU and the GPU. Recall that cpusize, gpusize are the work unit sizes for CPU and GPU respectively. As CPU and GPU have different architectures and computational models, cpusize, gpusize are also different. In our experiment we vary both cpusize, gpusize on a wide range of values to find the best values. In Figure 4.2, Figure 4.3, we show the absolute difference in the work split percentage between our algorithm and the baseline algorithm for spgemm and for csrmm respectively. It is important to see the absolute difference between both the cases. Because it gives our clear idea about the deviation from the best possible work division. As can be seen, the average absolute difference in the work split percentage of our implementation with respect to the baseline implementation is under 6% on both the heterogeneous platforms that we used in our experiments and for both spgemm and csrmm. To measure the time overhead of our framework, we measure the absolute difference in time taken by our implementation with respect to the baseline algorithm. The results of this comparison on two different heterogeneous platforms which are given in Section 2.5.1, Section are shown in Figure 4.4 for spgemm, and in Figure 4.5 for csrmm. As can be seen, the average absolute difference in the runtime of our algorithm with respect to the baseline algorithms is under 10% for both the workloads on both the platforms. The slight difference in the work split percentage and the runtime of our implementation compared to the baseline implementation can be attributed to the overheads involved in our framework such as breaking the computation into several work units. On the other hand, the baseline implementation treats the portion of its work as a single work unit. 4.3 Sparse Matrix - Vector Multiplication Recall that A is an M M sparse matrix, and X and Y are vectors of M elements each, and Y = A X. Similar to spgemm, spmv also has all the characteristics of framework mentioned in Section So we can use work queue model for spmv. In this section we present our algorithm along with results Algorithm Let cpusize and gpusize denote the work unit sizes of CPU and GPU respectively. Let cpuoffset and gpuoffset are two global variables to track working units of CPU and GPU. cpurows,gpurows denotes the number of rows computed in that call made by CPU and GPU respectively. The CPU and the GPU use Algorithm 7, Algorithm 8 respectively. In Algorithm 7, the function CPUSPMV() in line 13 calls the best known multi-core CPU routine for spmv. In our implementation, we use the Intel MKL library [42] for this purpose. Similarly, in the GPU algorithm 8, the function GPUSPMV() in line 11 uses the best known GPU implementation for spmv, which in our case is the spmv library call from the NVidia cusp library [51]. 39

52 Algorithm 7 Work queue algorithm on CPU side 1: cpuoffset =0; /* initialized globally */ 2: gpuoffset = M; /* initialized globally */ //check if CP U front, GP U front are not meeting 3: while cpuoffset < gpuoffset do 4: if (cpuof f set + cpusize) < gpuof f set then 5: cpurows = cpusize; 6: cpuoffset1 = cpuoffset; 7: cpuoffset = cpuoffset + cpusize; 8: else 9: cpurows = gpuoffset cpuoffset; 10: cpuoffset1 = cpuoffset; 11: cpuoffset = gpuoffset 12: end if 13: CP USP MV (A, B, C, cpuoffset1, cpurows);//compute of output of cpurows number of rows starting from cpuoffset1 th row. 14: end while Algorithm 8 Work queue algorithm on GPU side 1: cpuoffset = 0; /* initialized globally */ 2: gpuoffset = M;/* initialized globally */ //check if CP U front, GP U front are not meeting 3: while cpuoffset < gpuoffset do 4: if cpuof f set < (gpuof f set gpusize) then 5: gpurows = gpusize; 6: gpuoffset = gpuoffset gpurows; 7: else 8: gpurows = gpuoffset cpuoffset; 9: gpuoffset = cpuoffset 10: end if 11: GP USP MV (A, B, C, gpuoffset, gpurows);//compute of output of gpurows number of rows starting from gpuoffset th row. 12: Transfer partial output asynchronously. 13: end while 40

53 4.3.2 Results We evaluate the work queue method on scale-free matrices given in Table 2.3 and Table 2.4 on platform given in Section In our experiments we vary both cpusize, gpusize on a wide range of values to find the best values for these parameters. We limited our experiments to scale-free matrices since it is shown earlier that for such matrices, the Small-Rows-GPU heterogeneous algorithm outperforms the other algorithms. In our experiments, we focus on two quantities. One of them is to find the work division percentage. We compare the work division percentage identified by the work queue model and the baseline algorithm. The baseline algorithm follows the Small-Rows-GPU work division scheme and identifies the work division percentage by an empirical exhaustive search. The results of this experiment on the scale-free matrices from Table 2.3 and Table 2.4 are shown in Figure 4.6(a). We can see that absolute percentage difference in the work division is under 3% most of instances. This difference can be attributed to a few overheads of work queue method. In our second experiment, we compare the time taken by the Small-Rows-GPU algorithm using the work queue method with respect to the baseline algorithm. The absolute difference of these times is shown in Figure 4.6(b) for the scale-free matrices from Table 2.3 and Table 2.4. It can be noticed that the absolute difference in under 10% on average. The above results suggest that using a work queue can result in a workload-aware division of work across the devices in a heterogeneous platform. The utility of the work queue method for spmv can be argued in two ways. If spmv is used iteratively, then the above overheads are applicable only for the first iteration. The rest of the iterations can use the work division percentage identified by the work queue method since the nature of the spmv computation does not change over iterations. Another possibility is to localize the work division percentage around the one identified by the work queue method. This can help avoid an exhaustive search for the work division percentage. In the above experiments we find the work unit sizes through exhaustive search only. But these values will be depends on the nature of the problem. So one can come up with a analytical model to identify these values. 41

54 Circuit Webbase amazon0312 internet web-google dblp-2010 p2p-gnutella31 ca-condmat Work Division Difference (%) Work Division Difference (%) Matrix Instance (a) Work Division Percentage Time Difference (%) Time Difference (%) Circuit Webbase amazon0312 internet web-google dblp-2010 p2p-gnutella31 ca-condmat Matrix Instance (b) Time Difference Figure 4.6 Absolute difference of work division percentage and the overall runtime between the work queue based algorithm and the baseline algorithm for scale-free matrices from Table 2.3 and Table 2.4 on platform given in Section

55 Chapter 5 Analytical Model For Band Matrix Multiplication Work division among heterogeneous devices is a challenging problem. Recall from Chapter 3, in case general sparse matrix multiplication, it is difficult to predict the best work division using heuristics as well. So in this chapter we defined an analytical model to find work division in case of band matrix multiplication. Band matrix is a subclass of sparse matrix. It is not always possible to design a analytical model for work division. Devising an analytical model reduces the exhaustive search for finding best work division threshold. 5.1 Band Matrix Band matrices are a special kind of sparse matrices where nonzero entries appear uniformly in a diagonal band. This allows one to store band matrices in more efficient data structures rather than in COO format, CSR format. So these matrices are stored in separate format called diagonal format (DIA) [33]. The diagonal format consists of two arrays: for a matrix A, the data A array stores the nonzero values, the offset A array stores the offset of each diagonal from the main diagonal. The i th column of data A indicates i th diagonal of matrix and offset A[i] indicates the offset of i th diagonal. Figure 5.1 illustrates the DIA representation of an example matrix with four diagonals. In our implementation data A is stored in column-major order so that diagonals placed adjacently from left to right. Entries with the symbol * are stored with 0. Notice that multiplying two band matrices results in another band matrix. Example: 5.2 Algorithm Let A, B, and C be band matrices with C = A B. Let Adiagonals, Bdiagonals, and Cdiagonals indicate number of diagonals in A, B, and C respectively. We can see that Cdiagonals = Adiagonals + Bdiagonals 1. In general, multiplying the i th diagonal elements of A with j th diagonal elements 43

56 A = data A = offset A = [ ] Figure 5.1 An example illustrating DIA format representation of B contribute output to the diagonal whose offset is offset A[i]+offset B[j]. The DIA format allows for more efficient algorithms to multiply two band matrices on the CPU and also on the GPU. The CPU Algorithm and the GPU Algorithm are presented as Algorithm 9 and Algorithm 10 respectively. Algorithm 9 iterates over the diagonals of A and the diagonals of B. For a given pair of such diagonals, all the applicable multiplications are done in parallel. An Example of band matrix multiplication is shown below. Example: A = data A = 6 3 offset A = B = data B = 1 1 offset B = A B = [ ] 1 0 [ ] 0 1 In this example Adiagonals,Bdiagonals,Cdiagonals are 2,2,3 respectively. Iteration 1: First column of the both data A and data B matrices are involved in computation. 44

57 = Iteration 2: First column of data A and second column of data B are involved in computation = Iteration 3: Second column of data A and first column of data B are involved in computation = Iteration 4: Second column of both data A and data B are involved in computation = In the GPU algorithm, each block of threads processes BlockSize rows of the A matrix. Every block of threads brings the applicable portion of the B matrix into the shared memory. We use variables such as Arow to denote the starting row number of A corresponding to the block in GPU, startbrow and endbrow to denote starting row value and ending row value of the applicable portion of B. Every block computes a portion of the output and writes to C. The computation is similar to that of the CPU algorithm, except for calculating the indices and offsets used. 45

58 Algorithm 9 CPU Algorithm for i = 1 to Adiagonals do for j = 1 to Bdiagonals do outdiagoffset = offset A[i] + offset B[j] outdiagnumber = outdiagoffset offset A[0] offset B[0] {writing output to the diagonal computed above} for k = 1 to Crows do in parallel do data C(k, outdiagnumber) += data A(k, i) data B(k + i + offset A[0], j) end for end for end for Algorithm 10 GPU Algorithm Every BlockSize rows of A is assigned to a block of GPU threads. for each thread with index tid in the Block do startbrow = Arow + offset A[0] endbrow=startbrow +Adiagonals 1 + BlockSize Bring rows of B from startbrow to endbrow into shared memory. for i = 1 to Adiagonals do for j = 1 to Bdiagonals do outdiagoffset = offset A[i] + offset B[j] outdiagnumber =outdiagoffset offset A[0] offset B[0] data C(tid, outdiagnumber) += data A(tid, i) data B(tid + i + offset A[0], j) end for end for end for 46

59 5.3 Analytical Model To identify the correct threshold to use in the hybrid approach, we proceed as follows. Let A r denote the number of rows in the A matrix, A d and B d denote the number of diagonals in the A matrix and the B matrix respectively. Let R = A r A d B d and S = A r (A d + B d 1). It can be seen that the time taken by the CPU Algorithm (see also Algorithm 9) is proportional to R. If we process t% of the rows on the CPU, then the number of operations performed on the CPU is proportional to t R 100. Similarly, time taken by the GPU is proportional to (100 t) R 100. Let us assume that the final output would be available on the CPU by transferring the output from the GPU to the CPU in time proportional to S. For a few input matrices, we evaluate the performance of the CPU Algorithm, the GPU Algorithm and the copy time of the output from the GPU to the CPU. This helps us identify the parameters α, β, and γ such that: CP Utime = α R GP Utime = β R Copytime = γ S In the above, the parameter α is a constant that depends on the CPU, β is a constant that depends on the GPU, and γ depends on the bandwidth of the PCI Express link connection the CPU and the GPU. The Copytime in the above refers to the time taken to transfer the GPU part of the output to the CPU. If we use t% as the threshold, to minimize the hybrid execution time, we require: CP Utime = GP Utime + Copytime. t αr = (100 t) (βr + γs). Solving the above equation for t gives us that t = 100(βR+γS) (α+β)r+γs. 5.4 Experiments and Results We conduct our experiments on a hybrid platform mentioned in Section To evaluate our model we use some synthetically generated band matrices. The results are shown in Figure 5.2. Our experimental results on synthetically generated band matrices indicate that the CPU and the GPU algorithms presented above for band matrices outperform the corresponding CPU and GPU algorithms for spgemm as expected. The hybrid approach we study is similar to the hybrid approach of spgemm where a certain t% of the rows of A are processed on the CPU and the remaining (100 t)% rows of A are processed on the GPU. We use our analytical model to find t value in case of band matrices. To study our methodology we experimented on a set of synthetic matrices with varying A r, A d, and B d. The synthetic dataset is generated with different combinations of A r, A d, and B d sizes, so as to study 47

60 Figure 5.2 Graph showing the performance comparison of best time, predicted time using formulae, for various combinations of A r, A d, and B d. A tuple l, m, n in the x-axis indicates A r, A d, B d. The last instance Average shows the average value of the series. the effect of varying one or more values among A r, A d, and B d. The results of the study are shown in Figure 5.2. In Figure 5.2, Best time indicates the execution time at best threshold value of t and Predicted time indicates the execution time at predicted threshold value of t using our analytical model. We can observe that predicted time is nearer to best time. 48