A Model- Driven Partitioning and Auto- tuning Integrated Framework for Sparse Matrix- Vector Multiplication on GPUs

Size: px

Start display at page:

Download "A Model- Driven Partitioning and Auto- tuning Integrated Framework for Sparse Matrix- Vector Multiplication on GPUs"

Sherilyn Richards
7 years ago
Views:

1 A Model- Driven Partitioning and Auto- tuning Integrated Framework for Sparse Matrix- Vector Multiplication on GPUs HUANG, He Department of Computer Science, University of Wyoming Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, and Po Chen. A Model-Driven Partitioning and Auto-tuning Integrated F ramework for Sparse Matrix-Vector Multiplication on GPUs. In the 2011 TeraGrid Conference. Salt Lake City, UT. ACM Press, 2011.

2 Outline 1. Introduction 2. GPU performance model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 2

3 GPU: attractive platform for data- intensive computation in science and engineering due to its high computation power. Sparse matrix- vector multiplication (SpMV): a common operation in solving linear systems Matrix maybe very large and sparse Typical iterative method may require many times of SpMV SpMV is often the most time- consuming part of scientific calculation 3

4 Observation: SpMV on GPU Different SpMV CUDA kernels are good at on different types of matrices: CSR good when large: number of nonzero elements ELLPACK better when: nearly equal and small number of nonzero elements per row Hybrid ELL/COO better when: matrix has small number of nonzero elements per row, and most of rows have nearly equal number of nonzero but there may be a few irregular rows with much larger nonzero elements 4

5 Objective Design an automatic approach to improve SpMV performance based on the previous observations partition the matrix give near optimal compressed format for each partition combine the advantages of various formats 5

6 Framework 6

7 Outline 1. Introduction 2. GPU execution time estimation model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 7

8 GPU memory architecture review Thread: registers and local memory Block of threads: Shared memory Block is scheduled to SM for calculation All blocks: Global memory Constant memory Texture memory Registers and shared memory are very fast but very limited Global memory is abundant Our model considers register and shared memory restriction source: NVIDIA 8

9 GPU exec. time estimation model(1) Obtain hardware information of GPU Determined by compute capability, e.g. GTX Compile kernel using cubin Obtained info used in the following steps 9

10 Model(2) Count the number of active thread blocks (TBs) for a streaming multiprocessor. Limited by three factors: active warps, registers and shared memory Compute number of thread block limited by the three factors respectively The minimum value of the three is the active TB Total number of thread blocks for a given matrix CSR: one warp per row ELL: one thread per row The information is used in next step 10

11 Model(3) CSR Choose several sample matrices and test their execution time Principle of choosing sample matrices 1. The matrix should be large enough - > fully utilize all active blocks in all SMs and also simulate the real large matrix 2. Keep every row has the same number of nonzeros 11

12 Model(3) ELLPACK The execution time is related to number of nonzeros per row and number of rows Two dimensional interpolation for sample matrices Use the fitting surface to estimate execution time x axis: log(# nonzeros per row) y axis: log(# rows) z axis: log(time) 12

13 Model(4): estimate exec. time Sometime, appropriately enlarge the number of nonzero elements for some row, if there is little nonzero in that row. CSR: Input the total number of nonzeros to the model ELLPACK: Input number of nonzero per row, and number of rows to the model Estimate the execution time 13

14 Outline 1. Introduction 2. GPU performance model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 14

15 Model- driven matrix partitioning Use GPU model to decide how to partition the matrix 1. Partition the matrix into several blocks 2. Each block uses different storage format, the execution time of every block can be estimated by the GPU model 3. The total estimated execution time is the summation 4. By exploring different partition combination, we may find an near optimal one from the model 5. The optimal partition is used for real execution 15

16 Outline 1. Introduction 2. GPU performance model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 16

17 Details: Auto- Tuning CUDA Parameters for Sparse Matrix- Vector Multiplication on GPUs. 17

18 Outline 1. Introduction 2. GPU performance model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 18

19 Integrate model- driven partitioning and auto- tuning CSR: 222% performance improvement ELLPACK: 197% performance improvement HYBRID: 33% performance improvement 19

20 Outline 1. Introduction 2. GPU performance model 3. Model driven partitioning 4. Auto- tuning GPU parameter 5. Experiment 6. Conclusion and future work 20

21 Concussion and Future Work Conclusion: Design and implement a model- driven partitioning approach to decide how to partition the target sparse matrix and transform each partition into appropriate storage format. The auto- tuning tool can automatically adjust CUDA- specific parameters to improve performance. We integrate model- driven partitioning and auto- tuning to maximize the performance. Experiment shows that this approach has substantial performance SpMV kernels Future work: Refine our performance model - > more accurate Extend current framework to include more CUDA kernels. 21

22 A Model-Driven Partitioning and Auto-tuning Integrated F ramework for Sparse Matrix-Vector Multiplication on GPUs.

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...