SCATTERED DATA VISUALIZATION USING GPU. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

Transcription

1 SCATTERED DATA VISUALIZATION USING GPU A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Bo Cai May, 2015

2 SCATTERED DATA VISUALIZATION USING GPU Bo Cai Thesis Approved: Accepted: Advisor Dr. Yingcai Xiao Dean of the College Dr. Chand Midha Committee Member Dr. Tim O Neil Interim Dean of the Graduate School Dr. Rex D. Ramsier Committee Member Dr. Zhong-Hui Duan Date Department Chair Dr. Timothy Norfolk ii

3 ABSTRACT Scattered data visualization is commonly used in engineering applications. We usually employ a two-step approach, data modeling and rendering, in visualizing scattered data. Performance and accuracy are two important issues in scattered data modeling and rendering. This project developed a GPU-accelerated scattered data visualization system. The Shepard s method was used to interpolate scattered data into a 3D uniform grid and The Marching Cubes method was used in rendering the intermediate grid. Techniques, such as Localized Data Modeling, Static Local Block Data Modeling and Dynamic Local Block Data Modeling, were tested to measure their performance and accuracy. Experiments have been conducted with real world data on a GPU-accelerated scattered data visualization system. The speed-up observed on a GPU (NVidia GeForce GT 525M) is 12 to 27 times faster than on a CPU (Intel Core i5-2410m 2.30 GHz). Increasing the value of α in the Shepard s method can improve accuracy without causing performance penalty. Localization can reduce modeling error but causes performance penalty. Dynamic Block Localization can increase modeling accuracy significantly, but it has a large speed penalty due to frequent data shifts among GPU memory banks. Static iii

4 Block Localization, on the other hand, has a smaller performance penalty, but also shown smaller accuracy improvement. The parallel efficiency of the system is low ( to ). Future work includes studying issues related to GPU memory bank conflicts to increase the efficiency and investigating more GPU-accelerated data interpolation methods for their accuracy and performance. iv

5 ACKNOWLEDGEMENTS I am very thankful to my parents, for encouraging and supporting me in pursuing my master degree and also making this thesis possible. I would like to acknowledge the professor who inspired me throughout my master s program. Dr. Yingcai Xiao, thank you very much for inspiring me by giving me guidance and supporting me throughout the program. Dr. Zhong-Hui Duan and Dr. Tim O Neil, thank you very much for being my thesis committee members and supporting me to accomplish this thesis. I also would like to acknowledge the faculty in the Department of Computer Science, Dr. En Cheng, Dr. Chien-Chung Chan, Dr. Kathy Liszka and Dr. Michael L. Collard for their help during my Master s degree study. Their help has directly or indirectly contributed towards the accomplishment of this thesis. v

6 TABLE OF CONTENTS Page LIST OF TABLES... viii LIST OF FIGURES... ix CHAPTER I. INTRODUCTION Motivation Survey of Previous Work Outline of the Thesis... 3 II. BACKGROUND Scattered Data Modeling and Visualization Data Interpolation Method Data Visualization Method... 6 III. DESIGN Localized Data Modeling Design of GPU-based Modeling Algorithm... 8 vi

7 3.3 Design of GPU-based Visualization Algorithm IV. IMPLEMENTATION Implementation of Localization Shepard s Method on GPU Implementation of Marching Cubes Algorithm on GPU V. RESULTS AND ANALYSES Global Method Comparisons between CPU and GPU Accuracy Comparisons between Localized Global Method and Non-localized Global Method Performance Comparison between Static Local Block Data Modeling Method and Dynamic Local Block Data Modeling Method Over All Speed Up and Error Report Analyses VI. CONCLUSION AND FUTURE WORK REFERENCES vii

8 LIST OF TABLES Table Page 5.1 Comparing CPU and GPU for Global Shepard s Method for Various Grid Sizes Runtime GPU Global Method Detailed Runtime Data Communication Time, Size and Speed for Various Grid Sizes Data Communication Size and Speed for Grid Size 64*64*64 to 128*128* Data Communication Size and Speed for Grid Size 80*80*80 to 82*82* Comparing Static Local Block Data Modeling and Dynamic Local Block Data Modeling Method Running Time for Various Grid Sizes viii

9 LIST OF FIGURES Figure Page 5.1 Graphical Representation of Non-localized Global Method Data Modeling Result Graphical Representation of Non-localized Global Method Data Modeling Numerical Error Graphical Representation of Non-localized Global Method Data Modeling Relative Error Graphical Representation of Localized Global Method Data Modeling Result Graphical Representation of Localized Global Method Data Modeling Numerical Error Graphical Representation of Localized Global Method Data Modeling Relative Error Graphical Representation of Improved Shepard s Method by α = 2 Result Graphical Representation of Improved Shepard s Method by α = 2 Numerical Error Graphical Representation of Improved Shepard s Method by α = 2 Relative Error Graphical Representation of Improved Shepard s Method by α = 10 Result Graphical Representation of Improved Shepard s Method by α = 10 Numerical Error Graphical Representation of Improved Shepard s Method by α = 10 Relative Error Improved Shepard s Method by α = 11 Relative Error RMS of Different Data Modeling Methods ix

10 5.15 Graph to Compare the Speed Between Static Local Block Data Modeling and Dynamic Local Block Data Modeling Method x

11 CHAPTER I INTRODUCTION 1.1 Motivation High performance scattered data visualizations are in great demands in many engineering applications. Examples of such applications can be found in environmental studies, oil exploration and mining. Volume visualization of scattered data is difficult due to the limited sampling rate and the scattered nature of the data [1]. This involves difficulty and high costs of conducting site investigations to acquire scattered data. It is typically we tend to collect sampling points from suspected areas of concentration. Hence it is difficult to form a 3D grid with scattered sampling points and traditional grid-based visualization techniques cannot be employed to visualize such data. As a result we usually apply a two-step approach. The first step is to perform modeling on the scattered sample data to form a 3D uniform grid. Each grid node has an interpolated data value. Conventional grid-based visualization techniques are then applied on this intermediate grid in the second step, i.e, the rendering step [2]. Traditional CPU-based computing methods are dominant in the modeling field. Even though the CPU has developed very quickly in the past several decades, it still cannot catch up with modern modeling computational demand [3]. Similarly, interactive 1

12 visualization has a higher computation demand. In our project, we aim to speed up both steps the modeling and rendering by parallelizing both the modeling and visualization processes using CUDA parallel processing. 1.2 Survey of Previous Work CPU-based interpolation methods have been used in the modeling process of scattered data visualization for years. The ideas of scattered data visualization correctness dilemma and local constraint are presented [4]. The advent of GPU CUDA parallel processing has led to research in many areas. Scattered data visualization is one of the most suitable research areas for parallel processing. The advantages of GPU CUDA parallel processing can benefit both the scattered data modeling process and the visualization process. For the GPU based scattered data modeling part, a GPU based scattered data modeling system was developed [5], Where the author implemented four GPU based scattered data modeling methods, Shepard s method, Multiquadric method, Thin-plate-spline method, and Volume Spline method. In [13] this GPU based scattered data modeling system was migrated into various platforms such as GPU GTX480, GPGPU Tesla C2070, and An Amazon Web Service Cloud-Based GPGPU instance [13]. For GPU-based scattered data visualization, there is currently no existing research on visualizing scattered data by CUDA GPU. However, the NVidia CUDA SDK provides GPU-accelerated data expansions for the Marching Cubes algorithm [9]. These tools enabled the development of GPU based scattered data visualization. 2

13 1.3 Outline of the Thesis This report consists of detailed work explained through various chapters. Chapter I includes information regarding the motivation of the project. Chapter II presents the background on the technology and some of the basic theories on scattered data modeling and rendering. Chapter III discusses the design and implementation of this scattered data visualization system. The design and implementation regarding GPU-based Shepard s method modeling and GPU-based Marching Cubes algorithm rendering are explained in detail and depicted through diagrams. Chapter IV discusses case studies, in which a stepby-step procedure is explained pictorially. Time consumed, memories used, and speeds of data communications are presented in order to make comparisons among different cases. Chapter V discusses the summary of the work and how the system can be utilized. Possible modifications as well as some future work are also explained. 3

14 CHAPTER II BACKGROUND 2.1 Scattered Data Modeling and Visualization Scattered data is unevenly distributed or randomly spread over the volume of interest. The random distribution of the data makes it hard to visualize (since existing visualization algorithms are based on a 3D grid structure [3]). Scattered data is commonly found in engineering applications. Quick interactive visualization of scattered data is in great demand [2]. The most commonly used approach for scattered data visualization contains two steps [1]. The first step involves converting the scattered sample data into a 3D uniform grid. The sample data consist of 3 values for the position and one data value. To form the grid we need to interpolate the data values onto each grid node. After the interpolation we could use grid-based visualization techniques as Marching Cubes to visualize the grid such. 2.2 Data Interpolation Method Following the two-steps approach, the first step of the procedure is modeling scattered data into a 3D uniform grid. To model the scattered data, we employ commonly used interpolation methods. Interpolation is a method of constructing new data points within the range of a discrete set of the known original data points. In the areas of 4

15 engineering and science, when we deal with data, one often has a number of data points obtained through sampling or experimentation. The data represents function for a limited independent variable [14]. To analyze the data scientists and engineers usually use mathematical interpolation methods to model the scattered data. Generally speaking there are two kinds of mathematical interpolation, global interpolation and local interpolation. All the points will be used to determine the value of separate new points in global interpolation, while only the nearby points are used in local interpolation. Usually we use global interpolation methods when modeling scattered data to make full use of the original data. In global interpolation, given a set of n sample points {(xi,yi,zi), i = 1, 2,, n} with the sample value for each point {vi, i = 1, 2,, n}, we construct a function f(x,y,z) that is valid everywhere inside the domain of interest and satisfies the condition of {f(xi,yi,zi) = vi, i = 1, 2,, n}. When the function is found, it can help to calculate the value for each point based on the discrete location. For this project we used the Shepard s method. The mathematical expression of the method is: ( ) ( ) ( ) (2.1) is the distance between sample point and this gird node. is the diagonal length of a grid node. α is usually any real number greater than zero. The inverse-distance weighted method is a special case of the Shepard s method where α = 1. 5

16 2.3 Data Visualization Method Marching cubes is a computer graphics or visualization algorithm for extracting a polygonal mesh of an iso-surface from a 3D scalar field (sometimes called voxels) [9]. This is done by creating an index to a pre-calculated array of 256 possible polygon configurations ( =256) within the cube (by treating each of the 8 scalar values as a bit in an 8-bit integer). If the scalar's value is higher than the iso-value (i.e., it is inside the surface) then the appropriate bit is set to one, while if it is lower (outside), it is set to zero. The final value after all 8 scalars are checked is the actual index to the polygon indices array. Finally each vertex will generated triangles by using the Marching Cubes lookup table. So they can connect vertex correctly.[15]. 6

17 CHAPTER III DESIGN 3.1 Localized Data Modeling To model scattered data, we usually employ interpolation methods. These are the methods used to construct new data points within the range of a discrete set of known original data points. Currently, the most commonly used mathematical interpolation methods have two kinds of data modeling, globalized data modeling and localized data modeling. Globalized data modeling methods use all sample points to interpolate a grid value. Localized data modeling only use nearby sample points to interpolate a grid value. In our project, we focus on localized data modeling. As we previously stated, that localized data modeling only use nearby sample point to interpolate a grid value. How to define nearby is an interesting question. So we have included two options, Range- Oriented Localized Data Modeling (ROLDM) and Blocked-Oriented Localized Data Modeling (BOLDM). Range-Oriented Localized Data Modeling (ROLDM) is a distance-based localized data modeling method. Each time we interpolate a grid value, we draw a circle (in 2D) or sphere (in 3D) using this grid value as the center point and a certain radius. The radius is defined by us. If the radius is large enough to contain all sample point, the result will be 7

18 the same as the result of globalized data modeling. We only use the sample points in this circle or sphere to calculate the grid value and ignore any sample points outside this circle or sphere. Blocked-Oriented Localized Data Modeling (BOLDM) is designed for a GPU s architecture. We divide the entire data volume into small blocks or volumes which could fit perfectly in shared memory. We will discuss this later in the design of a GPU-based modeling algorithm. All the grid points have their own block ID. All the grid points values are interpolated by using the sample points, which have the same block ID. This means that only same block sample points are used to interpolate a grid value. 3.2 Design of GPU-based Modeling Algorithm Calculating the data values in parallel is the basic idea behind designing a GPUbased modeling algorithm. As we discussed, there are two types of localized data modeling methods (which we designed), Range-Oriented Localized Data Modeling (ROLDM) and Blocked-Oriented Localized Data Modeling (BOLDM). The design of Range-Oriented Localized Data Modeling (ROLDM) can be explained as follows: 1) Define the grid size in each dimension. 2) Allocate one dimension arrays for sample points and grid points in the CPU. 3) Read sample points from a text file and write positions and data values into the arrays. 4) Scale the sample point position values by the following steps: 8

19 a) Find the maximum and minimum of each x, y and z values of the sample points. b) For each x, y and z values of the sample points divide by the difference of the maximum and minimum and multiply by the grid size in each dimension. c) Write the scaled sample points positions in to an array 5) Allocate one dimension arrays for sample points and grid points in the GPU. 6) Calculate the block dimension and grid dimension by the grid size. 7) Copy the scaled sample point positions and values from the CPU to the GPU. 8) Call the kernel function by passing the block dimension, grid dimension, grid array pointer and sample data pointer. 9) Allocate shared memory. 10) Each kernel performs the following steps: a) Load this kernel s corresponding sample point data from global memory to shared memory. b) Synchronize all the threads. Wait until all the sample data are loaded into shared memory. c) Calculate this kernel s corresponding grid point index by using the block index, block dimension, and thread index. d) Calculate the distances between this kernel s corresponding grid point and all the sample data points. e) Ignore the sample data points far away from this kernel s corresponding grid point and record the nearby sample points. 9

20 f) Interpolate this kernel s corresponding grid point value by using recorded nearby sample points. g) Write the interpolated value into the array. 11) Copy back the interpolated grid values from the GPU to the CPU. 12) Free GPU memories. The design of Block-Oriented Localized Data Modeling (BOLDM) can be explained as follows: 1) Define the grid size in each dimension. 2) Allocate one dimension arrays for sample points and grid points in the CPU. 3) Read sample points from a text file and write positions and data values into the arrays. 4) Scale the sample point position values by the following steps: a) Find the maximum and minimum of each x, y and z value of the sample points. b) For each x, y and z value of the sample points divide by the difference of the maximum and minimum and multiply by the grid size in each dimension. c) Write the scaled sample point positions into an array 5) Allocate one dimension arrays for sample points and grid points in the GPU. 6) Calculate the block dimension and grid dimension by the grid size. 7) Divide the sample data points into blocks by using the grid dimension. 8) Copy the scaled sample point positions and values from the CPU to the GPU. 9) Call the kennel function by passing the block dimension, grid dimension, grid array pointer and sample data pointer. 10) Allocate shared memory. 10

21 11) Each kernel performs the following steps: a) Load this kernel s corresponding block of sample point data from global memory to shared memory. b) Synchronize all the threads. Wait until all the sample data are loaded into shared memory. c) Calculate this kernel s corresponding grid point index by using the block index, block dimension, and thread index. d) Interpolate this kernel s corresponding grid point value by using this kernel s corresponding block sample data. e) Write the interpolated value into the array. 12) Copy back the interpolated grid values from the GPU to the CPU. 13) Free GPU memories. 3.3 Design of GPU-based Visualization Algorithm Marching Cubes is a surface reconstruction algorithm [8]. It extracts a geometric iso-surface from the volume of voxels. There are three situations for the vertices of a voxel: 1) If the value of the vertex is less than the iso-value this vertex is outside of the isosurface. 2) If the value of the vertex equals the iso-value which means this vertex is on the iso-surface. 3) If the value of the vertex is larger than the iso-value this vertex is in the isosurface. 11

22 A border voxel does not have all of its vertices neither inside nor outside the isosurface. By ignoring the second situation (where the value of the vertex equals the isovalue), there are 256 possible configurations for each voxel. Each voxel has 8 vertices and each vertex has 2 situations either inside the iso-surface or outside the iso-surface. This is why there are many configurations. So we predefined the triangle mesh approximating the part of the iso-surface for each configuration [9]. We use the edgetable[256] array to store 256 possible configurations as a look up table. For each of the possible vertex states listed in edgetable[256] there is a specific triangulation. tritable[256] lists all of them in the form of 0-5 edge triples so there are 256 ways to draw a triangle. In a 3D space we enumerate 256 different situations for the marching cubes representation. All of these cases can be generalized in 15 unique topological cases [7]. The main idea of the GPU-base marching cubes algorithm is that each thread of the GPU can be used to compute each voxel of the entire volume. The design of the GPUbased marching cubes algorithm can be explained as follows: 1) Initialization. 2) Allocate one dimension arrays for the grid data values, the voxel cases of the entire volume, the edge look up table and the triangle look up table on the CPU. 3) Read grid data values which we interpolated by ROLDM or BOLDM and write them into CPU arrays. 4) Allocate one dimension arrays for the grid data values, the voxel cases of the entire volume, the edge look up table and the triangle look up table on the GPU. 5) Copy the grid data values array from the CPU to the GPU. 12

23 6) Calculate the block dimension and grid dimension by the volume size. 7) Call the kernel function by passing the block dimension, the grid dimension, the iso-value and grid data values array pointer. 8) Each kernel performs the following steps: a) Calculate this kernel s corresponding voxel index by using the block index, the block dimension, and the thread index. b) Read the vertex values of this kernel s corresponding voxel from grid data values array. c) Compare each vertex value and iso-value to generate 8 scalar values which indicate 8 vertices of this voxel. We treat each of the 8 scalar values as a bit in an 8-bit integer what inside is the 0 and outside is the 1. d) Write the result of the comparison which is a 8-bit integer into the voxel cases array. 9) Copy back the voxel cases array from the GPU to the CPU. 10) Draw triangles by using the voxel cases array, edge look up table and triangle look up table. 13

24 CHAPTER IV IMPLEMENTATION 4.1 Implementation of Localization Shepard s Method on GPU The Shepard s Method is represented as: ( ) ( ) ( ) (2.1) is the distance between sample point and this gird node. is the diagonal length of a grid node. α is usually any real number greater than zero. The inverse-distance weighted method is a special case of the Shepard s method where α = 1. Each grid point has its own position which is represented by x, y and z. Each x, y and z value is used as an index to correspond the kernel index which is the thread index (in order to sign each grid point a kernel thread). Thus, the program can be parallelized so that each thread calculates a data value for each of the given set of grid points. All the grid point positions and values will be stored in a one dimension array since the GPU and the CPU are using one dimension arrays to communicate. The index of the one dimension array will indicate the position of a grid point: Index = z*ydimensionsize + y*xdimensionsize + x (4.1) 14

25 Each thread can calculate its corresponding grid point position by using threadidx.x, blockdim.x and blockidx.x. The equation is defined as: Index = threadidx.x + blockdim.x * blockidx.x (4.2) Since every kernel thread has been assigned to each grid point, all the grid points are interpolated simultaneously by using the Shepard s method equation. All sample points will be loaded into shared memory when each kernel thread begins Range-Oriented Localized Data Modeling (ROLDM). The distances between the current kernels thread grid point and all sample points are calculated to determine whether this sample point is nearby or not. The distance equation is defined as: Distance = ( ) ( ) ( ) (4.3) Pseudo code: Input: sample points array Output: 3D uniform grid with data value on each grid node Load sample points into shared memory Synchronize all the threads Calculate the index by using kernel thread index equation (4.2) Parse x, y and z value form to the index Calculate the distance between the current grid point and all sample points Record nearby sample points whose distance is less than a certain value 15

26 Interpolate the current grid point value by using the Shepard s method Equation. Only recorded samples are used to interpolate. Write back the grid point value into an array 4.2 Implementation of Marching Cubes Algorithm on GPU Three kernel functions are implemented for the GPU-based Marching Cubes Algorithm. They are the classifyvoxel kernel, the compactvoxels kernel and the generatetriangles kernel. Each kernel thread is assigned a voxel and then execute classifyvoxel to determine whether this voxel will be displayed or not, which means there is an intersection on the edge of this voxel or not. We compare the iso-value and the value of each vertex to determine whether there is an intersection on the edge or not. If all of the verte value of the voxel are less than or greater than iso-value, this voxel will not be displayed. If some of the vertices are less than the iso-value and other vertices are greater than the iso-value, there is an intersection on the edge and this voxel will be displayed. After the classifyvoxel kernel function, the voxeloccupied array will be outputted. This will indicate if the voxel is non-empty or will be displayed. The voxelvertices array is used to record vertex statues in order to tell the generatetriangles kernel function how to display triangles. 16

27 We executed compactvoxels right after classifyvoxel to compact the voxeloccupied array and get rid of empty voxels. This allows us to run the complex generatetriangles kernel on only the occupied voxels. The generatetriangles kernel function runs only on the occupied voxels for the high performance. Both of the lookup tables edgetable and tritable will be loaded into the GPU texture memory. Each kernel calculates its corresponding voxel case from the voxelvertices array. After the voxel cases are loaded, each kernel will go through both of the lookup tables to find how to generate a triangle for this voxel case. Thus triangles will be generated correctly. 17

28 CHAPTER V RESULTS AND ANALYSES 5.1 Global Method Comparisons between CPU and GPU. The implementation of the presented algorithm has been tested on a Dell computer with an NVidia GeForce GT 525M. The following are its specifications: CUDA Driver Version / Runtime Version: 5.5/5.5 CUDA Capability Major/Minor version number: 2.1 Number of Multiprocessors: 2 Number of CUDA cores per Multiprocessors: 48 Total Number of CUDA Cores: 96 Total amount of global memory: 1024 Mbytes Total amount of shared memory per block: bytes Various grid sizes have been chosen in order to compare the time consumed and draw conclusions regarding how much the GPU-based program will speed computation. Codes are written using similar logic for both CPU based sequential programming and GPU based parallel programming. Table 5.1 shows average the running times in milliseconds (ms) of ten experiments result for each grid. 18

29 Table 5.1 Comparing CPU and GPU for Global Shepard s Method of Various Grid Sizes Runtime Grid Size CPU Runtime GPU Runtime SpeedUp Efficiency Factor 1*1* < *2* < *4* < *8* *16* *32* *64* *128* The speedup factor is the ratio between GPU runtime and CPU runtime. It measures and captures the relative benefit of using parallel. The speedup factor equation is defined as: SpeedUp Factor = = = S(p) (5.1) Efficiency is a fraction time measurement for how a processing element is usefully employed in a computation [11]. The Efficiency equation is defined as: Efficiency = = = ( ) (5.2) The GPU-based program has overhead factors such as process synchronization, memory allocation and data communication. The ratio of overhead factors becomes smaller when the grid size increases. Thus, we can see that the GPU-based program becomes increasingly efficient as the grid size increase. 19

30 The running time of the CPU exceeds that of the GPU when the grid size is larger than 21*21*21. The GPU-based global method shows better results as the size of grid increases. When the grid size is smaller than 21*21*21, the CPU-based global method taken advantage due to the GPU-based global method having the overhead of memory copy and synchronization. The GPU-based global method can take great advantage of parallel processing when the grid size is larger than 21*21*21. The serial compute time is significantly longer than the GPU communication time. The details of the overhead of GPU-based global method data communication have also been observed. Table 5.1 average shows the running times in milliseconds (ms) of ten experiments in each size. Table 5.2 GPU Global Method Detailed Runtime Grid Size GPU Runtime GPU Kernel Compute Runtime Data Copy Host to Device Time Data Copy Device to Host Time Malloc Memory Time Data Communication Time 1*1* *2* *4* *8* *16* *32* *64* *128* Note: Data Communication Time is the sum of Data Copy Host to Device Time, Data Copy Device to Host Time and Malloc Memory Time. The NVIDIA Visual Profiler is a cross-platform performance profiling tool that delivers developers with vital feedback for optimizing CUDA C/C++ applications [10]. 20

31 We applied the NVIDIA Visual Profiler as a timing test tool. The Data Communication Time of GPU-based Global Method is shown in detail, in Table 5.2 which is broken into four parts, GPU Kernel Compute Runtime, Data Copy Host to Device Time, Data Copy Device to Host Time and Malloc Memory Runtime. The smallest time unit of the NVIDIA Visual Profiler timing test tool is 0.002ms, so all the time less than or equals to 0.002ms will be shown as 0.002ms in this table. The GPU Kernel Compute Runtime is the time of all kernel computations from beginning to end. These are meaningless when the grid size is smaller than 16*16*16 because GPU these are too small to monitor by the NVIDIA Visual Profile. The GPU Kernel Compute Runtime increases approximately 8 times on each grid dimension when the grid size increase 32*32*32 to 128*128*128. Data Copy Host to Device Time has approximately the same results. Due to the same sample data are being used for each experiment. The Data Copy Device to Host Time is meaningless when the grid size is less than 16*16*16. This is due to NVIDIA Visual Profile smallest time unit being the same as that of GPU Kernel Compute Runtime. Data Copy Device to Host Time increases with the larger output data size which is the grid size. Malloc Memory Time also grows with the needed memory data size. The speeds of global memory copy are also experimented and shown in Table

32 Table 5.3 Data Communication Time, Size and Speed for Various Grid Sizes Grid Size Data Copy Host to Device Data Size Data Copy Host to Device Time Data Copy Host to Device Speed Data Copy Device to Host data Size Data Copy Device to Host Time Data Copy Device to Host Speed 1*1*1 2156KB GB/s 4bytes MB/s 2*2*2 2156KB GB/s 32 bytes MB/s 4*4*4 2156KB GB/s 256 bytes MB/s 8*8*8 2156KB GB/s 2 KB MB/s 16*16* KB GB/s 16KB GB/s 32*32* KB GB/s 128KB GB/s 64*64* KB GB/s 1MB GB/s 128*128* KB GB/s 8MB GB/s The Data Copy Host to Device Speed are all approximately the same since the hardware does not change. The largest Data Copy Device to Host Speed is observed within the grid size 64*64*64. 22

33 Table 5.4 Data Communication Size and Speed for Grid Size 64*64*64 to 128*128*128 Grid Size Data Copy Device to Host data size Data Copy Device to Host speed 64*64*64 1MB 6.09GB/s 70*70* MB 5.78GB/s 80*80* MB 5.65GB/s 90*90* MB 2.81GB/s 100*100* MB 2.95GB/s 110*110* MB 2.82GB/s 120*120* MB 2.91GB/s 128*128*128 8MB 2.82GB/s Table 5.5 Data Communication Size and Speed for Grid Size 80*80*80 to 82*82*82 Grid Size Data Copy Device to Host data size Data Copy Device to Host speed 80*80* MB 5.65GB/s 81*81* MB 2.65GB/s 82*82* MB 2.68GB/s Table 5.4 and Table 5.5 show us that the peak Data Copy Device to Host speed is observed at grid size 80*80*80. 23

34 5.2 Accuracy Comparisons between Localized Global Method and Non-localized Global Method The two-step approach to scattered data visualization faces many issues, one of which is accuracy. We employed numerical error analysis to calculate the accuracy of the scattered data modeling. After the data modeling step, every grid node value was then constructed using the input sample points. Then, these interpolated grid node values were used to produce the data values for the original sample points (by using linear interpolation). Analytically, these interpolated grid node values can exactly reproduce the original data values at the sample points; but numerically, it cannot due to numerical errors. Such numerical errors can be calculated by: n i = f ( x, y, z ) - v, i i i i i =1,..., n, (5.4) where is the scattered data value at sample point (,, ) and f(,, ) is the interpolated grid node values at the point [12]. The relative errors are calculated using the numerical errors and original data values. Root mean square measures the differences between value predicted by a model or an estimator and the values actually observed. We are using the absolute value of relative error to calculate RMS, which can be explained as: Relative Error = (5.5) 24

35 RMS use the sample standard deviation to show the experiment accuracy. The equation for RMS is explained as: RMS = ( ) (5.5) Real Value Experimental Value Figure 5.1 Graphical Representation of Non-Localized Global Method Data Modeling Result 25

36 Numerical Error Numerical Error Figure 5.2 Graphical Representation of Non-localized Global Method Data Modeling Numerical Error 3500 Relative Error Relative Error Figure 5.3 Graphical Representation of Non-localized Global Method Data Modeling Relative Error 26

37 We can see that the results of non-localized global method data modeling have a seemingly low accuracy. The RMS is which are not desirable. We can barely see the trend of data point values as they increase and decrease. Apparently, the result is not satisfied. In order to improve accuracy, we employ localized global method data modeling which uses only nearby sample points interpolated a grid value. We use 8 by 8 by 8 as the range size. The results of localized global method data modeling are shown in Figure Real Value Experimental Value Figure 5.4 Graphical Representation of Localized Global Method Data Modeling Result 27

38 Numerical Error Numerical Error Figure 5.5 Graphical Representation of Localized Global Method Data Modeling Numerical Error Relative Error Relative Error Figure 5.6 Graphical Representation of Localized Global Method Data Modeling Relative Error 28

39 We can see that the results of localized global method data modeling are much more accurate than the results of non-localized global method data modeling. The accuracy has improved significantly. The RMS is reduced to Decreasing the contribution of any far away data point is another approach to improving accuracy. We can control the contribution of far distance data point by changing α. The distant data point contributes to the result more when α is small. So, we can increase α value in order to decrease the contribution of far away data point. By default, α is 1, so we increase α to 2. The result is shown by Figure Real Value Experimental Value Figure 5.7 Graphical Representation of Improved Shepard s Method by α = 2 Result 29

40 Numerical Error Numerical Error Figure 5.8 Graphical Representation of Improved Shepard s Method by α = 2 Numerical Error 30

41 Relative Error Relative Error Figure 5.9 Graphical Representation of Improved Shepard s Method by α = 2 Relative Error By changing α from 1 to 2, the accuracy has improved slightly. The RMS is reduced to (from ). We continue to increase the α value to the accuracy peak, which is 10 in this case. The result is shown by Figure

42 Real Value Experimental Value Figure 5.10 Graphical Representation of Improved Shepard s Method by α = 10 Result 6000 Numerical Error Numerical Error Figure 5.11 Graphical Representation of Improved Shepard s Method by α = 10 Numerical Error 32

43 3000 Relative Error Relative Error Figure 5.12 Graphical Representation of Improved Shepard s Method by α = 10 Relative Error The accuracy has significantly improved and the result is desirable now. The RMS is reduced to When the value of the α parameter increase to 11, The accuracy goes down sharply and is shown in Figure

44 Real Value Experimental Value Figure 5.13 Improved Shepard s Method by α = 11 Relative Error The accuracy drop down sharply when α increase to 11, due to the contribution of distance becomes too small. The RMS is increased to The over all RMS changes are shown in Figure

45 RMS RMSE Non-localized Global Method Localized Global Method α=2 α=10 α=11 Figure 5.14 RMS of Different Data Modeling Methods 5.3 Performance Comparison between Static Local Block Data Modeling Method and Dynamic Local Block Data Modeling Method Blocked-oriented localized data modeling (BOLDM) is designed for GPU s Architecture. We divided the entire data volume into small blocks (or volumes) which could be fitted perfectly into shared memory. The static local block data modeling method is a pre-defined method. We manually divide the entire data volume into small blocks and assign each small block to a GPU block, so each small block has its own shared memory. The dynamic local block data modeling method also divides the entire data volume into small blocks but dynamically. Each data point has its own block which is the group of data points around it. The blocks are organized dynamically to each data point, which is determined by its position. The dynamic local block data modeling method also uses the shared memory to improve the performance. However, it is different than the static local block data modeling method which puts all the required sample data 35

46 points in their own shared memory. Some of the required sample data points may not be in their own shared memory due to shared memory being static and the block is moved by data point position. So, static local block data modeling method reads all its required sample data from its own shared memory. However, the dynamic local block data modeling method reads some of the required sample data from its own shared memory and the other sample data from global memory. Obviously, the performance of the static local block data modeling method and the dynamic local block data modeling method will be seemingly different. They are shown in Table

47 Table 5.6 Comparing Static Local Block Data Modeling and Dynamic Local Block Data Modeling Method Running Time for Various Grid Sizes Grid Size Static Local Block Data Modeling Method Runtime Dynamic Local Block Data Modeling Method Runtime 20*20* *30* *40* *50* *60* *70* *80* *90* *100* *110* *120*

48 *20*20 30*30*30 40*40*40 50*50*50 60*60*60 70*70*70 80*80*80 90*90*90 100*100* *110* *120*120 Static Local Block Data Modeling Method runtime Dynamic Local Block Data Modeling Method runtime Figure 5.14 Graph to Compare the Speed between Static Local Block Data Modeling and Dynamic Local Block Data Modeling Method We can see that the dynamic local block data modeling method consumes more time than the static local block data modeling method runtime. This is due to dynamic local block data modeling method reading from both shared memory and global memory. As the grid size increases, the running time of the dynamic local block data modeling method increases significantly. 5.4 Over All Speed Up and Error Report Analyses Different GPU-based data modeling methods present different error reports and speed up rates. How to balance speed up rate and accuracy will be a new issue of GPUbased scattered data visualization. Table 5.7 shows the error reports and speedup factors of different data modeling methods by grid size 128*128*

49 Table 5.7 Speed Up Rate and Error Report for Various Data Modeling Method CPUbased Global Method GPUbased Global Method GPUbased Global Method with α=2 GPUbased Global Method with α=10 GPUbased Static Local Block Method GPUbased Dynamic Local Block Method Maximum Absolute Numerical Error Maximum Absolute Relative Error RMS Accuracy of Numerical Error RMS Accuracy of Relative Error SpeedUp Factor Efficiency Table 5.7 shows that we can employ the GPU-based dynamic local block method to increase accuracy significantly by sacrificing running time. The key element is finding a way to balance accuracy and runtime. Increasing the α coefficient of Shepard s method is good option to improve result accuracy. 39

50 CHAPTER VI CONCLUSION AND FUTURE WORK How to balance performance and accuracy is an important issue in GPU-based scattered data visualization. We have built a GPU-accelerated scattered data visualization system and used the system to study various methods of speeding up the performance while still balancing the accuracy. We have experimented with various techniques to improve GPU memory usage and to reduce CPU-GPU data communication. The experiments have shown the following results: 1) GPU-based scattered data modeling demonstrates a speedup of 12 to 27 times than its CPU-based counterpart (on an NVidia GeForce GT 525M GPU against an Intel Core i5-2410m 2.30 GHz CPU). 2) Increasing the value of the α parameter to certain value in the Shepard s method can improve accuracy without causing a performance penalty (The accuracy of chemical leakage sample data will increase to the peak when the value of the α parameter is 10). 3) Localization can reduce modeling error but causes a performance penalty. 4) Dynamic block localization can increase accuracy significantly, but has a large performance penalty due to the frequent data shift among GPU memory banks. 40

51 5) Static block localization has a smaller performance penalty, but also shows smaller accuracy improvement. The parallel efficiency of the system is low ( to ). To achieve high memory bandwidth for concurrent accesses, issues related to GPU memory bank conflicts need to be addressed in future work. More GPU-accelerated data interpolation methods, such as the volume spline method, the thin-plate-spline method and the multiquadric method, should also be investigated in the future. 41

52 REFERENCES [1] Yingcai Xiao, J. Ziebarth, Physically Based Data Modeling for Sparse Data Volume Visualization, Technical Report No , Department of Mathematics and Computer Science, University of Akron, January [2] Yingcai Xiao, J, Ziebarth, FEM-based Scattered Data Modeling and Visualization, with J. Ziebarth, Computers and Graphics, Vol. 24, No. 5, 2000, [3] Yingcai Xiao, C. Woodbury Constraining Global Interpolation Methods for Sparse Data Volume Visualization, with C. Woodbury, International Journal of Computers and Applications, Vol. 21, No. 2, 1999, [4] Yingcai Xiao, John P. Ziebarth, Chuck Woodbury, Eric Bayer, Bruce Rundell, Jeroen van der Zijp, The Challenges of Visualizing and Modeling Environmental Data, with J. Ziebarth, C. Woodbury, E. Bayer, B. Rundell, J. Zijp, IEEE Visualization 96 Conference Proceeding, San Francisco, California, October 27 November 1, 1996, [5] Vinjarapu, Saranya S. GPU-based Scattered Data Modeling, Master Thesis in Computer Science, University of Akron, [6] J.Allard, C.Menier, B.Raffin, et al. Grimage: Markerless 3D Interactions, In ACM SIGGRAPH 07, International Conference on Computer Graphics and Interactive Techniques, emerging technologies, article No. 9, [7] C.Leong, Y.Xing, N.D.Georganas. Tele-Immersive Systems, IEEE International Workshop on Haptic Audio Visual Environments and their Applications, Ottawa: Canada, [8] W. E. Lorensen, H. E. Cline, Marching Cubes: A High Resolution 3D Surface Reconstruction Algorithm, SIGGRAPH 87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques, 1987, USA [9] Y. Heng, L. Gu, GPU-based Volume Rendering for Medical Image Visualization, Proceedings of the IEEE Engineering in Medicine and Biology 27th Annual Conference, 2005, Shanghai, China [10] NVIDIA CUDA C Programming guide. Version 3.2, 2010, NVIDIA C[4] H.R. Nagel, GPU optimized Marching Cubes Algorithm for Handling Very Large, Temporal 42

53 Datasets, CiteSeerX Scientific Literature Digital Library and Search Engine, 2010 corporation [11] Programming Massively Parallel Processors: A Hands-on Approach. By David Kirk and Wen-mei Hwu. [12] Yingcai Xiao, Jinqiang Tian, Hao Sun. Error Analysis in Sparse Data Volume Visualization, International Conference on Imaging Science, Systems, and Technology, Las Vegas, June 24-27, 2002, [13] Lu Wang. Scattered-Data Computing on Various Platforms, Master Thesis in Computer Science, University of Akron, [14] Interpolation. (2015, March 12). In Wikipedia, The Free Encyclopedia. Retrieved 19:34, April 2, 2015, from [15] Marching cubes. (2015, February 9). In Wikipedia, The Free Encyclopedia. Retrieved 19:40, April 2, 2015, from 43