CONTENTS. I Introduction 2. II Preparation 2 II-A Hardware... 2 II-B Checking Versions for Hardware and Drivers... 3

Transcription

1 1 CONTENTS I Introduction 2 II Preparation 2 II-A Hardware II-B Checking Versions for Hardware and Drivers III Software 3 III-A Installation III-B Checking Software Versions III-C Running CUDA Examples IV Creating Your Own CUDA Program 5 V Integrate CUBLAS to ATLAS 6 V-A Sources for Sdot V-B Makefiles V-C ATLAS Source Modification V-D Compilation and Execution VI User BLAS Implementation on CUDA 8 VI-A Directory Structure VI-B Makefiles VI-C Sources for Sdot VI-D ATLAS Source Modification VI-E Compilation and Execution VI-F Test Runs References 14

2 2 ATLAS GPU Tutorial Chia-Tien Dan Lo Department of Computer Science University of Texas at San Antonio Abstract This paper describes how to run Basic Linear Algebra Subprograms (BLAS) routines on Graphics Processing Units (GPUs) in the Automatically Tuned Linear Algebra Software (ATLAS) project. The BLAS implementations invoked in this tutorial include CUBLAS 2.2, a BLAS implementation in Compute Unified Device Architecture (CUDA), and CULBASLO 1.0, our BLAS implementation. Basically, we integrate CUBLAS and CUBLASLO in ATLAS via a test function, l1blastst.c, defined in the ATLAS source tree. A test run will be illustrated this integration, and its results will be reported. I. INTRODUCTION This tutorial will be using a Linux box running a 64-bit Ubuntu 8.10 Linux ( ). It can be applied to a Windows based machine with an installed Cygwin to simulate the Linux environment. Software used in this tutorial includes GNU Make 3.8.1, and GCC GPUs installed in the Linux box is Nvidia e- GeForce 8600 GTS with 512 MB DDR3 and PCIe x16. Other GPUs may also apply. Installed in the Linux box is another VGA card, which is Nvidia GeForce 7200 GS with 256 MB RAM, for the main display purpose. The 7-series Nvidia card is not supported by CUDA. Therefore, the CUDA computation will be on the 8600 GTS, which is not connected to any monitor. This configuration will exceed the 5-second limit on running CUDA programs, caused by the CUDA driver if GPUs are attached to a display. II. PREPARATION Before the tutorial, make sure the software and hardware are installed and are in the correct versions. It is very often that you may be stuck on some point during the tutorial if some of the software and the hardware are the same as ones used in my machine. If the box does not have ATLAS and LAPACK, follow the ATLAS installation Guide with the packages ATLAS and LAPACK [1]. If the installed versions are different, remove it and install the correct version accordingly. We will be referring two directories in the ATLAS source tree: SRCDIR and BLDDIR. In my box, I get and unzip ATLAS to danlo@etl-corei74b: /atlas/atlas , and make a build directory danlo@etl-corei74b: /atlas/atlas /linux_i7_64_build. So SRCDIR will refer to the former directory, and BLDDIR will refer to the latter one through this tutorial. A. Hardware For GPUs computation, of course, we have to install GPUs to the motherboard. To bypass the 5-second limit, it is strongly recommended that you add an extra dedicated GPUs card for the CUDA computation. Therefore, the motherboard has to have at least two PCIe 2.0 slots: one for normal VGA display card that connects to a display, and one for the CUDA enabled GPUs. The motherboard used in the tutorial is ASUS P6T Deluxe with two PCIe 2.0 slots. At the time of this writing, ASUS L1N64-SLI WS Dual L(1207FX) provides 4 PCIe x16 slots. In Windows machines, both of the two GPUs cards have to be the same brand, i.e., Nvidia, for the driver issue. Linux boxes, however, may take a different VGA card along with a CUDA enabled GPUs card, though this configuration will not be validated in this tutorial. It is highly recommended that both cards are the Nvidia brand to save some hurdles. Installation of the GPUs card includes two steps: first, open the computer box and insert the card to the motherboard, and second, install device driver. Make sure the power cord is detached in addition to

3 3 powering off the machine. There is still power running when the power switch is off. The latest Nvidia device driver at the time of this writing is cudatoolkit_2.2_linux_64_ubuntu8.10.run from Nvidia. There are a number of Nvidia drivers for a variety of operating systems and machines. Make sure you get the right one for your system. You will have to remove old drivers first. Use the following command to remove old Nvidia drivers. sudo apt-get remove nvidia* Then download the new drivers from Nvidia s website and install it using the following command. sudo cudadriver_2.2_linux_64_ beta.run Since you need to close your x session to do this, I recommend you ssh into your box from another machine so that you can get a nice terminal to run Nvidia s installer (It seems to have some problems if you run it from the console; I got flashing text and other nonsense). Once you ve modprobed the new driver, you will need to install software in the next section. B. Checking Versions for Hardware and Drivers To find out what VGA device hardware is installed in your system, use the following command: danlo@etl-corei74b: $ lspci grep VGA 02:00.0 VGA compatible controller: nvidia Corporation G72 [GeForce 7300 SE] (re 03:00.0 VGA compatible controller: nvidia Corporation GeForce 8600 GTS (rev a1) In my machine, there are two VGA cards: GeForce 7200 and GeForce 8600 GTS. The first one is not supported by CUDA. Only the second one is CUDA-enable GPU. Also note that the PCI bus information prefixed to each device. When setting XServer, this information has to be set in /etc/x11/xorg.conf manually. An NVIDIA Linux Display driver is needed to run CUDA code on a CUDA-enabled NVIDIA GPU. CUDA 2.2 requires version or later of the Linux NVIDIA Display Driver. Please see the NVIDIA CUDA Toolkit 2.2 release notes for more details. We can use the following command to verify the running device driver version: danlo@etl-corei74b:/proc$ less /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module Thu Apr 30 15:48:49 GCC version: gcc version (Ubuntu ubuntu12) For information on installing NVIDIA Linux display drivers, please refer to the NVIDIA Accelerated Linux Driver Set README and Installation Guide: III. SOFTWARE The software required to run CPU programs includes CUDA toolkit, SDK, and Debugger. At the time of this writing, the last version is 2.2. The toolkit contains Nvidia s C compiler (nvcc), assembler (ptxas), debugger (cuda-gdb), and other tools. The SDK is composed of a series of sample CUDA programs, and a template to create user programs. A. Installation Nvidia has created *.run self-extracting archives to install these tools. In the following steps, we will install the required software, and test running 1) Install version 2.2 of the NVIDIA CUDA Toolkit by executing the file NVIDIA_CUDA_Toolkit_2.2-*.run corresponding to your Linux distribution

4 4 Add the CUDA binaries and lib path to your PATH and LD_LIBRARY_PATH environment variables. It is recommended that you run the installer as root and use the default install path (/usr/local). Make sure that you add the location of the CUDA binaries (such as nvcc) to your PATH environment variable and the location of the CUDA libraries (such as libcuda.so) to your LD_LIBRARY_PATH environment variable. In the bash shell, one way to do this is to add the following lines to the file.bash_profile in your home directory. PATH=$PATH:<CUDA_INSTALL_PATH>/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<CUDA_INSTALL_PATH>/lib export PATH export LD_LIBRARY_PATH 2) Install version 2.2 of the NVIDIA CUDA SDK by executing the file NVIDIA_CUDA_SDK_2.2-*.run The installer will prompt you to enter an installation path for the SDK or accept the default. We will refer to the path you choose as SDK_INSTALL_PATH. 3) Build the SDK project examples. cd <SDK_INSTALL_PATH> make 4) Run the examples: cd <SDK_INSTALL_PATH>/bin/linux32/release matrixmul (or any of the other executables in that directory) This package consists of a.run file. This is a self-extracting archive that decompresses its contents to a temporary folder and then installs the contents to a path that you specify. The archive is: NVIDIA_CUDA_SDK_2.2-*.run : NVIDIA CUDA SDK Installer B. Checking Software Versions If the installation is good or you may want to find out what current CUDA version is running, you can run the following command: danlo@etl-corei74b: $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) NVIDIA Corporation Built on Thu_Apr 9_05:05:52_PDT_2009 Cuda compilation tools, release 2.2, V The command nvcc is CUDAs C compiler. Later on, you will use nvcc to compile your CUDA programs. C. Running CUDA Examples Before writing any CUDA it, program is highly recommended to compile and run the sample programs that come along with the SDK. By passing these sample programs, it shows that your system is ready for you to develop CUDA programs. To build the SDK project examples, follow the steps: 1) Go to <SDK_INSTALL_PATH> (cd <SDK_INSTALL_PATH>) 2) Build: release configuration by typing make.

5 5 debug configuration by typing make dbg=1. emurelease configuration by typing make emu=1. emudebug configuration by typing make emu=1 dbg=1. Running make at the top level first builds libcutil, a utility library used by the SDK examples (libcutil is simply for convenience it is not a part of CUDA and is not required for your own CUDA programs). Make then builds each of the projects in the SDK. NOTES: The release and debug configurations require a CUDA-capable GPU to run properly (see Appendix A.1 of the CUDA Programming Guide [2] for a complete list of CUDA-capable GPUs). The emurelease and emudebug configurations run in device emulation mode, and therefore do not require a CUDA-capable GPU to run properly. You can build an individual sample by typing make (or make emu=1, etc.) in that sample s project directory. For example: cd <SDK_INSTALL_PATH>/projects/matrixmul make emu=1 And then execute the sample with: <SDK_INSTALL_PATH>/bin/linux32/emurelease/matrixmul To build just libcutil, type make (or make dbg=1 ) in the common subdirectory: cd <SDK_INSTALL_PATH>/common make Run the examples from the release, debug, emurelease, or emudebug directories located in /bin/linux32/[release debug emurelease emudebug]. IV. CREATING YOUR OWN CUDA PROGRAM Creating a new CUDA Program using the NVIDIA CUDA SDK infrastructure is easy. Nvidia has provided a template project that you can copy and modify to suit your needs. Just follow these steps: 1) Copy the template project cd <SDK_INSTALL_PATH>/projects cp -r template <myproject> 2) Edit the filenames of the project to suit your needs mv template.cu myproject.cu mv template_kernel.cu myproject_kernel.cu mv template_gold.cpp myproject_gold.cpp 3) Edit the Makefile and source files. Just search and replace all occurences of template with myproject. 4) Build the project make You can build a debug version with make dbg=1, an emulation version with make emu=1, and a debug emulation with make dbg=1 emu=1. 5) Run the program../../bin/linux32/release/myproject (It should print Test PASSED ) 6) Now modify the code to perform the computation you require. See the CUDA Programming Guide for details of programming in CUDA.

6 6 V. INTEGRATE CUBLAS TO ATLAS CUBLAS is a CUDA implementation for BLAS. To write programs that call CUBLAS, we would have to implement a wrapper that initializes GPU and CUDA library, and makes calls to CUBLAS library. Remember that BLDDIR refers to the build directory of ATLAS. We will create a directory for the wrapper. In this tutorial, I created one in BLDDIR/bin/DanLo/ATL_CUBLAS. Under this directory, I organize one subdirectory for a BLAS library function. Since I will use sdot as an example, I therefore created a subdirectory sdot. In the sdot directory, all the source files related to sdot resides. A. Sources for Sdot The source of sdot.c is listed as follows for your reference. Before calling a CUBLAS library function, the GPU and the CUBLAS library have to be initialized. The wrapper to be tested is basically composed of the following steps: 1) Headers. There are two headers needed: cublas.h, cuda_runtime.h. The former contains CUBLAS library functions whereas the latter consists of CUDA runtime APIs, functions start with cuda*. Note that CUDA driver APIs start with cu*. Lines 1 and 2 include the two headers. Line 3 include for util.h is a header where I put my own utility functions such as defines, wrapper and the like. 2) Initialize GPU and CUBLAS library. There is a cutildeviceinit() defined in cutil_inline_runtime.h that will parse command switches and initialize GPU by calling CudaSetDevice(). For CUBLAS, there are helper functions to ease device initialization (cublasinit()), clean up after execution (cublasshutdown()), error handling (cublasgeterror()) memory management (cublasalloc(), cublasfree()), and matrix manipulation (cublassetvector(), cublasgetvector(), cublassetmatrix(), cublasgetmatrix() We will use cublasinit() to initialize CUBLAS library and bound CUBLAS to the current attached GPU. In case of multiple initializations, a static variable cublasinitflag is used to guard multiple attempts. Lines 11 16, implement the initialization. 3) GPU memory allocation. Before we ask GPU to compute something, we have to request space to hold data to be processed. Lines allocate memory for two vectors on the GPU device. In this implementation, we take advantage of the CUBLAS s helper routine cublasalloc() to request device memory. 4) Date movement from host to device. Once we got device memory, we have to initialize the memory with the data to be processed. In this implementation, lines 25-29, again, the CUBLAS s helper function cublassetvector() is used to do the job. The function check_cuda() is a utility in util.h I wrote to verify if the previous library call is successful. Lines 31-32, commented out, show another way of moving data around using cudamemcpy(). 5) Computation. Line 35 is a statement to actually call CUBLAS library routine. In this tutorial, cublassdot() is called. Since cublassdot() returns a float [3], a local variable result is used to keep the result. Note that the current CUDA implementation (version 2.2) does not support return value for a kernel function. Therefore, CUBLAS library functions that return a value have to ship result back to CPU somehow. Therefore, the following step will be done by the CUBLAS library functions. 6) Data movement from device to host. Since the cublassdot() function returns a float, we don t need to move result back to CPU. If the result, e.g., a matrix, contains more than one values, we will have to move the result back to CPU manually. 7) Housekeeping. After GPU computation, the memory allocated will be freed for next computation. Here we call cublasfree() to clean the space used for the two vectors (Lines 38 29). If the memory is not returned, sooner or later, the GPU memory will be full of garbage and further memory requests will no longer be grated. So always keep a good habit of housekeeping.

7 7 1 #include "cublas.h" 2 #include "cuda_runtime.h" 3 #include "util.h" 4 5 static int cublasinitflag = 0; 6 7 float ATL_GPU_sdot(int n, const float *x, int incx, const float *y, int incy) { 8 /* Dan Lo */ 9 float result; 10 float *x_d, *y_d; 11 /* Dan Lo: init GPU */ 12 if (!cublasinitflag) { 13 cublasinit(); 14 printf("cublas Init...\n"); 15 cublasinitflag = 1; 16 } /* allocate GPU memory */ 19 cublasstatus stat; 20 stat = cublasalloc(n, sizeof(*x_d)*incx, (void**)&x_d); 21 check_cuda(stat, "alloc"); 22 stat = cublasalloc(n, sizeof(*y_d)*incy, (void**)&y_d); 23 check_cuda(stat, "alloc"); /* move data */ 26 stat = cublassetvector(n, sizeof(*x_d), x, incx, x_d, incx); 27 check_cuda(stat, "setvector"); 28 stat = cublassetvector(n, sizeof(*y_d), y, incy, y_d, incy); 29 check_cuda(stat, "setvector"); 30 /* 31 cudamemcpy(x_d, x, sizeof(*x_d)*incx*n, cudamemcpyhosttodevice); 32 cudamemcpy(y_d, y, sizeof(*y_d)*incy*n, cudamemcpyhosttodevice); 33 */ 34 /* computation */ 35 result = cublassdot(n, x_d, 1, y_d, 1); /* free GPU memory */ 38 cublasfree(x_d); 39 cublasfree(y_d); 40 return result; 41 }/* ATL_GPU_sdot() */ B. Makefiles In order to integrate the added code to the ATLAS make system, we will have to modify the file BLDDIR/bin/Makefile, and add one per each added subdirectory. Since we will build level-1 BLAS test xsl1blastst, the following modification on the file BLDDIR/bin/Makefile should be made: DanLoStuff: (cd DanLo;$(MAKE))

8 8 xsl1blastst : sl1blastst.o sl1lib ststlib DanLoStuff $(FLINKER) $(FCLINKFLAGS) -o $@ sl1blastst.o \ $(TESTlib) $(BLASlib) $(ATLASlib) $(LIBS) \ -lcudart -L/usr/local/cuda/lib -latl_cublas -L./DanLo The first couple of lines in the Makefile simply tell the make utility to work on subdirectory DanLo. Therefore, we will need to add the file BLDDIR/bin/DanLo/Makefile as follows: libatl_cublaslo.a: MUSTDO ar -ru libatl_cublas.a ATL_CUBLAS/sdot/*.o ranlib libatl_cublas.a MUSTDO: (cd ATL_CUBLAS;$(MAKE)) Also add the make file BLDDIR/bin/DanLo/ATL_CUBLAS/Makefile as follows: MUSTDO: (cd sdot;$(make)) Finally, add the following make file BLDDIR/bin/DanLo/ATL_CUBLAS/sdot/Makefile to actually compile the sdot sources as follows: sdot.o: sdot.c util.h gcc -g -c sdot.c -I/usr/local/cuda/include C. ATLAS Source Modification Because we will use ATLAS s timers to test sdot implementation, the ATLAS source SRCDIR/bin/l1blastst.c have to be modified. We can simply change the define for trusted_dot to our sdot function. The following shows the modification. The original Mjoin (commented out) is replaced with our sdot implementation ATL_CUBLAS_sdot. #define trusted_dot( N, X, ix, Y, iy ) \ ATL_CUBLAS_sdot (N, X, ix, Y, iy ) //Mjoin( TP1, dot )( N, X, ix, Y, iy ) D. Compilation and Execution Once the makefiles are in place, change directory to BLDDIR/bin and type the following command: make xsl1blastst To run the test program, simply type the following command:./xsl1blastst R dot More options can be found by typing the following:./xsl1blastst help VI. USER BLAS IMPLEMENTATION ON CUDA In this tutorial, I will show you my implementation of sdot on CUDA. The flow will pretty much similar to integrating CUBLAS to ATLAS depicted in the previous sections.

9 9 A. Directory Structure The following directory structure is built. BLDDIR/bin/DanLo/ATL_CUBLASLO/sdot B. Makefiles The Makefiles are similar to ones used in the previous section except any occurrence of CUBLAS should be replaced with CUBLASLO. The one in BLDDIR/bin/DanLo/ATL_CUBLASLO/sdot/Makefile is listed as fllows: CUDA_INSTALL_PATH := /usr/local/cuda ROOTDIR := /home/danlo/nvidia_cuda_sdk COMMONDIR := $(ROOTDIR)/common INCLUDE := -I. -I$(CUDA_INSTALL_PATH)/include -I$(COMMONDIR)/inc # debug -g COMMONFLAG := $(INCLUDE) -DUNIX -g # debug -D_DEBUG NVCCFLAG := -D_DEBUG CUBIN_ARCH_FLAG := -m64 SMVERSION := -arch sm_11 # if use driver API -lcuda LIB := -lcudart OBJS := main.cu.o kernel.cu.o util.c.o all: $(OBJS) %.cu.o :: %.cu nvcc $(NVCCFLAG) -c $< $(SMVERSION) $(INCLUDE) $(COMMONFLAG) -o $@ %.c.o :: %.c gcc -g -c $< -o $@ clean: rm $(OBJS) It is worth noting that this compilation does not involve linking. Therefore, I don t include CUDA runtime libraries. The CUDA runtime libraries will be specified in the file BLDDIR/bin/Makefile as follows: xsl1blastst : sl1blastst.o sl1lib ststlib DanLoStuff $(FLINKER) $(FCLINKFLAGS) -o $@ sl1blastst.o \ $(TESTlib) $(BLASlib) $(ATLASlib) \ $(LIBS) -lcudart -L/usr/local/cuda/lib \ -latl_cublaslo -L./DanLo This is where the CUDA runtime library and my own library libatl_cublaslo.a got linked.

10 10 C. Sources for Sdot The implementation of our sdot() function is composed of two sources: main.cu and kernel.cu. Listed in the following is the kernel.cu, which implements two kernels: dot_product_kernel() and sum_reduction(). The first kernel performs multiplications, and the latter sums them together. 1 /* This file implements kernel functions for dot product test. 2 * 3 * Chia-Tien Dan Lo 4 * May 1, * 6 */ 7 8 #ifndef KRNEL_CU 9 #define KRNEL_CU #include <stdio.h> 12 #include <global.h> global void dot_product_kernel(vector_t *v1, vector_t *v2, vector_t *p) { 15 // find which element to work with from block and thread info 16 unsigned int idx = blockidx.x * BLOCK_X + threadidx.x; 17 /* each thread works on a product */ 18 p[idx] = v1[idx]*v2[idx]; }/* dot_product_kernel() */ /* sum vector and store it to v[vector_size] */ 23 /* assume vector size is multiple of 2*BLOCK_X */ 24 /* 5/17/2009 remove multiple of 2*BLOCK_X limit */ 25 global void sum_reduction(vector_t *v, unsigned int n, vector_t *ps) { 26 unsigned int t = threadidx.x; 27 unsigned int idx; 28 idx = blockidx.x*block_x+t; 29 for (unsigned int stride = BLOCK_X/2; stride > 0; stride = stride >> 1 ) { 30 if (t<stride) { 31 v[idx] += v[idx+stride]; 32 } /* if */ 33 syncthreads(); 34 }/* for stride */ 35 /* only this guy will collect partial sum! */ 36 if (t==0) 37 ps[blockidx.x] = v[idx]; }/* sum_reduction() */ #endif

11 11 The only function defined in main.cu is ATL_CUBLASLO_sdot(), which is indeed a wrapper similar to the one implemented to call CUBLAS function. We implement this wrapper to initialize GPU and CUDA runtime library, allocate device memory, copy data to device, call kernel functions, move result back to CPU, free any memory allocated, and finally return results back to the caller. 1 /* 2 * Dot product operation in libatl_cublaslo.a. 3 * 4 * 5 * Chia-Tien Dan Lo 6 * May 16, * 8 */ 9 10 #include <stdio.h> 11 #include <cutil_inline.h> 12 //#include "oracle.h" 13 #include "util.h" 14 #include "global.h" 15 #include "kernel.h" static int cuinitflag = 0; /* dot_product_cu starts here */ 21 /* we may have to init device later! 22 int main(int argc, char **argv) { */ 23 extern "C" vector_t ATL_CUBLASLO_sdot(const int n, const vector_t *v1, const int incx, const vector_t *v2, const int incy) { 24 /* vector_t v1[n], v2[n], p[n+1]={}; */ 25 int i; 26 vector_t *p, *ps; 27 vector_t *zero; 28 vector_t result=0; 29 unsigned int N; 30 /* init GPU */ 31 if (!cuinitflag) { 32 cuinitflag = 1; 33 /* set highest device for computation */ 34 cudasetdevice( cutgetmaxgflopsdeviceid() ); 35 } 36 /* round up N to multiple of BLOCK_X */ 37 N = (n+block_x_1); 38 N = N & BLOCK_X_MASK; 39 /* allocate memory */ 40 /* 41 v1 = (vector_t*)dan_malloc(sizeof(vector_t)*n); 42 v1 = (vector_t*)dan_malloc(sizeof(vector_t)*n); 43 */ 44 p = (vector_t*)dan_calloc(sizeof(vector_t)*(n));

12 45 ps = (vector_t*)dan_calloc(sizeof(vector_t)*(n/block_x)); 46 /* used to zero extra GPU memory for both vectors*/ 47 zero = (vector_t*)dan_calloc(sizeof(vector_t)*(n-n)); 48 p[n] = 0; 49 // random generate vectors 50 /* 51 vector_gen(v1, N, MAX_VALUE); 52 vector_gen(v2, N, MAX_VALUE); 53 */ 54 // test out 55 // vector_out(v, 100); 56 /* init device */ 57 // CUT_DEVICE_INIT(argc, argv); 58 /* allocate device memory */ 59 vector_t *v1d, *v2d, *pd, *psd; 60 cutilsafecall(cudamalloc((void**)&v1d, N*sizeof(vector_t))); 61 cutilsafecall(cudamalloc((void**)&v2d, N*sizeof(vector_t))); 62 cutilsafecall(cudamalloc((void**)&pd, (N)*sizeof(vector_t))); 63 cutilsafecall(cudamalloc((void**)&psd, (N/BLOCK_X)*sizeof(vector_t))); 64 // create and start timer 65 /* 66 unsigned int timer = 0; 67 cutilcheckerror( cutcreatetimer( &timer)); 68 cutilcheckerror( cutstarttimer( timer)); 69 */ 70 /* move data to device */ 71 cutilsafecall(cudamemcpy(v1d, v1, n*sizeof(vector_t), cudamemcpyhosttodevice)); 72 cutilsafecall(cudamemcpy(v2d, v2, n*sizeof(vector_t), cudamemcpyhosttodevice)); 73 /* zero the accumulate partial sum */ 74 cutilsafecall(cudamemcpy(psd, ps, (N/BLOCK_X)*sizeof(vector_t), cudamemcpyhosttodevice)); 75 cutilsafecall(cudamemcpy(v1d+n, zero, (N-n)*sizeof(vector_t), cudamemcpyhosttodevice)); 76 cutilsafecall(cudamemcpy(v2d+n, zero, (N-n)*sizeof(vector_t), cudamemcpyhosttodevice)); 77 /* setup execution parameters */ 78 dim3 block(block_x, BLOCK_Y); 79 dim3 grid(n/block.x, block.y); 80 /* invoke kernel */ 81 dot_product_kernel<<<grid, block>>>(v1d, v2d, pd); 82 sum_reduction<<<grid, block>>>(pd, N, psd); 83 /* copy result from device to host */ cutilsafecall( cudamemcpy( ps, psd, (N/BLOCK_X)*sizeof(vector_t), cudamemcpydevicetohost) ); 86 /* set result */ 87 for (i=0;i<n/block_x;i++) 88 result += ps[i]; 12

13 /* clean memory */ 91 cudafree(v1d); 92 cudafree(v2d); 93 cudafree(pd); 94 cudafree(psd); 95 free(p); 96 free(ps); 97 free(zero); 98 /* return result */ 99 return result; 100 }/* main() */ D. ATLAS Source Modification Because we will use ATLAS s timers to test sdot implementation, the ATLAS source SRCDIR/bin/l1blasts have to be modified. We can simply change the define for trusted_dot to our sdot function. The following shows the modification. The original Mjoin (commented out) is replaced with our sdot implementation ATL_CUBLASLO_sdot. #define trusted_dot( N, X, ix, Y, iy ) \ ATL_CUBLASLO_sdot (N, X, ix, Y, iy ) //Mjoin( TP1, dot )( N, X, ix, Y, iy ) E. Compilation and Execution Once the makefiles are in place, change directory to BLDDIR/bin and type the following command: make xsl1blastst To run the test program, simply type the following command:./xsl1blastst R dot More options can be found by typing the following:./xsl1blastst help F. Test Runs The following test runs show the correct implementation (PASS es) and performance of running our BLAS (SpUp s). danlo@etl-corei74b: /atlas/atlas /linux_i7_64_build/bin$./xsl1blastst -R./xsl1blastst -R dot -N DOT TST# N INCX INCY TIME MFLOP SpUp TEST ==== ==== ==== ==== ====== ===== ===== ===== PASS

14 PASS PASS PASS PASS PASS PASS PASS PASS PASS 10 tests run, 10 passed REFERENCES [1] R. C. Whaley, Atlas installation guide, in Technical Report CS-TR CS at UTSA, [2] Nvidia cuda programming guide version 2.2, develop.html, [3] Cuda cublas library v2.2, develop.html, 2008.