Accelerate your Application with CUDA C



Similar documents
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Programming. Week 4. Shared memory and register

Hands-on CUDA exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introduction to CUDA C

HPC with Multicore and GPUs

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

1. If we need to use each thread to calculate one output element of a vector addition, what would

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Final Project Report. Trading Platform Server

OpenACC 2.0 and the PGI Accelerator Compilers

CUDA Basics. Murphy Stein New York University

Rootbeer: Seamlessly using GPUs from Java

12 Tips for Maximum Performance with PGI Directives in C

Image Processing & Video Algorithms with CUDA

Introduction to GPU hardware and to CUDA

U N C L A S S I F I E D

GPU Parallel Computing Architecture and CUDA Programming Model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Optimizing Application Performance with CUDA Profiling Tools

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Multi-GPU Load Balancing for Simulation and Rendering

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

OpenACC Basics Directive-based GPGPU Programming

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU Tools Sandra Wienke

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Advanced CUDA Webinar. Memory Optimizations

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Sélection adaptative de codes polyédriques pour GPU/CPU

Case Study on Productivity and Performance of GPGPUs

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

HP ProLiant SL270s Gen8 Server. Evaluation Report

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

OpenACC Programming on GPUs

Performance Characteristics of Large SMP Machines

GPU Performance Analysis and Optimisation

Embedded Systems: map to FPGA, GPU, CPU?

CUDA programming on NVIDIA GPUs

HPC Wales Skills Academy Course Catalogue 2015

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Lecture 1: an introduction to CUDA

Programming GPUs with CUDA

OpenACC Programming and Best Practices Guide

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Application Note 120 Communicating Through the 1-Wire Master

Monte-Carlo Option Pricing. Victor Podlozhnyuk

Texture Cache Approximation on GPUs

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

ultra fast SOM using CUDA

Black-Scholes option pricing. Victor Podlozhnyuk

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

IMAGE PROCESSING WITH CUDA

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Introduction to Parallel Programming (w/ JAVA)

Packet-based Network Traffic Monitoring and Analysis with GPUs

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Compute Spearman Correlation Coefficient with Matlab/CUDA

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Part I Courses Syllabus

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Pricing of cross-currency interest rate derivatives on Graphics Processing Units

Transcription:

Accelerate your Application with CUDA C

2D Minimum Algorithm Consider applying a 2D window to a 2D array of elements Each output element is the minimum of input elements within the window Window Input Matrix Output Matrix 2D operations like this are found in many fundamental algorithms Interpolation, Convolution, Filtering Applications in seismic processing, weather simulation, image processing, etc

2D Minimum: C Version #define WIN_SIZE 16 #define N 20000 int main() int size=n*n*sizeof(int); int i, j, x, y, temp; //allocate resources int *cell=(int *)malloc(size); //input int *node=(int *)malloc(size); //output initializearray(cell,n); initializearray(node,n); Allocate Resources Initialize data for(i=0;i<n;i++) for(j=0;j<n;j++) //find minimum in window temp = node[i][j]; for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> cell[i+x][j+y]) temp = cell[i+x][j+y]; node[i][j] = temp; Loop over dataset Loop over window Find min //free resources free(cell); free(node); Cleanup

2D Minimum: C Version #define WIN_SIZE 16 #define N 20000 for(i=0;i<n;i++) for(j=0;j<n;j++) int main() int size=n*n*sizeof(int); //find minimum in window int i, j, x, y, temp; temp = node[i][j]; //allocate resources for(x=0;x<win_size;x++) Algorithm Device Compute Speedup int *cell=(int *)malloc(size); //input for(y=0;y<win_size;y++) int *node=(int *)malloc(size); //output if (temp> cell[i+x][j+y]) C (serial) Xeon X5650 96.1sec -- temp = cell[i+x][j+y]; initializearray(cell,n); node[i][j] = temp; initializearray(node,n); //free resources free(cell); free(node);

CPU + GPU = Big Speedup Application Code GPU Compute-Intensive Functions Use CUDA to Parallelize Rest of Sequential CPU Code CPU +

Explicit Data Movement CPU (host) GPU (device) CPU Memory in out PCI Bus GPU Memory d_in d_out

2D Window: CUDA Main Program C int main() int main() int size=n*n*sizeof(int); int size=n*n*sizeof(int); //allocate resources //allocate resources int *cell=(int*)malloc(size); //input int *cell=(float*)malloc(size); int *node=(int*)malloc(size); //output int *node=(float*)malloc(size); int *d_cell; cudamalloc(&d_cell,size); initializearray(cell,n); int *d_node; cudamalloc(&d_node,size); initializearray(node,n); initializearray(cell,n); initializearray(node,n); // nested for loops for cudamemcpy(d_cell,cell,size,cudamemcpyhosttodevice); for cudamemcpy(d_node,node,size,cudamemcpyhosttodevice); for compute_win2d<<<nblocks, nthreads>>>(d_node,d_cell); for. //free resources //free resources free(cell); free(node); free(in); free(out); cudafree(d_cell); cudafree(d_node); CUDA C

2D Window: CUDA Main Program int main() int size=n*n*sizeof(int); //allocate resources int *cell=(int*)malloc(size); //input int *node=(int*)malloc(size); //output initializearray(cell,n); initializearray(node,n); // nested for loops for for for for. //free resources free(in); free(out); C int main() int size=n*n*sizeof(int); //allocate resources int *cell=(float*)malloc(size); int *node=(float*)malloc(size); int *d_cell; cudamalloc(&d_cell,size); int *d_node; cudamalloc(&d_node,size); initializearray(cell,n); initializearray(node,n); cudamemcpy(d_cell,cell,size,cudamemcpyhosttodevice); cudamemcpy(d_node,node,size,cudamemcpyhosttodevice); compute_win2d<<<nblocks, nthreads>>>(d_node,d_cell); //free resources free(cell); free(node); cudafree(d_cell); cudafree(d_node); CUDA C Allocate Device Memory

2D Window: CUDA Main Program int main() int size=n*n*sizeof(int); //allocate resources int *cell=(int*)malloc(size); //input int *node=(int*)malloc(size); //output initializearray(cell,n); initializearray(node,n); // nested for loops for for for for. //free resources free(in); free(out); C int main() int size=n*n*sizeof(int); //allocate resources int *cell=(float*)malloc(size); int *node=(float*)malloc(size); int *d_cell; cudamalloc(&d_cell,size); int *d_node; cudamalloc(&d_node,size); initializearray(cell,n); initializearray(node,n); cudamemcpy(d_cell,cell,size,cudamemcpyhosttodevice); cudamemcpy(d_node,node,size,cudamemcpyhosttodevice); compute_win2d<<<nblocks, nthreads>>>(d_node,d_cell); //free resources free(cell); free(node); cudafree(d_cell); cudafree(d_node); CUDA C Copy Data to the Device

2D Window: CUDA Main Program int main() int size=n*n*sizeof(int); //allocate resources int *cell=(int*)malloc(size); //input int *node=(int*)malloc(size); //output initializearray(cell,n); initializearray(node,n); // nested for loops for for for for. //free resources free(in); free(out); C Call Cuda Function int main() int size=n*n*sizeof(int); //allocate resources int *cell=(float*)malloc(size); int *node=(float*)malloc(size); int *d_cell; cudamalloc(&d_cell,size); int *d_node; cudamalloc(&d_node,size); initializearray(cell,n); initializearray(node,n); cudamemcpy(d_cell,cell,size,cudamemcpyhosttodevice); cudamemcpy(d_node,node,size,cudamemcpyhosttodevice); compute_win2d<<<nblocks, nthreads>>>(d_node,d_cell); //free resources free(cell); free(node); Launch cudafree(d_cell); cudafree(d_node); Parameters CUDA C Device Pointers

2D Window: CUDA Main Program C int main() int size=n*n*sizeof(int); //allocate resources int *cell=(int*)malloc(size); //input int *node=(int*)malloc(size); //output int main() int size=n*n*sizeof(int); //allocate resources int *cell=(float*)malloc(size); int *node=(float*)malloc(size); int *d_cell; cudamalloc(&d_cell,size); initializearray(cell,n); initializearray(node,n); int *d_node; cudamalloc(&d_node,size); initializearray(cell,n); initializearray(node,n); // nested for loops for for for for. cudamemcpy(d_cell,cell,size,cudamemcpyhosttodevice); cudamemcpy(d_node,node,size,cudamemcpyhosttodevice); compute_win2d<<<nblocks, nthreads>>>(d_node,d_cell); //free resources //free resources free(in); free(out); free(cell); free(node); cudafree(d_cell); cudafree(d_node); Cleanup CUDA C

Parallel Execution Model A CUDA C function is executed by many parallel threads Threads are organized as a grid of independent thread blocks Grid Block 0 Block 1 Block N_B-2 Block N_B-1

2D Window: CUDA Kernel Function for(i=0;i<n;i++) for(j=0;j<n;j++) //find minimum in window temp = node[i][j]; C for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> cell[i+x][j+y]) temp = cell[i+x][j+y]; node[i][j] = temp; global void compute_win2d(int knode[][n], int kcell[][n]) int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) //find minimum in window temp = knode[idx][idy]; for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> kcell[idx+x][idy+y]) temp = kcell[idx+x][idy+y]; knode[i][j] = temp; CUDA C

2D Window: CUDA Kernel Function for(i=0;i<n;i++) for(j=0;j<n;j++) //find minimum in window temp = node[i][j]; C for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> cell[i+x][j+y]) temp = cell[i+x][j+y]; node[i][j] = temp; Add global Keyword global void compute_win2d(int knode[][n], int kcell[][n]) int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) //find minimum in window temp = knode[idx][idy]; for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> kcell[idx+x][idy+y]) temp = kcell[idx+x][idy+y]; knode[i][j] = temp; CUDA C

2D Window: CUDA Kernel Function for(i=0;i<n;i++) for(j=0;j<n;j++) //find minimum in window blockidx.x temp = node[i][j]; C for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> cell[i+x][j+y]) temp = cell[i+x][j+y]; node[i][j] = temp; Replace Outer Loops With an Index Calculation 0 N_B-1 0 255 0 255 threadidx.x threadidx.x global void compute_win2d(int knode[][n], int kcell[][n]) int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) //find minimum in window temp = knode[idx][idy]; for(x=0;x<win_size;x++) for(y=0;y<win_size;y++) if (temp> kcell[idx+x][idy+y]) temp = kcell[idx+x][idy+y]; knode[i][j] = temp; CUDA C

2D Window: CUDA Kernel Function C global void compute_win2d(int knode[][n], int kcell[][n]) for(i=0;i<n;i++) int idx=blockidx.x*blockdim.x+threadidx.x; for(j=0;j<n;j++) int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; //find minimum Algorithm in window Device if((idx<n)&&(idy<n)) Compute Speedup temp = node[i][j]; //find minimum in window for(x=0;x<win_size;x++) C (serial) X5650 temp 96.1sec = knode[idx][idy]; -- for(y=0;y<win_size;y++) for(x=0;x<win_size;x++) CUDA M2090 8.33sec 11.5x if (temp> cell[i+x][j+y]) for(y=0;y<win_size;y++) temp = cell[i+x][j+y]; if (temp> kcell[idx+x][idy+y]) node[i][j] = temp; temp = kcell[idx+x][idy+y]; knode[i][j] = temp; CUDA C

Observation: Data Reuse Thread(i) Thread(i+1) Global Memory Neighboring threads read the same elements.

Optimization: Use Shared Memory Global Memory SMEM COPY 1 2 3 4 BLK_SIZE WSIZE Shared Memory

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); //find minimum in window temp = knode[idx][idy]; for(x=0;x<wsize;x++) for(y=0;y<wsize;y++) if (temp> smem[threadidx.x+x][threadidx.y+y]) temp = smem[threadidx.x+x][threadidx.y+y]; knode[i][j] = temp;

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) Allocate Shared Memory for Each Block smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Shared Memory BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) Calculate Global Memory Indices smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Shared Memory BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Load Interior Region 1 to Shared Memory 1 Shared Memory BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) Load Halo Region 2 to Shared Memory smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Shared Memory 2 BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) Load Halo Region 3 to Shared Memory smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); 3 Shared Memory BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) Load Halo Region 4 to Shared Memory smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Shared Memory 4 BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); Wait for All Threads to Finish Writing to Shared Memory Shared Memory BLK_SIZE WSIZE

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); //find minimum in window temp = knode[idx][idy]; for(x=0;x<wsize;x++) for(y=0;y<wsize;y++) if (temp> smem[threadidx.x+x][threadidx.y+y]) temp = smem[threadidx.x+x][threadidx.y+y]; knode[i][j] = temp; Find Minimum: Read Input from Shared Memory, Accumulate into a Register

CUDA C: Optimized Kernel global void compute_win2d(int knode[][n], int kcell[][n]) shared int smem[blk_size+wsize][blk_size+wsize]; int idx=blockidx.x*blockdim.x+threadidx.x; int idy=blockidx.y*blockdim.y+threadidx.y; int temp, x, y; if((idx<n)&&(idy<n)) smem[threadidx.x][threadidx.y]=kcell[idx][idy]; if (threadidx.y > (N-WSIZE)) smem[threadidx.x][threadidx.y + WSIZE]=kcell[idx][idy+WSIZE]; if (threadidx.x >(N-WSIZE)) smem[threadidx.x + WSIZE][threadIdx.y]=kcell[idx+WSIZE][idy]; if ((threadidx.x >(N-WSIZE)) && (threadidx.y > (N-WSIZE))) smem[threadidx.x+wsize][threadidx.y+wsize]= kcell[idx+wsize][idy+wsize]; //wait for all threads to finish read syncthreads(); //find minimum in window temp = knode[idx][idy]; for(x=0;x<wsize;x++) for(y=0;y<wsize;y++) if (temp> smem[threadidx.x+x][threadidx.y+y]) temp = smem[threadidx.x+x][threadidx.y+y]; knode[i][j] = temp; Write Minimum to Global Memory

Questions?