Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel Weingaertner (DInf-UFPR) FH-Regensburg 1 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 2 / 40
Paraná Brazil Daniel Weingaertner (DInf-UFPR) FH-Regensburg 3 / 40
Brazil Europe Daniel Weingaertner (DInf-UFPR) FH-Regensburg 4 / 40
Paraná Daniel Weingaertner (DInf-UFPR) FH-Regensburg 5 / 40
Curitiba Daniel Weingaertner (DInf-UFPR) FH-Regensburg 6 / 40
Federal University of Paraná Daniel Weingaertner (DInf-UFPR) FH-Regensburg 7 / 40
Informatics Department Undergraduate: Bachelor in Computer Science 8 semesters course 80 incoming students per year Bachelor in Biomedical Informatics 8 semesters course 30 incoming students per year Graduate: Master and PhD in Computer Science Algorithms, Image Processing, Computer Vision, Artificial Intelligence Databases, Scientific Computing and Open Source Software, Computer-Human Interface Computer Networks, Embedded Systems Daniel Weingaertner (DInf-UFPR) FH-Regensburg 8 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 9 / 40
Insight Toolkit (ITK) Created in 1999, Open Source, Multi platform, Object Oriented (Templates), Good documentation and support Figure: Image Processing Workflow in ITK Daniel Weingaertner (DInf-UFPR) FH-Regensburg 10 / 40
ITK - Sample code 1 #i n c l u d e itkimage. h 2 #i n c l u d e itkimagefilereader. h 3 #i n c l u d e i t k I m a g e F i l e W r i t e r. h 4 #i n c l u d e itkcannyedgedetectionimagefilter. h 5 6 typedef i t k : : Image<float,2> ImageType ; 7 t y p e d e f i t k : : ImageFileReader< ImageType > ReaderType ; 8 t y p e d e f i t k : : ImageFileWriter< ImageType > WriterType ; 9 t y p e d e f i t k : : CannyEdgeDetectionImageFilter< ImageType, ImageType > CannyFilter ; 0 1 i n t main ( i n t argc, char a r g v ){ 2 3 ReaderType : : Pointer reader = ReaderType : : New ( ) ; 4 reader >SetFileName ( argv [ 1 ] ) ; 5 reader >Update ( ) ; 6 7 CannyFilter : : Pointer canny = CannyFilter : : New ( ) ; 8 canny >S e t I n p u t ( r e a d e r >GetOutput ( ) ) ; 9 canny >SetVariance ( atof ( argv [ 3 ] ) ) ; 0 canny >S e t U p p e r T h r e s h o l d ( a t o i ( a r g v [ 4 ] ) ) ; 1 canny >S e t L o w e r T h r e s h o l d ( a t o i ( a r g v [ 5 ] ) ) ; 2 canny >Update ( ) ; 3 4 WriterType : : P o i n t e r w r i t e r = WriterType : : New ( ) ; 5 writer >SetFileName ( argv [ 2 ] ) ; 6 w r i t e r >S e t I n p u t ( canny >GetOutput ( ) ) ; 7 writer >Update ( ) ; 8 9 r e t u r n EXIT SUCCESS ; 0 } Daniel Weingaertner (DInf-UFPR) FH-Regensburg 11 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 12 / 40
What is GPGPU Computing? The use of the GPU for general purpose computation CPU and GPU can be used concurrently To the end user, its simply a way to run applications faster. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 13 / 40
What is CUDA? CUDA = Compute Unified Device Architecture. General-Purpose Parallel Computing Architecture. Provides libraries, C language extension and hardware driver. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 14 / 40
Parallel Processing Models Daniel Weingaertner (DInf-UFPR) FH-Regensburg 15 / 40
Single-Instruction Multiple-Thread Unit Creates, handles, schedules and executes groups of 32 threads (warp). All threads in a warp start at the same point. But they are free to jump to different code positions independently. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 16 / 40
CUDA Architecture Overview Daniel Weingaertner (DInf-UFPR) FH-Regensburg 17 / 40
Optimization Strategies for CUDA Main optimization strategies for CUDA involve: Optimized/careful memory access Maximization of processor utilization Maximization of non-serialized instructions Daniel Weingaertner (DInf-UFPR) FH-Regensburg 18 / 40
CUDA - Sample Code 1 #i n c l u d e <s t d i o. h> 2 #i n c l u d e <a s s e r t. h> 3 #i n c l u d e <cuda. h> 4 v o i d incrementarrayonhost ( f l o a t a, i n t N) 5 { 6 i n t i ; 7 f o r ( i =0; i < N; i ++) a [ i ] = a [ i ]+1. f ; 8 } 9 g l o b a l v o i d i n c r e m e n t A r r a y O n D e v i c e ( f l o a t a, i n t N) 0 { 1 i n t idx = blockidx. x blockdim. x + threadidx. x ; 2 i f ( idx<n) a [ i d x ] = a [ i d x ]+1. f ; 3 } 4 i n t main ( v o i d ) 5 { 6 f l o a t a h, b h ; // p o i n t e r s to h o s t memory 7 f l o a t a d ; // p o i n t e r to d e v i c e memory 8 i n t i, N = 10000; 9 s i z e t s i z e = N s i z e o f ( f l o a t ) ; 0 a h = ( f l o a t ) m a l l o c ( s i z e ) ; 1 b h = ( f l o a t ) m a l l o c ( s i z e ) ; 2 cudamalloc ( ( v o i d ) &a d, s i z e ) ; 3 f o r ( i =0; i<n; i ++) a h [ i ] = ( f l o a t ) i ; 4 cudamemcpy ( a d, a h, s i z e o f ( f l o a t ) N, cudamemcpyhosttodevice ) ; 5 incrementarrayonhost ( a h, N) ; 6 i n t b l o c k S i z e = 2 5 6 ; 7 i n t nblocks = N/ blocksize + (N%blockSize == 0? 0 : 1 ) ; 8 incrementarrayondevice <<< nblocks, blocksize >>> ( a d, N) ; 9 cudamemcpy ( b h, a d, s i z e o f ( f l o a t ) N, cudamemcpydevicetohost ) ; 0 f r e e ( a h ) ; f r e e ( b h ) ; cudafree ( a d ) ; 1 } Daniel Weingaertner (DInf-UFPR) FH-Regensburg 19 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 20 / 40
Integrating CUDA Filters into ITK Workflow ITK community suggests: Careful! Re-implement filters where parallelizing provides significant speedup Consider the entire workflow: copying to/from the GPU is very time consuming Premature optimization is the root of all evil! (Donald Knuth) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 21 / 40
Integrating CUDA Filters into ITK Workflow ITK community suggests: Careful! Re-implement filters where parallelizing provides significant speedup Consider the entire workflow: copying to/from the GPU is very time consuming Premature optimization is the root of all evil! (Donald Knuth) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 21 / 40
CUDA Insight Toolkit (CITK) Changes to ITK Slight architecture change: CudaImportImageContainer Backwards compatible Data transfer between HOST and DEVICE only on demand Allows for filter chaining inside the DEVICE Daniel Weingaertner (DInf-UFPR) FH-Regensburg 22 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 23 / 40
CudaCanny itkcudacannyedgedetectionimagefilter Algorithm 1 Canny Edge Detection Filter Gaussian Smoothing Gradient Computation Non-Maximum Supression Histeresis Daniel Weingaertner (DInf-UFPR) FH-Regensburg 24 / 40
Gradient Computation with Sobel Filter itkcudasobeledgedetectionimagefilter (a) Sobel X (b) Sobel Y L v = L 2 x + L 2 y (1) ( ) Ly θ = arctan L x (2) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 25 / 40
Optimization for Edge Direction Computation Daniel Weingaertner (DInf-UFPR) FH-Regensburg 26 / 40
Code Extract from CudaSobel Daniel Weingaertner (DInf-UFPR) FH-Regensburg 27 / 40
Histeresis Operation Daniel Weingaertner (DInf-UFPR) FH-Regensburg 28 / 40
Histeresis Algorithm Algorithm 2 Histeresis on CPU Transfers the Gradient/NMS images to the GPU repeat Run the histeresis kernel on GPU until no pixel changes status Return edge image Daniel Weingaertner (DInf-UFPR) FH-Regensburg 29 / 40
Histeresis Algorithm Algorithm 3 Histeresis on GPU Load an image region with size 18x18 into shared memory modified false repeat modified region false Synchronize threads of same multiprocessor if Pixel changes status then modified true modified region true end if Synchronize threads of same multiprocessor until modified region = false if modified = true then Update modified status on HOST end if Daniel Weingaertner (DInf-UFPR) FH-Regensburg 30 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 31 / 40
Metodology Hardware: Server: CPU: 4x AMD Opteron(tm) Processor 6136 2,4GHz with 8 cores, each with 512 KB cache and 126GB RAM GPU1: NVidia Tesla C2050 with 448 1,15GHz cores and 3GB RAM. GPU2: NVidia Tesla C1060 com 240 1,3GHz cores and 4GB RAM. Desktop: CPU: Intel R Core(TM)2 Duo E7400 2,80GHz with 3072 KB cache and 2GB RAM GPU: NVidia GeForce 8800 GT with 112 1,5GHz cores and 512MB RAM. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 32 / 40
Metodology Images from the Berkeley Segmentation Dataset Base Image resolution Num. of Images B1 321 481 e 481 321 100 B2 642 962 e 962 642 100 B3 1284 1924 e 1924 1284 100 B4 2568 3848 e 3848 2568 100 Daniel Weingaertner (DInf-UFPR) FH-Regensburg 33 / 40
Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 34 / 40
Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 35 / 40
Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 36 / 40
Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 37 / 40
Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 38 / 40
Conclusion Parallel Programming Parallel programming is definitely the way to go. Implement efficient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection filter Also noticed that the existing implementation is not efficient There is still a LOT of work if we want to parallelize ITK. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 39 / 40
Conclusion Parallel Programming Parallel programming is definitely the way to go. Implement efficient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection filter Also noticed that the existing implementation is not efficient There is still a LOT of work if we want to parallelize ITK. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 39 / 40
Contact Thank You! Daniel Weingaertner danielw@inf.ufpr.br Daniel Weingaertner (DInf-UFPR) FH-Regensburg 40 / 40