Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter



Similar documents
GPU Parallel Computing Architecture and CUDA Programming Model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

IMAGE PROCESSING WITH CUDA

CUDA Basics. Murphy Stein New York University

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU hardware and to CUDA

GPGPU Parallel Merge Sort Algorithm

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ultra fast SOM using CUDA

~ Greetings from WSU CAPPLab ~

Introduction to GPU Computing

Programming GPUs with CUDA

Lecture 1: an introduction to CUDA

GPU Computing - CUDA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA programming on NVIDIA GPUs

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

L20: GPU Architecture and Models

Planning Your Installation or Upgrade

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GeoImaging Accelerator Pansharp Test Results

Embedded Systems: map to FPGA, GPU, CPU?

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute


GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA GeForce GTX 580 GPU Datasheet

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Microsoft Office Outlook 2013: Part 1

Next Generation GPU Architecture Code-named Fermi

Image Processing & Video Algorithms with CUDA

GPUs for Scientific Computing

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Introduction to CUDA C

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

CUDA. Multicore machines

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Autodesk Revit 2016 Product Line System Requirements and Recommendations

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Turbomachinery CFD on many-core platforms experiences and strategies

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Introduction to GPGPU. Tiziano Diamanti

Introduction to GPU Programming Languages

High Performance Matrix Inversion with Several GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Real-time Visual Tracker by Stream Processing

Texture Cache Approximation on GPUs

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Several tips on how to choose a suitable computer

Parallel Computing with MATLAB

System Requirements. Maximizer CRM Enterprise Edition System Requirements

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Parallel Programming Survey

GPGPU Computing. Yong Cao

Clustering Billions of Data Points Using GPUs

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Parallel Firewalls on General-Purpose Graphics Processing Units

Learn CUDA in an Afternoon: Hands-on Practical Exercises

System Requirements Table of contents

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

How to choose a suitable computer

Choosing a Computer for Running SLX, P3D, and P5

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

FPGA-based Multithreading for In-Memory Hash Joins

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Guided Performance Analysis with the NVIDIA Visual Profiler

OpenACC Basics Directive-based GPGPU Programming

Microsoft Office Outlook 2010: Level 1

String Matching on a multicore GPU using CUDA

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

INTRODUCTION TO WINDOWS 7

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GPGPU accelerated Computational Fluid Dynamics

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Transcription:

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel Weingaertner (DInf-UFPR) FH-Regensburg 1 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 2 / 40

Paraná Brazil Daniel Weingaertner (DInf-UFPR) FH-Regensburg 3 / 40

Brazil Europe Daniel Weingaertner (DInf-UFPR) FH-Regensburg 4 / 40

Paraná Daniel Weingaertner (DInf-UFPR) FH-Regensburg 5 / 40

Curitiba Daniel Weingaertner (DInf-UFPR) FH-Regensburg 6 / 40

Federal University of Paraná Daniel Weingaertner (DInf-UFPR) FH-Regensburg 7 / 40

Informatics Department Undergraduate: Bachelor in Computer Science 8 semesters course 80 incoming students per year Bachelor in Biomedical Informatics 8 semesters course 30 incoming students per year Graduate: Master and PhD in Computer Science Algorithms, Image Processing, Computer Vision, Artificial Intelligence Databases, Scientific Computing and Open Source Software, Computer-Human Interface Computer Networks, Embedded Systems Daniel Weingaertner (DInf-UFPR) FH-Regensburg 8 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 9 / 40

Insight Toolkit (ITK) Created in 1999, Open Source, Multi platform, Object Oriented (Templates), Good documentation and support Figure: Image Processing Workflow in ITK Daniel Weingaertner (DInf-UFPR) FH-Regensburg 10 / 40

ITK - Sample code 1 #i n c l u d e itkimage. h 2 #i n c l u d e itkimagefilereader. h 3 #i n c l u d e i t k I m a g e F i l e W r i t e r. h 4 #i n c l u d e itkcannyedgedetectionimagefilter. h 5 6 typedef i t k : : Image<float,2> ImageType ; 7 t y p e d e f i t k : : ImageFileReader< ImageType > ReaderType ; 8 t y p e d e f i t k : : ImageFileWriter< ImageType > WriterType ; 9 t y p e d e f i t k : : CannyEdgeDetectionImageFilter< ImageType, ImageType > CannyFilter ; 0 1 i n t main ( i n t argc, char a r g v ){ 2 3 ReaderType : : Pointer reader = ReaderType : : New ( ) ; 4 reader >SetFileName ( argv [ 1 ] ) ; 5 reader >Update ( ) ; 6 7 CannyFilter : : Pointer canny = CannyFilter : : New ( ) ; 8 canny >S e t I n p u t ( r e a d e r >GetOutput ( ) ) ; 9 canny >SetVariance ( atof ( argv [ 3 ] ) ) ; 0 canny >S e t U p p e r T h r e s h o l d ( a t o i ( a r g v [ 4 ] ) ) ; 1 canny >S e t L o w e r T h r e s h o l d ( a t o i ( a r g v [ 5 ] ) ) ; 2 canny >Update ( ) ; 3 4 WriterType : : P o i n t e r w r i t e r = WriterType : : New ( ) ; 5 writer >SetFileName ( argv [ 2 ] ) ; 6 w r i t e r >S e t I n p u t ( canny >GetOutput ( ) ) ; 7 writer >Update ( ) ; 8 9 r e t u r n EXIT SUCCESS ; 0 } Daniel Weingaertner (DInf-UFPR) FH-Regensburg 11 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 12 / 40

What is GPGPU Computing? The use of the GPU for general purpose computation CPU and GPU can be used concurrently To the end user, its simply a way to run applications faster. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 13 / 40

What is CUDA? CUDA = Compute Unified Device Architecture. General-Purpose Parallel Computing Architecture. Provides libraries, C language extension and hardware driver. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 14 / 40

Parallel Processing Models Daniel Weingaertner (DInf-UFPR) FH-Regensburg 15 / 40

Single-Instruction Multiple-Thread Unit Creates, handles, schedules and executes groups of 32 threads (warp). All threads in a warp start at the same point. But they are free to jump to different code positions independently. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 16 / 40

CUDA Architecture Overview Daniel Weingaertner (DInf-UFPR) FH-Regensburg 17 / 40

Optimization Strategies for CUDA Main optimization strategies for CUDA involve: Optimized/careful memory access Maximization of processor utilization Maximization of non-serialized instructions Daniel Weingaertner (DInf-UFPR) FH-Regensburg 18 / 40

CUDA - Sample Code 1 #i n c l u d e <s t d i o. h> 2 #i n c l u d e <a s s e r t. h> 3 #i n c l u d e <cuda. h> 4 v o i d incrementarrayonhost ( f l o a t a, i n t N) 5 { 6 i n t i ; 7 f o r ( i =0; i < N; i ++) a [ i ] = a [ i ]+1. f ; 8 } 9 g l o b a l v o i d i n c r e m e n t A r r a y O n D e v i c e ( f l o a t a, i n t N) 0 { 1 i n t idx = blockidx. x blockdim. x + threadidx. x ; 2 i f ( idx<n) a [ i d x ] = a [ i d x ]+1. f ; 3 } 4 i n t main ( v o i d ) 5 { 6 f l o a t a h, b h ; // p o i n t e r s to h o s t memory 7 f l o a t a d ; // p o i n t e r to d e v i c e memory 8 i n t i, N = 10000; 9 s i z e t s i z e = N s i z e o f ( f l o a t ) ; 0 a h = ( f l o a t ) m a l l o c ( s i z e ) ; 1 b h = ( f l o a t ) m a l l o c ( s i z e ) ; 2 cudamalloc ( ( v o i d ) &a d, s i z e ) ; 3 f o r ( i =0; i<n; i ++) a h [ i ] = ( f l o a t ) i ; 4 cudamemcpy ( a d, a h, s i z e o f ( f l o a t ) N, cudamemcpyhosttodevice ) ; 5 incrementarrayonhost ( a h, N) ; 6 i n t b l o c k S i z e = 2 5 6 ; 7 i n t nblocks = N/ blocksize + (N%blockSize == 0? 0 : 1 ) ; 8 incrementarrayondevice <<< nblocks, blocksize >>> ( a d, N) ; 9 cudamemcpy ( b h, a d, s i z e o f ( f l o a t ) N, cudamemcpydevicetohost ) ; 0 f r e e ( a h ) ; f r e e ( b h ) ; cudafree ( a d ) ; 1 } Daniel Weingaertner (DInf-UFPR) FH-Regensburg 19 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 20 / 40

Integrating CUDA Filters into ITK Workflow ITK community suggests: Careful! Re-implement filters where parallelizing provides significant speedup Consider the entire workflow: copying to/from the GPU is very time consuming Premature optimization is the root of all evil! (Donald Knuth) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 21 / 40

Integrating CUDA Filters into ITK Workflow ITK community suggests: Careful! Re-implement filters where parallelizing provides significant speedup Consider the entire workflow: copying to/from the GPU is very time consuming Premature optimization is the root of all evil! (Donald Knuth) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 21 / 40

CUDA Insight Toolkit (CITK) Changes to ITK Slight architecture change: CudaImportImageContainer Backwards compatible Data transfer between HOST and DEVICE only on demand Allows for filter chaining inside the DEVICE Daniel Weingaertner (DInf-UFPR) FH-Regensburg 22 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 23 / 40

CudaCanny itkcudacannyedgedetectionimagefilter Algorithm 1 Canny Edge Detection Filter Gaussian Smoothing Gradient Computation Non-Maximum Supression Histeresis Daniel Weingaertner (DInf-UFPR) FH-Regensburg 24 / 40

Gradient Computation with Sobel Filter itkcudasobeledgedetectionimagefilter (a) Sobel X (b) Sobel Y L v = L 2 x + L 2 y (1) ( ) Ly θ = arctan L x (2) Daniel Weingaertner (DInf-UFPR) FH-Regensburg 25 / 40

Optimization for Edge Direction Computation Daniel Weingaertner (DInf-UFPR) FH-Regensburg 26 / 40

Code Extract from CudaSobel Daniel Weingaertner (DInf-UFPR) FH-Regensburg 27 / 40

Histeresis Operation Daniel Weingaertner (DInf-UFPR) FH-Regensburg 28 / 40

Histeresis Algorithm Algorithm 2 Histeresis on CPU Transfers the Gradient/NMS images to the GPU repeat Run the histeresis kernel on GPU until no pixel changes status Return edge image Daniel Weingaertner (DInf-UFPR) FH-Regensburg 29 / 40

Histeresis Algorithm Algorithm 3 Histeresis on GPU Load an image region with size 18x18 into shared memory modified false repeat modified region false Synchronize threads of same multiprocessor if Pixel changes status then modified true modified region true end if Synchronize threads of same multiprocessor until modified region = false if modified = true then Update modified status on HOST end if Daniel Weingaertner (DInf-UFPR) FH-Regensburg 30 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 31 / 40

Metodology Hardware: Server: CPU: 4x AMD Opteron(tm) Processor 6136 2,4GHz with 8 cores, each with 512 KB cache and 126GB RAM GPU1: NVidia Tesla C2050 with 448 1,15GHz cores and 3GB RAM. GPU2: NVidia Tesla C1060 com 240 1,3GHz cores and 4GB RAM. Desktop: CPU: Intel R Core(TM)2 Duo E7400 2,80GHz with 3072 KB cache and 2GB RAM GPU: NVidia GeForce 8800 GT with 112 1,5GHz cores and 512MB RAM. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 32 / 40

Metodology Images from the Berkeley Segmentation Dataset Base Image resolution Num. of Images B1 321 481 e 481 321 100 B2 642 962 e 962 642 100 B3 1284 1924 e 1924 1284 100 B4 2568 3848 e 3848 2568 100 Daniel Weingaertner (DInf-UFPR) FH-Regensburg 33 / 40

Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 34 / 40

Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 35 / 40

Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 36 / 40

Performance Tests Daniel Weingaertner (DInf-UFPR) FH-Regensburg 37 / 40

Summary 1 Introduction 2 Insight Toolkit (ITK) 3 GPGPU and CUDA 4 Integrating CUDA and ITK 5 Canny Edge Detection 6 Experimental Results 7 Conclusion Daniel Weingaertner (DInf-UFPR) FH-Regensburg 38 / 40

Conclusion Parallel Programming Parallel programming is definitely the way to go. Implement efficient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection filter Also noticed that the existing implementation is not efficient There is still a LOT of work if we want to parallelize ITK. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 39 / 40

Conclusion Parallel Programming Parallel programming is definitely the way to go. Implement efficient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection filter Also noticed that the existing implementation is not efficient There is still a LOT of work if we want to parallelize ITK. Daniel Weingaertner (DInf-UFPR) FH-Regensburg 39 / 40

Contact Thank You! Daniel Weingaertner danielw@inf.ufpr.br Daniel Weingaertner (DInf-UFPR) FH-Regensburg 40 / 40