Image Processing on Graphics Processing Unit with CUDA and C++



Similar documents
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Texture Cache Approximation on GPUs

GPU Parallel Computing Architecture and CUDA Programming Model

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing Application Performance with CUDA Profiling Tools

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

OpenCL Programming for the CUDA Architecture. Version 2.3

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Computer Graphics Hardware An Overview

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Real-time Visual Tracker by Stream Processing

Introduction to GPU Programming Languages

QCD as a Video Game?

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA programming on NVIDIA GPUs

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Introduction to GPU hardware and to CUDA

GeoImaging Accelerator Pansharp Test Results

ultra fast SOM using CUDA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Next Generation GPU Architecture Code-named Fermi

GPU Hardware Performance. Fall 2015

Guided Performance Analysis with the NVIDIA Visual Profiler

HPC with Multicore and GPUs

CUDA Basics. Murphy Stein New York University

GPU Performance Analysis and Optimisation

Introduction to GPGPU. Tiziano Diamanti

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Parallel Programming Survey

GPUs for Scientific Computing

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Application. EDIUS and Intel s Sandy Bridge Technology

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

1. Memory technology & Hierarchy

Image Processing & Video Algorithms with CUDA

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

GPU Tools Sandra Wienke

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Clustering Billions of Data Points Using GPUs

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Performance Characteristics of Large SMP Machines

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

How To Run A Factory I/O On A Microsoft Gpu 2.5 (Sdk) On A Computer Or Microsoft Powerbook 2.3 (Powerpoint) On An Android Computer Or Macbook 2 (Powerstation) On

A Prototype For Eye-Gaze Corrected

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

NVIDIA VIDEO ENCODER 5.0

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPGPU Computing. Yong Cao

GPU Programming Strategies and Trends in GPU Computing

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Lecture 1: an introduction to CUDA

Parallel Algorithm Engineering

Stream Processing on GPUs Using Distributed Multimedia Middleware

Evaluation of CUDA Fortran for the CFD code Strukti

NVIDIA GeForce GTX 580 GPU Datasheet

Spring 2011 Prof. Hyesoon Kim

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Introduction to GPU Architecture

A Case Study - Scaling Legacy Code on Next Generation Platforms

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Optimizing AAA Games for Mobile Platforms

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

High-speed image processing algorithms using MMX hardware

Interactive Level-Set Deformation On the GPU

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Parallel Firewalls on General-Purpose Graphics Processing Units

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Parallel Computing for Data Science

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

OpenACC 2.0 and the PGI Accelerator Compilers

LSN 2 Computer Processors

GPU-based Decompression for Medical Imaging Applications

GPGPU Parallel Merge Sort Algorithm

Transcription:

Image Processing on Graphics Processing Unit with CUDA and C++ Matthieu Garrigues <matthieu.garrigues@gmail.com> ENSTA-ParisTech June 15, 2011 Image Processing on Graphics Processing Unit with CUDA and C++ 1 / 37 Matthieu Garrigues

Outline 1 The Graphic Processing Unit 2 Benchmark 3 Real Time Dense Scene tracking 4 A Small CUDA Image Processing Toolkit 5 Dige Image Processing on Graphics Processing Unit with CUDA and C++ 2 / 37 Matthieu Garrigues

Nvidia CUDA Architecture GeForce GTX 580: 16 multiprocessors x 32 processors 1 multiprocessor runs simultaneously 32 threads Threads local to a multiprocessor communicate via fast on-chip shared memory Two dimensional texture cache Schema from CUDA documentation [NVI] Image Processing on Graphics Processing Unit with CUDA and C++ 3 / 37 Matthieu Garrigues

The CUDA Programming Model Schema from CUDA documentation [NVI] Every thread execute the same programs on different data according to a unique thread id For example: One thread handles one pixel Grid size and block size are defined by the user to fit the data dimensions One block runs on one multiprocessor Intra block synchronization and communication via shared memory Image Processing on Graphics Processing Unit with CUDA and C++ 4 / 37 Matthieu Garrigues

Parallelism hides high latency Schema from CUDA documentation [NVI] The GPU devotes more transistors to data processing Absence of branch prediction, out of order execution, etc implies high read after write register latency (about 20-25 cycles) A multiprocessor must overlap execution of at least 6 thread blocks to hide this latency At least 30 6 16 = 2880 threads needed to properly feed the Geforce GTX 280 Image Processing on Graphics Processing Unit with CUDA and C++ 5 / 37 Matthieu Garrigues

The PCI Bus Communication between host and device is done via the PCI bus which can be a bottleneck Need to minimize data transfers However: PCI express 2.0 16x theoretical bandwidth is 8GB/s Next generation PCI express 3.0 16x theoretical bandwidth is 16GB/s CUDA driver pins memory for direct memory access from the device Image Processing on Graphics Processing Unit with CUDA and C++ 6 / 37 Matthieu Garrigues

Benchmarking host device communications Benchmark image2d<i_char4> ima (1024, 1024) ; host_image2d<i_char4> h_ima (1024, 1024) ; for ( unsigned i = 0 ; i < 1 0 0 ; i++) copy ( h_ima, ima ) ; // Host to d e v i c e for ( unsigned i = 0 ; i < 1 0 0 ; i++) copy ( ima, h_ima ) ; // D e v i c e to h o s t Table: Results using PCI-Express 2.0 16x / Geforce GTX 280 classical memory pinned-memory host to device 1.44 GB/s 5.5 GB/s device to host 1.25 GB/s 4.7 GB/s About 1ms to transfer a RGBA 8bits 512x512 2D image using pinned memory Transfers and CUDA kernels execution can overlap with CPU computation Image Processing on Graphics Processing Unit with CUDA and C++ 7 / 37 Matthieu Garrigues

Global Memory Access 1 memory transaction 16 memory transactions Multiprocessor Global memory 128bits memory aligned segment Theoretical maximum bandwidth: 115GB/s (GTX 460) Accesses inside a multiprocessor coalesce if they fall in the same 128 aligned segment. Image Processing on Graphics Processing Unit with CUDA and C++ 8 / 37 Matthieu Garrigues

Shared Memory Access Shared memory banks Shared memory banks Parallel access 16-bank conflict Serialized access On-chip memory 100x faster than global memory if no bank conflicts Successive 32 bits words are assigned to successive banks Bank bandwidth: 32 bits per cycle Bank conflicts are serialized Image Processing on Graphics Processing Unit with CUDA and C++ 9 / 37 Matthieu Garrigues

GPUs are not magic... Host device communications should be minimum Global memory accesses inside a block should coalesce The instruction flow should not diverge within a block Shared memory should be used with care (bank conflicts) Enough thread blocks should run to hide register latency Image Processing on Graphics Processing Unit with CUDA and C++ 10 / 37 Matthieu Garrigues

Convolution Benchmark: CPU vs GPU Benchmark Convolution on row and columns Image size: 3072 x 3072 x float 32bits = 36MB CPU version Intel i5 2500k 4-cores@3.3GHz (May 2011: 150 euros) Use of OpenMP to feed all the cores GPU version Hardware: GeForce GTX 460 (May 2011: 160 euros) CUDA and C++ 2D texture cache Image Processing on Graphics Processing Unit with CUDA and C++ 11 / 37 Matthieu Garrigues

Convolution on rows: CPU vs GPU Time (ms) 300 250 200 150 14 12 10 8 6 Speedup Intel i5 2500k Geforce GTX 460 Speedup 100 4 50 2 0 20 40 60 80 0 100 Kernel size Image Processing on Graphics Processing Unit with CUDA and C++ 12 / 37 Matthieu Garrigues

Convolution on columns: CPU vs GPU 500 400 14 12 10 Intel i5 2500k Geforce GTX 460 Speedup Time (ms) 300 200 8 6 Speedup 4 100 2 0 20 40 60 80 0 100 Kernel size Image Processing on Graphics Processing Unit with CUDA and C++ 13 / 37 Matthieu Garrigues

Real Time Dense Scene tracking Track scene points through the frames Estimate apparent motion of each points In real time Image Processing on Graphics Processing Unit with CUDA and C++ 14 / 37 Matthieu Garrigues

Real Time Dense Scene tracking: Overview Descriptor Matching 128bits local jet descriptors [[Man11]] Partial derivatives, several scales 16 x 8 bits components Search of the closest point according to a distance on the descriptor space Tracking Move the trackers according to the matches Filter bad matches (Occlusions) Avoid trackers drifts (Sub-pixel motion) Estimate speed of the trackers Image Processing on Graphics Processing Unit with CUDA and C++ 15 / 37 Matthieu Garrigues

Real Time Dense Scene tracking: Overview Descriptor Matching 128bits local jet descriptors [[Man11]] Partial derivatives, several scales 16 x 8 bits components Search of the closest point according to a distance on the descriptor space Tracking Move the trackers according to the matches Filter bad matches (Occlusions) Avoid trackers drifts (Sub-pixel motion) Estimate speed of the trackers Image Processing on Graphics Processing Unit with CUDA and C++ 15 / 37 Matthieu Garrigues

Real Time Dense Scene tracking: Overview Descriptor Matching 128bits local jet descriptors [[Man11]] Partial derivatives, several scales 16 x 8 bits components Search of the closest point according to a distance on the descriptor space Tracking Move the trackers according to the matches Filter bad matches (Occlusions) Avoid trackers drifts (Sub-pixel motion) Estimate speed of the trackers Image Processing on Graphics Processing Unit with CUDA and C++ 15 / 37 Matthieu Garrigues

First Pass: Matching I 0 I 1 I 2 I 3 For each pixel of frame I t 1 find its closest point on frame I t We limit our search to a fixed 2D neighborhood N match t 1 (p) = arg min D(I t 1 (p), I t (q)) q N (p) Image Processing on Graphics Processing Unit with CUDA and C++ 16 / 37 Matthieu Garrigues

Second Pass: Tracking For each point p of I t 1 Find its counterparts in I t 1 C p = {q I t, p = match t 1 (q)} 2 Threshold bad matches F p = {q C p, D(I t 1 (p), I t (q)) < T } I t 1 I t 3 Update tracker speed and age 4 If no conterpart, tracker is discarded Image Processing on Graphics Processing Unit with CUDA and C++ 17 / 37 Matthieu Garrigues

Temporal Filters Filtering descriptor Instead of matching descriptors between I t 1 and I t, we use a weighted mean of the trackers history. Each tracker x computes at time t: x t v = (1 α)x t 1 v + αi t α [0, 1] the forget factor A low α minimize tracker drifts Filtering tracker speed Likewise, the filtered tracker speed is given by: x t s = (1 β)x t 1 s + βs t x β [0, 1] β [0, 1] is the forget factor S t x is the speed estimate of tracker x at time t Image Processing on Graphics Processing Unit with CUDA and C++ 18 / 37 Matthieu Garrigues

Results Figure: Motion vector field: Color encodes speed orientation, luminance is proportional to speed modulus 240x180 pixels @ 30 fps (GeForce GTX 280) Image Processing on Graphics Processing Unit with CUDA and C++ 19 / 37 Matthieu Garrigues

CuImg: A Small CUDA Image Processing Toolkit Goals Write efficient image processing applications with CUDA Reduce the size of the application code... and then the number of potential bugs Image Processing on Graphics Processing Unit with CUDA and C++ 20 / 37 Matthieu Garrigues

Image Types host_image{2,3}d Instances and pixel buffer lives on main memory Memory management using reference counting image{2,3}d Instances lives on main memory, buffer on device memory Memory management using reference counting Implicitly converts to kernel_image{2,3}d kernel_image{2,3}d Instances and pixel buffer lives on device memory Argument of CUDA kernel No memory managment needed Image Processing on Graphics Processing Unit with CUDA and C++ 21 / 37 Matthieu Garrigues

Image Types 1 // Host types. 2 template <typename T> host_image2d ; 3 template <typename T> host_image3d ; 4 // CUDA types. 5 template <typename T> image2d ; 6 template <typename T> image3d ; 7 // Types used i n s i d e CUDA k e r n e l s. 8 template <typename T> kernel_image2d ; 9 template <typename T> kernel_image3d ; 10 11 { 12 image2d<float4> img ( 1 0 0, 100) ; // c r e a t i o n. 13 image2d<float4> img2 = img ; // l i g h t copy. 14 image2d<float4> img_h ( img. domain ( ) ) ; // l i g h t copy. 15 16 // Host d e v i c e t r a n s f e r s 17 copy ( img_h, img ) ; // CPU > GPU 18 copy ( img, img_h ) ; // GPU > CPU 19 20 } // Images are fre e d automatically here. 21 22 // View a s l i c e as a 2d image : 23 image3d<float4> img3d (100, 100, 100) ; 24 image2d<float4, simple_ptr> slice = img3d. slice ( 4 2 ) ; Image Processing on Graphics Processing Unit with CUDA and C++ 22 / 37 Matthieu Garrigues

Improved Built-In Types CUDA built-in types {char uchar short ushort int uint long ulong float}[1234] {double longlong ulonglong}[12] Access to coordinates via: bt.x, bt.y, bt.z, bt.w No arithmetic operator CuImg built-in improved types Inherit from CUDA built-in types Provide static type information such as T::size and T::vtype Access to coordinates via: get<n>(bt) Generic arithmetic operators No memory or run-time overhead Use boost::typeof to compute return type of generic operators Image Processing on Graphics Processing Unit with CUDA and C++ 23 / 37 Matthieu Garrigues

Improved Built-In Types: Example CUDA built-ins Improved built-ins float4 a = make_float4 ( 1. 5 f, 1.5 f, 1.5 f ) ; int4 b = make_int4 (1, 2, 3) ; float3 c ; c. x = ( a. x + b. x ) * 2. 5 f ; c. y = ( a. y + b. y ) * 2. 5 f ; c. z = ( a. z + b. z ) * 2. 5 f ; c. w = ( a. w + b. w ) * 2. 5 f ; i_float4 a ( 1. 5, 1. 5, 1. 5 ) ; i_int4 b ( 2. 5, 2. 5, 2. 5 ) ; i_float4 c = ( a + b ) * 2.5 f ; Image Processing on Graphics Processing Unit with CUDA and C++ 24 / 37 Matthieu Garrigues

Inputs Wrap OpenCV for use with CuImg image types Image and video 1 // Load 2d images. 2 host_image2d<uchar3> img = load_image ( "test.jpg " ) ; 3 4 // Read USB camera 5 video_capture cam (0) ; 6 host_image2d<uchar3> cam_img ( cam. nrows ( ), cam. ncols ( ) ) ; 7 cam >> cam_img ; // Get t h e camera c u r r e n t frame 8 9 // Read a v i d e o 10 video_capture vid ( "test.avi " ) ; 11 host_image2d<uchar3> frame ( vid. nrows ( ), vid. ncols ( ) ) ; 12 vid >> v ; // Get t h e n e x t v i d e o frame Image Processing on Graphics Processing Unit with CUDA and C++ 25 / 37 Matthieu Garrigues

Fast gaussian convolutions code generation Heavy use of C++ templates for loop unrolling Gaussian kernel is known at compile time and injected directly inside the ptx assembly Used by the local jet computation Cons: Large kernels slow down the compilation mov.f32 %f182, 0f3b1138f8; // 0.00221592 mul.f32 %f183, %f169, %f182; mov.f32 %f184, 0f39e4c4b3; // 0.000436341 mad.f32 %f185, %f184, %f181, %f183; mov.f32 %f186, 0f3c0f9782; // 0.00876415 mad.f32 %f187, %f186, %f157, %f185; mov.f32 %f188, 0f3cdd25ab; // 0.0269955 mad.f32 %f189, %f188, %f145, %f187; mov.f32 %f190, 0f3d84a043; // 0.0647588 mad.f32 %f191, %f190, %f133, %f189; mov.f32 %f192, 0f3df7c6fc; // 0.120985 mad.f32 %f193, %f192, %f121, %f191; mov.f32 %f194, 0f3e3441ff; // 0.176033 mad.f32 %f195, %f194, %f109, %f193; mov.f32 %f196, 0f3e4c4220; // 0.199471 mad.f32 %f197, %f196, %f97, %f195; mov.f32 %f198, 0f3e3441ff; // 0.176033 mad.f32 %f199, %f198, %f85, %f197; mov.f32 %f200, 0f3df7c6fc; // 0.120985 mad.f32 %f201, %f200, %f73, %f199; mov.f32 %f202, 0f3d84a043; // 0.0647588 mad.f32 %f203, %f202, %f61, %f201; mov.f32 %f204, 0f3cdd25ab; // 0.0269955 mad.f32 %f205, %f204, %f49, %f203; mov.f32 %f206, 0f3c0f9782; // 0.00876415 mad.f32 %f207, %f206, %f37, %f205; mov.f32 %f208, 0f3b1138f8; // 0.00221592 mad.f32 %f209, %f208, %f25, %f207; mov.f32 %f210, 0f39e4c4b3; // 0.000436341 Image Processing on Graphics Processing Unit with CUDA and C++ 26 / 37 Matthieu Garrigues

Pixel-Wise Kernels Generator Goal Use C++ to generate pixel-wise kernels Classic CUDA code // D e c l a r a t i o n template <typename T, typename U, typename V> global void simple_kernel ( kernel_image2d<t> a, kernel_image2d<u> b, kernel_image2d<v> out ) { point2d<int> p = thread_pos2d ( ) ; if ( out. has ( p ) ) out ( p ) = a ( p ) * b ( p ) / 2. 5 f ; } Using expression templates image2d<i_int4> a (200, 100) ; image2d<i_short4> b ( a. domain ( ) ) ; image2d<i_float4> c ( a. domain ( ) ) ; // G e n e r a t e s and c a l l s s i m p l e k e r n e l. c = a * b / 2. 5 ; // BGR < > RGB conversion a = aggregate<int >:: run ( get_z ( a ), get_y ( a ), get_x ( a ), 1. 0 f ) ; // C a l l dim3 dimblock ( 1 6, 16) ; dim3 dimgrid ( divup ( a. ncols ( ), 16), divup ( a. nrows ( ), 16) ) ; simple_kernel<<<dimgrid, dimblock>>> ( a, b, c ) ; Image Processing on Graphics Processing Unit with CUDA and C++ 27 / 37 Matthieu Garrigues

Pixel-Wise Kernels Generator: Implementation Expression templates build an expression tree Type of a * b / 2.5f is div<mult<image2d<a>, image2d<b> >, float> Recursive tree evaluation is inlined by the compiler template <typename I, typename E> global void pixel_wise_kernel ( kernel_image2d<i> out, E e ) { i_int2 p = thread_pos2d ( ) ; if ( out. has ( p ) ) out ( p ) = e. eval ( p ) ; // E h o l d s e x p r e s s i o n t r e e } operator= handles the kernel call template <typename T, typename E> inline void image2d<t >:: operator=(e& expr ) { [... ] pixel_wise_kernel<<<dimgrid, dimblock>>>(*this, expr ) ; } Image Processing on Graphics Processing Unit with CUDA and C++ 28 / 37 Matthieu Garrigues

Future works Add convolutions to expression templates Image and video output Documentation and release? Image Processing on Graphics Processing Unit with CUDA and C++ 29 / 37 Matthieu Garrigues

Dige: Fast Prototyping of Graphical Interface Our problem with image processing experiments Visualize image is not straightforward mln::display is convenient but is not well suited to visualize images in real time Interact with our code is hard. For example, tuning thresholds on the fly is hard or impossible without a GUI In short Graphical interface creation Image visualization Event management No additional thread Non intrusive! Based on the Qt framework Image Processing on Graphics Processing Unit with CUDA and C++ 30 / 37 Matthieu Garrigues

Create GUI with only one C++ statement Window ( "my_window ", 400, 400) <<= vbox_start << ( hbox_start << ImageView ( "my_view " ) / 3 << Slider ( "my_slider ", 0, 42, 21) / 1 << hbox_end ) / 3 << Label ( "Hello LRDE " ) / 1 << vbox_end ; Image Processing on Graphics Processing Unit with CUDA and C++ 31 / 37 Matthieu Garrigues

Display your images with only one C++ statement... dg : : Window ( " my_window ", 400, 400) <<= ImageView ( "my_view " ) ; // Milena image mln : : image2d<rgb8> lena = io : : ppm : : load<rgb8 >(" lena. ppm " ) ; ImageView ( "my_view " ) <<= dl ( ) lena lena + lena ; Image Processing on Graphics Processing Unit with CUDA and C++ 32 / 37 Matthieu Garrigues

... But Dige is not magic, it does not anything about Milena image types. So we need to provide it with a bridge between its internal image type and mln::image2d<rgb8> namespace dg { image<trait : : format : : rgb, unsigned char> adapt ( const mln : : image2d<mln : : value : : rgb8>& i ) { return image<trait : : format : : rgb, unsigned char> ( i. ncols ( ) + i. border ( ) * 2, i. nrows ( ) + i. border ( ) * 2, ( unsigned char *) i. buffer ( ) ) ; } } Image Processing on Graphics Processing Unit with CUDA and C++ 33 / 37 Matthieu Garrigues

Events management Wait for user interaction: wait_event ( key_press ( key_enter ) ) ; Create and watch event queues: event_queue q ( PushButton ( "button " ). click ( ) key_release ( key_x ) ) ; while ( true ) if (! q. empty ( ) ) { any_event e = q. pop_front ( ) ; if ( e == PushButton ( " button " ). click ( ) ) std : : cout << " button clicked " << std : : endl ; if ( e == key_release ( key_x ) ) std : : cout << " key x released " << std : : endl ; } Event based loops: for_each_event_until ( PushButton ( "button " ). click ( ), key_release ( key_enter ) ) std : : cout << " button clicked " << std : : endl ; Image Processing on Graphics Processing Unit with CUDA and C++ 34 / 37 Matthieu Garrigues

The library Available: https://gitorious.org/dige Still under development 7k lines of code Portable (depends on Qt and boost) Should be released this year (2011) Image Processing on Graphics Processing Unit with CUDA and C++ 35 / 37 Matthieu Garrigues

Thank you for your attention. Any questions? Image Processing on Graphics Processing Unit with CUDA and C++ 36 / 37 Matthieu Garrigues

References I Antoine Manzanera, Image representation and processing through multiscale local jet features. NVIDIA, Nvidia cuda c programming guide. Image Processing on Graphics Processing Unit with CUDA and C++ 37 / 37 Matthieu Garrigues