Image Processing & Video Algorithms with CUDA
|
|
|
- Christina Cunningham
- 10 years ago
- Views:
Transcription
1 Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation.
2 introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly to threads Lots of data is shared between pixels Advantages of CUDA vs. pixel shader-based image processing CUDA supports sharing image data with OpenGL and DirectD applications 8 NVIDIA Corporation.
3 overview CUDA for Image and Video Processing Advantages and Applications Video Processing with CUDA CUDA Video Extensions API YUVtoARGB CUDA kernel Image Processing Design Implications API Comparison of CPU, D, and CUDA CUDA for Histogram-Type Algorithms Standard and Parallel Histogram CUDA Image Transpose Performance Waveform Monitor Type Histogram 8 NVIDIA Corporation.
4 advantages of CUDA Shared memory (high speed on-chip cache) More flexible programming model C with extensions vs HLSL/GLSL Arbitrary scatter writes Each thread can write more than one pixel Thread Synchronization 8 NVIDIA Corporation.
5 applications Convolutions Median filter FFT Image & Video compression DCT Wavelet Motion Estimation Histograms Noise reduction Image correlation Demosaic of CCD images (RAW conversion) 8 NVIDIA Corporation.
6 shared memory Shared memory is fast Same speed as registers Like a user managed data cache Limitations 6KB per multiprocessor Can store 64 x 64 pixels with 4 bytes per pixel Typical operation for each thread block: Load image tile from global memory to shared Synchronize threads Threads operate on pixels in shared memory in parallel Write tile back from shared to global memory Global memory vs Shared Big potential for significant speed up depending on how many times data in shared memory can be reused 8 NVIDIA Corporation.
7 convolution performance G8 Separable Convolution Performance Mpixels / sec CUDA OpenGL Filter size (pixels) 8 NVIDIA Corporation.
8 separable convolutions Filter coefficients can be stored in constant memory Image tile can be cached to shared memory Each output pixel must have access to neighboring pixels within certain radius R This means tiles in shared memory must be expanded with an apron that contains neighboring pixels Only pixels within the apron write results The remaining threads do nothing 8 NVIDIA Corporation.
9 tile apron Image Image Apron Tile Tile with Apron 8 NVIDIA Corporation.
10 image processing with CUDA How does image processing map to the GPU? Image Tiles Grid/Thread Blocks Large Data Lots of Memory BW D Region Shared Memory (cached) 8 NVIDIA Corporation.
11 define tile sizes #define TILE_W 6 #define TILE_H 6 #define R // filter radius #define D (R*+) // filter diameter #define S (D*D) // filter size #define BLOCK_W (TILE_W+(*R)) #define BLOCK_H (TILE_H+(*R)) 8 NVIDIA Corporation.
12 simple filter example global void d_filter(int *g_idata, int *g_odata, unsigned int width, unsigned int height) { shared int smem[block_w*block_h]; int x = blockidx.x*tile_w + threadidx.x - R; int y = blockidx.y*tile_h + threadidx.y - R; // clamp to edge of image x = max(, x); x = min(x, width-); y = max(y, ); y = min(y, height-); unsigned int index = y*width + x; unsigned int bindex = threadidx.y*blockdim.y+threadidx.x; // each thread copies its pixel of the block to shared memory smem[bindex] = g_idata[index]; syncthreads(); 8 NVIDIA Corporation.
13 simple filter example (cont.) // only threads inside the apron will write results if ((threadidx.x >= R) && (threadidx.x < (BLOCK_W-R)) && (threadidx.y >= R) && (threadidx.y < (BLOCK_H-R))) { float sum = ; for(int dy=-r; dy<=r; dy++) { for(int dx=-r; dx<=r; dx++) { float i = smem[bindex + (dy*blockdim.x) + dx]; sum += i; } } g_odata[index] = sum / S; } } 8 NVIDIA Corporation.
14 sobel edge detect filter Two filters to detect horizontal and vertical change in the image Computes the magnitude and direction of edges We can calculate both directions with one single CUDA kernel C horizontal C vertical = = Magnitude Sobel = norm Ghorizontal + G vertical Direction Sobel G = arctan G vertical horizontal 8 NVIDIA Corporation.
15 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
16 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
17 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
18 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
19 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
20 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude + Sobel = norm Ghorizontal Gvertical 4 8 NVIDIA Corporation.
21 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
22 sobel edge detect filter x window of pixels for each thread C vertical C horizontal = G = G vertical horizonta Magnitude = norm G + G Sobel horizontal vertical 8 NVIDIA Corporation.
23 fast box filter Allows box filter of any width with a constant cost Rolling box filter Uses a sliding window Two adds and a multiply per output pixel Adds new pixel entering window, subtracts pixel leaving Iterative Box Filter Gaussian blur Using pixel shaders, it is impossible to implement a rolling box filter Each thread requires writing more than one pixel CUDA allows executing rows/columns in parallel Uses texd to improve read performance and simplify addressing 8 NVIDIA Corporation.
24 fast box filter Separable, two pass filter. First row pass, then column pass Source Image (input) Output Result 8 NVIDIA Corporation.
25 fast box filter (row pass pixel ) Assume r =, each thread works pixels along the row and sums (r+) pixels Then average (r+) pixels and writes to destination ( i, j ) (r + ) = 8 NVIDIA Corporation.
26 fast box filter (row pass pixel ) Take previous sum from pixel, - pixel ( i - (r+), j ), + pixel ( i +(r+), j) Average (r+) pixels and Output to ( i, j ) 8 NVIDIA Corporation.
27 fast box filter (finish row pass) Each thread continues to iterate until the entire row of pixels is done Average then Write to ( i, j ) in destination image A single thread writes the entire row of pixels Note: Writes are not coalesced Solution: Use shared memory to cache results per warp, call syncthreads(), then copy to global mem to achieve Coalescing 8 NVIDIA Corporation.
28 Column Filter Pass (final) Threads (i, j) read from global memory and sum along the column from row pass image, we get Coalesced Reads Compute pixel sums from previous pixel, - pixel, + pixel Average result and Output to (i, j). We get Coalesced Writes 8 NVIDIA Corporation.
29 Video processing with CUDA GPU has different engines Video Processor (decoding video bitstreams) CUDA (image and video processing) DMA Engine (transfers data host GPU) CUDA enables developers to access these engines 8 NVIDIA Corporation.
30 CUDA Video Extensions NVCUVID: video extension for CUDA Access to video decoder core requires VP (> G8) Similar to DXVA API, but will be platform OS independent. Interoperates with CUDA (surface exchange) with OpenGL and DirectX CUDA SDK.: cudavideodecode 8 NVIDIA Corporation.
31 Video Processor (VP) VP is a dedicated video-decode engine on NVIDIA GPUs. Supports: MPEG-, MPEG- H.64 Can operate in parallel with GPU s DMA engines and D Graphics engine. Very low power. 8 NVIDIA Corporation.
32 YUV to RGB conversion Video Processor Decodes directly to a NV surface 4:: that can be mapped directly to a CUDA surface Y samples (bytes) are packed together, followed by interleaved Cb, Cr samples (bytes) sub sampled x CUDA Kernel performs YUV to RGB R..4 G = B..77 Y C C b r 8 NVIDIA Corporation.
33 YUV to RGB CUDA kernel global void YUVRGB(uint *yuvi, float *R, float *G, float *B) { float luma, chromacb, chromacr; // Prepare for hue adjustment (-bit YUV to RGB) luma = (float)yuvi[]; chromacb = (float)((int)yuvi[] - 5.f); chromacr = (float)((int)yuvi[] - 5.f); } // Convert YUV To RGB with hue adjustment *R = MUL(luma, consthuecolorspacemat[]) + MUL(chromaCb, consthuecolorspacemat[]) + MUL(chromaCr, consthuecolorspacemat[]); *G = MUL(luma, consthuecolorspacemat[]) + MUL(chromaCb, consthuecolorspacemat[4]) + MUL(chromaCr, consthuecolorspacemat[5]); *B = MUL(luma, consthuecolorspacemat[6]) + MUL(chromaCb, consthuecolorspacemat[7]) + MUL(chromaCr, consthuecolorspacemat[8]); 8 NVIDIA Corporation.
34 NVCUVID API Five entry-points for Decoder object: cuvidcreatedecoder(...); cuviddestroydecoder(...); cuviddecodepicture(...); cuvidmapvideoframe(...); cuvidunmapvideoframe(...); Sample application also uses helper library for Parsing video streams. Provided in binary as part of SDK 8 NVIDIA Corporation.
35 cudavideodecode Demo 8 NVIDIA Corporation.
36 Image Processing (contd.) Image Processing: CPU vs. D APIs vs. CUDA Design implications CUDA for Histogram-Type Algorithms Standard and Parallel Histogram CUDA Image Transpose Performance Waveform Monitor Type Histogram 8 NVIDIA Corporation.
37 API Comparison API CPU Code D API (DX/GL) CUDA Code Image Data Heap Allocated Texture/FB CUDA D Allocate Alignment Matters n/a Matters Cached Yes Yes No Access (r/w) Random/random Random/fixed Random/random Access order Matters (general purpose caches) Minimized (D Caching Schemes) Matters (coalescing, CUDA Array -> Texture HW) In-Place Good n/a Doesn t matter Threads Few per Image (Programmer s decision. But typically one per tile; one tile per core) One Per Pixel (Consequence of using Pixel Shaders) One per few Pixels (Programmer s decision. Typically one per input or output pixel) Data Types All bit-float (half-float maybe) Storage Types All Tex/FB Formats All All (Double precision, native instructions not for all though) 8 NVIDIA Corporation.
38 Histogram Extremely Important Algorithm Histogram Data used in large number of compound algorithms: Color and contrast improvements Tone Mapping Color re-quantization/posterize Device Calibration (Scopes see below) 8 NVIDIA Corporation.
39 Histogram Performance D API not suited for histogram computation. CUDA Histogram is x faster than previous GPGPU approaches: 64 bins 56 bins CUDA¹ 65 MB/s 676 MB/s RVB².8 MB/s 4.6 MB/s CPU³ 86 MB/s 96 MB/s ¹ ² Efficient Histogram Generation Using Scattering on GPUs, T. Sheuermann, AMD Inc, ID 7 ³ Intel GHz 8 NVIDIA Corporation.
40 Histogram Algorithm Distribution of intensities/colors in an image Standard algoritm: for all i in [, max_luminance]: h[i] = ; for all pixel in image: ++h[luminance(pixel)] How to parallelize? Reinhard HDR Tonemapping operator 8 NVIDIA Corporation.
41 Histogram Parallelization Subdivide for-all-pixel loop Thread works on block of pixels (in extreme, one thread per pixel) Need ++h[luminance(pixel)] to be atomic (global atomics >= compute_) Breaking up Image I into sub-images I = UNION(A, B): H(UNION(A, B)) = H(A) + H(B) Histogram of concatenation is sum of histograms 8 NVIDIA Corporation.
42 Better Parallel Histogram Have one histogram per thread Memory consumption! Consolidate sub-histograms in parallel (parallel-reduction). CUDA: Histograms in shared memory 64bins * 56threads = 6kByte (8bit bins) Approach not feasible for >64 bins 8 NVIDIA Corporation.
43 >64bins Parallel Histogram Compute capability. has sharedmemory atomic-operations. Victor Podlozhnyuk Manual shared memory per-warp atomics (CUDA SDK histogram56 sample) Have groups of threads work on one sub-histogram, reduce as before. 8 NVIDIA Corporation.
44 Real-World Problems with Histograms My attempt to implement a waveform monitor for video using CUDA. One histogram per column of the input video frame. In order to achieve good performance need to solve various memory-access related issues. 8 NVIDIA Corporation.
45 Accessing 8-bit Pixels Input video produces Y-channel (luma) as planar data in row-major form. Coalescing: 6 threads access bit word in subsequent locations. 8 NVIDIA Corporation.
46 SMEM Exhausted! => 6xM thread blocks => 64 columns processed by each block. => 64 histograms in smem: 64 * 56 * 4 = 64kByte. Max 6 kbyte! 8 NVIDIA Corporation.
47 Thread-Block Dimensions 6xM TB dimensions desirable but impossible for Y-surface read Nx TB dimensions desirable for manual atomics in waveform code Solution: Fast transpose input image! Also: Result histograms could be copied efficiently (shared->global) horizontally. 8 NVIDIA Corporation.
48 Image Transpose Problem: Writing a fast Transpose vs. writing Transpose fast. Naïve implementation: kernel(char * pi, int si, char * po, int so, int w, int h) { int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; } if (y < w && x < h) OUTPUT_PIXEL(y, x) = INPUT_PIXEL(x, y); 8 NVIDIA Corporation.
49 Problems with Naïve Code Memory reads AND writes not coalesced because reading/writing bytes, not words. Bandwidth: ~ GByte/s (of ~4max) Idea: Need to read (at least) 6 words in subsequent threads. Need to write (at least) 6 words in subsequent threads. 8 NVIDIA Corporation.
50 Improved Transpose Idea Subdivide image into micro-blocks of 4x4 pixels (6Byte). Thread blocks of 6x6 threads. Each thread operates on a microblock. Shared memory for micro-blocks: 6 x 6 x 4 x 4 = 4kByte. 8 NVIDIA Corporation.
51 Basic Algorithm Each thread reads its micro-block into shared memory. Each thread transposes its microblock. Each thread writes its micro-block back into global memory. 8 NVIDIA Corporation.
52 Reading and Writing MicroBlocks Reading one row of MicroBlock via unsigned int rather than 4x unsigned char 8 NVIDIA Corporation.
53 6x6 Thread Blocks One (6-thread) warp reads one row of MicroBlocks. t t t t5 t6 t7 t8 t One 6x6 block of threads deals with a 64x64 pixel region (8-bit luminance pixels). 8 NVIDIA Corporation.
54 Pseudo Code Assume single 64x64 image. kernel(...) { int i = threadidx.x; int j = threadidx.y; } readmicroblock(image, i, j, shared, i, j); transposemicroblock(shared, i, j); writemicroblock(shared, i, j, image, j, i); Problem: Non-coalesced writes! 8 NVIDIA Corporation.
55 Write Coalescing for Transpose readmicroblock(image, i, j, shared, i, j); writemicroblock(shared, i, j, image, j, i); Input Image Output Image t t t t5 t t6 t t6 t7 t8 t t t7 t 8 NVIDIA Corporation.
56 Coalesced Writes Simple fix: kernel(...) { int i = threadidx.x; int j = threadidx.y; } readmicroblock(image, i, j, shared, i, j); transposemicroblock(shared, i, j); syncthreads(); writemicroblock(shared, j, i, image, i, j); Must syncthreads() because T i,j now writes data produced by T j,i. 8 NVIDIA Corporation.
57 Transpose Performance Algorithm 56x56 5x5 4^ 48^ 496^ CUDA Naive CUDA Opt IPPI Unit: GB/s throughput. GPU: GeForce GTX 8 (GT) CPU: Intel Core Duo 8 NVIDIA Corporation.
58 Summary Memory access crucial for CUDA performance. Shared memory as user-managed cache. 8-bit images especially tricky. Extra pass may improve over all performance. 8 NVIDIA Corporation.
59 Waveform Demo 8 NVIDIA Corporation.
60 Questions? Eric Young Frank Jargstorff 8 NVIDIA Corporation.
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
Advanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Introduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
GPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
IP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
Optimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
IMAGE PROCESSING WITH CUDA
IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Black-Scholes option pricing. Victor Podlozhnyuk [email protected]
Black-Scholes option pricing Victor Podlozhnyuk [email protected] June 007 Document Change History Version Date Responsible Reason for Change 0.9 007/03/19 vpodlozhnyuk Initial release 1.0 007/04/06
Low power GPUs a view from the industry. Edvard Sørgård
Low power GPUs a view from the industry Edvard Sørgård 1 ARM in Trondheim Graphics technology design centre From 2006 acquisition of Falanx Microsystems AS Origin of the ARM Mali GPUs Main activities today
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
Rootbeer: Seamlessly using GPUs from Java
Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
NVIDIA Performance Primitives & Video Codecs on GPU. Gold Room Thursday 1 st October 2009 Anton Obukhov & Frank Jargstorff
NVIDIA Performance Primitives & Video Codecs on GPU Gold Room Thursday 1 st October 2009 Anton Obukhov & Frank Jargstorff Overview Two presentations: NPP (Frank Jargstorff) Video Codes on NVIDIA GPUs (Anton
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
GPU-based Decompression for Medical Imaging Applications
GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 [email protected] (888) LESS-BITS +1 (408) 249-1500 1 Outline
CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014
CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university
Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook
GPU Performance Analysis and Optimisation
GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS Abhijit Patait Eric Young April 4 th, 2016 NVIDIA GPU Video Technologies Video Hardware Capabilities AGENDA Video Software
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
High-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Lecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles [email protected] Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Cloud Gaming & Application Delivery with NVIDIA GRID Technologies. Franck DIARD, Ph.D. GRID Architect, NVIDIA
Cloud Gaming & Application Delivery with NVIDIA GRID Technologies Franck DIARD, Ph.D. GRID Architect, NVIDIA What is GRID? Using efficient GPUS in efficient servers What is Streaming? Transporting pixels
Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze
Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze Whitepaper December 2012 Anita Banerjee Contents Introduction... 3 Sorenson Squeeze... 4 Intel QSV H.264... 5 Power Performance...
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
How To Run A Factory I/O On A Microsoft Gpu 2.5 (Sdk) On A Computer Or Microsoft Powerbook 2.3 (Powerpoint) On An Android Computer Or Macbook 2 (Powerstation) On
User Guide November 19, 2014 Contents 3 Welcome 3 What Is FACTORY I/O 3 How Does It Work 4 I/O Drivers: Connecting To External Technologies 5 System Requirements 6 Run Mode And Edit Mode 7 Controls 8 Cameras
Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group
Shader Model 3.0 Ashu Rege NVIDIA Developer Technology Group Talk Outline Quick Intro GeForce 6 Series (NV4X family) New Vertex Shader Features Vertex Texture Fetch Longer Programs and Dynamic Flow Control
Sélection adaptative de codes polyédriques pour GPU/CPU
Sélection adaptative de codes polyédriques pour GPU/CPU Jean-François DOLLINGER, Vincent LOECHNER, Philippe CLAUSS INRIA - Équipe CAMUS Université de Strasbourg Saint-Hippolyte - Le 6 décembre 2011 1 Sommaire
Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez
Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis artínez, Gerardo Fernández-Escribano, José. Claver and José Luis Sánchez 1. Introduction 2. Technical Background 3. Proposed DVC to H.264/AVC
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
U N C L A S S I F I E D
CUDA and Java GPU Computing in a Cross Platform Application Scot Halverson [email protected] LA-UR-13-20719 Slide 1 What s the goal? Run GPU code alongside Java code Take advantage of high parallelization Utilize
Application. EDIUS and Intel s Sandy Bridge Technology
Application Note How to Turbo charge your workflow with Intel s Sandy Bridge processors and chipsets Alex Kataoka, Product Manager, Editing, Servers & Storage (ESS) August 2011 Would you like to cut the
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Outline. srgb DX9, DX10, XBox 360. Tone Mapping. Motion Blur
Outline srgb DX9, DX10, XBox 360 Tone Mapping Motion Blur srgb Outline srgb & gamma review Alpha Blending: DX9 vs. DX10 & XBox 360 srgb curve: PC vs. XBox 360 srgb Review Terminology: Color textures are
GPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
