Optimization Techniques: Image Convolution. Udeepta D. Bordoloi December 2010

Similar documents
AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

GPU Architecture. Michael Doggett ATI

AMD APP SDK v2.8 FAQ. 1 General Questions

Introduction to GPU Architecture

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

A Vision for Tomorrow s Hosting Data Center

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPGPU. Tiziano Diamanti

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Radeon HD 2900 and Geometry Generation. Michael Doggett

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

Introduction to GPU Programming Languages

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

L20: GPU Architecture and Models

White Paper AMD PROJECT FREESYNC

OpenCL Programming for the CUDA Architecture. Version 2.3

THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Ensure that the AMD APP SDK Samples package has been installed before proceeding.

Parallel Programming Survey

2

GPUs for Scientific Computing

Next Generation GPU Architecture Code-named Fermi

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

AMD Product and Technology Roadmaps

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to OpenCL Programming. Training Guide

AMD Processor Performance. AMD Phenom II Processors Discrete Platform Benchmarks December 2008

White Paper DisplayPort 1.2 Technology AMD FirePro V7900 and V5900 Professional Graphics. Table of Contents

PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER

Optimizing Code for Accelerators: The Long Road to High Performance

Innovation in Software Quality

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Let s put together a Manual Processor

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Initial Hardware Estimation Guidelines. AgilePoint BPMS v5.0 SP1

GPU Hardware Performance. Fall 2015

MS Exchange Server Acceleration

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Binary search tree with SIMD bandwidth optimization using SSE

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

StorTrends RAID Considerations

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Delivering Accelerated SQL Server Performance with OCZ s ZD-XL SQL Accelerator

AMD FirePro Professional Graphics for CAD & Engineering and Media & Entertainment

Leveraging Aparapi to Help Improve Financial Java Application Performance

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Evaluation of CUDA Fortran for the CFD code Strukti

BlackBerry Professional Software For Microsoft Exchange Compatibility Matrix January 30, 2009

GPGPU Computing. Yong Cao

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

HP Workstations graphics card options

The continuum of data management techniques for explicitly managed systems

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Practical Performance Understanding the Performance of Your Application

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Rethinking SIMD Vectorization for In-Memory Databases

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

ME964 High Performance Computing for Engineering Applications

Oracle Solaris Studio Code Analyzer

Introduction to GPU Computing

The Microsoft Windows Hypervisor High Level Architecture

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1


Benchmarking Large Scale Cloud Computing in Asia Pacific

An Oracle Benchmarking Study February Oracle Insurance Insbridge Enterprise Rating: Performance Assessment

Autodesk Building Design Suite 2012 Standard Edition System Requirements... 2

NVIDIA GeForce Experience

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Autodesk Revit 2016 Product Line System Requirements and Recommendations

CUDA Basics. Murphy Stein New York University

Autodesk 3ds Max 2010 Boot Camp FAQ

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Keys to node-level performance analysis and threading in HPC applications

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

Family 10h AMD Phenom II Processor Product Data Sheet

BlackBerry Business Cloud Services. Version: Release Notes

Transcription:

Optimization Techniques: Image Convolution Udeepta D. Bordoloi December 2010

Contents AMD GPU architecture review OpenCL mapping on AMD hardware Convolution Algorithm Optimizations (CPU) Optimizations (GPU) 2 Optimization Techniques Image Convolution December 2010

ATI 5800 Series (Cypress) GPU Architecture Peak values: 2.72 Teraflops Single Precision 544 Gigaflops Double Precision 153.6 GB/s memory bandwidth 20 SIMDS Each SIMD has Local (shared) memory Cached (texture) memory 3 Optimization Techniques Image Convolution December 2010

SIMD Engine Each SIMD: Includes 16 VLIW Thread Processing Units, each with 5 scalar stream processing units + 32KB Local Data Share Has its own control logic and runs from a shared set of threads Has dedicated texture fetch unit w/ 8KB L1 cache 4 Optimization Techniques Image Convolution December 2010

Wavefront All threads in a Wavefront execute the same instruction 16 Thread Processing Units in a SIMD * 4 batches of threads = 64 threads on same instruction (Cypress) What if there is a branch? 1. First, full wavefront executes left branch, threads supposed to go to right branch are masked 2. Next, full wavefront executes right branch, left branch threads are masked OpenCL workgroup = 1 to 4 wavefronts on same SIMD Wavefront size less than 64 is inefficient! 5 Optimization Techniques Image Convolution December 2010

OpenCL View of AMD GPU Constants (cached global) Workgroups Local memory (user cache) Image cache L2 cache Global memory (uncached) 6 Optimization Techniques Image Convolution December 2010

OpenCL TM Memory space on AMD GPU Registers/LDS Private Memory Private Memory Private Memory Private Memory Thread Processor Unit Work Item 1 Work Item M Work Item 1 Work Item M SIMD Compute Unit 1 Compute Unit N Local Data Share Local Memory Local Memory Board Mem/Constant Cache Global / Constant Memory Data Cache Compute Device Board Memory Global Memory Compute Device Memory 7 Optimization Techniques Image Convolution December 2010

Contents AMD GPU architecture review OpenCL mapping on AMD hardware Convolution Algorithm Optimizations (CPU) Optimizations (GPU) 8 Optimization Techniques Image Convolution December 2010

Convolution algorithm Input image 3x3 Filter or mask (weights) Output Image value = Weighted Sum of neighboring Input Image values Output image 9 Optimization Techniques Image Convolution December 2010

Convolution algorithm FOR every pixel: float sum = 0; for (int r = 0; r < nfilterwidth; r++) { for (int c = 0; c < nfilterwidth; c++) { const int idxf = r * nfilterwidth + c; sum += pfilter[idxf]*pinput[idxinputpixel]; } } //for (int r = 0... poutput[ idxoutputpixel ] = sum; For a 3x3 filter: 9+9 reads (from input and filter) for every write (to output) For large filters such as 16x16, 256+256 reads for every write Notice read overlap between neighboring output pixels! 10 Optimization Techniques Image Convolution December 2010

OpenCL Convolution on multi-core CPU CPU implementation: Automatic multi-threading! One CPU-thread per CPU-core Highly efficient implementation Each CPU-thread runs one or more OpenCL work-groups Use large work-groups (max on CPU is 1024) Optimization 1 Unroll loops Pass #defines at run-time (compile option for OpenCL kernels) Use vector types to transparently enable SSE in the backend Can be faster than simple OpenMP multi-threading! Image Convolution Using OpenCL - A Step-by-Step Tutorial 11 Optimization Techniques Image Convolution December 2010

Time (lower is better) OpenCL Convolution on multi-core CPU 250% 200% 150% 100% OMP(4-core) Float4If Def_Float4 50% 0% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Filter Width 12 Optimization Techniques Image Convolution December 2010

Contents AMD GPU architecture review OpenCL mapping on AMD hardware Convolution Algorithm Optimizations (CPU) Optimizations (GPU) 13 Optimization Techniques Image Convolution December 2010

Convolution on GPU (naïve implementation) kernel void Convolve( const global float * pinput, global float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) All data is in global (uncached) buffers Filter (float * pfilter) is 16x16 Output image (float * poutput) is 4096x4096 Input image (float * pinput) is (4096+15)x(4096+15) Work-group size is 8x8 to correspond to wavefront size of 64 on AMD GPUs Convolution time: 1511 ms on Radeon 5870 14 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 1) Previously, all data was in global (uncached) buffers Did not reuse common data between neighboring pixels Input items fetched per output pixel = 16x16 = 256 Can share input data within each work-group (SIMD) Preload input data into local memory (LDS), and then access it For a work-group of 8x8, if you pre-load input data into LDS Filter is 16x16 Output image (per work-group) is 8x8 = 64 Input image that is loaded onto LDS is (8+15)x(8+15) = 529 Input items fetched per output pixel = 529/64 = 8.3! Convolution time: 359 ms! 15 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 2) You may have deduced by now that if you have a larger work-group, there is more data reuse. Largest work-group size on CPU = 1024 = 32x32 Largest work-group size on GPU = 256 = 16x16 For a work-group of 16x16, if you pre-load input data into LDS Filter is 16x16 Output image (per work-group) is 16x16 = 256 Input image that is loaded onto LDS is (16+15)x(16+15) = 961 Input items fetched per output pixel = 961/256 = 3.7!! Convolution time: 182 ms!! Be aware: Increasing work-group size and increasing LDS memory usage will reduce the number of concurrent wavefronts running on a SIMD, which can lead to slower performance. There is a trade-off that may nullify the advantages, depending on the kernel. 16 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 3) Previously, we used local memory (LDS) You can imagine that to be a user-managed cache What if the developer does not want to manage the cache Use the hardware texture cache that is attached to each SIMD Why use texture cache instead of LDS? Easier and cleaner code Sometimes faster than LDS How to use the cache? OpenCL image buffers = cached OpenCL buffers = uncached For the previous example Workgroup size LDS Texture 8x8 359 ms 346 ms 16x16 182 ms 207 ms 17 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 4) Let us go back and start from the naïve implementation to check other possible optimizations. What about the filter array? (uncached in the naïve kernel) It is usually a small array that remains constant All work-items (threads) in the work-group (SIMD) access the same element of the array at the same instruction Options: Image (Texture) buffer or constant buffer Constant buffer: cached reads as all threads access same element kernel void Convolve(const global float * pinput, constant float * pfilter attribute ((max_constant_size(4096))), global float * poutput, ) Naïve implementation time: 1511 ms constant buffer optimization: 1375 ms 18 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 5) Let us again go back to the naïve implementation to check other possible optimizations. This time, we will try unrolling the inner loop. Unroll by 4 Reduces control flow overhead Fetch 4 floats at a time instead of a single float Since we are accessing uncached data (in the naïve kernel), fetching float4 instead of float will give us faster read performance. In general, accessing 128-bit data (float4) is faster than accessing 32-bit data (float). Naïve implementation time: 1511 ms Unroll-by-4 and float4 input buffer fetch: 401 ms Unroll-by-4 and float4 input + float4 filter fetch: 389 ms 19 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 6) What if we combine optimizations 4 and 5 to the naïve kernel? Mark the filter as constant float* buffer Unroll by 4 and float4 input buffer fetch Unroll-4, float4 input fetch + constant float* filter: 680 ms!! Why did the time increase?! Be aware: Using constant float* increases the ALU usage in the shader as the compiler has to add instructions to extract a 32-bit data from a 128-bit structure. Instead, use a constant float4* buffer Naïve implementation time: 1511 ms Unroll-by-4 and float4 input buffer fetch: 401 ms Unroll-by-4 and float4 input + float4 filter fetch: 389 ms Unroll-4, float4 input fetch + constant float4* filter: 346 ms 20 Optimization Techniques Image Convolution December 2010

Convolution on GPU (All combined) We can now combine the previous optimizations to the caching optimizations (LDS and textures) For a 16x16 work-group, same input and filter sizes as before: Optimization LDS Texture Naïve implementaion 1511 ms 1511 ms Data reuse (#1,2,3) 182 ms 207 ms constant float* filter(#4) 190 ms 160 ms Unroll4, float4 input (#5) 90 ms 130 ms Unroll4, float4 input, float4 filter (#5) 83 ms 127 ms All above, constant float* (#6 bad) 88 ms 158 ms All above, constant float4* (#6 good) 71 ms! 93 ms! 21 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 7) Pass #defines to the kernel at runtime: When an OpenCL application runs, it can Load binary kernels, or, compile kernels from source at runtime When compiling at runtime, runtime parameters (such as filter sizes, work-group sizes etc.) may be available. When possible, pass these values to the OpenCL compiler when compiling the kernel using the -D option. The GPU compiler is able to plug these known parameter values and produce highly optimized code for the GPU 22 Optimization Techniques Image Convolution December 2010

Convolution on GPU (Optimization 7) For a 16x16 work-group, same input and filter sizes as before: At runtime, if we pass the filter-width, work-group size etc values to the kernel compilation: Optimization LDS Texture Naïve implementaion 1511 ms 1511 ms Data reuse (#1,2,3) 69 ms 128 ms constant float* filter(#4) 25 ms 127 ms Unroll4, float4 input (#5) 68 ms 127 ms Unroll4, float4 input, float4 filter (#5) 66 ms 127 ms All above, constant float* (#6 bad) 26 ms 127 ms All above, constant float4* (#6 good) 25 ms!! 63 ms!! 23 Optimization Techniques Image Convolution December 2010

Questions and Answers Visit the OpenCL Zone on developer.amd.com http://developer.amd.com/zones/openclzone/ Tutorials, developer guides, and more OpenCL Programming Webinars page includes: Schedule of upcoming webinars On-demand versions of this and past webinars Slide decks of this and past webinars 24 Optimization Techniques Image Convolution December 2010

Disclaimer and Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, AMD Opteron, Radeon, FirePro, FireStream and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, Windows Vista, and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners. 25 Optimization Techniques Image Convolution December 2010