Texture Cache Approximation on GPUs



Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPGPU Computing. Yong Cao

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Computer Graphics Hardware An Overview

Introduction to GPU Programming Languages

Introduction to GPGPU. Tiziano Diamanti

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Performance Analysis and Optimisation

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to GPU hardware and to CUDA

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

NVIDIA GeForce GTX 580 GPU Datasheet

Real-time Visual Tracker by Stream Processing

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Writing Applications for the GPU Using the RapidMind Development Platform

Optimizing Application Performance with CUDA Profiling Tools

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Stream Processing on GPUs Using Distributed Multimedia Middleware

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Hardware design for ray tracing

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

L20: GPU Architecture and Models

MIDeA: A Multi-Parallel Intrusion Detection Architecture

GPU Computing - CUDA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Radeon HD 2900 and Geometry Generation. Michael Doggett

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

The Future Of Animation Is Games

GeoImaging Accelerator Pansharp Test Results

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

NVIDIA Tools For Profiling And Monitoring. David Goodwin

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Firewalls on General-Purpose Graphics Processing Units

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Evaluation of CUDA Fortran for the CFD code Strukti

GPU Architecture. Michael Doggett ATI

Multi-GPU Load Balancing for Simulation and Rendering

Graphic Processing Units: a possible answer to High Performance Computing?

HPC with Multicore and GPUs

NVIDIA GeForce GTX 750 Ti

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Turbomachinery CFD on many-core platforms experiences and strategies

Clustering Billions of Data Points Using GPUs

NVIDIA VIDEO ENCODER 5.0

GPU Point List Generation through Histogram Pyramids

Motivation: Smartphone Market

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Learn CUDA in an Afternoon: Hands-on Practical Exercises

ultra fast SOM using CUDA

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Embedded Systems: map to FPGA, GPU, CPU?

GPU Programming Strategies and Trends in GPU Computing

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Image Processing on Graphics Processing Unit with CUDA and C++

NVIDIA Quadro K5200 Amazing 3D Application Performance and Capability

The Fastest, Most Efficient HPC Architecture Ever Built

1. INTRODUCTION Graphics 2

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

QCD as a Video Game?

Parallel Programming Survey

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Image Processing & Video Algorithms with CUDA

Imperial College London

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Introduction to GPU Architecture

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Advanced Rendering for Engineering & Styling

Performance Characteristics of Large SMP Machines

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Transcription:

Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1

Our Contribution GPU Core Cache Cache Cache Cache On-Chip Network, Shared Caches Main Memory Problem: High memory latency requires thread swapping. 2

Our Contribution GPU Core Cache Cache Cache Cache On-Chip Network, Shared Caches Main Memory Texture Cache Avoid main memory accesses by generating approximate values on-chip. 3

Load Value Approximation Figure from Load Value Approximation, Joshua San Miguel, Mario Badr, Natalie Enright Jerger 4

Load Value Approximation Figure from Load Value Approximation, Joshua San Miguel, Mario Badr, Natalie Enright Jerger 5

Load Value Approximation Figure from Load Value Approximation, Joshua San Miguel, Mario Badr, Natalie Enright Jerger 6

Goal: Use off-the-shelf GPU hardware to implement massively parallel value approximation. 7

What is a Texture? Images are comprised of polygons, which the GPU fills using shaders. Example: Rendering a brick wall. Without a texture, the GPU can fill a polygon with a solid colour, or a gradient spanning from one vertex to another. Textures allow you to draw the whole wall, without filling N polygons for each brick.? Single-polygon (rectangle) Multi-polygon solid Single-polygon w. texture 8 Source: http://upload.wikimedia.org/wikipedia/commons/2/28/- _Brickwall_01_-.jpg (Creative Commons)

Texture Hardware Streaming Multiprocessor 12kB dedicated cache per SM, 4 (Core) dedicated fetch/interpolation units. Texture Unit Features: Interpolating between data values, wrapping out-of-bounds indexes. Caveat: Texture memory is read-only. Figure source: M. Doggett. Texture caches. IEEE Micro, 32(3):136 141, May 2012. 9

Texture Cache Approximation 1. Analyze the data values read by each GPU thread, and build a training set from repetitive patterns in this data. 2. Before GPU execution, load the texture cache with this training set. 3. Replace memory accesses with approximations derived from the texture cache. 10

Cache Contents Want our approximations to be compact, so they fit in the texture cache, as well as portable to many value patterns. Each approximation is the sum of the last data value, and a delta approximation from the texture cache: Xapx = Xlast + Dapx 11

Training Set Generation Example: Image access pattern Dl = -2 1. Upon accessing P4, thread inspects its LHB and calculates the last delta. 2. Baseline code then reads P4 from memory, and calculates the current delta. 3. Output a pair to the trace: [Dl,Dc] 12

Training Set Generation Example: Image access pattern 1. Upon accessing P4, thread inspects its LHB and calculates the last delta. 2. Baseline code then reads P4 from memory, and calculates the current delta. P4 = 4 Dc = 1 3. Output a pair to the trace: [Dl,Dc] 13

Training Set Generation Example: Image access pattern 1. Upon accessing P4, thread inspects its LHB and calculates the last delta. 2. Baseline code then reads P4 from memory, and calculates the current delta. 3. Output a pair to the trace: [Dl,Dc] [Dl,Dc] = [-2,1] 14

Kernel Code Transformation foreach (pixel in 4x4 grid) { Characteristic Loop (per thread) foreach (neighbour pixel) { } = a[neighbor]; Loop Body (exact data) b[pix] = ; } 15

Kernel Code Transformation foreach (pixel in 4x3 grid) { Characteristic Loop (per thread) } = a[neighbor]; updatelhb( ); Loop Body (exact w. update) foreach (pixel in last row) { = LHB[0] + texture[getindex()]; updatelhb( ); Epilogue Loop (with approx) } 16

Online Approximations = LHB[0] + texture[ getindex( ) ]; In-Core Activity Texture Cache Index Value 0 1 Thread 1 reaches 0 texture reference. 1 4 2 2 8 10 Age LHB s 3-2 17

Online Approximations = LHB[0] + texture[ getindex( ) ]; In-Core Activity Texture Cache Index Value 0 1 Calculate: 0 2 Dlast= -2 1 4 2 8 10 Age LHB s 3-2 18

Online Approximations = LHB[0] + texture[ 0 ]; In-Core Activity Texture Cache Index Value 0 1 Normalize: 0 Index = 0 1 4 2 2 8 10 Age LHB s 3-2 19

Online Approximations = LHB[0] + texture[ 0 ]; In-Core Activity Texture Cache Index Value 0 1 0 Access: 1 4 2 texture[0] 2 8 10 Age LHB s 3-2 20

Online Approximations = 0 + 1; In-Core Activity Texture Cache Index Value 0 1 Return: 0 Dapx = 1 1 4 2 2 8 10 Age LHB s 3-2 21

Evaluation Commodity hardware: NVIDIA 780GTX GPU (Kepler u-architecture) Image blur kernel derived from San Diego Computer Vision Benchmark Suite [Venkata, IISWC 09]. Use CUDA Toolkit Profiler to measure: Kernel runtime (cycles), texture cache hit rate. Error metric: Mean Pixel Difference [Samadi, MICRO 13] 22

Kernel Runtime 1.05 Better normalized runtime 1.00 0.95 0.90 0.85 0.80 baseline texture-20% texture-40% texture-60% Replaced X global memory reads (baseline has 5). Texture cache hit rate: > 99% in all cases. 23

Image Comparison Exact Approximate Error: 0.4% 24 Image source: Nature stock footage archive. http://downloadnatureclip. blogspot.ca/p/download-links.html. Accessed: 2015-03-28.

Error Evolution error (mean pixel diff) 1.2 1.1 1 0.9 0.8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frame Used same training set (from F1) for 16 images in a video sequence, evaluated error with 40% of loads replaced. 25

Future Work Evaluate on applications from different application domains: machine learning, physics & fluid simulations, data queries. Can we eliminate the need for training sets? Improve speedup on floating point benchmarks. 26

Conclusion: Under-utilized texture hardware on GPUs can be used to accelerate kernel execution using value approximation. 27

? 28

29

Comparison to Reduced Blur Large Radius Small Radius Image source: Nature stock footage archive. http://downloadnatureclip. blogspot.ca/p/download-links.html. Accessed: 2015-03-28. 30

Approximate Value Computing 1. Certain applications are robust to inexact data values. Data mining and pattern recognition [Chippa, DAC 13] 2. Where can these values come from? Reduced-voltage DRAM [Liu, ASPLOS 12] Kernels edited to remove synchronization [Samadi, MICRO 13] Load Value Approximation [San Miguel, MICRO 14] 31

Why Approximate? 1. GPU memory architecture has similar drawbacks to that of a CPU. 2. Modern GPU s choose to hide latency with thread swapping and context storage. Could easily put these transistors to better use! 32

Training Set Generation 1. Give every thread its own Local History Buffer (LHB) in shared memory. 2. Upon every memory access, output the Thread Blocks previous and next deltas to a trace. This allows us to analyze different value patterns, and generate approximations that are accurate for many thread blocks. Shared Memory 33 Private LHB Older Values

Kernel Code Transformation foreach (pixel in 4x3 grid) { foreach (neighbour pixel) { Characteristic Loop (per thread) = a[np]; updatelhb(a[np]); Loop Body } (exact w. update) b[p] = ; } foreach (pixels left in 4x1 stripe) { Epilogue Loop Body } 34

Kernel Code Transformation foreach (pixel in 4x3 grid) { updatelhb(a[np]); Loop Body (exact w. update) } foreach (pixels left in 4x1 stripe) { foreach(neighbour pixel) { = gettexture(my_tex, getdeltafromlhb( ) ); updatelhb(); Epilogue Loop Body (replace loads with } texture references) b[p] = ; } 35

Processing Continuum Core Cache Cache Cache Cache On-Chip Network Shared Cache(s) Accelerator 1 Multi-Core CPU 36 1. Source: www.newsroom.intel.com/community/intel_newsroom

GPU Niche GPGPU S G Characteristics Massively multi-core. Trades latency for throughput. Very few optimizations for singlethread performance. Source: http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf 37