Agenda. Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion

Similar documents
Introduction to GPU hardware and to CUDA

CUDA Basics. Murphy Stein New York University

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU Programming Languages

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

1. If we need to use each thread to calculate one output element of a vector addition, what would

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Next Generation GPU Architecture Code-named Fermi

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Programming. Week 4. Shared memory and register

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

CUDA programming on NVIDIA GPUs

HPC with Multicore and GPUs

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 1: an introduction to CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Sélection adaptative de codes polyédriques pour GPU/CPU

Advanced CUDA Webinar. Memory Optimizations

Parallel Programming Survey

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU Architecture

L20: GPU Architecture and Models

Guided Performance Analysis with the NVIDIA Visual Profiler

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

IMAGE PROCESSING WITH CUDA

GPU Hardware Performance. Fall 2015

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Image Processing & Video Algorithms with CUDA

Black-Scholes option pricing. Victor Podlozhnyuk

Programming GPUs with CUDA

GPU Performance Analysis and Optimisation

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

NVIDIA Tools For Profiling And Monitoring. David Goodwin

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPGPU Computing. Yong Cao

GPUs for Scientific Computing

Introduction to GPGPU. Tiziano Diamanti

OpenCL Programming for the CUDA Architecture. Version 2.3

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Graphics Hardware An Overview

CUDA. Multicore machines

OpenACC Basics Directive-based GPGPU Programming

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Hands-on CUDA exercises

Introduction to CUDA C

12 Tips for Maximum Performance with PGI Directives in C

NVIDIA GeForce GTX 580 GPU Datasheet

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

GPGPUs, CUDA and OpenCL

Let s put together a Manual Processor

Optimizing Application Performance with CUDA Profiling Tools

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Best Practice mini-guide accelerated clusters

~ Greetings from WSU CAPPLab ~

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

The Fastest, Most Efficient HPC Architecture Ever Built

GPU Computing - CUDA

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011

Evaluation of CUDA Fortran for the CFD code Strukti

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Data Centric Systems (DCS)

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Data Center and Cloud Computing Market Landscape and Challenges

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

ultra fast SOM using CUDA

Speeding up GPU-based password cracking

Binary search tree with SIMD bandwidth optimization using SSE

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Next Generation Operating Systems

CHAPTER 1 INTRODUCTION

Introduction to Cloud Computing

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Chapter 1 Computer System Overview

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

GPU Programming Strategies and Trends in GPU Computing

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Transcription:

GP- GPU Programming

Agenda Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion

Why GPUs? Moore s Law Concurrency, not speed Hardware support Memory bandwidth Programmability

Why GPUs? Image from: CUDA Programming Guide

Why GPUs? Raw computing power Cost ($ per FLOP) Ubiquitous Stable Graphics & Games Practical (space, noise, power) Power (performance per watt)

Parallel Processing Shared Memory Multi- core Multi- socket Distributed Memory Clusters LCFs Accelerators GPUs FPGAs Asynchronous Operations

Parallel Processing Example: Reduction

Parallel Processing Example: Reduction Each processor does a portion Sync Repeat

Parallel Processing How is this done in different parallel architectures? Distributed Memory Shared Memory GPU FPGA?

Parallel Processing What is the problem with all of these solutions?

Parallel Processing What is the problem with all of these solutions? Communication Memory

CUDA (Compute Unified Device Architecture) Single Instruction Multiple Data (SIMD) Each processor executes the same instruction, at the same time Each processor operates on different data Thousands of concurrent threads Hardware support Reduced functionality cores Limited number of functional units Quickly evolving

GPU Architecture Cores are organized into SMs (streaming multiprocessors) Multiple SMs per GPU Certain elements are shared on and SM: Special function units Double precision units Register file Caches Texture elements Image from http://benchmarkreviews.com/index.php? option=com_content&task=view&id=440&itemid=63&limit=1&limitst art=4

CUDA Threads & Threadblocks Individual execution units are called threads. Threads are also grouped into warps and half- warps. A warp 32 threads Threads are grouped into thread blocks. Thread blocks are grouped into a grid. Functions that execute on the GPU are called kernels. No communication between thread blocks

CUDA Threads & Threadblocks Image from: CUDA Programming Guide

CUDA Threads & Threadblocks SM1 SM2 SM3 SM4 SM4 SM5 Thread1 Thread2 Thread3 Thread4

CUDA First Example Vector Addition: // Kernel definition global void VecAdd(float* A, float* B, float* C) { int i = threadidx.x; C[i] = A[i] + B[i]; }

CUDA First Example Revisited // Kernel definition global void VecAdd(float* A, float* B, float* C) { int i = blockidx.x * blockdim.x + threadidx.x; C[i] = A[i] + B[i]; }

CUDA Second Example Previous examples all 1D thread blocks and grid what about 2D data?

CUDA Second Example Previous examples all 1D thread blocks and grid what about 2D data? Image from: CUDA Programming Guide

CUDA Second Example: Matrix Addition Matrices are two dimensional // Kernel definition global void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; }

GPU Memory Probably the biggest difference from traditional CPUs High bandwidth but only if accessed correctly Lots of restrictions May be one of the biggest obstacles to adoption

GPU Memory Different regions: Registers Global Shared Local Texture Constant

GPU Memory Image from: http://people.sc.fsu.edu/~jburkardt/latex/fdi_2009/ gpu_memory.png

GPU Memory - Global Similar to RAM on CPU Furthest memory from the cores Long memory access latency Data transferred to/from host lives here Access should be regular

GPU Memory - Global Number of aligned memory transactions for each version of CUDA HW. Image from CUDA Programming Guide

GPU Memory - Global Number of aligned non- sequential memory transactions for each version of CUDA HW. Image from CUDA Programming Guide

GPU Memory - Global Number of misaligned non- sequential memory transactions for each version of CUDA HW. Image from CUDA Programming Guide

GPU Memory - Registers On- chip Fast access No latency Finite size Shares space with shared memory Mostly managed by compiler

GPU Memory - Shared On- chip (SM), very fast User managed cache Required below compute capability < 1.2 Explicit use in later versions still recommended Shares space with register file Perfect for repeatedly accessed data Example?

GPU Memory Shared Example: Matrix Multiplication Image from CUDA Programming Guide

GPU Memory - Local A special name for certain values in global memory Values too big to fit in register file Non- determinately sized arrays Too many registers used Extremely detrimental to performance

GPU Memory - Constant Fast, on- chip memory region for read- only values Small Coalescence is different: all threads must access same value Useful for constants, or small read- only data

GPU Memory Texture (Briefly) Comes from graphics operations Cached reads Cache writes not coherent Spatial locality is important Certain hardware assisted operations (interpolation)

GPU Memory - Latency Global memory accesses take several hundred cycles Once a warp makes a request to global memory it is swapped out and the next warp is loaded in By the time the time all warps have made a request the data for the first request will have arrived and the first warp can continue

GPU Memory - Latency GPU device must be saturated There must be enough work to do Hide latency

Advanced Techniques Atomic operations Memory bound kernels Arithmetic intensity Redundancy calculations Memory halos Index calculations Memory shape

Advanced Techniques Atomic Operations Back to the reduction Every step cuts the data size by half Multiple threads write to same location Serializes execution Use with Care // Thread zero of each block signals that it is done unsigned int value = atomicinc(&count, griddim.x);

Advanced Techniques - Metrics Useful metrics to determine performance FLOPs per read How many times is each global memory value used? Bus utilization Memory transactions sometimes transmit more bytes than required on the bus Want to minimize number of unused bytes across the bus

Advanced Techniques Index Calculations Recall 2- D case: int i = blockidx.x * blockdim.x + threadidx.x; int j = blockidx.y * blockdim.y + threadidx.y; Approximately 15 instructions for each index This is calculated for every thread For 1M data elements, 15M instructions just for indices

Advanced Techniques Index Calculations After first index calculation (15 instructions), calculate another, relative to the first, with a single instruction int i = blockidx.x * blockdim.x + threadidx.x; int j = i + BLOCK_DIM Now two indices calculated for 16 instructions 15M down to 8M instructions Approximately 44% reduction

Issues PCI Express Bus Memory access Bus utilization Arithmetic intensity Optimization is difficult

Conclusion Serious performance improvements possible Requires certain data organization and algorithms Fits into other parallel methodologies Non- trivial implementation Constantly evovling

Questions?