CUDA Basics. Murphy Stein New York University

Similar documents

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU hardware and to CUDA

Lecture 1: an introduction to CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Programming. Week 4. Shared memory and register

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

IMAGE PROCESSING WITH CUDA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA programming on NVIDIA GPUs

Programming GPUs with CUDA

Introduction to CUDA C

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

HPC with Multicore and GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPGPU Parallel Merge Sort Algorithm

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU Hardware Performance. Fall 2015

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

OpenACC Basics Directive-based GPGPU Programming

CUDA. Multicore machines

Image Processing & Video Algorithms with CUDA

1. If we need to use each thread to calculate one output element of a vector addition, what would

Next Generation GPU Architecture Code-named Fermi

ME964 High Performance Computing for Engineering Applications

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Computing - CUDA

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

Advanced CUDA Webinar. Memory Optimizations

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

An introduction to CUDA using Python

Matrix Multiplication with CUDA A basic introduction to the CUDA programming model. Robert Hochberg

GPGPUs, CUDA and OpenCL

GPU Tools Sandra Wienke

Introduction to GPU Programming Languages

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Rootbeer: Seamlessly using GPUs from Java

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Speeding Up Evolutionary Learning Algorithms using GPUs

U N C L A S S I F I E D

GPGPU Computing. Yong Cao

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Optimizing Application Performance with CUDA Profiling Tools

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

UNIVERSITY OF OSLO Department of Informatics. GPU Virtualization. Master thesis. Kristoffer Robin Stokke

~ Greetings from WSU CAPPLab ~

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Sélection adaptative de codes polyédriques pour GPU/CPU

OpenACC Programming on GPUs

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Multi-Threading Performance on Commodity Multi-Core Processors

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Clustering Billions of Data Points Using GPUs

Introduction to GPGPU. Tiziano Diamanti

Introduction to GPU Architecture

Parallel Programming Survey

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

gpus1 Ubuntu Available via ssh

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Software Design and Implementation Strategy for High-Level Abstracts

Balliol College, University of Oxford Third Year Project Report Trinity Term 2011

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Performance Analysis and Optimisation

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

The Fastest, Most Efficient HPC Architecture Ever Built

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

L20: GPU Architecture and Models

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011

Guided Performance Analysis with the NVIDIA Visual Profiler

Monte-Carlo Option Pricing. Victor Podlozhnyuk

Hands-on CUDA exercises

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Adaptive GPU-Accelerated Software Beacon Processing for Geospace Sensing

SCATTERED DATA VISUALIZATION USING GPU. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

CONTENTS. I Introduction 2. II Preparation 2 II-A Hardware... 2 II-B Checking Versions for Hardware and Drivers... 3

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Compute Spearman Correlation Coefficient with Matlab/CUDA

Transcription:

CUDA Basics Murphy Stein New York University

Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading

What is CUDA? CUDA stands for: Compute Unified Device Architecture It is 2 things: 1. Device Architecture Specification 2. A small extension to C = New Syntax + Built in Variables Restrictions + Libraries

Device Architecture: Streaming Multiprocessor (SM) 1 SM contains 8 scalar cores Up to 8 cores can run simulatenously Instruction Fetch/Dispatch SM Each core executes identical instruction set, or sleeps Streaming Core #1 Shared Memory 16KB SM schedules instructions across cores with 0 overhead Up to 32 threads may be scheduled at a time, called a warp, but max 24 warps active in 1 SM Streaming Core #2 Streaming Core #3 Registers 8KB Texture Memory Cache 5 8 KB Constant Memory Cache 8KB Thread level memory sharing supported via Shared Memory Register memory is local to thread, and divided amongst all blocks on SM Streaming Core #8...

Transparent Scalability Hardware is free to assigns blocks to any processor at any time A kernel scales across any number of parallel processors Device Kernel grid Block 0 Block 1 Device Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Each block can execute in any order relative to other blocks. Block 6 Block 7 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

SM Warp Scheduling time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200 cycle memory latency

Device Architecture Host Block Execution Manager 1 GPU Streaming Multiprocessor (SM) Streaming Multiprocessor (SM)... Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 1 2 N 1 N Constant Memory 64KB Texture Memory Global Memory 768MB 4GB

C Extension Consists of: New Syntax and Built in Variables Restrictions to ANSI C API/Libraries

C Extension: Built in Variables New Syntax: <<<... >>> host, global, device constant, shared, device syncthreads()

C Extension: Built in Variables Built in Variables: dim3 griddim; Dimensions of the grid in blocks (griddim.z unused) dim3 blockdim; Dimensions of the block in threads dim3 blockidx; Block index within the grid dim3 threadidx; Thread index within the block

C Extension: Restrictions New Restrictions: No recursion in device code No function pointers in device code

CUDA API CUDA Runtime (Host and Device) Device Memory Handling (cudamalloc,...) Built in Math Functions (sin, sqrt, mod,...) Atomic operations (for concurrency) Data types (2D textures, dim2, dim3,...)

Compiling a CUDA Program C/C++ CUDA Application C/C++ Code NVCC Virtual PTX Code gcc PTX to Target Compiler CPU Instructions GPU Instructions

Matrix Transpose M i,j M j,i

Matrix Transpose A B A T C T C D B T D T

Matrix Transpose: First idea Each thread block transposes an equal sized block of matrix M Assume M is square (n x n) What is a good block size? CUDA places limitations on number of threads per block Matrix M n 512 threads per block is the maximum allowed by CUDA n

Matrix Transpose: First idea #include <stdio.h> #include <stdlib.h> global void transpose(float* in, float* out, uint width) { uint tx = blockidx.x * blockdim.x + threadidx.x; uint ty = blockidx.y * blockdim.y + threadidx.y; out[tx * width + ty] = in[ty * width + tx]; } int main(int args, char** vargs) { const int HEIGHT = 1024; const int WIDTH = 1024; const int SIZE = WIDTH * HEIGHT * sizeof(float); dim3 bdim(16, 16); dim3 gdim(width / bdim.x, HEIGHT / bdim.y); float* M = (float*)malloc(size); for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; } float* Md = NULL; cudamalloc((void**)&md, SIZE); cudamemcpy(md,m, SIZE, cudamemcpyhosttodevice); float* Bd = NULL; cudamalloc((void**)&bd, SIZE); transpose<<<gdim, bdim>>>(md, Bd, WIDTH); cudamemcpy(m,bd, SIZE, cudamemcpydevicetohost); return 0; }

Further Reading On line Course: UIUC NVIDIA Programming Course by David Kirk and Wen Mei W. Hwu http://courses.ece.illinois.edu/ece498/al/syllabus.html CUDA@MIT '09 http://sites.google.com/site/cudaiap2009/materials 1/lectures Great Memory Latency Study: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs by Vasily & Demmel Book of advanced examples: GPU Gems 3 Edited by Hubert Nguyen CUDA SDK Tons of source code examples available for download from NVIDIA's website