GPU Parallel Computing Architecture and CUDA Programming Model



Similar documents
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU Programming Languages

CUDA Basics. Murphy Stein New York University

GPGPUs, CUDA and OpenCL

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Next Generation GPU Architecture Code-named Fermi

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU hardware and to CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

IMAGE PROCESSING WITH CUDA

Programming GPUs with CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

CUDA. Multicore machines

Introduction to GPGPU. Tiziano Diamanti

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

HPC with Multicore and GPUs

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

~ Greetings from WSU CAPPLab ~

NVIDIA GeForce GTX 580 GPU Datasheet

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to CUDA C

Graphics and Computing GPUs

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU Computing - CUDA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Parallel Firewalls on General-Purpose Graphics Processing Units

CUDA Programming. Week 4. Shared memory and register

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ultra fast SOM using CUDA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Parallel Programming Survey

Evaluation of CUDA Fortran for the CFD code Strukti

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

CUDA programming on NVIDIA GPUs

Optimizing Application Performance with CUDA Profiling Tools

Hands-on CUDA exercises

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Tools Sandra Wienke

Lecture 1: an introduction to CUDA

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

GPU Programming Strategies and Trends in GPU Computing

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPGPU Parallel Merge Sort Algorithm

Advanced CUDA Webinar. Memory Optimizations

Introduction to Cloud Computing

Texture Cache Approximation on GPUs

1. If we need to use each thread to calculate one output element of a vector addition, what would

GPU Performance Analysis and Optimisation

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Computer Graphics Hardware An Overview

Scalability and Classifications

Embedded Systems: map to FPGA, GPU, CPU?

Introduction to GPU Architecture

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

The Fastest, Most Efficient HPC Architecture Ever Built

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

GPU Hardware Performance. Fall 2015

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPGPU Computing. Yong Cao

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPGPU accelerated Computational Fluid Dynamics

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Transcription:

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel Sharing Transparent Scalability CUDA Programming Model CUDA: C on the GPU CUDA Example Applications Summary 2

Parallel Computing on a GPU NVIDIA GPU Computing Architecture is a scalable parallel computing platform In laptops, desktops, workstations, servers 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications GPU parallel performance pulled by the insatiable demands of PC game market Tesla D870 GeForce 8800 GPU parallelism is doubling every year Programming model scales transparently 3 Programmable in C with CUDA tools Multithreaded MD model uses application data parallelism and thread parallelism Tesla S870 NVIDIA 8-Series GPU Computing Massively multithreaded parallel computing platform 12,288 concurrent threads, hardware managed 128 Processor cores at 1.35 GHz == 518 GFLOPS peak GPU Computing features enable C on Graphics Processing Unit Host CPU Work Distribution 4

SM Multithreaded Multiprocessor Texture L1 SM MT SM has 8 Processors 32 GFLOPS peak at 1.35 GHz IEEE 754 32-bit floating point 32-bit integer Scalar ISA load/store Texture fetch Branch, call, return Barrier synchronization instruction Multithreaded Instruction Unit 768 s, hardware multithreaded 24 SIMD warps of 32 threads Independent MIMD thread execution Hardware thread scheduling 16KB Concurrent threads share data Low latency load/store 5 SM SIMD Multithreaded Execution time SM multithreaded instruction scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 Weaving: the original parallel thread technology is about 10,000 years old Warp: the set of 32 parallel threads that execute a SIMD instruction SM hardware implements zero-overhead warp and thread scheduling Each SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads s can execute independently SIMD warp diverges and converges when threads branch independently Best efficiency and performance when threads of a warp execute together SIMD across threads (not just data) gives easy single-thread scalar programming with SIMD efficiency 6

Programmer Partitions Problem with Data-Parallel Decomposition CUDA Programmer partitions problem into Grids, one Grid per sequential problem step Programmer partitions Grid into result s computed independently in parallel Sequence Step 1: GPU Grid 1 (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) GPU thread array computes result Step 2: Grid 2 Programmer partitions into elements computed cooperatively in parallel GPU thread computes result element (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (4, 0) (3, 1) (4, 1) (3, 2) (4, 2) 7 Cooperative Array CTA Implements CUDA A CTA is an array of concurrent threads that cooperate to compute a result A CUDA thread block is a CTA Programmer declares CTA: CTA size 1 to 512 concurrent threads CTA shape 1D, 2D, or 3D CTA dimensions in threads CTA CUDA Id #: 0 1 2 3 m program CTA threads execute thread program CTA threads have thread id numbers CTA threads share data and synchronize program uses thread id to select work and address shared data 8

SM Multiprocessor Executes CTAs t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT MT CTA 1 CTA 0 Texture L1 CTA threads run concurrently SM assigns thread id #s SM manages thread execution CTA threads share data & results In and Synchronize at barrier instruction Per-CTA Keeps data close to processor Minimize trips to global CTA threads access global 76 GB/sec GDDR DRAM 9 Data Parallel Levels CTA t0 t1 t2 tm Grid Computes result elements id number CTA Cooperative Array Computes result 1 to 512 threads per CTA CTA () id number Grid of CTAs Computes many result s 1 to many CTAs per Grid Sequential Grids Compute sequential problem steps CTA 0 CTA 1 CTA 2 CTA n... 10

Parallel Sharing CTA Local Local : per-thread Private per thread Auto variables, register spill : per-cta by threads of CTA Inter-thread communication Global : per-application by all threads Inter-Grid communication Grid 0 Grid 1... Global Sequential Grids in Time... 11 How to Scale GPU Computing? GPU parallelism varies widely Ranges from 8 cores to many 100s of cores Ranges from 100 to many 1000s of threads GPU parallelism doubles yearly Graphics performance scales with GPU parallelism Data parallel mapping of pixels to threads Unlimited demand for parallel pixel shader threads and cores Challenge: Scale Computing performance with GPU parallelism Program must be insensitive to the number of cores Write one program for any number of SM cores Program runs on any size GPU without recompiling 12

Transparent Scalability Programmer uses multi-level data parallel decomposition Decomposes problem into sequential steps (Grids) Decomposes Grid into computing parallel s (CTAs) Decomposes into computing parallel elements (threads) GPU hardware distributes CTA work to available SM cores GPU balances CTA work load across any number of SM cores SM core executes CTA program that computes CTA program computes a independently of others Enables parallel computing of s of a Grid No communication among s of same Grid Scales one program across any number of parallel SM cores Programmer writes one program for all GPU sizes Program does not know how many cores it uses Program executes on GPU with any number of cores 13 CUDA Programming Model: Parallel Multithreaded Kernels Execute data-parallel portions of application on GPU as kernels which run in parallel on many cooperative threads Integrated CPU + GPU application C program Partition problem into a sequence of kernels Kernel C code executes on GPU Serial C code executes on CPU Kernels execute as blocks of parallel threads View GPU as a computing device that: Acts as a coprocessor to the CPU host Has its own memory Runs many lightweight threads in parallel 14

Single-Program Multiple-Data (MD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 15 CUDA Programming Model: Grids, s, and s Execute a sequence of kernels on GPU computing device A kernel executes as a Grid of thread blocks A thread block is an array of threads that can cooperate s within the same block synchronize and share data in Execute thread blocks as CTAs on multithreaded multiprocessor SM cores CPU Kernel 1 Sequence Kernel 2 GPU device Grid 1 (0, 0) (0, 1) Grid 2 (1, 1) (1, 0) (1, 1) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (2, 0) (2, 1) 16

CUDA Programming Model: Spaces Id, Id Kernel Program Written in C Global Registers Local Constants Texture Each kernel thread can read: Id per thread Id per block Constants per grid Texture per grid Each thread can read and write: Registers per thread Local memory per thread memory per block Global memory per grid Host CPU can read and write: Constants per grid Texture per grid Global memory per grid 17 CUDA: C on the GPU Single-Program Multiple-Data (MD) programming model C program for a thread of a thread block in a grid Extend C only where necessary Simple, explicit language mapping to parallel threads Declare C kernel functions and variables on GPU: global void KernelFunc(...); device int GlobalVar; shared int Var; Call kernel function as Grid of 500 blocks of 128 threads: KernelFunc<<< 500, 128 >>>(args...); Explicit GPU memory allocation, CPU-GPU memory transfers cudamalloc( ), cudafree( ) cudamemcpy( ), cudamemcpy2d( ), 18

CUDA C Example: Add Arrays C program CUDA C program void addmatrix (float *a, float *b, float *c, int N) { int i, j, idx; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { idx = i + j*n; c[idx] = a[idx] + b[idx]; } } } void main() {... addmatrix(a, b, c, N); } global void addmatrixg (float *a, float *b, float *c, int N) { int i = blockidx.x*blockdim.x + threadidx.x; int j = blockidx.y*blockdim.y + threadidx.y; int idx = i + j*n; if (i < N && j < N) c[idx] = a[idx] + b[idx]; } void main() { dim3 dim (blocksize, blocksize); dim3 dimgrid (N/dim.x, N/dim.y); addmatrixg<<<dimgrid, dim>>>(a, b, c, N); } 19 CUDA Software Development Kit CUDA Optimized Libraries: FFT, BLAS, Integrated CPU + GPU C Source Code NVIDIA C Compiler NVIDIA Assembly for Computing (PTX) CPU Host Code CUDA Driver Debugger Profiler Standard C Compiler GPU CPU 20

Compiling CUDA Programs C/C++ CUDA Application CPU Code NVCC Virtual PTX Code Target PTX to Target Translator GPU GPU Target code 21 GPU Computing Application Areas Geoscience Chemistry Medicine Modeling Science Biology Finance 22 Image Processing

Summary NVIDIA GPU Computing Architecture Computing mode enables parallel C on GPUs Massively multithreaded 1000s of threads Executes parallel threads and thread arrays s cooperate via and Global memory Scales to any number of parallel processor cores Now on: Tesla C870, D870, S870, GeForce 8800/8600/8500, and Quadro FX 5600/4600 CUDA Programming model C program for GPU threads Scales transparently to GPU parallelism Compiler, tools, libraries, and driver for GPU Computing Supports Linux and Windows http://www.nvidia.com/tesla http://developer.nvidia.com/cuda 23