NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist



Similar documents
GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPGPU Computing. Yong Cao

GPUs for Scientific Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Next Generation GPU Architecture Code-named Fermi

Texture Cache Approximation on GPUs

CUDA programming on NVIDIA GPUs

Parallel Programming Survey

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Clustering Billions of Data Points Using GPUs

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GPU Programming Languages

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HPC with Multicore and GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Performance Analysis and Optimisation

NVIDIA GeForce GTX 580 GPU Datasheet

Optimizing Application Performance with CUDA Profiling Tools

Overview of HPC Resources at Vanderbilt

Introduction to Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Embedded Systems: map to FPGA, GPU, CPU?

Parallel Algorithm Engineering

ultra fast SOM using CUDA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Guided Performance Analysis with the NVIDIA Visual Profiler

GPU Computing - CUDA

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA Tesla. GPU Computing Technical Brief. Version /24/07

How To Build A Cloud Computer

Accelerating CFD using OpenFOAM with GPUs

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

CUDA Basics. Murphy Stein New York University

Multi-GPU Load Balancing for Simulation and Rendering

GPGPU accelerated Computational Fluid Dynamics

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Next Generation Operating Systems

NVIDIA VIDEO ENCODER 5.0

Parallel Firewalls on General-Purpose Graphics Processing Units

CHAPTER 1 INTRODUCTION

Computer Graphics Hardware An Overview

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

HP ProLiant SL270s Gen8 Server. Evaluation Report

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

OpenCL Programming for the CUDA Architecture. Version 2.3

High Performance GPGPU Computer for Embedded Systems

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

<Insert Picture Here> Oracle In-Memory Database Cache Overview

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Introduction to GPGPU. Tiziano Diamanti

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Final Project Report. Trading Platform Server

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

~ Greetings from WSU CAPPLab ~

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Lecture 1: an introduction to CUDA

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

FPGA-based Multithreading for In-Memory Hash Joins

Case Study on Productivity and Performance of GPGPUs

L20: GPU Architecture and Models

Application. EDIUS and Intel s Sandy Bridge Technology

RevoScaleR Speed and Scalability

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Direct GPU/FPGA Communication Via PCI Express

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Lecture 2: Computer Hardware and Ports.

10- High Performance Compu5ng

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Transcription:

NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist

Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get Started! Exercises / Examples Interleaved with Presentation Materials Homework for later 2

Future Science and Engineering Breakthroughs Hinge on Computing Computational Geoscience Computational Chemistry Computational Medicine Computational Modeling Computational Physics Computational Biology Computational Finance Image Processing 3

Faster is not just Faster 2-3X faster is just faster Do a little more, wait a little less Doesn t change how you work 5-10x faster is significant Worth upgrading Worth re-writing (parts of) the application 100x+ faster is fundamentally different Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives time to discovery and creates fundamental changes in Science 4

The GPU is a New Computation Engine 80 70 Fully Programmable G80 Relative Floating Point Performance 60 50 40 30 20 Era of Shaders 10 10 2002 2003 2004 2005 2006 5

Closely Coupled CPU-GPU Function Lib Lib Function Function Init Alloc GPU CPU Operation 1 Operation 2 Operation 3 Integrated programming model High speed data transfer up to 3.2 GB/s Asynchronous operation Large GPU memory systems 6

Millions of CUDA-enabled GPUs 50 Dedicated computing C on the GPU Servers through Notebook PCs Total GPUs (millions) 25 2006 2007 7

GeForce Entertainment Quadro Design & Creation Tesla TM High Performance Computing 8

VMD/NAMD Molecular Dynamics 240X speedup Computational biology http://www.ks.uiuc.edu/research/vmd/projects/ece498/lecture/ 9

EvolvedMachines Simulate the brain circuit Sensory computing: vision, olfactory 130X Speed up EvolvedMachines 10

Hanweck Associates VOLERA, real-time options implied volatility engine Accuracy results with SINGLE PRECISION Evaluate all U.S. listed equity options in <1 second (www.hanweckassoc.com) 11

LIBOR APPLICATION: Mike Giles and Su Xiaoke Oxford University Computing Laboratory LIBOR Model with portfolio of swaptions 80 initial forward rates and 40 timesteps to maturity 80 Deltas computed with adjoint approach No Greeks Greeks Intel Xeon 18.1s - 26.9s - ClearSpeed Advance 2 CSX600 2.9s 6x 6.4s 4x NVIDIA 8800 GTX 0.045s 400x 0.18s 149x The performance of the CUDA code on the 8800 GTX is exceptional -Mike Giles Source codes and papers available at: http://web.comlab.ox.ac.uk/oucl/work/mike.giles/hpc 12

Manifold 8 GIS Application From the Manifold 8 feature list: applications fitting CUDA capabilities that might have taken tens of seconds or even minutes can be accomplished in hundredths of seconds. CUDA will clearly emerge to be the future of almost all GIS computing From the user manual: "NVIDIA CUDA could well be the most revolutionary thing to happen in computing since the invention of the microprocessor 13

nbody Astrophysics Astrophysics research 1 GF on standard PC 300+ GF on GeForce 8800GTX http://progrape.jp/cs/ Faster than GRAPE-6Af custom simulation computer Video demo 14

Matlab: Language of Science 17X with MATLAB CPU+GPU http://developer.nvidia.com/object/matlab_cuda.html Pseudo-spectral simulation of 2D Isotropic turbulence http://www.amath.washington.edu/courses/571-winter-2006/matlab/fs_2dturb.m 15

CUDA Programming Model Overview

GPU Computing GPU is a massively parallel processor NVIDIA G80: 128 processors Support thousands of active threads (12,288 on G80) GPU Computing requires a programming model that can efficiently express that kind of parallelism Most importantly, data parallelism CUDA implements such a programming model 17

CUDA Kernels and s Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few Definitions: Device = GPU; Host = CPU Kernel = function that runs on the device 18

Arrays of Parallel s A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions threadid 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; 19

Cooperation The Missing Piece: threads may need to cooperate cooperation is valuable Share results to save computation Synchronization Share memory accesses Drastic bandwidth reduction cooperation is a powerful feature of CUDA 20

Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks s within a block cooperate via shared memory s in different blocks cannot cooperate Enables programs to transparently scale to any number of processors! Block 0 Block 0 Block N - 1 threadid 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; 21

Transparent Scalability Hardware is free to schedule thread blocks on any processor at any time A kernel scales across any number of parallel multiprocessors Device Block 0 Block 1 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 22

CUDA Programming Model A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Host Kernel 1 Kernel 2 Block (1, 1) (0, 0) Device (1, 0) Grid 1 Block (0, 0) Block (0, 1) Grid 2 (2, 0) Block (1, 0) Block (1, 1) (3, 0) (4, 0) Block (2, 0) Block (2, 1) s from different blocks cannot cooperate (0, 1) (0, 2) (1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2) 23

G80 Device Processors execute computing threads Execution Manager issues threads 128 Processors grouped into 16 Multiprocessors (SMs) Parallel Data Cache (Shared Memory) enables thread cooperation Host Input Assembler Execution Manager Processors Processors Processors Processors Processors Processors Processors Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Global Memory 24

and Block IDs s and blocks have IDs Each thread can decide what data to work on Device Grid 1 Block ID: 1D or 2D ID: 1D, 2D, or 3D Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Simplifies memory addressing when processing multi-dimensional data Image processing Solving PDEs on volumes Block (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) 25

Kernel Memory Access Registers Global Memory (external DRAM) Kernel input and output data reside here Off-chip, large Uncached Grid Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers Shared Memory (Parallel Data Cache) Shared among threads in a single block On-chip, small As fast as registers Host (0, 0) Global Memory (1, 0) (0, 0) (1, 0) The host can read & write global memory but not shared memory 26

Execution Model Kernels are launched in grids One kernel executes at a time A block executes on one Streaming Multiprocessor (SM) Does not migrate Several blocks can reside concurrently on one SM Control limitations (of G8X/G9X GPUs): At most 8 concurrent blocks per SM At most 768 concurrent threads per SM Number is further limited by SM resources Register file is partitioned among all resident threads Shared memory is partitioned among all resident thread blocks 27

CUDA Advantages over Legacy GPGPU (Legacy GPGPU is programming GPU through graphics APIs) Random access byte-addressable memory can access any memory location Unlimited access to memory can read/write as many locations as needed Shared memory (per block) and thread synchronization s can cooperatively load data into shared memory Any thread can then access any shared memory location Low learning curve Just a few extensions to C No knowledge of graphics is required No graphics API overhead 28

CUDA Model Summary Thousands of lightweight concurrent threads No switching overhead Hide instruction and memory latency Shared memory User-managed L1 cache communication / cooperation within blocks Random access to global memory Any thread can read/write any location(s) Current generation hardware: Up to 128 streaming processors Memory Location Cached Access Scope ( Who? ) Shared On-chip N/A Read/write All threads in a block Global Off-chip No Read/write All threads + host 29

Getting Started 30

Get CUDA CUDA Zone: http://nvidia.com/cuda Programming Guide and other Documentation Toolkits and SDKs for: Windows Linux MacOS Libraries Plugins Forums Code Samples 31

Come visit the class! UIUC ECE498AL Programming Massively Parallel Processors (http://courses.ece.uiuc.edu/ece498/al/) David Kirk (NVIDIA) and Wenmei Hwu (UIUC) co-instructors CUDA programming, GPU computing, lab exercises, and projects Lecture slides and voice recordings 32

Questions? 33