GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Similar documents
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Basics. Murphy Stein New York University

Introduction to GPU hardware and to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Next Generation GPU Architecture Code-named Fermi

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Parallel Computing Architecture and CUDA Programming Model

GPGPU Computing. Yong Cao

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Optimizing Application Performance with CUDA Profiling Tools

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Programming Survey

Computer Graphics Hardware An Overview

Guided Performance Analysis with the NVIDIA Visual Profiler

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Programming Languages

L20: GPU Architecture and Models

ultra fast SOM using CUDA

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA Programming. Week 4. Shared memory and register

~ Greetings from WSU CAPPLab ~

Texture Cache Approximation on GPUs

GPUs for Scientific Computing

The Fastest, Most Efficient HPC Architecture Ever Built

CUDA programming on NVIDIA GPUs

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Chapter 1 Computer System Overview

GPU Computing - CUDA

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

GPU Performance Analysis and Optimisation

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

ST810 Advanced Computing

GPU Hardware Performance. Fall 2015

HPC with Multicore and GPUs

Introduction to GPGPU. Tiziano Diamanti

Fast Implementations of AES on Various Platforms

Real-time Visual Tracker by Stream Processing

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Advanced CUDA Webinar. Memory Optimizations

Speeding up GPU-based password cracking

Capacity Planning Process Estimating the load Initial configuration

MIDeA: A Multi-Parallel Intrusion Detection Architecture

GPU for Scientific Computing. -Ali Saleh

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Introduction to Cloud Computing

Using Synology SSD Technology to Enhance System Performance Synology Inc.

GPU Architecture. Michael Doggett ATI

Turbomachinery CFD on many-core platforms experiences and strategies

Stream Processing on GPUs Using Distributed Multimedia Middleware

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

CUDA for Real Time Multigrid Finite Element Simulation of

Lecture 1: an introduction to CUDA

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

CHAPTER 1 INTRODUCTION

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Programming GPUs with CUDA

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

GPU Acceleration of the SENSEI CFD Code Suite

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Multi-Threading Performance on Commodity Multi-Core Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Building a Top500-class Supercomputing Cluster at LNS-BUAP

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Principles and characteristics of distributed systems and environments

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Parallel Computing with MATLAB

Accelerating CFD using OpenFOAM with GPUs

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

1. If we need to use each thread to calculate one output element of a vector addition, what would

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

The Design of the Inferno Virtual Machine. Introduction

How To Build A Cloud Computer

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Transcription:

GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy Introduction to the Finite Difference Method 2

Recap GPU - A real alternative in scientific computing - Massively parallel commodity hardware CUDA Market driven development Cheap! - NVIDIA technology capable of programming massively parallel processors with general purpose - C/C++ with extensions 3

Recap GPU chip - Designed for high throughput tasks - Many simple cores - Fermi: 16 SMs - 32 SPs (cores) SP - Shared, registers, cache - SFU, CU 1TFLOP peak throughput 144GB/s peak bandwidth SM 4

Recap Programming model - Based on threads - Thread hierarchy: grouped in thread blocks - Threads in a thread block can communicate Challenge: parallel thinking - Data parallelism - Load balancing - Conflicts - Latency hiding 5

CUDA - Programming model Connecting the hardware and the software Hardware Software representation 6

CUDA - Programming model Connecting the hardware and the software Thread Thread block Hardware Software representation 6

CUDA - Programming model Connecting the hardware and the software Hardware Software representation 6

CUDA - Programming model Connecting the hardware and the software Hardware Software representation 6

CUDA - Programming model Connecting the hardware and the software Hardware Software representation 6

How a Streaming Multiprocessor (SM) works CUDA Threads are grouped in thread blocks - All threads of the same thread block are executed in the same SM at the same time SMs have shared memory, then threads within a thread block can communicate - The entirety of the threads of a thread block must be executed before there is space to schedule another thread block 7

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 100 Block 15 Block 7 SM Block 1 8

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 100 Block 15 Block 1 Block 7 SM 8

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 100 Block 7 Block 15 Block 1 SM 8

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 15 Block 100 Block 7 Block 1 SM 8

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 15 Block 100 Block 7 SM 8

How a Streaming Multiprocessor (SM) works Hardware schedules thread blocks onto available SMs - No guarantee of order of execution - If a SM has more resources the hardware will schedule more blocks Block 15 Block 7 Block 100 SM 8

Warps Inside the SM, threads are launched in groups of 32 called warps - Warps share the control part (warp scheduler) - At any time, only one warp is executed per SM - Threads in a warp will be executing the same instruction - Half warps for compute capability 1.X - Fermi: Maximum number of active threads 1024*8*32 = 262144 9

Warps Inside the SM, threads are launched in groups of 32 called warps - Warps share the control part (warp scheduler) - At any time, only one warp is executed per SM - Threads in a warp will be executing the same instruction - Half warps for compute capability 1.X - Fermi: Max threads per block Maximum number of active threads 1024*8*32 = 262144 9

Warps Inside the SM, threads are launched in groups of 32 called warps - Warps share the control part (warp scheduler) - At any time, only one warp is executed per SM - Threads in a warp will be executing the same instruction - Half warps for compute capability 1.X - Fermi: Max threads per block Max blocks per SM Maximum number of active threads 1024*8*32 = 262144 9

Warps Inside the SM, threads are launched in groups of 32 called warps - Warps share the control part (warp scheduler) - At any time, only one warp is executed per SM - Threads in a warp will be executing the same instruction - Half warps for compute capability 1.X - Fermi: Max threads per block Max blocks per SM SMs Maximum number of active threads 1024*8*32 = 262144 9

Warp designation Hardware separates threads of a block into warps - All threads in a warp correspond to the same thread block - Threads are placed in a warp sequentially - Threads are scheduled in warps Block N, Warp 1 Block N, Warp 2 Block N 16x8 TB Block N, Warp 3 Block N, Warp 4 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! ECE498AL, University of Illinois, Urbana-Champaign! 10

Warp scheduling The SM implements a zero-overhead warp scheduling - Warps whose next instruction has its operands ready for consumption are eligible for execution - Eligible warps are selected for execution on a prioritized scheduling policy - All threads in a warp execute the same instruction David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! ECE498AL, University of Illinois, Urbana-Champaign! 11

Warp scheduling Fermi - Double warp scheduler - Each SM has two warp schedulers and two instruction units 12

Memory hierarchy CUDA works in both the CPU and GPU - One has to keep track of which memory is operating on (host - device) - Within the GPU there are also different memory spaces 13

Memory hierarchy Global memory - Main GPU memory - Communicates with host - Can be seen by all threads - Order of GB - Off chip, slow (~400 cycles) device float variable; 14

Memory hierarchy Shared memory - Per SM - Seen by threads of the same thread block - Order of kb - On chip, fast (~4 cycles) shared float variable; Registers - Private to each thread - On chip, fast float variable; 15

Memory hierarchy Local memory - Private to each thread - Off chip, slow - Register overflows float variable [10]; Constant memory - Read only - Off chip, but fast (cached) - Seen by all threads - 64kB with 8kB cache constant float variable; 16

Memory hierarchy Texture memory - Seen by all threads - Read only - Off chip, but fast (cached) if cache hit - Cache optimized for 2D locality - Binds to global texture<type, dim> tex_var; cudachannelformatdesc(); cudabindtexture2d(...); tex2d(tex_var, x_index, y_index); 17

Memory hierarchy Texture memory - Seen by all threads - Read only - Off chip, but fast (cached) if cache hit - Cache optimized for 2D locality - Binds to global Initialize texture<type, dim> tex_var; cudachannelformatdesc(); cudabindtexture2d(...); tex2d(tex_var, x_index, y_index); 17

Memory hierarchy Texture memory - Seen by all threads - Read only - Off chip, but fast (cached) if cache hit - Cache optimized for 2D locality - Binds to global Initialize texture<type, dim> tex_var; cudachannelformatdesc(); cudabindtexture2d(...); tex2d(tex_var, x_index, y_index); Options 17

Memory hierarchy Texture memory - Seen by all threads - Read only - Off chip, but fast (cached) if cache hit - Cache optimized for 2D locality - Binds to global Initialize texture<type, dim> tex_var; cudachannelformatdesc(); cudabindtexture2d(...); tex2d(tex_var, x_index, y_index); Options Bind 17

Memory hierarchy Texture memory - Seen by all threads - Read only - Off chip, but fast (cached) if cache hit - Cache optimized for 2D locality - Binds to global Initialize texture<type, dim> tex_var; cudachannelformatdesc(); cudabindtexture2d(...); tex2d(tex_var, x_index, y_index); Options Bind Fetch 17

Memory hierarchy 18

Memory hierarchy Case of Fermi - Added an L1 cache to each SM - Shared + cache = 64kB: Shared = 48kB, cache = 16kB Shared = 16kB, cache = 48kB 19

Memory hierarchy Use you memory strategically - Read only: constant (fast) - Read/write and communicate within a block: shared (fast and communication) - Read/write inside thread: registers (fast) - Data locality: texture 20

Resource limits Number of thread blocks per SM at the same time is limited by - Shared memory usage - Registers - No more than 8 thread blocks per SM - Number of threads 21

Resource limits - Examples Number of blocks How big should my blocks be? 8x8, 16x16 or 64x64? 8x8 Max threads per block 8*8 = 64 threads per block, 1024/64 = 16 blocks. An SM can have up to 8 blocks: only 512 threads will be active at the same time 16x16 16*16 = 256 threads per block, 1024/256 = 4 blocks. An SM can take all blocks, then all 1024 threads will be active and achieve full capacity unless other resource overrule 64x64 64*64 = 4096 threads per block: doesn t fit in a SM David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! 22 ECE498AL, University of Illinois, Urbana-Champaign!

Resource limits - Examples Registers We have a kernel that uses 10 registers. With 16x16 block, how many blocks can run in G80 (max 8192 registers per SM, 768 threads per SM)? 10*16*16 = 2560. SM can hold 8192/2560 = 3 blocks, meaning we will use 3*16*16 = 768 threads, which is within limits. If we add one more register, the number of registers grows to 11*16*16 = 2816. SM can hold 8192/2816 = 2 blocks, meaning we will use 2*16*15 = 512 threads. Now as less threads are running per SM is more difficult to have enough warps to have the GPU always busy! 23 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! ECE498AL, University of Illinois, Urbana-Champaign!

Introduction to Finite Difference Method Throughout the course many examples will be taken from Finite Difference Method Also, next lab involves implementing a stencil code Just want to make sure we re all in the same page! 24

Introduction to Finite Difference Method Finite Difference Method (FDM) is a numerical method for solution of differential equations Domain is discretized with a mesh Derivative at a node are approximated by the linear combination of the values of points nearby (including itself) y x i y i y i 1 x i x i 1 Example using first order one sided difference 25

FDM - Accuracy and order From Taylor s polynomial, we get f(x 0 + h) =f(x 0 )+f (x 0 ) h + R 1 (x) Sources of error f (x 0 )= - Roundoff (Machine) f(a + h) f(a) h R(x) - Truncation (R(x)): gives the order of the method 26

FDM - Stability Stability of the algorithm may be conditional Explicit method u n+1 j Implicit u n+1 j k u n j Crank Nicolson u n+1 j k k u n j u n j = 1 2 u t = 2 u x 2 = un j+1 2un j + un j 1 h 2 = un+1 j+1 2un+1 j + u n+1 j 1 h 2 ( u n j+1 2u n j + un j 1 h 2 Conditionally stable Unconditionally stable + un+1 j+1 2un+1 j + u n+1 j 1 h 2 ) Unconditionally stable k h 2 < 0.5 27

FDM - 2D example We re going to be working with this example in the labs Explicit diffusion u t = u α 2 x 2 u n+1 i,j = u n i,j + αk h 2 (un i,j+1 + u n i,j 1 + u n i+1,j + u n i 1,j 4u n i,j) 28

FDM on the GPU Mapping the physical problem to the software and hardware Each thread will operate on one node: node indexing groups them naturally! u n+1 i,j = u n i,j + αk h 2 (un i,j+1 + u n i,j 1 + u n i+1,j + u n i 1,j 4u n i,j) 29

FDM on the GPU Mapping the physical problem to the software and hardware Each thread will operate on one node: node indexing groups them naturally! Thread = node Thread block u n+1 i,j = u n i,j + αk h 2 (un i,j+1 + u n i,j 1 + u n i+1,j + u n i 1,j 4u n i,j) 29

FDM on the GPU Mapping the physical problem to the software and hardware Each thread will operate on one node: node indexing groups them naturally! u n+1 i,j = u n i,j + αk h 2 (un i,j+1 + u n i,j 1 + u n i+1,j + u n i 1,j 4u n i,j) 29