Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Similar documents

Introduction to GPU Programming Languages

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU hardware and to CUDA

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Parallel Computing Architecture and CUDA Programming Model

Next Generation GPU Architecture Code-named Fermi

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

HPC with Multicore and GPUs

Lecture 1: an introduction to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenACC 2.0 and the PGI Accelerator Compilers

GPU Computing - CUDA

Optimizing Application Performance with CUDA Profiling Tools

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to Cloud Computing

CUDA Basics. Murphy Stein New York University

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Spring 2011 Prof. Hyesoon Kim

Introduction to CUDA C

GPUs for Scientific Computing

CUDA Programming. Week 4. Shared memory and register

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Hardware Performance. Fall 2015

Overview of HPC Resources at Vanderbilt

Parallel Programming Survey

Low Latency Complex Event Processing on Parallel Hardware

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenACC Basics Directive-based GPGPU Programming

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

An Implementation Of Multiprocessor Linux

Hybrid Programming with MPI and OpenMP

Debugging with TotalView

An Introduction to Parallel Computing/ Programming

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

ultra fast SOM using CUDA

Rethinking SIMD Vectorization for In-Memory Databases

Chapter 2: OS Overview

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Parallel Computing. Parallel shared memory computing with OpenMP

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

CUDA programming on NVIDIA GPUs

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to GPU Architecture

GPU Tools Sandra Wienke

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Parallel Firewalls on General-Purpose Graphics Processing Units

Hardware design for ray tracing

IMAGE PROCESSING WITH CUDA

Lecture 3. Optimising OpenCL performance

Chapter 2 Parallel Architecture, Software And Performance

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming GPUs with CUDA

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

GPI Global Address Space Programming Interface

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Course Development of Programming for General-Purpose Multicore Processors

L20: GPU Architecture and Models

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPGPU Parallel Merge Sort Algorithm

CSE 6040 Computing for Data Analytics: Methods and Tools

Introduction to Parallel Programming and MapReduce

CHAPTER 1 INTRODUCTION

Operating System: Scheduling

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Real-time Visual Tracker by Stream Processing

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

GPGPU Computing. Yong Cao

Multi-core Programming System Overview

Performance Analysis for GPU Accelerated Applications

Trends in High-Performance Computing for Power Grid Applications

First-class User Level Threads

Binary search tree with SIMD bandwidth optimization using SSE

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

COSCO 2015 Heterogeneous Computing Programming

Chapter 1 Computer System Overview

Computer Systems II. Unix system calls. fork( ) wait( ) exit( ) How To Create New Processes? Creating and Executing Processes

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Transcription:

Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1

Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion 2

GPUs are parallel processors Throughput-oriented hardware Example problem I have 100 apples to eat 1) high performance : finish one apple faster 2) high throughput : finish all apples faster Performance = parallel hardware + scalable parallel program! Spring 2015 Mark Silberstein, 048661, Technion 3

Simplified GPU model (see an introductory paper to on my home page: «GPUs: High-performance Accelerators for Parallel Applications» ) Spring 2015 Mark Silberstein, 048661, Technion 4

GPU 101 CPU GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 5

GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 6

GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 7

Co-processor model CPU Computation GPU t a t i o n GPU kernel Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 8

GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 9

Simple GPU program Idea: same set of operations is applied to different data chunks in parallel Implementation Every thread runs the same code on different data chunks. GPU concurrently runs many parallel threads Spring 2015 Mark Silberstein, 048661, Technion 10

Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, 048661, Technion 11

Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, 048661, Technion 12

Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, 048661, Technion 13

Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size Instantiated and executed on GPU A[i]=A[i]+B[i] in thousands of hardware threads Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, 048661, Technion 14

GPU programming basics Spring 2015 Mark Silberstein, 048661, Technion 15

GPU execution model Thread sequence of sequentially-executed instructions Each thread has private state: registers, stack Single Instruction Multiple Threads (SIMT) model Same program, called kernel, is instantiated and invoked for each thread Each thread gets unique ID Many threads (tens of thousands) Threads wait in a hardware queue until resources become available All threads share access to GPU memory Spring 2015 Mark Silberstein, 048661, Technion 16

CPU control of GPU GPU* and CPU have different memory spaces GPU is fully managed by a CPU cannot* access CPU RAM cannot* reallocate its own memory cannot* start computations CPU must allocate memory and transfer input before kernel invocation CPU must transfer output and clean up memory after kernel termination Spring 2015 Mark Silberstein, 048661, Technion 17

Vector sum for A.size=1024 GPU A[threadId]=A[threadId]+B[threadId] CPU 1.Allocate arrays in GPU memory 2.Copy data CPU -> GPU 3.Invoke kernel with 1024 threads 4.Wait until complete and copy data GPU->CPU Spring 2015 Mark Silberstein, 048661, Technion 18

Vector sum kernel code in CUDA global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU Get unique thread ID CUDA modifier to signify GPU kernels fine-grain parallelism: operation per thread Spring 2015 Mark Silberstein, 048661, Technion 19

CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); Two sets of pointers Two sets of pointers GPU memory allocated by CPU GPU CPU cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Input copied from CPU to GPU Pass pointers to cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { device memory at fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; kernel invocation } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); return 0; Mark Silberstein, 048661, Technion 20 }

CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Wait until the kernel is over cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); Copy data back to CPU return 1; } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); Mark Silberstein, 048661, Technion 21 return 0; }

Typical code structure global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU programming GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); GPU integration cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; } cudamemcpy(d_a,a,size_bytes,cudamemcpydevicetohost); printf ("Success"); Spring 2015return 0; Mark Silberstein, 048661, Technion 22 }

BUT! Vector sum is simple purely data parallel What if we need coordination between tasks Spring 2015 Mark Silberstein, 048661, Technion 23

Inter-thread communication Programming model usability is determined by the means of inter-thread communication No communications easy for hardware, hard for software Fine-grained communications hard to make efficient in hardware, easy for software Spring 2015 Mark Silberstein, 048661, Technion 24

Communications: inner product Spring 2015 Mark Silberstein, 048661, Technion 25

Communications: inner product + + + + + + + x x x x x x x x Eventually all-to-all thread communication required How can we make it efficient?? Spring 2015 Mark Silberstein, 048661, Technion 26

Compromise: Hierarchy Threads Threadblocks All threads are split up into fixed-size groups: threadblocks Threads in a threadblock can synchronize and communicate efficiently Share fast memory Can use barriers Spring 2015 Mark Silberstein, 048661, Technion 27

Execution model: stream of threadblocks Threadblock scheduling is non-deterministic Cannot rely on a schedule for correctness No inter-threadblock global barrier in the programming model Spring 2015 Mark Silberstein, 048661, Technion 28

Hierarchical Dot-Product + + + + + + + x x x x x x x x Threadblock 1 Threadblock 2 Spring 2015 Mark Silberstein, 048661, Technion 29

Slow coordination should be rare Slow coordination + + + Efficient communication + + + + x x x x x x x x Threadblock 1 Threadblock 2 Coarse grain task Fine-grain task Spring 2015 Mark Silberstein, 048661, Technion 30

Dot product void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, 048661, Technion 31

Dot product void vector_dotproduct_kernel(float* ga, Fast float* local gb, memory float* gout) { _shared_ float l_res[tbsize]; //local core memory int tid=threadidx.x; int bid=blockidx.x; Two levels of indexing int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; Fast sync } synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, 048661, Technion 32

How to finalize the parallel reduce Option 1: use CPU Option 2: use atomics Option 3: use multiple kernel invocations kernel completion = global barrier Spring 2015 Mark Silberstein, 048661, Technion 33

Dot product with atomics void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all All threadblocks products in are going a threadblock to atomically update the // parallel reduction global memory gout[0] for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) atomicadd(gout,l_res[0]); Spring 2015 Mark Silberstein, 048661, Technion 34

GPU hardware in more details Spring 2015 Mark Silberstein, 048661, Technion 35

BEGIN: Terminology break Spring 2015 Mark Silberstein, 048661, Technion 36

Flynn's HARDWARE Taxonomy Spring 2015 Mark Silberstein, 048661, Technion 37

SIMD Lock-step execution Example: vector operations (SSE) Spring 2015 Mark Silberstein, 048661, Technion 38

MIMD Example: multicores Spring 2015 Mark Silberstein, 048661, Technion 39

END: Terminology break Spring 2015 Mark Silberstein, 048661, Technion 40

GPU hardware characteristics Massive parallelism Runs thousands of concurrent execution flows (threads) identified and explicitly programmed by a developer Spring 2015 Mark Silberstein, 048661, Technion 41

GPU hardware parallelism 1. MIMD = multi-core GPU GPU memory Core MP Core MP Core MP Core MP Spring 2015 Mark Silberstein, 048661, Technion 42

GPU hardware parallelism 2. SIMD = vector Core SIMD vector SIMD vector SIMD vector Spring 2015 Mark Silberstein, 048661, Technion 43

Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 44

Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 45

Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP R 0x01 R 0x04 T1 T2 Execution state T3 Spring 2015 Mark Silberstein, 048661, Technion 46

Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 47

Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 48

Putting it all together: 3 levels of hardware parallelism GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 49

Thread n Thread 1 What is GPU thread? GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 50

Thread n Thread 1 What is thread block? GPU GPU memory Core MP Core MP Core MP Core MP scratchpad memory in core Core Core MP Thread State Ctx 1 1 Thread block SIMD vector Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 51

What is GPU kernel Hardware queue GPU GPU memory Thread block Thread block Core MP Core MP Core MP Core MP Thread Thread Thread Thread block Thread block Thread block Thread block Thread block block block block Thread block Spring 2015 Mark Silberstein, 048661, Technion 52

Thread 1 Thread n 10,000-s of concurrent threads! GPU GPU memory NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads Core MP Core MP Core MP Core MP 14 32 Core Thread State Ctx 1 1 SIMD vector 64 Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 53

Using shared memory for efficient matrix product computations (using CUDA samples) Spring 2015 Mark Silberstein, 048661, Technion 54

Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 55

Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 56

Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 57

More info about CUDA NVIDIA programming manual UDACITY Interactive online course: Intro to parallel computing highly recommended to skim through the first two lectures before the next class https://www.udacity.com/course/cs344 Spring 2015 Mark Silberstein, 048661, Technion 58