Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1
Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion 2
GPUs are parallel processors Throughput-oriented hardware Example problem I have 100 apples to eat 1) high performance : finish one apple faster 2) high throughput : finish all apples faster Performance = parallel hardware + scalable parallel program! Spring 2015 Mark Silberstein, 048661, Technion 3
Simplified GPU model (see an introductory paper to on my home page: «GPUs: High-performance Accelerators for Parallel Applications» ) Spring 2015 Mark Silberstein, 048661, Technion 4
GPU 101 CPU GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 5
GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 6
GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 7
Co-processor model CPU Computation GPU t a t i o n GPU kernel Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 8
GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, 048661, Technion 9
Simple GPU program Idea: same set of operations is applied to different data chunks in parallel Implementation Every thread runs the same code on different data chunks. GPU concurrently runs many parallel threads Spring 2015 Mark Silberstein, 048661, Technion 10
Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, 048661, Technion 11
Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, 048661, Technion 12
Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, 048661, Technion 13
Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size Instantiated and executed on GPU A[i]=A[i]+B[i] in thousands of hardware threads Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, 048661, Technion 14
GPU programming basics Spring 2015 Mark Silberstein, 048661, Technion 15
GPU execution model Thread sequence of sequentially-executed instructions Each thread has private state: registers, stack Single Instruction Multiple Threads (SIMT) model Same program, called kernel, is instantiated and invoked for each thread Each thread gets unique ID Many threads (tens of thousands) Threads wait in a hardware queue until resources become available All threads share access to GPU memory Spring 2015 Mark Silberstein, 048661, Technion 16
CPU control of GPU GPU* and CPU have different memory spaces GPU is fully managed by a CPU cannot* access CPU RAM cannot* reallocate its own memory cannot* start computations CPU must allocate memory and transfer input before kernel invocation CPU must transfer output and clean up memory after kernel termination Spring 2015 Mark Silberstein, 048661, Technion 17
Vector sum for A.size=1024 GPU A[threadId]=A[threadId]+B[threadId] CPU 1.Allocate arrays in GPU memory 2.Copy data CPU -> GPU 3.Invoke kernel with 1024 threads 4.Wait until complete and copy data GPU->CPU Spring 2015 Mark Silberstein, 048661, Technion 18
Vector sum kernel code in CUDA global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU Get unique thread ID CUDA modifier to signify GPU kernels fine-grain parallelism: operation per thread Spring 2015 Mark Silberstein, 048661, Technion 19
CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); Two sets of pointers Two sets of pointers GPU memory allocated by CPU GPU CPU cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Input copied from CPU to GPU Pass pointers to cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { device memory at fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; kernel invocation } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); return 0; Mark Silberstein, 048661, Technion 20 }
CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Wait until the kernel is over cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); Copy data back to CPU return 1; } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); Mark Silberstein, 048661, Technion 21 return 0; }
Typical code structure global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU programming GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); GPU integration cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; } cudamemcpy(d_a,a,size_bytes,cudamemcpydevicetohost); printf ("Success"); Spring 2015return 0; Mark Silberstein, 048661, Technion 22 }
BUT! Vector sum is simple purely data parallel What if we need coordination between tasks Spring 2015 Mark Silberstein, 048661, Technion 23
Inter-thread communication Programming model usability is determined by the means of inter-thread communication No communications easy for hardware, hard for software Fine-grained communications hard to make efficient in hardware, easy for software Spring 2015 Mark Silberstein, 048661, Technion 24
Communications: inner product Spring 2015 Mark Silberstein, 048661, Technion 25
Communications: inner product + + + + + + + x x x x x x x x Eventually all-to-all thread communication required How can we make it efficient?? Spring 2015 Mark Silberstein, 048661, Technion 26
Compromise: Hierarchy Threads Threadblocks All threads are split up into fixed-size groups: threadblocks Threads in a threadblock can synchronize and communicate efficiently Share fast memory Can use barriers Spring 2015 Mark Silberstein, 048661, Technion 27
Execution model: stream of threadblocks Threadblock scheduling is non-deterministic Cannot rely on a schedule for correctness No inter-threadblock global barrier in the programming model Spring 2015 Mark Silberstein, 048661, Technion 28
Hierarchical Dot-Product + + + + + + + x x x x x x x x Threadblock 1 Threadblock 2 Spring 2015 Mark Silberstein, 048661, Technion 29
Slow coordination should be rare Slow coordination + + + Efficient communication + + + + x x x x x x x x Threadblock 1 Threadblock 2 Coarse grain task Fine-grain task Spring 2015 Mark Silberstein, 048661, Technion 30
Dot product void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, 048661, Technion 31
Dot product void vector_dotproduct_kernel(float* ga, Fast float* local gb, memory float* gout) { _shared_ float l_res[tbsize]; //local core memory int tid=threadidx.x; int bid=blockidx.x; Two levels of indexing int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; Fast sync } synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, 048661, Technion 32
How to finalize the parallel reduce Option 1: use CPU Option 2: use atomics Option 3: use multiple kernel invocations kernel completion = global barrier Spring 2015 Mark Silberstein, 048661, Technion 33
Dot product with atomics void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all All threadblocks products in are going a threadblock to atomically update the // parallel reduction global memory gout[0] for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) atomicadd(gout,l_res[0]); Spring 2015 Mark Silberstein, 048661, Technion 34
GPU hardware in more details Spring 2015 Mark Silberstein, 048661, Technion 35
BEGIN: Terminology break Spring 2015 Mark Silberstein, 048661, Technion 36
Flynn's HARDWARE Taxonomy Spring 2015 Mark Silberstein, 048661, Technion 37
SIMD Lock-step execution Example: vector operations (SSE) Spring 2015 Mark Silberstein, 048661, Technion 38
MIMD Example: multicores Spring 2015 Mark Silberstein, 048661, Technion 39
END: Terminology break Spring 2015 Mark Silberstein, 048661, Technion 40
GPU hardware characteristics Massive parallelism Runs thousands of concurrent execution flows (threads) identified and explicitly programmed by a developer Spring 2015 Mark Silberstein, 048661, Technion 41
GPU hardware parallelism 1. MIMD = multi-core GPU GPU memory Core MP Core MP Core MP Core MP Spring 2015 Mark Silberstein, 048661, Technion 42
GPU hardware parallelism 2. SIMD = vector Core SIMD vector SIMD vector SIMD vector Spring 2015 Mark Silberstein, 048661, Technion 43
Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 44
Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 45
Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP R 0x01 R 0x04 T1 T2 Execution state T3 Spring 2015 Mark Silberstein, 048661, Technion 46
Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 47
Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, 048661, Technion 48
Putting it all together: 3 levels of hardware parallelism GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 49
Thread n Thread 1 What is GPU thread? GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 50
Thread n Thread 1 What is thread block? GPU GPU memory Core MP Core MP Core MP Core MP scratchpad memory in core Core Core MP Thread State Ctx 1 1 Thread block SIMD vector Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 51
What is GPU kernel Hardware queue GPU GPU memory Thread block Thread block Core MP Core MP Core MP Core MP Thread Thread Thread Thread block Thread block Thread block Thread block Thread block block block block Thread block Spring 2015 Mark Silberstein, 048661, Technion 52
Thread 1 Thread n 10,000-s of concurrent threads! GPU GPU memory NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads Core MP Core MP Core MP Core MP 14 32 Core Thread State Ctx 1 1 SIMD vector 64 Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, 048661, Technion 53
Using shared memory for efficient matrix product computations (using CUDA samples) Spring 2015 Mark Silberstein, 048661, Technion 54
Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 55
Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 56
Walk-through matrix product code Spring 2015 Mark Silberstein, 048661, Technion 57
More info about CUDA NVIDIA programming manual UDACITY Interactive online course: Intro to parallel computing highly recommended to skim through the first two lectures before the next class https://www.udacity.com/course/cs344 Spring 2015 Mark Silberstein, 048661, Technion 58