Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Transcription

1 Intro to GPU computing Spring 2015 Mark Silberstein, , Technion 1

2 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, , Technion 2

3 GPUs are parallel processors Throughput-oriented hardware Example problem I have 100 apples to eat 1) high performance : finish one apple faster 2) high throughput : finish all apples faster Performance = parallel hardware + scalable parallel program! Spring 2015 Mark Silberstein, , Technion 3

4 Simplified GPU model (see an introductory paper to on my home page: «GPUs: High-performance Accelerators for Parallel Applications» ) Spring 2015 Mark Silberstein, , Technion 4

5 GPU 101 CPU GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 5

6 GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 6

8 Co-processor model CPU Computation GPU t a t i o n GPU kernel Memory Memory Spring 2015 Mark Silberstein, , Technion 8

10 Simple GPU program Idea: same set of operations is applied to different data chunks in parallel Implementation Every thread runs the same code on different data chunks. GPU concurrently runs many parallel threads Spring 2015 Mark Silberstein, , Technion 10

11 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, , Technion 11

12 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, , Technion 12

13 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, , Technion 13

14 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size Instantiated and executed on GPU A[i]=A[i]+B[i] in thousands of hardware threads Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, , Technion 14

15 GPU programming basics Spring 2015 Mark Silberstein, , Technion 15

16 GPU execution model Thread sequence of sequentially-executed instructions Each thread has private state: registers, stack Single Instruction Multiple Threads (SIMT) model Same program, called kernel, is instantiated and invoked for each thread Each thread gets unique ID Many threads (tens of thousands) Threads wait in a hardware queue until resources become available All threads share access to GPU memory Spring 2015 Mark Silberstein, , Technion 16

17 CPU control of GPU GPU* and CPU have different memory spaces GPU is fully managed by a CPU cannot* access CPU RAM cannot* reallocate its own memory cannot* start computations CPU must allocate memory and transfer input before kernel invocation CPU must transfer output and clean up memory after kernel termination Spring 2015 Mark Silberstein, , Technion 17

18 Vector sum for A.size=1024 GPU A[threadId]=A[threadId]+B[threadId] CPU 1.Allocate arrays in GPU memory 2.Copy data CPU -> GPU 3.Invoke kernel with 1024 threads 4.Wait until complete and copy data GPU->CPU Spring 2015 Mark Silberstein, , Technion 18

19 Vector sum kernel code in CUDA global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU Get unique thread ID CUDA modifier to signify GPU kernels fine-grain parallelism: operation per thread Spring 2015 Mark Silberstein, , Technion 19

20 CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); Two sets of pointers Two sets of pointers GPU memory allocated by CPU GPU CPU cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Input copied from CPU to GPU Pass pointers to cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { device memory at fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; kernel invocation } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); return 0; Mark Silberstein, , Technion 20 }

21 CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Wait until the kernel is over cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); Copy data back to CPU return 1; } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); Mark Silberstein, , Technion 21 return 0; }

22 Typical code structure global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU programming GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); GPU integration cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; } cudamemcpy(d_a,a,size_bytes,cudamemcpydevicetohost); printf ("Success"); Spring 2015return 0; Mark Silberstein, , Technion 22 }

23 BUT! Vector sum is simple purely data parallel What if we need coordination between tasks Spring 2015 Mark Silberstein, , Technion 23

24 Inter-thread communication Programming model usability is determined by the means of inter-thread communication No communications easy for hardware, hard for software Fine-grained communications hard to make efficient in hardware, easy for software Spring 2015 Mark Silberstein, , Technion 24

25 Communications: inner product Spring 2015 Mark Silberstein, , Technion 25

26 Communications: inner product x x x x x x x x Eventually all-to-all thread communication required How can we make it efficient?? Spring 2015 Mark Silberstein, , Technion 26

27 Compromise: Hierarchy Threads Threadblocks All threads are split up into fixed-size groups: threadblocks Threads in a threadblock can synchronize and communicate efficiently Share fast memory Can use barriers Spring 2015 Mark Silberstein, , Technion 27

28 Execution model: stream of threadblocks Threadblock scheduling is non-deterministic Cannot rely on a schedule for correctness No inter-threadblock global barrier in the programming model Spring 2015 Mark Silberstein, , Technion 28

29 Hierarchical Dot-Product x x x x x x x x Threadblock 1 Threadblock 2 Spring 2015 Mark Silberstein, , Technion 29

30 Slow coordination should be rare Slow coordination Efficient communication x x x x x x x x Threadblock 1 Threadblock 2 Coarse grain task Fine-grain task Spring 2015 Mark Silberstein, , Technion 30

31 Dot product void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, , Technion 31

32 Dot product void vector_dotproduct_kernel(float* ga, Fast float* local gb, memory float* gout) { _shared_ float l_res[tbsize]; //local core memory int tid=threadidx.x; int bid=blockidx.x; Two levels of indexing int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; Fast sync } synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, , Technion 32

33 How to finalize the parallel reduce Option 1: use CPU Option 2: use atomics Option 3: use multiple kernel invocations kernel completion = global barrier Spring 2015 Mark Silberstein, , Technion 33

34 Dot product with atomics void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all All threadblocks products in are going a threadblock to atomically update the // parallel reduction global memory gout[0] for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) atomicadd(gout,l_res[0]); Spring 2015 Mark Silberstein, , Technion 34

35 GPU hardware in more details Spring 2015 Mark Silberstein, , Technion 35

36 BEGIN: Terminology break Spring 2015 Mark Silberstein, , Technion 36

37 Flynn's HARDWARE Taxonomy Spring 2015 Mark Silberstein, , Technion 37

38 SIMD Lock-step execution Example: vector operations (SSE) Spring 2015 Mark Silberstein, , Technion 38

39 MIMD Example: multicores Spring 2015 Mark Silberstein, , Technion 39

40 END: Terminology break Spring 2015 Mark Silberstein, , Technion 40

41 GPU hardware characteristics Massive parallelism Runs thousands of concurrent execution flows (threads) identified and explicitly programmed by a developer Spring 2015 Mark Silberstein, , Technion 41

42 GPU hardware parallelism 1. MIMD = multi-core GPU GPU memory Core MP Core MP Core MP Core MP Spring 2015 Mark Silberstein, , Technion 42

43 GPU hardware parallelism 2. SIMD = vector Core SIMD vector SIMD vector SIMD vector Spring 2015 Mark Silberstein, , Technion 43

44 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 44

45 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 45

46 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP R 0x01 R 0x04 T1 T2 Execution state T3 Spring 2015 Mark Silberstein, , Technion 46

47 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 47

48 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 48

49 Putting it all together: 3 levels of hardware parallelism GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 49

50 Thread n Thread 1 What is GPU thread? GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 50

51 Thread n Thread 1 What is thread block? GPU GPU memory Core MP Core MP Core MP Core MP scratchpad memory in core Core Core MP Thread State Ctx 1 1 Thread block SIMD vector Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 51

52 What is GPU kernel Hardware queue GPU GPU memory Thread block Thread block Core MP Core MP Core MP Core MP Thread Thread Thread Thread block Thread block Thread block Thread block Thread block block block block Thread block Spring 2015 Mark Silberstein, , Technion 52

53 Thread 1 Thread n 10,000-s of concurrent threads! GPU GPU memory NVIDIA K20x GPU: 64x14x32= concurrent threads Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector 64 Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 53

54 Using shared memory for efficient matrix product computations (using CUDA samples) Spring 2015 Mark Silberstein, , Technion 54

55 Walk-through matrix product code Spring 2015 Mark Silberstein, , Technion 55

58 More info about CUDA NVIDIA programming manual UDACITY Interactive online course: Intro to parallel computing highly recommended to skim through the first two lectures before the next class Spring 2015 Mark Silberstein, , Technion 58