1. If we need to use each thread to calculate one output element of a vector addition, what would

Size: px

Start display at page:

Download "1. If we need to use each thread to calculate one output element of a vector addition, what would"

Sandra Miller
9 years ago
Views:

1 Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x + threadidx.y; (B) i=blockidx.x + threadidx.x; (C) i=blockidx.x*blockdim.x + threadidx.x; (D) i=blockidx.x * threadidx.x; 2. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index? (A) i=blockidx.x*blockdim.x + threadidx.x +2; (B) i=blockidx.x*threadidx.x*2 (C) i=(blockidx.x*blockdim.x + threadidx.x)*2 (D) i=blockidx.x*blockdim.x*2 + threadidx.x 3. If a CUDA device s SM (streaming multiprocessor) can take up to 1536 threads and up to 4 thread blocks. Which of the following block configuration would result in the most number of threads in the SM? 28 threads per block (B) 256 threads per block (C) 512 threads per block (D) 1024 threads per block 4. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid? (A) 2000 (B) 2024 (C) 2048 (D) 2096

y; (B) i=blockidx.x + threadidx.x; (C) i=blockidx.x*blockdim.x + threadidx.x; (D) i=blockidx.x * threadidx.x; 2.

2 5. If the previous question, how many warps do you expect to have divergence due to the boundary check on vector length? (B) 2 (C) 3 (D) 6 Answer: (A) Quiz Questions Lecture 3: 1. For our tiled matrix matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of memory bandwidth usage for input matrices M and N? /8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage 2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) 51200

/8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage 2.

3 4. For the simple matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither 5. For the tiled matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither Quiz Questions: Lecture 4 1. For the simple reduction kernel, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32, All warps will have divergence throughout the execution. 2. For the improved reduction kernel, if the block size is 1024 and warp size is 32, how many warps will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32 Answer: (A), There are 64 consecutive active threads, more than warp size.

For the simple reduction kernel, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5 th iteration?

4 3. For the work efficient scan kernel, assume that we have 2048 elements, how many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10*1024 Answer: (A) 4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10* For the vector addition example where input vectors are read from disk, if the GPU kernel runs at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is the closest approximation of the minimum time it would take to add two 190 mega element vectors stored in the host memory and get the result back to the host memory? 90 / 190 ms (B) 190 / 6 ms (C) 8 * 190 / 6 ms (D) 2 * 190 / 6 ms Lecture 5 1. What is the CUDA API call that make sure that all previous kernel executions and memory copies have been completed? (A) syncthreads() (B) cudadevicesynchronize() (C) cudastreamsynchronize() (D) barrier()

For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed?

5 2. Which of the following statements is true? (A) The data transfer between device and host is done by DMA hardware using virtual addresses. (B) The OS automatically guarantees that any memory being used by a DMA device is not swapped out. (C) If a swapped page is to be transferred by cudymemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudamalloc() function. Lecture 6 1. For vector addition, if there are 100,000 elements in each vector and we are using 3 compute processes. How many elements are we sending to the last compute process? (A) 5 (B) 300 (C) 333 (D) If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of bytes, what is the size of each data element being sent? byte (B) 2 bytes (C) 4 bytes (D) 8 bytes 3. Which of the following statements is true? (A) MPI_send() is blocking by default. (B) MPI_recv() is blocking by default. (C) MPI messages must be at least 128 bytes. (D) MPI processes can access the same variable through shared memory.

(C) If a swapped page is to be transferred by cudymemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudamalloc() function.

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @