1. If we need to use each thread to calculate one output element of a vector addition, what would



Similar documents
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Optimizing Application Performance with CUDA Profiling Tools

Introduction to GPU hardware and to CUDA

GPU Performance Analysis and Optimisation

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Learn CUDA in an Afternoon: Hands-on Practical Exercises

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Basics. Murphy Stein New York University

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Programming. Week 4. Shared memory and register

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Parallel Computing Architecture and CUDA Programming Model

Lecture 1: an introduction to CUDA

Hands-on CUDA exercises

Guided Performance Analysis with the NVIDIA Visual Profiler

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Image Processing & Video Algorithms with CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

HPC with Multicore and GPUs

Advanced CUDA Webinar. Memory Optimizations

NVIDIA Tools For Profiling And Monitoring. David Goodwin

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

~ Greetings from WSU CAPPLab ~

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

OpenACC 2.0 and the PGI Accelerator Compilers

CUDA programming on NVIDIA GPUs

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Tools Sandra Wienke

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction

Introduction to CUDA C

Sélection adaptative de codes polyédriques pour GPU/CPU

NVIDIA GeForce GTX 580 GPU Datasheet

Parallel Computing for Data Science

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Computing - CUDA

Black-Scholes option pricing. Victor Podlozhnyuk

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Pricing of cross-currency interest rate derivatives on Graphics Processing Units

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

Architecture of Hitachi SR-8000

GPU Acceleration of the SENSEI CFD Code Suite

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models. Cris Doloc, Ph.D.

Introduction to GPU Programming Languages

Multi-GPU Programming Supercomputing 2011

Rootbeer: Seamlessly using GPUs from Java

Java Virtual Machine: the key for accurated memory prefetching

CUDA. Multicore machines

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Efficient Sparse Matrix-Vector Multiplication on CUDA

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

CUDA for Real Time Multigrid Finite Element Simulation of

Configuring Apache Derby for Performance and Durability Olav Sandstå

XID ERRORS. vr352 May XID Errors

Data Storage - II: Efficient Usage & Errors

HP ProLiant SL270s Gen8 Server. Evaluation Report

Scalability and Classifications

Texture Cache Approximation on GPUs

GPUs for Scientific Computing

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

Next Generation GPU Architecture Code-named Fermi

Real-time Visual Tracker by Stream Processing

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Pedraforca: ARM + GPU prototype

Chapter 12 File Management. Roadmap

Chapter 12 File Management

Storing Measurement Data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to GPU Architecture

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Vendor and Hardware Platform: Fujitsu BX924 S2 Virtualization Platform: VMware ESX 4.0 Update 2 (build )

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Transcription:

Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x + threadidx.y; (B) i=blockidx.x + threadidx.x; (C) i=blockidx.x*blockdim.x + threadidx.x; (D) i=blockidx.x * threadidx.x; 2. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index? (A) i=blockidx.x*blockdim.x + threadidx.x +2; (B) i=blockidx.x*threadidx.x*2 (C) i=(blockidx.x*blockdim.x + threadidx.x)*2 (D) i=blockidx.x*blockdim.x*2 + threadidx.x 3. If a CUDA device s SM (streaming multiprocessor) can take up to 1536 threads and up to 4 thread blocks. Which of the following block configuration would result in the most number of threads in the SM? 28 threads per block (B) 256 threads per block (C) 512 threads per block (D) 1024 threads per block 4. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid? (A) 2000 (B) 2024 (C) 2048 (D) 2096

5. If the previous question, how many warps do you expect to have divergence due to the boundary check on vector length? (B) 2 (C) 3 (D) 6 Answer: (A) Quiz Questions Lecture 3: 1. For our tiled matrix matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of memory bandwidth usage for input matrices M and N? /8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage 2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) 512000 3. In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) 51200

4. For the simple matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither 5. For the tiled matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither Quiz Questions: Lecture 4 1. For the simple reduction kernel, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32, All warps will have divergence throughout the execution. 2. For the improved reduction kernel, if the block size is 1024 and warp size is 32, how many warps will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32 Answer: (A), There are 64 consecutive active threads, more than warp size.

3. For the work efficient scan kernel, assume that we have 2048 elements, how many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10*1024 Answer: (A) 4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10*1024 5. For the vector addition example where input vectors are read from disk, if the GPU kernel runs at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is the closest approximation of the minimum time it would take to add two 190 mega element vectors stored in the host memory and get the result back to the host memory? 90 / 190 ms (B) 190 / 6 ms (C) 8 * 190 / 6 ms (D) 2 * 190 / 6 ms Lecture 5 1. What is the CUDA API call that make sure that all previous kernel executions and memory copies have been completed? (A) syncthreads() (B) cudadevicesynchronize() (C) cudastreamsynchronize() (D) barrier()

2. Which of the following statements is true? (A) The data transfer between device and host is done by DMA hardware using virtual addresses. (B) The OS automatically guarantees that any memory being used by a DMA device is not swapped out. (C) If a swapped page is to be transferred by cudymemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudamalloc() function. Lecture 6 1. For vector addition, if there are 100,000 elements in each vector and we are using 3 compute processes. How many elements are we sending to the last compute process? (A) 5 (B) 300 (C) 333 (D) 334 2. If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of 40000 bytes, what is the size of each data element being sent? byte (B) 2 bytes (C) 4 bytes (D) 8 bytes 3. Which of the following statements is true? (A) MPI_send() is blocking by default. (B) MPI_recv() is blocking by default. (C) MPI messages must be at least 128 bytes. (D) MPI processes can access the same variable through shared memory.