Hands-on CUDA exercises



Similar documents
Learn CUDA in an Afternoon: Hands-on Practical Exercises

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

GPU Tools Sandra Wienke

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to CUDA C

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Microsoft Visual Studio 2010 Instructions For C Programs

Lab 2-2: Exploring Threads

Using Microsoft Visual Studio API Reference

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

DEPLOYING A VISUAL BASIC.NET APPLICATION

AMD CodeXL 1.7 GA Release Notes

GPU Parallel Computing Architecture and CUDA Programming Model

Q N X S O F T W A R E D E V E L O P M E N T P L A T F O R M v Steps to Developing a QNX Program Quickstart Guide

Using Microsoft Internet Explorer 6 (Windows 2000/ Windows XP/ Windows Server 2003)

Walkthrough: Creating and Using an ASP.NET Web Service in Visual Web Developer

Optimizing Application Performance with CUDA Profiling Tools

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Debugging Export Connectors With Visual Studio.NET

Creating a Simple Visual C++ Program

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Getting Started with CodeXL

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

1. If we need to use each thread to calculate one output element of a vector addition, what would

Rootbeer: Seamlessly using GPUs from Java

Jolly Server Getting Started Guide

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Pocket ESA Network Server Installation

Quick Start Tutorial. Using the TASKING* Software Development Tools with the Intel 8x930 Family Evaluation Board

ICONICS Using the Azure Cloud Connector

MetaMorph Microscopy Automation & Image Analysis Software Super-Resolution Module

Working with Office Applications and ProjectWise

Prerequisite. Getting Started. Signing and Encryption using Microsoft outlook 2007

KEYWORDS InteractX, database, SQL Server, SQL Server Express, backup, maintenance.

SKP16C62P Tutorial 1 Software Development Process using HEW. Renesas Technology America Inc.

Eliminate Memory Errors and Improve Program Stability

Working with SQL Server Integration Services

For Introduction to Java Programming, 5E By Y. Daniel Liang

NVIDIA CUDA INSTALLATION GUIDE FOR MICROSOFT WINDOWS

Prerequisite. Getting Started. Signing and Encryption using Microsoft outlook 2010

INF-USB2 and SI-USB Quick Start Guide

eadvantage Certificate Enrollment Procedures

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Tutorial 5: Developing Java applications

GETTING STARTED WITH SQL SERVER

Introduction to the use of the environment of Microsoft Visual Studio 2008

Zanibal Plug-in For Microsoft Outlook Installation & User Guide Version 1.1

Creating a Distribution List from an Excel Spreadsheet

Tool Tip. SyAM Management Utilities and Non-Admin Domain Users

Technical Bulletin. SQL Express Backup Utility

Promap V4 ActiveX MSI File

Developing, Deploying, and Debugging Applications on Windows Embedded Standard 7

XMailer Reference Guide

Compiler Setup and DirectX/OpenGL Setup

WINDOWS PROCESSES AND SERVICES

PaperStream Connect. Setup Guide. Version Copyright Fujitsu

Part I Courses Syllabus

Getting Started with Android Development

Word 2010: Mail Merge to with Attachments

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

MONITORING PERFORMANCE IN WINDOWS 7

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Real-time Debugging using GDB Tracepoints and other Eclipse features

BusinessObjects Enterprise XI Release 2

File Management Using Microsoft Windows

Appendix L: Running Hazus-MH with SQL Server 2005

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Programming with the Dev C++ IDE

Introduction to GPU hardware and to CUDA

How to install and use the File Sharing Outlook Plugin

Getting Started with Tizen SDK : How to develop a Web app. Hong Gyungpyo 洪 競 杓 Samsung Electronics Co., Ltd

Appendix K Introduction to Microsoft Visual C++ 6.0

QUANTIFY INSTALLATION GUIDE

To install Multifront you need to have familiarity with Internet Information Services (IIS), Microsoft.NET Framework and SQL Server 2008.

Debugging with TotalView

Kepware Technologies OPC Quick Client Connectivity Guide

Application Development With Data Studio

Module 11 Setting up Customization Environment

In order to upload a VM you need to have a VM image in one of the following formats:

Nios II IDE Help System

Export the address book from the Blackberry handheld to MS Outlook 2003, using the synchronize utility in Blackberry Desktop Manager.

Administrator s Plus. Backup Process. A Get Started Guide

Getting Started with the LabVIEW Mobile Module

Advanced Outlook Tutorials

Creating Rules in Outlook

The System Monitor Handbook. Chris Schlaeger John Tapsell Chris Schlaeger Tobias Koenig

Server Manager Performance Monitor. Server Manager Diagnostics Page. . Information. . Audit Success. . Audit Failure

Nexio Connectus Cluster Set Up with SQL Server Backend

Releasing blocked in Data Security

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

DS-5 ARM. Using the Debugger. Version 5.7. Copyright 2010, 2011 ARM. All rights reserved. ARM DUI 0446G (ID092311)

Administration guide. Host software WinCCU Installation. Complete gas volume and energy data management

Transcription:

Hands-on CUDA exercises

CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished when you compile and run the program and get the output Correct! Solutions are included in the solution folder of each exercise

Compiling the Code: Windows Open the <project>.sln file in Microsoft Visual Studio Build the project 4 configuration choices: Release,Debug,EmuRelease, EmuDebug To debug your code build EmuDebug configuration Can set breakpoints inside kernels ( global or device functions) Can debug the code as normal, even printf! One CPU thread per GPU thread Threads not actually in parallel on GPU

Compiling the Code: Linux nvcc <filename>.cu [-o <executable>] Builds release mode nvcc g <filename>.cu Builds debug (device) mode Can debug host code but not device code (runs on GPU) nvcc deviceemu <filename>.cu Builds device emulation mode All code runs on CPU, but no debug symbols nvcc deviceemu g <filename>.cu Builds debug device emulation mode All code runs on CPU, with debug symbols Debug using gdb or other linux debugger

1: Copying between host and device Start from the cudamallocandmemcpy template. Part1: Allocate memory for pointers d_a and d_b on the device. Part2: Copy h_a on the host to d_a on the device. Part3: Do a device to device copy from d_a to d_b. Part4: Copy d_b on the device back to h_a on the host. Part5: Free d_a and d_b on the host. Bonus: Experiment with cudamallochost in place of malloc for allocating h_a. 5

2: Launching kernels Start from the myfirstkernel template. Part1: Allocate device memory for the result of the kernel using pointer d_a. Part2: Configure and launch the kernel using a 1-D grid of 1-D thread blocks. Part3: Have each thread set an element of d_a as follows: idx = blockidx.x*blockdim.x + threadidx.x d_a[idx] = 1000*blockIdx.x + threadidx.x Part4: Copy the result in d_a back to the host pointer h_a. Part5: Verify that the result is correct. 6

3: Reverse Array (single block) Given an input array {a 0, a 1,, a n-1 } in pointer d_a, store the reversed array {a n-1, a n-2,, a 0 } in pointer d_b Start from the reversearray_singleblock template Only one thread block launched, to reverse an array of size N = numthreads = 256 elements Part 1 (of 1): All you have to do is implement the body of the kernel reversearrayblock() Each thread moves a single element to reversed position Read input from d_a pointer Store output in reversed location in d_b pointer 7

4: Reverse Array (multiblock) Given an input array {a 0, a 1,, a n-1 } in pointer d_a, store the reversed array {a n-1, a n-2,, a 0 } in pointer d_b Start from the reversearray_multiblock template Multiple 256-thread blocks launched To reverse an array of size N, N/256 blocks Part 1: Compute the number of blocks to launch Part 2: Implement the kernel reversearrayblock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block

5: Profiling Array Reversal Your array reversal has a performance problem Use the CUDA Visual Profiler to run your compiled program Compile release mode, run cudaprof, create a new project Browse to your executable file in the launch box of the session settings dialog

5: Profiling Array Reversal Click on configuration tab Select the check box next to signal list Click OK, then Start Check if any of these are non-zero: GLD_INCOHERENT GST_INCOHERENT WARP_SERIALIZE Take a note of the GPU Time

6: Optimizing Array Reversal Goal: Get rid of incoherent loads/stores and improve performance Use shared memory to reverse each block Part 1: compute the number of bytes of shared mem One element per thread Part 2: implement the kernel Comments should help Don t forget to compute the correct block offset! Part 3: Profile the working code Compare value of GLD/GST_INCOHERENT to previous Compare GPU Time to previous

Reverse Data in shared memory Input addresses are linear and aligned = coalesced Input Global Mem 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 blockoffset=0 blockoffset=4 blockoffset=8 blockoffset=12 Registers (local variables) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Shared Memory 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12 Output Global Mem 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Output addresses are linear and aligned = coalesced