CS 179 Lecture 6. Synchronization, Matrix Transpose, Profiling, AWS Cluster

Similar documents
GPU Hardware Performance. Fall 2015

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Application Performance with CUDA Profiling Tools

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU Performance Analysis and Optimisation

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CUDA Programming. Week 4. Shared memory and register

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Basics. Murphy Stein New York University

1. If we need to use each thread to calculate one output element of a vector addition, what would

Introduction to GPU hardware and to CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Image Processing & Video Algorithms with CUDA

OpenCL Programming for the CUDA Architecture. Version 2.3

HPC with Multicore and GPUs

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introduction to CUDA C

GPU Tools Sandra Wienke

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Guided Performance Analysis with the NVIDIA Visual Profiler

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Rootbeer: Seamlessly using GPUs from Java

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Advanced CUDA Webinar. Memory Optimizations

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

The Fastest, Most Efficient HPC Architecture Ever Built

Hands-on CUDA exercises

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Profiler User's Guide

Running applications on the Cray XC30 4/12/2015

CUDA programming on NVIDIA GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 1: an introduction to CUDA

Next Generation GPU Architecture Code-named Fermi

OpenACC Basics Directive-based GPGPU Programming

U N C L A S S I F I E D

GPGPU Computing. Yong Cao

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

12 Tips for Maximum Performance with PGI Directives in C

Texture Cache Approximation on GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Programming GPUs with CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Debugging with TotalView

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Monte-Carlo Option Pricing. Victor Podlozhnyuk

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Accelerate your Application with CUDA C

Parallel Computing for Data Science

Crash Course in Java

OpenACC 2.0 and the PGI Accelerator Compilers

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPU Computing - CUDA

How to Run Spark Application

Black-Scholes option pricing. Victor Podlozhnyuk

Introduction to GPU Programming Languages

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel Debugging with DDT

Parallel Programming Survey

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Introduction to the SGE/OGS batch-queuing system

Evaluation of CUDA Fortran for the CFD code Strukti

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

IMAGE PROCESSING WITH CUDA

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

ultra fast SOM using CUDA

Performance Analysis and Optimization Tool

A quick tutorial on Intel's Xeon Phi Coprocessor

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Definiens XD Release Notes

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Informatica e Sistemi in Tempo Reale

Running Knn Spark on EC2 Documentation

Transcription:

CS 179 Lecture 6 Synchronization, Matrix Transpose, Profiling, AWS Cluster

Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many algorithms that require just a little bit of resource sharing can still be accelerated by massive parallelism of GPU

Examples needing synchronization (1) Block loads data into shared memory before processing (2) Summing a list of numbers

syncthreads() syncthreads() synchronizes all threads in a block. Useful for load data into shared memory example No global equivalent of syncthreads()

Atomic instructions: motivation Two threads try to increment variable x=42 concurrently. Final value should be 44. Possible execution order: thread 0 load x (=42) into register r0 thread 1 load x (=42) into register r1 thread 0 increment r0 to 43 thread 1 increment r1 to 43 thread 0 store r0 (=43) into x thread 1 store r1 (=43) into x Actual final value of x: 43 :(

Atomic instructions An atomic instruction executes as a single unit, cannot be interrupted.

Atomic instructions on CUDA atomic{add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} Syntax: atomicadd(float *address, float val) Work in both global and shared memory!

The fun world of parallelism All of the atomic instructions can be implemented given compare and swap: atomiccas(int *address, int compare, int val) CAS is very powerful, can also implement locks, lock-free data structures, etc. Recommend Art of Multiprocessor Programming to learn more

Warp-synchronous programming What if I only need to synchronize between all threads in a warp? Warps are already synchronized! Can save unneeded syncthreads() use, but code is fragile and can be broken by compiler optimizations.

Warp vote & warp shuffle Safer warp-synchronous programming (and doesn t use shared memory) Warp vote: all, any, ballot int x = threadidx.x; // goes from 0 to 31 any(x < 16) == true; all(x < 16) == false; ballot(x < 16) == (1 << 16) - 1;

Warp shuffle Read value of register from another thread in warp. int shfl(int var, int srclane, int width=warpsize) Extremely useful to compute sum of values across a warp (and other reductions, more next time) First available on Kepler (no Fermi, only CC >= 3.0)

(Synchronization) budget advice Do more cheap things and fewer expensive things! Example: computing sum of list of numbers Naive: each thread atomically increments each number to accumulator in global memory

Sum example Smarter solution: each thread computes its own sum in register use warp shuffle to compute sum over warp each warp does a single atomic increment to accumulator in global memory

Set 2 (1) Questions on latency hiding, thread divergence, coalesced memory access, bank conflicts, instruction dependencies (2) Putting it into action: optimizing matrix transpose. Need to comment on all non-coalesced memory accesses and bank conflicts in code.

Matrix transpose A great IO problem, because you have a stride 1 access and a stride n access. Transpose is just a fancy memcpy, so memcpy provides a great performance target.

Matrix Transpose global void naivetransposekernel(const float *input, float *output, int n) { // launched with (64, 16) block size and (n / 64, n / 64) grid size // each block transposes a 64x64 block } const int i = threadidx.x + 64 * blockidx.x; int j = 4 * threadidx.y + 64 * blockidx.y; const int end_j = j + 4; for (; j < end_j; j++) { output[j + n * i] = input[i + n * j]; }

Shared memory & matrix transpose Idea to avoid non-coalesced accesses: Load from global memory with stride 1 Store into shared memory with stride x syncthreads() Load from shared memory with stride y Store to global memory with stride 1 Choose values of x and y perform the transpose.

Avoiding bank conflicts You can choose x and y to avoid bank conflicts. A stride n access to shared memory avoids bank conflicts iff gcd(n, 32) == 1.

Two versions of the same kernel You have to write 2 kernels for the set: (1) shmemtransposekernel. This should have all of the optimizations with memory access I just talked about. (2) optimaltransposekernel. Build on top of shmemtransposekernel, but include any optimizations tricks that you want.

Possible optimizations Reduce & separate instruction dependencies (and everything depends on writes) Unroll loops to reduce bounds checking overhead Try rewriting your code to use 64-bit or 128-bit loads (with float2 or float4) Take a warp-centric approach rather than block-centric and use warp shuffle rather than shared memory (will not be built on top of shmemtranspose). I ll allow you to use this as a guide

Profiling profiling = analyzing where program spends time Putting effort into optimizing without profiling is foolish. There is a great visual (GUI) profiler for CUDA called nvpp, but using it is a bit of a pain with a remote GPU. nvprof is the command line profiler (works on minuteman) so let s check that out!

nvprof demo If you didn t catch the demo in class (or even if you did), this blog post and the guide will be useful.

Profiler tips List all profiler events with nvprof --query-events and all metrics with nvprof --query-metrics. Many useful metrics, some good ones are acheived_occupancy, ipc, shared_replay_overhead, all of the utilizations, all of the throughputs. Some useful events are global_ld_mem_divergence_replays, global_st_mem_divergence_replays, shared_load_replay, shared_store_replay

Profile example (1) [emartin@minuteman:~/set2]> nvprof --events global_ld_mem_divergence_replays,global_st_mem_divergence_replays -- metrics achieved_occupancy./transpose 4096 naive ==11043== NVPROF is profiling process 11043, command:./transpose 4096 naive ==11043== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. Size 4096 naive GPU: 33.305279 ms ==11043== Profiling application:./transpose 4096 naive ==11043== Profiling result: ==11043== Event result:

Profile example (2) Invocations Event Name Min Max Avg Device "GeForce GTX 780 (0)" Kernel: naivetransposekernel(float const *, float*, int) 1 global_ld_mem_divergence_replays 0 0 0 1 global_st_mem_divergence_replays 16252928 16252928 16252928 ==11043== Metric result: Invocations Metric Name Metric Description Min Max Avg Device "GeForce GTX 780 (0)" Kernel: naivetransposekernel(float const *, float*, int) 1 achieved_occupancy Achieved Occupancy 0.862066 0.862066 0.862066

Profiling interpretation Lots of non-coalesced stores, 83% occupancy for naive kernel transpose

Amazon Cluster Submit jobs with qsub. You must specify -l gpu=1 for cuda >qsub -l gpu=1 job.sh Binaries can be run directly if the binary option is specified >qsub -l gpu=1 -b y./program Script and binaries are run in your homedir by default. Use the -cwd option to run in the current folder ~/set10/files/>qsub -l gpu=1 -cwd job.sh

Amazon Cluster (2) View running jobs with qstat >qstat The stdout and stderr are in stored in files after the job completes. These files have the job number appended. >qsub -l gpu=1 -cwd job.sh >ls job.sh.o5 job.sh.e5

Amazon Cluster (3) Login requires ssh keys >ssh -i user123.rsa user123@54.163.37.252 Can also use the url instead of the ip ec2-54-163-37-252.compute-1.amazonaws.com Windows users must use puttygen to convert to putty format Keys will be distributed to each haru account