Presented by Damodaran Ramani

Similar documents
Parallel Prefix Sum (Scan) with CUDA. Mark Harris

HPC with Multicore and GPUs

Scan Primitives for GPU Computing

Introduction to GPU Programming Languages

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Efficient Gather and Scatter Operations on Graphics Processors

GPU Parallel Computing Architecture and CUDA Programming Model

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

QCD as a Video Game?

GPGPU Computing. Yong Cao

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Rethinking SIMD Vectorization for In-Memory Databases

ultra fast SOM using CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPUs for Scientific Computing

Interactive Level-Set Deformation On the GPU

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Spring 2011 Prof. Hyesoon Kim

Next Generation GPU Architecture Code-named Fermi

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Accelerating Wavelet-Based Video Coding on Graphics Hardware

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Retargeting PLAPACK to Clusters with Hardware Accelerators

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Binary search tree with SIMD bandwidth optimization using SSE

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

PARALLEL PROGRAMMING

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Graphic Processing Units: a possible answer to High Performance Computing?

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CUDAMat: a CUDA-based matrix class for Python

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Matrix Multiplication

Introduction to GPU hardware and to CUDA

CUDA Programming. Week 4. Shared memory and register

ENHANCEMENT OF TEGRA TABLET'S COMPUTATIONAL PERFORMANCE BY GEFORCE DESKTOP AND WIFI

Real-time Visual Tracker by Stream Processing

Let s put together a Manual Processor

Computer Graphics Hardware An Overview

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

ST810 Advanced Computing

Writing Applications for the GPU Using the RapidMind Development Platform

The sparse matrix vector product on GPUs

Parallel Computing for Data Science

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Introduction to GPGPU. Tiziano Diamanti

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

DATA STRUCTURES USING C

Multicore Parallel Computing with OpenMP

Hardware design for ray tracing

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Systolic Computing. Fundamentals

Fast Neural Network Training on General Purpose Computers

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

OpenCL Programming for the CUDA Architecture. Version 2.3

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

High Performance Matrix Inversion with Several GPUs

Operation Count; Numerical Linear Algebra

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Flexible Agent Based Simulation for Pedestrian Modelling on GPU Hardware

Texture Cache Approximation on GPUs

L20: GPU Architecture and Models

THE NAS KERNEL BENCHMARK PROGRAM

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

On Sorting and Load Balancing on GPUs

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Managing and Using Millions of Threads

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

CUDA programming on NVIDIA GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Scalable Prefix Matching for Internet Packet Forwarding

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

HSL and its out-of-core solver

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Designing Efficient Sorting Algorithms for Manycore GPUs


CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Binary Heap Algorithms

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

Load balancing. David Bindel. 12 Nov 2015

Transcription:

CUDA Tricks Presented by Damodaran Ramani

Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust: a Template Library for CUDA Applications CUDA FFT and BLAS libraries for the GPU

References Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens Presentation on scan primitives by Gary J. Katz based on the article Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens (GPU GEMS Chapter 39)

Introduction GPUs massively parallel processors Programmable parts of the graphics pipeline operates on primitives (vertices, fragments) These eseprimitive epoga programs sspawn a thread for each primitive to keep the parallel processors full

Stream programming model (particle systems, image processing, grid-based fluid simulations, and dense matrix algebra) Fragment program operating on n fragments (accesses - O(n)) Problem arises when access requirements are complex (eg: prefix-sum O(n 2 ))

Prefix-Sum Example in: 3 1 7 0 4 1 6 3 out:0341111141622 14 16

Trivial Sequential Implementation void scan(int* in, int* out, int n) { } out[0] = 0; for (int i = 1; i < n; i++) out[i] = in[i-1] + out[i-1];

Scan: An Efficient Parallel Primitive Interested in finding efficient solutions to parallel problems in which each output requires global knowledge of the inputs. Why CUDA? (General Load-Store Memory Architecture, On-chip Shared Memory, Thread Synchronization) i )

Threads & Blocks GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD-parallel threads. Programmers specify the number of thread blocks and threads per block, and the hardware and drivers map thread blocks to parallel multiprocessors on the GPU. Within a thread block, threads can communicate through shared memory and cooperate through sync. Because only threads within the same block can cooperate via shared memory and thread synchronization,programmers must partition computation into multiple blocks.(complex programming, large performance benefits)

The Scan Operator Definition: The scan operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1,, a n-1 ] and returns the array [I, a 0, (a 0 a 1 ),, (a 0 a 1 a n-2 )] Types inclusive, exclusive, forward, backward

Parallel Scan for(d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[out][k] = x[in][k 2 d-1 ] + x[in][k] else x[out][k] ] = x[in][k] ] Complexity O(nlog 2 n)

A work efficient parallel scan Goal is a parallel scan that is O(n) instead of O(nlog 2 n) Solution: Balanced Trees: Build a binary tree on the input data and sweep it to and from the root. Binary tree with n leaves has d=logd 2 n levels, l each level d has 2 d nodes One add is performed per node, therefore O(n) add on a single traversal of the tree.

O(n) unsegmented scan Reduce/Up-Sweep for(d = 0; d < log 2 n-1; d++) for all k=0; k < n-1; k+=2 d+1 in parallel x[k+2 d+1-1] = x[k+2 d -1] + x[k+2 d+1-1] Down-Sweep x[n-1] = 0; for(d = log 2 n 1; d >=0; d--) for all k = 0; k < n-1; k += 2 d+1 in parallel t = x[k + 2 d 1] x[k + 2 d - 1] = x[k + 2 d+1-1] x[k + 2 d+1-1] = t + x[k + 2 d+1 1]

Tree analogy x 0 (x 0..x 1 ) x 2 (x 0..x 3 ) x 4 (x 4..x 5 ) x 6 (x 0..x 7 ) x 0 (x 0..x 1 ) x 2 (x 0..x 3 ) x 4 (x 4..x 5 ) x 6 0 x 0 (x 0..x 1 ) x 2 0 x 4 (x 4..x 5 ) x 6 (x 0..x 3 ) x 0 0 x 2 (x 0..x 1 ) x 4 (x 0..x 3 ) x 6 (x 0..x 5 ) 0 x (x 0..x 1 ) (x 0..x 2 ) (x 0..x 3 ) (x 0..x 4 ) (x 0..x 5 ) (x 0..x 6 ) 0

O(n) Segmented Scan Up-Sweep

Down-Sweep

Features of segmented scan 3 times slower than unsegmented scan Useful for building broad variety of applications which are not possible with unsegmented scan.

Primitives built on scan Enumerate enumerate([t f f tftt]) t t]) = [0111223] 1 1 2 Exclusive scan of input vector Distribute (copy) distribute([a b c][d e]) = [a a a][d d] Inclusive scan of input vector Split and split-and-segment Split divides the input vector into two pieces, with all the elements marked false on the left side of the output vector and all the elements marked true on the right.

Applications Quicksort Sparse Matrix-Vector Multiply Tridiagonal Matrix Solvers and Fluid Simulation Radix Sort Stream Compaction Summed-Area Tables

Quicksort

Sparse Matrix-Vector Multiplication

Stream Compaction Definition: Extracts the interest elements from an array of elements and places them continuously in a new array Uses: Collision Detection Sparse Matrix Compression A B A D D E C F B A B A C B

Stream Compaction A B A D D E C F B Input: We want to preserve the gray 1 1 1 0 0 0 1 0 1 elements Set a 1 in each gray input 0 1 2 3 3 3 3 4 4 Scan A B A D D E C F B A B A C B Scatter gray inputs to output using scan result as scatter address 0 1 2 3 4

Radix Sort Using Scan 100 111 010 110 011 101 001 000 Input Array 0 1 0 0 1 1 1 0 b = least significant bit 1 0 1 1 0 0 0 1 e = Insert a 1 for all false sort keys 0 1 1 2 3 3 3 3 f = Scan the 1s 0-0+4 1-1+4 2-1+4 3-2+4 4-3+4 5-3+4 6-3+4 7-3+4 Total Falses = e[n-1] + f[n-1] = 4 = 4 = 5 = 5 = 5 = 6 = 7 = 8 t=index f+totalfalses 0 4 1 2 5 6 7 3 d = b? t : f 100 111 010 110 011 101 001 000 100 010 110 000 111 011 101 001 Scatter input using d as scatter address

Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library CUDPP is a library of data-parallel algorithm primitives such as parallel prefix- sum ( scan ), parallel sort and parallel reduction.

CUDPP_DLL CUDPPResult cudppsparsema trixvectormultiply(cudpphandle sparse MatrixHandle,void * d_y,const void * d_x ) Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.

CUDPPScanConfig config; config.direction = CUDPP_SCAN_FORWARD; config.exclusivity = CUDPP_SCAN_EXCLUSIVE; config.op = CUDPP_ADD; config.datatype = CUDPP_FLOAT; config.maxnumelements = numelements; config.maxnumrows = 1; config.rowpitch = 0; cudppinitializescan(&config); cudppscan(d_odata, d_idata, numelements, &config);

CUFFT No. of elements<8192 slower than fftw >8192, 5x speedup over threaded fftw and 10x over serial fftw.

CUBLAS Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers. 3D reconstruction of planetary nebulae. http://graphics.tu- bs.de/publications/fernandez08techreport.pdf

GPU Variant 100 times faster than CPU version Matrix size is limited by graphics card memory and texture size. Although taking advantage of sparce matrices will help reduce memory consumption, sparse matrix storage is not implemented by CUBLAS.

Useful Links http://www.science.uwaterloo.ca/~hmerz/cuda_ben chfft/ http://developer.download.nvidia.com/compute/cuda /2_0/docs/CUBLAS_Library_2.0.pdf http://gpgpu.org/developer/cudpp http://gpgpu.org/2009/05/31/thrust