General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Similar documents

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPGPU: General-Purpose Computation on GPUs

GPGPU Computing. Yong Cao

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Graphics Hardware An Overview

Introduction to GPU Architecture

Brook for GPUs: Stream Computing on Graphics Hardware

Radeon HD 2900 and Geometry Generation. Michael Doggett

QCD as a Video Game?

Introduction to GPU Programming Languages

GPU Architecture. Michael Doggett ATI

Introduction to GPGPU. Tiziano Diamanti

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

L20: GPU Architecture and Models

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Developer Tools. Tim Purcell NVIDIA

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Interactive Level-Set Deformation On the GPU

Texture Cache Approximation on GPUs

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

GPUs for Scientific Computing

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Writing Applications for the GPU Using the RapidMind Development Platform

A Survey of General-Purpose Computation on Graphics Hardware

ultra fast SOM using CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Shading and Rendering: Introduction & Graphics Hardware

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GPU hardware and to CUDA

HPC with Multicore and GPUs

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

64-Bit versus 32-Bit CPUs in Scientific Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Graphic Processing Units: a possible answer to High Performance Computing?

OpenGL Performance Tuning

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

ST810 Advanced Computing

Evaluation of CUDA Fortran for the CFD code Strukti

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Hardware design for ray tracing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Intel Pentium 4 Processor on 90nm Technology

Parallel Programming Survey

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Stream Processing on GPUs Using Distributed Multimedia Middleware

1. INTRODUCTION Graphics 2

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Introduction to Computer Graphics

Choosing a Computer for Running SLX, P3D, and P5

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Next Generation GPU Architecture Code-named Fermi

Matrix Multiplication

MIDeA: A Multi-Parallel Intrusion Detection Architecture

GPU for Scientific Computing. -Ali Saleh

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

GPU Computing - CUDA

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Advanced Rendering for Engineering & Styling

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Interactive Level-Set Segmentation on the GPU

CUDA programming on NVIDIA GPUs

Introduction to GPU Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPGPU Parallel Merge Sort Algorithm

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

In the early 1990s, ubiquitous

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to Cloud Computing

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Chapter 1 Computer System Overview

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

NVIDIA GeForce GTX 580 GPU Datasheet

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

An Introduction to Parallel Computing/ Programming

Transcription:

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

A little about me http://graphics.stanford.edu/~mhouston Education: UC San Diego, Computer Science BS Stanford University, Computer Science MS Currently a PhD candidate at Stanford University Research Parallel Rendering High performance computing Computation on graphics processors (GPGPU) 2

What can you do on GPUs other than graphics? Large matrix/vector operations (BLAS) Protein Folding (Molecular Dynamics) FFT (SETI, signal processing) Ray Tracing Physics Simulation [cloth, fluid, collision] Sequence Matching (Hidden Markov Models) Speech Recognition (Hidden Markov Models, Neural nets) Databases Sort/Search Medical Imaging (image segmentation, processing) And many, many more http://www.gpgpu.org 3

Why use GPUs? COTS In every machine Performance Intel 3.0 GHz Pentium 4 12 GFLOPs peak (MAD) 5.96 GB/s to main memory ATI Radeon X1800XT 120 GFLOPs peak (fragment engine) 42 GB/s to video memory 4

Task vs. Data parallelism Task parallel Independent processes with little communication Easy to use Free on modern operating systems with SMP Data parallel Lots of data on which the same computation is being executed No dependencies between data elements in each step in the computation Can saturate many ALUs But often requires redesign of traditional algorithms 5

CPU vs. GPU CPU Really fast caches (great for data reuse) Fine branching granularity Lots of different processes/threads High performance on a single thread of execution GPU Lots of math units Fast access to onboard memory Run a program on each fragment/vertex High throughput on parallel tasks CPUs are great for task parallelism GPUs are great for data parallelism 6

The Importance of Data Parallelism for GPUs GPUs are designed for highly parallel tasks like rendering GPUs process independent vertices and fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers In short, no communication between vertices or fragments Data-parallel processing GPU architectures are ALU-heavy Multiple vertex & pixel pipelines Lots of compute power GPU memory systems are designed to stream data Linear access patterns can be prefetched Hide memory latency Courtesy GPGPU.org 7

GPGPU Terminology 8

Arithmetic Intensity Arithmetic intensity Math operations per word transferred Computation / bandwidth Ideal apps to target GPGPU have: Large data sets High parallelism Minimal dependencies between data elements High arithmetic intensity Lots of work to do without CPU intervention Courtesy GPGPU.org 9

Data Streams & Kernels Streams Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, No dependencies between stream elements Encourage high Arithmetic Intensity Courtesy GPGPU.org 10

Scatter vs. Gather Gather Indirect read from memory ( x = a[i] ) Naturally maps to a texture fetch Used to access data structures and data streams Scatter Indirect write to memory ( a[i] = x ) Difficult to emulate: Render to vertex array Sorting buffer Needed for building many data structures Usually done on the CPU 11

Mapping algorithms to the GPU 12

Mapping CPU algorithms to the GPU Basics Stream/Arrays -> Textures Parallel loops -> Quads Loop body -> vertex + fragment program Output arrays -> render targets Memory read -> texture fetch Memory write -> framebuffer write Controlling the parallel loop Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range Courtesy GPGPU.org 13

Computational Resources Programmable parallel processors Vertex & Fragment pipelines Rasterizer Mostly useful for interpolating values (texture coordinates) and per-vertex constants Texture unit Read-only memory interface Render to texture Write-only memory interface Courtesy GPGPU.org 14

Vertex Processors Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Can change the location of current vertex Cannot read info from other vertices Can only read a small constant memory Vertex Texture Fetch Random access memory for vertices Limited gather capabilities Can fetch from texture Cannot fetch from current vertex stream Courtesy GPGPU.org 15

Fragment Processors Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Generally capable of gather but not scatter Indirect memory read (texture fetch), but no indirect memory write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) Better memory read performance For GPGPU, we mainly concentrate on using the fragment processors Most of the flops Highest memory bandwidth Courtesy GPGPU.org 16

GPGPU example Adding Vectors Place arrays into 2D textures Convert loop body into a shader Loop body = Render a quad Needs to cover all the pixels in the output 1:1 mapping between pixels and texels Readback framebuffer into result array 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 float a[5*5]; float b[5*5]; float c[5*5]; //initialize vector a //initialize vector b for(int i=0; i<5*5; i++) { c[i] = a[i] + b[i]; }!!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0; 17

How this basically works Adding vectors Bind Input Textures Vector A, Vector B Bind Render Targets Load Shader Set Shader Params Render Quad Readback Buffer!!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0; Vector C C = A+B 18

Rolling your own GPGPU apps Lots of information on GPGPU.org For those with a strong graphics background: Do all the graphics setup yourself Write your kernels: Use high level languages Cg, HLSL, ASHLI Or, direct assembly ARB_fragment_program, ps20, ps2a, ps2b, ps30 High level languages and systems to make GPGPU easier Brook (http://graphics.stanford.edu/projects/brookgpu/) Sh (http://libsh.org) 19

BrookGPU History Developed at Stanford University Goal: allow non-graphics users to use GPUs for computation Lots of GPGPU apps written in Brook Design C based language with streaming extensions Compiles kernels to DX9 and OpenGL shading models Runtimes (DX9/OpenGL) handle all graphics commands Performance 80-90% of hand tuned GPU application in many cases 20

GPGPU and the ATI X1800 21

GPGPU on the ATI X1800 IEEE 32-bit floating point Simplifies precision issues in applications Long programs We can now handle larger applications 512 static instructions Effectively unlimited dynamic instructions Branching and Looping No performance cliffs for dynamic branching and looping Fine branch granularity: ~16 fragments Faster upload/download 50-100% increase in PCIe bandwidth over last generation 22

GPGPU on the ATI X1800, cont. Advanced memory controller Latency hiding for streaming reads and writes to memory With enough math ops you can hide all memory access! Large bandwidth improvement over previous generation Scatter support (a[i] = x) Arbitrary number of float outputs from fragment processors Uncached reads and writes for register spilling F-Buffer Support for linearizing datasets Store temporaries in flight 23

GPGPU on the ATI X1800, cont. Flexibility Unlimited texture reads Unlimited dependent texture reads 32 hardware registers per fragment 512MB memory support Larger datasets without going to system memory 24

Performance basics for GPGPU X1800XT (from GPUBench) Compute 83 GFLOPs (MAD) Memory 42 GB/s cache bandwidth 21 GB/s streaming bandwidth 4 cycle latency for a float4 fetch (cache hit) 8 cycle latency for a float4 fetch (streaming) Branch granularity 16 fragments Offload to GPU Download (GPU -> CPU): 900 MB/s Upload (CPU -> GPU): 1.4 GB/s http://graphics.stanford.edu/projects/gpubench 25

A few examples on the ATI X1800XT 26

BLAS Basic Linear Algebra Subprograms High performance computing The basis for LINPACK benchmarks Heavily used in simulation Ubiquitous in many math packages MatLab LAPACK BLAS 1: scalar, vector, vector/vector operations BLAS 2: matrix-vector operations BLAS 3: matrix-matrix operations 27

BLAS GPU Performance saxpy (BLAS1) single precision vector-scalar product sgemv (BLAS2) single precision matrix-vector product sgemm (BLAS3) single precision matrix-matrix product Relative Performance 12 11 10 9 8 7 6 5 4 3 2 1 0 saxpy sgemv sgemm 3.0GHz P4 2.5GHz G5 Nvidia 7800GTX ATI X1800XT 28

HMMer Protein sequence matching Goal Find matching patterns between protein sequences Relationship between diseases and genetics Genetic relationships between species Problem HUGE databases to search against Queries take lots of time to process Researches start searches and go home for the night Core Algorithm (hmmsearch) Viterbi algorithm Compare a Hidden Markov Model against a large database of protein sequences Paper at IEEE Supercomputing 2005 http://graphics.stanford.edu/papers/clawhmmer/ 29

HMMer Performance Relative Performance X1800XT 30

Protein Folding GROMACS provides extremely high performance compared to all other programs. Lot of algorithmic optimizations: Own software routines to calculate the inverse square root. Inner loops optimized to remove all conditionals. Loops use SSE and 3DNow! multimedia instructions for x86 processors For Power PC G4 and later processors: Altivec instructions provided Normally 3-10 times faster than any other program. Core algorithm in Folding@Home http://www.gromacs.org 31

GROMACS - Performance Relative Performance ATI X1800XT NVIDIA 7800GTX 2.5GHz G5 Force Kernel Only Complete Application 3.0GHz P4 0 0.5 1 1.5 2 2.5 3 3.5 4 32

GROMACS GPU Implementation Written using Brook by non-graphics programmers Offloads force calculation to GPU (~80% of CPU time) Force calculation on X1800XT is ~3.5X a 3.0GHz P4 Overall speed up on X1800XT is ~2.5X a 3.0GHz P4 Not yet optimized for X1800XT Using ps2b kernels, i.e. no looping Not making use of new scatter functionality The revenge of Ahmdal s law Force calculation no longer bottleneck (38% of runtime) Need to also accelerate data structure building (neighbor lists) MUCH easier with scatter support This looks like a very promising application for GPUs Combine CPU and GPU processing for a folding monster! 33

Making GPGPU easier 34

What GPGPU needs from vendors More information Shader ISA Latency information GPGPU Programming guide (floating point) How to order code for ALU efficiency The real cost of all instructions Expected latencies of different types of memory fetches Direct access to the hardware GL/DX is not what we want to be using We don t need state tracking Using graphics commands is odd for doing computation The graphics abstractions aren t useful for us Better memory management Fast transfer to and from GPU Non-blocking Consistent graphics drivers Some optimizations for games hurt GPGPU performance 35

What GPGPU needs from the community Data Parallel programming languages Lots of academic research GCC for GPUs Parallel data structures More applications What will make the average user care about GPGPU? What can we make data parallel and run fast? 36

Thanks The BrookGPU team Ian Buck, Tim Foley, Jeremy Sugerman, Daniel Horn, Kayvon Fatahalian GROMACS Vishal Vaidyanathan, Erich Elsen, Vijay Pande, Eric Darve HMMer Daniel Horn, Eric Lindahl Pat Hanrahan Everyone at ATI Technologies 37

Questions? I ll also be around after the talk Email: mhouston@stanford.edu Web: http://graphics.stanford.edu/~mhouston For lots of great GPGPU information: GPGPU.org (http://www.gpgpu.org) 38