CS377P Programming for Performance Introduction to Accelerators

Similar documents
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU hardware and to CUDA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Programming Languages

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Next Generation GPU Architecture Code-named Fermi

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Programming Survey

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Computing - CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

ST810 Advanced Computing

Introduction to GPU Architecture

Multi-Threading Performance on Commodity Multi-Core Processors

Embedded Systems: map to FPGA, GPU, CPU?

Trends in High-Performance Computing for Power Grid Applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

HPC with Multicore and GPUs

Introduction to Cloud Computing

Evaluation of CUDA Fortran for the CFD code Strukti

Introduction to GPGPU. Tiziano Diamanti

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA Basics. Murphy Stein New York University

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPUs for Scientific Computing

L20: GPU Architecture and Models

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPGPU Computing. Yong Cao

Computer Graphics Hardware An Overview

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Choosing a Computer for Running SLX, P3D, and P5

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CUDA programming on NVIDIA GPUs

Case Study on Productivity and Performance of GPGPUs

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1

Programming GPUs with CUDA

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA GeForce GTX 580 GPU Datasheet

A quick tutorial on Intel's Xeon Phi Coprocessor

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

The Fastest, Most Efficient HPC Architecture Ever Built

OpenACC 2.0 and the PGI Accelerator Compilers

Le langage OCaml et la programmation des GPU

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

COSCO 2015 Heterogeneous Computing Programming

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

ME964 High Performance Computing for Engineering Applications

Graphic Processing Units: a possible answer to High Performance Computing?

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

OpenACC Programming and Best Practices Guide

1 Bull, 2011 Bull Extreme Computing

~ Greetings from WSU CAPPLab ~

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPU Architecture. Michael Doggett ATI

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

OpenACC Basics Directive-based GPGPU Programming

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

ANALYSIS OF SUPERCOMPUTER DESIGN

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Stream Processing on GPUs Using Distributed Multimedia Middleware

Overview of HPC systems and software available within

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Software and the Concurrency Revolution

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Parallel Algorithm Engineering

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Introduction to GPU Computing

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Program Grid and HPC5+ workshop

Guided Performance Analysis with the NVIDIA Visual Profiler

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

High Performance Matrix Inversion with Several GPUs

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Lecture 1: an introduction to CUDA

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

YALES2 porting on the Xeon- Phi Early results

BLM 413E - Parallel Programming Lecture 3

Accelerating CFD using OpenFOAM with GPUs

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

Multi-core Programming System Overview

Performance Characteristics of Large SMP Machines

Transcription:

CS377P Programming for Performance Introduction to Accelerators Sreepathi Pai UTCS November 4, 2015

Outline 1 Introduction to Accelerators 2 GPU Architectures 3 GPU Programming Models

Outline 1 Introduction to Accelerators 2 GPU Architectures 3 GPU Programming Models

Accelerators Single-core processors Multi-core processors What if these aren t enough? Accelerators, specifically GPUs what they are when you should use them

Timeline 1980s Geometry Engines 1990s Consumer GPUs Out-of-order Superscalars 2000s General-purpose GPUs Multicore CPUs Cell BE (Playstation 3) Lots of specialized accelerators in phones

The Graphics Processing Unit (1980s) SGI Geometry Engine Implemented the Geometry Pipeline Hardwired logic Embarrassingly Parallel O(Pixels) Large number of logic elements High memory bandwidth From Kaufman et al. (2009):

GPU 2.0 (circa 2004) Like CPUs, GPUs benefited from Moore s Law Evolved from fixed-function hardwired logic to flexible, programmable ALUs Around 2004, GPUs were programmable enough to do some non-graphics computations Severely limited by graphics programming model (shader programming) In 2006, GPUs became fully programmable GPGPU: General-Purpose GPU NVIDIA releases CUDA language to write non-graphics programs that will run on GPUs

FLOPS/s NVIDIA CUDA C Programming Guide

Memory Bandwidth NVIDIA CUDA C Programming Guide

GPGPU Today GPUs are widely deployed as accelerators Intel Paper 10x vs 100x Myth GPUs so successful that other accelerators are dead Sony/IBM Cell BE Clearspeed RSX Kepler K40 GPUs from NVIDIA have performance of 4TFlops (peak) CM-5, #1 system in 1993 was 60 Gflops (Linpack) ASCI White (#1 2001) was 4.9 Tflops (Linpack) Pictures of Titan and Tianhe 1A from the Top500 website.

Accelerator Programming Models CPUs have always depended on co-processors I/O co-processors to handle slow I/O Math co-processors to speed up computation H.264 co-processor to play video (Phones) DSPs to handle audio (Phones) Many have been transparent Drop in the co-processor and everything sped up Or used a function-based model Call a function and it is sped up (e.g. decode video ) The GPU is not a transparent accelerator for general purpose computations Only graphics code is sped up transparently Code must be rewritten to target GPUs

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc.

Outline 1 Introduction to Accelerators 2 GPU Architectures 3 GPU Programming Models

The Two (Three?) Kinds of GPUs NVIDIA Type 1: Discrete GPUs More computational power More memory bandwidth Separate memory

The Two (Three?) Kinds of GPUs #2 Type 2: Integrated GPUs Share memory with processor Share bandwidth with processor Consume Less power Can participate in cache coherence Intel

The NVIDIA Kepler NVIDIA Kepler GK110 Whitepaper

Using a Discrete GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM You : Programmer, Compiler, Runtime, OS, etc. Hardware won t do it for you

NVIDIA Kepler SMX (i.e. CPU core equivalent)

NVIDIA Kepler SMX Details 2-wide Inorder 4-wide SMT 2048 threads per core (64 warps) 15 cores Each thread runs the same code (hence SIMT) 65536 32-bit registers (256KBytes) A thread can use upto 255 of these Partitioned among threads (not shared!) 192 ALUs 64 Double-precision 32 Load/store 32 Special Functional Unit 64 KB L1/Shared Cache Shared cache is software-managed cache

CPU vs GPU Parameter CPU GPU Clockspeed > 1 GHz 700 MHz RAM GB to TB 12 GB (max) Memory B/W 60 GB/s > 300 GB/s Peak FP < 1 TFlop > 1 TFlop Concurrent Threads O(10) O(1000) [O(10000)] LLC cache size > 100MB (L3) < 2MB (L2) [edram] O(10) [traditional] Cache size per thread O(1 MB) O(10 bytes) Software-managed cache None 48KB/SMX Type OOO superscalar 2-way Inorder superscalar

Using a GPU You must retarget code for the GPU Rewrite, recompile, translate, etc. Working set must fit in GPU RAM You must copy data to/from GPU RAM You : Programmer, Compiler, Runtime, OS, etc. Hardware won t do it for you Data accesses should be streaming Or use scratchpad as user-managed cache Lots of parallelism preferred (throughput, not latency) SIMD-style parallelism best suited High arithmetic intensity (FLOPs/byte) preferred

Showcase GPU Applications Image Processing Graphics Rendering Matrix Multiply FFT See Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU by V.W.Lee et al. for more examples and a comparison of CPU and GPU.

Outline 1 Introduction to Accelerators 2 GPU Architectures 3 GPU Programming Models

Hierarchy of GPU Programming Models Model GPU CPU Equivalent Vectorizing Compiler PGI CUDA Fortran gcc, icc, etc. Drop-in Libraries cublas ATLAS Directive-driven OpenACC, OpenMP OpenMP-to-CUDA High-level languages pycuda python Mid-level languages OpenCL, CUDA pthreads + C/C++ Low-level languages PTX, Shader - Bare-metal SASS Assembly/Machine code

Drop-in Libraries Drop-in replacements for popular CPU libraries, examples from NVIDIA: CUBLAS/NVBLAS for BLAS (e.g. ATLAS) CUFFT for FFTW MAGMA for LAPACK and BLAS These libraries may still expect you to manage data transfers manually Libraries may support multiple accelerators (GPU + CPU + Xeon Phi)

GPU Libraries NVIDIA Thrust Like C++ STL, but executes on the GPU Modern GPU At first glance: high-performance library routines for sorting, searching, reductions, etc. A deeper look: Specific hard problems tackled in a different style NVIDIA CUB Low-level primitives for use in CUDA kernels

Directive-Driven Programming OpenACC, new standard for offloading parallel work to an accelerator Currently supported only by PGI Accelerator compiler gcc 5.0 support is ongoing OpenMPC, a research compiler, can compile OpenMP code + extra directives to CUDA OpenMP 4.0 also supports offload to accelerators Not for GPUs yet int main(void) { double pi = 0.0f; long i; } #pragma acc parallel loop reduction(+:pi) for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } printf("pi=%16.15f\n",pi/n); return 0;

Python-based Tools (pycuda) import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""\" global void multiply_them(float *dest, float *a, float *b) { const int i = threadidx.x; dest[i] = a[i] * b[i]; } ""\") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.out(dest), drv.in(a), drv.in(b), block=(400,1,1), grid=(1,1)) print dest-a*b

OpenCL C99-based dialect for programming heterogenous systems Originally based on CUDA nomenclature is different Supported by more than GPUs Xeon Phi, FPGAs, CPUs, etc. Source code is portable (somewhat) Performance may not be! Poorly supported by NVIDIA

CUDA Compute Unified Device Architecture First language to allow general-purpose programming for GPUs preceded by shader languages Promoted by NVIDIA for their GPUs Not supported by any other accelerator though commercial CUDA-to-x86/64 compilers exist We will focus on CUDA programs

CUDA Architecture From 10000 feet CUDA is like pthreads CUDA language C++ dialect Host code (CPU) and GPU code in same file Special language extensions for GPU code CUDA Runtime API Manages runtime GPU environment Allocation of memory, data transfers, synchronization with GPU, etc. Usually invoked by host code CUDA Device API Lower-level API that CUDA Runtime API is built upon

CUDA Limitations No standard library for GPU functions No parallel data structures No synchronization primitives (mutex, semaphores, queues, etc.) you can roll your own only atomic*() functions provided Toolchain not as mature as CPU toolchain Felt intensely in performance debugging It s only been a decade :)

Conclusions GPUs are very interesting parallel machines They re not going away Xeon Phi might pose a formidable challenge They re here and now Your laptop probably already contains one Your phone definitely has one