GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

Similar documents

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Introduction to GPU Architecture

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

A quick tutorial on Intel's Xeon Phi Coprocessor

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU Programming Languages

Next Generation GPU Architecture Code-named Fermi

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Programming Survey

Introduction to GPU hardware and to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPUs for Scientific Computing

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Introduction to GPGPU. Tiziano Diamanti

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPGPU Computing. Yong Cao

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Computer Graphics Hardware An Overview

L20: GPU Architecture and Models

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA programming on NVIDIA GPUs

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

NVIDIA GeForce GTX 580 GPU Datasheet

HPC with Multicore and GPUs

Keys to node-level performance analysis and threading in HPC applications

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

ME964 High Performance Computing for Engineering Applications

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

The Fastest, Most Efficient HPC Architecture Ever Built

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

ST810 Advanced Computing

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Turbomachinery CFD on many-core platforms experiences and strategies

Evaluation of CUDA Fortran for the CFD code Strukti

10- High Performance Compu5ng

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

HP Workstations graphics card options

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Radeon HD 2900 and Geometry Generation. Michael Doggett

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Big Data Visualization on the MIC

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Algorithm Engineering

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Retargeting PLAPACK to Clusters with Hardware Accelerators

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Trends in High-Performance Computing for Power Grid Applications

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Computing - CUDA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Scalability and Classifications

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU Acceleration of the SENSEI CFD Code Suite

CFD Implementation with In-Socket FPGA Accelerators

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Pedraforca: ARM + GPU prototype

Lecture 1. Course Introduction

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Multi-Threading Performance on Commodity Multi-Core Processors

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Introduction to GPU Computing

Rethinking SIMD Vectorization for In-Memory Databases

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Case Study on Productivity and Performance of GPGPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU architecture II: Scheduling the graphics pipeline

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

OpenCL Programming for the CUDA Architecture. Version 2.3

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

~ Greetings from WSU CAPPLab ~

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Choosing a Computer for Running SLX, P3D, and P5

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

HP ProLiant SL270s Gen8 Server. Evaluation Report

Transcription:

GPU Hardware CS 380P Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center with thanks to Don Fussell for slides 15-28 and Bill Barth for slides 36-55

CPU vs. GPU characteris7cs CPU Few computa7on cores Supports many instruc7on streams, but keep few for performance More complex pipeline Out- of- order processing Deep (tens of stages) Became simpler (Pen7um 4 was complexity peak) Op7mized for serial execu7on SIMD units less so, but lower penalty for branching than GPU GPU Many of computa7on cores Few instruc7on streams Simple pipeline In- order processing Shallow (< 10 stages) Became more complex Op7mized for parallel execu7on Poten7ally heavy penalty for branching

Intel Nehalem (Longhorn nodes)

Intel Westmere (Lonestar nodes)

Intel Sandy Bridge (Stampede nodes)

Intel Sandy Bridge (Stampede nodes)

NVIDIA GT200 (Longhorn nodes)

NVIDIA GF100 Fermi (Lonestar nodes) Memory Controller Memory Controller Memory Controller SMs (x8) s + L1 cache L2 cache SMs (x8) Memory Controller Memory Controller Memory Controller

NVIDIA GK110 Kepler (Stampede nodes)

Hardware Comparison (Longhorn- and Lonestar- deployed versions) Functional Units Speed (GHz) SIMD / SIMT width Instruction Streams Peak Bandwidth DRAM->Chip (GB/s) Nehalem E5540 Westmere X5680 Sandy Bridge E5-2680 Tesla Quadro FX 5800 Fermi Tesla M2070 Kepler Tesla K20 4 6 8 30 14 13 2.53 3.33 2.7 1.30 1.15.706 4 4 4 8 32 32 16 24 32 240 448 2496 35 35 51.2 102 150 208

A Word about FLOPS Yesterday s slides calculated Longhorn s GPUs (NVIDIA Quadro FX 5800) at 624 peak GFLOPS but NVIDIA marke7ng literature lists peak performance at 936 GFLOPS!? NVIDIA s number includes the Special Func7on Unit (SFU) of each SM, which handles unusual and expec7onal instruc7ons (transcendentals, trigonometrics, roots, etc.) Fermi marke7ng materials do not include SFU in FLOPs measurement, more comparable to CPU metrics.

The GPU s Origins: Why They are Made This Way Zero DP!

GPU Accelerates Rendering Determining the color to be assigned to each pixel in an image by simula7ng the transport of light in a synthe7c scene.

The Key Efficiency Trick Transform into perspec7ve space, densely sample, and produce a large number of independent SIMD computa7ons for shading

Shading a Fragment sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp(dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } compile sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Simple Lamber7an shading of texture- mapped fragment. Sequen7al code Performed in parallel on many independent fragments How many is many? At least hundreds of thousands per frame

Work per Fragment unshaded fragment sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) shaded fragment Do a a couple hundred thousand of these @ 60 Hz or so How? We have independent threads to execute, so use mul7ple cores What kind of cores?

The CPU Way Fetch/Decode Branch Predictor unshaded fragment Execution Context Instruction Scheduler Prefetch Unit shaded fragment Caches Big, complex, but fast on a single thread However, each program is very short, so do not need this much complexity Must complete many many short programs quickly

Simplify and Parallelize Fetch/Decode Fetch/Decode Fetch/Decode Fetch/Decode Execution Context Execution Context Execution Context Execution Context Fetch/Decode Fetch/Decode Fetch/Decode Fetch/Decode Execution Context Execution Context Execution Context Execution Context Fetch/Decode Fetch/Decode Fetch/Decode Fetch/Decode Execution Context Execution Context Execution Context Execution Context Fetch/Decode Fetch/Decode Fetch/Decode Fetch/Decode Execution Context Execution Context Execution Context Execution Context Don t use a few CPU style cores Use simpler ones and many more of them.

Shared Instruc7ons Instruction Cache Fetch/Decode Context Context Context Context Context Context Context Context Context Context Context Context Context Context Context Context Shared Memory Applying same instruc7ons to different data the defini7on of SIMD! Thus SIMD amor7ze instruc7on handling over mul7ple s

But What about the Other Processing? A graphics pipeline does more than shading. Other ops are done in parallel, like transforming ver7ces. So need to execute more than one program in the system simultaneously. If we replicate these SIMD processors, we now have the ability to do different SIMD computa7ons in parallel in different parts of the machine. In this example, we can have 128 threads in parallel, but only 8 different programs simultaneously running

What about Branches? 1 2 3 4 5 6 7 8 GPUs use predication! T F F T F T F F <unconditional shader code> if (x > 0) { y = pow(x, exp); Time y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <unconditional shader code>

Efficiency - Dealing with Stalls A thread is stalled when its next instruc7on to be executed must await a result from a previous instruc7on. Pipeline dependencies Memory latency The complex CPU hardware (omioed from these machines) was effec7ve at dealing with stalls. What will we do instead? Since we expect to have lots more threads than processors, we can interleave their execu7on to keep the hardware busy when a thread stalls. Mul7threading!

Threads 1-8 Mul7threading Stall Threads 9-16 waiting Stall Threads 17-24 Ready waiting Stall Threads 24-32 Ready waiting Stall

Threads 1-8 Mul7threading Stall Threads 9-16 waiting Stall Threads 17-24 Ready extra latency waiting Ready extra latency Stall waiting Threads 24-32 Stall

Costs of Mul7threading Instruction Cache Instruction Cache Fetch/Decode Fetch/Decode 1 2 Storage Pool 3 4 Shared Memory Adds latency to individual threads in order to minimize 7me to complete all threads. Requires extra context storage. More contexts can mask more latency.

Example System 32 cores x 16 s/core = 512 (madd) s @ 1 GHz = 1 Teraflop

Real Example NVIDIA Tesla K20 13 Cores ( Streaming Mul7processors (SMX) ) 192 SIMD Func7onal Units per Core ( CUDA Cores ) Each FU has 1 fused mul7ply- add (SP and DP) Peak 2496 SP floa7ng point ops per clock 4 warp schedulers and 8 instruc7on dispatch units Up to 32 threads concurrently execu7ng (called a WARP ) Coarse- grained: Up to 64 WARPS interleaved per core to mask latency to memory

Real Example - AMD Radeon HD 7970 32 Compute Units ( Graphics Cores Next (GCN) processors) 4 Cores per FU ( Stream Cores ) 16- wide SIMD per Stream Core 1 Fused Mul7ply- Add per Peak 4096 SP ops per clock 2 level mul7threading Fine- grained: 8 threads interleaved into pipelined CU GCN Cores Up to 256 concurrent threads (called a Wavefront ) Coarse- grained: groups of about 40 wavefronts interleaved to mask memory latency Up to 81,920 concurrent items

Mapping Marke7ng Terminology to Engineering Details Functional Unit x86 NVIDIA AMD/ATI Core Streaming Multiprocessor (SM) SIMD Engines / Processor SIMD lane CUDA core Stream core Simultaneouslyprocessed SIMD (Concurrent threads ) Functional Unit instruction stream Warp Wavefront Thread Kernel Kernel

Memory Architecture CPU style Mul7ple levels of cache on chip Takes advantage of temporal and spa7al locality to reduce demand on remote slow DRAM Caches provide local high bandwidth to cores on chip 25GB/sec to main memory GPU style Local execu7on contexts (64KB) and a similar amount of local memory Read- only texture cache Tradi7onally no cache hierarchy (but see NVIDIA Fermi and Intel MIC) Much higher bandwidth to main memory, 150 200 GB/sec

Performance Implica7ons of GPU Bandwidth GPU memory system is designed for throughput Wide Bus (150 200 GB/sec) and high bandwidth DRAM organiza7on (GDDR3-5) Careful scheduling of memory requests to make efficient use of available bandwidth (recent architectures help with this) An NVIDIA Tesla K20 GPU in Stampede has 13 SMXs with 192 SIMD lanes and a 0.706 GHz clock. How many peak single- precision FLOPs? 3524 GFLOPs Memory bandwidth is 208 GB/s. How many FLOPs per byte transferred must be performed for peak efficiency? ~17 FLOPs per byte

Performance Implica7ons of GPU Bandwidth An AMD Radeon 7970 GHz has 32 Compute Units with 64 Stream Cores each and a 0.925 GHz clock. How many peak single- precision FLOPs? 3789 GFLOPs Memory bandwidth is 264GB/s. How many FLOPs per byte transferred must be performed for peak efficiency? ~14 FLOPs per byte AMD new GCN technology has closed the bandwidth gap Compute performance will likely con7nue to outpace memory bandwidth performance

Real Example - Intel MIC coprocessor Many Integrated Cores: originally Larrabee, now Knights Ferry (dev), Knights Corner (prod) 32 cores (Knights Ferry) >50 cores (Knights Corner) Explicit 16- wide vector ISA (16- wide madder unit) Peak 1024 SP float ops per clock for 32 cores Each core interleaves four threads of x86 instruc7ons Addi7onal interleaving under soyware control Tradi7onal x86 programming and threading model http://www.hpcwire.com/hpcwire/2010-08-05/compilers_and_more_knights_ferry_versus_fermi.html

What is it? Co- processor PCI Express card Stripped down Linux opera7ng system Dense, simplified processor Many power- hungry opera7ons removed Wider vector unit Wider hardware thread count Lots of names Many Integrated Core architecture, aka MIC Knights Corner (code name) Intel Xeon Phi Co- processor SE10P (product name)

What is it? Leverage x86 architecture (CPU with many cores) x86 cores that are simpler, but allow for more compute throughput Leverage exis7ng x86 programming models Dedicate much of the silicon to floa7ng point ops Cache coherent Increase floa7ng- point throughput Strip expensive features out- of- order execu7on branch predic7on Widen SIMD registers for more throughput Fast (GDDR5) memory on card 37

Intel Xeon Phi Chip 22 nm process Based on what Intel learned from Larrabee SCC TeraFlops Research Chip

MIC Architecture Many cores on the die L1 and L2 cache Bidirec7onal ring network for L2 Memory and PCIe connec7on 39

George Chrysos, Intel, Hot Chips 24 (2012): hop://www.slideshare.net/intelxeon/under- the- armor- of- knights- corner- intel- mic- architecture- at- hotchips- 2012

George Chrysos, Intel, Hot Chips 24 (2012): hop://www.slideshare.net/intelxeon/under- the- armor- of- knights- corner- intel- mic- architecture- at- hotchips- 2012

Processor ~1.1 GHz 61 cores 512- bit wide vector unit 1.074 TF peak DP Data Cache L1 32KB/core L2 512KB/core, 30.5 MB/chip Memory 8GB GDDR5 DRAM 5.5 GT/s, 512- bit* PCIe 5.0 GT/s, 16- bit Speeds and Feeds

Advantages Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruc7on set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Op7mizing for MIC is similar to op7mizing for CPUs Op.mize once, run anywhere Our early MIC por7ng efforts for codes in the field are frequently doubling performance on Sandy Bridge.

Stampede Programming Models Tradi7onal Cluster Pure MPI and MPI+X X: OpenMP, TBB, Cilk+, OpenCL, Na7ve Phi Use one Phi and run OpenMP or MPI programs directly MPI tasks on Host and Phi Treat the Phi (mostly) like another host Pure MPI and MPI+X MPI on Host, Offload to Xeon Phi Targeted offload through OpenMP extensions Automa7cally offload some library rou7nes with MKL

Tradi7onal Cluster Stampede is 2+ PF of FDR- connected Xeon E5 High bandwidth: 56 Gb/s (sustaining >52 Gb/s) Low- latency ~1 μs on leaf switch ~2.5 μs across the system Highly scalable for exis7ng MPI codes IB mul7cast and collec7ve offloads for improved collec7ve performance

Na7ve Execu7on Build for Phi with mmic Execute on host or ssh to mic0 and run on the Phi Can safely use all 61 cores Offload programs should stay away from the 61 st core since the offload daemon runs here

Symmetric MPI Host and Phi can operate symmetrically as MPI targets High code reuse MPI and hybrid MPI+X Careful to balance workload between big cores and liole cores Careful to create locality between local host, local Phi, remote hosts, and remote Phis Take advantage of topology- aware MPI interface under development in MVAPICH2 NSF STCI project with OSU, TACC, and SDSC

Symmetric MPI Typical 1-2 GB per task on the host Targe7ng 1-10 MPI tasks per Phi on Stampede With 6+ threads per MPI task

MPI with Offload to Phi Exis7ng codes using accelerators have already iden7fied regions where offload works well Por7ng these to OpenMP offload should be straigh orward Automa7c offload where MKL kernel rou7nes can be used xgemm, etc.

What we at TACC like about Phi Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruc7on set Programming for Phi is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Op7mizing for Phiis similar to op7mizing for CPUs Op.mize once, run anywhere Our early Phi por7ng efforts for codes in the field have doubled performance on Sandy Bridge.

Will My Code Run on Xeon Phi? Yes but that s the wrong ques7on Will your code run *best* on Phi?, or Will you get great Phi performance without addi7onal work?

Early Phi Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance can require real work While the soyware environment con7nues to evolve Ge ng codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is preoy good Mul7ple threads per core is really important Ge ng your code to vectorize is really important

LBM Example La ce Boltzmann Method CFD code Carlos Rosales, TACC MPI code with OpenMP Finding all the right rou7nes to parallelize is cri7cal

Hydrosta7c ice sheet modeling MUMPS solver (called through PETSC) BLAS calls automa7cally offloaded behind the scenes PETSc/MUMPS with AO

Publishing Results Published results about Xeon Phis should include Silicon stepping (B0 or B1) Soyware stack version (Gold) Intel Product SKU (Intel Xeon Phi Co- processor SE10P) Core count: 61 Clock rate (1.09 GHz or 1.1 GHz) DRAM Rate (5.6 GT/s)