Next Generation GPU Architecture Code-named Fermi

Similar documents

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

NVIDIA GeForce GTX 580 GPU Datasheet

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GPU Programming Languages

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

GPU Computing - CUDA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPGPU. Tiziano Diamanti

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

NVIDIA Solves the GPU Computing Puzzle

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Programming Strategies and Trends in GPU Computing

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Programming GPUs with CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

Guided Performance Analysis with the NVIDIA Visual Profiler

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

L20: GPU Architecture and Models

CUDA programming on NVIDIA GPUs

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to GPU Computing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

OpenCL Programming for the CUDA Architecture. Version 2.3

GPUs for Scientific Computing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Binary search tree with SIMD bandwidth optimization using SSE

HPC with Multicore and GPUs

Introduction to GPU Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

~ Greetings from WSU CAPPLab ~

Stream Processing on GPUs Using Distributed Multimedia Middleware

The Fastest, Most Efficient HPC Architecture Ever Built

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Computer Graphics Hardware An Overview

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPGPU Computing. Yong Cao

Texture Cache Approximation on GPUs

Parallel Programming Survey

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU Profiling with AMD CodeXL

Introduction to Cloud Computing

Embedded Systems: map to FPGA, GPU, CPU?

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

GeoImaging Accelerator Pansharp Test Results

GPU Performance Analysis and Optimisation

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

GPU Tools Sandra Wienke

FPGA-based Multithreading for In-Memory Hash Joins

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

CUDA Basics. Murphy Stein New York University

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Coping with Complexity: CPUs, GPUs and Real-world Applications

Multi-Threading Performance on Commodity Multi-Core Processors

Parallel Firewalls on General-Purpose Graphics Processing Units

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU architecture II: Scheduling the graphics pipeline

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

OpenSPARC T1 Processor

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

gpus1 Ubuntu Available via ssh

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Multi-core and Linux* Kernel

Radeon HD 2900 and Geometry Generation. Michael Doggett

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

QCD as a Video Game?

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

NVIDIA Quadro K5200 Amazing 3D Application Performance and Capability

Intel Pentium 4 Processor on 90nm Technology

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Chapter 2 Parallel Architecture, Software And Performance

Trends in High-Performance Computing for Power Grid Applications

The Future Of Animation Is Games

Case Study on Productivity and Performance of GPGPUs

Transcription:

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU

Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time = 120 Mpix / sec every pixel is independent tremendous inherent parallelism Create, run, & retire lots of threads very rapidly measured 14.8 Gthread/s on trivial (x[i]+=1) kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run Much of scientific computing is also a throughput problem!

Multicore CPU: Run ~10 Threads Fast Processor Memory Processor Memory Global Memory Few processors, each supporting 1 2 hardware threads On-chip memory/cache near processors Shared global memory space (external DRAM)

Manycore GPU: Run ~10,000 Threads Fast Processor Memory Processor Memory Processor Memory Global Memory Hundreds of processors, each supporting hundreds of hardware threads On-chip memory/cache near processors Shared global memory space (external DRAM)

Why do CPUs and GPUs differ? Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread lots of big on-chip caches extremely sophisticated control GPU: maximize throughput of all threads lots of big ALUs multithreading can hide latency so skip the big caches simpler control, cost amortized over ALUs via SIMD

GPU Computing: The Best of Both Worlds Multicore CPU Heterogeneous Computing

Giga Thread HOST I/F Goals of the Fermi Architecture Improved peak performance Improved throughput efficiency Broader applicability L2 Full integration within modern software development environment

Giga Thread HOST I/F Fermi Focus Areas Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory L2 Bring more users, more applications to the GPU C++ Visual Studio Integration ECC

SM Multiprocessor Architecture Instruction Cache Scheduler Scheduler 32 CUDA cores per SM (512 total) Dispatch Dispatch Register File 8x peak FP64 performance 50% of peak FP32 performance Dual Thread Scheduler 64 KB of RAM for shared memory and L1 cache (configurable) Load/Store Units x 16 Special Func Units x 4 FP32 FP64 INT SFU LD/ST Ops / clk 32 16 32 4 16 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Fermi Instruction Set Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Enables C++ features virtual functions, new/delete, try/catch Register File Unified load/store addressing 64-bit addressing for large problems Optimized for CUDA C, OpenCL & Direct Compute native (x,y)-based LD/ST with format conversion Load/Store Units x 16 Special Func Units x 4 Interconnect Network Enables system call functionality 64K Configurable Cache/Shared Mem Uniform Cache

CUDA Architecture Instruction Cache Scheduler Scheduler New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Dispatch Dispatch Register File Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations CUDA Dispatch Port Operand Collector FP Unit INT Unit Result Queue Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

IEEE 754-2008 Floating Point IEEE 754-2008 results 64-bit double precision 32-bit single precision full-speed subnormal operands & results NaNs, +/- Infinity IEEE 754-2008 rounding nearest even, zero, +inf, -inf IEEE 754-2008 Fused Multiply-Add (FMA) D = A*B + C; No loss of precision IEEE divide & sqrt use FMA Multiply-Add (MAD): D = A*B + C; A B = Product + C = D (truncate digits) Fused Multiply-Add (FMA): D = A*B + C; A B = (retain all digits) Product + C = D (no loss of precision)

Giga Thread HOST I/F Cached Memory Hierarchy True cache hierarchy + on-chip shared RAM On-chip shared memory: good fit for regular memory access dense linear algebra, image processing, Caches: good fit for irregular or unpredictable memory access ray tracing, sparse matrix multiply, physics Separate L1 Cache for each SM (16/48 KB) Improves bandwidth and reduces latency Unified L2 Cache for all SMs (768 KB) Fast, coherent data sharing across all cores in the GPU L2

Giga Thread HOST I/F Larger, Faster Memory Interface GDDR5 memory interface 2x speed of GDDR3 Up to 1 Terabyte of memory attached to GPU Operate on large data sets L2

ECC Memory Protection All major internal memories protected by ECC Register file L1 cache L2 cache External DRAM protected by ECC GDDR5 as well as SDDR3 memory configurations ECC is a must have for many computing applications Clear customer feedback

Faster Atomic Operations (Read / Modify / Write) Significantly more efficient chip-wide inter-block synchronization Much faster parallel aggregation ray tracing, histogram computation, clustering and pattern recognition, face recognition, speech recognition, BLAS, etc Accelerated by cached memory hierarchy Fermi increases atomic performance by 5x to 20x

GigaThread TM Hardware Thread Scheduler Hierarchically manages tens of thousands of simultaneously active threads 10x faster context switching on Fermi HTS Concurrent kernel execution

Time Concurrent Kernel Execution Kernel 1 Kernel 1 Kernel 2 Kernel 2 Kernel 2 Kernel 3 Ker 4 Kernel 2 nel Kernel 5 Kernel 4 Kernel 3 Kernel 5 Serial Kernel Execution Parallel Kernel Execution

GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer SDT SDT Fully overlapped with CPU/GPU processing Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

G80 GT200 Fermi Transistors 681 million 1.4 billion 3.0 billion CUDA s 128 240 512 Double Precision Floating Point - 30 FMA ops/clock 256 FMA ops/clock Single Precision Floating Point 128 MAD ops/clock 240 MAD ops/clock 512 FMA ops/clock Special Function Units (per SM) 2 2 4 Warp schedulers (per SM) 1 1 2 Shared Memory (per SM) 16 KB 16 KB Configurable 48/16 KB L1 Cache (per SM) - - Configurable 16/48 KB L2 Cache - - 768 KB ECC Memory Support - - Yes Concurrent Kernels - - Up to 16 Load/Store Address Width 32-bit 32-bit 64-bit

CUDA Parallel Computing Architecture GPU Computing Applications C++ tm Direct C OpenCL Compute Fortran Java and Python NVIDIA GPU with the CUDA Parallel Computing Architecture OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

C with CUDA 3.0 Unified addressing for C and C++ pointers Global, shared, local addresses Enables 3 rd party GPU callable libraries, dynamic linking One 40-bit address space for load/store instructions Compiling for native 64-bit addressing IEEE 754-2008 single & double precision C99 math.h support Concurrent Kernels

NVIDIA Integrated Development Environment Code-named Nexus Industry s 1st IDE for massively parallel applications Accelerate development of CPU + GPU co-processing applications Complete Visual Studio-integrated development environment C, C++, OpenCL, & DirectCompute platforms DirectX and OpenGL graphics APIs

Giga Thread HOST I/F Fermi Summary Third generation CUDA architecture Improved peak performance Improved throughput efficiency Broader applicability L2 Full integration within modern development environment