NVIDIA GPU Architecture. for General Purpose Computing. Anthony Lippert 4/27/09

Similar documents
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPGPU Computing. Yong Cao

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPUs for Scientific Computing

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPGPU. Tiziano Diamanti

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA programming on NVIDIA GPUs

NVIDIA GeForce GTX 580 GPU Datasheet

Next Generation GPU Architecture Code-named Fermi

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GPU hardware and to CUDA

Introduction to GPU Architecture

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ultra fast SOM using CUDA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Computer Graphics Hardware An Overview

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Parallel Programming Survey

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

L20: GPU Architecture and Models

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

NVIDIA GeForce GTX 750 Ti

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

10- High Performance Compu5ng

ST810 Advanced Computing

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Guided Performance Analysis with the NVIDIA Visual Profiler

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Introduction to GPU Programming Languages

Stream Processing on GPUs Using Distributed Multimedia Middleware

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

QCD as a Video Game?

Lecture 1: an introduction to CUDA

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Choosing a Computer for Running SLX, P3D, and P5

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CUDA Basics. Murphy Stein New York University

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Accelerating CFD using OpenFOAM with GPUs

Industry First X86-based Single Board Computer JaguarBoard Released

~ Greetings from WSU CAPPLab ~

Several tips on how to choose a suitable computer

GPU Architecture. Michael Doggett ATI

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Motivation: Smartphone Market

GPU Computing - CUDA

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Optimizing Application Performance with CUDA Profiling Tools

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Pedraforca: ARM + GPU prototype

Trends in High-Performance Computing for Power Grid Applications

Several tips on how to choose a suitable computer

Fast Implementations of AES on Various Platforms

Configuring Memory on the HP Business Desktop dx5150

Technical Brief. Introducing Hybrid SLI Technology. March 11, 2008 TB _v01

The Fastest, Most Efficient HPC Architecture Ever Built

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

gpus1 Ubuntu Available via ssh

Lecture 1: the anatomy of a supercomputer

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Chapter 1 Computer System Overview

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Turbomachinery CFD on many-core platforms experiences and strategies

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Planning Your Installation or Upgrade

OpenCL Programming for the CUDA Architecture. Version 2.3

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

GeoImaging Accelerator Pansharp Test Results

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

IP Video Rendering Basics

Scalability and Classifications

CUDA. Multicore machines

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

NVIDIA Tools For Profiling And Monitoring. David Goodwin

AMD Radeon HD 8000M Series GPU Specifications AMD Radeon HD 8870M Series GPU Feature Summary

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Transcription:

NVIDIA GPU Architecture for General Purpose Computing Anthony Lippert 4/27/09 1

Outline Introduction GPU Hardware Programming Model Performance Results Supercomputing Products Conclusion 2

Intoduction GPU: Graphics Processing Unit Hundreds of Cores Programmable Can be easily installed in most desktops Similar price to CPU GPU follows Moore's Law better than CPU 3

Motivation: Introduction 4

GPU Hardware Multiprocessor Structure: 5

GPU Hardware Multiprocessor Structure: N multiprocessors with M cores each SIMD Cores share an Instruction Unit with other cores in a multiprocessor. Diverging threads may not execute in parallel. 6

GPU Hardware Memory Hierarchy: Processors have 32-bit registers Multiprocessors have shared memory, constant cache, and texture cache Constant/texture cache are readonly and have faster access than shared memory. 7

GPU Hardware NVIDIA GTX280 Specifications: 933 GFLOPS peak performance ( TPC ) 10 thread processing clusters 3 multiprocessors per TPC 8 cores per multiprocessor 16384 registers per multiprocessor 16 KB shared memory per multiprocessor 64 KB constant cache per multiprocessor 6 KB < texture cache < 8 KB per multiprocessor 1.3 GHz clock rate Single and double-precision floating-point calculation 1 GB DDR3 dedicated memory 8

GPU Hardware Thread Scheduler Thread Processing Clusters Atomic/Tex L2 Memory 9

GPU Hardware Thread Scheduler: Hardware-based Manages scheduling threads across thread processing clusters Nearly 100% utilization: If a thread is waiting for memory access, the scheduler can perform a zero-cost, immediate context switch to another thread Up to 30,720 threads on the chip 10

GPU Hardware Thread Processing Cluster: IU - instruction unit TF - texture filtering 11

GPU Hardware Atomic/Tex L2: Level 2 Cache Shared by all thread processing clusters Atomic Ability to perform read-modify-write operations to memory Allows granular access to memory locations Provides parallel reductions and parallel data structure management 12

GPU Hardware 13

GPU Hardware GT200 Power Features: Dynamic power management Power consumption is based on utilization Idle/2D power mode: 25 W Blu-ray DVD playback mode: 35 W Full 3D performance mode: worst case 236 W HybridPower mode: 0 W On an nforce motherboard, when not performing, the GPU can be powered off and computation can be diverted to the motherboard ( mgpu ) GPU 14

GPU Hardware 10 Thread Processing ( Clusters(TPC 3 multiprocessors per TPC 8 cores per multiprocessor ROP raster operation ( graphics processors (for 1024 MB frame buffer for displaying images Texture (L2) Cache 15

Programming Model Past: The GPU was intended for graphics only, not general purpose computing. The programmer needed to rewrite the program in a graphics language, such as OpenGL Complicated Present: NVIDIA developed CUDA, a language for general purpose GPU computing Simple 16

CUDA: Programming Model Compute Unified Device Architecture Extension of the C language Used to control the device The programmer specifies CPU and GPU functions The host code can be C++ Device code may only be C The programmer specifies thread layout 17

Thread Layout: Programming Model Threads are organized into blocks. Blocks are organized into a grid. A multiprocessor executes one block at a time. A warp is the set of threads executed in parallel 32 threads in a warp 18

Programming Model Heterogeneous Computing: GPU and CPU execute different types of code. CPU runs the main program, sending tasks to the GPU in the form of kernel functions Multiple kernel functions may be declared and called. Only one kernel may be called at a time. 19

Programming Model: GPU vs. CPU Code D. Kirk. Parallel Computing: What has changed lately? Supercomputing, 2007 20

Performance Results 21

Tesla C1060 GPU 933 GFLOPS Supercomputing Products nforce Motherboard Tesla C1070 Blade 4.14 TFLOPS 22

Supercomputing Products Tesla C1060: Similar to GTX 280 No video connections 933 GFLOPS peak performance 4 GB DDR3 dedicated memory 187.8 W max power consumption 23

Supercomputing Products Tesla C1070: Server Blade 4.14 TFLOPS peak performance Contains 4 Tesla GPUs 960 Cores 16GB DDR3 408 GB/s bandwidth 800W max power consumption 24

Conclusion SIMD causes some problems GPU computing is a good choice for fine-grained data-parallel programs with limited communication GPU computing is not so good for coarse-grained programs with a lot of communication The GPU has become a co-processor to the CPU 25

References D. Kirk. Parallel Computing: What has changed lately? Supercomputing, 2007. nvidia.com NVIDIA. NVIDIA GeForce GTX 200 GPU Architectural Overview. May, 2008. NVIDIA. NVIDIA CUDA Programming Guide 2.1. 2008. 26