Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Similar documents

Next Generation GPU Architecture Code-named Fermi

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Programming Languages

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to GPU hardware and to CUDA

Introduction to GPGPU. Tiziano Diamanti

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Parallel Computing Architecture and CUDA Programming Model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Architecture

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Programming Survey

Computer Graphics Hardware An Overview

GPU Computing - CUDA

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU Computing. Yong Cao

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPUs for Scientific Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

HPC with Multicore and GPUs

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Programming GPUs with CUDA

GPU Performance Analysis and Optimisation

Guided Performance Analysis with the NVIDIA Visual Profiler

The Fastest, Most Efficient HPC Architecture Ever Built

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Introduction to GPU Computing

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Programming Strategies and Trends in GPU Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 1: an introduction to CUDA

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

L20: GPU Architecture and Models

OpenCL Programming for the CUDA Architecture. Version 2.3

CUDA programming on NVIDIA GPUs

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Fast Implementations of AES on Various Platforms

Parallel Firewalls on General-Purpose Graphics Processing Units

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

A quick tutorial on Intel's Xeon Phi Coprocessor

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Radeon HD 2900 and Geometry Generation. Michael Doggett

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Evaluation of CUDA Fortran for the CFD code Strukti

Texture Cache Approximation on GPUs

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

CUDA Basics. Murphy Stein New York University

Embedded Systems: map to FPGA, GPU, CPU?

GPU Architecture. Michael Doggett ATI

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Fast Software AES Encryption

NVIDIA Solves the GPU Computing Puzzle

Clustering Billions of Data Points Using GPUs

~ Greetings from WSU CAPPLab ~

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Case Study on Productivity and Performance of GPGPUs

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

IMAGE PROCESSING WITH CUDA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Binary search tree with SIMD bandwidth optimization using SSE

Optimizing Code for Accelerators: The Long Road to High Performance

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Hardware design for ray tracing

NVIDIA GeForce GTX 750 Ti

ultra fast SOM using CUDA

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPGPU accelerated Computational Fluid Dynamics

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

High Performance GPGPU Computer for Embedded Systems

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Transcription:

Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2

What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!

General Purpose Computing ref: http://www.nvidia.com/object/tesla_computing_solutions.html 4

General-Purpose GPUs (GP-GPUs) In 2006, NVIDIA introduced GeForce 8800 GPU supporting a new programming language: CUDA, Compute Unified Device Architecture Subsequently, broader industry pushing for OpenCL, a vendorneutral version of same ideas for multiple platforms. Basic idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU device for execution Notice: This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics (Would probably need another course to describe graphics processing) 5

General-Purpose GPUs (GP-GPUs) Given the hardware invested to do graphics well, how can be supplement it to improve performance of a wider range of general purpose applications? Basic concepts: Heterogeneous execution model CPU is the host, GPU is the device Develop a C-like programming language for GPUs Unify all forms of GPU parallelism as CUDA thread Programming model is Single Instruction Multiple Thread (SIMT): massive number of light-weight threads Programmer unware of number of parallel cores Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 6

Threads, Blocks and Grids Thread: lower level of parallelism as programming primitive. Each thread is composed of SIMD instructions A thread is associated with each data element Threads are organized into Blocks Blocks are organized into Grids GPU hardware handles thread management, not applications or OS Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 7

Threads, Blocks and Grids (2) Copyright 2011, Elsevier Inc. All rights Reserved. 8

Threads, Blocks and Grids (3) Each thread of SIMD instructions calculates 32 elements for each instruction. Each thread block is executed by a multi-threaded SIMD processor. Example: Multiply two vectors of length 8192 Each SIMD instruction executes 32 elements at a time Each thread block contains 16 threads of SIMD instructions Grid size = 16 thread blocks Total 16 x 16 x 32 = 8192 elements Copyright 2012, Elsevier Inc. All rights reserved. 9

Scheduling Threads Two levels of HW Schedulers: 1. The Thread Block Scheduler assigns thread blocks to multithreaded SIMD processors 2. The Thread Scheduler, within a SIMD Processor, assigns each thread to run each clock cycle. Fermi architecture: 1. Giga Thread Scheduler 2. Dual Warp Scheduler Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 10

NVIDIA Fermi GPU (2010) Each thread block (warp) is composed of 32 CUDA threads for Fermi (16 CUDA threads for Tesla). Fermi GPUs have 16 multithreaded SIMD Processors (called SMs, Streaming Multiprocessors). Each SM has: two Warp Schedulers two sets of 16 CUDA cores also called SPs 16 load/store units 4 special function units (SFUs) to execute transcendental instructions (such as sin, cosine, reciprocal and square root). Globally there are 512 CUDA cores (512 SPs). Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 11

Floor plan of the Fermi GPU composed of 16 multithreaded SIMD Processors (called SMs, Streaming Multiprocessors) and a common L2 Cache. The Thread Block Scheduler on the left (Giga Thread) distributes thread blocks to SMs, each of which has its own dual SIMD Thread Scheduler (Dual Warp Scheduler). 12

Fermi Streaming Multiprocessor (SM) Each SM has: two warp schedulers, two sets of 16 cores (2 x 16 SPs), 16 load/store units and 4 SFUs Each SM has 32K registers of 32-bit Each SM has 64 KB shared memory /L1 cache

NVIDIA Fermi GPU: Scheduling Threads Two levels of HW Schedulers: 1. The Giga Thread Engine schedules thread blocks to various SMs. 2. The Dual Warp Schedulers selects two warps of 32 threads of SIMD instructions and issue one instruction from each warp to its execution units The threads are independent so there is no need to check for dependencies within the instruction stream. Analogous to a multi-threaded vector processor that can issue vector instructions from two independent threads Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 14

NVIDIA Fermi GPU: Dual Warp Scheduler Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 15

Fermi GPU: Fused Multiply-Add The Fermi architecture implements the new IEE 754-2008 floating point standard. Fused Multiply-Add (FMA) performs multiplication and addition a <= a + (b x c) with a single final rounding step (more accurate than performing the two operations separately). Each SP can fulfill up to two single precision FMAs per clock Each SM up to 32 single precision (32-bit) FP operations or 16 double precision (64-bit) FP operations Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 16

Fermi GPU Memory Hierarchy

Fermi GPU Memory Hierarchy 19

Fermi GPU Memory Hierarchy Each SM has 32,768 registers of 32-bit Divided into lanes Each SIMD thread has access to its own registers and not those of other threads. Each SIMD thread is limited to 64 registers Each SM has 64 KB shared memory/l1 cache to be configured as either 48 KB of shared memory among threads (within the same block) + 16 KB of L1 to cache data to individual threads or 16 KB of shared memory and 48 KB of L1 cache. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 20

Fermi GPU Memory Hierarchy Local memory can be used to hold spilled registers, when a thread block requires more register storage than is available on SM registers. L2 cache 768 KB unified among the 16 SMs that services all load/store from/to global memory (including copies to/from CPU host) and used for managing access to data shared across thread blocks. Global Memory accessible by all threads as well as host CPU. High latency. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 21

NVIDIA GPU vs. Vector Architectures Similarities : Works well with data-level parallel problems Scatter-gather transfers Mask registers Branch hardware uses internal masks Large register files Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few deeply pipelined units like a vector processor Copyright 2012, Elsevier Inc. All rights reserved. 22

NVIDIA Fermi GPU Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 23

NVIDIA Kepler GPU Kepler GK110 Architecture 7.1B Transistors 15 SMX units (2880 cores) >1TFLOP FP64 1.5MB L2 Cache 384-bit GDDR5 PCI Express Gen3 Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 24