Compute API Past & Future. Ofer Rosenberg Visual Computing Software

Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

L20: GPU Architecture and Models

Introduction to GPU Architecture

The Future Of Animation Is Games

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPGPU Computing. Yong Cao

Introduction to GPU hardware and to CUDA

Next Generation GPU Architecture Code-named Fermi

GPU Computing - CUDA

Introduction to GPU Computing

NVIDIA GeForce GTX 580 GPU Datasheet

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

HPC with Multicore and GPUs

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Introduction to GPGPU. Tiziano Diamanti

Intel Graphics Virtualization Technology Update. Zhi Wang,

Introduction to GPU Programming Languages

Stream Processing on GPUs Using Distributed Multimedia Middleware

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPUs for Scientific Computing

Choosing a Computer for Running SLX, P3D, and P5

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

OpenCL Programming for the CUDA Architecture. Version 2.3

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Parallel Programming Survey

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Hardware design for ray tracing

Embedded Systems: map to FPGA, GPU, CPU?

Transactional Memory

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

A Powerful solution for next generation Pcs

Interactive Rendering In The Post-GPU Era. Matt Pharr Graphics Hardware 2006

Open Source Solution for IVI: Tizen IVI. Brett Branch Tizen IVI Product Marketing


Data Centric Systems (DCS)

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

ST810 Advanced Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

NVIDIA GeForce Experience

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

GPU Hardware Performance. Fall 2015

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

BSC vision on Big Data and extreme scale computing

Texture Cache Approximation on GPUs

System requirements for Autodesk Building Design Suite 2017

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Case Study on Productivity and Performance of GPGPUs

Turbomachinery CFD on many-core platforms experiences and strategies

GPU Profiling with AMD CodeXL

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Using Mobile Processors for Cost Effective Live Video Streaming to the Internet

Analysis One Code Desc. Transaction Amount. Fiscal Period

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Load balancing. David Bindel. 12 Nov 2015

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Technical Specifications

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Le langage OCaml et la programmation des GPU

ME964 High Performance Computing for Engineering Applications

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Revit products will use multiple cores for many tasks, using up to 16 cores for nearphotorealistic

Computing & Telecommunications Services Monthly Report March 2015

Software and the Concurrency Revolution

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Multi-core Programming System Overview

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

NVIDIA Tools For Profiling And Monitoring. David Goodwin

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Radeon HD 2900 and Geometry Generation. Michael Doggett

NVIDIA GRID DASSAULT CATIA V5/V6 SCALABILITY GUIDE. NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016

Transcription:

Compute AI ast & Future Ofer Rosenberg Visual Computing Software 1

Intro and acknowledgments Who am I? For the past two years leading the Intel representation in OpenCL working group @ Khronos Additional background of Media, Signal rocessing, etc. http://il.linkedin.com/in/oferrosenberg Acknowledgments: This presentation contains ideas based on talks with lots of people (who should be mentioned here) artial list: AMD: Mike Houston, Ben Gaster Apple: Aaftab Munshi DICE: Johan Andersson Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more And others 2

Agenda The beginning From Shaders to Compute The ast/resent: 1 st Generation of Compute AI s Caveats of the 1 st generation The Future: 2 nd Generation of Compute AI s

From Shaders to Compute In the beginning, GU HW was fixed & optimized for Graphics Slide from: GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: 4

From Shaders to Compute Graphics stages became programmable GUs evolved This led to the traditional GGU approach Slide from: GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: 5

From Shaders to Compute Traditional GGU Write in graphics language and use the GU Highly effective, but : The developer needs to learn another (not intuitive) language The developer was limited by the graphics language Then came G80 & CUDA 6 Slides from General urpose Computation on Graphics rocessors (GGU), Mike Houston, Stanford University Graphics Lab 6

The cradle of GU Compute AI s GeForce 8800 GTX (G80) was released on Nov. 2006 ATI x1900 (R580) released on Jan 2006 CUDA 0.8 was released on Feb. 2007 (first official Beta) CTM was released on Nov. 2006 Slides from GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GU, Ian Buck, NVIDIA, SC06, & Close to the Metal, Justin Hensley, AMD, SIGGRAH 2007 7

The 1 st generation of latform Compute AI CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL DirectCompute is a Microsoft standard Released as part of WIn7/DX11, a.k.a. Compute Shaders Only runs under Windows on a GU device OpenCL is a cross-os / cross-vendor standard Managed by a working group in Khronos Apple is the spec editor & conformance owner Work can be scheduled on both GUs and CUs Nov 2006 June 2007 Dec 2007 Aug 2008 Dec 2008 Oct 2009 Mar 2010 June 2010 CTM Released CUDA 1.0 Released StreamSDK Released CUDA 2.0 Released OpenCL 1.0 Released DirectX 11 Released CUDA 3.0 Released OpenCL 1.1 Released The 1 st Generation was developed on GU HW which was tuned for graphics usages just extended it for general usage 8

The 1 st generation of latform Compute AI Execution Model Execution model was driven directly from shader programming in graphics ( fragment processing ) : Shader rogramming : initiate one instance of the shader per vertex/pixel Compute : initiate one instance for each point in an N-dimensional grid Fits GU s vision of array of scalar (or stream) processors Drawing from OpenCL 1.1 Specification, Rev36 9

The 1 st generation of latform Compute AI Memory Model Distributed Memory system: Abstraction: Application gets a handle to the memory object / resource Explicit transactions: AI for sync between Host & Device(s) : read/write, map/unmap App OCL RT A Dev1 H A Dev2 Three address spaces: Global, Local (Shared) & rivate Local/Shared Memory: the non-trivial memory space 10

Disclaimer Next slides provide my opinion and thoughts on caveats and future improvements to the latform Compute AI. 11

The 2 nd generation of latform Compute AI Recap: The 1 st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS Defined on HW optimized for GFX, extended to General Compute The cheese has moved for GUs Compute becomes an important usage scenario Advanced Graphics: hysics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering Media: Video Encoding & rocessing, Image rocessing, Image Segmentation, Face Recognition Throughput: Scientific Simulations, Finance, Oil Searches Developers feedback based on the 1 st generation enables creating better HW/AI The Second generation of latform Compute AI: OpenCL Next, DirectX12? The 2nd Generation of Compute AI will run on HW which is designed with Compute in mind 12

Caveats of the 1 st generation: Execution Model Developers input: Most real world usages for compute use fine-grain granularity (the gird is small 100 s at best) Real world kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.) kernel foo() { // code here runs for each point in the grid barrier(clk_local_mem_fence); if (local_id == 0) { // this code runs once per workgroup } // code here runs for each point in the grid barrier(clk_global_mem_fence); if (global_id == 0) { // this code runs only once } // code here runs for each point in the grid } Battlefield 2 execution phase DAG (Image courtesy Johan Andersson, DICE) Using fragment processing for these usages results inefficient use of the machine 13

Caveats of the 1 st generation Execution Model The array of scalar/stream processors model is not optimal for CU s & GU s Works well for large grids (like in traditional graphics), but on finer grain there is a better model NV Fermi AMD R600 Intel NHM CU s and GU s are better modeled as multi-threaded vector machines 14

The 2 nd generation of latform Compute AI Ideas for new execution model Goals Support fine-grain parallelism Support complex application execution graphs: Better match HW evolution: target multi-threaded vector machines Aligned with CU evolution, and SoC integration of CU/GU Solution: Tasking system as execution model foundation SW Thread Task ool Device Domain......... Device HW compute unit HW compute unit HW compute unit HW compute unit Tasking system: Task Q s mapped to independent HW units (~compute cores) Device load balancing enabled via stealing OpenCL Analogy: Tasks execute at the work group level OpenCL Task CU Task More restricted: No reemption Evolved: Braided Task (sequential parts & fine-grain parallel parts interleaved) 15

The 2 nd generation of latform Compute AI Ideas for new execution model There are others who think along the same lines Slides from Leading a new Era of Computing, Chekib Akrout, Senior V, Technology Group, AMD, 2010 Financial Analyst Day 16

Caveats of the 1 st generation: Memory Model Developers input: A growing number of compute workloads uses complex data structures (linked lists, trees, etc.) erformance: Cost of pointer marshaling & re-construct on device is high orting complexity: need to add explicit transactions, marshaling, etc. Supporting a shared/unified address space (AI & HW) is required App OCL RT A Dev1 H A Dev2 App OCL RT A Dev1 A S A Dev2 Shared/Unified Address Space between Host & Devices 17

The 2 nd generation of latform Compute AI Ideas for new memory model Baseline: Memory objects / resources will have the same starting address between Host & Devices Shared Address Space Shared Address Space w. relaxed consistency Extend existing OCL 1.x / DX11 Memory Model Use explicit AI calls to sync between Host & Device Suitable for Disjoint memory architectures (Discrete GU s, for example ) Shared Address Space w. full coherency New Model - Memory is coherent between Host & Device Use known language level mechanisms for concurrent access: atomics, volatile Suitable for Shared Memory architectures Host Device Host Device Host Memory Device Memory Coherent/Shared Memory 18

Some more thoughts for the 2 nd generation (and beyond) romote Heterogeneous rocessing not GU only Running code pending on problem domain: Matrix Multiply of 16x16 should run on the CU Matrix Multiply of 1000x1000 should run on the GU Where s the decision point? Better leave it to the Runtime (requires AI) Load Balancing Execution Time Relevant especially on systems where the CU & GU are close in compute power CU GU roblem size One AI to rule them all Compute AI as the underlying infrastructure to run Media & GFX Extend the AI to contain flexible pipeline, fixed-function HW, etc. Slide from arallel Future of a Game Engine, Johan Andersson, DICE 19

References: GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GU, Ian Buck, NVIDIA, SC06 http://gpgpu.org/static/sc2006/workshop/presentations/buck_nvidia_cuda.pdf GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf General urpose Computation on Graphics rocessors (GGU), Mike Houston, Stanford University Graphics Lab http://www-graphics.stanford.edu/~mhouston/public_talks/r520-mhouston.pdf Close to the Metal, Justin Hensley, AMD, SIGGRAH 2007 http://gpgpu.org/static/s2007/slides/07-ctm-overview.pdf NVIDIA s Fermi: The First Complete GU Computing Architecture, eter N. Glaskowsky http://www.nvidia.com/content/df/fermi_white_papers/.glaskowsky_nvidiafermi- TheFirstCompleteGUComputingArchitecture.pdf Leading a new Era of Computing, Chekib Akrout, Senior V, Technology Group, AMD, 2010 Financial Analyst Day http://phx.corporate-ir.net/external.file?item=ugfyzw50suq9njk3ntj8q2hpbgrjrd0tmxxuexbltm=&t=1 arallel Future of a Game Engine, Johan Andersson, DICE http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448 20