Compute API Past & Future. Ofer Rosenberg Visual Computing Software

Compute AI ast & Future Ofer Rosenberg Visual Computing Software 1

Intro and acknowledgments Who am I? For the past two years leading the Intel representation in OpenCL working group @ Khronos Additional background of Media, Signal rocessing, etc. http://il.linkedin.com/in/oferrosenberg Acknowledgments: This presentation contains ideas based on talks with lots of people (who should be mentioned here) artial list: AMD: Mike Houston, Ben Gaster Apple: Aaftab Munshi DICE: Johan Andersson Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more And others 2

Agenda The beginning From Shaders to Compute The ast/resent: 1 st Generation of Compute AI s Caveats of the 1 st generation The Future: 2 nd Generation of Compute AI s

From Shaders to Compute In the beginning, GU HW was fixed & optimized for Graphics Slide from: GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: 4

From Shaders to Compute Graphics stages became programmable GUs evolved This led to the traditional GGU approach Slide from: GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: 5

From Shaders to Compute Traditional GGU Write in graphics language and use the GU Highly effective, but : The developer needs to learn another (not intuitive) language The developer was limited by the graphics language Then came G80 & CUDA 6 Slides from General urpose Computation on Graphics rocessors (GGU), Mike Houston, Stanford University Graphics Lab 6

The cradle of GU Compute AI s GeForce 8800 GTX (G80) was released on Nov. 2006 ATI x1900 (R580) released on Jan 2006 CUDA 0.8 was released on Feb. 2007 (first official Beta) CTM was released on Nov. 2006 Slides from GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GU, Ian Buck, NVIDIA, SC06, & Close to the Metal, Justin Hensley, AMD, SIGGRAH 2007 7

The 1 st generation of latform Compute AI CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL DirectCompute is a Microsoft standard Released as part of WIn7/DX11, a.k.a. Compute Shaders Only runs under Windows on a GU device OpenCL is a cross-os / cross-vendor standard Managed by a working group in Khronos Apple is the spec editor & conformance owner Work can be scheduled on both GUs and CUs Nov 2006 June 2007 Dec 2007 Aug 2008 Dec 2008 Oct 2009 Mar 2010 June 2010 CTM Released CUDA 1.0 Released StreamSDK Released CUDA 2.0 Released OpenCL 1.0 Released DirectX 11 Released CUDA 3.0 Released OpenCL 1.1 Released The 1 st Generation was developed on GU HW which was tuned for graphics usages just extended it for general usage 8

The 1 st generation of latform Compute AI Execution Model Execution model was driven directly from shader programming in graphics ( fragment processing ) : Shader rogramming : initiate one instance of the shader per vertex/pixel Compute : initiate one instance for each point in an N-dimensional grid Fits GU s vision of array of scalar (or stream) processors Drawing from OpenCL 1.1 Specification, Rev36 9

The 1 st generation of latform Compute AI Memory Model Distributed Memory system: Abstraction: Application gets a handle to the memory object / resource Explicit transactions: AI for sync between Host & Device(s) : read/write, map/unmap App OCL RT A Dev1 H A Dev2 Three address spaces: Global, Local (Shared) & rivate Local/Shared Memory: the non-trivial memory space 10

Disclaimer Next slides provide my opinion and thoughts on caveats and future improvements to the latform Compute AI. 11

The 2 nd generation of latform Compute AI Recap: The 1 st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS Defined on HW optimized for GFX, extended to General Compute The cheese has moved for GUs Compute becomes an important usage scenario Advanced Graphics: hysics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering Media: Video Encoding & rocessing, Image rocessing, Image Segmentation, Face Recognition Throughput: Scientific Simulations, Finance, Oil Searches Developers feedback based on the 1 st generation enables creating better HW/AI The Second generation of latform Compute AI: OpenCL Next, DirectX12? The 2nd Generation of Compute AI will run on HW which is designed with Compute in mind 12

Caveats of the 1 st generation: Execution Model Developers input: Most real world usages for compute use fine-grain granularity (the gird is small 100 s at best) Real world kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.) kernel foo() { // code here runs for each point in the grid barrier(clk_local_mem_fence); if (local_id == 0) { // this code runs once per workgroup } // code here runs for each point in the grid barrier(clk_global_mem_fence); if (global_id == 0) { // this code runs only once } // code here runs for each point in the grid } Battlefield 2 execution phase DAG (Image courtesy Johan Andersson, DICE) Using fragment processing for these usages results inefficient use of the machine 13

Caveats of the 1 st generation Execution Model The array of scalar/stream processors model is not optimal for CU s & GU s Works well for large grids (like in traditional graphics), but on finer grain there is a better model NV Fermi AMD R600 Intel NHM CU s and GU s are better modeled as multi-threaded vector machines 14

The 2 nd generation of latform Compute AI Ideas for new execution model Goals Support fine-grain parallelism Support complex application execution graphs: Better match HW evolution: target multi-threaded vector machines Aligned with CU evolution, and SoC integration of CU/GU Solution: Tasking system as execution model foundation SW Thread Task ool Device Domain......... Device HW compute unit HW compute unit HW compute unit HW compute unit Tasking system: Task Q s mapped to independent HW units (~compute cores) Device load balancing enabled via stealing OpenCL Analogy: Tasks execute at the work group level OpenCL Task CU Task More restricted: No reemption Evolved: Braided Task (sequential parts & fine-grain parallel parts interleaved) 15

The 2 nd generation of latform Compute AI Ideas for new execution model There are others who think along the same lines Slides from Leading a new Era of Computing, Chekib Akrout, Senior V, Technology Group, AMD, 2010 Financial Analyst Day 16

Caveats of the 1 st generation: Memory Model Developers input: A growing number of compute workloads uses complex data structures (linked lists, trees, etc.) erformance: Cost of pointer marshaling & re-construct on device is high orting complexity: need to add explicit transactions, marshaling, etc. Supporting a shared/unified address space (AI & HW) is required App OCL RT A Dev1 H A Dev2 App OCL RT A Dev1 A S A Dev2 Shared/Unified Address Space between Host & Devices 17

The 2 nd generation of latform Compute AI Ideas for new memory model Baseline: Memory objects / resources will have the same starting address between Host & Devices Shared Address Space Shared Address Space w. relaxed consistency Extend existing OCL 1.x / DX11 Memory Model Use explicit AI calls to sync between Host & Device Suitable for Disjoint memory architectures (Discrete GU s, for example ) Shared Address Space w. full coherency New Model - Memory is coherent between Host & Device Use known language level mechanisms for concurrent access: atomics, volatile Suitable for Shared Memory architectures Host Device Host Device Host Memory Device Memory Coherent/Shared Memory 18

Some more thoughts for the 2 nd generation (and beyond) romote Heterogeneous rocessing not GU only Running code pending on problem domain: Matrix Multiply of 16x16 should run on the CU Matrix Multiply of 1000x1000 should run on the GU Where s the decision point? Better leave it to the Runtime (requires AI) Load Balancing Execution Time Relevant especially on systems where the CU & GU are close in compute power CU GU roblem size One AI to rule them all Compute AI as the underlying infrastructure to run Media & GFX Extend the AI to contain flexible pipeline, fixed-function HW, etc. Slide from arallel Future of a Game Engine, Johan Andersson, DICE 19

References: GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GU, Ian Buck, NVIDIA, SC06 http://gpgpu.org/static/sc2006/workshop/presentations/buck_nvidia_cuda.pdf GU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAH 2008: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf General urpose Computation on Graphics rocessors (GGU), Mike Houston, Stanford University Graphics Lab http://www-graphics.stanford.edu/~mhouston/public_talks/r520-mhouston.pdf Close to the Metal, Justin Hensley, AMD, SIGGRAH 2007 http://gpgpu.org/static/s2007/slides/07-ctm-overview.pdf NVIDIA s Fermi: The First Complete GU Computing Architecture, eter N. Glaskowsky http://www.nvidia.com/content/df/fermi_white_papers/.glaskowsky_nvidiafermi- TheFirstCompleteGUComputingArchitecture.pdf Leading a new Era of Computing, Chekib Akrout, Senior V, Technology Group, AMD, 2010 Financial Analyst Day http://phx.corporate-ir.net/external.file?item=ugfyzw50suq9njk3ntj8q2hpbgrjrd0tmxxuexbltm=&t=1 arallel Future of a Game Engine, Johan Andersson, DICE http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448 20