OpenCL on Intel Iris Graphics

Copyright Khronos Group, 2013 - Page 1 OpenCL on Intel Iris Graphics SIGGRAPH 2013 July 2013 Presenter: Adam Lake Content: Ben Ashbaugh, Arnon Peleg, others

Copyright Khronos Group, 2013 - Page 2 Agenda Intel Iris Graphics Architecture Overview Tools to help get the most from OpenCL on Intel CPU and GPU Additional Resources

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 3 Copyright Khronos Group, 2013 - Page 3

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Global Assets Command Video Quality Streamer Multi-Format Engine CODEC Thread Dispatch Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 4 Copyright Khronos Group, 2013 - Page 4

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) Tessellator Sub Slice Execution Units L1 IC$ s and Data Port Instruction and Texture Caches 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 5 Copyright Khronos Group, 2013 - Page 5

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Video Quality Engine L1 IC$ Multi-Format CODEC Blitter 3D Media Data Port Slice DisplayCommon L3 Cache Shared Local Memory Barriers Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 6 Copyright Khronos Group, 2013 - Page 6

Intel Iris Graphics Architecture Building Blocks OpenCL* Kernels run on an Execution Unit () Each is a Multi-Threaded SIMD Processor Up to 7 threads per - 128 x 8 x 32-bit registers per thread Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled) - SIMD8, SIMD16, SIMD32 - SIMD8 More Registers - SIMD16 and SIMD32 Better Efficiency Thread 0 Thread 2 Thread 4 Thread 6 Thread ` 1 Thread 3 Thread 5 7 Copyright Khronos Group, 2013 - Page 7

Sub Slice Intel Iris Graphics Architecture Building Blocks OpenCL* Work Groups run on a Sub Slice - 10 s per Sub Slice - Texture (Images) L1 IC$ 3D Media Data Port Tex$ - Data Port (Buffers) - Instruction and Texture Caches OpenCL* Work Groups may run on multiple threads, on multiple s! 8 Copyright Khronos Group, 2013 - Page 8

Intel Iris Graphics Architecture Building Blocks Two Sub Slices make a Slice Shared Resources: Slice Common - L3 Cache + Shared Local Memory Sub Slice L1 IC$ 3D Media Data Port Tex$ - Barriers Intel Iris Graphics has Two Slices How many work items in flight at once? Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ - 2 slides each with 2 subslices = 4 Sub Slices - 4 subslices x 10 s/subslice = 40 s - Up to 40 s x 7 threads/ = 280 threads - Up to 8960 OpenCL* work items in flight! Sub Slice L1 IC$ 3D Media Data Port Tex$ Slice 9 Copyright Khronos Group, 2013 - Page 9

Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images L1 L2 L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 10 Copyright Khronos Group, 2013 - Page 10

Tools and Additional Resources Copyright Khronos Group, 2013 - Page 11

Occupancy Intel VTune Amplifier XE 2013 12 Copyright Khronos Group, 2013 - Page 12

Additional Resources Intel SDK for OpenCL* Applications 2013 - Offline kernel builder, Debugger for CPU in MSVC Intel OpenCL* Optimization Guide Intel VTune Amplifier XE 2013 Intel Graphics Performance Analyzers - OpenCL Tuning Intel Linux Graphics hardware Bspec: https://01.org/linuxgraphics/ SIGGRAPH 2013: - Faster, Better Pixels on the Go and in the Cloud with OpenCL* on Intel Architecture - Arnon Peleg - Optimizing OpenCL* Applications for Intel Iris Graphics - Ben Ashbaugh - Backup of this deck contains more from Ben Ashbaugh s excellent SIGGRAPH 2012 talk on optimizing for Intel Iris Graphics! 13 Copyright Khronos Group, 2013 - Page 13

Questions Copyright Khronos Group, 2013 - Page 14

Copyright Khronos Group, 2013 - Page 15 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, and Ultrabook are trademarks of Intel Corporation in the United States and other countries

backup Copyright Khronos Group, 2013 - Page 16

Scheduling s for maximum occupancy Copyright Khronos Group, 2013 - Page 17

Occupancy Goal: Use All Execution Unit Resources This is harder than it sounds! Many factors to consider - Launch Enough Work - More threads means better latency coverage to keep an active - One thread sufficient to prevent an from going idle - Too few threads can result in an being stalled 18 Copyright Khronos Group, 2013 - Page 18

Occupancy Goal: Use All Execution Unit Resources - Don t Waste SIMD Lanes - Use an optimal Local Work Size - Good: Query for compiled SIMD size: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE - Occasionally Helpful: Compile for a specific local work size (8, 16, or 32): attribute ((reqd_work_group_size(x, Y, Z))) - Best: Let the driver pick (Local Work Size == NULL) Ideal for kernels with no barriers or shared local memory 19 Copyright Khronos Group, 2013 - Page 19

Occupancy (continued ) Barriers - 16 barriers per sub slice - Can be a limiting factor for very small local work groups Shared Local Memory - 64KB shared local memory per sub slice - Can be a limiting factor for kernels that use lots of shared local memory 20 Copyright Khronos Group, 2013 - Page 20

Optimizing OpenCL* Applications for Intel Iris Graphics Ben Ashbaugh (Intel) Copyright Khronos Group, 2013 - Page 21

Agenda Understanding Occupancy - How Intel Iris Graphics executes OpenCL* Kernels - - - 22 Copyright Khronos Group, 2013 - Page 22

How Intel Iris Graphics Runs OpenCL* Copyright Khronos Group, 2013 - Page 23

How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 24 Copyright Khronos Group, 2013 - Page 24

How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into Threads 25 Copyright Khronos Group, 2013 - Page 25

Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Media Data Port Tex$ 3. Launch Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all threads for the Work Group - Not enough room in any Sub Slice threads must wait 26 Copyright Khronos Group, 2013 - Page 26

How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 27 Copyright Khronos Group, 2013 - Page 27

How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into Threads 28 Copyright Khronos Group, 2013 - Page 28

Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Media Data Port Tex$ 2. Launch Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all threads for the Work Group - Not enough room in any Sub Slice threads must wait 29 Copyright Khronos Group, 2013 - Page 29

Occupancy Goal: Use All Machine Resources This is harder than it sounds! Many factors to consider 1. Launch Enough Work - One thread sufficient to prevent an from going idle - Too few threads can result in an being stalled - More threads better latency coverage keeps an active 30 Copyright Khronos Group, 2013 - Page 30

Occupancy Intel VTune Amplifier XE 2013 31 Copyright Khronos Group, 2013 - Page 31

Agenda - Memory Matters - Host to Device - - 32 Copyright Khronos Group, 2013 - Page 32

Optimizing Host to Device Transfers Host (CPU) and Device (GPU) share the same physical memory For OpenCL* buffers: - No transfer needed (zero copy)! - Allocate system memory aligned to a cache line (64 bytes) - Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR - Use clenqueuemapbuffer() to access data For OpenCL* images: - Currently tiled in device memory transfer required 33 Copyright Khronos Group, 2013 - Page 33

Operating on Buffers as Images Intel Iris Graphics supports cl_khr_image2d_from_buffer - New OpenCL* 1.2 Extension - Treat data as a buffer for some kernels, as an image for others - Some restrictions for zero copy: buffer size, image pitch Buffer Image 0x123 0x456 0x789 34 Copyright Khronos Group, 2013 - Page 34

Interop with Graphics and Media APIs / SDKs Intel Iris Graphics supports many Graphics and Media interop extensions: - cl_khr_dx9_media_sharing (includes DXVA for Intel Media SDK) - cl_khr_d3d10_sharing - cl_khr_d3d11_sharing - cl_khr_gl_sharing - cl_khr_gl_depth_images - cl_khr_gl_event - cl_khr_gl_msaa_sharing Use Graphics API / SDK assets in OpenCL* with no copies! 35 Copyright Khronos Group, 2013 - Page 35

Agenda - Memory Matters - - Device Access - 36 Copyright Khronos Group, 2013 - Page 36

Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images L1 L2 L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 37 Copyright Khronos Group, 2013 - Page 37

global and constant Memory Global Memory Accesses go through the L3 Cache L3 Cache Line is 64 bytes thread accesses to the same L3 Cache Line are collapsed - Order of data within cache line does not matter - Bandwidth determined by number of cache lines accessed - Maximum Bandwidth: 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address - Loading more than 4 x 32-bits of data is not beneficial 38 Copyright Khronos Group, 2013 - Page 38

Global and Constant Memory Access Examples 1. x = data[ get_global_id(0) ] - One cache line, full bandwidth 2. x = data[ n get_global_id(0) ] - Reverse order, full bandwidth 3. x = data[ get_global_id(0) + 1 ] - Offset, two cache lines, half bandwidth 4. x = data[ get_global_id(0) * 2 ] - Strided, half bandwidth 5. x = data[ get_global_id(0) * 16 ] - Very strided, worst-case Global ID: Global ID: Global ID: Global ID: Global ID: Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n - 1 Cache Line n Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n Cache Line n + 1 Cache Line n + 2... 0 1 2... 39 Copyright Khronos Group, 2013 - Page 39

Copyright Khronos Group, 2013 - Page 40 local Memory Accesses Local Memory Accesses also go through the L3 Cache! Key Difference: Local Memory is Banked - Banked at a DWORD granularity, 16 banks - Bandwidth determined by number of bank conflicts - Maximum Bandwidth: Still 64 bytes / clock / sub slice Supports more access patterns with full bandwidth than Global Memory - No bank conflicts full bandwidth - Reading from the same address in a bank full bandwidth

Local Memory Access Examples 1. x = data[ get_global_id(0) + 1 ] - Unique banks, full bandwidth 2. x = data[ get_global_id(0) & ~1 Bank: ] - Same address read, full bandwidth 3. x = data[ get_global_id(0) * 2 ] - Strided, half bandwidth 4. x = data[ get_global_id(0) * 16 ] - Very strided, worst-case 5. x = data[ get_global_id(0) * 17 ] - Full bandwidth! Bank: Global ID: Global ID: Bank: Global ID: Bank: Global ID: Bank: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0... 0 0 0 0 1 2 3 4 5 6 7 0... 1..................... 0.................. 2 3 0 4 0 5 0 6 Global ID: 0 1 2 3 4 5 6 41 Copyright Khronos Group, 2013 - Page 41

Thread n+1 Thread n-1 Thread n Copyright Khronos Group, 2013 - Page 42 private Memory Compiler can usually allocate Private Memory in the Register File Work Item 0 Work Item 1 private int a[100] - Even if Private Memory is dynamically indexed - Good Performance Fallback: Private Memory allocated in Global Memory - Accesses are very strided Work Item n Work Item 0 Work Item 1 Work Item n Work Item 0 Work Item 1 private int b[100] private int c[200] - Bad Performance Work Item n

Agenda - - - Compute Characteristics - Maximizing GFlops 43 Copyright Khronos Group, 2013 - Page 43

ISA SIMT ISA with Predication and Branching Divergent code executes both branches Reduced SIMD Efficiency this(); if ( x ) else that(); another(); finish(); time Example: x sometimes true SIMD lane time Example: x never true SIMD lane 44 Copyright Khronos Group, 2013 - Page 44

Compute GFlops s have 2 x 4-wide vector ALUs Second ALU has limitations: - Subset of instructions: add, mov, mad, mul, cmp - Instruction must come from another thread - Only float operands! Peak GFlops: #s x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate For Intel Iris Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops! Add Intel Core Host Processor >1TFlop! 45 Copyright Khronos Group, 2013 - Page 45

Maximizing Compute Performance Use mad() / fma(): Either explicitly with built-ins, or via -cl-mad-enable Use floats wherever possible to maximize co-issue Avoid long and size_t data types - Prefer float over int, if possible - Using short data types may improve performance Trade accuracy for speed: native built-ins, -cl-fastrelaxed-math - Often good enough for graphics 46 Copyright Khronos Group, 2013 - Page 46

Summary Maximize Occupancy - Choose a Good Local Work Size - Or, Let the Driver Choose (Local Work Size == NULL) Avoid Host-to-Device Transfers - Create Buffers with CL_MEM_USE_HOST_PTR Access Device Memory Efficiently - Minimize Cache Lines for global Memory - Minimize Bank Conflicts for local Memory Maximize Compute - Avoid Divergent Branches - Use mad / fma and float Data When Possible Copyright Khronos Group, 2013 - Page 48

Copyright Khronos Group, 2013 - Page 49 Questions / Acknowledgements This presentation would not have been possible without material and review comments from many people Thank you! Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier