OpenCL on Intel Iris Graphics

Similar documents
Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media SDK Library Distribution and Dispatching Process

The Transition to PCI Express* for Client SSDs

Specification Update. January 2014

Intel Core TM i3 Processor Series Embedded Application Power Guideline Addendum

2013 Intel Corporation

Intel HTML5 Development Environment. Tutorial Test & Submit a Microsoft Windows Phone 8* App (BETA)

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

The Case for Rack Scale Architecture

Intel Service Assurance Administrator. Product Overview

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Intel X38 Express Chipset Memory Technology and Configuration Guide

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

VNF & Performance: A practical approach

Introduction to GPU Architecture

Intel HTML5 Development Environment. Tutorial Building an Apple ios* Application Binary

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Intel Media Server Studio Professional Edition for Windows* Server

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Software Solutions for Multi-Display Setups

Monte Carlo Method for Stock Options Pricing Sample

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Douglas Fisher Vice President General Manager, Software and Services Group Intel Corporation

Haswell Cryptographic Performance

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Next Generation GPU Architecture Code-named Fermi

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Intel Retail Client Manager

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

Radeon HD 2900 and Geometry Generation. Michael Doggett

Head-Coupled Perspective

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

Benchmarking Cloud Storage through a Standard Approach Wang, Yaguang Intel Corporation

MCA Enhancements in Future Intel Xeon Processors June 2013

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Intel Platform and Big Data: Making big data work for you.

Intel 845G/GL Chipset Dynamic Video Memory Technology

Hetero Streams Library 1.0

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

Intel SSD 520 Series Specification Update

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

iscsi Quick-Connect Guide for Red Hat Linux

Keys to node-level performance analysis and threading in HPC applications

Intel HTML5 Development Environment. Article - Native Application Facebook* Integration

Introduction to GPU Programming Languages

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel Technical Advisory

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Customizing Boot Media for Linux* Direct Boot

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Partition Alignment of Intel SSDs for Achieving Maximum Performance and Endurance Technical Brief February 2014

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

* * * Intel RealSense SDK Architecture

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Intel Data Migration Software

OpenCL Programming for the CUDA Architecture. Version 2.3

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

GPU architecture II: Scheduling the graphics pipeline

Intel Software Guard Extensions(Intel SGX) Carlos Rozas Intel Labs November 6, 2013

Cross-Platform Game Development Best practices learned from Marmalade, Unreal, Unity, etc.

Intel HTML5 Development Environment Article Using the App Dev Center

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Intel Identity Protection Technology (IPT)

GPU Architecture. Michael Doggett ATI

NVIDIA GeForce GTX 580 GPU Datasheet

Intel Platform Controller Hub EG20T

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Software Evaluation Guide for Autodesk 3ds Max 2009* and Enemy Territory: Quake Wars* Render a 3D character while playing a game

Intel Retail Client Manager Audience Analytics

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

A Powerful solution for next generation Pcs

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Intel vpro Technology Module for Microsoft* Windows PowerShell*

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

DDR2 x16 Hardware Implementation Utilizing the Intel EP80579 Integrated Processor Product Line

This guide explains how to install an Intel Solid-State Drive (Intel SSD) in a SATA-based desktop or notebook computer.

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family

Intel Desktop Board D945GCPE Specification Update

Intel 865G, Intel 865P, Intel 865PE Chipset Memory Configuration Guide

GPU Parallel Computing Architecture and CUDA Programming Model

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Configuring RAID for Optimal Performance

Transcription:

Copyright Khronos Group, 2013 - Page 1 OpenCL on Intel Iris Graphics SIGGRAPH 2013 July 2013 Presenter: Adam Lake Content: Ben Ashbaugh, Arnon Peleg, others

Copyright Khronos Group, 2013 - Page 2 Agenda Intel Iris Graphics Architecture Overview Tools to help get the most from OpenCL on Intel CPU and GPU Additional Resources

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 3 Copyright Khronos Group, 2013 - Page 3

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Global Assets Command Video Quality Streamer Multi-Format Engine CODEC Thread Dispatch Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 4 Copyright Khronos Group, 2013 - Page 4

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) Tessellator Sub Slice Execution Units L1 IC$ s and Data Port Instruction and Texture Caches 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 5 Copyright Khronos Group, 2013 - Page 5

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Video Quality Engine L1 IC$ Multi-Format CODEC Blitter 3D Media Data Port Slice DisplayCommon L3 Cache Shared Local Memory Barriers Tex$ L1 IC$ 3D Media Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Media Data Port Tex$ L1 IC$ 3D Media Data Port Tex$ Slice 0 Slice 1 6 Copyright Khronos Group, 2013 - Page 6

Intel Iris Graphics Architecture Building Blocks OpenCL* Kernels run on an Execution Unit () Each is a Multi-Threaded SIMD Processor Up to 7 threads per - 128 x 8 x 32-bit registers per thread Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled) - SIMD8, SIMD16, SIMD32 - SIMD8 More Registers - SIMD16 and SIMD32 Better Efficiency Thread 0 Thread 2 Thread 4 Thread 6 Thread ` 1 Thread 3 Thread 5 7 Copyright Khronos Group, 2013 - Page 7

Sub Slice Intel Iris Graphics Architecture Building Blocks OpenCL* Work Groups run on a Sub Slice - 10 s per Sub Slice - Texture (Images) L1 IC$ 3D Media Data Port Tex$ - Data Port (Buffers) - Instruction and Texture Caches OpenCL* Work Groups may run on multiple threads, on multiple s! 8 Copyright Khronos Group, 2013 - Page 8

Intel Iris Graphics Architecture Building Blocks Two Sub Slices make a Slice Shared Resources: Slice Common - L3 Cache + Shared Local Memory Sub Slice L1 IC$ 3D Media Data Port Tex$ - Barriers Intel Iris Graphics has Two Slices How many work items in flight at once? Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ - 2 slides each with 2 subslices = 4 Sub Slices - 4 subslices x 10 s/subslice = 40 s - Up to 40 s x 7 threads/ = 280 threads - Up to 8960 OpenCL* work items in flight! Sub Slice L1 IC$ 3D Media Data Port Tex$ Slice 9 Copyright Khronos Group, 2013 - Page 9

Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images L1 L2 L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 10 Copyright Khronos Group, 2013 - Page 10

Tools and Additional Resources Copyright Khronos Group, 2013 - Page 11

Occupancy Intel VTune Amplifier XE 2013 12 Copyright Khronos Group, 2013 - Page 12

Additional Resources Intel SDK for OpenCL* Applications 2013 - Offline kernel builder, Debugger for CPU in MSVC Intel OpenCL* Optimization Guide Intel VTune Amplifier XE 2013 Intel Graphics Performance Analyzers - OpenCL Tuning Intel Linux Graphics hardware Bspec: https://01.org/linuxgraphics/ SIGGRAPH 2013: - Faster, Better Pixels on the Go and in the Cloud with OpenCL* on Intel Architecture - Arnon Peleg - Optimizing OpenCL* Applications for Intel Iris Graphics - Ben Ashbaugh - Backup of this deck contains more from Ben Ashbaugh s excellent SIGGRAPH 2012 talk on optimizing for Intel Iris Graphics! 13 Copyright Khronos Group, 2013 - Page 13

Questions Copyright Khronos Group, 2013 - Page 14

Copyright Khronos Group, 2013 - Page 15 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, and Ultrabook are trademarks of Intel Corporation in the United States and other countries

backup Copyright Khronos Group, 2013 - Page 16

Scheduling s for maximum occupancy Copyright Khronos Group, 2013 - Page 17

Occupancy Goal: Use All Execution Unit Resources This is harder than it sounds! Many factors to consider - Launch Enough Work - More threads means better latency coverage to keep an active - One thread sufficient to prevent an from going idle - Too few threads can result in an being stalled 18 Copyright Khronos Group, 2013 - Page 18

Occupancy Goal: Use All Execution Unit Resources - Don t Waste SIMD Lanes - Use an optimal Local Work Size - Good: Query for compiled SIMD size: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE - Occasionally Helpful: Compile for a specific local work size (8, 16, or 32): attribute ((reqd_work_group_size(x, Y, Z))) - Best: Let the driver pick (Local Work Size == NULL) Ideal for kernels with no barriers or shared local memory 19 Copyright Khronos Group, 2013 - Page 19

Occupancy (continued ) Barriers - 16 barriers per sub slice - Can be a limiting factor for very small local work groups Shared Local Memory - 64KB shared local memory per sub slice - Can be a limiting factor for kernels that use lots of shared local memory 20 Copyright Khronos Group, 2013 - Page 20

Optimizing OpenCL* Applications for Intel Iris Graphics Ben Ashbaugh (Intel) Copyright Khronos Group, 2013 - Page 21

Agenda Understanding Occupancy - How Intel Iris Graphics executes OpenCL* Kernels - - - 22 Copyright Khronos Group, 2013 - Page 22

How Intel Iris Graphics Runs OpenCL* Copyright Khronos Group, 2013 - Page 23

How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 24 Copyright Khronos Group, 2013 - Page 24

How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into Threads 25 Copyright Khronos Group, 2013 - Page 25

Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Media Data Port Tex$ 3. Launch Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all threads for the Work Group - Not enough room in any Sub Slice threads must wait 26 Copyright Khronos Group, 2013 - Page 26

How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 27 Copyright Khronos Group, 2013 - Page 27

How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into Threads 28 Copyright Khronos Group, 2013 - Page 28

Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Media Data Port Tex$ 2. Launch Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all threads for the Work Group - Not enough room in any Sub Slice threads must wait 29 Copyright Khronos Group, 2013 - Page 29

Occupancy Goal: Use All Machine Resources This is harder than it sounds! Many factors to consider 1. Launch Enough Work - One thread sufficient to prevent an from going idle - Too few threads can result in an being stalled - More threads better latency coverage keeps an active 30 Copyright Khronos Group, 2013 - Page 30

Occupancy Intel VTune Amplifier XE 2013 31 Copyright Khronos Group, 2013 - Page 31

Agenda - Memory Matters - Host to Device - - 32 Copyright Khronos Group, 2013 - Page 32

Optimizing Host to Device Transfers Host (CPU) and Device (GPU) share the same physical memory For OpenCL* buffers: - No transfer needed (zero copy)! - Allocate system memory aligned to a cache line (64 bytes) - Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR - Use clenqueuemapbuffer() to access data For OpenCL* images: - Currently tiled in device memory transfer required 33 Copyright Khronos Group, 2013 - Page 33

Operating on Buffers as Images Intel Iris Graphics supports cl_khr_image2d_from_buffer - New OpenCL* 1.2 Extension - Treat data as a buffer for some kernels, as an image for others - Some restrictions for zero copy: buffer size, image pitch Buffer Image 0x123 0x456 0x789 34 Copyright Khronos Group, 2013 - Page 34

Interop with Graphics and Media APIs / SDKs Intel Iris Graphics supports many Graphics and Media interop extensions: - cl_khr_dx9_media_sharing (includes DXVA for Intel Media SDK) - cl_khr_d3d10_sharing - cl_khr_d3d11_sharing - cl_khr_gl_sharing - cl_khr_gl_depth_images - cl_khr_gl_event - cl_khr_gl_msaa_sharing Use Graphics API / SDK assets in OpenCL* with no copies! 35 Copyright Khronos Group, 2013 - Page 35

Agenda - Memory Matters - - Device Access - 36 Copyright Khronos Group, 2013 - Page 36

Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images L1 L2 L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 37 Copyright Khronos Group, 2013 - Page 37

global and constant Memory Global Memory Accesses go through the L3 Cache L3 Cache Line is 64 bytes thread accesses to the same L3 Cache Line are collapsed - Order of data within cache line does not matter - Bandwidth determined by number of cache lines accessed - Maximum Bandwidth: 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address - Loading more than 4 x 32-bits of data is not beneficial 38 Copyright Khronos Group, 2013 - Page 38

Global and Constant Memory Access Examples 1. x = data[ get_global_id(0) ] - One cache line, full bandwidth 2. x = data[ n get_global_id(0) ] - Reverse order, full bandwidth 3. x = data[ get_global_id(0) + 1 ] - Offset, two cache lines, half bandwidth 4. x = data[ get_global_id(0) * 2 ] - Strided, half bandwidth 5. x = data[ get_global_id(0) * 16 ] - Very strided, worst-case Global ID: Global ID: Global ID: Global ID: Global ID: Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n - 1 Cache Line n Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n Cache Line n + 1 Cache Line n + 2... 0 1 2... 39 Copyright Khronos Group, 2013 - Page 39

Copyright Khronos Group, 2013 - Page 40 local Memory Accesses Local Memory Accesses also go through the L3 Cache! Key Difference: Local Memory is Banked - Banked at a DWORD granularity, 16 banks - Bandwidth determined by number of bank conflicts - Maximum Bandwidth: Still 64 bytes / clock / sub slice Supports more access patterns with full bandwidth than Global Memory - No bank conflicts full bandwidth - Reading from the same address in a bank full bandwidth

Local Memory Access Examples 1. x = data[ get_global_id(0) + 1 ] - Unique banks, full bandwidth 2. x = data[ get_global_id(0) & ~1 Bank: ] - Same address read, full bandwidth 3. x = data[ get_global_id(0) * 2 ] - Strided, half bandwidth 4. x = data[ get_global_id(0) * 16 ] - Very strided, worst-case 5. x = data[ get_global_id(0) * 17 ] - Full bandwidth! Bank: Global ID: Global ID: Bank: Global ID: Bank: Global ID: Bank: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0... 0 0 0 0 1 2 3 4 5 6 7 0... 1..................... 0.................. 2 3 0 4 0 5 0 6 Global ID: 0 1 2 3 4 5 6 41 Copyright Khronos Group, 2013 - Page 41

Thread n+1 Thread n-1 Thread n Copyright Khronos Group, 2013 - Page 42 private Memory Compiler can usually allocate Private Memory in the Register File Work Item 0 Work Item 1 private int a[100] - Even if Private Memory is dynamically indexed - Good Performance Fallback: Private Memory allocated in Global Memory - Accesses are very strided Work Item n Work Item 0 Work Item 1 Work Item n Work Item 0 Work Item 1 private int b[100] private int c[200] - Bad Performance Work Item n

Agenda - - - Compute Characteristics - Maximizing GFlops 43 Copyright Khronos Group, 2013 - Page 43

ISA SIMT ISA with Predication and Branching Divergent code executes both branches Reduced SIMD Efficiency this(); if ( x ) else that(); another(); finish(); time Example: x sometimes true SIMD lane time Example: x never true SIMD lane 44 Copyright Khronos Group, 2013 - Page 44

Compute GFlops s have 2 x 4-wide vector ALUs Second ALU has limitations: - Subset of instructions: add, mov, mad, mul, cmp - Instruction must come from another thread - Only float operands! Peak GFlops: #s x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate For Intel Iris Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops! Add Intel Core Host Processor >1TFlop! 45 Copyright Khronos Group, 2013 - Page 45

Maximizing Compute Performance Use mad() / fma(): Either explicitly with built-ins, or via -cl-mad-enable Use floats wherever possible to maximize co-issue Avoid long and size_t data types - Prefer float over int, if possible - Using short data types may improve performance Trade accuracy for speed: native built-ins, -cl-fastrelaxed-math - Often good enough for graphics 46 Copyright Khronos Group, 2013 - Page 46

Agenda - - - - Summary / Questions 47 Copyright Khronos Group, 2013 - Page 47

Summary Maximize Occupancy - Choose a Good Local Work Size - Or, Let the Driver Choose (Local Work Size == NULL) Avoid Host-to-Device Transfers - Create Buffers with CL_MEM_USE_HOST_PTR Access Device Memory Efficiently - Minimize Cache Lines for global Memory - Minimize Bank Conflicts for local Memory Maximize Compute - Avoid Divergent Branches - Use mad / fma and float Data When Possible Copyright Khronos Group, 2013 - Page 48

Copyright Khronos Group, 2013 - Page 49 Questions / Acknowledgements This presentation would not have been possible without material and review comments from many people Thank you! Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier