Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture Arnon Peleg (Intel) Ben Ashbaugh (Intel) Dave Helmly (Adobe)
Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel Processor Numbers Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Iris graphics is available on select systems. Consult your system manufacturer. Copyright 2013 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, VTune, and Ultrabook are trademarks of Intel Corporation in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. *Other names and brands may be claimed as the property of others.
Today Intel and OpenCL* sessions 12:15 PM -1:45 PM Maximize Application Performance On the Go and in the Cloud with OpenCL* on Intel Architecture (Intel, Adobe Systems Inc.) 2:00 PM 3:00 PM Faster Video Creation with Higher Productivity Using Intel Developer Tools and OpenCL* (Intel) 3:15 PM 4:15 PM Journey of Pixels in Adobe Photoshop from CPU, GPU to Cloud on Intel HD Graphics (Intel, Adobe Systems Inc.) 3
This Session Agenda Introduction to the Intel SDK for OpenCL* Applications Presenter: Arnon Peleg (Intel) Product Manager, Intel SDK for OpenCL Applications, Intel Corporation Optimize OpenCL* applications for Intel Iris Graphics Presenter: Ben Ashbaugh (Intel) Sr. Graphics Complier Engineer, Intel Corporation Accelerating video production with OpenCL* and Intel Iris Graphics Presenter: Dave Helmly, Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc. 4
Introduction to the Intel SDK for OpenCL* Applications Arnon Peleg, Intel Cooperation 5
The Question Is How to get maximum performance out of the platform? Better Together Use all resources (CPU, Graphics, Media) Target the right task on the right device Multicore CPU Task-parallel or irregular workloads better suited to a CPU Complex game & graphics engine graphs, variable bitrate compression, Programmable Graphics Film & image post-processing filters, graphics, video analytics, Highly data-parallel tasks better suited to Intel Processor Graphics Be the conductor of our orchestra: Use a dynamic set of instruments, whose ranges overlap and compliment each other. When these instruments come together, your composition has even more power to move your audience.
{ int id = get_global_id(0); What is Programmability With OpenCL*? A Standard-based Environment to Write Portable Parallel Code for a Diverse Mix of Multi-core CPUs, Processor Graphics, and Coprocessors kernel void dp_mul(global const float *a, global const float *b, global float *c) Unified Code Base c[id] = a[id] * b[id]; } // execute over n work-items CPU Maximize platform performance with Intel Iris Graphics and Intel HD Graphics for Visual Computing Usages Image Processing Video Editing & Playback Games Proc. GFX Perceptual Computing Standard based by Khronos* with Intel participation
The BIG Idea Behind OpenCL* OpenCL* execution model Define N-dimensional computation domain Choose a target device Execute a kernel at each point in computation domain void trad_mul(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Traditional loops Data Parallel OpenCL kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n workitems 8
Intel SDK for OpenCL* Applications 2013 A Comprehensive Software Development Environment for OpenCL* Applications Supporting the OpenCL 1.2 Full-Profile on 3 rd and 4 th Generation Intel Core Processors with Intel Iris Graphics and Intel HD Graphics product family, Intel Xeon Processors, and Intel Xeon Phi Coprocessors FREE download at intel.com/software/opencl 9
Intel SDK for OpenCL* Applications XE 2013 What is it? OpenCL* runtime and compiler for Intel Processors (CPU) and Intel Xeon Phi Coprocessors Linux OSs support: Red Hat EL 6.x SUSE SLES 11.x OpenCL headers and libs Source Code Samples Product documentation User guide, Optimization Guide Development tools Offline build and analysis tools Debugger for OpenCL kernels on CPU Profiling with Intel VTune Amplifier XE Use the Intel SDK for OpenCL* Applications XE 2013 for Intel Xeon Phi support on Linux*
Intel SDK for OpenCL* Applications 2013 What is it? OpenCL* 1.2 support for 3rd and Future 4th Generation Intel Core Processors Accelerating performance with Intel Iris Graphics and Intel HD Graphics family Enhanced graphics and media APIs interoperability DirectX*, OpenGL*, and the Intel Media SDK Support for Microsoft Windows 7* and Windows 8* Operating Systems Developers tools for build, debug and tune of OpenCL applications. Use the Intel SDK for OpenCL* Applications 2013 for maximize performance of Intel Core Processors
The Intel SDK for OpenCL* Applications 2013 Software Stack For Intel Core Processors with Intel Iris Graphics and Intel HD Graphics Applications SDK Components Run Develop Profiling Tools Integration Intel SDK for OpenCL* Applications Microsoft* Visual Studio* Integration SDK Tools - Kernel Debugger - Kernel Builder Developer Environment (libs and headers) Online Resources: - Code Samples - Optimization Guide - Tech Articles and Videos Intel HD Graphics Driver With OpenCL* 1.2 support for CPU and the Processor Graphics 3rd and 4th Generation Intel Core Processors With Intel Iris Graphics and Intel HD Graphics
Intel SDK for OpenCL* Applications Support Matrix Know the Processors and Operating Systems Visual Computing Domain Data Center Domain (Version 2013) (Version XE 2013) You Use and download the SDK You Need Supported Processors: Intel HD & Iris Graphics Intel Core Processor (CPU) Intel Xeon Processor Intel Xeon Phi Coprocessor Operating Systems: Windows 7* Window 8* Red Hat* Linux* SUSE* Linux*
The Intel SDK for OpenCL* Applications Online Resource The SDK section of the Intel Developers Zone is a one-stop shop for resources, support and information for OpenCL* developers intel.com/software/opencl @intelopencl Free Downloads Code Samples Tech Articles Case Studies Forums and Support Beta Programs
Optimize OpenCL* applications for Intel Iris Graphics Ben Ashbaugh (Intel) 15
Agenda Understanding Occupancy How Intel Iris Graphics executes OpenCL* Kernels Memory Matters Host to Device Device Access Compute Characteristics Maximizing GFlops Summary / Questions 16
Agenda Understanding Occupancy How Intel Iris Graphics executes OpenCL* Kernels 17
Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 18
Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Global Assets Command Video Quality Multi-Format Streamer Blitter Engine CODEC Thread Dispatch L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Display Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 19
Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Video Quality Engine L1 IC$ EU Multi-Format CODEC EU EU EU EU Blitter Display Slice Common L3 Cache 3D Sampler Media Shared Local EUMemory Sampler L1 Tex$ IC$ Data Port 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 20
Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Sub Slice L1 Execution Units IC$ Thread Dispatch Rasterizer / Depth L3$ Pixel Ops 3D Sampler Media Sampler Data Port Samplers and Data Port Instruction and Texture Caches Tex$ Render$ Depth$ L1 IC$ EU Rasterizer / Depth EU EU EU EU L3$ Pixel Ops 3D Sampler Media Sampler Data Port Tex$ Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 21
Intel Iris Graphics Architecture Building Blocks OpenCL* Kernels run on an Execution Unit (EU) Each EU is a Multi-Threaded SIMD Processor Up to 7 threads per EU 128 x 8 x 32-bit registers per thread Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled) Thread 0 Thread 2 Thread 4 EU Thread ` 1 Thread 3 Thread 5 SIMD8, SIMD16, SIMD32 Thread 6 SIMD8 More Registers SIMD16 and SIMD32 Better Efficiency 22
Sub Slice Intel Iris Graphics Architecture Building Blocks OpenCL* Work Groups run on a Sub Slice 10 EUs per Sub Slice Texture Sampler (Images) Data Port (Buffers) L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Instruction and Texture Caches OpenCL* Work Groups may run on multiple EU threads, on multiple EUs! 23
Sub Slice Sub Slice Intel Iris Graphics Architecture Building Blocks Two Sub Slices make a Slice Shared Resources: Slice Common L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Tex$ L3 Cache + Shared Local Memory Barriers Intel Iris Graphics has Two Slices Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ 2 x 2 = 4 Sub Slices 4 x 10 = 40 EUs Up to 40 x 7 = 280 EU threads Up to 8960 OpenCL* work items in flight! L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Tex$ Slice 24
How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 25
How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into EU Threads 26
Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Sampler Media Sampler Data Port Tex$ 2. Launch EU Threads for the Work Group Onto a Sub Slice Repeat for each Work Group Must have enough room in the Sub Slice for all EU threads for the Work Group Not enough room in any Sub Slice EU threads must wait 27
Occupancy Goal: Use All Machine Resources This is harder than it sounds! Many factors to consider 1. Launch Enough Work One thread sufficient to prevent an EU from going idle Too few EU threads can result in an EU being stalled More EU threads better latency coverage keeps an EU active 28
Occupancy Intel VTune Amplifier XE 2013 29
Occupancy (continued ) 2. Don t Waste SIMD Lanes Use an optimal Local Work Size Good: Query for compiled SIMD size: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE Occasionally Helpful: Compile for a specific local work size (8, 16, or 32): attribute ((reqd_work_group_size(x, Y, Z))) Best: Let the driver pick (Local Work Size == NULL) Ideal for kernels with no barriers or shared local memory 30
Occupancy (continued ) More subtle factors: 3. Barriers 16 barriers per sub slice Can be a limiting factor for very small local work groups 4. Shared Local Memory 64KB shared local memory per sub slice Can be a limiting factor for kernels that use lots of shared local memory 31
Agenda Memory Matters Host to Device 32
Optimizing Host to Device Transfers Host (CPU) and Device (GPU) share the same physical memory For OpenCL* buffers: No transfer needed (zero copy)! Allocate system memory aligned to a cache line (64 bytes) Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR Use clenqueuemapbuffer() to access data For OpenCL* images: Currently tiled in device memory transfer required 33
Operating on Buffers as Images Intel Iris Graphics supports cl_khr_image2d_from_buffer New OpenCL* 1.2 Extension Treat data as a buffer for some kernels, as an image for others Some restrictions for zero copy: buffer size, image pitch Buffer Image 0x123 0x456 0x789 34
Interop with Graphics and Media APIs / SDKs Intel Iris Graphics supports many Graphics and Media interop extensions: cl_khr_dx9_media_sharing (includes DXVA for Intel Media SDK) cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing Use Graphics API / SDK assets in OpenCL* with no copies! 35
Agenda Memory Matters Device Access 36
Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images Sampler L1 Sampler L2 EU L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 37
global and constant Memory Global Memory Accesses go through the L3 Cache L3 Cache Line is 64 bytes EU thread accesses to the same L3 Cache Line are collapsed Order of data within cache line does not matter Bandwidth determined by number of cache lines accessed Maximum Bandwidth: 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address Loading more than 4 x 32-bits of data is not beneficial 38
Global and Constant Memory Access Examples 1. x = data[ get_global_id(0) ] One cache line, full bandwidth 2. x = data[ n get_global_id(0) ] Reverse order, full bandwidth Global ID: Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n - 1 Cache Line n + 1 Cache Line n 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 3. x = data[ get_global_id(0) + 1 ] Offset, two cache lines, half bandwidth Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 4. x = data[ get_global_id(0) * 2 ] Strided, half bandwidth Global ID: Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5. x = data[ get_global_id(0) * 16 ] Very strided, worst-case Cache Line n Cache Line n + 1 Cache Line n + 2... Global ID: 0 1 2... 39
local Memory Accesses Local Memory Accesses also go through the L3 Cache! Key Difference: Local Memory is Banked Banked at a DWORD granularity, 16 banks Bandwidth determined by number of bank conflicts Maximum Bandwidth: Still 64 bytes / clock / sub slice Supports more access patterns with full bandwidth than Global Memory No bank conflicts full bandwidth Reading from the same address in a bank full bandwidth 40
Local Memory Access Examples 1. x = data[ get_global_id(0) + 1 ] Unique banks, full bandwidth 2. x = data[ get_global_id(0) & ~1 ] Same address read, full bandwidth 3. x = data[ get_global_id(0) * 2 ] Strided, half bandwidth 4. x = data[ get_global_id(0) * 16 ] Very strided, worst-case 5. x = data[ get_global_id(0) * 17 ] Full bandwidth! Bank: Global ID: Bank: Global ID: Bank: Global ID: Bank: Global ID: Bank: Global ID: 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 0...... 0 1 15 0 0 0 1 1 1 2 2 2..................... 0 0 0 3 3 3.................. 2 3 0 1 2 3 4 5 6 4 4 4 0 4 5 5 5 6 6 6 7 7 7 8 8 8 0 9 9 9 5 10 10 10 11 11 11 12 12 12 0 13 13 13 14 14 14 6 15 15 15 41
private Memory EU Thread n+1 EU Thread n-1 EU Thread n Compiler can usually allocate Private Memory in the Register File Even if Private Memory is dynamically indexed Good Performance Work Item 0 Work Item 1 Work Item n Work Item 0 Work Item 1 private int a[100] private int b[100] Fallback: Private Memory allocated in Global Memory Accesses are very strided Bad Performance Work Item n Work Item 0 Work Item 1 Work Item n private int c[200]
Agenda Compute Characteristics Maximizing GFlops 43
ISA SIMT ISA with Predication and Branching Divergent code executes both branches Reduced SIMD Efficiency this(); if ( x ) that(); else another(); finish(); time Example: x sometimes true SIMD lane time Example: x never true SIMD lane 44
Compute GFlops EUs have 2 x 4-wide vector ALUs Second ALU has limitations: Subset of instructions: add, mov, mad, mul, cmp Instruction must come from another EU thread Only float operands! Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate For Intel Iris Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops! 4 th Generation Intel Core Processor (CPU+ Intel Iris Graphics) >1TFlop! 45
Maximizing Compute Performance Use mad() / fma(): Either explicitly with built-ins, or via -cl-mad-enable Use floats wherever possible to maximize co-issue Avoid long and size_t data types Prefer float over int, if possible Using short data types may improve performance Trade accuracy for speed: native built-ins, -cl-fast-relaxed-math Often good enough for graphics 46
Agenda Summary / Questions 47
Summary Maximize Occupancy Choose a Good Local Work Size Or, Let the Driver Choose (Local Work Size == NULL) Avoid Host-to-Device Transfers Create Buffers with CL_MEM_USE_HOST_PTR Access Device Memory Efficiently Minimize Cache Lines for global Memory Minimize Bank Conflicts for local Memory Maximize Compute Avoid Divergent Branches Use mad / fma and float Data When Possible 48
Questions / Acknowledgements This presentation would not have been possible without material and review comments from many people Thank you! Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier 49
Additional Resources Intel SDK for OpenCL* Applications 2013 Intel OpenCL* Optimization Guide Intel VTune Amplifier XE 2013 Intel Graphics Performance Analyzers 50
Accelerating video production with OpenCL* and Intel Iris Graphics Dave Helmly, Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc. 51
Intel is Hiring! We want to work with you! www.intel.com/jobs/ Head to our booth (201) to hear about our exciting opportunities! 52
Coming up next 2:00 p.m. 3:00 p.m. Faster Content Creation with Higher Productivity using Intel Developer Tools and OpenCL* Presenters: Arnon Peleg (Intel) Raghu Muthyalampalli (Intel) For more information: http://software.intel.com/en-us/siggraph2013 53