Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Similar documents

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Intel Media SDK Library Distribution and Dispatching Process

The Transition to PCI Express* for Client SSDs

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

2013 Intel Corporation

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Specification Update. January 2014

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Intel X38 Express Chipset Memory Technology and Configuration Guide

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Next Generation GPU Architecture Code-named Fermi

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Intel Core TM i3 Processor Series Embedded Application Power Guideline Addendum

Intel Media Server Studio Professional Edition for Windows* Server

Intel Retail Client Manager

Radeon HD 2900 and Geometry Generation. Michael Doggett

Intel HTML5 Development Environment. Tutorial Test & Submit a Microsoft Windows Phone 8* App (BETA)

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Intel Service Assurance Administrator. Product Overview

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Hetero Streams Library 1.0

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPU Programming Languages

Intel HTML5 Development Environment. Tutorial Building an Apple ios* Application Binary

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

High Performance Computing and Big Data: The coming wave.

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GPU Architecture

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

VNF & Performance: A practical approach

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Douglas Fisher Vice President General Manager, Software and Services Group Intel Corporation

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel Platform and Big Data: Making big data work for you.

GPU Architecture. Michael Doggett ATI

Intel 845G/GL Chipset Dynamic Video Memory Technology

* * * Intel RealSense SDK Architecture

Large-Data Software Defined Visualization on CPUs

Software Solutions for Multi-Display Setups

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Intel HTML5 Development Environment Article Using the App Dev Center

Intel Retail Client Manager Audience Analytics

Keys to node-level performance analysis and threading in HPC applications

OpenCL Programming for the CUDA Architecture. Version 2.3

Head-Coupled Perspective

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Accomplish Optimal I/O Performance on SAS 9.3 with

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

The Case for Rack Scale Architecture

Haswell Cryptographic Performance

Cross-Platform Game Development Best practices learned from Marmalade, Unreal, Unity, etc.

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Intel Integrated Native Developer Experience (INDE): IDE Integration for Android*

Intel HTML5 Development Environment. Article - Native Application Facebook* Integration

iscsi Quick-Connect Guide for Red Hat Linux

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Introduction to GPU hardware and to CUDA

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Benchmarking Cloud Storage through a Standard Approach Wang, Yaguang Intel Corporation

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Intel 865G Chipset Dynamic Video Memory Technology

Whitepaper. NVIDIA Miracast Wireless Display Architecture

Intel Platform Controller Hub EG20T

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

MCA Enhancements in Future Intel Xeon Processors June 2013

Creating Full Screen Applications Across Multiple Displays in Extended Mode

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms

Accelerating Business Intelligence with Large-Scale System Memory

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

AMD CodeXL 1.7 GA Release Notes

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

Version Rev. 1.0

Intel Perceptual Computing SDK My First C++ Application

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Intel Graphics Media Accelerator 900

Intel Technical Advisory

Software Evaluation Guide for Autodesk 3ds Max 2009* and Enemy Territory: Quake Wars* Render a 3D character while playing a game

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPU Parallel Computing Architecture and CUDA Programming Model

Transcription:

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture Arnon Peleg (Intel) Ben Ashbaugh (Intel) Dave Helmly (Adobe)

Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel Processor Numbers Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Iris graphics is available on select systems. Consult your system manufacturer. Copyright 2013 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, VTune, and Ultrabook are trademarks of Intel Corporation in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. *Other names and brands may be claimed as the property of others.

Today Intel and OpenCL* sessions 12:15 PM -1:45 PM Maximize Application Performance On the Go and in the Cloud with OpenCL* on Intel Architecture (Intel, Adobe Systems Inc.) 2:00 PM 3:00 PM Faster Video Creation with Higher Productivity Using Intel Developer Tools and OpenCL* (Intel) 3:15 PM 4:15 PM Journey of Pixels in Adobe Photoshop from CPU, GPU to Cloud on Intel HD Graphics (Intel, Adobe Systems Inc.) 3

This Session Agenda Introduction to the Intel SDK for OpenCL* Applications Presenter: Arnon Peleg (Intel) Product Manager, Intel SDK for OpenCL Applications, Intel Corporation Optimize OpenCL* applications for Intel Iris Graphics Presenter: Ben Ashbaugh (Intel) Sr. Graphics Complier Engineer, Intel Corporation Accelerating video production with OpenCL* and Intel Iris Graphics Presenter: Dave Helmly, Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc. 4

Introduction to the Intel SDK for OpenCL* Applications Arnon Peleg, Intel Cooperation 5

The Question Is How to get maximum performance out of the platform? Better Together Use all resources (CPU, Graphics, Media) Target the right task on the right device Multicore CPU Task-parallel or irregular workloads better suited to a CPU Complex game & graphics engine graphs, variable bitrate compression, Programmable Graphics Film & image post-processing filters, graphics, video analytics, Highly data-parallel tasks better suited to Intel Processor Graphics Be the conductor of our orchestra: Use a dynamic set of instruments, whose ranges overlap and compliment each other. When these instruments come together, your composition has even more power to move your audience.

{ int id = get_global_id(0); What is Programmability With OpenCL*? A Standard-based Environment to Write Portable Parallel Code for a Diverse Mix of Multi-core CPUs, Processor Graphics, and Coprocessors kernel void dp_mul(global const float *a, global const float *b, global float *c) Unified Code Base c[id] = a[id] * b[id]; } // execute over n work-items CPU Maximize platform performance with Intel Iris Graphics and Intel HD Graphics for Visual Computing Usages Image Processing Video Editing & Playback Games Proc. GFX Perceptual Computing Standard based by Khronos* with Intel participation

The BIG Idea Behind OpenCL* OpenCL* execution model Define N-dimensional computation domain Choose a target device Execute a kernel at each point in computation domain void trad_mul(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Traditional loops Data Parallel OpenCL kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n workitems 8

Intel SDK for OpenCL* Applications 2013 A Comprehensive Software Development Environment for OpenCL* Applications Supporting the OpenCL 1.2 Full-Profile on 3 rd and 4 th Generation Intel Core Processors with Intel Iris Graphics and Intel HD Graphics product family, Intel Xeon Processors, and Intel Xeon Phi Coprocessors FREE download at intel.com/software/opencl 9

Intel SDK for OpenCL* Applications XE 2013 What is it? OpenCL* runtime and compiler for Intel Processors (CPU) and Intel Xeon Phi Coprocessors Linux OSs support: Red Hat EL 6.x SUSE SLES 11.x OpenCL headers and libs Source Code Samples Product documentation User guide, Optimization Guide Development tools Offline build and analysis tools Debugger for OpenCL kernels on CPU Profiling with Intel VTune Amplifier XE Use the Intel SDK for OpenCL* Applications XE 2013 for Intel Xeon Phi support on Linux*

Intel SDK for OpenCL* Applications 2013 What is it? OpenCL* 1.2 support for 3rd and Future 4th Generation Intel Core Processors Accelerating performance with Intel Iris Graphics and Intel HD Graphics family Enhanced graphics and media APIs interoperability DirectX*, OpenGL*, and the Intel Media SDK Support for Microsoft Windows 7* and Windows 8* Operating Systems Developers tools for build, debug and tune of OpenCL applications. Use the Intel SDK for OpenCL* Applications 2013 for maximize performance of Intel Core Processors

The Intel SDK for OpenCL* Applications 2013 Software Stack For Intel Core Processors with Intel Iris Graphics and Intel HD Graphics Applications SDK Components Run Develop Profiling Tools Integration Intel SDK for OpenCL* Applications Microsoft* Visual Studio* Integration SDK Tools - Kernel Debugger - Kernel Builder Developer Environment (libs and headers) Online Resources: - Code Samples - Optimization Guide - Tech Articles and Videos Intel HD Graphics Driver With OpenCL* 1.2 support for CPU and the Processor Graphics 3rd and 4th Generation Intel Core Processors With Intel Iris Graphics and Intel HD Graphics

Intel SDK for OpenCL* Applications Support Matrix Know the Processors and Operating Systems Visual Computing Domain Data Center Domain (Version 2013) (Version XE 2013) You Use and download the SDK You Need Supported Processors: Intel HD & Iris Graphics Intel Core Processor (CPU) Intel Xeon Processor Intel Xeon Phi Coprocessor Operating Systems: Windows 7* Window 8* Red Hat* Linux* SUSE* Linux*

The Intel SDK for OpenCL* Applications Online Resource The SDK section of the Intel Developers Zone is a one-stop shop for resources, support and information for OpenCL* developers intel.com/software/opencl @intelopencl Free Downloads Code Samples Tech Articles Case Studies Forums and Support Beta Programs

Optimize OpenCL* applications for Intel Iris Graphics Ben Ashbaugh (Intel) 15

Agenda Understanding Occupancy How Intel Iris Graphics executes OpenCL* Kernels Memory Matters Host to Device Device Access Compute Characteristics Maximizing GFlops Summary / Questions 16

Agenda Understanding Occupancy How Intel Iris Graphics executes OpenCL* Kernels 17

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 18

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Global Assets Command Video Quality Multi-Format Streamer Blitter Engine CODEC Thread Dispatch L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Display Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 19

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Video Front End (VFE) Video Quality Engine L1 IC$ EU Multi-Format CODEC EU EU EU EU Blitter Display Slice Common L3 Cache 3D Sampler Media Shared Local EUMemory Sampler L1 Tex$ IC$ Data Port 3D Sampler Media Sampler Data Port Tex$ Tessellator Domain Shader (DS) Thread Dispatch Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 20

Sub Slice 1 Sub Slice 3 Ring Bus / LLC / Memory Sub Slice 0 Sub Slice 2 Intel Iris Graphics Architecture Overview Command Streamer (CS) Vertex Fetch (VF) Video Front End (VFE) Video Quality Engine Multi-Format CODEC Blitter Display Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Sub Slice L1 Execution Units IC$ Thread Dispatch Rasterizer / Depth L3$ Pixel Ops 3D Sampler Media Sampler Data Port Samplers and Data Port Instruction and Texture Caches Tex$ Render$ Depth$ L1 IC$ EU Rasterizer / Depth EU EU EU EU L3$ Pixel Ops 3D Sampler Media Sampler Data Port Tex$ Render$ Depth$ Geometry Shader (GS) Stream Out (SOL) Clip/Setup L1 IC$ 3D Sampler Media Sampler Data Port Tex$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Slice 0 Slice 1 21

Intel Iris Graphics Architecture Building Blocks OpenCL* Kernels run on an Execution Unit (EU) Each EU is a Multi-Threaded SIMD Processor Up to 7 threads per EU 128 x 8 x 32-bit registers per thread Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled) Thread 0 Thread 2 Thread 4 EU Thread ` 1 Thread 3 Thread 5 SIMD8, SIMD16, SIMD32 Thread 6 SIMD8 More Registers SIMD16 and SIMD32 Better Efficiency 22

Sub Slice Intel Iris Graphics Architecture Building Blocks OpenCL* Work Groups run on a Sub Slice 10 EUs per Sub Slice Texture Sampler (Images) Data Port (Buffers) L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Instruction and Texture Caches OpenCL* Work Groups may run on multiple EU threads, on multiple EUs! 23

Sub Slice Sub Slice Intel Iris Graphics Architecture Building Blocks Two Sub Slices make a Slice Shared Resources: Slice Common L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Tex$ L3 Cache + Shared Local Memory Barriers Intel Iris Graphics has Two Slices Rasterizer / Depth L3$ Pixel Ops Render$ Depth$ 2 x 2 = 4 Sub Slices 4 x 10 = 40 EUs Up to 40 x 7 = 280 EU threads Up to 8960 OpenCL* work items in flight! L1 IC$ EU EU EU EU EU 3D Sampler Media Sampler Data Port Tex$ Slice 24

How Intel Iris Graphics Runs OpenCL* 1. Divide Into Work Groups 25

How Intel Iris Graphics Runs OpenCL* 2. Divide Each Work Group Into EU Threads 26

Sub Slice How Intel Iris Graphics Runs OpenCL* L1 IC$ 3D Sampler Media Sampler Data Port Tex$ 2. Launch EU Threads for the Work Group Onto a Sub Slice Repeat for each Work Group Must have enough room in the Sub Slice for all EU threads for the Work Group Not enough room in any Sub Slice EU threads must wait 27

Occupancy Goal: Use All Machine Resources This is harder than it sounds! Many factors to consider 1. Launch Enough Work One thread sufficient to prevent an EU from going idle Too few EU threads can result in an EU being stalled More EU threads better latency coverage keeps an EU active 28

Occupancy Intel VTune Amplifier XE 2013 29

Occupancy (continued ) 2. Don t Waste SIMD Lanes Use an optimal Local Work Size Good: Query for compiled SIMD size: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE Occasionally Helpful: Compile for a specific local work size (8, 16, or 32): attribute ((reqd_work_group_size(x, Y, Z))) Best: Let the driver pick (Local Work Size == NULL) Ideal for kernels with no barriers or shared local memory 30

Occupancy (continued ) More subtle factors: 3. Barriers 16 barriers per sub slice Can be a limiting factor for very small local work groups 4. Shared Local Memory 64KB shared local memory per sub slice Can be a limiting factor for kernels that use lots of shared local memory 31

Agenda Memory Matters Host to Device 32

Optimizing Host to Device Transfers Host (CPU) and Device (GPU) share the same physical memory For OpenCL* buffers: No transfer needed (zero copy)! Allocate system memory aligned to a cache line (64 bytes) Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR Use clenqueuemapbuffer() to access data For OpenCL* images: Currently tiled in device memory transfer required 33

Operating on Buffers as Images Intel Iris Graphics supports cl_khr_image2d_from_buffer New OpenCL* 1.2 Extension Treat data as a buffer for some kernels, as an image for others Some restrictions for zero copy: buffer size, image pitch Buffer Image 0x123 0x456 0x789 34

Interop with Graphics and Media APIs / SDKs Intel Iris Graphics supports many Graphics and Media interop extensions: cl_khr_dx9_media_sharing (includes DXVA for Intel Media SDK) cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing Use Graphics API / SDK assets in OpenCL* with no copies! 35

Agenda Memory Matters Device Access 36

Intel Iris Graphics Cache Hierarchy EDRAM (non-inclusive victim cache) 128MB/package (Intel Iris Pro 5200) images Sampler L1 Sampler L2 EU L3 LLC DRAM buffers 256KB/slice 2-8MB/package (shared w/ CPU) 37

global and constant Memory Global Memory Accesses go through the L3 Cache L3 Cache Line is 64 bytes EU thread accesses to the same L3 Cache Line are collapsed Order of data within cache line does not matter Bandwidth determined by number of cache lines accessed Maximum Bandwidth: 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address Loading more than 4 x 32-bits of data is not beneficial 38

Global and Constant Memory Access Examples 1. x = data[ get_global_id(0) ] One cache line, full bandwidth 2. x = data[ n get_global_id(0) ] Reverse order, full bandwidth Global ID: Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n - 1 Cache Line n + 1 Cache Line n 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 3. x = data[ get_global_id(0) + 1 ] Offset, two cache lines, half bandwidth Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 4. x = data[ get_global_id(0) * 2 ] Strided, half bandwidth Global ID: Cache Line n Cache Line n + 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5. x = data[ get_global_id(0) * 16 ] Very strided, worst-case Cache Line n Cache Line n + 1 Cache Line n + 2... Global ID: 0 1 2... 39

local Memory Accesses Local Memory Accesses also go through the L3 Cache! Key Difference: Local Memory is Banked Banked at a DWORD granularity, 16 banks Bandwidth determined by number of bank conflicts Maximum Bandwidth: Still 64 bytes / clock / sub slice Supports more access patterns with full bandwidth than Global Memory No bank conflicts full bandwidth Reading from the same address in a bank full bandwidth 40

Local Memory Access Examples 1. x = data[ get_global_id(0) + 1 ] Unique banks, full bandwidth 2. x = data[ get_global_id(0) & ~1 ] Same address read, full bandwidth 3. x = data[ get_global_id(0) * 2 ] Strided, half bandwidth 4. x = data[ get_global_id(0) * 16 ] Very strided, worst-case 5. x = data[ get_global_id(0) * 17 ] Full bandwidth! Bank: Global ID: Bank: Global ID: Bank: Global ID: Bank: Global ID: Bank: Global ID: 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 0...... 0 1 15 0 0 0 1 1 1 2 2 2..................... 0 0 0 3 3 3.................. 2 3 0 1 2 3 4 5 6 4 4 4 0 4 5 5 5 6 6 6 7 7 7 8 8 8 0 9 9 9 5 10 10 10 11 11 11 12 12 12 0 13 13 13 14 14 14 6 15 15 15 41

private Memory EU Thread n+1 EU Thread n-1 EU Thread n Compiler can usually allocate Private Memory in the Register File Even if Private Memory is dynamically indexed Good Performance Work Item 0 Work Item 1 Work Item n Work Item 0 Work Item 1 private int a[100] private int b[100] Fallback: Private Memory allocated in Global Memory Accesses are very strided Bad Performance Work Item n Work Item 0 Work Item 1 Work Item n private int c[200]

Agenda Compute Characteristics Maximizing GFlops 43

ISA SIMT ISA with Predication and Branching Divergent code executes both branches Reduced SIMD Efficiency this(); if ( x ) that(); else another(); finish(); time Example: x sometimes true SIMD lane time Example: x never true SIMD lane 44

Compute GFlops EUs have 2 x 4-wide vector ALUs Second ALU has limitations: Subset of instructions: add, mov, mad, mul, cmp Instruction must come from another EU thread Only float operands! Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate For Intel Iris Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops! 4 th Generation Intel Core Processor (CPU+ Intel Iris Graphics) >1TFlop! 45

Maximizing Compute Performance Use mad() / fma(): Either explicitly with built-ins, or via -cl-mad-enable Use floats wherever possible to maximize co-issue Avoid long and size_t data types Prefer float over int, if possible Using short data types may improve performance Trade accuracy for speed: native built-ins, -cl-fast-relaxed-math Often good enough for graphics 46

Agenda Summary / Questions 47

Summary Maximize Occupancy Choose a Good Local Work Size Or, Let the Driver Choose (Local Work Size == NULL) Avoid Host-to-Device Transfers Create Buffers with CL_MEM_USE_HOST_PTR Access Device Memory Efficiently Minimize Cache Lines for global Memory Minimize Bank Conflicts for local Memory Maximize Compute Avoid Divergent Branches Use mad / fma and float Data When Possible 48

Questions / Acknowledgements This presentation would not have been possible without material and review comments from many people Thank you! Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier 49

Additional Resources Intel SDK for OpenCL* Applications 2013 Intel OpenCL* Optimization Guide Intel VTune Amplifier XE 2013 Intel Graphics Performance Analyzers 50

Accelerating video production with OpenCL* and Intel Iris Graphics Dave Helmly, Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc. 51

Intel is Hiring! We want to work with you! www.intel.com/jobs/ Head to our booth (201) to hear about our exciting opportunities! 52

Coming up next 2:00 p.m. 3:00 p.m. Faster Content Creation with Higher Productivity using Intel Developer Tools and OpenCL* Presenters: Arnon Peleg (Intel) Raghu Muthyalampalli (Intel) For more information: http://software.intel.com/en-us/siggraph2013 53