WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

Similar documents

Enabling OpenCL Acceleration of Web Applications

Parallel Web Programming

Lecture 3. Optimising OpenCL performance

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Introduction to OpenCL Programming. Training Guide

Crosswalk: build world class hybrid mobile apps

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Press Briefing. GDC, March Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos. Copyright Khronos Group Page 1

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Performance Optimization and Debug Tools for mobile games with PlayCanvas

OpenGL Insights. Edited by. Patrick Cozzi and Christophe Riccio

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

COSCO 2015 Heterogeneous Computing Programming

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Geo-Scale Data Visualization in a Web Browser. Patrick Cozzi pcozzi@agi.com

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to WebGL

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

The Most Popular UI/Apps Framework For IVI on Linux

Computer Graphics on Mobile Devices VL SS ECTS

Introduction to GPU Programming Languages

How OpenCL enables easy access to FPGA performance?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

AMD Accelerated Parallel Processing. OpenCL Programming Guide. November rev2.7

Introduction to GPU hardware and to CUDA

CSC230 Getting Starting in C. Tyler Bletsch

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Texture Cache Approximation on GPUs

JavaFX Session Agenda

Programming Guide. ATI Stream Computing OpenCL. June rev1.03

OpenGL & Delphi. Max Kleiner. 1/22

Outline. 1.! Development Platforms for Multimedia Programming!

Making the Most of Existing Public Web Development Frameworks WEB04

Web Based 3D Visualization for COMSOL Multiphysics

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

gpus1 Ubuntu Available via ssh

Introduction to OpenCV for Tegra. Shalini Gupta, Nvidia

Writing Applications for the GPU Using the RapidMind Development Platform

Image Processing & Video Algorithms with CUDA

3D Graphics and Cameras

OpenGL ES Safety-Critical Profile Philosophy

Introduction to GPGPU. Tiziano Diamanti

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

How To Teach Computer Graphics

GPU Profiling with AMD CodeXL

Vulkan on NVIDIA GPUs. Piers Daniell, Driver Software Engineer, OpenGL and Vulkan

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

OpenCL Programming for the CUDA Architecture. Version 2.3

Debugging with TotalView

USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

Java GPU Computing. Maarten Steur & Arjan Lamers

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Research and Design of Universal and Open Software Development Platform for Digital Home

GPU Parallel Computing Architecture and CUDA Programming Model

Stream Processing on GPUs Using Distributed Multimedia Middleware

Next Generation GPU Architecture Code-named Fermi

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

L20: GPU Architecture and Models

The OpenCL Specification

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

GPU Hardware Performance. Fall 2015

Real-time Visual Tracker by Stream Processing

Leveraging Aparapi to Help Improve Financial Java Application Performance

Optimizing AAA Games for Mobile Platforms

Getting Started with CodeXL

GPGPU Parallel Merge Sort Algorithm

Altera SDK for OpenCL

The Future Of Animation Is Games

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Embedded Systems: map to FPGA, GPU, CPU?

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

OpenACC Basics Directive-based GPGPU Programming

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Firefox for Android. Reviewer s Guide. Contact us: press@mozilla.com

Optimizing Application Performance with CUDA Profiling Tools

CUDA Basics. Murphy Stein New York University

HPC with Multicore and GPUs

The OpenGL Framebuffer Object Extension. Simon Green. NVIDIA Corporation

Xeon+FPGA Platform for the Data Center

OpenACC 2.0 and the PGI Accelerator Compilers

GPGPU. General Purpose Computing on. Diese Folien wurden von Mathias Bach und David Rohr erarbeitet

ipad, a revolutionary device - Apple

How To Understand The Power Of Unity 3D (Pro) And The Power Behind It (Pro/Pro)

Finger Paint: Cross-platform Augmented Reality

Trends in HTML5. Matt Spencer UI & Browser Marketing Manager

Transcription:

WebCL for Hardware-Accelerated Web Applications Won Jeon, Tasneem Brutch, and Simon Gibbs

What is WebCL? WebCL is a JavaScript binding to OpenCL. WebCL enables significant acceleration of compute-intensive web applications. WebCL currently being standardized by Khronos. 2

Contents Introduction Motivation and goals OpenCL WebCL Demos Programming WebCL Samsung s Prototype: implementation & performance Standardization Khronos WebCL Working Group WebCL robustness and security Standardization status Conclusion Summary and Resources 3

Motivation New mobile apps with high compute demands: augmented reality video processing computational photography 3D games 4

Motivation GPU offers improved FLOPS and FLOPS/watt as compared to CPU. FLOPS/ watt GPU CPU year 5

Motivation Web-centric platforms 6

OpenCL: Overview Features C-based cross-platform programming interface Kernels: subset of ISO C99 with language extensions Run-time or build-time compilation of kernels Well-defined numerical accuracy (IEEE 754 rounding with specified maximum error) Rich set of built-in functions: cross, dot, sin, cos, pow, log 7

OpenCL: Platform Model A host is connected to one or more OpenCL devices OpenCL device A collection of one or more compute units (~ cores) A compute unit is composed of one or more processing elements (~ threads) Processing elements execute code as SIMD or SPMD 8

OpenCL: Execution Model Kernel Basic unit of executable code, similar to a C function Data-parallel or task-parallel Program Collection of kernels and functions called by kernels Analogous to a dynamic library (run-time linking) Command Queue Applications queue kernels and data transfers Performed in-order or out-of-order Work-item An execution of a kernel by a processing element (~ thread) Work-group Different Types of Parallelism A collection of related work-items that execute on a single compute unit (~ core) 9

Work-group Example # work-items = # pixels # work-groups = # tiles work-group size = tilew * tileh 10

OpenCL: Memory Model Memory types Private memory (blue) Per work-item Local memory (green) At least 32KB split into blocks, each available to any work-item in a given work-group Global/Constant memory (orange) Not synchronized Host memory (red) On the CPU Memory management is explicit Application must move data from host global local and back 11

OpenCL: Kernels OpenCL kernel Defined on an N-dimensional computation domain Execute a kernel at each point of the computation domain Traditional Loop void vectormult( const float* a, const float* b, float* c, const unsigned int count) { for(int i=0; i<count; i++) c[i] = a[i] * b[i]; } Data Parallel OpenCL kernel void vectormult( global const float* a, global const float* b, global float* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } 12

OpenCL: Execution of OpenCL Program 1. Query host for OpenCL devices. 2. Create a context to associate OpenCL devices. 3. Create programs for execution on one or more associated devices. 4. From the programs, select kernels to execute. Programs Kernels Context Memory Objects Command Queue 5. Create memory objects accessible from the host and/or the device. 6. Copy memory data to the device as needed. 7. Provide kernels to the command queue for execution. Programs Compile Kernel0 Kernel1 Kernel2 Images Buffers Create data & arguments In order & out of order Send for execution 8. Copy results from the device to the host. 13

WebCL Demos

Programming WebCL Initialization Create kernel Run kernel WebCL image object creation From Uint8Array From <img>, <canvas>, or <video> From WebGL vertex buffer From WebGL texture WebGL vertex animation WebGL texture animation 15

WebCL: Initialization (draft) <script> var platforms = WebCL.getPlatforms(); var devices = platforms[0].getdevices(webcl.device_type_gpu); var context = WebCL.createContext({ WebCLDevice: devices[0] } ); var queue = context.createcommandqueue(); </script> 16

WebCL: Create Kernel (draft) <script id="squareprogram" type="x-kernel"> kernel square( global float* input, global float* output, const unsigned int count) { int i = get_global_id(0); if(i < count) output[i] = input[i] * input[i]; } </script> OpenCL C (based on C99) <script> var programsource = getprogramsource("squareprogram"); var program = context.createprogram(programsource); program.build(); var kernel = program.createkernel("square"); </script> // JavaScript function using DOM APIs 17

WebCL: Run Kernel (draft) <script> var inputbuf = context.createbuffer(webcl.mem_read_only, Float32Array.BYTES_PER_ELEMENT * count); var outputbuf = context.createbuffer(webcl.mem_write_only, Float32Array.BYTES_PER_ELEMENT * count); var data = new Float32Array(count); // populate data queue.enqueuewritebuffer(inputbuf, data, true); // last arg indicates API is blocking kernel.setkernelarg(0, inputbuf); kernel.setkernelarg(1, outputbuf); kernel.setkernelarg(2, count, WebCL.KERNEL_ARG_INT); var workgroupsize = kernel.getworkgroupinfo(devices[0], WebCL.KERNEL_WORK_GROUP_SIZE); queue.enqueuendrangekernel(kernel, [count], [workgroupsize]); queue.finish(); queue.enqueuereadbuffer(outputbuf, data, true); </script> // this API blocks // last arg indicates API is blocking 18

WebCL: Image Object Creation (draft) <script> </script> From Uint8Array() var bpp = 4; // bytes per pixel var pixels = new Uint8Array(width * height * bpp); var pitch = width * bpp; var climage = context.createimage(webcl.mem_read_only, {channelorder:webcl.rgba, channeltype:webcl.unorm_int8, size:[width, height], pitch:pitch } ); From <img> or <canvas> or <video> <script> var canvas = document.getelementbyid("acanvas"); var climage = context.createimage(webcl.mem_read_only, canvas); </script> // format, size from element 19

WebCL: Vertex Animation (draft) Web GL Web CL </script> Web CL Web GL Vertex Buffer Initialization <script> Vertex Buffer Update and Draw <script> function DrawLoop() { queue.enqueueacquireglobjects([clvertexbuffer]); } var points = new Float32Array(NPOINTS * 3); var glvertexbuffer = gl.createbuffer(); gl.bindbuffer(gl.array_buffer, glvertexbuffer); gl.bufferdata(gl.array_buffer, points, gl.dynamic_draw); var clvertexbuffer = context.createfromglbuffer(webcl.mem_read_write, glvertexbuffer); kernel.setkernelarg(0, NPOINTS, WebCL.KERNEL_ARG_INT); kernel.setkernelarg(1, clvertexbuffer); queue.enqueuendrangekernel(kernel, [NPOINTS], [workgroupsize]); queue.enqueuereleaseglobjects([clvertexbuffer]); gl.bindbuffer(gl.array_buffer, glvertexbuffer); gl.clear(gl.color_buffer_bit); gl.drawarrays(gl.points, 0, NPOINTS); gl.flush(); </script> 20

WebCL: Texture Animation (draft) Texture Initialization <script> var gltexture = gl.createtexture(); gl.bindtexture(gl.texture_2d, gltexture); gl.teximage2d(gl.texture_2d, 0, gl.rgba, gl.rgba, gl.unsigned_byte, image); var cltexture = context.createfromgltexture2d(webcl.mem_read_write, gl.texture_2d, gltexture); kernel.setkernelarg(0, NWIDTH, WebCL.KERNEL_ARG_INT); kernel.setkernelarg(1, NHEIGHT, WebCL.KERNEL_ARG_INT); kernel.setkernelarg(2, cltexture); </script> Texture Update and Draw <script> function DrawLoop() { queue.enqueueacquireglobjects([cltexture]); } </script> queue.enqueuendrangekernel(kernel, [NWIDTH, NHEIGHT], [tilewidth, tileheight]); queue.enqueuereleaseglobjects([cltexture]); gl.clear(gl.color_buffer_bit); gl.activetexture(gl.texture0); gl.bindtexture(gl.texture_2d, gltexture); gl.flush(); 21

WebCL: Prototype Status Prototype implementations Nokia s prototype: uses Firefox, open sourced May 2011 (LGPL) Samsung s prototype: uses WebKit, open sourced June 2011 (BSD) Motorola s prototype: uses node.js, open sourced April 2012 (BSD) Functionality WebCL contexts Platform queries (clgetplatformids, clgetdeviceids, etc.) Creation of OpenCL Context, Queue, and Buffer objects Kernel building Querying workgroup size (clgetkernelworkgroupinfo) Reading, writing, and copying OpenCL buffers Setting kernel arguments Scheduling a kernel for execution GL interoperability clcreatefromglbuffer clenqueueacquireglobjects clenqueuereleaseglobjects Synchronization Error handling 22

WebCL: Implementation Class Relationships inheritance aggregation association OpenCL UML direction of navigability Relationship Cardinality * many 1 one and only one 0:1 optionally one 23

Samsung s WebCL Prototype: Implementation on WebKit Modified file Source/WebCore/page/DOMWindow.idl List of added files (under Source/WebCore/webcl) WebCL.idl WebCLBuffer.idl WebCLCommandQueue.idl WebCLContext.idl WebCLDevice.idl WebCLDeviceList.idl WebCLEvent.idl WebCLEventList.idl WebCLImage.idl WebCLKernel.idl WebCLKernelList.idl WebCLMem.idl WebCLPlatform.idl WebCLPlatformList.idl WebCLProgram.idl WebCLSampler.idl 24

Samsung s WebCL Prototype: Building IDLStructure.pm CodeGeneratorJS.pm WebCL.idl IDLParser.pm CodeGenerator.pm WebCL.h WebCL.cpp generate-bindings.pl JSDomBinding.h JSWebCL.cpp JSWebCL.h JSWebCL.cpp gcc Compile and link with other object files and libraries (including OpenCL library) User created file Automatically generated file Toolchain 25

Samsung s WebCL Prototype: Status Integration with WebKit r78407 (released on February 2011) r92365 (released on August 2011) r101696 (released on December 2011) Available at http://code.google.com/p/webcl Sobel filter (left: JavaScript, right: WebCL) Performance evaluation Sobel filter N-body simulation Deformable body simulation N-body simulation (left: JavaScript, right: WebCL) Deformation (WebCL) 26

Samsung s WebCL Prototype: Performance Results Tested platform Hardware: MacBook Pro (Intel Core i7@2.66ghz CPU, 8GB memory, Nvidia GeForce GT 330M GPU) Software: Mac OSX 10.6.7, WebKit r78407 Video clips available at http://www.youtube.com/user/samsungsisa Demo Name JavaScript WebCL Speed-up Sobel filter (with 256x256 image) N-body simulation (1024 particles) Deformation (2880 vertices) ~200 ms ~15ms 13x 5-6 fps 75-115 fps 12-23x ~ 1 fps 87-116 fps 87-116x Performance comparison of JavaScript vs. WebCL 27

WebCL Standardization Tasneem Brutch Khronos WebCL Working Group Chair

WebCL Standardization Standardize portable and efficient access to heterogeneous multicore devices from web content. Define the ECMAScript API for interacting with OpenCL or equivalent computing capabilities. Designed for security and robustness. Use of OpenCL Embedded Profile. Interoperability between WebCL and WebGL. WebCL Portability: Proposal by Samsung to simplify initialization and to promote portability. 29

WebCL Robustness and Security Security threats Information leakage. Denial of service from long running kernels. WebCL security requirements Prevent cross-origin information leakage Prevent CORS (Cross Origin Resource Sharing) to restrict media use by non-originating web site. Prevent out of range memory access Prevent out of bounds memory writes. Out of bounds memory reads should not return any uninitialized value. Memory initialization All memory objects shall be initialized at allocation time to prevent information leakage. Prevent denial of service Ability to detect offending kernels. Ability to terminate contexts associated with malicious or erroneous kernels. Undefined OpenCL behavior to be defined by WebCL Prevent security vulnerabilities resulting from undefined behavior. Ensure strong conformance testing. OpenCL robustness and security Discussion of WebCL security and robustness proposals, currently in progress with OpenCL WG. 30

WebCL Standardization Status WebCL API definition in progress: First working draft released to public through public_webcl@khronos.org - April, 2012. Definition of conformance process and conformance test suite for testing WebCL implementation 2H, 2012. Release of WebCL specification draft 2H, 2012. Strong interest in WebCL by Khronos members. Non-Khronos members engaged through public mailing list. 31

Summary and Resources WebCL is a JavaScript binding to OpenCL, allowing web applications to leverage heterogeneous computing resources. WebCL enables significant acceleration of compute-intensive web applications. WebCL Resources Samsung s open source WebCL prototype implementation available at: http://code.google.com/p/webcl Khronos member WebCL distribution list: webcl@khronos.org Public WebCL distribution list: public_webcl@khronos.org WebCL public Wiki: https://www.khronos.org/webcl/wiki/main_page WebCL public working draft: https://cvs.khronos.org/svn/repos/registry/trunk/public/webcl/spec/latest/index.html 32

Thank You!

Appendix

WebCL: Differentiation and Impact Related technologies Google s Renderscript: providing 3D rendering and compute on Android Intel s River Trail: JavaScript extension for ParallelArray objects Comparison to WebCL WebCL is closer to the HW: Fine-grained control over memory use Fine-grained control over thread scheduling Interoperability with WebGL Specialized features: image samplers, synchronization events 35

OpenCL: Sample (1 of 3) Ref: http://developer.apple.com/library/mac/#samplecode/opencl_hello_world_example/listings/hello_c.html 1 2 3 5 kernel square( global float* input, global float* output, 4 const unsigned int count) { int i = get_global_id(0); if(i < count) output[i] = input[i] * input[i]; } Notes 1. A kernel is a special function written using a C-like language. 2. Objects in global memory can be accessed by all work items. 3. Objects in const memory are also accessible by all work items. 4. Value of input, output and count are set via clsetkernelarg. 5. Each work item has a unique id. 36

OpenCL: Sample (2 of 3) int main(int argc, char** argv) { int count = COUNT; float data[ COUNT ]; float results[ COUNT ]; 1 // number of data values // used by input buffer // used by output buffer clgetdeviceids (NULL, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext (0, 1, &device_id, NULL, NULL, &err); Notes 1. Application may request use of CPU (multicore) or GPU. 2. Builds (compiles and links) the kernel source code associated with the program. 3. Creates buffers that are read and written by kernel code. queue = clcreatecommandqueue (context, device_id, 0, &err); program = clcreateprogramwithsource (context, 1, (const char **) & KernelSource, NULL, &err); 2 3 clbuildprogram (program, 0, NULL, NULL, NULL, NULL); kernel = clcreatekernel (program, "square", &err); input = clcreatebuffer (context, CL_MEM_READ_ONLY, sizeof(data), NULL, NULL); output = clcreatebuffer (context, CL_MEM_WRITE_ONLY, sizeof(results), NULL, NULL); 37

OpenCL: Sample (3 of 3) 1 clenqueuewritebuffer (queue, input, CL_TRUE, 0, sizeof(data), data, 0, NULL, NULL); 2 clsetkernelarg (kernel, 0, sizeof(cl_mem), &input); clsetkernelarg (kernel, 1, sizeof(cl_mem), &output); clsetkernelarg (kernel, 2, sizeof(unsigned int), &count); clgetkernelworkgroupinfo (kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL); 3 4 5 global = count; clenqueuendrangekernel (queue, kernel, 1, NULL, &global, &local, 0, NULL, NULL); clfinish (queue); clenqueuereadbuffer (queue, output, CL_TRUE, 0, sizeof(results), results, 0, NULL, NULL ); } clreleasememobject (input); clreleasememobject (output); clreleaseprogram (program); clreleasekernel (kernel); clreleasecommandqueue (queue); clreleasecontext (context); Notes 1. Requests transfer of data from host memory to GPU memory. 2. Set arguments used by the kernel function. 3. Requests execution of work items. 4. Blocks until all commands in queue have completed. 5. Requests transfer of data from GPU memory to host memory. 38

WebCL: Overview JavaScript binding for OpenCL Provides high performance parallel processing on multicore CPU & GPGPU Portable and efficient access to heterogeneous multicore devices Provides a single coherent standard across desktop and mobile devices WebCL HW and SW requirements Need a modified browser with WebCL support Hardware, driver, and runtime support for OpenCL WebCL stays close to the OpenCL standard Preserves developer familiarity and facilitates adoption Allows developers to translate their OpenCL knowledge to the web environment Easier to keep OpenCL and WebCL in sync, as the two evolve Intended to be an interface above OpenCL Facilitates layering of higher level abstractions on top of WebCL API 39

Goal of WebCL Compute intensive web applications High performance Platform independent Standards-compliant Ease of application development 40

WebCL: Memory Object Creation (draft) <script> From WebGL vertex buffer (ref: OpenGL 9.8.2) var meshvertices = new Float32Array(NVERTICES * 3); var meshbuffer = gl.createbuffer(); gl.bindbuffer(gl.array_buffer, meshbuffer); gl.bufferdata(gl.array_buffer, meshvertices, gl.dynamic_draw); var clbuffer = context.createfromglbuffer(webcl.mem_read_write, meshbuffer); </script> From WebGL texture (ref: OpenGL 9.8.3) <script> var texture = gl.createtexture(); gl.bindtexture(gl.texture_2d, texture); gl.teximage2d(gl.texture_2d, 0, gl.rgba, gl.rgba, gl.unsigned_byte, image); var climage = context.createfromgltexture2d(webcl.mem_read_only, gl.texture_2d, 0, texture); </script> From WebGL render buffer (ref: OpenGL 9.8.4) <script> </script> var renderbuffer = gl.createrenderbuffer(); var climage = context.createfromglrenderbuffer(webcl.mem_read_only, renderbuffer); 41