SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1
|
|
|
- Virgil Hood
- 9 years ago
- Views:
Transcription
1 SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group Page 1
2 Where is OpenCL today? OpenCL: supported by a very wide range of platforms Huge industry adoption Provides a C-based kernel language NEW: SPIR provides ability to build other languages on top Now, we need to provide languages and libraries Topic for today: C++ OpenCL C kernels Other language kernels Device X Device Y Device Z OpenCL and the OpenCL logo are trademarks of Apple Inc. Copyright Khronos Group Page 2
3 SYCL for OpenCL Pronounced sickle to go with spear (SPIR) Royalty-free, cross-platform C++ programming layer Builds on concepts portability & efficiency of OpenCL Ease of use and flexibility of C++ Single-source C++ development C++ template functions can contain host & device code Construct complex reusable algorithm templates that use OpenCL for acceleration Copyright Khronos Group Page 3
4 SYCL Roadmap Today Releasing a provisional specification to enable feedback Developers can provide input into standardization process Feedback via Khronos forums Next steps Full specification, based on feedback Conformance testsuite to ensure compatibility between implementations Release of implementations Copyright Khronos Group Page 4
5 What we want to achieve We want to enable a C++ on OpenCL ecosystem With C++ libraries supported on OpenCL C++ tools supported on OpenCL Aim to achieve long-term support for OpenCL features with C++ Good performance of C++ software on OpenCL Multiple sources of implementations Enable future innovation Copyright Khronos Group Page 5
6 Where can I get SYCL? Codeplay is working on an implementation It s an open, royalty-free standard Anyone can implement it Copyright Khronos Group Page 6
7 #include <CL/sycl.hpp> Simple example Does everything * expected of an OpenCL program: compilation, startup, shutdown, host fall-back, queue-based parallelism, efficient data movement. * (this sample doesn t catch exceptions) int main () int result; // this is where we will write our result // by sticking all the SYCL work in a } block, we ensure // all SYCL tasks must complete before exiting the block // create a queue to work on cl::sycl::queue myqueue; // wrap our result variable in a buffer cl::sycl::buffer<int> resultbuf (&result, 1); // create some commands for our queue cl::sycl::command_group (myqueue, [&] () // request access to our buffer auto writeresult = resultbuf.access<cl::sycl::access:write_only> (); // enqueue a single, simple task single_task(kernel_lambda<class simple_test>([=] () writeresult [0] = 1234; } }); // end of our commands for this queue } // end scope, so we wait for the queue to complete } printf ( Result = %d\n, result); Copyright Khronos Group Page 7
8 FEATURES OF SYCL Copyright Khronos Group Page 8
9 Default synchronization Uses C++ RAII Simple to use Clear, obvious rules Common in C++ int my_array [20]; cl::sycl::buffer my_buffer (my_array, 20); // creates the buffer // my_array is now taken over by the SYCL system and its contents undefined auto my_access = my_buffer.get_access<cl::sycl::access::read_write, cl::sycl::access::host_buffer> (); /* The host now has access to the buffer via my_access. This is a synchronizing operation - it blocks until access is ready. Access is released when my_access is destroyed */ } // access to my_buffer is now free to other threads/queues } /* my_buffer is destroyed. Waits for all threads/queues to complete work on my_buffer. Then writes any modified contents of my_buffer back to my_array, if necessary. */ Copyright Khronos Group Page 9
10 Task Graphs Input Data Input Data Input Access Output Data Output Data Data buffer<int> my_buffer(data, 10); SYCL separates data storage (buffers) from data access (accessors). This allows easy, safe, efficient scheduling Input / Output access defines scheduling order Kernel Scope Mode: Data or Task Parallel Device (or queue) command_group(myqueue, [&]() auto in_access = my_buffer.access<cl::sycl::access:read_only>(); auto out_access = my_buffer.access<cl::sycl::access:write_only>(); parallel_for_workgroup(nd_range(range(size), range(groupsize)), lambda<class hierarchical>([=](group_id group) parallel_for_workitem(group, [=](tile_id tile) out_access[tile.global()] = in_access[tile.global()] * 2; }); })); }); Output Output Output Data Data Access Copyright Khronos Group Page 10
11 Workgroup Hierarchical Data Parallelism Task (nd-range) Work Work item Workgroup item Work Work item Workgroup item Work Work item Work itemworkgroup Work Work item Work item Work Work item item Work Work item item item Work item Work Work item Work item Work Work item item item item Work Work item Work item Work item item buffer<int> my_buffer(data, 10); command_group(my_queue, [&]() auto in_access = my_buffer.access<cl::sycl::access:read_only>(); auto out_access = my_buffer.access<cl::sycl::access:write_only>(); parallel_for_workgroup(nd_range(range(size), range(groupsize)), lambda<class hierarchical>([=](group_id group) parallel_for_workitem(group, [=](tile_id tile) out_access[tile] = in_access[tile] * 2; }); })); }); Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. No need to think about barriers (automatically deduced) 4. Easier to compose components & algorithms e.g. Kernel fusion Copyright Khronos Group Page 11
12 Single source Developers want to write templates, like: parallel_sort<myclass> (mydata); This requires a single template function that includes both host and device code The host code ensures the right data is in the right place Type-checking (and maybe conversions) required Copyright Khronos Group Page 12
13 Choose your own host compiler Why? Developers use a lot of CPU-compiler-specific features (OS integration, for example) - SYCL supports this The kind of developer that wants to accelerate code with OpenCL will often use CPU-specific optimizations, intrinsic functions etc. - SYCL supports this For example, a developer will think it reasonable to accelerate CPU code with OpenMP and GPU code with OpenCL, but want to share source between the 2. - SYCL supports this OpenCL C supports this approach, but without single source SYCL additionally allows single source Copyright Khronos Group Page 13
14 Choose your own host compiler #include <CL/sycl.hpp> int main() int result; // this is where we will write our result // by sticking all the SYCL work in a } block, we ensure // all SYCL tasks must complete before exiting the block // create a queue to work on cl::sycl::queue myqueue; // wrap our result variable in a buffer cl::sycl::buffer<int> resultbuf(&result, 1); // create some commands for our queue cl::sycl::command_group(myqueue, [&]() // request access to our buffer auto writeresult = resultbuf.access<cl::sycl::access:write_only>(); // enqueue a single, simple task cl::sycl::single_task(kernel_lambda<class simple_test>([=]() writeresult[0] = 1234; } }); // end of our commands for this queue CPU compiler (e.g. gcc, llvm, Intel C/C++, Visual C/C++) Same source code is passed to more than one compiler SYCL device compiler(s) CPU object file The SYCL runtime and header files connect the host code with OpenCL & kernels Device object file e.g. SPIR } // end scope, so we wait for the queue to complete printf( Result = %d\n, result); } SYCL can be implemented multiple ways, including as a single compiler, but the multi-compiler option is a possible implementation of the specification Copyright Khronos Group Page 14
15 Support multiple device compilers #include <CL/sycl.hpp> int main() int result; // this is where we will write our result // by sticking all the SYCL work in a } block, we ensure // all SYCL tasks must complete before exiting the block CPU compiler (e.g. gcc, llvm, Intel C/C++, Visual C/C++) CPU object file // create a queue to work on cl::sycl::queue myqueue; // wrap our result variable in a buffer cl::sycl::buffer<int> resultbuf(&result, 1); // create some commands for our queue cl::sycl::command_group(myqueue, [&]() // request access to our buffer auto writeresult = resultbuf.access<cl::sycl::access:write_only>(); // enqueue a single, simple task cl::sycl::single_task(kernel_lambda<class simple_test>([=]() writeresult[0] = 1234; } }); // end of our commands for this queue } // end scope, so we wait for the queue to complete printf( Result = %d\n, result); } SYCL device compiler SYCL device compiler The SYCL runtime chooses the b binary for the device at runtime SPIR32 SPIR64 Binary format? Multi-device compilers is not required, but it is a possibility Copyright Khronos Group Page 15
16 Can use a common library Can use #ifdefs to implement common libraries differently on different compilers/devices e.g. defining domain-specific maths functions that use OpenCL C features on device and CPU-specific intrinsics on host Or, define your own parallel_for templates that use (for example) OpenMP on host and OpenCL on device The C++ code that calls the library function is shared across platforms, but the library is compiled differently depending on host/device Copyright Khronos Group Page 16
17 Asynchronous operation A command_group is: an atomic operation that includes all memory object creation, copying, mapping, and execution of kernels enqueued, non-blocking scheduling dependencies tracked automatically thread-safe Host thread 1 Host thread 2 Copy A to device Only blocks on return of Copy B to data to host kernel 1 device command_group 1 kernel 2 Buffer B command_group 3 kernel 3 Copy A to host Buffer A command_group 2 Copy B to host Thread 2 waits for completion Thread 1 waits for completion Copyright Khronos Group Page 17
18 Low-latency error-handling We use exception-handling to catch errors We use the standard C++ RAII approach However, some developers require destructors to return immediately, even on error. But the error-causing code may be asynchronously running. So such a developer needs to leave code running and resources released later. We support with storage objects cl::sycl::buffer cl_mem kernel Destroying this buffer will block for kernel to complete cl::sycl::buffer cl_mem kernel Destroying this buffer will release storage_object, but not block. User s storage_object handles resources storage_object Copyright Khronos Group Page 18
19 Relationship to core OpenCL Built on top of OpenCL Runs kernels on OpenCL devices Can construct SYCL objects from OpenCL objects and OpenCL objects obtained from SYCL objects Copyright Khronos Group Page 19
20 OpenCL/OpenGL interop Built directly on top of OpenCL interop extension SYCL uses the same structures, macros, enums etc Lets developers share OpenGL images/textures etc with SYCL as well as OpenCL Only runs on OpenCL devices that support one of the CL/GL interop extensions Users can query a device for extension support first Copyright Khronos Group Page 20
21 The specification An overview of the specification itself Copyright Khronos Group Page 21
22 The specification itself Copyright Khronos Group Page 22
23 Structure (Similar to OpenCL structure) Section 1: Introduction Section 2: SYCL Architecture Very similar to OpenCL architecture Section 3: SYCL Programming Interface This is the C++ interface that works across host and device Section 4: Support of non-core OpenCL features Section 5: Device compiler This is the C++ compiler that compiles kernels Appendix A: Glossary Copyright Khronos Group Page 23
24 Architecture 1 SYCL has queues and command-groups Queues are identical to OpenCL C Command-groups enqueue multiple OpenCL commands to handle data movement, access, synchronization etc SYCL has buffers and images Built on top of OpenCL buffers and images, but abstracts away the different queues, devices, platforms maps, copying etc. Can create SYCL buffers/images from OpenCL buffers/images, or obtain OpenCL buffers/images from SYCL buffers/images (but need to specify context). Copyright Khronos Group Page 24
25 Architecture 2 In SYCL, access to data is defined by accessors Users constructs within command-groups: used to define types of access and create data movement and synchronization commands Error handling Synchronous errors handled by C++ exceptions Asynchronous errors handled via user-supplied error-handler based on C++14 proposal [n3711] Copyright Khronos Group Page 25
26 Kernels can be: Architecture: kernels Single task: A non-data-parallel task Basic data parallel: NDRange with no workgroup Work-group data parallel: NDRange with workgroup Hierarchical data parallel: compiler feature to express workgroups in more template-friendly form Restrictions on language features in kernels, no: function pointers, virtual methods, exceptions, RTTI Vector classes can work efficiently on host & device OpenCL C kernel library available in kernels Copyright Khronos Group Page 26
27 Architecture: advanced features Storage objects used to define complex ownership/synchronization All OpenCL C features supported in kernels (but maybe in a namespace) Including swizzles All host compiler features supported in host code Wrappers for: programs, kernels, samplers, events Allows linking OpenCL C kernels with SYCL kernels Copyright Khronos Group Page 27
28 SYCL Programming Interface Defined as a C++ templated class library Some classes host-only, some (e.g. accessors, vectors) host-and-device Only uses standard C++ features, so code written to this library sharable between host and device. Classes have methods to construct from OpenCL C objects and obtain OpenCL C objects wherever possible Events, buffers, images work across multiple devices and contexts Copyright Khronos Group Page 28
29 SYCL Extensions Defines how OpenCL device extensions (e.g. CL/GL interop) are available within SYCL Availability is based on device support Host can also support extensions Queries are provided to allow users to query for device and host support for extensions OpenCL extensions not in the SYCL spec are still usable within SYCL due to deep OpenCL integration and interop Copyright Khronos Group Page 29
30 SYCL Device Compiler Defines the language features available in kernels Supports restricted standard C++11 feature-set Restricted by capabilities of OpenCL 1.2 devices Would be enhanced for OpenCL 2.0 in the future Defines how OpenCL kernel language features are available within SYCL Users using OpenCL kernel language features need to ensure their code is compilable for host. May need #ifdef Copyright Khronos Group Page 30
31 What now? We are releasing this provisional specification to get feedback from developers So please give feedback! Khronos forums are best place Next steps Full specification, based on feedback Conformance test suite to ensure compatibility between implementations Release of implementations Copyright Khronos Group Page 31
Press Briefing. GDC, March 2014. Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos. Copyright Khronos Group 2014 - Page 1
Copyright Khronos Group 2014 - Page 1 Press Briefing GDC, March 2014 Neil Trevett Vice President Mobile Ecosystem, NVIDIA President Khronos Copyright Khronos Group 2014 - Page 2 Lots of Khronos News at
Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:
Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v6.5 August 2014 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
OpenCL Static C++ Kernel Language Extension
OpenCL Static C++ Kernel Language Extension Document Revision: 04 Advanced Micro Devices Authors: Ofer Rosenberg, Benedict R. Gaster, Bixia Zheng, Irina Lipov December 15, 2011 Contents 1 Overview... 3
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v5.5 July 2013 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth
Intel Threading Building Blocks (Intel TBB) 2.2 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.2........... 3 Features................................................ 3 New in This Release...5
AMD APP SDK v2.8 FAQ. 1 General Questions
AMD APP SDK v2.8 FAQ 1 General Questions 1. Do I need to use additional software with the SDK? To run an OpenCL application, you must have an OpenCL runtime on your system. If your system includes a recent
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Parallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
OpenCL for programming shared memory multicore CPUs
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs
Ensure that the AMD APP SDK Samples package has been installed before proceeding.
AMD APP SDK v2.6 Getting Started 1 How to Build a Sample 1.1 On Windows Ensure that the AMD APP SDK Samples package has been installed before proceeding. Building With Visual Studio Solution Files The
Xeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
C++ INTERVIEW QUESTIONS
C++ INTERVIEW QUESTIONS http://www.tutorialspoint.com/cplusplus/cpp_interview_questions.htm Copyright tutorialspoint.com Dear readers, these C++ Interview Questions have been designed specially to get
Hetero Streams Library 1.0
Release Notes for release of Copyright 2013-2016 Intel Corporation All Rights Reserved US Revision: 1.0 World Wide Web: http://www.intel.com Legal Disclaimer Legal Disclaimer You may not use or facilitate
AMD CodeXL 1.7 GA Release Notes
AMD CodeXL 1.7 GA Release Notes Thank you for using CodeXL. We appreciate any feedback you have! Please use the CodeXL Forum to provide your feedback. You can also check out the Getting Started guide on
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
The ROI from Optimizing Software Performance with Intel Parallel Studio XE
The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire
Virtuozzo Virtualization SDK
Virtuozzo Virtualization SDK Programmer's Guide February 18, 2016 Copyright 1999-2016 Parallels IP Holdings GmbH and its affiliates. All rights reserved. Parallels IP Holdings GmbH Vordergasse 59 8200
Introduction to OpenCL Programming. Training Guide
Introduction to OpenCL Programming Training Guide Publication #: 137-41768-10 Rev: A Issue Date: May, 2010 Introduction to OpenCL Programming PID: 137-41768-10 Rev: A May, 2010 2010 Advanced Micro Devices
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
KITES TECHNOLOGY COURSE MODULE (C, C++, DS)
KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php [email protected] [email protected] Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL
Getting Started with CodeXL
AMD Developer Tools Team Advanced Micro Devices, Inc. Table of Contents Introduction... 2 Install CodeXL... 2 Validate CodeXL installation... 3 CodeXL help... 5 Run the Teapot Sample project... 5 Basic
GPU Profiling with AMD CodeXL
GPU Profiling with AMD CodeXL Software Profiling Course Hannes Würfel OUTLINE 1. Motivation 2. GPU Recap 3. OpenCL 4. CodeXL Overview 5. CodeXL Internals 6. CodeXL Profiling 7. CodeXL Debugging 8. Sources
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Performance Analysis for GPU Accelerated Applications
Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871
The Plan Today... System Calls and API's Basics of OS design Virtual Machines
System Calls + The Plan Today... System Calls and API's Basics of OS design Virtual Machines System Calls System programs interact with the OS (and ultimately hardware) through system calls. Called when
ELEC 377. Operating Systems. Week 1 Class 3
Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation
Example of Standard API
16 Example of Standard API System Call Implementation Typically, a number associated with each system call System call interface maintains a table indexed according to these numbers The system call interface
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
An Easier Way for Cross-Platform Data Acquisition Application Development
An Easier Way for Cross-Platform Data Acquisition Application Development For industrial automation and measurement system developers, software technology continues making rapid progress. Software engineers
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
Data-Flow Awareness in Parallel Data Processing
Data-Flow Awareness in Parallel Data Processing D. Bednárek, J. Dokulil *, J. Yaghob, F. Zavoral Charles University Prague, Czech Republic * University of Vienna, Austria 6 th International Symposium on
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
First-class User Level Threads
First-class User Level Threads based on paper: First-Class User Level Threads by Marsh, Scott, LeBlanc, and Markatos research paper, not merely an implementation report User-level Threads Threads managed
Parallel Web Programming
Parallel Web Programming Tobias Groß, Björn Meier Hardware/Software Co-Design, University of Erlangen-Nuremberg May 23, 2013 Outline WebGL OpenGL Rendering Pipeline Shader WebCL Motivation Development
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
Intel Media SDK Library Distribution and Dispatching Process
Intel Media SDK Library Distribution and Dispatching Process Overview Dispatching Procedure Software Libraries Platform-Specific Libraries Legal Information Overview This document describes the Intel Media
Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011
Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform
Have both hardware and software. Want to hide the details from the programmer (user).
Input/Output Devices Chapter 5 of Tanenbaum. Have both hardware and software. Want to hide the details from the programmer (user). Ideally have the same interface to all devices (device independence).
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All
CS5460: Operating Systems. Lecture: Virtualization 2. Anton Burtsev March, 2013
CS5460: Operating Systems Lecture: Virtualization 2 Anton Burtsev March, 2013 Paravirtualization: Xen Full virtualization Complete illusion of physical hardware Trap _all_ sensitive instructions Virtualized
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs
WebCL for Hardware-Accelerated Web Applications Won Jeon, Tasneem Brutch, and Simon Gibbs What is WebCL? WebCL is a JavaScript binding to OpenCL. WebCL enables significant acceleration of compute-intensive
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
OpenGL ES Safety-Critical Profile Philosophy
OpenGL ES Safety-Critical Profile Philosophy Claude Knaus July 5th, 2004 OpenGL is a registered trademark, and OpenGL ES is a trademark, of Silicon Graphics, Inc. 1 1 Overview The Safety-Critical profile
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Contents -------- Overview and Product Contents -----------------------------
------------------------------------------------------------------------ Intel(R) Threading Building Blocks - Release Notes Version 2.0 ------------------------------------------------------------------------
Kodo - Cross-platform Network Coding Software Library. Morten V. Pedersen - Aalborg University / Steinwurf ApS [email protected]
Kodo - Cross-platform Network Coding Software Library Morten V. Pedersen - Aalborg University / Steinwurf ApS [email protected] Background Academia Network coding key enabler for efficient user cooperation
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
Freescale Semiconductor, I
nc. Application Note 6/2002 8-Bit Software Development Kit By Jiri Ryba Introduction 8-Bit SDK Overview This application note describes the features and advantages of the 8-bit SDK (software development
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.
How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)
TN203 Porting a Program to Dynamic C Introduction Dynamic C has a number of improvements and differences compared to many other C compiler systems. This application note gives instructions and suggestions
Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)
Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications
Enhanced Project Management for Embedded C/C++ Programming using Software Components
Enhanced Project Management for Embedded C/C++ Programming using Software Components Evgueni Driouk Principal Software Engineer MCU Development Tools 1 Outline Introduction Challenges of embedded software
RISC-V Software Ecosystem. Andrew Waterman UC Berkeley [email protected]!
RISC-V Software Ecosystem Andrew Waterman UC Berkeley [email protected]! 2 Tethered vs. Standalone Systems Tethered systems are those that cannot stand alone - They depend on a host system to
White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5
White Paper Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE Version 1.5 Contents Introduction... 4 Goals... 4 This document does:... 4 This document does not:... 4 Terminology... 4 System Configuration...
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
RTOS Debugger for ecos
RTOS Debugger for ecos TRACE32 Online Help TRACE32 Directory TRACE32 Index TRACE32 Documents... RTOS Debugger... RTOS Debugger for ecos... 1 Overview... 2 Brief Overview of Documents for New Users... 3
Using the CoreSight ITM for debug and testing in RTX applications
Using the CoreSight ITM for debug and testing in RTX applications Outline This document outlines a basic scheme for detecting runtime errors during development of an RTX application and an approach to
Towards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
CS222: Systems Programming
CS222: Systems Programming The Basics January 24, 2008 A Designated Center of Academic Excellence in Information Assurance Education by the National Security Agency Agenda Operating System Essentials Windows
Running applications on the Cray XC30 4/12/2015
Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes
CORBA Programming with TAOX11. The C++11 CORBA Implementation
CORBA Programming with TAOX11 The C++11 CORBA Implementation TAOX11: the CORBA Implementation by Remedy IT TAOX11 simplifies development of CORBA based applications IDL to C++11 language mapping is easy
THE VELOX STACK Patrick Marlier (UniNE)
THE VELOX STACK Patrick Marlier (UniNE) 06.09.20101 THE VELOX STACK / OUTLINE 2 APPLICATIONS Real applications QuakeTM Game server C code with OpenMP Globulation 2 Real-Time Strategy Game C++ code using
Update: About Apple RAID Version 1.5 About this update
apple Update: About Apple RAID Version 1.5 About this update This update describes new features and capabilities of Apple RAID Software version 1.5, which includes improvements that increase the performance
Hands-on CUDA exercises
Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
Development With ARM DS-5. Mervyn Liu FAE Aug. 2015
Development With ARM DS-5 Mervyn Liu FAE Aug. 2015 1 Support for all Stages of Product Development Single IDE, compiler, debug, trace and performance analysis for all stages in the product development
Lecture 3. Optimising OpenCL performance
Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL
Glossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
