COSCO 2015 Heterogeneous Computing Programming

Similar documents

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to OpenCL Programming. Training Guide

OpenCL for programming shared memory multicore CPUs

Introduction to GPU hardware and to CUDA

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Introduction to GPU Programming Languages

Embedded Systems: map to FPGA, GPU, CPU?

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Multi-core Programming System Overview

Multi-Threading Performance on Commodity Multi-Core Processors

HPC Wales Skills Academy Course Catalogue 2015

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Case Study on Productivity and Performance of GPGPUs

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Part I Courses Syllabus

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

OpenACC Basics Directive-based GPGPU Programming

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Lecture 3. Optimising OpenCL performance

GPU Computing - CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU Computing

Introduction to Cloud Computing

Le langage OCaml et la programmation des GPU

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Course Development of Programming for General-Purpose Multicore Processors

OpenCL Programming for the CUDA Architecture. Version 2.3

Parallel Computing. Shared memory parallel programming with OpenMP

Next Generation GPU Architecture Code-named Fermi

Performance Analysis for GPU Accelerated Applications

Trends in High-Performance Computing for Power Grid Applications

10- High Performance Compu5ng

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

Parallel Programming Survey

Turbomachinery CFD on many-core platforms experiences and strategies

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

OpenACC 2.0 and the PGI Accelerator Compilers

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

BLM 413E - Parallel Programming Lecture 3

How OpenCL enables easy access to FPGA performance?

Xeon+FPGA Platform for the Data Center

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

HPC with Multicore and GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPUs for Scientific Computing

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

MAQAO Performance Analysis and Optimization Tool

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Graphic Processing Units: a possible answer to High Performance Computing?

Using Mobile Processors for Cost Effective Live Video Streaming to the Internet

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

CUDA Basics. Murphy Stein New York University

Evaluation of CUDA Fortran for the CFD code Strukti

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

An Introduction to Parallel Computing/ Programming

Scalability and Classifications

A quick tutorial on Intel's Xeon Phi Coprocessor

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Altera SDK for OpenCL

WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

CUDA programming on NVIDIA GPUs

ultra fast SOM using CUDA

Parallel Computing. Parallel shared memory computing with OpenMP

Parallel Computing for Data Science

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Algorithm Engineering

An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds

OpenACC Programming and Best Practices Guide

Coping with Complexity: CPUs, GPUs and Real-world Applications

ST810 Advanced Computing

~ Greetings from WSU CAPPLab ~

Parallel Computing for Digital Signal Processing on Mobile Device GPUs

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Transcription:

COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015

Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion

Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion

Heterogeneous Architecture- What is it diverse in character or content. Different Types in one package Types of? Different Types of Cores Different Types of Processing(GPU and CPU) Different Functions(CPU and Memory) Different Communication Mediums(Optical and Electric and RF or sensors)

Heterogeneous Architecture- Different Types of Cores CELL Uses SPEs for Floating-Point calculations Power Processing Element is used for all other major functions

Heterogeneous Architecture- Different Types of Processing(GPU and CPU)

Heterogeneous Architecture- Different Functions(CPU and Memory)

Heterogeneous Architecture- Communication Mediums(Optical and Electric and RF or sensors)

Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion

2. Methodology Parallel Computing Hardware Software OpenCL s Approach Basic idea Programming Model Development Environment

Parallel Computing

Why Parallel? Serial A problem can be divided to small tasks Parallel https://computing.llnl.gov/tutorials/parallel_comp/

Hardware: Flynn s Taxonomy https://en.wikipedia.org/wiki/flynn%27s_taxonomy

Hardware: Memory Shared Memory Distributed Memory http://daugerresearch.com/vault/parallelparadigm.shtml

Hardware: Accelerator Host GPU DSP FPGA

Software Sequential Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Multi Core Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum

Software: 1. Analysis Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum

Software: 1. Analysis Amdahl s Law https://nf.nci.org.au/training/mpiappopt/slides/slides.011.html

Software: 2. Algorithm Data Parallelism Task Parallelism Task A Task B Task C Task D Task A Task A Task A Task A Task A Task B Task C Task D Task B Task B Task B Task B Task A Task B Task C Task D Sum Task C Task C Task C Task C Task A Task B Task C Task D Task D Task D Task D Task D

Software: 3. Programming OS API pthread Framework OpenMP, CUDA, OpenCL

OpenCL s Approach

Basic Idea

Basic Idea OpenCL Device Host OpenCL Device OpenCL Device

Basic Idea OpenCL Device Host OpenCL Device OpenCL Device Common API Portable Optimization

OpenCL C, OpenCL runtime OpenCL C Language C/C++ OpenCL Runtime Library Host OpenCL Device OpenCL C Language OpenCL Device OpenCL C Language OpenCL Device

OpenCL Device OpenCL Device Compute Unit Processing Element Host

Programming Model OpenCL Device Command Queue Work Group #0 #1 #2 Work item #0 #1 #2 #3 #4

Memory Model OpenCL Device Global Memory Compute Unit Processing Element Local Memory Constant Memory Private Private Private Private Private

Comparison OpenMP vs OpenCL OpenMP: Multiprocessors OpenCL: Multiprocessors and Accelerators CUDA vs OpenCL CUDA: only for NVIDIA GPU OpenCL: Supporting AMD, Intel and NVIDIA GPU

Comparison HSA vs OpenCL HSA: Framework for hardware vendors OpenCL: Better development environment and materials

Development Environment Intel Intel Multicore Processor + Intel OpenCL SDK NVIDIA NVIDIA GPU + CUDA Apple Intel Mac + Xcode Altera Altera PCIe FPGA + Altera SDK For OpenCL

Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion

Hello World See the program list handout hello.cl: Kernel code which works on OpenCL Device hello.cpp: Host program which works on a host machine

hello.cl Run on OpenCL Device #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable kernel void hello( global char* string) { string[0] = 'H'; string[1] = 'e'; string[2] = 'l'; string[3] = 'l'; string[4] = 'o'; string[5] = ','; string[6] = ' '; string[7] = 'W'; string[8] = 'o'; string[9] = 'r'; string[10] = 'l'; string[11] = 'd'; string[12] = '!'; string[13] = ' 0'; }

hello.cpp FILE *fp; char filename[] = "./hello.cl"; char *source_str; size_t source_size; /* カーネルを含むソースコードをロード / Load kernel code */ fp = fopen(filename, "r");

hello.cpp /* プラットフォームデバイスの情報の取得 / Get device information */ ret = clgetplatformids(1, &platform_id, &ret_num_platforms); ret = clgetdeviceids( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, &ret_num_devices); /* OpenCLコンテキストの作成 / Create OpenCL Context */ context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); There is no platform dependency

hello.cpp /* OpenCLカーネルを実行 / Execute an OpenCL Kernel */ ret = clenqueuetask(command_queue, kernel, 0, NULL,NULL); /* メモリバッファから結果を取得 / Get result data from memory buffer */ ret = clenqueuereadbuffer(command_queue, memobj, CL_TRUE, 0, MEM_SIZE * sizeof(char),string, 0, NULL, NULL); /* 結果の表示 / Display the result */ puts(string);

Hello World: Build and Run Build (NVIDIA) $ g++ I/usr/local/cuda/include o hello hello.cpp lopencl Run $./hello Hello World!

Image Processing Edge Filter

FFT: Fourier Transformation W = exp( 2πi n )

FFT: Inverse Fourier Transformation Inverse Trans

start Generate Twiddle Factors FFT Core FFT Core start Transpose Matrix FFT Core Filter Loop count < log 2 N Butterfly Calc FFT Core (Inverse) Transpose Matrix FFT Core (Inverse) Normalize if (Inverse) end end

FFT: Source Code See the program list handout fft.cl: Kernel code which works on OpenCL Device fft.cpp: Host program which works on a host machine

FFT: Evaluation Tesla C2050(NVIDIA) Number of workitems and execution time(ms) num of workitems 1 16 32 64 128 256 512 membuf_write 0.45 0.36 0.45 0.45 0.45 0.36 0.45 spinfactor 0.01 0.01 0.01 0.01 0.01 0.01 0.01 bitreverse 7.51 0.88 0.49 0.36 0.32 0.31 0.34 butterfly 41.83 3.41 2.16 1.61 1.58 1.58 1.59 normalize 3.96 0.28 0.16 0.11 0.08 0.08 0.08 highpassfilter 1.90 0.13 0.08 0.05 0.04 0.04 0.04 membuf read 0.52 0.35 0.52 0.52 0.52 0.35 0.52

Conclusion Heterogeneous computing is one of parallel computing methods. Parallel computing needs knowledge of Hardware and software characteristics. OpenCL framework helps heterogeneous computing with portable API.

References 株式会社フィックスターズ, 改訂新版 OpenCL 入門, インプレスジャパン, 2012.1 Khronos OpenCL Working Group, OpenCL 詳説, カットシステム, 2011.8 Blaise Barney, Lawrence Livermore National Laboratory, Introduction to Parallel Computing, https://computing.llnl.gov/tutorials/parallel_comp/, 2015.7 Wikipedia, Flynn's taxonomy, https://en.wikipedia.org/wiki/flynn%27s_taxonomy, 2015.7 Dauger Research, Parallel Programming Paradigms, http://daugerresearch.com/vault/parallelparadigm.shtml, 2015.7 NCI NATIONAL FACILITY, MPI Applications Course Overview, https://nf.nci.org.au/training/mpiappopt/slides/index.html, 2015.7