COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015
Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion
Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion
Heterogeneous Architecture- What is it diverse in character or content. Different Types in one package Types of? Different Types of Cores Different Types of Processing(GPU and CPU) Different Functions(CPU and Memory) Different Communication Mediums(Optical and Electric and RF or sensors)
Heterogeneous Architecture- Different Types of Cores CELL Uses SPEs for Floating-Point calculations Power Processing Element is used for all other major functions
Heterogeneous Architecture- Different Types of Processing(GPU and CPU)
Heterogeneous Architecture- Different Functions(CPU and Memory)
Heterogeneous Architecture- Communication Mediums(Optical and Electric and RF or sensors)
Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion
2. Methodology Parallel Computing Hardware Software OpenCL s Approach Basic idea Programming Model Development Environment
Parallel Computing
Why Parallel? Serial A problem can be divided to small tasks Parallel https://computing.llnl.gov/tutorials/parallel_comp/
Hardware: Flynn s Taxonomy https://en.wikipedia.org/wiki/flynn%27s_taxonomy
Hardware: Memory Shared Memory Distributed Memory http://daugerresearch.com/vault/parallelparadigm.shtml
Hardware: Accelerator Host GPU DSP FPGA
Software Sequential Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Multi Core Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum
Software: 1. Analysis Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum Task A Task B Task C Task D Sum
Software: 1. Analysis Amdahl s Law https://nf.nci.org.au/training/mpiappopt/slides/slides.011.html
Software: 2. Algorithm Data Parallelism Task Parallelism Task A Task B Task C Task D Task A Task A Task A Task A Task A Task B Task C Task D Task B Task B Task B Task B Task A Task B Task C Task D Sum Task C Task C Task C Task C Task A Task B Task C Task D Task D Task D Task D Task D
Software: 3. Programming OS API pthread Framework OpenMP, CUDA, OpenCL
OpenCL s Approach
Basic Idea
Basic Idea OpenCL Device Host OpenCL Device OpenCL Device
Basic Idea OpenCL Device Host OpenCL Device OpenCL Device Common API Portable Optimization
OpenCL C, OpenCL runtime OpenCL C Language C/C++ OpenCL Runtime Library Host OpenCL Device OpenCL C Language OpenCL Device OpenCL C Language OpenCL Device
OpenCL Device OpenCL Device Compute Unit Processing Element Host
Programming Model OpenCL Device Command Queue Work Group #0 #1 #2 Work item #0 #1 #2 #3 #4
Memory Model OpenCL Device Global Memory Compute Unit Processing Element Local Memory Constant Memory Private Private Private Private Private
Comparison OpenMP vs OpenCL OpenMP: Multiprocessors OpenCL: Multiprocessors and Accelerators CUDA vs OpenCL CUDA: only for NVIDIA GPU OpenCL: Supporting AMD, Intel and NVIDIA GPU
Comparison HSA vs OpenCL HSA: Framework for hardware vendors OpenCL: Better development environment and materials
Development Environment Intel Intel Multicore Processor + Intel OpenCL SDK NVIDIA NVIDIA GPU + CUDA Apple Intel Mac + Xcode Altera Altera PCIe FPGA + Altera SDK For OpenCL
Heterogeneous Computing Programming 1. Overview 2. Methodology 3. Example and Evaluation 4. Conclusion
Hello World See the program list handout hello.cl: Kernel code which works on OpenCL Device hello.cpp: Host program which works on a host machine
hello.cl Run on OpenCL Device #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable kernel void hello( global char* string) { string[0] = 'H'; string[1] = 'e'; string[2] = 'l'; string[3] = 'l'; string[4] = 'o'; string[5] = ','; string[6] = ' '; string[7] = 'W'; string[8] = 'o'; string[9] = 'r'; string[10] = 'l'; string[11] = 'd'; string[12] = '!'; string[13] = ' 0'; }
hello.cpp FILE *fp; char filename[] = "./hello.cl"; char *source_str; size_t source_size; /* カーネルを 含 むソースコードをロード / Load kernel code */ fp = fopen(filename, "r");
hello.cpp /* プラットフォーム デバイスの 情 報 の 取 得 / Get device information */ ret = clgetplatformids(1, &platform_id, &ret_num_platforms); ret = clgetdeviceids( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, &ret_num_devices); /* OpenCLコンテキストの 作 成 / Create OpenCL Context */ context = clcreatecontext( NULL, 1, &device_id, NULL, NULL, &ret); There is no platform dependency
hello.cpp /* OpenCLカーネルを 実 行 / Execute an OpenCL Kernel */ ret = clenqueuetask(command_queue, kernel, 0, NULL,NULL); /* メモリバッファから 結 果 を 取 得 / Get result data from memory buffer */ ret = clenqueuereadbuffer(command_queue, memobj, CL_TRUE, 0, MEM_SIZE * sizeof(char),string, 0, NULL, NULL); /* 結 果 の 表 示 / Display the result */ puts(string);
Hello World: Build and Run Build (NVIDIA) $ g++ I/usr/local/cuda/include o hello hello.cpp lopencl Run $./hello Hello World!
Image Processing Edge Filter
FFT: Fourier Transformation W = exp( 2πi n )
FFT: Inverse Fourier Transformation Inverse Trans
start Generate Twiddle Factors FFT Core FFT Core start Transpose Matrix FFT Core Filter Loop count < log 2 N Butterfly Calc FFT Core (Inverse) Transpose Matrix FFT Core (Inverse) Normalize if (Inverse) end end
FFT: Source Code See the program list handout fft.cl: Kernel code which works on OpenCL Device fft.cpp: Host program which works on a host machine
FFT: Evaluation Tesla C2050(NVIDIA) Number of workitems and execution time(ms) num of workitems 1 16 32 64 128 256 512 membuf_write 0.45 0.36 0.45 0.45 0.45 0.36 0.45 spinfactor 0.01 0.01 0.01 0.01 0.01 0.01 0.01 bitreverse 7.51 0.88 0.49 0.36 0.32 0.31 0.34 butterfly 41.83 3.41 2.16 1.61 1.58 1.58 1.59 normalize 3.96 0.28 0.16 0.11 0.08 0.08 0.08 highpassfilter 1.90 0.13 0.08 0.05 0.04 0.04 0.04 membuf read 0.52 0.35 0.52 0.52 0.52 0.35 0.52
Conclusion Heterogeneous computing is one of parallel computing methods. Parallel computing needs knowledge of Hardware and software characteristics. OpenCL framework helps heterogeneous computing with portable API.
References 株 式 会 社 フィックスターズ, 改 訂 新 版 OpenCL 入 門, インプレスジャパン, 2012.1 Khronos OpenCL Working Group, OpenCL 詳 説, カットシステム, 2011.8 Blaise Barney, Lawrence Livermore National Laboratory, Introduction to Parallel Computing, https://computing.llnl.gov/tutorials/parallel_comp/, 2015.7 Wikipedia, Flynn's taxonomy, https://en.wikipedia.org/wiki/flynn%27s_taxonomy, 2015.7 Dauger Research, Parallel Programming Paradigms, http://daugerresearch.com/vault/parallelparadigm.shtml, 2015.7 NCI NATIONAL FACILITY, MPI Applications Course Overview, https://nf.nci.org.au/training/mpiappopt/slides/index.html, 2015.7