Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic Vectory B.V. george.van.venrooij@organicvectory.com As defined by the Khronos Group (www.khronos.org): Khronos Group is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. Khronos Group controls many other standards like: OpenGL (ES) OpenVG COLLADA WebGL and many more vs. & Device Memory Model (source: nvidia Tutorial Slides)
Terminology Qualifiers Work-Item Thread kernel global Work-Group Thread-Block (no qualifier needed) device (function) Global Memory Global Memory Constant Memory Constant Memory constant constant global device (variable) local shared Local Memory Shared Memory Private Memory Local Memory Indexing API Objects get_num_groups() griddim cl_device_id CUdevice get_local_size() blockdim cl_context CUcontext get_group_id() blockidx get_local_id() threadidx cl_program CUmodule cl_kernel CUfunction cl_mem CUdeviceptr cl_command_queue CUstream get_global_id() (calculate manually) get_global_size() (calculate manually) Kernel Language Device Thread Synchronization Subset of C99 C for, subset of C barrier() syncthreads() data-parallel extensions data parallel extensions (no equivalent) threadfence() mem_fence() threadfence_block() read_mem_fence() (no equivalent) write_mem_fence() (no equivalent) C++ features (function templates) enable higher productivity through meta-programming techniques Requires run-time compilation by driver Compilation through separate compiler: NVCC No function pointers or recursion No function pointers or recursion
Performance Comparison (1) Profiler Tool accelereyes.com vs tested on Tesla C20 Performance Comparison (2) NVidia GeForce GT 285 - vs NVidia GeForce GT 285 - vs Particle simulation PI approximation Performance Comparison (3) Particle simulation PI approximation 0.014000 0.012000 0.0000 0.2000 0.00 0.0000 0.0000 20 40 60 80 0.0000 0.0000 0.008000 GT 285 0.006000 58 0.004000 0.002000 0 0 0 0 900 90 200 400 600 800 00 NVidia GeForce GT 285 - vs NVidia GeForce GT 285 - vs Random global memory reads Random global memory writes 0.00 0.0000 0.0000 20 40 60 80 0 0 0 0 900 90 200 400 600 800 00 0.00 0.0000 0.0000 20 40 60 80 0 0 0 0 900 90 200 400 600 800 00 Preliminary Conclusions There are cases where performs better than There are cases where performs better than seems to have slightly higher overhead for kernel launches compared to on NVidia's platform For some cases the differences can be large, but... Measuring = knowing! 200 600 00 90 400 800 Random global memory writes 0.4000 0.00 GT 285 0.2000 58 0.00 0.0000 0.0000 GT 285 200 600 00 90 400 800 Iterations 0.2000 GT 285 0.2000 0.0000 Random global memory reads 0.00 58 0.4000 0.00 GT 285 Iterations GT 285 40000 99856 698896 000 69696 399424 00000 29929 69696 199809 599076 00000 000 49729 90000 399424 799236 0.0000 0.00 0.00 0.020000 0.600000 0.00 0.2000 GT 285 0.00 58 0.0000 0.0000 GT 285 200 600 00 90 400 800 Back to the Host
Host Synchronization: Host Synchronization: Streams Command Queues Streams are a sequence of commands that execute in-order Default behavior of command queue's is similar to Streams Streams can contain kernel launches and/or memory transfers One big difference: out-of-order execution mode clenqueue...() commands can be given a set of events to wait for Each command itself can generate an event Host code can wait for stream completion using the cudastreamsynchronize() call Events can be inserted into the stream Host code can query event completion or perform a blocking wait for an event Useful for synchronization with host code and timing Task & Data Parallelism The commands and the events they must wait for, create a task graph The end-result is a task-parallel framework supporting data-parallel tasks It is possible to create multiple queues for a device It is possible have commands in one queue wait for events from a different queue Intermediate Conclusions will execute the commands in the queue as it sees fit, respecting the dependencies specified. Based on the dependencies between commands in the queue, can determine which commands are allowed to execute simultaneously The programming methodology for data-parallel application is virtually identical, i.e. if you can program in one language/environment, you can program in the other currently offers certain productivity advantages at the kernel level NVidia's hardware seems to be more capable on the GP side when compared to ATi's hardware has the platform advantage in that it presents a unified platform API for ALL computing hardware in your machine programs can be run on hardware from different vendors Your application could be written entirely in kernels, requiring only a small framework that fills the command queue Implementations AVAILABLE: Vendor Type Hardware Apple x86_64 (Intel) nvidia GeForce 8/9 series and higher ATi R0/800 series AMD any x86/x86_64 with SSE3 extensions Samsung ARM A9 IBM ACCELERATOR CELL BE ZiiLabs ARM ANNOUNCED/UPCOMING: Imagination Technology PowerVR SG Series 5 VIA VN 00 Chipset S3 Chrome 5400E Graphics Processor Apple ARM A4 Portability to other platforms Results of a kernel are guaranteed across platforms Optimal Performance is not All platforms are required to support data-parallelism, but are not required to support task-parallelism can be considered a replacement for OpenMP (data-parallel) can be considered a replacement for Threads (task-parallel)
Libraries & Tools for Libraries & Tools for ATi StreamProfiler (ATi hardware only) cublas (closed-source) NVidia Visual Profiler (NVidia hardware only) cufft (closed-source) Stream KernalAnalyzer (ATi hardware only) CUDPP (data-parallel primitives) NVidia NSight (NVidia hardware only) Thrust (high-level & OpenMP-based algorithms) gdebugger CL (Windows, Mac, Linux, currently in beta) CULATools (LAPACK) libstdcl (wrapper around context/queue management functions) NSight debugger GATLAS (Matrix multiplication) NVidia Visual Profiler ViennaCL (BLAS level 1 and 2) Language bindings for C++, Fortran, Java, Matlab,.NET, Python and Scala are available Language bindings for Python, Java,.NET, MATLAB, Fortran, Perl, Ruby, Lua (Unofficial) Sneak Preview Things to consider Platforms API stability/agility changes more slowly, retains backward compatibility changes more rapidly, unlocks new hardware features quicker Third-party library availability is currently the only choice if you do not want to tie your application to NVidia's hardware is about 2 years younger, so less numerous and less mature libraries are available has spawned a host of initiatives and various libraries are available, especially in the scientific computing domain Supporting tools has a fairly young, but decent set of tools NVidia recently launched the NSight debugger which seems more mature Questions Further Reading GP General Implementations? www.gpgpu.org www.khronos.org/opencl http://developer.amd.com/documentation/articles/pages/-and-the-ati-stream-v2.0-beta.aspx http://developer.nvidia.com/object/opencl.html http://www.alphaworks.ibm.com/tech/opencl http://developer.apple.com/mac/library/documentation/performance/conceptual/_macprogguide/whatis/wh / Comparisons Mobile/Embedded announcements http://www.gpucomputing.net/?q=node/128 http://www.imgtec.com/news/release/index.asp?newsid=557 http://www.ziilabs.com/technology/opencl.aspx
References http://blog.accelereyes.com/blog/20/05//nvidia-fermi-cuda-and-opencl/ http://www.s3graphics.com/en/news/news_detail.aspx?id=44 http://www.gremedy.com/gdebuggercl.php http://browndeertechnology.com/stdcl.html http://golem5.org/gatlas/ http://www.mainconcept.com/products/sdks/hw-acceleration/opencl-h264avc.html http://awaregeek.com/news/some-pictures-of-old-computers/