- PDF Free Download

Size: px

Start display at page:

Download ""

Donald Sparks
10 years ago
Views:

1 Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic Vectory B.V. As defined by the Khronos Group ( Khronos Group is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. Khronos Group controls many other standards like: OpenGL (ES) OpenVG COLLADA WebGL and many more vs. & Device Memory Model (source: nvidia Tutorial Slides)

2 Terminology Qualifiers Work-Item Thread kernel global Work-Group Thread-Block (no qualifier needed) device (function) Global Memory Global Memory Constant Memory Constant Memory constant constant global device (variable) local shared Local Memory Shared Memory Private Memory Local Memory Indexing API Objects get_num_groups() griddim cl_device_id CUdevice get_local_size() blockdim cl_context CUcontext get_group_id() blockidx get_local_id() threadidx cl_program CUmodule cl_kernel CUfunction cl_mem CUdeviceptr cl_command_queue CUstream get_global_id() (calculate manually) get_global_size() (calculate manually) Kernel Language Device Thread Synchronization Subset of C99 C for, subset of C barrier() syncthreads() data-parallel extensions data parallel extensions (no equivalent) threadfence() mem_fence() threadfence_block() read_mem_fence() (no equivalent) write_mem_fence() (no equivalent) C++ features (function templates) enable higher productivity through meta-programming techniques Requires run-time compilation by driver Compilation through separate compiler: NVCC No function pointers or recursion No function pointers or recursion

3 Performance Comparison (1) Profiler Tool accelereyes.com vs tested on Tesla C20 Performance Comparison (2) NVidia GeForce GT vs NVidia GeForce GT vs Particle simulation PI approximation Performance Comparison (3) Particle simulation PI approximation GT NVidia GeForce GT vs NVidia GeForce GT vs Random global memory reads Random global memory writes Preliminary Conclusions There are cases where performs better than There are cases where performs better than seems to have slightly higher overhead for kernel launches compared to on NVidia's platform For some cases the differences can be large, but... Measuring = knowing! Random global memory writes GT GT Iterations GT Random global memory reads GT 285 Iterations GT GT GT Back to the Host

4 Host Synchronization: Host Synchronization: Streams Command Queues Streams are a sequence of commands that execute in-order Default behavior of command queue's is similar to Streams Streams can contain kernel launches and/or memory transfers One big difference: out-of-order execution mode clenqueue...() commands can be given a set of events to wait for Each command itself can generate an event Host code can wait for stream completion using the cudastreamsynchronize() call Events can be inserted into the stream Host code can query event completion or perform a blocking wait for an event Useful for synchronization with host code and timing Task & Data Parallelism The commands and the events they must wait for, create a task graph The end-result is a task-parallel framework supporting data-parallel tasks It is possible to create multiple queues for a device It is possible have commands in one queue wait for events from a different queue Intermediate Conclusions will execute the commands in the queue as it sees fit, respecting the dependencies specified. Based on the dependencies between commands in the queue, can determine which commands are allowed to execute simultaneously The programming methodology for data-parallel application is virtually identical, i.e. if you can program in one language/environment, you can program in the other currently offers certain productivity advantages at the kernel level NVidia's hardware seems to be more capable on the GP side when compared to ATi's hardware has the platform advantage in that it presents a unified platform API for ALL computing hardware in your machine programs can be run on hardware from different vendors Your application could be written entirely in kernels, requiring only a small framework that fills the command queue Implementations AVAILABLE: Vendor Type Hardware Apple x86_64 (Intel) nvidia GeForce 8/9 series and higher ATi R0/800 series AMD any x86/x86_64 with SSE3 extensions Samsung ARM A9 IBM ACCELERATOR CELL BE ZiiLabs ARM ANNOUNCED/UPCOMING: Imagination Technology PowerVR SG Series 5 VIA VN 00 Chipset S3 Chrome 5400E Graphics Processor Apple ARM A4 Portability to other platforms Results of a kernel are guaranteed across platforms Optimal Performance is not All platforms are required to support data-parallelism, but are not required to support task-parallelism can be considered a replacement for OpenMP (data-parallel) can be considered a replacement for Threads (task-parallel)

5 Libraries & Tools for Libraries & Tools for ATi StreamProfiler (ATi hardware only) cublas (closed-source) NVidia Visual Profiler (NVidia hardware only) cufft (closed-source) Stream KernalAnalyzer (ATi hardware only) CUDPP (data-parallel primitives) NVidia NSight (NVidia hardware only) Thrust (high-level & OpenMP-based algorithms) gdebugger CL (Windows, Mac, Linux, currently in beta) CULATools (LAPACK) libstdcl (wrapper around context/queue management functions) NSight debugger GATLAS (Matrix multiplication) NVidia Visual Profiler ViennaCL (BLAS level 1 and 2) Language bindings for C++, Fortran, Java, Matlab,.NET, Python and Scala are available Language bindings for Python, Java,.NET, MATLAB, Fortran, Perl, Ruby, Lua (Unofficial) Sneak Preview Things to consider Platforms API stability/agility changes more slowly, retains backward compatibility changes more rapidly, unlocks new hardware features quicker Third-party library availability is currently the only choice if you do not want to tie your application to NVidia's hardware is about 2 years younger, so less numerous and less mature libraries are available has spawned a host of initiatives and various libraries are available, especially in the scientific computing domain Supporting tools has a fairly young, but decent set of tools NVidia recently launched the NSight debugger which seems more mature Questions Further Reading GP General Implementations? / Comparisons Mobile/Embedded announcements

6 References

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.