GPGPU. General Purpose Computing on. Diese Folien wurden von Mathias Bach und David Rohr erarbeitet
|
|
|
- Melanie Sherman
- 10 years ago
- Views:
Transcription
1 GPGPU General Purpose Computing on Graphics Processing Units Diese Folien wurden von Mathias Bach und David Rohr erarbeitet Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
2 Roundup possibilities to increase computing performance increased clock speed more complex instructions improved instruction throughput (caches, branch prediction, ) vectorization parallelization Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-2
3 Possibilities and Problems increased clock speed power consumption / cooling limited by state of the art lithography more complex instructions require more transistors / bigger cores negative effect on clock speed caches, pipelining, branch prediction, out of order execution, require many more transistors vectorization / parallelization difficult to program Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-3
4 Possible features for an HPC chip parallelism is obligatory both vectorization and many core seems reasonable huge vectors are easier to realize than a large number of cores (e.g. only 1 instruction ti decoder d per vector processor) Independent cores can process independent instructions which might be better for some algorithms complex instructions, out of order execution, etc. Hardware requirements are huge not suited for a many core design as the additional hardware is required multiple times clock speed limited anyway, not so relevant in HPC as performance originates from parallelism Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-4
5 Design Guideline for a GPU Use many cores in parallel l Each core on its own has SIMD capabilities Keep the cores simple (Rather use many simple cores instead of fewer (faster) complex cores) This means: No out of order execution, etc. Use the highest clock speed possible, but do not focus on frequency Pipelining has no excessive register requirement and is required for a reasonable clock speed, therefore a small pipeline is used Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-5
6 Graphics Processing Unit Architectures Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
7 Todays graphics pipeline Model / View Transformation Per-Vertex Lightning Tesselation Clipping Display Texturing Rasterization Projection Executed per primitive Polygon, Vertex, Pixel Highly parallel l Everything but rasterization (and display) is software Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
8 A generic GPU architecture One hardware for all stages Modular setup Streaming Hardware Hardware scheduling Dynamic Register Count Processing Elements contain FPUs Control Contr rol Co ontrol COne hardware for all PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Register PE PEFile PE PE Register File Register Texture FileCache / Local Mem Texture Cache / Local Mem Texture Cache Local Mem GPU Memory Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
9 Two application examples Vector Addition Reduction Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
10 Generic architecture: Vector addition GM OP RF OP RF OP RF OP GM A B ld(a,0) ld(b,0) + st(c,0) C _ ld(a,1) _ ld(b,1) st(c,1) 9 ld(a,2) ld(b,2) _ st(c,2) ld(a,3) 3 _ ld(b,3) st(c,3) ld(a,4) 4 _ ld(b,4) st(c,0) ld(a,5) 5 _ ld(b,5) st(c,1) ld(a,6) 6 _ ld(b,6) st(c,2) ld(a,7) 7 _ ld(b,7) st(c,3) The examples use only two compute units with 4 PEs each to make the visualization easier to overview. We will also skip (imlicit) global memory ops and register file content in the next examples. Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
11 Generic architecture: Reduction OP SM OP SM OP A +(A0,A8) +(SM0,SM2) +(SM0,SM1) (A1,A9) (SM1,SM3) 18 NOOP +(A2,A10) NOOP 9 NOOP 6 7 +(A3,A11) A11) 13 NOOP 13 NOOP 8 9 +(A4,A12) 17 +(SM0,SM2) 42 +(SM0,SM1) (A5,A13) 21 +(SM1,SM3) 50 NOOP (A6,A14) 25 NOOP 25 NOOP (A7,A15) 29 NOOP 29 NOOP 6 7No syncing between compute units Second pass (with only one compute unit) to add results of compute units. C Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
12 SIMT Groups of threads executed in lock-step Lock-step makes it similar to SIMD Own Register Set for each Processing Element Vector-Width given by FPUs, not register size Gather/Scatter not restricted (see reduction example) No masking required for conditional execution More flexible register (re)usage (0,1,2,3) RegA (4567) (4,5,6,7) RegB add(a,b) (4,6,8,10) RegA add(a,b) a b a b a b a b Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
13 HyperThreading Hardware Scheduler Zero-Overhead Thread Switch Schedules thread groups onto processing units Latency Hiding E.g. NVIDIA Tesla: 400 to 500 cycles memory latency 4 cycles per thread group 100 thread groups (320 threads) on processing group to completely hide latency 1 Thread Group READ READ 6 Thread Groups READ READ READ READ READ READ READ In the example each thread issues a number of reads as required e.g. in vector addition example. The read latency is assumed equivalent to executing 8 thread groups. Colors distinguish groups. Thread groups are to be understood as (concurrently scheduled to the current processing unit) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
14 Stream Computing Limited execution control Specify number of workitem and thread group size No synchronization between thread groups Allows scaling over devices of different size Relaxed memory consistency Memory only consistent within thread Consistent for thread group at synchronization points Not consistent between thread groups No synchronization possibility anyway Save silicon for FPUs. Globally consistant memory only at end of execution. Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
15 Register Files / Local Memory / Caches Register File Dynamic Register Set Size Many threads with low register usage Good to hide memory latencies High throughput Less threads with high register usage Only suited for compute intense Local Memory Data exchange within thread group Spatial locality cache CPU caches work with temporal locality Reduces memory transaction count for multiple threads reading close addresses 2D / 3D locality requires special memory layout Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
16 Schematic of NVIDIA GT200b chip many core design (30 multiprocessors) 8 ALUs per multiprocessor (vector width: 8) Afull-featured featured Coherent read/write cache would be too complex. Instead several small special purpose caches are employed. Future generations have general purpose L2 Cache. Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-16
17 NVIDA Tesla Architecture Close to generic architecture Lockstep size: 16 1 DP FPU per Compute Unit 1 SFU per Compute Unit 3 Compute Units grouped into Thread Processing Cluster PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE lpe PE PE PE PE PE PE PE PE PE PE PE SFU PE PE DPFPU PE PE PE PELocal PEMem SFU SFU PE DPFPU DPFPU PE PELocal PEMem SFU SFU Register DPFPU File Local Local Mem Mem SFU Register Register DPFPU File File Local Mem Register File File Register File Contro Contr trol Con ontrol C ontrol Texture Cache Texture Cache Global Memory Atomics GPU Memory Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
18 ATI Cypress Architecture VLIW PEs 4 SP FPUs 1 Special Function Unit 1 to 2 DP ops per cycle HD Compute Units 16 Stream Cores each 1600 FPUs total Lockstep size: 64 Global Memory Atomics Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
19 VLIW = Very Long Instruction Word VLIW PE is similar to SIMD core FPUs can execute different ops Data for FPUs within VLIW must be independent Compiler need to detect this to generate proper VLIW Often results in SIMD style code / using vector types, e.g. float4 A B C (0,1,2,3) (10,11,12,13) (+,+,+,+) (10,12,14,16) (4,5,6,7) (14,15,16,17) (+,+,+,+) (18,20,22,24) (8,9,10,11) (18,19,20,21) (+,+,+,+) (26,28,30,32) (12,13,14,15) (22,23,24,25) (+,+,+,+) (34,36,38,40) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
20 NVIDIA Fermi Architecture PE = CUDA core 2 Cores fused for DP ops 2 Instruction Decoders per Compute Unit Lockstep size: 32 Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
21 NVIDIA Fermi Architecture Large L2 cache Unusual Shared Read-Write No synchronization between Compute Units Global Memory Atomics exist Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
22 GPUs in Comparison NVIDIA Tesla NVIDIA Fermi AMD HD5000 FPUs Performance SP / Gflops Performance DP / Gflops Memory Bandwidth / GiB/s Local Scratch Memory / KiB Cache (L2) / MiB N/A 10.5 N/A (Texture only) (Texture only) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten 22
23 SIMD Recall SIMD (Single Instruction ti Multiple l Data) one instruction stream processed multiple data streams in parallel often called vectorization Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-23
24 SIMD v.s. SIMT new programming model introduced d by NVIDIA SIMT: Single Instruction Multiple Threads resembles programming a vector processor instead of vectors threads are use BUT: as only 1 instruction decoder is available all threads have to execute the same instruction SIMT is in fact an abstraction for vectorization SIMT code looks like many core code BUT: the vector-like structure of the GPU must be kept in mind to achieve optimal performance Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-24
25 SIMT example how to add 2 vectors in memory the corresponding vectorized code would be: dest = src1 + src2; the SIMT way: each element of the vector in memory is processed by an independent thread each thread is assigned a private variable (called thread_id in this example) determining which element to process SIMT code: dest[thread_id] = src1[thread_id] + src2[thread_id]; dest, src1, and src2 are of course pointers and not vectors a number of threads equal to the vector size is started executing the above instruction in parallel Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-25
26 SIMD v.s. SIMT examples masked vector gather as example (Remember VC) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-26
27 SIMD v.s. SIMT examples SIMD masked vector gather (vector example) int_v dst; int_m mask; int_m *addr; code: dst(mask) = load_vector(addr); only one instruction executed by one thread on a data vector SIMT masked vector gather int dst; bool mask; int *addr; code: if (mask) dst = addr[thread_id]; multiple instructions executed by the threads in parallel source is a vector in memory, target is a set of registers but no vector-register register Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-27
28 SIMD v.s. SIMT comparison why use SIMT at all SIMT allows if-, else-, while-, and switch-statements etc. as commonly used in scalar code no masks required this makes porting code to SIMT easier especially code that has been developed to run on many core systems (e.g. using OpenMP, Intel TBB) can easily be adopted (see next example) SIMT primary (dis)advantages + easier portability / more opportunities for conditional code implizit vector nature of chip is likely to be not dealt with resulting in poor performance Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-28
29 SIMT threads threads within one multiprocessor usually more threads than ALUs are present on each multiprocessor this assures a good overall utilization (latency hiding: threads waiting for memory accesses to finish are replaced by the scheduler with other threads without any overhead) Thread count per multiprocessor is usually a multiple of the ALU count (only a minimum thread count can be defined) threads of different multiprocessors As only one instruction decoder is present, threads on one particular multiprocessor must execute common instructions threads of different multiprocessors are completely l independent d Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-29
30 Porting OpenMP Code simple OpenMP code #pragma omp parallel for for (int i = 0;i < max;i++) { //do something } SIMT code int i = thread_id; if (i < max) { //do something } Enough threads are started so that no loop is necessary the check for i < max is needed Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-30
31 Languages for GPGPU OpenGL / Direct3D first GPGPU approaches tried to encapsulate general problems in 3D graphics calculation, representing the source data by textures and encoding the result in the graphics rendered by the GPU not used anymore CUDA (Compute Unified Device Architecture) SIMT approach by NVIDIA OpenCL open SIMT approach by the Khronos Group that is platform independent (compare OpenCL) very similar to CUDA (CUDA still has more features) AMD / ATI Stream Based on Brook, recently AMD focuses in OpenCL Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-31
32 Languages for GPGPU OpenGL / Direct3D / Stream seem out-dated t d this course will focus primarily on OpenCL OpenCL is favored because it is an open framework more importantly OpenCL is platform independent, not even restricted to GPUs but also available for CPU (with auto- vectorization support) some notes will be made about CUDA especially where CUDA offers features, not available in OpenCL, such as: Full C++ support (CUDA offered limited functionality for C++ started from the beginning. Full C++ support is available as of version 3.0) this strongly suggest the application of CUDA when porting C++ codes support for DMA transfer using page-locked memory Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-32
33 OpenCL Introduction Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
34 OpenCL Introduction OpenCL distinguishes i between two types of functions regular host functions kernels (functions executed on the computing device) kernel - keyword in the following host will always refer to the CPU and the main memory device will identify the computing device and its memory, usually the graphics card (also a CPU can be the device when running OpenCL code on CPU. Then both host and device code executes in different threads on the CPU. The host thread is responsible for administrative tasks while the device threads do all the calculations) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-34
35 OpenCL Kernels / Subroutines Subroutines to initiate a calculation on the computing device a kernel must be called by a host function kernels can call other functions on the device but can obviously never call host functions the kernels are usually stored in plain source code and compiled at runtime (functions called by the kernel must be contained there too), then transferred to the device where they are executed (see example later) several third party libraries simplify this task Compilation OpenCL is platform independent and it is up to the compiler how to treat function calls. Usually calls are simply inlined Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-35
36 OpenCL Devices in OpenCL terminology several compute devices can be attached to a host (e.g. multiple graphics cards) each compute device can possess multiple compute units (e.g. the multiprocessors in the case of NVIDIA) each compute unit consists of multiple processing elements, which are virtual scalar processors each executing one thread Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
37 OpenCL Execution Configuration Kernels are executed the following way n*m kernel instances are created in parallel, which are called workitems (each is assigned a global ID [0,n*m-1]) work-items are grouped in n work-groups work groupes are indexed [0,n-1] each work-item is further identified by a local work-item-id inside its work group [0,m-1] thus the work-item can be uniquely identified using the global id or both the local- and the work-group-id The work-groups are distributed as follows all work-items within one work-group are executed concurrently within one compute unit Different work-groups may be executed simultaneously or sequentially on the same or different compute unit where the execution order is not well defined d Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
38 More Complex Execution Configuration OpenCL allows the indexes for the work-items and work-groups to be N-dimensional often well suited for some problems, especially image manipulation (recall that GPU originally render images) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-38
39 Command Queues OpenCL kernel calls are assigned to a command queue command queues can also contain memory transfer operations and barriers execution of command queues and the host code is asynchronous barriers can be used to synchronize the host with a queue tasks issued to command queues are executed in order Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-39
40 Realization of Execution Configuration consider n work-groups of m work-items each each compute unit must uphold at least m threads m is limited by the hardware scheduler (on NVIDIA GPUs the limit varies between 256 and 1024) if m is too small the compute unit might not be well utilized multiple work-groups (say k) can then be executed in parallel on the same compute unit (which then executed k*m threads) each work-item has a certain requirement for registers, memory, etc say each work-items requires l registers, then in total m*k*l registers must be available on the compute-unit this further limits the maximal number of threads Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-40
41 Platform Independent Realization register limitation and platform independence OpenCL code is platform independent and compiled at runtime apparantly this solves the problem with limited registers, because the compiler knows how many work-items to execute and can create code with reduced register requirement (up to a certain limit) no switch or parameter available that controls the register usage of the compiler, everything is decided d by runtime HOWEVER: register restriction leads to intermediate results being stored in memory and thus might result in poor performance Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-41
42 Register / Thread Trade-Off this can be discussed more concretely in the CUDA case, here the compiler is platform dependent and its behavior is well defined more registers result in faster threads more threads lead to a better overall utilization the best parameter has to be determined experimentally Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-42
43 Performance Impact of Register Usage Real World CUDA HPC Application (later in detail) ALICE HLT Online Tracker on GPU Performance for different thread- / register-counts Register and thread count is related as follows Registers Threads threads optimal Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-43
44 Summary of register benchmarks Optimal parameter was found experimentally It depends on the hardware Little influence possible in OpenCL code (as it is platform independent) CUDA allows for better utilization (as it is closer to the hardware) OpenCL optimizations i possible on compiler side (e.g. just in time recompilation, compare to JAVA) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-44
45 OpenCL kernel sizes Recall that functions are usually inlined Kernel register requirement commonly increases with amount of kernel source code (the compiler tries to eliminate registers at its best but often cannot assure the independence of variables that can share a register) Try to keep kernels small multiple small kernels executed sequentially usually perform better than one big kernel split tasks in small steps as possible Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-45
46 One more theoretical part: Memory no access to host main memory by device device memory itself divided into: global memory constant memory local l memory private memory before the kernel gets executed the relevant data must be transferred from the host to the device after the kernel execution the result is transferred back Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-46
47 Device Memory in Detail global l memory global memory is the main device memory (such as main memory for the host) can be written to and read from by all work-items and by the host through special runtime functions global memory may be cached depending on the device capabilities but should be considered slow (Even if it is cached, the cache is usually not as sophisticated as a usual CPU L1 cache. Slow still means transfer rates of more than 150 Gb/sec (for the newest generation NVIDIA Fermi cards). Random access however should be avoided at any case. Coalescing Rules to achieve optimal performance on NVIDIA cards will be explained later.) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-47
48 Device Memory in Detail constant t memory a region of global memory that remains constant during kernel execution often this allows for easier caching constant memory is allocated and written to by the host Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-48
49 Device Memory in Detail local l memory special memory that is shared among all work-items in one workgroup local memory is generally very fast atomic operations to local memory can be used to synchronize and share data between work-items when global memory is too slow and no cache is available it is a general practice to use local memory as an explicit (non transparent) global memory cache Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-49
50 Device Memory in Detail private memory as the name implies this is private for each work-item private memory is usually a region of global memory each thread requires its own private memory, so when executing n work-groups of m work-items each n*m*k bytes of global memory is reserved (with k the amount of private memory required by one thread) as global memory is usually big compared to private memory requirements, the available private memory is usually not exceeded if the compiler is short of registers it will swap register content to private memory Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-50
51 OpenCL Memory Summary global constant local memory private memory memory memory host dynamic allocation, dynamic allocation, dynamic allocation, no allocation, read/write read/write no access no access device no static static static allocation, allocation allocation allocation read/write read ony read/write read/write Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-51
52 Correspondance OpenCL / CUDA As already stated t OpenCL and CUDA resemble each other. However terminology differs: OpenCL host / compute device / kernel compute unit global memory constant memory local memory private memory work-item work-unit keyword for (sub)kernels: kernel command queue CUDA host / device / kernel multiprocessor global memory constant memory shared memory local memory thread block global ( device) stream Be carefull with local memory, as it refers to different memory types Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-52
53 Memory Realization the OpenCL specification does not define memory sizes and type (speed, etc.) we look at it in the case of CUDA (GT200b chip) memory (OpenCL terminology) Size Remarks global memory 1GB not cached, 100 Gb/sec constant memory 64 kb cached local memory 16 kb / very fast, when used with correct multiprocessor pattern as fast as registers private memory - part of global memory, considered slow Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-53
54 Memory Guidelines the following guidelines refer to the GT200b chip (for different chips the optimal memory usage might differ) store constants in constant memory wherever possible to benefit from the cache try not to use too many intermediate variables to save register space, better recalculate values try not to exceed the register limit, swapping registers to private memory is painful avoid private memory where possible use local memory where possible big datasets must be stored in global memory anyway, try to realize a streaming access, follow coalescing rules (see next sheet), and try to access the data only once Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-54
55 NVIDIA Coalescing Rules Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-55
56 Analysing Coalescing Rules example A resembles an aligned vector fetch with a swizzle example B is an unaligned vector fetch both access patterns commonly appear in SIMD applications as for vector-processors random gathers cause problems the vector-processor-like nature of the NVIDIA-GPU reappears Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-56
57 NVIDIA Local Memory Coalescing Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-57
58 Memory Consistency GPU memory consistency differs from what one is used from CPUs load / store order for global memory is not preserved among different compute-units the correct order can be ensured for threads within one particular compute-unit using synchronization / memory fences global memory coherence is only ensured after a kernel call is finished (when the next kernel starts, memory is consistent) there is no way to circumvent this!!! HOWEVER: different compute units can be synchronized using atomic operations As inter work-group synchronization is very expensive, try to divide the problem in small parts that are handled independently Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-58
59 From theory to application tasks required to execute an OpenCL kernel create the OpenCL context o query devices o choose device o etc. load the kernel source code (usually from a file) compile the OpenCL kernel transfer the source data to the device define the execution configuration execute the OpenCL kernel fetch the result from the device uninitialize the OpenCL context third party libraries encapsulate these tasks we will look at the OpenCL runtime functions in detail Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-59
60 OpenCL Runtime OpenCL is plain C currently Upcoming C++ interface for the host C++ for the device might appear in future versions The whole runtime documentation can be found at: org/opencl/ The basic functions to create first examples will be presented in the lecture Some features will just be mentioned, have a look at the documentation to see how to use them!!! Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-60
61 OpenCL Runtime Functions (Context) //Set OpenCL platform, choose between different implementations ti / versions cl_int clgetplatformids (cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms) //Get List of Devices available in the current platform cl_int clgetdeviceids (cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, cl_device_id *devices, cl_uint *num_devices) //Get Information about an OpenCL device cl_int clgetdeviceinfo (cl_device_id device, cl_device_info param_name, size_ t param _ value_ size, void *param _ value, size_ t *param _ value_ size_ ret) //Create OpenCL context for a platform / device combination cl_context clcreatecontext (const cl_context_properties *properties, cl_uint num_devices, const cl_device_id id *devices, void (*pfn_notify)(const t char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data, cl_int *errcode_ret) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-61
62 Runtime Functions (Queues / Memory) //Create a command queue kernels will be assigned to later cl_command_queue clcreatecommandqueue (cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret) //Allocate memory on the device cl_mem clcreatebuffer (cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret) flags regulate read / write access for kernels host memory can be defined as storage, however the OpenCL runtime is allowed to cache host memory in device memory during kernel execution device memory can be allocated as buffer or as image buffers are plain memory segments accessible by pointers images are 2/3-dimensional objects (textures / frame buffers) accessed by special functions, storage format opaque for user Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-62
63 Runtime Functions (Memory) //Read memory from device to host cl_int clenqueuereadbuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) //Write to device memory from host cl_int clenqueuewritebuffer ( ) //same parameters reads / writes can be blocking / non-blocking (Blocking commands are enqueued, the host process wait for the command to finish before it continues. Non blocking commands do not pause host execution) the event parameters force the operation to start only after specified events occurred on the device events occur for example when kernel executions finish, they are used for synchronization Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-63
64 Runtime Functions (Kernel creation) //Load a program from a string cl_program clcreateprogramwithsource (cl_context context, cl_uint count, const char **strings, const size_t *lengths, cl_int *errcode_ret) //Compile the program cl_int clbuildprogram (cl_program program, cl_uint num_devices, const cl_device_id id *device_list, const char *options, void (*pfn_notify)(cl_program, void *user_data), void *user_data) //Create an executable kernel out of a kernel function in the compiled program cl_kernel clcreatekernel (cl_program program, const char *kernel_name, cl_int *errcode_ret) //Define kernel parameters for execution cl_int clsetkernelarg (cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-64
65 Runtime Functions (Kernel execution) //Load a program from a string cl_int clenqueuendrangekernel (cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) work_dim is the dimensionality of the work groups global_work_size and local_work_size are the number of work items globally and in a work group respectively. these parameters are array ranging from 0 to work_dim 1to allow multi-dimensional work-groups the local_work_size parameters must evenly divide the global_work_size parameters Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-65
66 Runtime Functions (Kernel execution) examples for kernel execution configurations simple 1-dimensional example: 2 work groups of 16 work_items each work_dim = 1, local_work_size = (16), global_work_size = (32) more complex 2-dimensional example: 4*2 work groups of 8*8 work_items each work_dim = 1, local_work_size=(4,4), global_work_size = (32, 16) 8 32 work group work item 16 8 Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-66
67 Memory Access by Kernels Memory objects can be accesses by kernels New keywords for kernel parameters global //pointer to global memory constant //pointer to constant memory Assigning a buffer object to a global variable will result in a pointer to the address //Used for images not buffers, see reference for details read_only write_only Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-67
68 OpenCL First Example, Vector Addition addvector.cl : kernel void addvector( global int *src_1, global int *src_2, global int *dst, cl_int vector_size) { for (int i = get_global_id(0);i < vector_size;i += get_global_size(0)) } { } dst [i] = src_1[i] + src_2[i]; global id and size can be obtained by get_global_id _ / _size consider how the work is distributed amont the threads (consecutive threads access data in adjacent memory addresses, following the coalescing rules) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-68
69 OpenCL First Example, Vector Addition addvector.cpp : cl_int ocl_error, num_platforms = 1, num_devices = 1, vector_size= 1024; clgetplatformids(num_platforms, &platforms NULL); clgetdeviceids(platform, CL_DEVICE_TYPE_ALL, num_devices, &device, NULL); cl_context context = clcreatecontext(null, 1, &device, NULL, NULL, &ocl_error); cl_command_queue command_queue = clcreatecommandqueue(context, device, NULL, &ocl_error); cl_program program = clcreateprogramwithsource(context, 1, (const &sourcecode, NULL, &ocl_error); clbuildprogram(program, 1, &device, -cl-mad-enable, NULL, NULL); cl_kernel kernel = clcreatekernel(program, "addvector", &ocl_error); cl_mem vec1 = clcreatebuffer(context, CL_MEM_READ_WRITE, vector_size* sizeof(int), NULL, &ocl_error); char**) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-69
70 OpenCL First Example, Vector Addition clenqueuewritebuffer(command_queue, vec1, CL_FALSE, 0, vector_size* sizeof(int), host_vector_1, 0, NULL, NULL); clsetkernelarg(kernel, 0, sizeof(cl_mem), &vec1); // //... Vector 2, Destination Memory // clsetkernelarg(kernel, 3, sizeof(cl_int), int) vector_size); size_t local_size = 8; size_t global_size = 32; clenqueuendrangekernel(command_queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL; clenqueuereadbuffer(command_queue, vec_result, CL_TRUE, 0, vector_size * sizeof(int), host_vector[2], 0, NULL, NULL); Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-70
71 OpenCL First Example, Vector Addition Vector addition admittedly dl very simple example OpenCL overhead for creating the kernel etc. seems huge Extended / documented source code available on the lecture homepage Can now be easily extended to more complex kernels (will be done in the turorials) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-71
72 OpenCL Optimizations Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
73 Computing Device Utilization Using OpenCL the primary computing device is no longer the CPU (still the CPU can contribute to the total computing power or the CPU can be the OpenCL computing device on its own) Therefore the main objective is to keep the OpenCL computing device as busy as possible This includes two objectives Firstly: Ensure the device is totally utilized during kernel execution (This includes the prevention of latencies due to memory access as well as the utilization of all threads of the vector-like processor) Secondly: Make sure that there is no delay between kernel executions Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-73
74 Computing Device Utilization We will now discuss some criteria i that t should be fulfilled to ensure both of the previous requisitions Bad device utilization during kernel execution mostly originates from: Memory latencies when the device waits for data from global memory Non coalesced memory access where multiple memory accesses have to be issued by the threads instead of only a single access work-group serialization: As only one instruction decoder is present, performance decreases when different work-items follow different branches in conditional code Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-74
75 Memory Latencies OpenCL devices have an integrated t feature to hide memory latencies The number of threads started greatly exceeds the number of threads that can be executed in parallel For each instruction cycle without any overhead the scheduler selects threads that are ready to execute Try to use a large number of parallel threads OpenCL devices do not necessarily have a general purpose cache Random memory access is very expensive, streaming access is even more important than for usual CPUs Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-75
76 Memory Coalescing Streaming access can usually be achieved by following coalescing rules Often data structures must be changed to allow for coalescing Often arrays of structures should be replaced by structures of arrays Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-76
77 Memory Coalescing Data Structures Consider the following examples struct int4{int x, y, z, w}; int4 data[thread_count]; kernel example1 { data[thread_id].x++; data[thread_id].y--; ]y } x1 y1 z1 w1 x2 y2 z2 w2 Access to x[thread_id] skips 3 out of 4 memory addresses int x[thread_count], y[thread_count], z[thread_count], kernel example2 { x1 x2 x3 x4 y1 y2 y3 y4 x[thread_id]++; Access to x[thread_id] affects a y[thread_id]--; continous memory segment } Example 1 requires 4 times the amount of accesses example 2 needs (for a thread count of 4), the ratio is worse for higher thread counts Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-77
78 Random memory access Many algorithms have random access schemes restricted to a bounded memory segment If this segment fits in local memory it can be cached Random memory access to local memory is almost as fast as sequential access (e.g. except for the possibility of bank conflicts for NVIDIA chips) Caching to local memory can be performed in a coalesced way Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-78
79 Work-group serialization Consider the following code int algorithm(data_structure& data, bool mode) {. if (mode) data.x++; else data.x--; } kernel example(data_structure* data) { if (data[thread_id].y > 1) algorithm(data[thread_id], true); else algorithm(data[thread_id], false); } Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-79
80 Work-group serialization Analyzing the example The check for data[thread_id].y > 1 might lead to different results and thus different branches for different threads As only one instruction decoder is present, both branches have to be executed one after another Only if the result is the same for all work-items in a work group, execution is restricted t to a single branch This problem is called work-group serialization Except for the different behavior depending on the mode flag both branches involve identical code In the given example it will possibly halve the performance This might become worse with more complex branches conditional execution where branches contain complex code should be avoided Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-80
81 Work-group serialization Improved version of above example int algorithm(data_structure& data, bool mode) {. if (mode) data.x++; else data.x--; } kernel example(data_structure* data) { algorithm(data[thread_id], data[thread_id] > 1); } This clearly gives the same result, but the outer branch was removed Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-81
82 Work-group serialization Even more improved version of above example int algorithm(data_structure& data, bool mode) {. data.x += mode; } kernel example(data_structure* data) { algorithm(data[thread_id], id] data[thread_id] >1); } Often conditions can be exchanged by other statements without branches Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-82
83 Work-group serialization Allowed branches At the end it shall be noted, that conditions where different work-groups execute different branches do not influence the performance kernel example() { if (work_group_id % 2) algorithm1(); else algorithm2(); } As different work groups do not execute simultaneously on one compute unit, the restriction to one instruction decoder is not relevant Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-83
84 Common Source Common Source refers to a programming paradigm simplifying application development instead of an OpenCL performance optimization If different versions of an algorithm shall be created for OpenCL and another platform, it is desirable to stay with one common source code This is possible by placing the actual algorithm in a function that is included twice, in the OpenCL file and another C++ file Only two different wrapper functions are required and must be maintained Changes to the algorithm itself are done only once Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-84
85 Common Source Example include.cppcpp int algorithm(data_structure& data) {. } opencl.cl #include include.cpp kernel void example(data_structure* data) { algorithm(data[thread_id]); id]); } other_ version.cpp #include include.cpp void example(data_structure* data) { } for (i = 0;i < DATA_SIZE;i++) algorithm(data[i]); Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-85
86 Real World HPC application Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten
87 ALICE HLT TPC Online Tracker We will now look at a real world HPC application that t was portet to GPGPU The ALICE HLT Online Tracker is responsible for real- time track reconstruction for the ALICE experiment at LHC (CERN) Problems emerging when porting the code and solutions will be presented Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-87
88 ALICE HLT TPC Online Tracker First question: Which h language to choose Some facts: A C++ tracker CPU tracker code already existed It was desired to have a common code base for CPU / GPU The CPU code relies on AliROOT (which relies on ROOT) and therefore C++ was obligatory Using CUDA for the GPU tracker seemed the appropriate solution (OpenCL was also far away from a stable release) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-88
89 Track Reconstruction What is tracking In the LHC particles (protons or heavy-ions) are accelerated to yet unreach energies and afterwards collide in the center of the detectors. In the collision (imagine an explosion) lots of daughter particles are produced that shall be analysed Measuring the trajectories t of fthe particles is a crucial part of fthis analyzation How to measure the trajectories It is impossible to measure the particle trajectory directly Instead only some discrete points of the trajectories can be measured, these are called clusters The crucial part now, is to regain the trajectories from these clusters Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-89
90 Track Reconstruction trajectories trajectories and only the clusters cluster The third graph shows the input data for the tracking algorithm. It is now a combinatorial challenge to restore the original tracks. Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-90
91 Tracking Algorithm How to extract t trajectories t from 3-dimensional i spacepoints The CPU tracking algorithm was originally designed to run on parallel computer The Algorithm first determines some track candidates (tracklets), by fitting straight lines to 3 adjacent clusters These candidates can then be extrapolated, and new clusters close to the extrapolated tracks can be added Extrapolation and fitting for different candidates can be performed in parallel l This seems ideal for massive parallel computing, the more tracks the better Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-91
92 How to fit / extrapolate Fitting and extrapolating ti in a free 3-dimensional i space is complicated The following assumption is made to simplify and speed up the tracking As tracks are supposed to originate from the interaction point fitting and extrapolation ti is done in radial direction only radial space coordinates are discrete with 159 values possible (called rows) angular coordinates in every row are continuous Extrapolation is done in two steps: upwards and downwards (to the next / to the previous row) to extend the initial candidate in both directions In each extrapolation step the continuous angular coordinates for the next / previous row are calculated Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-92
93 Tracking Proton Events Real Proton-Proton P t collision i at ALICE Does not look so complex? Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-93
94 Tracking Heavy-Ion Events Simulated lead-lead l d collision: i Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-94
95 Tracking Heavy-Ion Events Tracks reconstructed t in lead-lead l d collision i Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-95
96 Tracking Heavy-Ion Events To make things even worse: central pb-pb b Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-96
97 Tracking on GPU Initial GPU tracker performance (first working port GTX285 v.s. Nehalem 3.2 GHz) Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-97
98 Performance analysis GPU tracker is not faster than a state-of-the-art t th tcpu In contrast the speed-ups presented by NVIDIA this looks frustrating BUT: This is an optimized CPU application, and no 10 year old fortran code, so no speedup of 10+ should be expected FURTHER: This was the very first try, so let us examine how to improve the tracker Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-98
99 Performance analysis Remember GPU paradigm: Keep all GPU processors active Ensure a high workload during kernel execution Make sure that there is always a kernel in the queue that can be executed We will now apply these criteria to the tracker Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-99
100 Initial GPU Utilization Active GPU Threads for the First Implementation Active GPU Threads: 19% Colors represent Tracklet Constructor Steps: Black: Idle Blue: Track Fit Green: Track Extrapolation Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-100
101 Problem with First Implementation The tracks greatly differ in length Some candidates are immediately dropped Some candidates are extended d to up to 100 clusters In the first implementation every track candidate was handled by a different thread When some threads in a work-group have already dropped their track candidate while others have not, they are idling, because only one single instruction decoder is present It is better to stop the tracking in between, and exchange dropped candidates with new ones Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-101
102 Fixing GPU Utilization / Scheduling Idea: Introduce Tracklet Pools The set of 159 rows is divided into n row-blocks Tracklets are fitted and extrapolated for rows in one row-block only Afterwards the extrapolation is interrupted All tracklets that are still active (the track has not ended yet) are stored in the tracklet pool of the next row block. However, if the extrapolation ti is finished, i the track is stored The threads then fetch new tracklets from a tracklet pool This way, after every interrupt, the active tracklets are redistributed among the threads, and no threads have to idle when their track has ended Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-102
103 Improved GPU Utilization Active GPU Threads using Dynamic Scheduling Active GPU Threads: 62% Colors represent Tracklet Constructor Steps: Black: Idle Blue: Track Fit Green: Track Extrapolation Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-103
104 Trackled Scheduling Summary Tracklet Scheduling Before many threads were idling after a track has ended, but other threads in the same work-group were still extrapolating other tracks This could be overcome by introducing a tracklet scheduler, which redistributes the tracklets among the threads Tracklet Scheduling Performance Without the scheduling only 19% of the threads were active Occupancy rose to 62% using the scheduler GPU utilization increased by 226% Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-104
105 Initialization / Output Tracker Initialization / Tracklet Output t Some steps of the tracking algorithm cannot be easily ported to run on GPU The cluster data is fetched from the network and the final tracks are sent to other nodes in the network The initial data received is converted to a format suitable for the tracking algorithm The conversion process touches every bit only one time rendering a transfer to the GPU non-performant The same holds for the conversion of the tracklet output data Running Initialization, Tracking and Output in a loop iterating over events is not optimal The GPU is idling during initialization and output wasting a lot of computational performance Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-105
106 CPU / GPU pipelining Asynchronous Calculation l The initialization / output of events on the CPU can overlap with the tracking of another event on the GPU There is no need that the transfer of data to the GPU is governed by the CPU Instead the data should be transferred using Direct Memory Access (DMA) This way 3 tasks can be performed in parallel Pipelining A pipeline was introduced in the tracking algorithm The GPU performs the tracking for event n, while event n-1 is transferred to the GPU, while event n-2 is initialized by the CPU It is only necessary to ensure, that the GPU is always fully utilized, as the GPU now is the main processor for computing Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-106
107 Pipelining visualization The diagram below shows the tracking of 12 slices in a pipelined way Only during the initialization of the first slice the GPU is idling Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-107
108 GPU Tracker Speedup The following final speedup was achieved: Volker Lindenstruth ( 15. November 2011 Copyright, Goethe Uni, alle Rechte vorbehalten L02-108
Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011
Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform
Lecture 3. Optimising OpenCL performance
Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Introduction to OpenCL Programming. Training Guide
Introduction to OpenCL Programming Training Guide Publication #: 137-41768-10 Rev: A Issue Date: May, 2010 Introduction to OpenCL Programming PID: 137-41768-10 Rev: A May, 2010 2010 Advanced Micro Devices
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:
Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Java GPU Computing. Maarten Steur & Arjan Lamers
Java GPU Computing Maarten Steur & Arjan Lamers Overzicht OpenCL Simpel voorbeeld Casus Tips & tricks Vragen Waarom GPU Computing Afkortingen CPU, GPU, APU Khronos: OpenCL, OpenGL Nvidia: CUDA JogAmp JOCL,
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Programming Guide. ATI Stream Computing OpenCL. June 2010. rev1.03
Programming Guide ATI Stream Computing OpenCL June 2010 rev1.03 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
AMD Accelerated Parallel Processing. OpenCL Programming Guide. November 2013. rev2.7
AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
GPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
PDC Summer School Introduction to High- Performance Computing: OpenCL Lab
PDC Summer School Introduction to High- Performance Computing: OpenCL Lab Instructor: David Black-Schaffer Introduction This lab assignment is designed to give you experience
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008
Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1
SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group 2014 - Page 1 Where is OpenCL today? OpenCL: supported by a very wide range of platforms
NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH [email protected] SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA
NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH [email protected] SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA GFLOPS 3500 3000 NVPRO-PIPELINE Peak Double Precision FLOPS GPU perf improved
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, [email protected]
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
1 Storage Devices Summary
Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010
GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:
A3 Computer Architecture
A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray [email protected] www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
WebCL for Hardware-Accelerated Web Applications. Won Jeon, Tasneem Brutch, and Simon Gibbs
WebCL for Hardware-Accelerated Web Applications Won Jeon, Tasneem Brutch, and Simon Gibbs What is WebCL? WebCL is a JavaScript binding to OpenCL. WebCL enables significant acceleration of compute-intensive
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
