OpenCL for programming shared memory multicore CPUs
|
|
|
- Moris Hawkins
- 9 years ago
- Views:
Transcription
1 Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan OpenCL for programming shared memory multicore CPUs Akhtar Ali, Usman Dastgeer, and Christoph Kessler PELAB, Dept. of Computer and Information Science, Linköping University, Sweden Abstract. Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition, 2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors platforms and the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented. 1 Introduction Shared memory multicore processor technology evolved in response to the physical limiting factors such as clock speed and heat/power problems with single processor technology. But this shift of hardware technology from single core to many cores required developers to write programs with parallelism to exploit all available cores. Parallel programming paradigms abstracting thread management activities evolved over time, such as OpenMP and Intel s TBB, to help developers divide single algorithm workload on different cores. Because of the specialized architecture of GPUs, which has higher arithmetic intensity and is thus more suitable for data parallel applications, the notion of general purpose GPU computing (GPGPU) has emerged at the same time. With growing focus on parallel computing, an environment of heterogeneous hardware architectures became inevitable and the need for a programming framework offering a uniform interface to these heterogeneous architectures was realized. Arisen from
2 2 OpenCL for programming shared memory multicore CPUs this need is the Open Computing Language (OpenCL) standard API that is based on ISO C99 standard. OpenCL abstracts the underlying hardware and enables developers to write code that is portable across different shared-memory architectures. Individually, these frameworks have been studied extensively such as OpenMP in [2, 12], TBB in [6, 13] and OpenCL in [7, 8]. There is still lot of work focused only on CPU specific [9, 1] and GPU specific [1, 4, 5] frameworks comparisons. But there is comparatively little work evaluating OpenCL CPU performance with that of state-of-the-art CPU specific frameworks. Evaluating OpenCL on CPUs is becoming increasingly important as OpenCL implementations are available for CPUs by AMD and Intel. Due to its promise for code portability, OpenCL could provide the programming foundation for modern heterogeneous systems. An important study in this direction is [3] examining overhead, scalability and usability of five shared memory parallelism frameworks including OpenCL on a 2D/3D image registration application. In our work, we choose OpenMP and TBB as popular representatives of CPU specialized frameworks and compare these with OpenCL, on five benchmark applications to provide a wider base for comparison. We do the evaluation in the following aspects: OpenCL as an alternative to Intel TBB and OpenMP for programming multicore CPUs: How does it compare in terms of performance, compiler support, programming effort and code size. OpenCL specific issues: such as performance implications of different vendor s OpenCL implementations, CPU-specific vs GPU-specific optimizations for an OpenCL algorithm and performance implications of these optimizations. 2 Frameworks 2.1 OpenMP Open Multi Processing (OpenMP) provides parallelizing facilities in C/C++ and Fortran by using preprocessor directives/pragmas, library routines and environment variables and is supported by most compilers today. These directives enable generation of multi-threaded code at compile time [1]. Although there exists a commercial implementation of OpenMP on clusters [2], its prime target are shared memory architectures. It provides a very easy to use high level application programming interface for multi-threading computationally intensive data and task parallel programs. Most OpenMP implementations use thread pools on top of the fork/join model, which exist throughout execution of the program and therefore avoid the overhead of thread creation and destruction after each parallel region [11]. Its work sharing constructs such as omp for in C/C++ (and omp do in Fortran) provide data parallelism while omp sections [?] and the omp task construct in OpenMP 3. support task parallelism. Portions of the code that are intended to be multi-threaded must be revised and modified to manage any dependencies before annotating those portions with OpenMP parallelization constructs. The directive based approach and support
3 OpenCL for programming shared memory multicore CPUs 3 for incremental parallelism makes this model easy to use for parallelizing new as well as existing applications [12]. 2.2 Intel TBB Intel Threading Building Blocks (TBB) is Intel s technology for parallelism, comprising a template library for C++. Users have to specify logical parallelism and the TBB runtime library maps it into threads ensuring efficient usage of available resources [6]. It relies on generic programming and supports both task parallelism and data parallelism. TBB s idea of parallelism is essentially objectoriented since the parallelizable code is encapsulated into template classes in a special way and then invoked from the main program. It also provides concurrent containers such as concurrent_queue and concurrent_vector that makes it even more suitable for data parallel programming. In contrast to some models of parallelism suggesting static division of work, TBB rather relies on recursively breaking a problem down to reach the right granularity level of parallel tasks and this technique shows better results than the former one in complex situations. It also fits well into the task stealing scheduler [6]. OpenMP also allows dynamic scheduling but in simple cases when fixed amount of work is known ahead of time, static scheduling is recommended since it incurs less run-time overhead than dynamic scheduling. 2.3 OpenCL OpenCL is the first royalty-free, open standard for programming modern heterogeneous architectures. An OpenCL program can run on multicore processors, graphics cards and has a planned support for DSP like accelerators [3]. The OpenCL code is just-in-time (JIT) compiled during runtime which prevents dependencies of instructions set and supporting libraries. The dynamic compilation enables better utilization of the underlying device latest software and hardware features such as SIMD capability of hardware [7]. In OpenCL terminology, a program runs on an OpenCL device (CPU, GPU etc.) that holds compute units (one or more cores) which further may include one or more single-instruction multiple-data (SIMD) processing elements. Besides hiding threads, OpenCL goes a step forward by abstracting hardware architectures and provides a common parallel programming interface. It creates a programming environment comprising a host CPU and connected OpenCL devices which may or may not share memory with the host and might have different machine instruction sets [7]. The OpenCL memory model consists of host side memory and four types of memories on device side: global, constant, local and private. Every single element of the problem domain is called work item while some work-items can be grouped together to form a work group. Global memory allows read/write to all work-items in all workgroups but has high access latency so its use must be kept minimal. Constant memory is a part of global memory which retains its constant values throughout kernel execution. Local memory can be used to make variables shared for a workgroup as all work-items of the workgroup can
4 4 OpenCL for programming shared memory multicore CPUs Figure 1: Matrix Multiplication Performance of OMP, TBB, OpenCL OMP (O) TBB (O) OpenCL (cl-opt-disable) OMP (O3) TBB (O3) OpenCL Simple (cl-fast-relaxed-math) Matrix order read/write to it. Private memory is only visible to individual work-items and each can modify or read only its own visible data. GPUs have on-chip Local Data Share (LDS) and a separate private memory bank with each compute unit which is OpenCL local and private memory respectively. CPUs on the other hand implement private memory as register/l1 cache, local memory as L2/L3 cache, global as main memory, and constant OpenCL memory as main memory/cache but their exact implementations are architecture and OpenCL implementation dependent. The host program running on the host CPU creates memory objects in global memory using OpenCL APIs while en-queuing memory commands operating on these memory objects which can be synchronized using commandenqueue barriers or context events [8]. AMD, Intel, IBM, Nvidia and Apple are some well-known vendors who have implemented the OpenCL standard. 3 Applications Five benchmarks, namely matrix multiplication, LU factorization, 2D Gaussian image convolution, Pi value approximation and histogram generation are chosen for investigating parallelism frameworks. Matrix computations such as matrix multiplications are widely used in scientific computing [14]. Linear algebra computations are ubiquitous in science and engineering and have applications in e.g., graphics programming, artificial intelligence and data mining. LU factorization is a technique for solving linear equation systems. The algorithm used derives lower and upper triangles in the same input matrix by using the row-major order variant of the right-looking algorithm that accesses consecutive elements of the matrix, which improves performance due to cache effects. matrix, which improves performance due to cache effects. Convolution and histogram generation are important building blocks of many image processing applications. Convolution is used to apply different kinds of effects such as blur, sharpen or emboss [15]. A Gaussian filter is separable, i.e., symmetric around its center and can be decomposed into 1D row vector and column vector filters that can be applied to the image in horizontal and vertical
5 OpenCL for programming shared memory multicore CPUs 5 directions respectively. This has the advantage of doing only n+n multiplications compared to n n multiplications in the 2D case where n is the filter size in each dimension. Histograms are used to compare images or to modify their intensity spectrum [9]. For a simple gray-scale image, a histogram generation function maps the frequency of the intensity levels in an image to the gray-level range. Pi approximation computes the area under the curve y = 4/(1 + x 2 ) between and 1. The only input parameter is the precision control value. 1 8 Figure 2: LU Decomposition Performance of OMP, TBB, OpenCL OpenMP (O) TBB (O) OpenCL (cl-opt-disable) OpenMP (O3) TBB (O3) OpenCL (cl-fast-relaxed-math) Matrix order Figure 3: Gaussian Convolution Performance of OMP, TBB, OpenCL OpenMP (O) TBB (O) OpenCL (cl-opt-disable) OpenMP (O3) TBB (O3) OpenCL (cl-fast-relaxed-math) Evaluation Matrix order Our test applications are parallelized from the sequential C/C++ implementations by applying standard CPU specific optimizations available in each framework to experience the programming complexity of these frameworks. For OpenCL
6 6 OpenCL for programming shared memory multicore CPUs implementations, optimizations such as vectorization and AoS (Array of Structures) data storage format are applied to improve performance on CPUs. Unless specified otherwise, the experiments are carried out on Intel Xeon CPU E5345 with 8 cores running at 2.33GHz. AMD SDK-v2.4 supporting OpenCL 1.1 is used with Intel compiler version Performance and compiler optimizations In this section, we do performance evaluation for OpenCL, OpenMP and Intel TBB implementations of each of the five benchmark applications. Furthermore, to investigate effectiveness of compiler optimizations for different frameworks, we have tried different compiler optimization switches with the Intel C++ compiler. To show the effects of compilation on the actual performance, we compare the execution with disabled compiler optimizations to the one with aggressive compiler optimizations. For executions with disabled compiler optimizations, option -O is used during the program compilation. For OpenCL dynamic compilation, a comparable effect is achieved by passing the -cl-opt-disable compiler option. For executions with aggressive compiler optimizations, option -O3 is used during the program compilation which consists of option -O2 1 plus memory access and loop optimizations, such as loop unrolling, code replication to remove branches and loops blocking to improve cache usage. For OpenCL dynamic compilation, similar effect is achieved by passing the -cl-fast-relaxed-math 2 option.in the following, we discuss effects of compiler optimizations for different benchmarks. Figure 1 shows that OpenCL outperforms the other two in matrix multiplication at both disabled and aggressive optimization levels while OpenMP beats TBB when aggressive compiler optimizations are enabled. Figure 2 shows that OpenMP yields best performance for all inputs in LU factorization while OpenCL shows slowest results comparatively. The rationale behind this could be traced to the kernel algorithm which sets a workgroup along each row of the image matrix for synchronization purpose using local memory. The OpenCL runtime should be allowed to choose their optimal workgroup size otherwise. The gap between TBB and OpenMP widens at aggressive optimization level, which means that OpenMP benefited more from compiler optimization than TBB as shown in Figure 2. For 2D image convolution, TBB performs comparatively slower while OpenMP and OpenCL perform equally well with a slightly better OpenCL performance at large input sizes, as shown in Figure 3. It also demonstrates that the performance gap among the three frameworks narrows with compiler optimizations. Pi value approximation uses reduction in which TBB presents better performance than OpenMP with no compiler optimizations while OpenCL shows the best performance as Figure 4 indicates. The graph clearly demonstrates that an explicitly vectorized OpenCL kernel significantly beats a simple OpenCL kernel and other models in speedup. With the Intel compiler s default optimization 1 The -O2 flag enables speed specific optimizations e.g., vectorization. 2 This option is used for all applications except the Pi application as it is sensitive to high precision.
7 OpenCL for programming shared memory multicore CPUs 7 level, there is a narrowed gap between OpenMP and TBB performances while the OpenCL kernel is still way faster. This shows that enabling explicit vectorization in OpenCL enhances speedup on CPUs. Histogram generation is also a reduction application and is run at default compiler optimization level. Its performance graph is shown in Figure 5 where TBB shows slightly low performance while OpenMP and OpenCL match well for the whole range of matrix size. This performance behavior of the frameworks is similar to the compiler optimized version of convolution Figure 4: Comparing PI Performance of OMP, TBB, OpenCL OpenMP (O) TBB (O) OpenCL Simple Kernel (cl-opt-disable) OpenCL Vectorized Kernel (cl-opt-disable) OpenMP (O2) TBB (O2) OpenCL Simple Kernel (default) OpenCL Vectorized Kernel (default).5 1e+7 2e+7 3e+7 4e+7 5e+7 6e+7 7e+7 8e+7 9e+7 1e+8 Precision Controller 1.8 Figure 5: Histogram Generation Performance of OMP, TBB, OpenCL Serial OpenMP TBB OpenCL Matrix order (K)
8 8 OpenCL for programming shared memory multicore CPUs 4.2 Scalability Evaluation OpenCL involves the highest overhead in LU factorization and shows performance improvement after five cores as shown in Figure 6. TBB incurs comparatively low overhead for a small number of cores but scales almost like OpenCL for a high number of cores. OpenMP beats TBB and OpenCL significantly and scales best for all numbers of cores, incurring zero overhead for a single core. Figure 7 illustrates that TBB and OpenCL incur some overhead on a single core but all the three frameworks demonstrate identical performance when multiple cores are used. OpenMP with loops statically scheduled shows no overhead for one core and scales better for all number of cores compared to dynamically scheduled loops in TBB. The zero overhead in OpenMP execution on a single core suggests that some OpenMP implementations have a special code path for a single thread that effectively adds a single branch. OpenCL shows some overhead degradation on single core but it catches up well with OpenMP compared to TBB when two or more cores are used. These results somewhat conform with [1] but are in contrast with [3] where OpenMP incurs most overhead and TBB scales best for all numbers of cores. The rationale behind this contrast could be compiler differences and suitability of different frameworks to different applications. In our case, OpenMP static scheduling compared to TBB s dynamic scheduling was well suited in convolution and LU decomposition algorithms since these applications have a fixed number of loop iterations with equal distribution of work across different loop iterations Figure 6: LUD Performance Scaling Serial OpenMP TBB OpenCL Number of cores
9 OpenCL for programming shared memory multicore CPUs Figure 7: Image Convolution Performance Scaling Serial OpenMP TBB OpenCL OpenCL Platforms Evaluation Number of cores In this section, we evaluate two OpenCL implementations available for multicore CPUs: AMD OpenCL SDK 2.5 and Intel OpenCL SDK LINUX 1.1, experimented on a system with Intel Xeon CPU E552 hosting 16 cores, each running at 2.27 GHz, and a compiler gcc version Interestingly Intel s OpenCL outperformed AMD in the three out of four tested applications with a clear margin in performance test on CPUs as shown by Figure 8. This behavior suggests that Intel enables its CPU specific auto-optimizations namely, auto-vectorization, using its JIT OpenCL C compiler. When the Pi calculation kernel was explicitly vectorized, AMD running time dropped significantly compared to that of Intel though Intel is still outperforming AMD. As in Figure 8, the vectorized kernel increased Intel s speedup by 125% and that of AMD by around 32%. This significant relative difference for AMD shows that the Intel JIT OpenCL C compiler was already exploiting auto-vectorization to some extent on Intel CPUs when it was not explicitly programmed in the kernel. 4.4 Code and performance portability Experiments in this section are performed on an Nvidia Tesla M25 GPU with Nvidia OpenCL driver which implements the OpenCL 1.1 specification. To test code portability, we run all our OpenCL implementations on the GPU. In all cases, the code was executed on the GPU without requiring any changes in the code. This strengthen the OpenCL claim for code portability across different hardware architectures with correctness of execution. Although the code is portable, the performance portability is highly unlikely as optimizations for an OpenCL implementation on CPUs and GPUs are quite different. We have seen during our experiments that most OpenCL implementations designed specifically for GPUs actually performed poorly on the multicore
10 1 OpenCL for programming shared memory multicore CPUs CPUs, sometimes even worse than a sequential implementation. In the following, we check the performance portability by comparing performance of OpenCL algorithms that are primarily written for CPU execution with optimized GPU implementations of the same applications on the GPU. The GPU-optimized implementations are taken from NVIDIA OpenCL SDK code samples. The intent is to see the extent of performance portability when single OpenCL code base written for CPUs is naively transported to another common architecture (GPUs). Figure 9 shows performance comparisons of OpenCL implementations written for CPU and GPU execution on the GPU. The speedups are calculated with respect to sequential CPU execution on a single core. The results can be interpreted in two ways: The performance portability is by far sub-optimal to this point. Still, there is a significant performance impact of architecture-specific optimizations. For Histogram, the difference is up to 15 times which is quite substantial. On the positive note, not only the code is portable, but it also gives some performance improvements when executed on a GPU. For applications such as image convolution, the speedups increased from 5 times to 5 times when running the same implementation on a powerful GPU (see Figure 8 and Figure 9). MM LU CONV PI Histogram T/E LOC T/E LOC T/E LOC T/E LOC T/E LOC OpenMP TBB OpenCL ++ 12/ /225-12/186-12/ /15 Table 1: Time/effort (T/E) and number of lines of code for given bench- marks for the three frameworks. MM(Matrix Multiplication), LU(LU Factorization), CONV(Image convolution). [+++ : Least time/effort : Most time/effort] Speed up Figure 8: OpenCL vendors platforms evaluation on CPUs AMD Intel Convolution Histogram MatrixMult Pi PiVec Speed up Figure 9: Performance comparison on an NVIDIA GPU CPU Specific Algorithm GPU Specific Algorithm Convolution Histogram MatrixMult 4.5 Usability Evaluation Introducing parallelism using OpenMP pragmas was the simplest practice preserving the serial semantics compared to the other two frameworks. The only delicate issue during the application of OpenMP constructs besides making loops independent was choosing the right variables to be made shared, private or firstprivate etc. But OpenMP pragmas spread throughout the serial program affects
11 OpenCL for programming shared memory multicore CPUs 11 readability since they commit additional hierarchies to the nested loops thereby making the scope of every loop and construct somewhat more complex. TBB needs paralleled portions of a program to be encapsulated in template classes which required medium effort. It moves parallel code out of the serial structure of the main program contributing to object oriented style and better program readability. The equivalent subtle issue with TBB as also pointed out in [1], compared to OpenMP variable sharing, was to select the right variables to be included in parallel_for/parallel_reduce object s class definitions. Similarly, the effort for identifying the most suitable scheduling strategy out of static, dynamic and guided for OpenMP resembled the effort for experimenting with default auto-partitioner, simple-partitioner or affinity partitioner and grain-sizes in TBB. However, TBB made the main program more manageable by moving looping structures to the template classes while the use of lambda expressions in the convolution application made the code relatively compact but rendered the looping structure almost like OpenMP. OpenCL, on the other hand, following the GPU programming fashion constitutes quite a different style of programming. It involves a fixed amount of code and effort to initialize the environment which was the baseline and was reused in all applications so the successive implementations were easier in OpenCL. Since specifying workgroup sizes and use of local/private memories in matrix multiplication and LU factorization did not benefit well or even degraded performance on CPUs, this extra effort can be avoided while programming in OpenCL for CPUs. Introducing explicit vectorization in the kernel, however, was worth the effort. Most of the optimization literature found is GPU specific and does not significantly improve performance on CPUs, which makes OpenCL programming for CPUs comparatively easier than for GPUs. A summary in Table 1 shows the time/effort as empirical representation of the combined time in weeks for learning and mapping each application into the corresponding framework by the user and LOC as the number of lines of code for each parallelized application using these models in our implementations. Notation +++ denotes the lowest while denotes the highest time/effort. The first value in LOC of OpenCL represents the number of lines of code that is fixed for all algorithms (initializing OpenCL platform, devices, contexts, queues etc.) while the second value shows remaining OpenCL lines of code and is variable for every application. 5 Discussion and Conclusion We evaluated the applicability of the OpenCL programming model for multicore CPUs on multiple benchmarks in comparison with two state-of-the-art CPU specialized frameworks: OpenMP and Intel TBB. The overall performance and scalability numbers showed that OpenCL on CPU yields competitive performance when compared with the CPU specific frameworks. Enabling compiler optimizations had significant impacts on performance particularly with OpenMP since it takes a compiler directive based approach to parallelism. OpenCL compiler has optimization options for tuning performance similar to other frameworks. Enabling explicit vectorization in OpenCL kernels improved performance while the use of local/private memory did not yield major performance improvements.
12 12 OpenCL for programming shared memory multicore CPUs On Intel CPUs, the Intel platform outperformed AMD in four out of five tested applications. This shows that Intel s OpenCL has better implicit optimizations for Intel CPUs. OpenCL code written for CPUs could successfully run on an Nvidia platform with Tesla GPUs which shows OpenCL code portability and correctness of execution across different platforms and hardware architectures. However, the performance portability is still by far sub-optimal. The performance of CPU specific code on GPU reasonably surpassed its CPU performance but it did not utilize the capacity of highly data parallel GPUs as GPU optimized algorithms did. This is mainly due to substantial architectural differences between CPU and GPU which makes it difficult to write one algorithm that can efficiently execute on each platform. Furthermore, the implementation experience with all three frameworks is documented with a usability evaluation. The comparative multicore CPU performance of OpenCL at the cost of adopting a different programming style may be worth the effort since OpenCL offers code portability across a number of different architectures. This may have significant implications in future high performance computing with the addition of data parallel multicore GPUs entering into mainstream computing. Having a single code-base that can execute on variety of hardware architectures although giving sub-optimal performance can still become desirable due to cost issues associated with maintaining separate code-bases. In future, we aim at testing our OpenCL code on AMD CPUs and AMD GPUs to complement our study for comparison of OpenCL performance on vendor-dependent platforms and their implicit optimizations for their own hardware architectures. Acknowledgements: Funded by EU FP7, project PEPPHER, grant # ( References 1. P. Du et al. From CUDA to OpenCL: Towards a Performance-portable Solution for Multiplatform GPU Programming. Technical report, Dept. of Comp. Science, UTK B. Chapman, L. Huang. Enhancing OpenMP and Its Implementation for Programming Multicore Systems. Parallel Computing: Architectures, Algorithms and Applications, R. Membarth et al. Frameworks for Multi-core Architectures: A Comprehensive Evaluation using 2D/3D Image Registration. Architecture of Computing Systems, 24th Int. Conf K. Komatsu et al. Evaluating Performance and Portability of OpenCL Programs. 5th Intl. workshop on Automatic Performance Tuning K. Karimi et al. A Performance Comparison of CUDA and OpenCL. CoRR, J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O Reilly Media, Inc J. Stone et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Co-published by the IEEE CS and the AIP, J. Gervasi et al. The AES implantation based on OpenCL for multi/many core architecture. Intl. Conf. on Computational Science and Its Applications, R. Membarth et al. Comparison of Parallelization Frameworks for Shared Memory Multi- Core Architectures. Proc. of the Embedded World Conference, Nuremberg, Germany, P. Kegel et al. Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multicores. Euro-Par 29. Parallel Processing-15th 15th Int. Euro-Par Conference, A. Ali, L. Johnsson, J. Subhlok. Scheduling FFT Computation on SMP and Multicore Systems. 21st Intl. Conf. on Supercomputing, ICS, A. Prabhakar et al. Performance Comparisons of Basic OpenMP Constructs. ISHPC J. Ma et al. Parallel Floyd-Warshall algorithm based on TBB. 2nd IEEE Int. Conf. on Inf. Management and Engineering, P. Michailidis, K Margaritis. Performance Models for Matrix Computations on Multicore Processors using OpenMP. The 11th Intl. Conf. on Parallel and Distributed Computing, Applications and Technologies, D. Kim et al. Image Processing on Multicore x86 Architectures. Signal Processing Magazine, IEEE, 21.
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth
Intel Threading Building Blocks (Intel TBB) 2.2 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.2........... 3 Features................................................ 3 New in This Release...5
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:
Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Driving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1
SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group 2014 - Page 1 Where is OpenCL today? OpenCL: supported by a very wide range of platforms
Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner
Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner Research Group Scientific Computing Faculty of Computer Science University of Vienna AUSTRIA http://www.par.univie.ac.at
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
www.quilogic.com SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology
SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology The parallel revolution Future computing systems are parallel, but Programmers
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom [email protected] Christophe Dubach School of Informatics
