Convey HC-1 Hybrid Core Computer The Potential of FPGAs in Numerical Simulation
|
|
|
- Bathsheba Campbell
- 10 years ago
- Views:
Transcription
1 Convey HC-1 Hybrid Core Computer The Potential of FPGAs in Numerical Simulation Werner Augustin Vincent Heuveline Jan-Philipp Weiss No. 1-7 Preprint Series of the Engineering Mathematics and Computing Lab (EMCL) KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
2 Preprint Series of the Engineering Mathematics and Computing Lab (EMCL) ISSN No. 1-7 Impressum Karlsruhe Institute of Technology (KIT) Engineering Mathematics and Computing Lab (EMCL) Fritz-Erler-Str. 3, building Karlsruhe Germany KIT University of the State of Baden Wuerttemberg and National Laboratory of the Helmholtz Association Published on the Internet under the following Creative Commons License:
3 Convey HC-1 Hybrid Core Computer The Potential of FPGAs in Numerical Simulation Werner Augustin, Jan-Philipp Weiss Engineering Mathematics and Computing Lab (EMCL) SRG New Frontiers in High Performance Computing Exploiting Multicore and Coprocessor Technology Karlsruhe Institute of Technology, Germany Vincent Heuveline Engineering Mathematics and Computing Lab (EMCL) Karlsruhe Institute of Technology, Germany Abstract The Convey HC-1 Hybrid Core Computer brings FPGA technologies closer to numerical simulation. It combines two types of processor architectures in a single system. Highly capable FPGAs are closely connected to a host CPU and the accelerator-to-memory bandwidth has remarkable values. Reconfigurability by means of pre-defined application-specific instruction sets called personalities have the appeal of optimized hardware configuration with respect to application characteristics. Moreover, Convey s solution eases the programming effort considerably. In contrast to hardware-centric and timeconsuming classical coding of FPGAs, a dual-target compiler interprets pragma-extended C/C++ or Fortran code and produces implementations running on both, host and accelerator. In addition, a global view of host and device memory is provided by means of a cache-coherent shared virtual memory space. In this work we analyze Convey s programming paradigm and the associated programming effort, and we present practical results on the HC-1. We consider vectorization strategies for the single and double precision vector personalities and a suite of basic numerical routines. Furthermore, we assess the viability of the Convey HC-1 Hybrid Core Computer for numerical simulation. Keywords-FPGA, Convey HC-1, reconfigurable architectures, high-performance heterogeneous computing, coherent memory system, performance analysis, BLAS I. INTRODUCTION Field Programmable Gate Arrays (FPGAs) have their main pillar and standing in the domain of embedded computing. Application-specific designs implemented by hardware description languages (HDL) like VHDL and Verilog [1], [] make them a perfect fit for specific tasks. From a softwareoriented programmer s point of view FPGA s capabilities are hidden behind an alien hardware design development cycle. Although there are some C-to-HDL tools like ROCCC, Impulse C or Handle-C [3] available, viability and translation efficiency for realistic code scenarios still have to be proven. For several years, FPGAs have not been interesting for numerical simulations due to their limited capabilities and resource requirements for double precision floating point arithmetics. But following Moore s law and with increased FPGA sizes more and more area is becoming available for computing. Moreover, further rates of increase are expected to outpace those of common multicore CPUs. For a general deployment and in particular for numerical simulation FPGAs are very attractive from further points of view: run-time configurability is an interesting topic for applications with several phases of communication and computation and might be considered for adaptive numerical methods. In addition, energy efficiency is a great concern in high performance computing and FPGA technology is a possible solution approach. The main idea of FPGAs is to build one s own parallel fixed-function units according to the special needs of the underlying application. Currently, numerical simulation adopts all kinds of emerging technologies. In this context, a trend towards heterogeneous platforms has become apparent []. Systems accelerated by graphics processing units (GPUs) offer unrivaled computing power but often suffer from slow interconnection via PCIe links. The idea to connect FPGAs via socket replacements closer to CPUs is nothing new (cf. technologies from Nallatech, DRC, XtremeData) but the software concept offered by Convey is revolutionary []. A related FPGA platform is Intel s Atom reconfigurable processor an embedded single board computer based on the Intel Atom EC processor series which pairs with an Altera FPGA in a single package. Here, both entities communicate via PCIe-x1 links. A former hybrid CPU-FPGA machine was the Cray XD1 []. In this work we outline the hardware and software architecture of the Convey HC-1 Hybrid Core Computer. We analyze Convey s programming concept and assess the functionalities and capabilities of Convey s single and double precision vector personalities. Furthermore, we evaluate the viability of the Convey HC-1 Hybrid Core Computer for numerical simulation by means of selected numerical kernels that are well-known building blocks for higher-level numercial schemes, solvers, and applications. Some performance results on the HC-1 can be found in [7]. Our work puts more emphasis on floating point kernels relevant for numerical simulation. Stencil applications on the HC-1 are also considered in []. II. HARDWARE CONFIGURATION OF THE CONVEY HC-1 The Convey HC-1 Hybrid Core Computer is an example of a heterogeneous computing platform. By its hybrid setup, specific application needs can either be handled by an x- dual-core CPU or by the application-adapted FPGAs. All
4 computational workloads are processed by a.13ghz Intel Xeon 13 dual-core host CPU and by the application engines (AE), a set of four Xilinx Virtex LX33 FPGAs. Two more VLX11 FPGAs implement the host interface the application engine hub (AEH) for data transfers and control flow exchange between host and device, and eight VLX1 FPGAs build the eight accelerator s memory controllers. Data transfers are communicated via the CPU s front-side bus (FSB), Intel s aging technology. Across the whole system a cache-coherent shared virtual memory system is provided that allows to access data in the CPU s memory and in the accelerator device memory. However, the system incorporates a ccnuma system where proper data placement is performance-critical. In our system the host CPU is equipped with 1x7MHz FBDIMMs providing 1 GB of memory and a theoretical bandwidth of GB/s. On the device 1 DDR memory channels feed Convey s special 1x7MHz Scatter- Gather-DIMMs with GB device memory and a theoretical peak bandwidth of GB/s. The fundamental advantage of this memory configuration is that non-unit strides have no drawback on the effective bandwidth. All data can be accessed with byte granularity. Functional units on the FPGAs are implemented by logic synthesis by means of Convey s personalities []. The single precision vector personality provides a load-store vector architecture with 3 function pipes, each one containing a vector register file and four fused-multiply-add (FMA) vector units for exploiting data parallelism by means of vectorization. Furthermore, out-of-order execution provides a means for instruction-level parallelism. While the clock rate of the FPGAs is undisclosed, the peak rate is expected to be about for single precision and for double precision. The most accented difference of the FPGA accelerator memory subsystem is that there are no caches and no local memory available. All block-ram on the FPGA is not accessible by the user (unless custom personalities and FPGA designs are created that support this feature). The whole system consumes about - Watt (depending on the actual workload). A sketch of the HC-1 s hardware configuration is shown in Figure 1. In comparison to other accelerators, the HC-1 offers lower peak performance and lower bandwidth. But in contrast, fast device memory can be configured up to 1 GB in size and is not limited to a few GB as on GPUs. Furthermore, Convey s technology and its future development will possibly allow fast cluster-like connection between several FPGA-based entities. In the latest Convey product, the HC-1ex, Xilinx Virtex FPGAs and an Intel Xeon quad-core processor provide the same functionality with improved capabilities. III. CONVEY HC-1 S SOFTWARE ENVIRONMENT Programming of FPGAs by means of HDL is a timeconsuming and non-intuitive effort and so FPGAs have been out of reach for many domain experts. With Convey s solution, FPGAs are now a viable alternative from a programming point of view. On the Convey HC-1, the CPU s capabilities are extended by application-specific Fig. 1. Hardware Configuration of the Convey HC-1 instructions for the FPGAs []. However, the asymmetric computing platform is hidden by Convey s unified software interface. A single code base (C, C++ or Fortran) can be enhanced with pragmas to advise the compiler how to treat computations and data. In particular, code regions that should be executed on the FPGAs can be identified by inserting #pragma cny begin_coproc / end_coproc. Another possibility is to compile whole subroutines for the accelerator, or even to use compiler s capability for automatic identification of parts, that shall be offloaded to the FPGAs. Typical code sections are those suitable for vectorization, especially long loops. The compiler then produces a dual target executable, i.e. the code can be executed on both the CPU host (e.g. if the coprocessor is not available) and on the FPGA accelerator. Hence, a portable solution is created that can run on any x system. There are some restrictions to coprocessor code: it is not possible to do I/O, to make system calls or to call non-coprocessor functions. A compiler report gives details on the vectorization procedure. Specific pragmas give further hints for compiler-based optimizations. Both memories on the host CPU and on the FPGA accelerator are combined into a common address space and data types are common across both entities. In order to prevent NUMA effects, placement of data can be controlled by pragmas. Data allocated in the CPU memory can be transferred to the accelerator with#pragma cny migrate_coproc. Dynamic and static memory on the FPGA device can be allocated directly via #pragma cny_cp_malloc and #pragma cny coproc_mem respectively. In case of memory migration whole memory pages are transferred. In order to avoid multiple transfers, several data objects should be grouped into larger structs. The actual configuration of the FPGAs is represented by means of application-specific instruction sets called personalities. These personalities augment the host s x- instruction set. This features allows adaptation and optimization of the hardware with respect to the specific needs of the underlying
5 algorithms. The user only has to treat an integrated instruction set controlled by pragmas and compiler settings. Convey offers a set of pre-defined personalities for single and double precision floating point arithmetics that turn the FPGAs into a soft-core vector processor. Furthermore, personalities for financial analytics and for proteomics are available. Currently, a finite difference personality for stencil computations is under development. The choice for a requested personality is specified at compile time by setting compiler flags. With Convey s personality development kit custom personalities can be developed by following the typical FPGA hardware design tool chain (requiring considerable effort and additional knowledge). Convey s Software Performance Analysis Tool (SPAT) gives insight into the system s actual runtime behavior and feedback on possible optimizations. The Convey Math Library (CML) provides tuned basic mathematical kernels. For our experiments we used the Convey Compiler Suite, Version... IV. THE POTENTIAL OF FPGAS FPGAs have been considered to be non-optimal for floating point number crunching. But FPGAs show particular benefits for specific workloads like processing complex mathematical expressions (logs, exponentials, transcendentals), performing bit operations (shifts, manipulations), and performing sort operations (string comparison, pattern recognition). Further benefits can be achieved for variable bit length of data types with reduced or increased precision, or for treating non-standard number formats (e.g. decimal representation). The latter points are exploited within Convey s personalities for financial applications and proteomics. Recently, Convey reported a remarkable speedup of 17 for the Smith-Waterman algorithm [9]. Pure floating point-based algorithms in numerical simulation are often limited by bandwidth constraints and low arithmetic intensity (ratio of flop per byte). The theoretical peak bandwidth of GB/s on the Convey FPGA device goes along with a specific appeal in this context. However, memory accesses on the device are not cached. Hence, particular benefits are expected for kernels with limited data reuse like vector updates (SAXPY/DAXPY), scalar products (SDOT/DDOT) and sparse matrix-vector multiplications (SpMV). Convey s special Scatter-Gather DIMMs are well adapted to applications with irregular data access patterns where CPUs and GPUs typically show tremendous performance breakdowns. V. PERFORMANCE EVALUATION In order to assess the performance potential of Convey s FPGA platform for floating point-based computations in numerical simulation we analyze some basic numerical kernels and their performance behavior. In particular, we consider library-based kernels provided by the CML and hand-written, optimized kernels. By comparing both results we draw some conclusion on the capability of Convey s compiler. In all cases, vectorization of the code and NUMA-aware placement of data is crucial for performance. Without vectorization there is a dramatic performance loss since scalar code for the accelerator is executed on the slow application engine hub (AEH) that builds the interface between host and accelerator device. If data is not located in the accelerator memory but is accessed in the host memory over the FSB, bandwidth and hence performance also drop considerably. For our numerical experiments we consider some basic building blocks for high-level solvers, namely vector updates z = ax + y (SAXPY/DAXPY in single and double precision), vector product α = x y (SDOT/DDOT), dense matrix-vector multiplication y = Ax (SGEMV/DGEMV), dense matrix-matrix multiplication C = AB (DGEMM/SGEMM), sparse matrix vector multiplication (SpMV), and stencil operations. VI. VECTORIZATION AND OPTIMIZATION OF CODE In order to exploit the full capabilities of the FPGA accelerator specific measures are necessary for code creation, for organizing data accesses, and to support the compiler for vectorizing code. Due to its nature as a low frequency, highly parallel vector architecture, performance on the Convey HC-1 heavily depends on the ability of the compiler to vectorize the code. One of the examples where this did not work out-of-thebox is dense matrix-vector multiplication SGEMV. The code snippet in Figure shows a straightforward implementation. Here, the pragma cny no_loop_dep gives a hint to the compiler for vectorization that there are no data dependencies in the corresponding arrays. void gemv(int length, float A[], float x[], float y[]){ for( int i = ; i < length; i++) { float sum = ; #pragma cny no_loop_dep(a, x, y) for( int j = ; j < length; j++) sum += A[i*length+j] * x[j]; y[i] = sum; Fig.. Straightforward implementation of dense matrix-vector multiplication (SGEMV) Although the compiler claims to vectorize the inner loop, performance is only approx. and by a factor of 7 below the performance of the CML math library version. The coprocessor instruction set supports vector reduction operations, but these seem to have a pretty high startup latency. The outer loop is not unrolled. Attempts to do that manually improved the performance somewhat, but introduced new performance degradations for certain vector lengths. The solution lies in exploiting Convey s scatter-gather memory which allows for fast strided memory reads and therefore allows to change the loop ordering (see Figure 3). This gives considerably better results; performance improvements by loop reordering are detailed in Figure. For the reordered loops we consider three different memory allocation scenarios: dynamic memory allocated on the host and migrated with
6 Convey s pragma cny migrate_coproc, dynamic memory allocated on the device, and static memory allocated on the host and migrated to the device with the pragma mentioned above. Performance increases with vector length but has some oscillations. These results even outperform the CML CBLAS library implementation from Convey (cf. Figure 1). void optimized_gemv(int length, float A[], float x[], float y[]){ for( int i = ; i < length; i++ ) y[i] =.; for( int j = ; j < length; j++ ) #pragma cny no_loop_dep(a, x, y) for( int i = ; i < length; i++ ) y[i] += A[i*length+j] * x[j]; Fig. 3. Dense matrix-vector multiplication (SGEMV) optimized by loop reordering Here, seq[i] = i, i = 1,...,N, is a sequential but indirect addressing and rnd[i] is an indirect addressing by an arbitrary permutation of [1,...,N]. Results are presented in Figure and should be seen in comparison to results obtained on a - way Intel Nehalem processor with.3 GHz and a total of cores and threads running (corresponding results are shown in Figure ). GByte/s 7 3 single precision, dynamic memory single precision, static memory double precision, dynamic memory double precision, static memory 1 1 reordered loop host malloc reordered loop device malloc reordered loop host static without reordering SeLo d[i]=s[i] SeLoI d[i]=s[seq[i]] ScaLoI d[i]=s[rnd[i]] SeWr d[i]=s[i] SeWrI d[seq[i]]=s[i] ScaWrI d[rnd[i]]=s[i] 1 Fig.. Memory bandwidth of the Convey HC-1 coprocessor for different memory access patterns Fig.. Performance results and optimization for single precision dense matrix-vector multiplication (SGEMV) with and without loop reordering and for different memory allocation schemes; cf. Fig. and Fig. 3 VII. PERFORMANCE RESULTS AND ANALYSIS A. Device Memory Bandwidth for Different Access Patterns Performance of numerical kernels is often influenced by the corresponding memory bandwidth for loading and storing data. For our memory bandwidth measurements we use the following memory access patterns that are characteristic for diverse kernels: Sequential Load (SeLo): Sequential Load Indexed (SeLoI): Scattered Load Indexed (ScaLoI): Sequential Write (SeWr): Sequential Write Indexed (SeWrI): Scattered Write Indexed (ScaWrI): d[i] = s[i] d[i] = s[seq[i]] d[i] = s[rnd[i]] d[i] = s[i] d[seq[i]] = s[i] d[rnd[i]] = s[i] For the sequential indirect access, Convey s compiler cannot detect possible improvements. For the scattered load and write, Convey s memory configuration gives better values than the Nehalem system. The Convey HC-1 not only has an about % percent higher peak memory bandwidth, but it really shows the potential of its scatter-gather capability when accessing random locations in memory. Here, traditional cache-based architectures typically perform poorly and GPU systems have a breakdown by an order of magnitude. B. Data-Transfers Between Host and Device Because of the strong asymmetric NUMA-architecture of the HC-1 there are different methods to use main memory. Three of them are used in the following examples: dynamically allocate (malloc) and initialize on the host; use migration pragmas dynamically allocate (cny cp malloc) and initialize on the device statically allocate and initialize on the host; use migration pragmas By initialization we mean the first touch of the data in memory. Because the Convey HC-1 is based on Intel s precedent technology of using the front-side bus (FSB) to connect memory to processors a major bottleneck is the data connection between host memory and device memory. Figure 7 shows the relation between initialization and migration bandwidth in
7 GByte/s 3 1 SeLo d[i]=s[i] single precision, dynamic memory single precision, static memory double precision, dynamic memory double precision, static memory SeLoI d[i]=s[seq[i]] ScaLoI d[i]=s[rnd[i]] SeWr d[i]=s[i] SeWrI d[seq[i]]=s[i] ScaWrI d[rnd[i]]=s[i] Fig.. Memory bandwidth of a -way Intel Nehalem processor with.3 GHz using cores for different memory access patterns terms of GB/s on the host and device and between host and device for the SAXPY vector update. Furthermore, Figure 7 depicts performance of the SAXPY in terms of (the unit on the y-axis has to be chosen correspondingly). For data originally allocated on the host and migrated to the device we observe some oscillations in the performance. While initialization inside the device memory reaches almost GB/s, the transfer over the FSB achieves only about 7 MB/s. This impedes fast switching between parts of an algorithm which perform well on the coprocessor and its vector units and other parts relying on the flexibility of high-clocked general purpose CPU. Compared to GPUs attached via PCIe, the FSB represents an even more narrow bottleneck. C. Avoiding Bank Conflicts with Interleave The scatter-gather memory configuration of the Convey HC- 1 can be used in two different mapping modes: Binary interleave: traditional approach, parts of the address bitmap are mapped round-robin to different memory banks interleave: modulo 31 mapping of parts of the address bitmap Because in the interleave mode the memory is divided into 31 groups of 31 banks, memory strides of powers of two and many other strides hit different banks and therefore do not suffer from memory bandwidth degradation. But to integrate this prime number scheme into a power of two dominated world, one of 3 groups and every 3th bank are not used resulting in a loss of some addressable memory and approximately % of peak memory bandwidth. In Figure performance results for the SAXPY vector update are shown for both interleave options. For the SAXPY, binary memory interleave is slightly worse. Performance results for the CML DGEMM routine in Figure 9 show larger variations with interleave. The DGEMM routine achieves about 3 and the SGEMM routine yields about 7 on our machine. In both cases this is roughly 9% of the estimated machine peak performance. 7 3 or GByte/s 1 1 host malloc fp performance device malloc fp performance device static fp performance host malloc init bandwidth device malloc init bandwidth host static init bandwidth host malloc migrate bandwidth host static migrate bandwidth 1 saxpy with interleave saxpy with binary interleave Fig.. Performance of SAXPY vector updates with and binary interleave Fig. 7. Measured bandwidth for data initialization and migration for different memory allocation schemes in GB/s and performance results in for the SAXPY vector update D. BLAS Operations Basic Linear Algebra Subprograms (BLAS) [1] are a collection and interface for basic numerical linear algebra routines. We use these routines for assessment of the HC- 1 FPGA platform. We compare our own, straightforward implementations of BLAS-routines with those provided by Convey s Math Library (CML). Loop reordering techniques are applied for performance improvements. In the following examples we use the three different memory usage schemes
8 3 3 7 saxpy host malloc saxpy device malloc saxpy host static cblas_saxpy host malloc cblas_saxpy device malloc cblas_saxpy host static cblas_dgemm with binary interleave cblas_dgemm with interleave Fig. 9. Performance of the cblas dgemm matrix-matrix multiplication provided by the CML with and binary interleave Fig. 1. SAXPY vector update using different implementations and different memory allocation strategies detailed in Section VII-B. In all three cases initialization and migration costs are not considered in our measurements. Data allocation on the host followed by migration routines or pragmas is not really a controllable and reliable procedure. From time to time considerable drops in performance are observed. So far, we could not identify a reasonable pattern or a satisfactory explanation for these effects. Our measurements are made using separate program calls for a set of parameters. When trying to measure by looping over different vector lengths, allocating and freeing memory on the host and using migration calls in between, the results are even less reliable. We observe that our own implementations are usually faster for short vector lengths probably due to lower call overhead and less parameter checking. For longer vector lengths the CML library implementations usually give better results. Results for the SAXPY/DAXPY vector updates are depicted in Figure 1 and in Figure 11. Performance data for the SDOT/DDOT scalar products are shown in Figure 1 and in Figure 13, and for the SGEMV/DGEMV dense matrix-vector multiplication in Figure 1 and in Figure 1. E. Sparse Matrix-Vector Multiplication Many numerical diescretization schemes for scientific problems result in sparse system matrices. Typically, iterative methods are used for solving these sparse systems. On top of scalar products and vector updates, the efficiency of sparse matrixvector multiplications is very important for these scenarios. When using the algorithm for the compressed sparse row (CSR) storage format [11] presented in Figure 1, loops and reduction operations are vectorized by the compiler. However, the performance results are very disappointing being in the range of a few MFlop/s. Although the memory bandwidth for indexed access as presented in Figure is very good, the relatively short vector length and the overhead of the vector reduction in the inner loop seem to slow down computations daxpy host malloc daxpy device malloc daxpy host static cblas_daxpy host malloc cblas_daxpy device malloc cblas_daxpy host static Fig. 11. DAXPY vector update using different implementations and different memory allocation strategies (see also observations in Section VI on loop optimizations). Unfortunately, in this case loop reordering is not that easy because the length of the inner loop depends on the outer loop. A possible solution is to use other sparse matrix representations like the ELL format, as used on GPUs e.g. in [1]. F. Stencil Operations Stencil kernels are one of the most important routines applied in the context of solving partial differential equations (PDEs) on structured grids. They originate from the discretization of differential expressions in PDEs by means of finite element, finite volume or finite difference methods. They are defined as a fixed subset of nearest neighbors where the corresponding node values are used for computing weighted
9 sdot host malloc sdot device malloc sdot host static cblas_sdot host malloc cblas_sdot device malloc cblas_sdot host static sgemv host malloc sgemv device malloc sgemv host static cblas_sgemv host malloc cblas_sgemv device malloc cblas_sgemv host static Fig. 1. SDOT scalar product using different implementations and different memory allocation strategies Fig. 1. SGEMV matrix-vector multiplication using different implementations and different memory allocation strategies 7 ddot host malloc ddot device malloc ddot host static cblas_ddot host malloc cblas_ddot device malloc cblas_ddot host static dgemv host malloc dgemv device malloc dgemv host static cblas_dgemv host malloc cblas_dgemv device malloc cblas_dgemv host static Fig. 13. DDOT scalar product using different implementations and different memory allocation strategies Fig. 1. DGEMV matrix-vector multiplication using different implementations and different memory allocation strategies sums. Theassociated weights correspond to the coefficients of the PDEs where coefficients are assumed to be constant in our context. In our test we used a 3-dimensional 7-point stencil for solving the Laplace equation on grids of different sizes. The performance results are shown in Figure 17. Our stencil code is close to the example given in the Convey documentation material. The CPU implementation is the one used in [13], not using the presented in-place optimization but only conventional space-blocking and streaming optimizations. For the conventional CPU one can see a high peak for small grid sizes when the data can be kept in the cache. For larger grid sizes a pretty constant performance with slight increases due to less loop overhead is observed. The Convey HC-1 on the other hand shows no cache effects but lower performance on smaller grids. But unfortunately because of its lack of caching, neighboring values of the stencil have to be reloaded every time they are needed wasting a large portion of the much higher total memory bandwidth. On the Convey HC- 1, the difference between single and double precision stencil performance becomes apparent only for large grid size. VIII. CONCLUSION Convey s HC-1 Hybrid Core Computer offers seamless integration of a highly capable FPGA platform with an easy coprocessor programming model, a coherent memory space shared by the host and the accelerator, and remarkable bandwidth values on the coprocessor. Moreover, Convey s scattergather memory configuration offers advantages for codes with
10 void spmv(int nrows, float val[], int coli[], int rowp[], float vin[], float vout[]){ #pragma cny no_loop_dep(val, vin, vout) #pragma cny no_loop_dep(coli, rowp) for( int row = ; row < nrows; row++ ) { int start = rowp[row]; int end = rowp[row+1]; float sum =.; for( int i = start; i < end; i++ ) { sum += val[i] * vin[coli[i]]; vout[row] = sum; Fig. 1. format Sparse matrix-vector multiplication (SpMV) routine for CSR data Convey HC-1 single precision Convey HC-1 double precision Nehalem single precision Nehalem double precision Fig. 17. Performance of a 3-dim. 7-point Laplace stencil (one grid update is counted as Flop) on the Convey HC-1 and on a -way Intel Nehalem processor with.3 GHz using cores irregular memory access patterns. With Convey s personalities the actual hardware configuration can be adapted to, and optimized for specific application needs. With its HC-1 platform, Convey brings FPGAs closer to high performance computing. However, we have failed to port more complex applications originating in numerical simulation due to the failure to obtain acceptable speed for sparse matrix-vector multiplication. The HC-1 has the potential to be used for general purpose applications. Although the HC-1 falls behind the impressive performance numbers of GPU systems and the latest multicore CPUs, it provides an innovative approach to asymmetric processing, to compiler-based parallelization, and in particular to portable programming solutions. Only a single code base is necessary for x- and FPGA platforms which facilitates maintainability of complex codes. In contrast to GPUs, memory capacity is not limited by a few GB and FPGAs connected by direct networks come in reach. A great opportunity lies in the possibility to develop custom personalities if time, knowledge and costs permit. Convey s approach represents emerging technology with some deficiencies but also with a high level of maturity. Major drawbacks arise from limitations for floating point arithmetics on FPGAs, compiler capabilities for automatic vectorization, and the usage of Intel s obsolete FSB communication infrastructure. In our experience, typical code bases still show room for code and compiler improvements. While major benefits have been reported for specific workloads in bioinformatics, the HC-1 also provides a viable means for floating pointdominated and bandwidth-limited numerical applications. Despite its high acquisition costs, this breakthrough technology needs further attention. ACKNOWLEDGEMENTS The Shared Research Group 1-1 received financial support by the Concept for the Future of Karlsruhe Institute of Technology in the framework of the German Excellence Initiative and by the industrial collaboration partner Hewlett-Packard. REFERENCES [1] P. J. Ashenden, The VHDL Cookbook. Dept. Computer Science, Univ. Adelaide, S. Australia. [Online]. Available: Cookbook.pdf [] D. Thomas and P. Moorby, The Verilog Hardware Description Language. Springer,. [3] J. M. P. Cardoso, P. C. Diniz, and M. Weinhardt, Compiling for reconfigurable computing: A survey, ACM Comput. Surv., vol., pp. 13:1 13:, June 1. [Online]. Available: [] A. Shan, Heterogeneous processing: A strategy for augmenting Moore s law, Linux J. [Online]. Available: [] T. M. Brewer, Instruction set innovations for the Convey HC-1 Computer, IEEE Micro, vol. 3, pp. 7 79, 1. [] O. Storaasli and D. Strenski, Cray XD1 exceeding 1x speedup/fpga: Timing analysis yields further gains, in Proc. 9 Cray User Group, Atlanta GA, 9. [7] J. Bakos, High-performance heterogeneous computing with the Convey HC-1, Computing in Science Engineering, vol. 1, no., pp. 7, 1. [] J. M. Kunkel and P. Nerge, System performance comparison of stencil operations with the Convey HC-1, Research Group Scientific Computing, University of Hamburg, Tech. Rep , 1. [9] Convey Computer, informatics web.pdf. [1] Basic Linear Algebra Subprograms (BLAS), [11] S. Williams, R. Vuduc, L. Oliker, J. Shalf, K. Yelick, and J. Demmel, Optimizing sparse matrix-vector multiply on emerging multicore platforms, Parallel Computing (ParCo), vol. 3, no. 3, pp , March 9. [1] N. Bell and M. Garland, Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Corporation, NVIDIA Technical Report NVR--, Dec.. [13] W. Augustin, V. Heuveline, and J.-P. Weiss, Optimized stencil computation using in-place calculation on modern multicore systems, in Euro-Par 9 Parallel Processing, LNCS 7, 9, pp
11 Preprint Series of the Engineering Mathematics and Computing Lab recent issues No. 1- Hartwig Anzt, Werner Augustin, Martin Baumann, Hendryk Bockelmann, Thomas Gengenbach, Tobias Hahn, Vincent Heuveline, Eva Ketelaer, Dimitar Lukarski, Andrea Otzen, Sebastian Ritterbusch, Björn Rocker, Staffan Ronnås, Michael Schick, Chandramowli Subramanian, Jan-Philipp Weiss, Florian Wilhelm: HiFlow 3 A Flexible and Hardware-Aware Parallel Finite Element Package No. 1- Martin Baumann, Vincent Heuveline: Evaluation of Different Strategies for Goal Oriented Adaptivity in CFD Part I: The Stationary Case No. 1- Hartwig Anzt, Tobias Hahn, Vincent Heuveline, Björn Rocker: GPU Accelerated Scientific Computing: Evaluation of the NVIDIA Fermi Architecture; Elementary Kernels and Linear Solvers No. 1-3 Hartwig Anzt, Vincent Heuveline, Björn Rocker: Energy Efficiency of Mixed Precision Iterative Refinement Methods using Hybrid Hardware Platforms: An Evaluation of different Solver and Hardware Configurations No. 1- Hartwig Anzt, Vincent Heuveline, Björn Rocker: Mixed Precision Error Correction Methods for Linear Systems: Convergence Analysis based on Krylov Subspace Methods No. 1-1 Hartwig Anzt, Vincent Heuveline, Björn Rocker: An Error Correction Solver for Linear Systems: Evaluation of Mixed Precision Implementations No. 9- Rainer Buchty, Vincent Heuveline, Wolfgang Karl, Jan-Philipp Weiß: A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators No. 9-1 Vincent Heuveline, Björn Rocker, Staffan Ronnas: Numerical Simulation on the SiCortex Supercomputer Platform: a Preliminary Evaluation The responsibility for the contents of the working papers rests with the authors, not the Institute. Since working papers are of a preliminary nature, it may be useful to contact the authors of a particular working paper about results or caveats before referring to, or quoting, a paper. Any comments on working papers should be sent directly to the authors.
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, M. Castillo, J. I. Aliaga, J. C. Ferna ndez, V. Heuveline,
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Xeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Accelerate Cloud Computing with the Xilinx Zynq SoC
X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce
Data Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Resource Scheduling Best Practice in Hybrid Clusters
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
DARPA, NSF-NGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Direct GPU/FPGA Communication Via PCI Express
Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Cloud Data Center Acceleration 2015
Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
Data Structure Oriented Monitoring for OpenMP Programs
A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,
A CP Scheduler for High-Performance Computers
A CP Scheduler for High-Performance Computers Thomas Bridi, Michele Lombardi, Andrea Bartolini, Luca Benini, and Michela Milano {thomas.bridi,michele.lombardi2,a.bartolini,luca.benini,michela.milano}@
Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment
Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
