PROJECT REPORT. Using Parallella for Numerical Linear Algebra. Marcus Naslund, Jacob Mattsson Project in Computational Science January 2015
|
|
|
- Randall Patrick
- 10 years ago
- Views:
Transcription
1 Using Parallella for Numerical Linear Algebra Marcus Naslund, Jacob Mattsson Project in Computational Science January 2015 PROJECT REPORT Department of Information Technology
2
3 Contents 1 Introduction 3 2 Background 3 3 Purpose 3 4 Hardware and setup The Epiphany chip Multigrid methods Linear systems General properties Computational speed Project workflow Setup and cooling Software Code development Dense kernels Sparse kernels Problem setup Linear algebra Energy consumption Results Dense linear algebra computations Sparse LA Speedup Memory allocation Memory synchronization
4 8.3 Simulated V-cycle Comparisons with MATLAB Power measurements Discussion Conclusions and future research 23 2
5 1 Introduction In this project, we have tested Parallella, a miniature super-computer, and its suitability for solving scientific computing problems. The project mainly involved a performance test in a realistic problem setting using the Algebraic Multigrid method (AMG). This report also documents some of the features which may make the platform an interesting option for future research. 2 Background The company Adapteva produces the 16- and 64-core Epiphany MIMD chips, which are the main selling point of the Parallella miniature super-computer. It is mainly powered by its dual core ARM-9 CPU [1] using the 16-core Epiphany chip as an accelerator. At the time of writing, the largest super-computer in the world sports a staggering 3.12 million cores consuming almost 18 MW of energy 1, which means around 5 W per core. In contrast, the Parallella consumes around 5 W for all of its 2+16 cores, while performing a theoretical 32 GFLOPS per second[2]. There is also a 66-core version, albeit delayed and not yet released, with a presumed even better power consumption-per-core ratio. As the world approaches exascale computing, it must also move towards more energy efficient computing cores. As of November 2014 the greenest supercomputer ever produced outputs an impressive 5.27 gigaflops per watt (GFLOPS/W) 2 which is the standard metric used for this measure. Independent of this study, another group in the IT department has built a cluster of Parallella boards, using the distributed shared memory system (DSM) computing paradigm. However, the results from the project show that, with their hardware and system setup, Parallella boards are not optimal for such a system due to the constant communication between ARM chips and Epiphany cores. We have not used their approach and details of our implementation follow in the next sections. Parallella s architecture, mainly its shared memory between both CPU and accelerator, together with its easy connection to more Parallella boards, both seem to offer good scaling capabilities for Numerical Linear Algebra, especially the Algebraic Multigrid methods which are further detailed below. 3 Purpose There are many iterative solvers for very large sparse linear systems Ax = b, such as the Jacobi method, Conjugate Gradient method, etc.. A special class of methods are the so-called Multigrid methods which have many theoretical advantages but do not always scale well on large parallell systems. There is a growing need to solve large sparse linear systems since they appear in nearly any scientific computing problem. The direct methods are unfeasible because of their computational complexity. Instead, the feasible alternatives are the iterative methods. Their potential drawback can be very slow convergence rates. Therefore we need to accelerate those methods using a technique called preconditioning which can be expressed as a linear system Ax = b replaced by B 1 Ax = B 1 b for a certain matrix B 1 called preconditioner. Note that the transformed system has the same solution as the original one. These improved iterative solvers, including the Multigrid methods, are very communication-heavy and thus become difficult to implement on distributed memory systems. Hopefully, with the architectural
6 (a) Top view of the board. (b) Bottom view of the board. Figure 1: The Parallella board [2]. difference of the Parallella compared to conventional large-scale computational systems, the Algebraic Multigrid method (AMG) will scale a lot better in terms of energy efficiency. Either way, the chosen test method provides a lot more information on how useful the computer actually is compared to standard benchmarking tests such as the LINPACK performance tests[3]. 4 Hardware and setup The Parallella board used for this report (the Desktop version) is equipped with an ARM A9 CPU and 1 GB RAM. Its parallel speed comes from its 16-core Epiphany-IV chip. There is also a 64-core version planned but not yet in production. While the computer supports any ARM-supported operating system our computer was setup with a variant of Ubuntu called Linaro, which is especially made for ARM-based computers [4]. This is what is recommended by the manufacturer Adapteva, but as the platform becomes more popular guides for installing other choices of OS will be, and are becoming, more common. Moreover the board comes with the following set of hardware options[2]: Gigabit Ethernet HDMI port 4 general purpose expansion connectors 2x USB 2.0 MicroSD card, for storing the operating system and software FPGA (Field programmable Gate Array), basically a programmable integrated circuit. One of the more interesting features is the set of four expansion connectors which gives the opportunity to increase parallelizability even further by connecting several Parallellas together (see Figure 1b). We have been unable to find any reports of large-scale tests of this feature and are thus unable to comment on how well it works. The credit-card-sized computers, measuring 9 cm x 5.5 cm, are available in three versions: 4
7 Micro-server The cheapest alternative, with no IO connector ports. Desktop The one we have used (and the most common version). Embedded platform The most expensive version. Same as the Desktop version but with a larger FPGA and twice as many GPIO ports. All versions feature the same processor, memory size and Epiphany 16-core accelerator and have a theoretical peak performance of 25 GFLOPS, meaning around 5 GFLOPS/W[5]. 4.1 The Epiphany chip The Epiphany chip is a multicore shared memory microprocessor consisting of a 2D array of 16 computing nodes. Each node contains an independent superscalar floating-point RISC 3 CPU operating at 1 GHz, capable of two floating-point or one integer operation per clock cycle. Each node has up to 1 MB of local memory allowing for simultaneous accesses by the instruction fetch engine. Each node can access its own or other s memory through regular load and store instructions. Moreover does the chip come with the following features: 32 GFLOPS peak performance 512 GB/s local memory bandwidth 2 Watt maximum chip power consumption Fully featured ANSI-C/C++ programmable With four connectors, several Parallella computers may be connected together in a cluster to achieve even greater parallell performance. While a traditional method like MPI (Message Passing Interface) may be used to communicate between Parallella boards, this approach is a distributed-memory model and thus not compatible with the shared memory model on a single Parallella board. Here, the Open Computing Language OpenCL [6] must instead be used. Instead of using OpenCL, there exists a Software Development Kit (SDK) specifically for programming for the Epiphany chip, called esdk. In a recent study, albeit with very limited testing, esdk was shown to well out-perform the OpenCL implementation for Epiphany called COPRTHR SDK [7]. However, since we do not have the time or skill to implement a new AMG solver based on esdk, our only option is to use OpenCL. The OpenCL language is very similar to the C language with certain extensions. Small sets of instructions called kernels are written in OpenCL (similar to the way someone would program for CUDA) which are assigned to the accelerator cores from a Host process run on the CPU. OpenCL lets us specify which cores should do what, something that works well with the Multi Instruction Multiple Data (MIMD) architecture of the Epiphany chip. 5 Multigrid methods 5.1 Linear systems Consider a linear system of equations 3 RISC, acronym for Reduced instruction set computing - a class of processor architecture developed by IBM during the late 1970 s. 5
8 Ax = b (1) or equivalently b Ax = 0 where b, x R n and sparse A R n n for very large n (>200 million). Such systems arise for example when solving partial differential equations using the Finite Difference Method (FDM) or the Finite Element Method (FEM). Exact solvers are far too inefficient because of their computational complexity. For the types of problems above one usually relies on iterative methods which obtain an approximate solution x k where b Ax k 0. The solution is improved with more iterations, and we of course want to have lim x k = x, k where x is the exact solution of the system. The expression b Ax k is called the residual. The residual corresponding to x k is denoted by r k. The error of the same solution is x x k which we denote by e k. The two are related in the following way: r k = b Ax k = Ax Ax k = A(x x k ) = Ae k 0 The number of iterations for simple iterative methods such as the Jacobi and Gauss-Seidel methods grows too quickly with n, which is briefly illustrated in the next section. But there is an upside to using Jacobi s method. It can be shown[8] that this method reduces the high-frequency errors, or errors corresponding to large eigenvalues of A ( λ >> 0), very rapidly. The problem with Jacobi s method resides in the fact that low-frequency errors, meaning errors corresponding to small eigenvalues ( λ 0) are resolved much slower. A complex yet efficient approach of using Jacobi s fast convergence for high-frequency errors result in the class of multigrid solvers. The theory behind them follows next. 5.2 General properties As described earlier, problems with the Jacobi method are mostly caused by low-frequency errors. However, a low-frequency oscillation in a fine mesh would be, when translated to a coarser mesh, a highfrequency (and thus easily remedied) error: The algorithm underlying multigrid methods consists of five major steps: Pre-smoothing: Reducing errors caused by large eigenvalues of A, for example using the Gauss- Seidel or the Jacobi method. Restriction: Moving the residual from a finer grid to a coarser grid. Calculation: Solving the error equation with an iterative method.. Prolongation: Moving the error from a coarser grid to a finer grid. Post-smoothing: We can get high-frequency errors back because of errors introduced from the prolongation, which can be remedied by again running a few steps of Jacobi s method. The five steps above are illustrated in Figure 2.3, forming a V-cycle. 6
9 Figure 2: Moving from a coarse mesh to a fine mesh, and then back again. Compare with figure 3. pre-smoothing steps post-smoothing steps restriction prolongation exact solving Figure 3: Illustration of one V-cycle in AMG. A full computation may involve moving up and down many times before the error is reduced enough on the finest mesh. Often, one step from a finer to a coarser mesh is not enough, instead an entire hierarchy of the finest mesh down to a very coarse mesh is used. The coarsest level might be a matrix, or even a 2 2-matrix, depending on how large the initial problem is and how many levels we use. Jacobi s method is used as much as possible on one mesh, and the solution is then projected (the restriction step) onto a coarser mesh for more easy solution before prolongated back. There are also several other kinds of cycles in Multigrid. The movement from finest to coarsest to finest may involve several smaller cycles between a subset of the levels, giving rise to more complicated patterns like a W-cycle (which, like the V-cycle, takes its name from how it may be pictured such as in Figure 3). It is possible to impose a certain hierarchy of finer and coarser meshes as discussed above without knowing anything about the mesh on which the PDE has been solved, or even the PDE itself. If such information is available the Geometric multigrid methods are available, otherwise the Algebraic Multigrid method (AMG) uses properties of the matrix A, viewing it as a graph, to do its magic. This means that less information is available, but the method works like a black box. Thus, if the geometric versions are too complicated to use, AMG may be a good alternative. The difficulty of implementing an AMG solver is very problem-dependent. The restrictions and prolongations are of linear cost, but it turns out that the overall computational cost also is of linear complexity O(m), and in fact the number of iterations become independent of the size of the matrix. These results are not at all trivial and left to proper books on this subject such as [9]. However, convergence and efficiency is often problem dependent, and the method may need to be trimmed and adapted to the current problem to be useful. The perfect problem for multigrid solvers, including 7
10 AMG, is the Laplacian ( u = f) where it outperforms all other solvers. Another related method of optimal complexity is the Algebraic Multilevel Iteration Methods[10] (AMLI), which is not covered here. 5.3 Computational speed The AMG method is of linear complexity and the number of iterations is, with preconditioning, independent of n. With an AMG solver utilizing a MATLAB interface called AGMG[11], we ran some tests to compare a standard iterative solver (such as Gauss-Seidel) to the results of AGMG. These tests were run on a laptop running Linux Mint 17 and MATLAB 2014a on an Intel i7-processor. Execution time was measured using the standard tic-toc methods in MATLAB. The program solves the two-dimensional convection-diffusion problem ɛ u + b 1 u x + b 2 u y = f, where ɛ u is the diffusive part. The value of ɛ determines the size of the diagonal elements in the arising matrices. When this value is smaller, the system becomes more difficult to solve. The remaining terms b 1 u x + b 2 u y constitute the convective part, which give rise to non-symmetric parts of the resulting matrix. The discretization method, whether it is FDM, FEM or some other local method, nevertheless yields a large sparse linear system Ax = b. We compare the performance of AGMG with the direct solver in MATLAB, which means simply calling A \ b in the MATLAB terminal. MATLAB s backslash operator is a complicated routine and actually very good for various (small) cases, as we shall see below. Method Time (s) Iterations Forward G-S Backward G-S AGMG Matlab s A\b N/A Table 1: Timings for unknowns. Method Time (s) Iterations Forward G-S Backward G-S AGMG Matlab s A\b N/A Table 2: Timings for unknowns. Here we see that the preconditioning that takes place with the AMG method clearly does not impose a heavy computational cost, and that the number of iterations does not depend on the matrix size n. Figure 4 shows timings with and without convection comparing with the backslash operator in MATLAB. As for time, we noted the following: The presence of the convective part in the equation is what makes the matrix non-symmetric. From the numerical tests we observe that the direct solver performs better than AGMG up to about n = for symmetric matrices. When the non-symmetry is introduced, the direct solver performs much worse. As Figure 4d shows, the Gauss-Seidel method is very much slower than AGMG and the direct solver for any problem size. Note the logarithmic scale on time. 8
11 (a) Time for AGMG vs. direct solver (no convection) (b) Time for AGMG vs. direct solver (convection: horizontal wind) (c) Time for AGMG vs. direct solver (non-linear convection) (d) Time for AGMG vs. Gauss-Seidel Figure 4: Comparison of the AGMG with other methods and different problems. 6 Project workflow 6.1 Setup and cooling The Parallella board, in its initial configuration, is delivered without cooling and becomes very warm. In our setup, in a normal room and connected to the internet running Ubuntu (Linaro) the ARM processor holds at about 80 degress when idle, and doing intensive tasks brings it up to above 85 degrees which Adapteva describes as dangerous [12]. Thus we were forced to purchase and install a cooling system, consisting of two 12 V fans mounted in a 3D-printed plastic case with the computer in the middle (see Figure 5). With cooling installed, the temperature of the ARM processor drops to around 50 degrees when idle and stays below 55 degrees at full load. Temperatures can be measured using xtemp, bundled with the Parallella software distribution. The Epiphany chip does not seem to suffer from heating problems, although its operation over long periods of time without cooling has not been checked. 9
12 6.2 Software Once cooling had been installed and we had gained some familiarity and experience with the machine we installed some prerequisites for the Trilinos package[13]. Here we encountered problems with using MPI (openmpi) and Fortran which we were unable to resolve. However, since we did not need any parts of Fortran, these parts were turned off with CMake and only C and C++ parts were compiled. The compilers installed were gcc and g After everything was in place and running we installed the relevant parts of the Trilinos package[13], namely AZTEC, Epetra, ML and Teuchos. The Trilinos package is extensively used for various linear algebra operations, and these packages can be used for AMG. The original intention was to use Trilinos together with the deal.ii library which generates matrices, but because of difficulties installing deal.ii this plan was scrapped. With some difficulty we got Trilinos setup and running. There does not, as of now, appear to be any support for OpenCL acceleration. This means it is impossible to use any parts of Trilinos with the Epiphany chip. It does, however, work on the CPU, but this is a very inefficient approach since a CPU-only solver on a standard laptop easily out-performs the ARM chip here. Figure 5: Our Parallella board setup with two cooling fans mounted on a 3D-printed case. Instead, we turned to another Linear Algebra package, Paralution 0.8.0[14]. It is a C++ library containing various sparse iterative solvers including the algebraic multigrid method with OpenCL support. Unfortunately, the OpenCL kernels proved to not be compatible with the Epiphany chip. The developers behind Paralution were contacted to provide support, but were ultimately unable to resolve the issue. It was clear to us that the Epiphany chip only supports single- and not double-precision calculations, but rewriting the source code to accomodate this limitation did not help. The two possible explanations are: 1. The OpenCL kernels contain parts of the OpenCL standard which are not supported by the CO- PRTHR SDK. A list of which parts are implemented and which are not can be found at [15]. 2. The compiled OpenCL kernels are too large to fit in device memory. We suspected the first one to be the main reason and this was later confirmed by the Paralution developer team. Among other things, the kernels use inline calls and recursion, which the COPRTHR SDK does not support. The second reason might also contribute, but we have not more closely examined this. Having failed with both Trilinos and Paralution we instead turned to an open-source library called ViennaCL[16]. The code was confirmed to work with the Intel OpenCL base on our own laptops and was thus installed on the Parallella. There, however, it passed all CPU tests but failed all OpenCL tests, meaning neither single- nor double-precision kernels were compatible with the Epiphany chip. Instead of trying to make external AMG packages fit, it was decided that we would write our own simpler programs to simulate a V-cycle of AMG. This is described in the following section. 10
13 6.3 Code development Three different OpenCL programs were written: Dense matrix-vector multiplication Sparse matrix-vector multiplication Sparse matrix-matrix multiplication The first is included mainly for use as a performance benchmarking test. The sparse kernels are used to simulate a complete V-cycle as it is performed in AMG, without having an actual AMG solver installed. To write the latter from scratch, conforming to the Epiphany OpenCL implementation (COPRTHR SDK), would simply take too much time. Details and pseudo-code implementations are included in the next two sections. It is well worth noting that the Epiphany chip does not support any double precision arithmetic. Hence, all calculations that follow are in single precision Dense kernels For dense matrix-vector we implement the basic naïve algorithm based on the COPRTHR guide example [15]: Data: n n matrix A and vector b of length n for i = 0:(n-1) do for j = 0:(n-1) do c i = c i + A ij b j end end Algorithm 1: Dense matrix-vector multiplication Here, a total of n OpenCL procceses are distributed accross the 16 Epiphany cores, where each computes one element of the final vector. We also wrote a secondary kernel which calculates one sixteenth of the elements, which means there is exactly one process per core. However, this does not lead to any noticeable improvement in speedup for matrices that fit in memory (n < 2 12 ) Sparse kernels The sparse kernels are the ones that are mostly used in iterative solution methods, including AMG. When doing sparse calculations, only a few elements are nonzero. For this report, we have considered matrices where the number of nonzero elements is less than 2%. Since the zero elements do not actually contribute anything to the result (anything multiplied by zero is zero) we need only store the nonzero elements which saves memory and reduces computational cost. We need to store the location of the nonzero elements. A straight-forward method is to store the coordinates (x, y) for each nonzero element A xy. This is, however, not the best approach. During computation, it is unclear how many nonzero elements there are in a particular row, and where in memory they are located. This means cache misses might often occur which induces a heavy computational penalty. 11
14 Instead, a more efficient method to store sparse matrices most commonly used in practice, for example as in the Python numerical library NumPy [17], is either Sparse Row-Wise (SRW) or Sparse Column-Wise (SCW). The first will be explained, but the second is analogous. There are many possible formats and the best one is application-dependent. Consider a (very small) sparse matrix such as with 9 nonzero elements. A coordinate-based storage format would require three arrays of length 9 such as: x = [1, 2, 4, 0, 2, 1, 3, 4, 0] y = [0, 0, 0, 1, 1, 3, 3, 3, 4] v = [1, 5, 2, 1, 7, 4, 6, 9, 7] Here, the element with value v i has coordinates (x i, y i ). Notice that the row numbers are repeated, and if written in row-order, we can write one array of length n + 1. The same matrix would in SRW format be written as: I = [0, 3, 5, 5, 8, 10] J = [1, 2, 4, 0, 2, 1, 3, 4, 0] V = [1, 5, 2, 1, 7, 4, 6, 9, 7] Here, the nonzero elements are once again stored in V and the corresponding columns are stored in the corresponding element of J. Thus, a nonzero element at V i is in column J i in the full matrix. The array I of length n + 1 specifies which elements of J belong to which row in the full matrix. All elements with index i so that I k i < I k+1 are in row k. Obviously, the difference in memory usage compared to a coordinate-based system is in this case very small, but for larger matrices becomes more noticeable. The SRW format also makes it computationally easy to ascertain how many nonzero elements there are in a row, and if any rows are empty - this number is simply T k+1 T k. If T k+1 T k = 0, it means this row is empty and this step of the process can simply be skipped. At the same time, the algorithms become slightly more complicated. Using the SRW format, multiplying a sparse matrix with an arbitrary vector was done in the following 12
15 way: Data: n n matrix A in SRW format and a vector b for i = 0:(n-1) do if Row i of A is non-empty then for j = 0:(n-1) do if This column exists in C then C i = C i + A ik B k else Store this column index in C; C i = A ik B k end end else Skip row i; end end Algorithm 2: Sparse matrix-vector multiplication Note that, here, only the matrix is considered sparse and we consider a dense vector b. Extending this code for matrix-matrix multiplication, where both matrices are sparse and in SRW format, we implemented the following algorithm: Data: n n matrices A and B in SRW format for i = 0:(n-1) do if Row i of A is non-empty then for j = 0:(n-1) do if This column exists in C then C i = C i + A ik B k else Store this column index in C; C i = A ik B k end end else Skip row i; end end Algorithm 3: Sparse matrix-matrix multiplication What does not appear in the pseudo-code is the problem of pre-allocating memory for C. Because there is no general result for how many nonzeros there might be in C for general sparse matrices A and B, one must do an estimate. We made a rough estimate of the expected number of nonzero elements in C and confirmed the estimate by performing the multiplication in MATLAB. In a more general case, the kernel must allocate a certain initial size and during calculation detect whether the allocated memory is too small and allocate more, which takes extra time. This more complicated (but in practical applications necessary) routine is not included in the present implementation. The sparse programs are designed in the same way as the optimized dense kernel. This means that instead of having one process for each index i, there are as many processes p as the number of cores used, where each takes care of n/p rows of the matrix. 13
16 Data is written to the temporary Epiphany core cache throughout the loop, and is only transferred to the global shared memory once the entire row is calculated. This provides a very large speedup due to a limited memory bandwidth, which is important to take into account for other developments on the Parallella. Since each row is only handled by one process, there are no cache problems or memory clashes. Complete source code for all our tests is released under the MIT software license on Github[18]. 7 Problem setup 7.1 Linear algebra With the Linear Algebra packages detailed in the previous section failing to use the Epiphany chip, another approach was needed to perform (or at least simulate) AMG calculations on the Parallella. We opted for an approach which does not actually calculate a solution, but mimics the type of computations (sparse matrix-vector and sparse matrix-matrix multiplication of varying sizes) that are performed during an AMG V-cycle. In order to obtain meaningful timings our simulation was constructed as follows: 1. Allocate memory for matrices 2. Synchronize memory with Epiphany chip 3. Perform computation 4. Synchronize result A careful reader might realize that, although we work with a shared memory model, the current OpenCL implementation is not designed for this and the COPRTHR SDK still requires us to synchronize data. Because the V-cycle works with computations on each level, from finest mesh down to coarsest and then back up to finest again, the computational steps were built up accordingly (where n is the largest matrix size): 1. Sparse Matrix-Matrix and sparse matrix-vector multiplication, size n n. 2. Divide n by Repeat step 1 and 2 until the coarsest level n/ Perform exact solving. 5. Multiply n by Sparse Matrix-Matrix and sparse matrix-vector multiplication. 7. Repeat step 5 and 6 until the finest level n. The mesh sizes were chosen as power of 2 s to ensure that obtaining exactly half the size is always possible. It also makes it easy to balance the load between the p Epiphany cores, who each compute n/p rows of the matrix. Due to memory restrictions we choose to have as our finest mesh. 14
17 The number of matrix-matrix and matrix-vector computations that are performed over a V-cycle is very problem dependent, namely on what smoother is used and how aggressive the coarsening is done. However, one way to loosely estimate this is by estimating how much work is done over the most simple V-cycle, which constitutes one matrix-vector multiplication and one matrix-matrix multiplication. Assume we have n levels ordered as {l 1, l 2,..., l n } where l 1 is the coarsest and l n is the finest. Suppose also that the matrix sizes are ordered as N k 1 = N k 2 and that the number of non-zeros in each matrix is some constant percentage α. The total number of non-zero elements on level k is therefore αnk 2. Then the arithmetic work to multiply a matrix of a size N k N k with a vector can be estimated as W k = αc 1 N 2 k. (2) Similarly, to multiply two matrices of the same size, with the same percentage of non-zeros, requires W k = αc 2 N 2 k, (3) where C 1 and C 2 are constants independent of the size of the matrix. With (2) and (3) we can estimate the total arithmetic work W V for a complete V-cycle: W V = αc ( N 2 n + N 2 n N 2 1 ) = ( = αc Nn 2 + N n 2 (2 1 ) N n 2 ) (2 n ) 2 = ( = αnnc ) 4 n = ( 1 ( 1 = αnnc )n ) = = α 4 3 CN 2 n for large n, where C = C 1 + C 2. From the formula we see that the total amount of work over a V-cycle is proportional to the number of non-zeros in the finest mesh, N n 1. Because all the matrices are sparse matrices, we must have α satisfying αn 2 n = O(N n ). From this we can deduce that W V = O(N n ) for some constant of proportionality C independent of the matrix size N n and the proportion of nonzero elements α. The size of this constant depends on factors such as the choice of smoother and how aggressive the coarsening is. This calculation nevertheless has the benefit of showing that the multigrid methods have a linear computational cost. We decided to measure one matrix-vector and one matrix-matrix multiplication on each level giving a good benchmark value which then can be used to estimate time demands for any given problem, as at least one matrix-matrix and one matrix-vector multiplication will be performed. The V-cycle was thereafter calculated by summing up execution and memory synchronization times for each level. The latter is included as it is a requirement for computation on this device. 15
18 7.2 Energy consumption With the help of the IT Project course group working on a separate Parallella project [19] equipment was borrowed to measure power consumption. According to the Adapteva website [2] the computer consumes only 5 W under typical workloads. We wanted to test this claim. The dense matrix-vector kernel computations for a matrix were run several times in a loop while power consumption was measured, and compared with power consumption when idle. The computations included both randomizing a matrix and a vector on the CPU, while most of the computations took place on the accelerator. We also multiplied large sparse matrices with the accelerator while the CPU was idle. Power consumption tests were made using a special chip attached to metal pins available on the board for this purpose, and a standard multimeter measured voltage going through these pins. To convert this into amperage, we solved for the amperage A using the following formula: V A This formula is deduced through least-squares approximation by before-hand measuring voltage (V ) corresponding to different loads (A) through the power supply. Multiplying this amperage by five volts gives the correct wattage. 8 Results All speedup results below are based on execution time on the accelerator, as well as the time for synchronizing memory to and from it. The speedup is computed using Amdahl s law, where the speedup S for a time, T, of a number of cores n is: S(n) = T (1 core) T (n cores) Although the memory is shared, it appears (according to the basic OpenCL for Parallella examples) that synchronization routines must still be called, which slightly hinders performance. This can hopefully be remedied in future versions of the COPRTHR SDK, but we have used the currently newest version 1.6. The Epiphany chip does not support double-precision numbers, so all the results below are in singleprecision. Worth noting is that, due to the lack of time on this project, our code is probably not fully optimized, although we know of no specific remedies that would provide a significant performance improvement. Table 3 shows the amount of non-zero elements and this proportion to total number of elements for each matrix size used. The second matrix used for matrix-matrix multiplication is of similar size and sparsity. The actual sparsity in the matrices in an AMG V-cycle depends firstly on the problem, but also on the methods used for restriction and prolongation. Our numbers below are realistic in the sense that we go to a larger proportion for every restriction (i.e. downwards in the hierarchy) and to a smaller proportion for every prolongation (upwards in the hierarchy). Whether the proportions themselves are realistic is impossible to say as they are, as just said, problem dependent. 16
19 Size # Non-zeros Proportion, % , , , , ,65 Table 3: Sizes and fraction of non-zeros used during tests. 8.1 Dense linear algebra computations The following graph measures memory synchronization with OpenCL as well as execution time of the OpenCL kernel. As we can see in Figure 6, memory latency affects speedup for n = 512 and n = 1024, whereas for n = 2048, we get near-linear speedup (see also Table 6 for the actual numbers). Using an even larger matrix requires a smarter memory model than what we implemented, as the maximum contiguous memory blocks seem to be around 32 MB. 15 Speedup, Dense matrix vector multiplication Speedup #Cores Figure 6: Relative speedup of multiplication of a dense matrix with a vector, split over 1-16 cores. Below we include tables showing the different time measurements that were performed on. Table 4 and Table 5 show allocation time and synchronization time, respectively, of each matrix size. Not surprisingly, there is no difference in allocation time for different number of used cores. The same goes for memory synchronization, which is good. Timings between the size of the matrices show an increase in time by almost a factor of 4 when doubling the size. This is expected as the number of elements also grow by a factor of 4. Allocation Size Time Ratio 512 0, ,0581 3, ,2292 3,9471 Table 4: Allocation time in seconds for dense matrix-vector multiplication. Synchronization Size Time Ratio 512 0, ,2144 3, ,8547 3,9814 Table 5: Memory synchronization time in seconds for dense matrix-vector multiplication. 17
20 Calculation Size\#cores ,7488 0,4097 0,2403 0,1559 0, ,7793 1,4260 0,7484 0,4110 0, ,9235 5,4955 2,7821 1,4295 0,7618 Speedup 1 1,9877 3,9263 7, ,3394 Table 6: Table over computation time in seconds for dense matrix-vector multiplication. 8.2 Sparse LA All results below are all obtained using single-precision values. We have not tested using integer values because of time restrictions of the project as well as the fact that this is not common in practical applications. Due to the lack of RAM we were not able to experiment with any larger matrices than Speedup Results from our experiments show a fairly good speedup for at least matrix-vector multiplication (see Figure 7). Best speedup is obtained when using as many cores as possible at maximum load for both types of computations Speedup, Matrix matrix multiplication Speedup, Matrix vector multiplication Speedup Speedup #Cores #Cores (a) Relative speedup of sparse matrix-matrix multiplication split over 1-16 cores. (b) Relative speedup of sparse matrix-vector multiplication split over 1-16 cores. Figure 7: Graphs of the values from Table 4.2 and 4.3. computed with Amdahl s law. Calculation and synchronization ,1356 0,1237 0,1182 0,1152 0, ,1681 0,1446 0,1325 0,1264 0, ,2353 0,1864 0,1625 0,1480 0, ,3670 0,2710 0,2232 0,1986 0, ,6365 0,4448 0,3459 0,3002 0,2760 Table 7: Actual measured timings in seconds for the sparse matrix-matrix multiplication. 18
21 Calculation and synchronization ,2074 0,1473 0,1183 0,1005 0, ,3329 0,2110 0,1509 0,1206 0, ,5831 0,3387 0,2177 0,1561 0, ,0839 0,5939 0,3501 0,2280 0, ,0938 1,1119 0,6183 0,3738 0,2517 Table 8: Actual measured timings in seconds for the sparse matrix-vector multiplication Memory allocation Although memory allocation is not done on the Epiphany chip (only on the CPU) it can be a significant part of the total execution time for any large application. Here we can note two important things. First, when doing for example a matrix-matrix multiplication with full use of the Epiphany chip (16 cores), we see that allocating enough memory takes approximately 40% of the total execution time. Second, as Table 9 and Table 10 show, doubling the size of the matrices also produces a doubling in time for allocating memory - corresponding to the doubling of nonzero elements Memory allocation Matrix vector Matrix matrix 10 1 Time, log(t) Number of elements x 10 6 Figure 8: Allocation timings for different number of elements. memory and thus becomes faster. Note that matrix-vector requires less Allocation: Mat-vec multiplication Size # allocated Time Ratio , ,0067 1, ,0128 1, ,0258 2, ,0507 1,9685 Table 9: Sparse matrix-matrix multiplication timings in MATLAB. Allocation: Mat-mat multiplication Size # allocated Time Ratio , ,0249 1, ,0499 2, ,1016 2, ,2113 2,0790 Table 10: Sparse matrix-vector multiplication in MATLAB. 19
22 8.2.3 Memory synchronization Synchronizing the memory on the ARM CPU with the Epiphany chip is also time consuming. Taking the same example as above with allocating memory, it takes slightly less time to synchronize but still above 30% of the total time for one computation. Also as with allocating memory, Table 11 and Table 12 show that it takes twice as much time to synchronize twice as many elements Memory synchronization Matrix vector Matrix matrix 10 1 Time, log(t) Number of elements x 10 6 Figure 9: Memory synchronization times in seconds for different number of elements. Note that matrixvector requires less memory and thus becomes faster. Synchronization: Mat-vec Size # synchronized Time Ratio , ,0056 1, ,0099 1, ,0187 1, ,0365 1,9545 Table 11: Sparse matrix-matrix multiplication timings in MATLAB. Synchronization: Mat-mat Size # synchronized Time Ratio , ,0191 1, ,0368 1, ,0726 1, ,1491 2,0560 Table 12: Sparse matrix-vector multiplication in MATLAB. 8.3 Simulated V-cycle Our simulated V-cycle shows an approximate 4x maximum speedup with 16 cores - see Figure 10 and Table
23 4.5 Speedup, V cycle simulation Speedup #Cores Figure 10: Speedup of our simulated V-cycle for matrix sizes of n = 8192 down to n = 512. Calculation and synchronization Calculation 10,6252 6,1530 3,9331 2,8116 2,2630 Synchronization 0,3790 0,3799 0,3795 0,3790 0,3944 Sum 11,0042 6,5329 4,3125 3,1910 2,6574 Speedup 1 1,6844 2,5517 3,4489 4,1410 Table 13: Actual timings for the simulated V-cycle on different amounts of cores used. 8.4 Comparisons with MATLAB In order to have something to compare with, we also ran computations of the very same matrices in MATLAB on a laptop running MATLAB 2013A on Windows 7 using a dual-core Intel Core i5 4500U (2.5 GHz) and 6 GB RAM. While most matrix operations, including matrix-matrix multiplication, are implicitly parallellized in MATLAB, this is not the case if at least one matrix is sparse. Hence we present only serial runtimes and discuss them in the next section. Because of how MATLAB is developed, the following are calculations in double precision, whereas the Epiphany accelerator only supports single precision calculations. Size Time Table 14: Sparse matrix-matrix multiplication timings in MATLAB. Size Time Table 15: Sparse matrix-vector multiplication in MATLAB. Setting up a similar V-cycle as proposed in the problem setup section (3.2.1) and using the same matrices as on the Parallella, the simulated V-cycle takes approximately seconds. Corresponding times on 21
24 Parallella are seconds on 1 core and on 16 cores on Parallella. 8.5 Power measurements Power consumption was tested in [19], the results are shown here in Table 4.1: Idle Compiling Unit tests Bucket sort 2.2 W 3 W 3.8 W 2.9 W Table 16: Energy consumption of Parallella Embedded Server as measured in [19]. For these tests, compilation took place only on the CPU, the Bucket sort only on the Accelerator, while the unit tests were utilizing CPU, accelerator, and even the network connector. It should be noted that the device used is not identical to ours (this was on the Embedded platform whereas we have used the Desktop version). Since the hardware difference is so small, and the CPU and accelerator are identical, we feel confident that the results of Table 13 should reflect the power consumption of all three device types. It is unknown how much difference the use of the GPIO ports might affect power consumption. We tried reproducing these tests, which peaked at 3 W for maximal load and held around 0.25 W during idle. This cannot be just the Epiphany chip since it has a maximum power consumption of 2 W[2], but seems too low to be the entire copmuter. This measure is doubtful at most. An important observation here was that the accelerator continues to run and consume power even after the kernel has completed execution and no data is being fed. The programmer has to manually clear the Epiphany instruction set in order to halt the power consumption. Needless to say this might have large consequences to power efficiency in large-scale runs, if forgotten. 9 Discussion As we have neither used nor closely examined the esdk made for the Epiphany chip we choose only to comment on the OpenCL support and the results we have acquired using it. OpenCL is, as yet, clearly very limited in its Epiphany support. The same conclusions are reached in [20] which notes that programs using the esdk scale much better than implementations of OpenCL do, at the cost of increased development time. Results show a fairly good speedup using 16 cores for both matrix-vector multiplications done, slightly above eight times faster for sparse and around 14 times faster for dense. Our poor results for matrixmatrix multiplication are most probably due to bad memory management. As already mentioned our code is not optimized, although we do not know of any specific fixes that would provide significan speedup. As to the V-cycle we see, at best, four times speedup on full load using all computing nodes. At the same time, it is common knowledge that sparse linear algebra operations generally do not scale well on parallel platforms. The experiments performed here do not bring a dramatically different insight, despite the shared memory model which was presumed to offer better performance. Improving the speedup of matrix-matrix multiplication, if possible, is what would offer the most improvement to the V- cycle simulation results. Note that standard linear algebra routines such as BLAS are not implemented for use on the Epiphany accelerator. Generic BLAS results on the ARM CPU would not be of any interest. 22
25 Because the OpenCL standard is not completely implemented, we consider the Epiphany OpenCL implementation (COPRTHR SDK) to not be quite finished yet. There are certainly many optimizations that could be performed there, under the hood. Another alternative would be to use another parallel framework such as a lighter version of MPI which might suit the multiple instruction multiple data (MIMD) architecture of the Epiphany better. Results regarding allocating memory and synchronizing with the Epiphany chip are rather good, showing that doubling the amount of elements leads to a doubling in time, as the number of elements grow. The memory latency shows itself when allocating or updating fewer elements, but becomes less noticeable for large datasets. Even though the comparison of our results with the same computations in MATLAB is very poor (the latter is fully optimized and not done on a comparable computer), we can note that the timings on the MATLAB grow much faster than on the Parallella. This shows that the latter shows potential for better speedup. The 64+2-core Parallella version, not yet available, might provide an even better scalability and has a peak performance rating of 85 GFLOPS, as opposed to the 25 GFLOPS [5] of this 16-core version. It was scheduled to be released Q but is as of January 2015 not yet available. The 1 GB RAM limit could also easily become a potential deal-breaker as many Numerical Linear Algebra problems entail much larger datasets. 10 Conclusions and future research For this project, we have tried to implement the Trilinos and Paralution packages for Linear Algebra. Although they can be installed, various attempts have failed to get them to use the Epiphany accelerator chip, meaning performance is severly limited and not useful. Results from the ViennaCL library were similarly disappointing, and the deal.ii package will not install, probably because of ARM incompatibilities. Instead, our own implementations of dense matrix-vector, sparse matrix-vector and sparse matrix-matrix multiplications were implemented in OpenCL. Dense matrix-vector multiplication shows near-linear speedup, whereas our unoptimized sparse routines show only 4-8x speedup. We tried simulating an AMG V-cycle using these routines, but its speedup suffers from the poor performance of sparse matrixmatrix multiplication. As we, because of time limitations, were unable to fully implement a standard AMG solver utilizing the Epiphany, future research could definitely look into developing such a software, both utilizing one Parallella board as well as several. We also tried, but failed, to measure power consumption of the board. Results from [19], if accurate, point to less than 5 W and even less than 4 W even at full load on both CPU and acclerator. A future study should verify this with better measurement instruments. We know of no study where different low-power supercomputing-on-a-chip architectures such as Epiphany, Kalray [21] etc. are compared in terms of computing power, scalability, energy efficiency and simplicity of development. This area of research is moving rapidly, even when compared to the usual speed at which computer research moves. Such a study might be of great benefit to the supercomputing community. Since the Epiphany is MIMD the Parallella computer also might lend itself well (or better) to task-parallell problems as well as data-parallell problems. While sparse linear algebra computations in theory are better suited for shared memory systems, we cannot at this moment recommend extensive linear algebra computations to be performed on Parallella. We acknowledge that it has potential, and with a more complete OpenCL support allowing for general LA-packages to be installed and run, as well as a 66-core version, its usability would be greatly enhanced. 23
26 References [1] Xilinx, Zynq 7000 All programmable SOC Overview, document number: DS190, 8/ http: // [2] Parallella reference manual, v [3] [4] [5] Parallella-1.x Reference Manual, REV [6] [7] Aaberge, Trygve, Analyzing the performance of the Epiphany processor, Trondheim: Department of computer and information science, Norwegian Uni. of Science and Technology, pp. 11 [8] Numerical Analysis of partial differential equations, S.H. Lui, page.346 [9] Youseef Saad, Iterative Methods for Sparse Linear Systems, SIAM [10] Johannes Kraus, Svetozar Margenov, Robust algebraic multilevel methods and algorithms, p.29 and forward, Walter de Gruyter [11] [12] [13] [14] [15] docs/parallella_quick_start_guide.pdf [16] [17] NumPy/SciPy documentation: matrix.html#scipy.sparse.csr_matrix [18] [19] Christos Sakalis et al., The EVI Distributed Shared Memory System, Uppsala: Department of Information Technology, Uppsala Universitet pp [20] Analyzing the Performance of the Epiphany Processor, Trygve Aaberge, 2014 [21] Kalray webpage: 24
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
The Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001,
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008
A tutorial on: Iterative methods for Sparse Matrix Problems Yousef Saad University of Minnesota Computer Science and Engineering CRM Montreal - April 30, 2008 Outline Part 1 Sparse matrices and sparsity
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
How High a Degree is High Enough for High Order Finite Elements?
This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
IT@Intel. Comparing Multi-Core Processors for Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
HPC Deployment of OpenFOAM in an Industrial Setting
HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak [email protected] Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
The Motherboard Chapter #5
The Motherboard Chapter #5 Amy Hissom Key Terms Advanced Transfer Cache (ATC) A type of L2 cache contained within the Pentium processor housing that is embedded on the same core processor die as the CPU
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Computer Architecture-I
Computer Architecture-I 1. Die Yield is given by the formula, Assignment 1 Solution Die Yield = Wafer Yield x (1 + (Defects per unit area x Die Area)/a) -a Let us assume a wafer yield of 100% and a 4 for
A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware
A+ Guide to Managing and Maintaining Your PC, 7e Chapter 1 Introducing Hardware Objectives Learn that a computer requires both hardware and software to work Learn about the many different hardware components
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Whitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Notes on Cholesky Factorization
Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 [email protected]
Mesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer [email protected] Standard Drives Division Bruce W. Weiss Principal
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
SIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method
578 CHAPTER 1 NUMERICAL METHODS 1. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS As a numerical technique, Gaussian elimination is rather unusual because it is direct. That is, a solution is obtained after
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
TESLA Report 2003-03
TESLA Report 23-3 A multigrid based 3D space-charge routine in the tracking code GPT Gisela Pöplau, Ursula van Rienen, Marieke de Loos and Bas van der Geer Institute of General Electrical Engineering,
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
What Is Specific in Load Testing?
What Is Specific in Load Testing? Testing of multi-user applications under realistic and stress loads is really the only way to ensure appropriate performance and reliability in production. Load testing
:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD [email protected]
:Introducing Star-P The Open Platform for Parallel Application Development Yoel Jacobsen E&M Computing LTD [email protected] The case for VHLLs Functional / applicative / very high-level languages allow
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
