PROJECT REPORT. Using Parallella for Numerical Linear Algebra. Marcus Naslund, Jacob Mattsson Project in Computational Science January 2015

Transcription

1 Using Parallella for Numerical Linear Algebra Marcus Naslund, Jacob Mattsson Project in Computational Science January 2015 PROJECT REPORT Department of Information Technology

2

3 Contents 1 Introduction 3 2 Background 3 3 Purpose 3 4 Hardware and setup The Epiphany chip Multigrid methods Linear systems General properties Computational speed Project workflow Setup and cooling Software Code development Dense kernels Sparse kernels Problem setup Linear algebra Energy consumption Results Dense linear algebra computations Sparse LA Speedup Memory allocation Memory synchronization

4 8.3 Simulated V-cycle Comparisons with MATLAB Power measurements Discussion Conclusions and future research 23 2

5 1 Introduction In this project, we have tested Parallella, a miniature super-computer, and its suitability for solving scientific computing problems. The project mainly involved a performance test in a realistic problem setting using the Algebraic Multigrid method (AMG). This report also documents some of the features which may make the platform an interesting option for future research. 2 Background The company Adapteva produces the 16- and 64-core Epiphany MIMD chips, which are the main selling point of the Parallella miniature super-computer. It is mainly powered by its dual core ARM-9 CPU [1] using the 16-core Epiphany chip as an accelerator. At the time of writing, the largest super-computer in the world sports a staggering 3.12 million cores consuming almost 18 MW of energy 1, which means around 5 W per core. In contrast, the Parallella consumes around 5 W for all of its 2+16 cores, while performing a theoretical 32 GFLOPS per second[2]. There is also a 66-core version, albeit delayed and not yet released, with a presumed even better power consumption-per-core ratio. As the world approaches exascale computing, it must also move towards more energy efficient computing cores. As of November 2014 the greenest supercomputer ever produced outputs an impressive 5.27 gigaflops per watt (GFLOPS/W) 2 which is the standard metric used for this measure. Independent of this study, another group in the IT department has built a cluster of Parallella boards, using the distributed shared memory system (DSM) computing paradigm. However, the results from the project show that, with their hardware and system setup, Parallella boards are not optimal for such a system due to the constant communication between ARM chips and Epiphany cores. We have not used their approach and details of our implementation follow in the next sections. Parallella s architecture, mainly its shared memory between both CPU and accelerator, together with its easy connection to more Parallella boards, both seem to offer good scaling capabilities for Numerical Linear Algebra, especially the Algebraic Multigrid methods which are further detailed below. 3 Purpose There are many iterative solvers for very large sparse linear systems Ax = b, such as the Jacobi method, Conjugate Gradient method, etc.. A special class of methods are the so-called Multigrid methods which have many theoretical advantages but do not always scale well on large parallell systems. There is a growing need to solve large sparse linear systems since they appear in nearly any scientific computing problem. The direct methods are unfeasible because of their computational complexity. Instead, the feasible alternatives are the iterative methods. Their potential drawback can be very slow convergence rates. Therefore we need to accelerate those methods using a technique called preconditioning which can be expressed as a linear system Ax = b replaced by B 1 Ax = B 1 b for a certain matrix B 1 called preconditioner. Note that the transformed system has the same solution as the original one. These improved iterative solvers, including the Multigrid methods, are very communication-heavy and thus become difficult to implement on distributed memory systems. Hopefully, with the architectural

6 (a) Top view of the board. (b) Bottom view of the board. Figure 1: The Parallella board [2]. difference of the Parallella compared to conventional large-scale computational systems, the Algebraic Multigrid method (AMG) will scale a lot better in terms of energy efficiency. Either way, the chosen test method provides a lot more information on how useful the computer actually is compared to standard benchmarking tests such as the LINPACK performance tests[3]. 4 Hardware and setup The Parallella board used for this report (the Desktop version) is equipped with an ARM A9 CPU and 1 GB RAM. Its parallel speed comes from its 16-core Epiphany-IV chip. There is also a 64-core version planned but not yet in production. While the computer supports any ARM-supported operating system our computer was setup with a variant of Ubuntu called Linaro, which is especially made for ARM-based computers [4]. This is what is recommended by the manufacturer Adapteva, but as the platform becomes more popular guides for installing other choices of OS will be, and are becoming, more common. Moreover the board comes with the following set of hardware options[2]: Gigabit Ethernet HDMI port 4 general purpose expansion connectors 2x USB 2.0 MicroSD card, for storing the operating system and software FPGA (Field programmable Gate Array), basically a programmable integrated circuit. One of the more interesting features is the set of four expansion connectors which gives the opportunity to increase parallelizability even further by connecting several Parallellas together (see Figure 1b). We have been unable to find any reports of large-scale tests of this feature and are thus unable to comment on how well it works. The credit-card-sized computers, measuring 9 cm x 5.5 cm, are available in three versions: 4

7 Micro-server The cheapest alternative, with no IO connector ports. Desktop The one we have used (and the most common version). Embedded platform The most expensive version. Same as the Desktop version but with a larger FPGA and twice as many GPIO ports. All versions feature the same processor, memory size and Epiphany 16-core accelerator and have a theoretical peak performance of 25 GFLOPS, meaning around 5 GFLOPS/W[5]. 4.1 The Epiphany chip The Epiphany chip is a multicore shared memory microprocessor consisting of a 2D array of 16 computing nodes. Each node contains an independent superscalar floating-point RISC 3 CPU operating at 1 GHz, capable of two floating-point or one integer operation per clock cycle. Each node has up to 1 MB of local memory allowing for simultaneous accesses by the instruction fetch engine. Each node can access its own or other s memory through regular load and store instructions. Moreover does the chip come with the following features: 32 GFLOPS peak performance 512 GB/s local memory bandwidth 2 Watt maximum chip power consumption Fully featured ANSI-C/C++ programmable With four connectors, several Parallella computers may be connected together in a cluster to achieve even greater parallell performance. While a traditional method like MPI (Message Passing Interface) may be used to communicate between Parallella boards, this approach is a distributed-memory model and thus not compatible with the shared memory model on a single Parallella board. Here, the Open Computing Language OpenCL [6] must instead be used. Instead of using OpenCL, there exists a Software Development Kit (SDK) specifically for programming for the Epiphany chip, called esdk. In a recent study, albeit with very limited testing, esdk was shown to well out-perform the OpenCL implementation for Epiphany called COPRTHR SDK [7]. However, since we do not have the time or skill to implement a new AMG solver based on esdk, our only option is to use OpenCL. The OpenCL language is very similar to the C language with certain extensions. Small sets of instructions called kernels are written in OpenCL (similar to the way someone would program for CUDA) which are assigned to the accelerator cores from a Host process run on the CPU. OpenCL lets us specify which cores should do what, something that works well with the Multi Instruction Multiple Data (MIMD) architecture of the Epiphany chip. 5 Multigrid methods 5.1 Linear systems Consider a linear system of equations 3 RISC, acronym for Reduced instruction set computing - a class of processor architecture developed by IBM during the late 1970 s. 5

8 Ax = b (1) or equivalently b Ax = 0 where b, x R n and sparse A R n n for very large n (>200 million). Such systems arise for example when solving partial differential equations using the Finite Difference Method (FDM) or the Finite Element Method (FEM). Exact solvers are far too inefficient because of their computational complexity. For the types of problems above one usually relies on iterative methods which obtain an approximate solution x k where b Ax k 0. The solution is improved with more iterations, and we of course want to have lim x k = x, k where x is the exact solution of the system. The expression b Ax k is called the residual. The residual corresponding to x k is denoted by r k. The error of the same solution is x x k which we denote by e k. The two are related in the following way: r k = b Ax k = Ax Ax k = A(x x k ) = Ae k 0 The number of iterations for simple iterative methods such as the Jacobi and Gauss-Seidel methods grows too quickly with n, which is briefly illustrated in the next section. But there is an upside to using Jacobi s method. It can be shown[8] that this method reduces the high-frequency errors, or errors corresponding to large eigenvalues of A ( λ >> 0), very rapidly. The problem with Jacobi s method resides in the fact that low-frequency errors, meaning errors corresponding to small eigenvalues ( λ 0) are resolved much slower. A complex yet efficient approach of using Jacobi s fast convergence for high-frequency errors result in the class of multigrid solvers. The theory behind them follows next. 5.2 General properties As described earlier, problems with the Jacobi method are mostly caused by low-frequency errors. However, a low-frequency oscillation in a fine mesh would be, when translated to a coarser mesh, a highfrequency (and thus easily remedied) error: The algorithm underlying multigrid methods consists of five major steps: Pre-smoothing: Reducing errors caused by large eigenvalues of A, for example using the Gauss- Seidel or the Jacobi method. Restriction: Moving the residual from a finer grid to a coarser grid. Calculation: Solving the error equation with an iterative method.. Prolongation: Moving the error from a coarser grid to a finer grid. Post-smoothing: We can get high-frequency errors back because of errors introduced from the prolongation, which can be remedied by again running a few steps of Jacobi s method. The five steps above are illustrated in Figure 2.3, forming a V-cycle. 6

9 Figure 2: Moving from a coarse mesh to a fine mesh, and then back again. Compare with figure 3. pre-smoothing steps post-smoothing steps restriction prolongation exact solving Figure 3: Illustration of one V-cycle in AMG. A full computation may involve moving up and down many times before the error is reduced enough on the finest mesh. Often, one step from a finer to a coarser mesh is not enough, instead an entire hierarchy of the finest mesh down to a very coarse mesh is used. The coarsest level might be a matrix, or even a 2 2-matrix, depending on how large the initial problem is and how many levels we use. Jacobi s method is used as much as possible on one mesh, and the solution is then projected (the restriction step) onto a coarser mesh for more easy solution before prolongated back. There are also several other kinds of cycles in Multigrid. The movement from finest to coarsest to finest may involve several smaller cycles between a subset of the levels, giving rise to more complicated patterns like a W-cycle (which, like the V-cycle, takes its name from how it may be pictured such as in Figure 3). It is possible to impose a certain hierarchy of finer and coarser meshes as discussed above without knowing anything about the mesh on which the PDE has been solved, or even the PDE itself. If such information is available the Geometric multigrid methods are available, otherwise the Algebraic Multigrid method (AMG) uses properties of the matrix A, viewing it as a graph, to do its magic. This means that less information is available, but the method works like a black box. Thus, if the geometric versions are too complicated to use, AMG may be a good alternative. The difficulty of implementing an AMG solver is very problem-dependent. The restrictions and prolongations are of linear cost, but it turns out that the overall computational cost also is of linear complexity O(m), and in fact the number of iterations become independent of the size of the matrix. These results are not at all trivial and left to proper books on this subject such as [9]. However, convergence and efficiency is often problem dependent, and the method may need to be trimmed and adapted to the current problem to be useful. The perfect problem for multigrid solvers, including 7

10 AMG, is the Laplacian ( u = f) where it outperforms all other solvers. Another related method of optimal complexity is the Algebraic Multilevel Iteration Methods[10] (AMLI), which is not covered here. 5.3 Computational speed The AMG method is of linear complexity and the number of iterations is, with preconditioning, independent of n. With an AMG solver utilizing a MATLAB interface called AGMG[11], we ran some tests to compare a standard iterative solver (such as Gauss-Seidel) to the results of AGMG. These tests were run on a laptop running Linux Mint 17 and MATLAB 2014a on an Intel i7-processor. Execution time was measured using the standard tic-toc methods in MATLAB. The program solves the two-dimensional convection-diffusion problem ɛ u + b 1 u x + b 2 u y = f, where ɛ u is the diffusive part. The value of ɛ determines the size of the diagonal elements in the arising matrices. When this value is smaller, the system becomes more difficult to solve. The remaining terms b 1 u x + b 2 u y constitute the convective part, which give rise to non-symmetric parts of the resulting matrix. The discretization method, whether it is FDM, FEM or some other local method, nevertheless yields a large sparse linear system Ax = b. We compare the performance of AGMG with the direct solver in MATLAB, which means simply calling A \ b in the MATLAB terminal. MATLAB s backslash operator is a complicated routine and actually very good for various (small) cases, as we shall see below. Method Time (s) Iterations Forward G-S Backward G-S AGMG Matlab s A\b N/A Table 1: Timings for unknowns. Method Time (s) Iterations Forward G-S Backward G-S AGMG Matlab s A\b N/A Table 2: Timings for unknowns. Here we see that the preconditioning that takes place with the AMG method clearly does not impose a heavy computational cost, and that the number of iterations does not depend on the matrix size n. Figure 4 shows timings with and without convection comparing with the backslash operator in MATLAB. As for time, we noted the following: The presence of the convective part in the equation is what makes the matrix non-symmetric. From the numerical tests we observe that the direct solver performs better than AGMG up to about n = for symmetric matrices. When the non-symmetry is introduced, the direct solver performs much worse. As Figure 4d shows, the Gauss-Seidel method is very much slower than AGMG and the direct solver for any problem size. Note the logarithmic scale on time. 8

11 (a) Time for AGMG vs. direct solver (no convection) (b) Time for AGMG vs. direct solver (convection: horizontal wind) (c) Time for AGMG vs. direct solver (non-linear convection) (d) Time for AGMG vs. Gauss-Seidel Figure 4: Comparison of the AGMG with other methods and different problems. 6 Project workflow 6.1 Setup and cooling The Parallella board, in its initial configuration, is delivered without cooling and becomes very warm. In our setup, in a normal room and connected to the internet running Ubuntu (Linaro) the ARM processor holds at about 80 degress when idle, and doing intensive tasks brings it up to above 85 degrees which Adapteva describes as dangerous [12]. Thus we were forced to purchase and install a cooling system, consisting of two 12 V fans mounted in a 3D-printed plastic case with the computer in the middle (see Figure 5). With cooling installed, the temperature of the ARM processor drops to around 50 degrees when idle and stays below 55 degrees at full load. Temperatures can be measured using xtemp, bundled with the Parallella software distribution. The Epiphany chip does not seem to suffer from heating problems, although its operation over long periods of time without cooling has not been checked. 9

12 6.2 Software Once cooling had been installed and we had gained some familiarity and experience with the machine we installed some prerequisites for the Trilinos package[13]. Here we encountered problems with using MPI (openmpi) and Fortran which we were unable to resolve. However, since we did not need any parts of Fortran, these parts were turned off with CMake and only C and C++ parts were compiled. The compilers installed were gcc and g After everything was in place and running we installed the relevant parts of the Trilinos package[13], namely AZTEC, Epetra, ML and Teuchos. The Trilinos package is extensively used for various linear algebra operations, and these packages can be used for AMG. The original intention was to use Trilinos together with the deal.ii library which generates matrices, but because of difficulties installing deal.ii this plan was scrapped. With some difficulty we got Trilinos setup and running. There does not, as of now, appear to be any support for OpenCL acceleration. This means it is impossible to use any parts of Trilinos with the Epiphany chip. It does, however, work on the CPU, but this is a very inefficient approach since a CPU-only solver on a standard laptop easily out-performs the ARM chip here. Figure 5: Our Parallella board setup with two cooling fans mounted on a 3D-printed case. Instead, we turned to another Linear Algebra package, Paralution 0.8.0[14]. It is a C++ library containing various sparse iterative solvers including the algebraic multigrid method with OpenCL support. Unfortunately, the OpenCL kernels proved to not be compatible with the Epiphany chip. The developers behind Paralution were contacted to provide support, but were ultimately unable to resolve the issue. It was clear to us that the Epiphany chip only supports single- and not double-precision calculations, but rewriting the source code to accomodate this limitation did not help. The two possible explanations are: 1. The OpenCL kernels contain parts of the OpenCL standard which are not supported by the CO- PRTHR SDK. A list of which parts are implemented and which are not can be found at [15]. 2. The compiled OpenCL kernels are too large to fit in device memory. We suspected the first one to be the main reason and this was later confirmed by the Paralution developer team. Among other things, the kernels use inline calls and recursion, which the COPRTHR SDK does not support. The second reason might also contribute, but we have not more closely examined this. Having failed with both Trilinos and Paralution we instead turned to an open-source library called ViennaCL[16]. The code was confirmed to work with the Intel OpenCL base on our own laptops and was thus installed on the Parallella. There, however, it passed all CPU tests but failed all OpenCL tests, meaning neither single- nor double-precision kernels were compatible with the Epiphany chip. Instead of trying to make external AMG packages fit, it was decided that we would write our own simpler programs to simulate a V-cycle of AMG. This is described in the following section. 10

13 6.3 Code development Three different OpenCL programs were written: Dense matrix-vector multiplication Sparse matrix-vector multiplication Sparse matrix-matrix multiplication The first is included mainly for use as a performance benchmarking test. The sparse kernels are used to simulate a complete V-cycle as it is performed in AMG, without having an actual AMG solver installed. To write the latter from scratch, conforming to the Epiphany OpenCL implementation (COPRTHR SDK), would simply take too much time. Details and pseudo-code implementations are included in the next two sections. It is well worth noting that the Epiphany chip does not support any double precision arithmetic. Hence, all calculations that follow are in single precision Dense kernels For dense matrix-vector we implement the basic naïve algorithm based on the COPRTHR guide example [15]: Data: n n matrix A and vector b of length n for i = 0:(n-1) do for j = 0:(n-1) do c i = c i + A ij b j end end Algorithm 1: Dense matrix-vector multiplication Here, a total of n OpenCL procceses are distributed accross the 16 Epiphany cores, where each computes one element of the final vector. We also wrote a secondary kernel which calculates one sixteenth of the elements, which means there is exactly one process per core. However, this does not lead to any noticeable improvement in speedup for matrices that fit in memory (n < 2 12 ) Sparse kernels The sparse kernels are the ones that are mostly used in iterative solution methods, including AMG. When doing sparse calculations, only a few elements are nonzero. For this report, we have considered matrices where the number of nonzero elements is less than 2%. Since the zero elements do not actually contribute anything to the result (anything multiplied by zero is zero) we need only store the nonzero elements which saves memory and reduces computational cost. We need to store the location of the nonzero elements. A straight-forward method is to store the coordinates (x, y) for each nonzero element A xy. This is, however, not the best approach. During computation, it is unclear how many nonzero elements there are in a particular row, and where in memory they are located. This means cache misses might often occur which induces a heavy computational penalty. 11

14 Instead, a more efficient method to store sparse matrices most commonly used in practice, for example as in the Python numerical library NumPy [17], is either Sparse Row-Wise (SRW) or Sparse Column-Wise (SCW). The first will be explained, but the second is analogous. There are many possible formats and the best one is application-dependent. Consider a (very small) sparse matrix such as with 9 nonzero elements. A coordinate-based storage format would require three arrays of length 9 such as: x = [1, 2, 4, 0, 2, 1, 3, 4, 0] y = [0, 0, 0, 1, 1, 3, 3, 3, 4] v = [1, 5, 2, 1, 7, 4, 6, 9, 7] Here, the element with value v i has coordinates (x i, y i ). Notice that the row numbers are repeated, and if written in row-order, we can write one array of length n + 1. The same matrix would in SRW format be written as: I = [0, 3, 5, 5, 8, 10] J = [1, 2, 4, 0, 2, 1, 3, 4, 0] V = [1, 5, 2, 1, 7, 4, 6, 9, 7] Here, the nonzero elements are once again stored in V and the corresponding columns are stored in the corresponding element of J. Thus, a nonzero element at V i is in column J i in the full matrix. The array I of length n + 1 specifies which elements of J belong to which row in the full matrix. All elements with index i so that I k i < I k+1 are in row k. Obviously, the difference in memory usage compared to a coordinate-based system is in this case very small, but for larger matrices becomes more noticeable. The SRW format also makes it computationally easy to ascertain how many nonzero elements there are in a row, and if any rows are empty - this number is simply T k+1 T k. If T k+1 T k = 0, it means this row is empty and this step of the process can simply be skipped. At the same time, the algorithms become slightly more complicated. Using the SRW format, multiplying a sparse matrix with an arbitrary vector was done in the following 12

15 way: Data: n n matrix A in SRW format and a vector b for i = 0:(n-1) do if Row i of A is non-empty then for j = 0:(n-1) do if This column exists in C then C i = C i + A ik B k else Store this column index in C; C i = A ik B k end end else Skip row i; end end Algorithm 2: Sparse matrix-vector multiplication Note that, here, only the matrix is considered sparse and we consider a dense vector b. Extending this code for matrix-matrix multiplication, where both matrices are sparse and in SRW format, we implemented the following algorithm: Data: n n matrices A and B in SRW format for i = 0:(n-1) do if Row i of A is non-empty then for j = 0:(n-1) do if This column exists in C then C i = C i + A ik B k else Store this column index in C; C i = A ik B k end end else Skip row i; end end Algorithm 3: Sparse matrix-matrix multiplication What does not appear in the pseudo-code is the problem of pre-allocating memory for C. Because there is no general result for how many nonzeros there might be in C for general sparse matrices A and B, one must do an estimate. We made a rough estimate of the expected number of nonzero elements in C and confirmed the estimate by performing the multiplication in MATLAB. In a more general case, the kernel must allocate a certain initial size and during calculation detect whether the allocated memory is too small and allocate more, which takes extra time. This more complicated (but in practical applications necessary) routine is not included in the present implementation. The sparse programs are designed in the same way as the optimized dense kernel. This means that instead of having one process for each index i, there are as many processes p as the number of cores used, where each takes care of n/p rows of the matrix. 13

16 Data is written to the temporary Epiphany core cache throughout the loop, and is only transferred to the global shared memory once the entire row is calculated. This provides a very large speedup due to a limited memory bandwidth, which is important to take into account for other developments on the Parallella. Since each row is only handled by one process, there are no cache problems or memory clashes. Complete source code for all our tests is released under the MIT software license on Github[18]. 7 Problem setup 7.1 Linear algebra With the Linear Algebra packages detailed in the previous section failing to use the Epiphany chip, another approach was needed to perform (or at least simulate) AMG calculations on the Parallella. We opted for an approach which does not actually calculate a solution, but mimics the type of computations (sparse matrix-vector and sparse matrix-matrix multiplication of varying sizes) that are performed during an AMG V-cycle. In order to obtain meaningful timings our simulation was constructed as follows: 1. Allocate memory for matrices 2. Synchronize memory with Epiphany chip 3. Perform computation 4. Synchronize result A careful reader might realize that, although we work with a shared memory model, the current OpenCL implementation is not designed for this and the COPRTHR SDK still requires us to synchronize data. Because the V-cycle works with computations on each level, from finest mesh down to coarsest and then back up to finest again, the computational steps were built up accordingly (where n is the largest matrix size): 1. Sparse Matrix-Matrix and sparse matrix-vector multiplication, size n n. 2. Divide n by Repeat step 1 and 2 until the coarsest level n/ Perform exact solving. 5. Multiply n by Sparse Matrix-Matrix and sparse matrix-vector multiplication. 7. Repeat step 5 and 6 until the finest level n. The mesh sizes were chosen as power of 2 s to ensure that obtaining exactly half the size is always possible. It also makes it easy to balance the load between the p Epiphany cores, who each compute n/p rows of the matrix. Due to memory restrictions we choose to have as our finest mesh. 14

17 The number of matrix-matrix and matrix-vector computations that are performed over a V-cycle is very problem dependent, namely on what smoother is used and how aggressive the coarsening is done. However, one way to loosely estimate this is by estimating how much work is done over the most simple V-cycle, which constitutes one matrix-vector multiplication and one matrix-matrix multiplication. Assume we have n levels ordered as {l 1, l 2,..., l n } where l 1 is the coarsest and l n is the finest. Suppose also that the matrix sizes are ordered as N k 1 = N k 2 and that the number of non-zeros in each matrix is some constant percentage α. The total number of non-zero elements on level k is therefore αnk 2. Then the arithmetic work to multiply a matrix of a size N k N k with a vector can be estimated as W k = αc 1 N 2 k. (2) Similarly, to multiply two matrices of the same size, with the same percentage of non-zeros, requires W k = αc 2 N 2 k, (3) where C 1 and C 2 are constants independent of the size of the matrix. With (2) and (3) we can estimate the total arithmetic work W V for a complete V-cycle: W V = αc ( N 2 n + N 2 n N 2 1 ) = ( = αc Nn 2 + N n 2 (2 1 ) N n 2 ) (2 n ) 2 = ( = αnnc ) 4 n = ( 1 ( 1 = αnnc )n ) = = α 4 3 CN 2 n for large n, where C = C 1 + C 2. From the formula we see that the total amount of work over a V-cycle is proportional to the number of non-zeros in the finest mesh, N n 1. Because all the matrices are sparse matrices, we must have α satisfying αn 2 n = O(N n ). From this we can deduce that W V = O(N n ) for some constant of proportionality C independent of the matrix size N n and the proportion of nonzero elements α. The size of this constant depends on factors such as the choice of smoother and how aggressive the coarsening is. This calculation nevertheless has the benefit of showing that the multigrid methods have a linear computational cost. We decided to measure one matrix-vector and one matrix-matrix multiplication on each level giving a good benchmark value which then can be used to estimate time demands for any given problem, as at least one matrix-matrix and one matrix-vector multiplication will be performed. The V-cycle was thereafter calculated by summing up execution and memory synchronization times for each level. The latter is included as it is a requirement for computation on this device. 15

18 7.2 Energy consumption With the help of the IT Project course group working on a separate Parallella project [19] equipment was borrowed to measure power consumption. According to the Adapteva website [2] the computer consumes only 5 W under typical workloads. We wanted to test this claim. The dense matrix-vector kernel computations for a matrix were run several times in a loop while power consumption was measured, and compared with power consumption when idle. The computations included both randomizing a matrix and a vector on the CPU, while most of the computations took place on the accelerator. We also multiplied large sparse matrices with the accelerator while the CPU was idle. Power consumption tests were made using a special chip attached to metal pins available on the board for this purpose, and a standard multimeter measured voltage going through these pins. To convert this into amperage, we solved for the amperage A using the following formula: V A This formula is deduced through least-squares approximation by before-hand measuring voltage (V ) corresponding to different loads (A) through the power supply. Multiplying this amperage by five volts gives the correct wattage. 8 Results All speedup results below are based on execution time on the accelerator, as well as the time for synchronizing memory to and from it. The speedup is computed using Amdahl s law, where the speedup S for a time, T, of a number of cores n is: S(n) = T (1 core) T (n cores) Although the memory is shared, it appears (according to the basic OpenCL for Parallella examples) that synchronization routines must still be called, which slightly hinders performance. This can hopefully be remedied in future versions of the COPRTHR SDK, but we have used the currently newest version 1.6. The Epiphany chip does not support double-precision numbers, so all the results below are in singleprecision. Worth noting is that, due to the lack of time on this project, our code is probably not fully optimized, although we know of no specific remedies that would provide a significant performance improvement. Table 3 shows the amount of non-zero elements and this proportion to total number of elements for each matrix size used. The second matrix used for matrix-matrix multiplication is of similar size and sparsity. The actual sparsity in the matrices in an AMG V-cycle depends firstly on the problem, but also on the methods used for restriction and prolongation. Our numbers below are realistic in the sense that we go to a larger proportion for every restriction (i.e. downwards in the hierarchy) and to a smaller proportion for every prolongation (upwards in the hierarchy). Whether the proportions themselves are realistic is impossible to say as they are, as just said, problem dependent. 16

19 Size # Non-zeros Proportion, % , , , , ,65 Table 3: Sizes and fraction of non-zeros used during tests. 8.1 Dense linear algebra computations The following graph measures memory synchronization with OpenCL as well as execution time of the OpenCL kernel. As we can see in Figure 6, memory latency affects speedup for n = 512 and n = 1024, whereas for n = 2048, we get near-linear speedup (see also Table 6 for the actual numbers). Using an even larger matrix requires a smarter memory model than what we implemented, as the maximum contiguous memory blocks seem to be around 32 MB. 15 Speedup, Dense matrix vector multiplication Speedup #Cores Figure 6: Relative speedup of multiplication of a dense matrix with a vector, split over 1-16 cores. Below we include tables showing the different time measurements that were performed on. Table 4 and Table 5 show allocation time and synchronization time, respectively, of each matrix size. Not surprisingly, there is no difference in allocation time for different number of used cores. The same goes for memory synchronization, which is good. Timings between the size of the matrices show an increase in time by almost a factor of 4 when doubling the size. This is expected as the number of elements also grow by a factor of 4. Allocation Size Time Ratio 512 0, ,0581 3, ,2292 3,9471 Table 4: Allocation time in seconds for dense matrix-vector multiplication. Synchronization Size Time Ratio 512 0, ,2144 3, ,8547 3,9814 Table 5: Memory synchronization time in seconds for dense matrix-vector multiplication. 17

20 Calculation Size\#cores ,7488 0,4097 0,2403 0,1559 0, ,7793 1,4260 0,7484 0,4110 0, ,9235 5,4955 2,7821 1,4295 0,7618 Speedup 1 1,9877 3,9263 7, ,3394 Table 6: Table over computation time in seconds for dense matrix-vector multiplication. 8.2 Sparse LA All results below are all obtained using single-precision values. We have not tested using integer values because of time restrictions of the project as well as the fact that this is not common in practical applications. Due to the lack of RAM we were not able to experiment with any larger matrices than Speedup Results from our experiments show a fairly good speedup for at least matrix-vector multiplication (see Figure 7). Best speedup is obtained when using as many cores as possible at maximum load for both types of computations Speedup, Matrix matrix multiplication Speedup, Matrix vector multiplication Speedup Speedup #Cores #Cores (a) Relative speedup of sparse matrix-matrix multiplication split over 1-16 cores. (b) Relative speedup of sparse matrix-vector multiplication split over 1-16 cores. Figure 7: Graphs of the values from Table 4.2 and 4.3. computed with Amdahl s law. Calculation and synchronization ,1356 0,1237 0,1182 0,1152 0, ,1681 0,1446 0,1325 0,1264 0, ,2353 0,1864 0,1625 0,1480 0, ,3670 0,2710 0,2232 0,1986 0, ,6365 0,4448 0,3459 0,3002 0,2760 Table 7: Actual measured timings in seconds for the sparse matrix-matrix multiplication. 18

21 Calculation and synchronization ,2074 0,1473 0,1183 0,1005 0, ,3329 0,2110 0,1509 0,1206 0, ,5831 0,3387 0,2177 0,1561 0, ,0839 0,5939 0,3501 0,2280 0, ,0938 1,1119 0,6183 0,3738 0,2517 Table 8: Actual measured timings in seconds for the sparse matrix-vector multiplication Memory allocation Although memory allocation is not done on the Epiphany chip (only on the CPU) it can be a significant part of the total execution time for any large application. Here we can note two important things. First, when doing for example a matrix-matrix multiplication with full use of the Epiphany chip (16 cores), we see that allocating enough memory takes approximately 40% of the total execution time. Second, as Table 9 and Table 10 show, doubling the size of the matrices also produces a doubling in time for allocating memory - corresponding to the doubling of nonzero elements Memory allocation Matrix vector Matrix matrix 10 1 Time, log(t) Number of elements x 10 6 Figure 8: Allocation timings for different number of elements. memory and thus becomes faster. Note that matrix-vector requires less Allocation: Mat-vec multiplication Size # allocated Time Ratio , ,0067 1, ,0128 1, ,0258 2, ,0507 1,9685 Table 9: Sparse matrix-matrix multiplication timings in MATLAB. Allocation: Mat-mat multiplication Size # allocated Time Ratio , ,0249 1, ,0499 2, ,1016 2, ,2113 2,0790 Table 10: Sparse matrix-vector multiplication in MATLAB. 19

22 8.2.3 Memory synchronization Synchronizing the memory on the ARM CPU with the Epiphany chip is also time consuming. Taking the same example as above with allocating memory, it takes slightly less time to synchronize but still above 30% of the total time for one computation. Also as with allocating memory, Table 11 and Table 12 show that it takes twice as much time to synchronize twice as many elements Memory synchronization Matrix vector Matrix matrix 10 1 Time, log(t) Number of elements x 10 6 Figure 9: Memory synchronization times in seconds for different number of elements. Note that matrixvector requires less memory and thus becomes faster. Synchronization: Mat-vec Size # synchronized Time Ratio , ,0056 1, ,0099 1, ,0187 1, ,0365 1,9545 Table 11: Sparse matrix-matrix multiplication timings in MATLAB. Synchronization: Mat-mat Size # synchronized Time Ratio , ,0191 1, ,0368 1, ,0726 1, ,1491 2,0560 Table 12: Sparse matrix-vector multiplication in MATLAB. 8.3 Simulated V-cycle Our simulated V-cycle shows an approximate 4x maximum speedup with 16 cores - see Figure 10 and Table

23 4.5 Speedup, V cycle simulation Speedup #Cores Figure 10: Speedup of our simulated V-cycle for matrix sizes of n = 8192 down to n = 512. Calculation and synchronization Calculation 10,6252 6,1530 3,9331 2,8116 2,2630 Synchronization 0,3790 0,3799 0,3795 0,3790 0,3944 Sum 11,0042 6,5329 4,3125 3,1910 2,6574 Speedup 1 1,6844 2,5517 3,4489 4,1410 Table 13: Actual timings for the simulated V-cycle on different amounts of cores used. 8.4 Comparisons with MATLAB In order to have something to compare with, we also ran computations of the very same matrices in MATLAB on a laptop running MATLAB 2013A on Windows 7 using a dual-core Intel Core i5 4500U (2.5 GHz) and 6 GB RAM. While most matrix operations, including matrix-matrix multiplication, are implicitly parallellized in MATLAB, this is not the case if at least one matrix is sparse. Hence we present only serial runtimes and discuss them in the next section. Because of how MATLAB is developed, the following are calculations in double precision, whereas the Epiphany accelerator only supports single precision calculations. Size Time Table 14: Sparse matrix-matrix multiplication timings in MATLAB. Size Time Table 15: Sparse matrix-vector multiplication in MATLAB. Setting up a similar V-cycle as proposed in the problem setup section (3.2.1) and using the same matrices as on the Parallella, the simulated V-cycle takes approximately seconds. Corresponding times on 21

24 Parallella are seconds on 1 core and on 16 cores on Parallella. 8.5 Power measurements Power consumption was tested in [19], the results are shown here in Table 4.1: Idle Compiling Unit tests Bucket sort 2.2 W 3 W 3.8 W 2.9 W Table 16: Energy consumption of Parallella Embedded Server as measured in [19]. For these tests, compilation took place only on the CPU, the Bucket sort only on the Accelerator, while the unit tests were utilizing CPU, accelerator, and even the network connector. It should be noted that the device used is not identical to ours (this was on the Embedded platform whereas we have used the Desktop version). Since the hardware difference is so small, and the CPU and accelerator are identical, we feel confident that the results of Table 13 should reflect the power consumption of all three device types. It is unknown how much difference the use of the GPIO ports might affect power consumption. We tried reproducing these tests, which peaked at 3 W for maximal load and held around 0.25 W during idle. This cannot be just the Epiphany chip since it has a maximum power consumption of 2 W[2], but seems too low to be the entire copmuter. This measure is doubtful at most. An important observation here was that the accelerator continues to run and consume power even after the kernel has completed execution and no data is being fed. The programmer has to manually clear the Epiphany instruction set in order to halt the power consumption. Needless to say this might have large consequences to power efficiency in large-scale runs, if forgotten. 9 Discussion As we have neither used nor closely examined the esdk made for the Epiphany chip we choose only to comment on the OpenCL support and the results we have acquired using it. OpenCL is, as yet, clearly very limited in its Epiphany support. The same conclusions are reached in [20] which notes that programs using the esdk scale much better than implementations of OpenCL do, at the cost of increased development time. Results show a fairly good speedup using 16 cores for both matrix-vector multiplications done, slightly above eight times faster for sparse and around 14 times faster for dense. Our poor results for matrixmatrix multiplication are most probably due to bad memory management. As already mentioned our code is not optimized, although we do not know of any specific fixes that would provide significan speedup. As to the V-cycle we see, at best, four times speedup on full load using all computing nodes. At the same time, it is common knowledge that sparse linear algebra operations generally do not scale well on parallel platforms. The experiments performed here do not bring a dramatically different insight, despite the shared memory model which was presumed to offer better performance. Improving the speedup of matrix-matrix multiplication, if possible, is what would offer the most improvement to the V- cycle simulation results. Note that standard linear algebra routines such as BLAS are not implemented for use on the Epiphany accelerator. Generic BLAS results on the ARM CPU would not be of any interest. 22

25 Because the OpenCL standard is not completely implemented, we consider the Epiphany OpenCL implementation (COPRTHR SDK) to not be quite finished yet. There are certainly many optimizations that could be performed there, under the hood. Another alternative would be to use another parallel framework such as a lighter version of MPI which might suit the multiple instruction multiple data (MIMD) architecture of the Epiphany better. Results regarding allocating memory and synchronizing with the Epiphany chip are rather good, showing that doubling the amount of elements leads to a doubling in time, as the number of elements grow. The memory latency shows itself when allocating or updating fewer elements, but becomes less noticeable for large datasets. Even though the comparison of our results with the same computations in MATLAB is very poor (the latter is fully optimized and not done on a comparable computer), we can note that the timings on the MATLAB grow much faster than on the Parallella. This shows that the latter shows potential for better speedup. The 64+2-core Parallella version, not yet available, might provide an even better scalability and has a peak performance rating of 85 GFLOPS, as opposed to the 25 GFLOPS [5] of this 16-core version. It was scheduled to be released Q but is as of January 2015 not yet available. The 1 GB RAM limit could also easily become a potential deal-breaker as many Numerical Linear Algebra problems entail much larger datasets. 10 Conclusions and future research For this project, we have tried to implement the Trilinos and Paralution packages for Linear Algebra. Although they can be installed, various attempts have failed to get them to use the Epiphany accelerator chip, meaning performance is severly limited and not useful. Results from the ViennaCL library were similarly disappointing, and the deal.ii package will not install, probably because of ARM incompatibilities. Instead, our own implementations of dense matrix-vector, sparse matrix-vector and sparse matrix-matrix multiplications were implemented in OpenCL. Dense matrix-vector multiplication shows near-linear speedup, whereas our unoptimized sparse routines show only 4-8x speedup. We tried simulating an AMG V-cycle using these routines, but its speedup suffers from the poor performance of sparse matrixmatrix multiplication. As we, because of time limitations, were unable to fully implement a standard AMG solver utilizing the Epiphany, future research could definitely look into developing such a software, both utilizing one Parallella board as well as several. We also tried, but failed, to measure power consumption of the board. Results from [19], if accurate, point to less than 5 W and even less than 4 W even at full load on both CPU and acclerator. A future study should verify this with better measurement instruments. We know of no study where different low-power supercomputing-on-a-chip architectures such as Epiphany, Kalray [21] etc. are compared in terms of computing power, scalability, energy efficiency and simplicity of development. This area of research is moving rapidly, even when compared to the usual speed at which computer research moves. Such a study might be of great benefit to the supercomputing community. Since the Epiphany is MIMD the Parallella computer also might lend itself well (or better) to task-parallell problems as well as data-parallell problems. While sparse linear algebra computations in theory are better suited for shared memory systems, we cannot at this moment recommend extensive linear algebra computations to be performed on Parallella. We acknowledge that it has potential, and with a more complete OpenCL support allowing for general LA-packages to be installed and run, as well as a 66-core version, its usability would be greatly enhanced. 23

26 References [1] Xilinx, Zynq 7000 All programmable SOC Overview, document number: DS190, 8/ http: // [2] Parallella reference manual, v [3] [4] [5] Parallella-1.x Reference Manual, REV [6] [7] Aaberge, Trygve, Analyzing the performance of the Epiphany processor, Trondheim: Department of computer and information science, Norwegian Uni. of Science and Technology, pp. 11 [8] Numerical Analysis of partial differential equations, S.H. Lui, page.346 [9] Youseef Saad, Iterative Methods for Sparse Linear Systems, SIAM [10] Johannes Kraus, Svetozar Margenov, Robust algebraic multilevel methods and algorithms, p.29 and forward, Walter de Gruyter [11] [12] [13] [14] [15] docs/parallella_quick_start_guide.pdf [16] [17] NumPy/SciPy documentation: matrix.html#scipy.sparse.csr_matrix [18] [19] Christos Sakalis et al., The EVI Distributed Shared Memory System, Uppsala: Department of Information Technology, Uppsala Universitet pp [20] Analyzing the Performance of the Epiphany Processor, Trygve Aaberge, 2014 [21] Kalray webpage: 24