A distributed CPU-GPU sparse direct solver
|
|
|
- Simon Emery Hamilton
- 9 years ago
- Views:
Transcription
1 A distributed CPU-GPU sparse direct solver Piyush Sao 1, Richard Vuduc 1, and Xiaoye Li 2 1 Georgia Institute of Technology, {piyush3,richie}@gatech.edu 2 Lawrence Berkeley National Laboratory, [email protected] Abstract. This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node s available CPU cores and GPUs. 1 Introduction Given a sparse matrix A, we consider the problem of factoring it into the product A = L U, where L is a unit lower triangular matrix and U is an upper triangular matrix. This problem ( sparse LU ) is usually the most expensive step in a sparse direct solver, the use of which appears in a variety of computational science and engineering applications. It typically needs a lot of memory, thereby benefiting from the use of a distributed memory system. A natural question is, given the increased reliance on some form of GPU-like acceleration for such systems, how to exploit all forms of available parallelism, whether distributed memory, shared memory, or accelerated. The challenge is that sparse LU factorization is, computationally, neither strictly dominated by arithmetic, like high-performance LINPACK is when A is dense, nor is it strictly dominated by communication, as is often the case with iterative linear solvers. Thus, it is an open question whether or by how much we should expect to speed up sparse LU factorization using distributed CPU+GPU machines [9]. Additionally, the facts of indirect irregular memory access, irregular parallelism, and a strong dependence on the input matrix s structure known only at runtime further complicate its implementation. These complications require carefully designed data structures and dynamic approaches to scheduling and load balancing. Indeed, perhaps due to these myriad issues, there are many studies offering distributed algorithms and hybrid single-node CPU+GPU implementations but, to date, no fully distributed hybrid CPU+GPU sparse direct solver of which we are aware ( 2). This paper presents the first such algorithm and implementation that can run scalably on a cluster comprising hybrid CPU+GPU nodes. We extend an existing distributed memory sparse direct solver, SuperLU_DIST [5], by adding
2 CPU multithreading and GPU acceleration during the LU factorization step. To effectively exploit intranode CPU and GPU parallelism, we use a variety of techniques ( 4). These include aggregating small computations to increase the amount of compute-bound work; asynchronously assigning compute-bound work to the GPU and memory-bound work to the CPU, thereby minimizing CPU- GPU communication and improving system utilization; and careful scheduling to hide various long-latency operations. We evaluate this implementation on test problems derived from applications ( 5). We show speedups of over 2 ( 5) over a highly scalable MPI-only code; and, provide the required explanation when our approach does not yield speedups. 2 Related work The last five years have seen several research developments on accelerating sparse factorization algorithms using GPUs. Most of these efforts rely on the GPU for solving large dense matrix subproblems, performing any other processing on the host CPU with data transfer as needed. There exist methods for multifrontal Cholesky [4, 12, 9, 6]; and, in single-precision, left-looking sparse LU [8]. In essence, all of these methods use the GPU as a BLAS accelerator. George et al. go beyond BLAS acceleration for their single-node multifrontal sparse Cholesky algorithm, implemented in WSMP [3]. They examine three compute-intensive kernels associated with each frontal matrix: factoring the diagonal block, triangular solution, and Schur complement update. These computations are selectively offloaded to the GPU depending on the workload distribution of the flops, which in turn depends on the input matrix. Their method achieves speedups over a single-core. Yeralan et al. developed a sparse multifrontal QR factorization algorithm using one CPU-GPU combination [11]. Since sparse QR has intrinsically higher arithmetic intensity than sparse LU, the pay-off of GPU acceleration should be higher. Our approach also offloads the most arithmetic-intensive part of the workload to GPUs. However, one distinction of our work is that we aim to exploit the maximum available parallelism of a distributed memory system, namely, distributed memory parallelism via MPI combined with intranode parallelism through multithreading and GPU acceleration. While our implementation is specific to SuperLU_DIST, we believe techniques discussed in this paper can be extended to other direct solvers. 3 Overview of SuperLU_DIST Solving a linear system Ax = b using SuperLU_DIST involves a number of steps [10, 7]. However, the most expensive step is numerical factorization, which is the focus of this paper. For test matrices in our study, numerical factorization accounts for at least 75% of the total solve time, and in fact more often accounts
3 Algorithm 1 SuperLU_DIST Numerical Factorization 1: for k = 1, 2, 3... n s do Panel Factorization 2: Column computation of L :,k. 3: if p id P c(k) then 4: compute the block column L k:ns,k 5: (communicate U k,k among P c(k)) 6: send L k:ns,k to required processes in P r(:) 7: else 8: receive L k:ns,k if required Row computation of U k,:. 9: if p id P r(k) then 10: wait for U k,k 11: compute the block row U k,k+1:ns 12: send U k,k+1:ns to required processes in P c(:) 13: else 14: receive U k,k+1:ns if required Schur Complement Update 15: if L :,k and U k,: are locally non-empty then 16: for j = k + 1, k + 2, k n s do 17: for i = k + 1, k + 2, k n s do 18: if p id P r(i) P c(j) then 19: A i,j A i,j L i,k U k,j for 90% or more. Therefore, we focus on just the numerical factorization phase. Accelerating the remaining steps is a good avenue for future research. SuperLU_DIST uses a fan-out (right-looking, outer-product) supernodal algorithm. A supernode is a set of consecutive columns of L with a dense triangular block just below the diagonal and with the same nonzero structure below the triangular block. To achieve good parallelism and load balance, the MPI processes are assigned to the supernodal blocks in a 2D cyclic layout. Algorithm 1 shows the pseudocode of the factorization algorithm, where n s is the number of supernodes, p id is the ID of this process, and P c (k) and P r (k) are the groups of processes assigned to the k-th supernodal column and the k- th supernodal row, respectively. Step 1 is the k-th panel factorization, where the k-th supernodal column of L and the k-th supernodal row of U are computed. Subsequently, each process in P c (k) and P r (k) sends its local blocks of the factors to the processes assigned to the same row and column, respectively. Consequently, Step 2 updates the trailing submatrix using the k-th supernodal column and row of the LU factors. The block A i,j is updated only if both blocks L i,k and U k,j are not empty. A more detailed description appears elsewhere [10]. 4 New Intranode Enhancements Our work enhances the intranode performance and scaling of alg. 1. The panel factorization and row computation phases primarily are concerned with commu-
4 nication. By contrast, the Schur complement update phase (lines 15 19) is the local computation that dominates intranode performance. Thus, it is our main target for optimization. Baseline Schur complement update. The Schur complement update step at iteration k of alg. 1 computes A k+1,ns:k+1,n s as A k+1,ns:k+1,n s = A k+1,ns:k+1,n s L k+1:ns,ku k,:k+1:ns. (1) SuperLU_DIST uses an owner-computes strategy, where each process updates the set of blocks, {A i,j }, which it owns once it has received the required blocks L :,k and U k,:. Each GEMM subproblem computes one A i,j, which is line 19 of alg. 1. In the baseline SuperLU_DIST implementation, a process updates each of its A i,j blocks in turn, traversing the matrix in a columnwise manner (outermost j-loop at line 18 of alg. 1). The update takes place in three steps: packing the U block, calling BLAS GEMM, and unpacking the result. We refer to the first two steps as the GEMM phase, and the last step as the Scatter phase. Packing allows the computation to use a highly optimized BLAS implementation of GEMM. Packing converts the U k,j, which is stored in a sparse format, into a dense BLAS-compliant column major format, Ũ k,j. This packing takes place once for each U k,j. The L i,k operand need not be packed, as it is already stored in a column major form as part of a rectangular supernode. The second step is the BLAS GEMM call, which computes V L i,k Ũ k,j, where V is a temporary buffer. The final Scatter step updates A i,j by subtracting V from it. Since only the nonzero rows of L and U are stored, the destination block A i,j usually has more nonzero rows and columns, than L i,k and U k,j. Thus, this step must also map the rows and columns of V to the rows and columns of A i,j before the elements of A i,j can be updated, which involves indirect addressing. This final unpacking step is what we refer to as the Scatter phase. Aggregating small GEMM subproblems. Relative to the baseline (above), we may increase the intensity of the GEMM phase by aggregating small GEMM subproblems into a single, larger GEMM. This aggregated computation then becomes a better target for GPU offload, though it also works well even in the multicore CPU-only case. Our approach to aggregation, illustrated in fig. 1, has two aspects. First, we process an entire block column at once. That is, instead of calling GEMM for every block multiply L i,k Ũ k,j, we aggregate the L-blocks in column k into a single GEMM call that effectively computes V L k+1:ns,kũk,j, thereby reusing Ũ k,j. Secondly, the packed block Ũk,j may still have only a few nonzero columns. Thus, we group multiple consecutive U-blocks to form a larger Ũk,j st:j end block, where j st and j end are the starting and the ending block indices. This large block has some minimum number of columns N b, a tuning parameter. We schedule
5 }CUDA Streams } CPU } U k,: ~ U k,: L :,k V } L :,k V L ~ :,k U k,: GEMM-Phase Scatter-Phase Fig. 1: Aggregating Small GEMM subproblems ~ L U 1 V 1 LU ~ 1 V 1 Host to GPU data transfer ~ U 2 V 2 LU ~ 2 V 2 GPU to host data transfer U ~ i V i LU ~ i V i V 0 LU ~ 0 Scatter( V 0 ) Scatter( V 1 ) Scatter( V 2 ) } Stream-intialize } First block column on CPU } } Wait Scatter rest of the block columns on CPU Fig. 2: Overlapping GEMM with Scatter the computation of L k+1:ns,kũk,j st:j end onto the GPU, using CUDA streams as explained below. Aggregation may increase the memory footprint relative to the baseline. In particular, we may need to store a large U-block, Ũ, and a large intermediate output, V. Our implementation preallocates these buffers, using N b as a tunable parameter to constrain their sizes. Pipelined execution. Given aggregated GEMMs, we use a software pipelining scheduling scheme to overlap copying the GEMM operands to the GPU with execution of both the GEMMs themselves as well as the CPU Scatter. Our pipelining scheme, illustrated in fig. 2, uses CUDA s streams facility. Our scheme divides Ũ into n s partitions, where n s is the number of desired CUDA streams, a tuning parameter. To perform this division, our scheme first ensures that each partition has a minimum of N b columns. It also ensures that the number of columns in each partition does not cross the boundary of the block columns. It then uses a greedy algorithm to ensure that each partition has
6 a number of columns of at most the average number of columns, except for the last partition which has all the remaining columns. The pipelining begins with the transfer of L to the GPU. Now each CUDA stream asynchronously initializes transfer of i-th partition, Ũ i, and a CUDA BLAS GEMM call to perform V i LŨi, and transfer of V i to the host. Once V i is copied back to the host, this V i is scattered as soon as possible. We schedule the GEMM and scatter of the first block column on CPU. This is done to minimize idle time of CPU, while it waits for the first CUDA stream to finish transferring the V 1. Note that CUDA streams mainly facilitates overlap of CPU, GPU, and PCIe transfer. The streams themselves may, but do not necessarily, overlap CUDA streams facility carries a nontrivial setup overhead. Suppose asynchronous CUDA calls take time t s to initialize, and the effective floating-point throughput of the CPU is F cpu operations per unit time. Then, offloading fewer than t s F CP U would be slower than executing on the host. Our implementation uses such a heuristic to decide whether offloading a particular GEMM phase to the GPU will pay off, or otherwise executes on the CPU. OpenMP parallelization of Scatter. We parallelized Scatter using OpenMP. There are a number of ways to assign blocks to be scattered to threads. Prior work on SuperLU_DIST used a block cyclic assignment [10]. However, we discovered by experiment that particular static assignment can lead to severe load imbalance. In addition, assigning one block to a thread can be inefficient since many blocks may have very little work in each, leading to an overly fine grain size. We address these issues as follows. When there are a sufficient number of block columns, we schedule the Scatter of the entire block column to one thread using OpenMP s guided scheduling option. We also tried dynamic scheduling options, but for our test cases, there was no significant difference in performance. When there are fewer block columns than the number of threads, we switch from parallelizing across block columns to parallelizing across block rows. In addition, we also use OpenMP to parallelize the local work at the lookahead phase and the panel factorization phase. However, doing so does not affect performance by much because these phases are dominated by MPI communication. 5 Experiments and Results We used two GPU clusters in our evaluation (table 2). We tested our implementations on the input matrices in table 2, which derive from real applications [2]. We evaluated 6 implementation variants. (All variants use double-precision arithmetic, including on the GPU.) The baseline is SuperLU_DIST. We modified this baseline to include the BLAS aggregation technique of 4. Since all variants derive from SuperLU_DIST, they all include distributed memory parallelism via MPI. Their mnemonic names describe what each variant adds to the MPI-enabled baseline.
7 Parameter Jinx-Cluster Dirac-GPU test bed # GPUs per node 2 1 Type of GPU Tesla M2090 Fermi Tesla C2050 Fermi GPU double-precision peak 665 GF/sec 515 GF/sec GPU DRAM / Bandwidth 6 GB / 177 GBytes/sec 3 GB / 144 GBytes/sec Host Intel Xeon GHz Intel Xeon GHz PCIe / Bandwidth PCIe x16 /8GB/s PCIe x16 /8GB/s Sockets Cores / socket CPU double-precision peak 128 GF/sec 76.8 GF/sec L3 Cache 2 12M 2 8M Memory 24GB 24GB Network /Bandwidth InfiniBand/ 40 Gbit/s InfiniBand/ 32 Gbit/s Table 1: Evaluation testbeds for our experiments Name n nnz nnz n symm Fill-in Ratio Application audikw_ yes structural problem bone yes model reduction problem nd24k yes D/3D problem RM07R no computational fluid dynamics dds.quad no cavity matrix no 9.68 Nuclear Fusion tdr190k no 23 Accelerator Ga19As19H yes quantum chemistry problem TSOPF_RS_b2383_c no 3.44 power network problem dielfilterv2real yes electromagnetics problem Table 2: Different test problems used for testing solvers. See the University of Florida Sparse Matrix Collection [2]; from NERSC users MKL 1 is the baseline, based on SuperLU_DIST Version 3.3 out-of-thebox. It uses MPI-only within a node and uses Intel s MKL, a vendor BLAS library, running in single-threaded mode. This implementation is what we hope to improve by exploiting intranode parallelism. Unless otherwise noted, we try all numbers of MPI processes within a node up to 1 MPI process per physical core, and report the performance of the best configuration. MKL p is the same as MKL 1, but with multithreaded MKL instead. It uses 1 MPI process per socket; within each socket, it uses multithreaded MKL with the number of threads equal to the physical cores per socket. {cublas,scatter} is MKL p but with most GEMM calls replaced by their NVIDIA GPU counterpart, via the CUDA BLAS (or cublas ) library. (Any other BLAS call uses MKL p.) Additionally, cublas may execute asynchronously; therefore, there may be an additional performance benefit from partial overlap between cublas and Scatter, as the mnemonic name suggests. Like MKL 1, we try various numbers of MPI processes per node and re-
8 port results for the best configuration. (When there are more MPI processes than physical GPUs, the cublas calls are automatically multiplexed.) OpenMP+MKL 1 exploits intranode parallelism explicitly using OpenMP. It parallelizes all phases using OpenMP. For phases that use the BLAS, we use explicit OpenMP parallelization and with single-threaded MKL. Scatter and GEMM phases run in sequence, i.e., they do not overlap. OpenMP+{MKL p,cublas} shares the work of the GEMM phase between both the CPU and GPU, running them concurrently. This tends to reduce the time spent in GEMM compared to OpenMP+MKL 1 implementation, but may not hide the cost completely. OpenMP+{MKL p,cublas,scatter}+pipeline adds pipelining to OpenMP+{MKL p,cublas}. We use n s = 16 CUDA streams and N b = 128. The first three implementations use implicit parallelism via multithreaded or GPU-accelerated BLAS; the last three involve explicit parallelism. We used X s = 144 as maximum supernode size. To profile the computation s execution time, we use TAU. When we evaluate memory usage, we use the IPM tool [1]. Relative time matrix211 nd24k RM07R tdr195k TSOPF_RS_b2383_c1 x x 0.7x x 0.1x x 1.1x 1.1x 0.7x x x x x 1.7x 1.9x x 7 1.1x 1.7x 6 1.4x 1.6x 2.3x 5 2.7x x 2 x x 1 0 audikw_1 bone010 dds.quad dielfilterv2real Ga19As19H42 x 0.7x 1.6 x x 2.4 x x x x 0.9x x x x 1.6x 1.1x x 1.1x x x x 1.7x 3.0x 1.9x 1.1x x 1.5x mkl 1mklp {cublas,scatter} omp +mkl 1 omp +{mkl p,cublas} omp +{mkl p,cublas,scatter} +pipelining mkl 1mklp {cublas,scatter} omp +mkl 1 omp +{mkl p,cublas} omp +{mkl p,cublas,scatter} +pipelining mkl 1mklp {cublas,scatter} omp +mkl 1 omp +{mkl p,cublas} omp +{mkl p,cublas,scatter} +pipelining mkl 1mklp {cublas,scatter} omp +mkl 1 omp +{mkl p,cublas} omp +{mkl p,cublas,scatter} +pipelining mkl 1mklp {cublas,scatter} omp +mkl 1 omp +{mkl p,cublas} omp +{mkl p,cublas,scatter} +pipelining Scatter DGEMM Other Fig. 3: Performance of different implementations for different test problems on Jinx cluster. Each bar is labeled by its speedup relative to the baseline (MKL 1).
9 Overall impact of intranode optimization. Our first analysis answers the question, by how much can explicit intranode optimization techniques improve performance above and beyond having a highly tuned multicore and/or GPUaccelerated BLAS? These experiments use just two nodes of the cluster. The results show best-case improvements of up to 3 using our techniques, and highlight scenarios in which our methods may yield a slowdown. We show results for the Jinx system in fig. 3. (Dirac results are similar [7], and so omitted for space.) It shows time (y-axis) versus implementation variant (x-axis) for a given matrix. The time is normalized to the baseline, with actual baseline execution times in the range of 10 to 1,000 seconds (not shown). Each bar breaks down the execution time into its components, which correspond to different phases of SuperLU. The GEMM phase and Scatter phase are as described in 4. The Scatter phase includes any CUDA stream setup and wait time. The Other phase has three major components: MPI_Wait, MPI_Recv, and triangular solve. When phases may overlap, the bar shows only the visible execution time, i.e., the part of the execution time that does not overlap. Thus, the total height of the bar is the visible wall-clock time. Both the MKL p and {cublas,scatter} variants are slower or just comparable to MKL 1 in many cases. Though they may improve GEMM, Scatter and Other may slowdown since they tend to improve with more MPI processes. Thus, only relying on accelerating BLAS calls whether by multithreading or offload tends not to yield a significant overall speedup, and can in fact decrease performance. The OpenMP+MKL 1 variant reduces the cost of Scatter and Other phases compared to MKL p and {cublas,scatter}. While Other for OpenMP+MKL 1 is better than with MKL 1, Scatter is worse. OpenMP+MKL 1 often matches the baseline MKL 1. The OpenMP+{MKL p,cublas} variant reduces the time spent in GEMM compared to OpenMP+MKL 1 implementation, but cannot hide the cost of GEMM completely. Our combined OpenMP+{MKL p,cublas,scatter}+pipeline implementation outperforms MKL 1 on 7 of the 10 test matrices on either platform, yielding speedups of up to 3 (fig. 3, audikw_1). Compared to MKL 1, this variant hides the cost of GEMM very well. However, Scatter still cannot achieve the same parallel efficiency as with MKL 1. The worst case occurs with TSOPF_RS_bs2383_c1, which derives from a power network analysis application. On Jinx, it is nearly 2 slower than MKL 1 (fig. 3). However, even with a slowdown our implementation can reduce the memory requirement of this problem; see below. Strong Scaling. Part of the benefit of intranode parallelism is to enhance strong scaling. We consider this scenario, for configurations of up to 8 nodes and 64 cores (Dirac) or 96 cores (Jinx), showing results for Jinx in fig. 4. (Dirac results are better than Jinx [7].) We present results for just two extremes : Matrix nd24k, on which our implementation does well, and TSOPF_RS_b2383_c1, on which it fairs somewhat poorly. We focus on three of our implementation variants: the baseline MKL 1, OpenMP+MKL 1, and OpenMP+{MKL p,cublas,scatter}+pipeline. For the MKL 1 variant, we use 1 MPI process per core. For OpenMP+MKL 1 and OpenMP+{MKL p,cublas,scatter}+pipeline cases, we use 1 MPI process per socket and one OpenMP thread per core.
10 Time (sec) 1 1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/512 Total GEMM Scatter MPI Other 2 1 1/2 1/4 1/8 1/16 1/32 1/64 1/128 1/256 1/ x6 1x122x124x128x121x6 1x122x124x128x121x6 1x122x124x128x121x6 1x122x124x128x121x6 1x122x124x128x12 (Nodes) x (Cores per node) mpi +omp +gpu nd24k TSOPF_RS_b2383_c1 Fig. 4: Strong scaling on up to 8 nodes (96 cores and 16 GPUs) on Jinx Figure 4 shows scalability as a log-log plot of time (y-axis) versus configuration as measured by the total number of cores (x-axis). Each series shows one of the three implementation variants. Each column is a phase, with the leftmost column, Total, showing scalability of the overall computation, inclusive of all phases. Time is always normalized by the total MKL 1 time when running on the smallest configuration (1 node and 1 socket), to reveal the relative time spent in each phase. Dashed lines indicate ideal linear speedup for MKL 1; perfect scaling would be parallel to this line, while sublinear scaling would have a less steep slope and superlinear scaling would have a more steep slope. On Dirac (not shown), both test matrices exhibit good scaling behavior for nearly all the phases; by contrast, Jinx scaling (fig. 4) exhibits sublinear behavior. At 96 cores and 16 GPUs (2 GPUs per node), all three implementations differ by only a little on nd24k. This is due largely to the relatively poor scaling of the Other phase, which eventually becomes the bottleneck for OpenMP+{MKL p,cublas,scatter}+pipeline. On TSOPF_RS_b2383_c1, the baseline MKL 1 is always fastest on both clusters, when it ran successfully. On Jinx, there was not enough memory per node to accommodate the 48 and 96 MPI processes cases, due to the fundamental memory scaling requirement of SuperLU_DIST; for more analysis, see below. Matrix TSOPF_RS_b2383_c1 case shows superlinear scaling. The Scatter phase is a major contributing factor in cost. As noted previously, the Scatter phase scales with increasing MPI processes, primarily due to better locality. Overall, OpenMP+MKL 1 shows good strong scaling. By contrast, the scaling of OpenMP+{MKL p,cublas,scatter}+pipeline can be worse, as observed on Jinx. How-
11 nd24k tdr190k TSOPF_RS_b2383_c1 nd24k tdr190k TSOPF_RS_b2383_c1 Time relative to 8x Memory relative to 8x x8 2x4 4x2 8x1 1x8 2x4 4x2 8x1 1x8 2x4 4x2 8x1 (MPI processes) x (OMP threads) Component Factorization MPI 1x8 2x4 4x2 8x1 1x8 2x4 4x2 8x1 1x8 2x4 4x2 8x1 (MPI processes) x (OMP threads) Component User MPI (a) Time in MPI vs. compute (b) User vs. MPI-runtime memory Fig. 5: Effect of intranode threading on memory and time ever, this owes largely to Amdahl s Law effects due to Other. That component is primarily a triangular solve step, which our work has not yet addressed. Time and memory requirements. Sparse direct solvers like SuperLU_DIST may exhibit a time-memory tradeoff. We show an example on three representative problems in fig. 5. This example includes nd24k, which shows common-case behavior; as well as TSOPF_RS_b2383_c1, which was a worst-case in execution time for our approach. The experiment tests the OpenMP+MKL 1 variant on one node of Dirac, which has 8 cores per node, under all configurations of (# of MPI processes) (# of OpenMP threads) = 8. Matrix TSOPF_RS_b2383_c1 exhibits the time-memory tradeoff. The Scatter phase dominates execution time, as observed above; since Scatter scales with MPI processes, the all-mpi configuration wins. However, memory usage actually increases with increasing numbers of MPI processes. Among user allocated memory, it turns out that the memory required by the L and U factors remains fairly constant, whereas the buffers used for MPI_Send and MPI_Recv increase. Memory allocated by MPI runtime also increases. Thus, even if our intranode threading approach is slower than the all-mpi case, there can be a large reduction in the memory requirement. 6 Conclusions and Future Work The high-level question this paper considers is how to exploit intranode parallelism in emerging CPU+GPU systems for distributed memory sparse direct solvers. At the outset, one expects a highly tuned multicore and/or GPU BLAS
12 will yield much of the potential performance benefits. The real question, then, is how much additional performance gain is possible from explicit parallelization. Our results for SuperLU_DIST suggest that on today s systems, there may be up to a factor of 2 more to gain above and beyond BLAS-only parallelization. Other avenues to pursue would include alternative accelerator platforms (e.g., Intel Xeon Phi, near-memory processing solutions); accelerating the Scatter phase, which requires extensive data structure changes; deeper architecturedependent performance analysis; and evaluation of time-energy tradeoffs, which we believe are present intrinsically in the SuperLU_DIST algorithm. References 1. IPM : Integrated performance monitoring. Accessed: Timothy A Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, T. George, V. Saxena, A. Gupta, A. Singh, and A. Choudjury. Multifrontal factorization of sparse spd matrices on GPUs. In Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2011), Anchorage, Alaska, May G. Krawezik and G. Poole. Accelerating the ANSYS direct sparse solver with GPUs. In Proc. Symposium on Application Accelerators in High Performance Computing (SAAHPC), Urbana-Champaign, IL, NCSA, ncsa.illinois.edu/ Xiaoye S. Li and James W. Demmel. SuperLU_DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems. ACM Trans. Mathematical Software, 29(2): , June R. Lucas, G. Wagenbreth, D. Davis, and R. Grimes. Multifrontal computations on GPUs and their multi-core hosts. In VECPAR 10: Proc. 9th Intl. Meeting for High Performance Computing for Computational Science, Berkeley, CA, Piyush Sao, Richard Vuduc, and Xiaoye Li. A distributed CPU-GPU sparse direct solver. Technical report, Georgia Institute of technology, O. Schenk, M. Christen, and H. Burkhart. Algorithmic performance studies on graphics processing units. J. Parallel and Distributed Computing, 68(10): , R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure. On the limits of GPU acceleration. In Proc. of the 2nd USENIX conference on Hot topics in parallelism, HotPar 10, Berkeley, CA, Ichitaro Yamazaki and Xiaoye S Li. New scheduling strategies and hybrid programming for a parallel right-looking sparse LU factorization algorithm on multicore cluster systems. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages IEEE, S.N. Yeralan, T. Davis, and S. Ranka. Sparse QR factorization on gpu architectures. Technical report, University of Florida, November C.D. Yu, W. Wang, and D. Pierce. A CPU-GPU hybrid approach for the unsymmetric multifrontal method. Parallel Computing, 37: , 2011.
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Notes on Cholesky Factorization
Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 [email protected]
It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU
It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU A. Windisch PhD Seminar: High Performance Computing II G. Haase March 29 th, 2012, Graz Outline 1 MUMPS 2 PaStiX 3 SuperLU 4 Summary
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Batched Matrix Computations on Hardware Accelerators Based on GPUs
Batched Matrix Computations on Hardware Accelerators Based on GPUs Azzam Haidar *, Tingxing Dong *, Piotr Luszczek *, Stanimire Tomov *, and Jack Dongarra *,, * University of Tennessee Oak Ridge National
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Poisson Equation Solver Parallelisation for Particle-in-Cell Model
WDS'14 Proceedings of Contributed Papers Physics, 233 237, 214. ISBN 978-8-7378-276-4 MATFYZPRESS Poisson Equation Solver Parallelisation for Particle-in-Cell Model A. Podolník, 1,2 M. Komm, 1 R. Dejarnac,
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix
7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
The Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
A note on fast approximate minimum degree orderings for symmetric matrices with some dense rows
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. (2009) Published online in Wiley InterScience (www.interscience.wiley.com)..647 A note on fast approximate minimum degree orderings
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
An examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
6. Cholesky factorization
6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs Rajib Nath Computer Science and Engineering University of California, San Diego [email protected] Stanimire Tomov Electrical Engineering and
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Ready Time Observations
VMWARE PERFORMANCE STUDY VMware ESX Server 3 Ready Time Observations VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS
CSRI Summer Proceedings 8 OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS I. KARLIN STUDENT AND J. HU MENTOR Abstract. In this paper, we describe a new matrix class in Epetra
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief NVIDIA changed the high performance computing (HPC) landscape by introducing its Fermibased GPUs that delivered
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing
Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms
Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science and Engineering
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
