Architecture Aware Programming on Multi-Core Systems
|
|
|
- Richard Bell
- 9 years ago
- Views:
Transcription
1 Architecture Aware Programming on Multi-Core Systems M.R. Pimple Department of Computer Science & Engg. Visvesvaraya National Institute of Technology, Nagpur (India) S.R. Sathe Department of Computer Science & Engg. Visvesvaraya National Institute of Technology, Nagpur (India) Abstract In order to improve the processor performance, the response of the industry has been to increase the number of cores on the die. One salient feature of multi-core architectures is that they have a varying degree of sharing of s at different levels. With the advent of multi-core architectures, we are facing the problem that is new to parallel computing, namely, the management of hierarchical s. Data locality features need to be considered in order to reduce the variance in the performance for different data sizes. In this paper, we propose a programming approach for the algorithms running on shared memory multi-core systems by using blocking, which is a wellknown optimization technique coupled with parallel programming paradigm, OpenMP. We have chosen the sizes of various problems based on the architectural parameters of the system like level, size, line size. We studied the optimization scheme on commonly used linear algebra applications matrix multiplication (MM), Gauss-Elimination (GE) and LU Decomposition (LUD) algorithm. Keywords- multi-core architecture; parallel programming; miss; blocking; OpenMP; linear algebra. I. INTRODUCTION While microprocessor technology has delivered significant improvements in clock speed over the past decade, it has also exposed a variety of other performance bottlenecks. To alleviate these bottlenecks, microprocessor designers have explored alternate routes to cost effective performance gains. This has led to use of multiple cores on a die. The design of contemporary multi-core architecture has progressively diversified from more conventional architectures. An important feature of these new architectures is the integration of large number of simple cores with software managed hierarchy with local storage. Offering these new architectures as general-purpose computation platforms creates number of problems, the most obvious one being programmability. Cache based architectures have been studied thoroughly for years leading to development of well-known programming methodologies for these systems, allowing a programmer to easily optimize code for them. However, multi-core architectures are relatively new and such general directions for application development do not exist yet. Multi-core processors have several levels of memory hierarchy. An important factor for software developers is how to achieve the best performance when the data is spread across local and global storage. Emergence of based multicore systems has created a aware programming consensus. Algorithms and applications implicitly assume the existence of a. The typical example is linear algebra algorithms. To achieve good performance, it is essential that algorithms be designed to maximize data locality so as to best exploit the hierarchical structures. The algorithms must be transformed to exploit the fact that a miss will move a whole -line from main memory. It is also necessary to design algorithms that minimize I/O traffic to slower memories and maximize data locality. As the memory hierarchy gets deeper, it is critical to efficiently manage the data. A significant challenge in programming these architectures is to exploit the parallelism available in the architecture and manage the fast memories to maximize the performance. In order to avoid the high cost of accessing offchip memory, algorithms and scheduling policies must be designed to make good use of the shared [12]. To improve data access performance, one of the well-known optimization technique is tiling[3][10]. If this technique is used along with parallel programming paradigm like OpenMP, considerable performance improvement is achieved. However, there is no direct support for aware programming using OpenMP for shared memory environment. Hence, it is suggested to couple OpenMP with tiling technique for required performance gain. The rest of the paper is organized as follows. Section II describes the computing problem which we have considered. The work done in the related area is described in section III. Implementation of the problems is discussed in section IV. Experimental setup and results are shown in section V. The performance analysis is carried out in section VI. II. COMPUTING PROBLEM As multi-core systems are becoming popular and easily available choice, for not only high performance computing world but also as desktop machines, the developers are forced to tailor the algorithms to take the advantage of this new platform. As the gap between and memory performance continues to grow, so does the importance of effective utilization of the memory hierarchy. This is especially evident in compute intensive algorithms that use very large data sets, such as most linear algebra problems. In the context of high performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the new architectural features of 105 P a g e
2 these new processors. Matrix factorization plays an important role in a large number of applications. In its most general form, matrix factorization involves expressing a given matrix as a product of two or more matrices with certain properties. A large number of matrix factorization techniques have been proposed and researched in the matrix computation literature to meet the requirements and needs arising from different application domains. Some of the factorization techniques are categorized into separate classes depending on whether the original matrix is dense or sparse. The most commonly used matrix factorization techniques are LU, Cholesky, QR and singular value decomposition (SVD). The problem of dense matrix multiplication (MM) is a classical benchmark for demonstrating the effectiveness of techniques that aim at improving memory utilization. One approach towards the effective algorithm is to restructure the matrices into sequence of tiles. The copying operation is then carried out during multiplication. Also, for a system AX=B, there are several different methods to obtain a solution. If a unique solution is known to exist, and the coefficient matrix is full, a direct method such as Gaussian Elimination(GE) is usually selected. LU decomposition (LUD) algorithm is used as the primary means to characterize the performance of high-end parallel systems and determine its rank in the Top 500 list[11]. LU Factorization or LU decomposition is perhaps the most primitive and the most popular matrix factorization techniques finding applications in direct solvers of linear systems such as Gaussian Elimination. LU factorization involves expressing a given matrix as product of a lower triangular matrix and an upper triangular matrix. Once the factorization is accomplished, simple forward and backward substitution methods can be applied to solve a linear system. LU factorization also turns out to be extremely useful when computing the inverse or determinant of a matrix because computing the inverse or the determinant of a lower or an upper triangular matrix is relatively easy. III. RELATED WORK Since multi-core architectures are now becoming mainstream, to effectively tap the potential of these multiple units is the major challenge. Performance and power characteristics of scientific algorithms on multi-core architectures have been thoroughly tested by many researchers[7]. Basic linear algebra operations on matrices and vectors serve as building blocks in many algorithms and software packages. Loop tiling is an effective optimization technique to boost the memory performance of a program. The tile size selection using organization and data layout, mainly for single core systems is discussed by Stephanie Coleman and Kathryn S. Mckinley [10]. LU decomposition algorithm decomposes the matrix that describes a linear system into a product of a lower and an upper triangular matrix. Due to its importance into scientific computing, it is well studied algorithm and many variations to it have been proposed, both for uni and multi-processor systems. LU algorithm is implemented using recursive methods [5], pipelining and hyperplane solutions [6]. It is also implemented using blocking algorithms on Cyclops 64 architecture [8]. Dimitrios S. Nikolopoulos, in his paper [4] implemented dynamic blocking algorithm. Multi-core architectures with alternative memory subsystems are evolving and it is becoming essential to find out programming and compiling methods that are effective on these platforms. The issues like diversity of these platforms, local and shared storage, movement of data between local and global storage, how to effectively program these architectures; are discussed in length by Ioannis E. Venetis and Guang R. Gao [8]. The algorithm is implemented using block recursive matrix scheme by Alexander Heinecke and Michael Bader [1]. Jay Hoeflinger, Prasad Allavilli, Thomas Jackson and Bob Kuhn have studied scalability issues using OpenMP for CFD applications[9]. OpenMP issues in the development of parallel BLAS and LAPACK libraries have also been studied[2]. However, the issues, challenges related with programming and effective exploitation of shared memory multi-core systems with respect to parameters have not been considered. Multi-core systems have hierarchical structure. Depending upon the architecture, there can be two or three layers, with private and shared s. When implementing the algorithm, on shared memory systems, parameters must be considered. The tile size selection for any particular thread running on a core is function of size of L 1, which is private to that core as well as of L 2 which is a shared. If parameters like, level, size, line size are considered, then substantial performance improvement can be obtained. In this paper, we present the parallelization of MM, GE and LUD algorithm on shared memory systems using OpenMP. IV. IMPLEMENTATION In this paper we have implemented parallelization of most widely used linear algebra algorithms, matrix multiplication, gauss elimination and LU decomposition, on multi-core systems. Parallelization of algorithms can also be implemented using message passing interface (MPI). Pure MPI model assumes that, message passing is the correct paradigm to use for all levels of parallelism available in the application and that the application topology can be mapped efficiently to the hardware topology. However, this may not be true in all cases. For matrix multiplication problem, the data can be decomposed into domains and these domains can be independently passed to and processed by various cores. While, in case of LU decomposition or GE problem, task dependency prevents to distribute the work load independently to all other processors. Since the distributed processors do not share a common memory subsystem, the computing to communication ratio for this problem is very low. Communication between the processors on the same node goes through the MPI software layers, which adds to overhead. Hence, pure MPI implementation approach is useful when domain decomposition can be used; such that, the total data space can be separated into fixed regions of data or domains, to be worked on separately by each processor. For GE and LUD problems, we used the approach of 1D partitioning of the matrix among the cores and then used OpenMP paradigm for distributing the work among number of 106 P a g e
3 threads to be executed on various cores. The approach of 2 D partitioning of data among cores is more suitable for array processors. For a shared memory platform, all the cores on a single die share the same memory subsystem, and there is no direct support for binding the threads to the core using OpeMP. So, we restricted our experiments with 1D partitioning technique and applied parallelization for achieving speedup using OpenMP. A. Architecture Aware Parallelization To cope up with memory latency, all data required during any phase of the algorithm are made available in the. The data sets so chosen, are accommodated into the. Considering the hierarchy, the tile size selection depends upon size, line size to eliminate self interference misses. Now depending upon the architecture of the underlying machine, the computation work is split into number of cores available. One dimensional partitioning of data is done, so that, every core receives specific number of rows (or columns), such that, the data fits in the shared. The blocking technique is then used which ensures that the maximum block size is equal to the size of private belonging to the core. Parallel computation is carried out using OpenMP pragmas by individual cores. B. Determining Block Size In order to exploit affinity, the block size is chosen such that, the data can be accommodated into the. The experiments were carried out on square matrix of size N. Let s be the size of each element of matrix and C s be the size of shared. Let the block size be. 1. For blocked matrix multiplication, C = A x B, block of matrix A & B, and one row of matrix C should be accommodated into the. Then the required block size can be calculated using : For large size, we get, (1) 2. For GE problem, the size of input matrix is [ ] [ ]. The required block size can be calculated with the following equation: So, the optimal block size, (2) 3. For LU decomposition algorithm with same matrix used for storing lower and upper triangular matrix, the optimal block size comes out to be (3) C. Effect of Cache Line Size Let line size be C ls Without the loss of generality, we assume that the first element of input array falls in the first position of. The number of rows that completely fit in the can be calculated as : Rows = C s /C ls (4) For every memory access, the entire line is fetched. So block size will lead to self interference misses. Also if, system will fetch additional lines, which may in turn lead to capacity misses; as less number of rows can be accommodated in the. So to take the advantage of spatial locality, the block sizes chosen were integral multiple of line size C ls. We assumed that, every row in the selected tile is aligned on a line boundary. After finding the row size, block size can be calculated. Block Size B = k x C ls if Or (k is integer) Maximum Speed up is achieved when - or when B is multiple of number of Rows The algorithm for block size selection is presented in Fig. 1. Further improvement in the performance is achieved by using the technique of register caching for the array elements, that are outside the purview of the for loop (like value a[i][j] shown in Fig. 3). This value is d, which is then shared by all the threads executing the for loop The OpenMP implementation of matrix multiplication and GE problem is given in Fig. 2 and Fig. 3 respectively. D. LU Decomposition The main concept is to partition the matrix into smaller blocks with a fixed size. The diagonal entry in each block is processed by master thread on a single core. Then for calculating the entries in the upper triangular matrix, each row is partitioned into number of groups equal to number of cores; so that each group is processed by each core. Similarly, for calculating the entries in the lower triangular matrix, each column is partitioned into number of groups equal to number of cores; so that each group is processed by each core. The implementation divides the matrix into fixed sized blocks, that fit into the L1 data of the core creating first level of memory locality. On the shared memory architecture, the whole matrix is assumed to be in the globally accessible memory address space. The algorithm starts by processing the diagonal block on one processor, while all other processors wait for the barrier. When this block finishes, the blocks on the same row are processed by various cores in parallel. Then the blocks on same column are processed by various cores in parallel. In turn, each processor waits for the barrier again for the next diagonal block. The storage space can further be reduced by storing lower and upper triangular matrices in a single matrix. The diagonal elements of lower triangular matrix are made 1, hence, they need not be stored. But this method suffers from the problem of load imbalance, if number of elements processed in each row or column by each core is not divisible by number of cores available. Also, the active portion of the matrix is reduced after each iteration and hence, load allocation after each iteration is not constant. 107 P a g e
4 1) LUD computation: Let A be an matrix with rows and columns numbered from 0 to (n-1). The factorization consists of n major steps. Each step consisting of an iteration of the outer loop starting at line 3 of Fig. 5. In step k, first the partial column [ ] us divided by [ ]. Then the outer product [ ] x [ ] is subtracted from the x sub matrix [ ]. For each iteration of the outer loop, the next nested loop in the algorithm goes from. Procedure BS(CS, CLS,N,B) Input: CS: Cache Size CLS: Cache Line Size s: Size of each element in input N: Input Matrix Rows Output : B : Block Size(square) Total lines = CS / CLS No of rows (NR) from input problem size that can be accommodated in NR = ( ) The optimal block size B If (NR > CLS) B= k xcls // Where k is integer constant Else (B = CLS) Figure 1. Block Size Selection void mat-mult() // matrix multiplication // { for (i=0; i<n; i=i+b) { Read block of a & c; Read block ofbb; omp_set_num_threads(omp_get_num_proc()); #pragma omp parallel for shared(a,b,c,i) private(r,i1,j) schedule (static) for (r=i; r<(min(i+b, N)); r++) for (i1=i; i1<(min(i+b,n)); i1++) { for(j=0; j<n; j++) c[r][i1] += a[r][j] * b[j][i1]; Write block of c ; Figure 2. Parallel Matrix Multiplication A typical computation of LU factorization procedure in the k th iteration of the outer loop is shown in Fig. 4. The k th iteration of the outer loop does not involve any computation on rows 1 to (k-1) or columns 1 to (k-1). Thus at this stage, only the lower right (n-k) x (n-k) sub matrix of A is computationally active. So the active part of the matrix shrinks towards the bottom right corner of the matrix as the computation proceeds. The amount of computation increases from top left to bottom right of the matrix. Thus the amount of work done differs for different elements of matrix. The work done by the processes assigned to the beginning rows and columns would be far less than those assigned to the later rows and columns. Hence, static scheme of block partitioning can potentially lead to load imbalance. Secondly, the process working on a block may idle even when there are unfinished tasks associated with that block. void forwardsubstitution() // GaussElimination loop// // Matrix size (n x n) { int i, j, k, max, kk, p, q; float t; for (i = 0; i < n; ++i) { max = i; for (j = i + 1; j < n; ++j) if (a[j][i] > a[max][i]) max = j; for (j = 0; j < n + 1; ++j) { t = a[max][j]; a[max][j] = a[i][j]; a[i][j] = t; for (j = n; j >= i; --j) { for (k=i+1; k<n; k=k+b) { x=a[i][i]; // Register Caching // #pragma omp parallel for shared(i,j,k) private(kk) schedule (static) for (kk=k; kk<(min(k+b, n)); ++kk) a[kk][j] -= a[kk][i]/x * a[i][j]; Figure 3. OpenMP parallelization of GE loop This idling can occur if the constraints imposed by the task-dependency graph do not allow the remaining tasks on this process to proceed until one or more tasks mapped onto other processes are completed. 2) LUD OpenMP parallelization: For parallelization of LU decomposition problem on shared memory, we used tiling technique with OpenMP paradigm. The block size B is selected such that, the matrix size is accommodated in a shared. The actual data block used by each core is less than the size of private so that locality of memory access for each thread is maintained. For LUD algorithm, due to the task dependency at each iteration level, the computation cannot be started simultaneously on every core. So, algorithm starts on one core. Diagonal element is executed by master core. After the synchronization barrier, the computation part of non-diagonal elements is split over the available cores. After computing a row and column of result matrix, again the barrier is applied to synchronize the operations for the next loop. The size of data computed by each core is determined by block size. The size of data dealt by each core after each iteration is not the same. With static scheduling, the chunk is divided exactly into the available multiple threads and every thread works on the same amount of data. Fig. 5 illustrates the OpenMP parallelization. The size of input matrix a is N 108 P a g e
5 Table 1. System configuration Figure 4.Processing of blocks of LU Decomposition 1. Lu-Fact (a) 2. { 3. for (k=0; k<n; k++) 4. { #pragma omp single 5. for(j=k+1; j<(n); j++) 6. a[j][k]=a[j][k]/a[k][k]; 7. #pragma omp parallel for shared (a,k) private(i) schedule static 8. for(j=k+1; j<(n); j=j+b) 9. for (jj=j; jj<min(jj+b, N), jj++) 10. { v=a[k][jj]; --- caching the value 11. #pragma omp parallel for shared (a,k,jj) private(i) schedule static 12. for(i=k+1; i<(n); i++) 13. a[i][jj]= a[i][jj]- (a[i][k]*v); Processors Core frequency L1 Cache size L2 Cache size L3 Cache size Core 2 Duo E7500 Dual Core E5300 Xeon(R) X5650 (12 cores) Figure 6. Matrix Multiplication: Data fits in L 2 Cache Xeon(R) E5630 (16 cores) 2.93 GHz 2.60 GHz 2.67 GHz 2.53 GHz, 3072 KB, shared, 2048 KB, Shared,, 256 KB 256 KB MB 12 MB Figure 5. OpenMP implementation of LU Decomposition algorithm V. EXPERIMENTAL SETUP & RESULTS We conducted the experiments to test aware parallelization of MM, GE, LUD algorithms on Intel Dual core, 12 core and 16 core machines. Each processor had hyper threading technology such that, each processor can execute simultaneously instructions from two threads, which appear to the operating system as two different processors and can run a multi program workload. The configuration of the systems is given in Table 1. Each processor had 32 KB data as L 1. Intel Xeon processors (12 & 16 cores) had an eight way set associative 256 KB L 2 and 12 MB L 3 dynamically shared between the threads. The systems run Linux 2.6.x Blocked LU decomposition was parallelized at two levels using OpenMP. We used relatively large data sets, so that the performance of the codes becomes more bound to the L 2 and L 3 miss latencies. The programs were compiled with C compiler (gcc 4.3.2). Fig. 6 and Fig. 7 show the speed up achieved when the block sizes are such that, the data fits in L 2 for matrix multiplication and Gauss elimination algorithm respectively. Fig. 8 and Fig. 9 show the results of LU decomposition algorithm for various matrix sizes on dual and 16 core system respectively Figure 7. Gauss Elimination on 12 & 16 core machines, Data accommodated in L 2 Figure 8. Speedup for LU Decomposition on Dual core 109 P a g e
6 Figure 9. Speedup for LU Decomposition on 16 core machine (Matrix size- NxN) VI. PERFORMANCE ANALYSIS The strategy of parallelization is based on two observations. One is that the ratio computation to communication should be very high for implementation on shared memory multi-core systems. And second is that the memory hierarchy is an important parameter to be taken into account for the algorithm design which affects load and store times of the data. Considering this, we implemented the algorithms matrix multiplication and Gauss elimination with a blocking scheme that divides the matrices into relatively small square tiles. The optimal block size is selected for each core, such that the tile is accommodated in the private of each core and thus avoids the conflict misses. This approach of distributing the data chunks to each core greatly improves the performance. Fig. 6 and Fig. 7 shows the performance improvement when block size is multiple of line size. Whenever block size is greater or less than the line size, performance suffers. This is due to reloading overheads of entire new line for the next data chunk. With this strategy, we got the speedup of 2.1 on 12 core machine and speed up of 2.4 on 16 core machine. The sub linear speedups in Fig. 6 and 7 for lower block sizes are attributed to blocking overheads. For Gauss elimination and LU decomposition problem, the OpenMP pragma, splits the data among the available cores. The size of data dealt by every core, after every iteration is different. This leads to load imbalance problem. The chunk scheduling scheme, demands the chunk calculations at every iteration and hence affects performance. However, static scheduling ensures equal load to every thread and hence reduces the load imbalance. For LU decomposition problem with 1D partitioning of data among the cores, we observed a speedup of 1.39, & 2.46 for two dual core machines and speedup of 3.63 on 16 core machine. The maximum speedup is observed when the number of threads is equal to the number of (hardware) threads supported by the architecture. Fig. 9 shows the speed up when 16 threads are running on a 16 core machine. Speed up is directly proportional to the number of threads. The performance degrades when more software threads are in execution than the threads supported by architecture. So, for 18 threads, scheduling overhead increases and performance is degraded. However, when number of threads is more than 8, performance degrades due to communication overheads. This is because, 16 core Intel Xeon machine comprises of 2 quad cores connected via QPI link. Fig. 9 shows performance enhancement up to eight threads and degradation in the performance when number of threads is ten. When the computation is split across all the available sixteen threads, speed up is again observed, where communication overhead is amortized over all cores. Further enhancement in the performance is achieved when method of register caching is used for loop independent variables in the program. Many tiling implementations do not consider this optimal block size considerations with attributes. However, our implementation considers the hierarchy of s, parameters and arrives at optimal block size. The block size calculations are governed by the architecture of the individual machine and the algorithm under consideration. Once the machine parameters and input problem size is available, the tailoring of the algorithm accordingly improves the performance to a greater extent. Of course, there is a significant amount of overhead in the OpenMP barriers at the end of loops; which means that load imbalance and not data locality is the problem. VII. CONCLUSION & FUTURE WORK We evaluated performance effects of exploiting architectural parameters of the underlying platform for programming on shared memory multi-core systems. We studied the effect of private L 1, shared L 2, line size on execution of compute intensive algorithms. The effect of exploiting L1 affinity does not affect the performance much, but the effects of exploiting L2 affinity is considerable, due its sharing among multiple threads and high reloading cost for larger volumes. If these factors are considered and coupled with parallel programming paradigm like OpenMP, performance enhancement is achieved. We conclude that, affinity awareness in compute intensive algorithms on multi-core systems is absolutely essential and will improve the performance significantly. We plan to extend the optimization techniques for the performance enhancement on multi-core systems by considering the blocking technique at register level and instruction level. We also plan to investigate and present generic guide lines for compute intensive algorithms on various multi-core architectures. REFERENCES [1] Alexander Heinecke and Michael Bader, Towards many-core implementation of LU decomposition using Peano Curves, UCHPC- MAW 09, May 18-20, 2009, Ischia, Itali. [2] C. Addison, Y. Ren, and M. van Waveren, OpenMP issues arising in the development of Parallel BLAS and LAPACK libraries, In Scientific Programming Journal, Volume 11, November 2003, pages: ,IOS Press. [3] Chung-Hsing Hsu, and Ulrich Kremer, A quantitative analysis of tile size selection algorithms, The Journal of Supercomputing, 27, , [4] Dimitrios S. Nikolopoulos, Dynamic tiling for effective use of shared s on multi-threaded processors, International Journal of High 110 P a g e
7 Performance Computing and Networking, 2004, Vol. 2, No. 1, pp [5] F.G. Gustavson, Recursion leads to automatic variable blocking for dense linear algebra algorithms, IBM Journal of Research and Development, 41(6): , Nov.,1997. [6] H. Jin, M. Frumkin, and J. Yan, The OpenMP implementation of NAS parallel benchmarks and its performance, Technical report nas , NASA Ames Research Centre, [7] Ingyu Lee, Analyzing performance and power of multi-core architecture using multithreaded iterative solver, Journal of Computer Science, 6 (4): , [8] Ioannis E. Venetis, and Guang R. Gao, Mapping the LU decomposition on a many-core architecture: Chanllenges and solutions, CF 09, May 18-20, 2009, Ischia, Itali. [9] Jay Hoeflinger, Prasad Alavilli, Thomas Jackson, and Bob Kuhn, Producing scalable performance with OpenMP: Experiments with two CFD applications, International Journal of Parallel computing. 27(2001), [10] Stephanie Coleman, and Kathryn S. McKinley, Tile size selection using organization and data layout, Proceedings of the ACM SIGPLAN Conference on Programming Language Design & Implementation, California, United States, 1995, pages: [11] The Top 500 List. [12] Vahid Kazempour, Alexandra Fedorova, and Pouya Alagheband, Performance implications of affinity on multicore processors, Proceedings of the 14 th International Euro-Par Conference on Parallel Processing, Spain, 2008, pages: AUTHORS PROFILE S.R. Sathe received M.Tech. in Computer Science from IIT, Bombay (India) and received Ph.D. from Nagpur University(India). He is currently working as a Professor in the Department of Computer Science & Engineering at Visvesvaraya National Institute of Technology, Nagpur(India). His research interests include paralle and distributed systems, mobile computing and algorithms. M.R. Pimple is working as a Programmer in the Department of Computer Science & Engineering at Visvesvaraya National Institute of Technology, Nagpur(India) and persuing her M.Tech. in Computer Science. Her area of interests are computer architecture, parallel processing. 111 P a g e
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
What is Multi Core Architecture?
What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
HyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
Notes on Cholesky Factorization
Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 [email protected]
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
A Direct Numerical Method for Observability Analysis
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL 15, NO 2, MAY 2000 625 A Direct Numerical Method for Observability Analysis Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper presents an algebraic method
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
Parallel Algorithm for Dense Matrix Multiplication
Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Poisson Equation Solver Parallelisation for Particle-in-Cell Model
WDS'14 Proceedings of Contributed Papers Physics, 233 237, 214. ISBN 978-8-7378-276-4 MATFYZPRESS Poisson Equation Solver Parallelisation for Particle-in-Cell Model A. Podolník, 1,2 M. Komm, 1 R. Dejarnac,
International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Optimizing matrix multiplication Amitabha Banerjee [email protected]
Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan Abstract AES is an encryption algorithm which can be easily implemented on fine grain many core systems.
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP
PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, and Barbara Chapman Department of Computer Science, University
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
