PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER
|
|
|
- Lorena Matthews
- 9 years ago
- Views:
Transcription
1 PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER Enzo Rucci, Franco Chichizola, Marcelo Naiouf and Armando De Giusti Institute of Research in Computer Science LIDI (III-LIDI) National University of La Plata La Plata, Buenos Aires, Argentina {erucci,francoch,mnaiouf,degiusti}@lidi.info.unlp.edu,ar ABSTRACT Currently, most supercomputers are multicore clusters. This type of architectures is said to be hybrid, because they combine distributed memory with shared memory. Traditional parallel programming paradigms (message passing and shared memory) cannot be naturally adapted to the hardware features offered by these architectures. A parallel paradigm that combines message passing with shared memory is expected to better exploit them. Therefore, in this paper the performance of two parallel programming paradigms (message passing and combination of message passing with shared memory) is analyzed for multicore clusters. The study case used is the construction of phylogenetic trees by means of the Neighbor-Joining method. Finally, conclusions and future research lines are presented. KEY WORDS parallel programming, hybrid programming, multicore cluster, performance comparison, Neighbor-Joining method. 1 Introduction The study of distributed and parallel systems is one of the most active research lines in computer science nowadays [1, 2]. In particular, the use of multiprocessor architectures configured in clusters, multiclusters, grids and clouds, supported by networks with different characteristics and topologies, has become general, not only for the development of parallel algorithms but also for the execution of processes that require intensive computation and the provision of concurrent web services [3, 4, 5, 6]. The energy consumption and heat generation problems that appear when the speed of a processor is escalated are the reasons of the development of the multicores processors. This type of processor is formed by the integration of two or more computational cores within the same chip. Even though these cores are simpler and slower, when combined they allow enhancing the global performance of the processor while making an efficient use of energy [7, 8]. The incorporation of this type of processors to con- CONICET Main Researcher ventional clusters gives birth to an architecture that combines shared and distributed memory, known as multicore cluster [9, 10]. In this type of architectures, communications between processing units are heterogeneous [11] and can be classified in two groups: inter-node and intranode. Inter-node communications are between cores that are in different nodes and they communicate by exchanging messages through the interconnection network. Intra-node communications are between cores that are within the same node and they communicate through the different memory levels of the node. Parallel programming paradigms differ in the way tasks communicate and synchronize. Traditionally, there were two main programming paradigms for parallel architectures: shared memory and message passing. In shared memory architectures, such as multicores, the most widely used paradigm is shared memory. In it, tasks communicate and synchronize by reading and writing variables in a shared address space. OpenMP is the most widely used library to program shared memory [12]. Message passing is the most commonly chosen paradigm for distributed architectures, such as traditional clusters. In it, each task has its own address space and task communication and synchronization is done by exchanging messages. MPI is the most widely used library to program under this paradigm [13]. Multicore clusters are hybrid architectures that combine distributed memory with shared memory, so neither of the previous paradigms naturally adapts to all hardware features in these architectures. For this reason, the scientific community has a great interest in analyzing hybrid parallel programming paradigms that allow communications both through message passing and shared memory. A parallel paradigm that uses shared memory for intra-node communications and message passing for inter-node communications would be expected to better leverage the features offered by multicore clusters [14]. In this paper, the performance of two parallel algorithms designed for the same application but using different programming paradigms is compared over a multicore cluster. The application selected as study case is the construction of phylogenetic trees by means of the Neighbor- Joining method, and it was selected based on its computational complexity (O(n 3 )). The rest of the paper is organized as follows: in Section 2, related works are described.
2 In Section 3, the Neighbour-Joining method is explained, together with sequential an the parallel algorithms used. In Section 4, the experimental work carried out is described, whereas in Section 5, the results obtained are presented and analyzed. Finally, in Section 6 the conclusions and future lines of work are presentend. 2 Related Works There are numerous works that analyze and compare parallel programming paradigms for multicore clusters. To mention but a few, in [15], [16] and [17] there are comparisons between message passing and combinations of message passing with shared memory. Results vary depending on the characteristics of the problem solved, the algorithms used and the features of the hardware architecture used as support; which makes research in this area even more significant. 3 Neighbor-Joining Method Studies carried out on molecular mechanisms of organisms suggest that all organisms in the planet have a common predecessor. Then, any set of species would be related, which is called Phylogeny. In general, this relation can be represented by means of a phylogenetic tree. The task of Phylogeny is to infer the previous tree based on observations of the existing organisms [18]. The Neighbor-Joining method (NJ) is based on distance matrixes and is widely used by biotechnologists and molecular biologists due to its efficiency and temporal complexity order. In recent years, it has become very popular through its use in the ClustalW algorithm [19], one of the most widely used tools for multiple sequence alignment. The NJ algorithm was originally developed by Saitou and Nei in 1987 [20]. One year later, Studier and Keppler [21] would review the algorithm and incorporate an improvement that allows reducing the temporal complexity from O(n 5 ) to O(n 3 ). The Neighbor-Joining method starts with a starshaped tree. In each step, the pair of nodes that is closest to each other are selected and connected through a new, inner node. Then, the distances from this new node to the rest of the nodes in the tree are calculated. The algorithm ends when there are only two nodes that are not connected [18]. Algorithm 1 details the pseudo-code of the Neighbor- Joining method and Figure 1 shows an example of the building process of a phylogenetic tree using the Neighbor- Joining method for a 4x4 distance matrix. 3.1 Sequential Neghbor-Joining Algorithm The algorithm starts with a distance matrix between pairs of sequences of N N denotaded as d, N being the number of sequences. Since this is a symmetric matrix, there is no need to store it in full, but only the lower triangular matrix Algorithm 1 Pseudo-code of the Neighbor-Joining method. Initialization: L = S. Iteration: Pick the pair i,j for which normalized distance D i,j is minimum, where D i,j = d i,j (r i + r j ) and r i is the divergence of node i, where 1 r i = [L] 2 kɛl d i,k Define a new node k and assign d k,s = ½(d i,s + d j,s d i,j ), for all s in L. Add k to S with distance edges d i,k = ½(d i,j + r i r j ), d j,k = d i,j d i,k, connecting k to i and j, respectively. Remove i and j from L and add k. Termination: When L is formed by two leaves i and j, add the pending edge between i and j with distance d i,j. or the upper triangular matrix can be stored (in this case, the former is chosen). In each iteration of the main loop, the pair i, j for which D i,j is minimal has to be found. The normalized distance matrix D is not stored, but the value of each position is calculated in each iteration. Node divergences are computed in a one-dimensional arrangement before starting the main loop, and are updated in each iteration of the loop, rather than calculating them every time they are required. For the list of active nodes L, a one-dimensional flag arrangement is used; the flags indicate which nodes have been selected and which have not. Algorithm 2 details the pseudo-code of the sequential algorithm. 3.2 Parallel Neighbor-Joining Algorithm The algorithm was parallelized using different parallel programming paradigms (message passing and combination of message passing and shared memory). Before going into an in-depth review of the solution developed, certain aspects of the Neighbor-Joining algorithm should be analyzed, since these might explain why an efficient parallel solution is difficult to obtain. First, it should be noted that for each iteration of the main loop, a new node is added to the distance matrix, but those from the two originating nodes are removed. This means that, in a parallel solution, the work carried out by each task in each iteration decreases as the iterations of the main loop progress. Also, in distributed environments, the distances of the new nodes must be distributed among all processes forming the solution.
3 Algorithm 2 Pseudo-code of the sequential algorithm. 1. L = S. 2. foreach d i,j in d do 3. Update r i. 4. Update r j. 5. end foreach 6. for h in 1 to N-2 7. foreach d i,j in d do 8. Calculate D i,j. 9. Calculate minimum D i,j. 10. end foreach 11. Create node k connecting nodes i and j. Figure 1. Building process of a phylogenetic tree using the Neighbor-Joining method for a 4x4 distance matrix. The search for the pair of nodes whose distance is minimal represents the most expensive part of the main loop from a computational standpoint. Taking into account that the distance matrix is triangular, distributing the work required for the search process by assigning the same number of rows to each task could result into idle time, since not all rows have the same number of cells. Since idle time negatively affects the performance of an algorithm, it must be removed or, if this is not possible, minimized. For this reason, the workload distribution strategy must be chosen trying to make it as equitable as possible for all tasks. 12. S = S + {k}. 13. Calculate d i,k and d j,k. 14. L = L {i,j} + {k}. 15. foreach s in L do 16. Calculate d k,s. 17. Update r k. 18. Update r s. 19. end foreach 20. end for 21. Group both remaining nodes in L Message Passing as Parallel Programming Paradigm This solution uses a master-slave model of P processes as parallelization strategy. The distances of each newly created node are distributed among all processes following a circular order (the node to which it is assigned is the owner). Algorithm 3 details the pseudo-code of the parallel algorithm Combination of Message Passing and Shared Memory as Parallel Programming Paradigm This solution is based on the one described in Section 3.2.1, but unlike it, each process generates T threads when computation begins. Then, the iterations belonging to different process loops are distributed among the threads that have been generated. 4 Experimental Work 4.1 Architecture Used Tests were carried out on a cluster of Blade multicores with four blades and two quad core Intel Xeon e GHz processors each[22, 23]. Each blade has 10 Gb RAM memory (shared between both processors) and 2 x 6Mb L2 cache for each pair of cores. The operating system is GNU/Linux Fedora 12 (64 bits). 4.2 Algorithms Used The algorithms used in this work were developed using C language (gcc compiler version 4.4.2) with the OpenMPI (mpicc compiler version 1.4.3) library for message passing and OpenMP for thread management. The algorithms are detailed below:
4 Algorithm 3 Pseudo-code of the parallel Neighbor-Joining algorithm. 1. The master process divides distance matrix d into P portions and distributed P-1 among the slaves. Each process keeps approximately ((N) (N-1)/2P) elements from distance matrix d. 2. Each process calculates the partial divergences of the nodes it has, broadcasts them to the other processes, and update its divergence vector based on the partial divergences received from the other processes. 3. for h in 1 to N-2 do 4. Each process calculates its local minimum D i,j. 5. The master process collects all local minimums, calculates the global minimum D i,j and broadcasts it to the other processes. 6. The master process creates a new node k and adds it to S. 7. The owner process of node k calculates d i,k and d j,k. Each remaining process accumulates in a data structure the distances to the pair of nodes i,j it has and then sends it to the owner of node k. 5 Results To assess the behavior of the algorithms developed when escalating the problem and/or the architecture, the speedup and efficiency of the tests carried out are analyzed [1, 3, 24]. The speedup metric is used to analyze the algorithm performance in the parallel architecture as indicated in Equation (1). SequentialT ime Speedup = (1) P arallelt ime To assess how good the speedup obtained is, the efficiency metric is calculated. Equation (2) indicates how to calculate this metric, where p is the total number of cores used. Efficiency = Speedup (2) p Figure 2 shows the efficiency achieved by the algorithms MP and HY when using two, three and four blades of the architecture for different problem sizes (N). For readability, only the results for N = {4000, 8000, 12000, 16000} are shown. 8. The owner process of node k calculates the distances from it to the rest of the nodes and updates the divergence vector. 9. The owner process of node k broadcasts to the other processes the updated divergence vector. 10. Each process removes nodes i and j from their own L and add k. 11. end for MP: this algorithm is based on the solution described in Section 3.2.1, where P is the number of cores used. HY: this algorithm is based on the solution described in Section 3.2.2, where P is the number of blades used and T is the number of cores in each blade. 4.3 Tests Carried Out Based on the features of the architecture, both algorithms were tested using all the cores with different numbers of nodes: two, three and four; this means that P = {16, 24, 32} for MP. In the case of HY, one process per node was used; this means that P = {2, 3, 4} and T = {8}. Various problem sizes were used: N = {4000, 6000, 8000, 10000, 12000, 14000, 16000}. Each particular test was run five times, and the average execution time was calculated for each of them. Figure 2. Efficiency achieved by algorithms MP and HY when using two, three and four blades of the architecture for different problem sizes (N). This chart shows that both algorithms increase their efficiency as the size of the problem increases and, on the other hand, as it is to be expected in most parallel systems, the efficiency decreases when the total number of nodes used increases. The efficiency levels obtained with both algorithms are low due to the number of communication and synchronization operations carried out and the idle time that processes and threads might have. Despite this, it can be seen that the best efficiency levels are obtained by HY. The superiority of HY over MP can be analyzed in detail in Figure 3, which shows the percentage of the relative difference between the efficiencies of both algorithms (prd), calculated by means of Equation (3). prd = efficiency(hy ) efficiency(mp ) 100 (3) efficiency(mp )
5 [1] A. Grama, G. Karypis, V. Kumar, and A. Gupta, An Introduction to Parallel Computing. Design and Analysis of Algorithms., 2nd ed. Addison Wesley, [2] M. Ben-Ari, Principles of Concurrent and Distributed Programming, 2nd ed. Addison Wesley, Figure 3. prd with two, three and four blades of the architecture for different problem sizes (N). The chart in Figure 3 shows that HY is better than MP in all cases, achieving improvement percentages higher than 250%. It can also be seen that the improvement increases with the number of nodes and that it decreases when the size if the problem increases. The difference in favor of HY is due to several factors: First, HY reduces latency and maximizes the bandwidth of the interconnection network, since, by using a single, multi-threaded process instead of multiple processes for each node, it groups all task messages corresponding to a node in a single, larger message. It also removes competition for the network at node level. Finally, since the distance matrix d is divided in less parts and the work assigned to each of these portions is distributed dynamically among the threads in each process, HY achieves a more balanced work distribution versus the fully static strategy used by MP. 6 Conclusions and Future Works In this paper, the performance of two parallel programming paradigms (message passing and hybrid) was compared for current cluster architectures, taking as study case the construction of phylogenetic trees by means of the Neighbor- Joining method. The algorithms were tested using various work and architecture sizes. The results obtained show that the hybrid parallelization better leverages the hardware features offered by the support architecture, which in turn yields a better performance. Future lines of work include the development and optimization of hybrid solutions for other types of applications and their comparison with solutions based only on message passing or shared memory. References [3] Z. Juhász, P. Kacsuk, and D. Kranzlmuller, Eds., Distributed and Parallel Systems: Cluster and Grid Computing. Springer Science + Business Media Inc., [4] J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, The Sourcebook of Parallel Computing. Morgan Kauffman, [5] M. D. Stefano, Distributed data management for Grid Computing. John Wiley & Sons Inc, [6] M. Miller, Web-Based applications that change the way you work and collaborate online. Que, [7] AMD. (2009) Evolución de la tecnología de múltiple núcleo. [Online]. Available: amd.com/es-es/amd-multi- Core/resources/Technology-Evolution [8] T. Burger. Intel Multi-Core Processors: Quick Reference Guide. [9] L. Chai, Q. Gago, and D. K. Panda, Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system, in Proceedings of the IEEE International Symposium on Cluster Computing and the Grid., [10] M. McCool, Scalable programming models for massively multicore processors, in Proceeding of the IEEE, [11] C. Zhang, X. Yuan, and A. Srinivasan, Processor affinity and mpi performance on smp-cmp clusters, in IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IP- DPSW), [12] (2012) OpenMP.org. [Online]. Available: [13] (2012) The Message Passing Interface (MPI) standard. [Online]. Available: [14] N. Drosinos and N. Koziris, Performance comparison of pure mpi vs hybrid mpi-openmp parallelization models on smp clusters, in 18th International Parallel and Distributed Processing Symposium (IP- DPS 04) - Papers, [15] E. Rucci, A. De Giusti, F. Chichizola, M. Naiouf, and L. De Giusti, DNA sequence alignment: hybrid parallel programming on a multicore cluster, in Recent Advances in Computers, Communications, Applied Social Science and Mathematics, N. Mastorakis, V. Mladenov, B. Lepadatescu, H. R. Karimi, and
6 C. G. Helmis, Eds., vol. 1, no. 1. WSEAS Press, September 2011, pp [16] F. Leibovich, L. De Giusti, and M. Naiouf, Parallel algorithms on clusters of multicores: Comparing message passing vs hybrid programming, in Proceedings of the 2011 International Conference on Parallel and Distributed processing Techniques and Applications (PDPTA2011), [17] X. Wu and V. Taylor, Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-scale Multicore Clusters, The Computer Journal, vol. 55, pp , [18] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchinson, Biological Sequence Analysis. Probabilistic models of proteins and nucleic acids., 7th ed. Cambridge University Press, [19] ClustalW and Clustal X Multiple Sequence Alignment. [Online]. Available: [20] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol. Biol. Evol., vol. 4, pp , [21] J. A. Studier and K. J. Keppler, A note on the neighbour-joining algorithm of saitou and nei, Mol. Biol. Evol., vol. 5, pp , [22] HP. HP BladeSystem. [Online]. Available: ents/c-class.html [23]. HP BladeSystem c-class architecture. [Online]. Available: Manual/c /c pdf [24] B. Wilkinson and M. Allen, Parallel Programming. Techniques and Applications Using Networked Workstations and Parallel Computers., 2nd ed. Prentice Hall, 2005.
Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures
ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 1 Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures Laura De Giusti 1, Franco Chichizola 1, Marcelo Naiouf 1, Armando
Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems
Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Nagaveni V # Dr. G T Raju * # Department of Computer Science and
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
A Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Performance Enhancement of Multicore Processors using Dynamic Load Balancing
Performance Enhancement of Multicore Processors using Dynamic oad Balancing jay Tiwari School of Computer Science Devi hilya niversity (DVV) Indore, India e-mail: [email protected] bstract Introduction
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Operating System Multilevel Load Balancing
Operating System Multilevel Load Balancing M. Corrêa, A. Zorzo Faculty of Informatics - PUCRS Porto Alegre, Brazil {mcorrea, zorzo}@inf.pucrs.br R. Scheer HP Brazil R&D Porto Alegre, Brazil [email protected]
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Program Optimization for Multi-core Architectures
Program Optimization for Multi-core Architectures Sanjeev K Aggarwal ([email protected]) M Chaudhuri ([email protected]) R Moona ([email protected]) Department of Computer Science and Engineering, IIT Kanpur
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 [email protected]
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, [email protected]
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
A Model for the Automatic Mapping of Tasks to Processors in Heterogeneous Multi-cluster Architectures *
A Model for the Automatic Mapping of Tasks to Processors in Heterogeneous Multi-cluster Architectures * Laura De Giusti 1, Franco Chichizola 2, Marcelo Naiouf 3, Ana Ripoll 4, Armando De Giusti 5 Instituto
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters Martin J. Chorley a, David W. Walker a a School of Computer Science and Informatics, Cardiff University, Cardiff, UK Abstract
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Design and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Improving Grid Processing Efficiency through Compute-Data Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton
MICROPROCESSOR SYSTEMS Mitchell Aaron Thornton, Department of Electrical and Computer Engineering, Mississippi State University, PO Box 9571, Mississippi State, MS, 39762-9571, United States. Keywords:
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)
Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Outline Introduction EO challenges; EO and classical/cloud computing; EO Services The computing platform Cluster -> Grid -> Cloud
Comparing the performance of the Landmark Nexus reservoir simulator on HP servers
WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Multilevel Load Balancing in NUMA Computers
FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A Theory of the Spatial Computational Domain
A Theory of the Spatial Computational Domain Shaowen Wang 1 and Marc P. Armstrong 2 1 Academic Technologies Research Services and Department of Geography, The University of Iowa Iowa City, IA 52242 Tel:
System Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of
Parallel Algorithm for Dense Matrix Multiplication
Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
