# A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Save this PDF as:

Size: px
Start display at page:

Download "A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster"

## Transcription

1 Acta Technica Jaurinensis Vol. 3. No A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906 Egyetem tér 1. Phone: , fax: Abstract: There are several iterative models for solving general, large linear equations. In this paper a parallel algorithm with slow convergence speed for studying the speed-up effect of the parallel algorithms has been presented. The difference between the ring and hierarchical topology considering running speed and efficiency has also been explored. Detailed numerical test results of the algorithm including the speedup of parallel execution are shown. Keywords: parallel programming, linear equation, cluster computing 1. The minimal residual algorithm The main goal of this article is to study the speed up effect on a parallel computer using a parallel algorithm for solving a full rank, general, symmetric, positive definite not sparse (but dense) linear equation with high condition number 10 ( n = 1000, cond = 10 ; n = 5000, n = 15000, 16 cond = cond = 4 10 ; n = 10000, ), where cond( A) = A A 1 1 cond = 7 10 ; and Euclidean norm was used. The condition number shows the difficulty of the linear equation. Higher condition number means the complexity of the problem, in other words the equation gets more and more difficult with numerical methods. The solving algorithms of the general Ax = b equation are well known [1] [] [3]. In the case of large systems direct algorithms are inefficient. Only iterative methods can be used that can produce results with the desired precision [4] [10], otherwise the floating point arithmetic causes several rounding errors. The base of the presented numerical algorithm for the solution of linear systems of equations is a generalisation of the classical one-step iterative algorithm (such as the gradient method). Generalisation will not improve the convergence speed of the algorithm but it highly improves parallel execution. In a sequential case the base algorithm [5] [6] has slow convergence speed. The minimal residual algorithm is a widely known algorithm, but the suggested versions of Algorithm1 and Algorithm have been created by the authors. The results of the parallel realisation of the algorithms and the measured data are the results of the present research. 65

2 Vol. 3. No Acta Technica Jaurinensis The suggested algorithms give a good opportunity to study the effects of parallel processing. If a cluster or a multiprocessor computer is used, one can expect considerable speed-up effect.. Methods for parallelisation on homogeneous and heterogeneous systems The aim of parallel processing is to break a large problem down to several smaller components or calculations that can be solved parallelly with different processors at the same time. The most efficient tools for scientific computations would be massively parallel computers, with a large shared memory, but this hardware is expensive and unattainable for the research team. Other solutions can be distributed systems and cluster computing. A small cluster of 16 PCs with normal network connection and an interconnected cluster machine with 88 processors (HP BladeSystem C3000) were used. On the cluster computing model a message passing software (MPI) was used to solve the tasks..1. Ring and hierarchical topologies In parallel solutions there are two bottlenecks for optimisation: the communication and the calculations. These aspects have been studied with two topologies. For linear equations based on numerical models the ring model is often used [8]. In this case every node has a connection with the two neighbouring nodes, or other nodes. This solution works efficiently on homogeneous systems. On this model the heartbeat algorithm is useful (see Figure 1 a). First it starts an initialization procedure, then a loop starts. In the first phase of the loop a data sending and receiving mechanism process is accomplished (synchronization). This is the data exchange period between each computation node. After that, every node runs the computation algorithm. This is the cpu period of the work. The loop runs until the stopping criteria. In this model every node has an equivalent role. This model needs a homogeneous network and the same type of processor because each synchronization step made by the slowest node. Fast nodes need to wait for them. Figure 1. Ring (a) and hierarchical (b) topology In other cases hierarchical topologies are useful [9]. The master-worker model is based on a distributed and large, heterogeneous cluster (see Fig 1b.). The master node controls 66

3 Acta Technica Jaurinensis Vol. 3. No the running processes, assigns problems to workers and manages the partial solutions. The role of the worker nodes is to solve smaller parts of the problem. This model works well with asynchronous methods, too because worker nodes have not connected to each other. Every worker node can reach the maximum performance of the processors... Algorithm 1 For a ring topology the following algorithm has been used: I. Let P be the number of nodes, let eps be the tolerated error value. II. Let A be the matrix to be solved and let b be the solution vector. III. For every node IV. For every node p P do in parallel: generate x1 random vector. p P do in parallel: Operation() while a result arrives or converges. V. The result of solution is x1 on master node. The algorithm uses the Operation() function on every node: 1. do. let x be a new random vector 3. let r1 = Ax1 b and r = Ax b, where r 1 r 0 ( 1, ) 4. let 1 : r r r c = r r Remarks: 5. let x 1 : = c 1 x 1 + ( 1 c 1 ) x 6. let r 1 : = c 1 r 1 + ( 1 c 1 ) r 7. let x 1 : = x1, and r 1 : = r if 1 < min or iterationn um > n r then send x1 to the next node and wait for a new x vector 9. while < eps r return x 1, the solution with desired precision. From the vector exchange it is expected that the given result is better, or when the algorithm reaches a local minimum value this 1 x solution is sent to another node, which continues the computation with a new random number coming from another node (see Figure ). 67

4 Vol. 3. No Acta Technica Jaurinensis In the implementation of the algorithm the Mersenne-twister pseudorandom number generator has been applied [7] and the independence of iteration sequences is based on the independent clock of the computing nodes. Two error measure methods have been used in the algorithm. The first was the general residual error: err r = r1 = Ax b. When the exact solution of the test case is known (x ), there is a chance to compute the absolute error value: err x = x x, where x is the approximate solution vector. Figure. The convergence of Algorithm1 (n = 100) In the case of large-sized, badly conditioned linear equations these error values are 16 relatively high numbers (with cond = 10, err = 1000 means a close solution as shown in Figure 4, in detail that means x x r n is x i x i and the error for every i= 1 i i member of the solution vector is approximately x x 10 ). If a problem was solved where the correct solution had already been known, and the residual and absolute error were compared it has to be noted that the absolute error of the solution is always better than the residual. The efficiency of the algorithm depends on load balancing: the operation can be repeated several times with slow convergence speed or the result vector can be exchanged between nodes to give extra speedup. In this case and referring to Amdahl's law the ring model has a theoretical maximum number of nodes. If more nodes are used and the best solution is sent to the next node, the larger ring will increase running time as the solution waves slowly on to the other nodes. 8 68

5 Acta Technica Jaurinensis Vol. 3. No Figure 3. The convergence of Algorithm1 (n = 500) This Algorithm1 works well on a homogeneous cluster (see Figure3). But if there is a slower or a loaded computer on the ring the send-receive method will be slow and several traffic jams are expected and running times also grow. Algorithm1 has been revised and a new, hierarchical model has been composed..3. Algorithm I. Let P be the number of nodes, let eps be the tolerated error value. II. Let A be the matrix to be solved and let b be the solution vector. III. The master node generates a random vector x and sends for every worker node. IV. For every node p P do in parallel: generate a random vector x 1 V. On master node do Control() while < eps VI. For every worker node p P x 1 do in parallel: Operation(x1) while a vector arrives VII. The result of solution is x1 on master node. On the worker nodes the Operation() function uses a residual approach like Algorithm1. The difference is that the operation function sends and receives data from the master node only. 69

6 Vol. 3. No Acta Technica Jaurinensis Figure 4. Residual and absolute error of the solution vector (n = 15000, P = 88 cpus, log-log scale) The Control function of the master node distributes and collects data from every worker node. On the master node the problem is not solved, but the result is presented here. The master node controls the data exchanges, and presents the best approximate result vector for every node (see Figure 4). This model is flexible because the number of worker nodes number can grow dynamically [8]. This model can be used on heterogeneous clusters, too, because the worker nodes are independent and communicate only with the master node. Every node works on the master s best solution. Figure 5. Convergence speed between topologies (n = 500, P = 16 cpus); 70

7 Acta Technica Jaurinensis Vol. 3. No The difference between the ring and hierarchical algorithm is that the hierarchical algorithm results in smaller computational time and better convergence. As it can be seen in Figure 5 the ring solution (solid line) has a minor gradient whereas the hierarchical solution (dotted line) has a steeper gradient. This is because at the ring model the corrective effect of a new solution reaches the previous node in P 1 steps, while in the hierarchical model the corrective effect is achieved in one step. Moreover, on the ring model we can only take the result of one or two neighbours into consideration. Figure 6. The results of working nodes (dotted) and the minimal residual error (solid line) (n = 10000, P = 88, log-log scale) The hierarchical model has better convergence features. At every iteration loop the master node sends the best solution vector for the workers. This adds some genetic features to the algorithm [5]. If we examine the details on Figure 3 and Figure 4 steps in the curve can be seen. It has to be noted that in a P-processor master-worker model only P-1 processors solve the linear equation. Let us focus now on this hierarchical solution. If the convergence curves are observed it can be noticed that all of the solutions are similar, and the final solution has always the same order of error in every case. The differences between the exact values of the errors are unfortunately caused by the pseudo-random number generator. In this algorithm only an approximate solution is achieved, not the exact vector. If the problem is examined from another point of view and the execution time of the algorithm is recorded, the results shown in Figure 7 are achieved. If more computing nodes are used the running time is expected to decrease. 71

8 Vol. 3. No Acta Technica Jaurinensis Figure 7. Time of execution and processor numbers (n = 5000); In these cases only parallel execution provides results in an acceptable time. At some points larger than linear relative speed-up can be achieved, as it is shown in Fig 8. But when the number of processors grows, the effective speedup and efficiency decreases as the worker nodes report their own solutions and the load of the master node grows. More precise load balancing can be used, but the size of the problem and the communication delays prevent further advances. For better results another method has to be used. Figure 8. Algortihm relative speedup (n = 500) 7

9 Acta Technica Jaurinensis Vol. 3. No Results Above a new type of algorithm for the solution of a linear equation system on heterogeneous clusters has been presented. The algorithm is based on a residual minimisation technique with master-worker solutions. The algorithm has some genetic features because the new, better vectors are made from a group of good vectors as seen in Figure 8 for extra speed-up. Computer tests have proved the theoretical results; parallel implementation is much better than the sequential one. We get a considerable speed-up effect using a parallel computer. The goal of creating these algorithms has been basic research but the solution of bad condition linear equations is a useful method for several practical and realistic problems. Optimization, finite element methods, control or simulation problems are often based on large dense linear equations. We have to note that only the simplest algorithm has been tested. The test with more effective algorithms will be the subject of a forthcoming work. 4. References [1] Louis A. Hageman, Davis M. Joung: Applied Iterative Methods, Computer Science and Applied Mathematics, Academic Press, (1981). [] P. G. Ciarlet: Introduction à l analyse numérique matricielle et à l optimisation, MASSON, Paris, (198). [3] G. Golub, A. Greenbaum, M. Luskin, eds., Recent Advances is Iterative Methods, The IMA Volumes in Math. and its Applications Vol.60., Springer Verlag, (1994). [4] G. Molnárka, N. Varjasi: Parallel algorithm for solution of general linear systems of equations, Informatika a felsőoktatásban 005, Debrecen ISBN , pp.176. [5] G. Molnárka: A scalable parallel algorithm for solving general linear system equations, 77th GAMM annual meeting 006, Berlin (006) pp.441. [6] N. Varjasi, Parallel Algorithm for linear equations with different network topologies, Proceedings of International e-conference on Computer Science (IeCCS) 006 in Lecture Series on Computer and Computational Sciences, (007) pp , Brill Academic Publishers, ISBN [7] M. Matsumoto and T. Nishimura, Mersenne Twister: A 63-dimensionally equidistributed uniform pseudorandom number generator, ACM Trans. on Modeling and Computer Simulation Vol. 8, No. 1, January (1998) pp [8] A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG methods for sparse matrices on massively parallel machines, in Parallel Computing 3. (1997) pp [9] E. J. H. Yero, M. A. A Henriques, Speedup and scalability analysis of Master- Slave applications on large heterogeneous clusters, in Journal of Parallel and Distributed Computing 67. (007) pp [10] I. S. Duff, H. A. van der Vorst, Developments and trends in the parallel solution of linear systems, in Parallel Computing 5. (1999) pp

10 Vol. 3. No Acta Technica Jaurinensis 74

### Load Balancing using Potential Functions for Hierarchical Topologies

Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and

### Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

### P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

### A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

### Windows Server Performance Monitoring

Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

### Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

### APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

### Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

### Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

### High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

### Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

### General Framework for an Iterative Solution of Ax b. Jacobi s Method

2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

### Load Balancing and Termination Detection

Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

### Scalability and Classifications

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

### Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

### Experiments on the local load balancing algorithms; part 1

Experiments on the local load balancing algorithms; part 1 Ştefan Măruşter Institute e-austria Timisoara West University of Timişoara, Romania maruster@info.uvt.ro Abstract. In this paper the influence

### Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

### Distributed Operating Systems Introduction

Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of

### Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

### Building an Inexpensive Parallel Computer

Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

### Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

### PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING MARKUS HEGLAND Abstract. Parallel computing enables the analysis of very large data sets using large collections of flexible models with many variables. The

### Overview of High Performance Computing

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the

### Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

### Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

### Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

### Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

### Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

### ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

### Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

### Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

### A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

### RAID: Redundant Arrays of Inexpensive Disks this discussion is based on the paper: on Management of Data (Chicago, IL), pp.109--116, 1988.

: Redundant Arrays of Inexpensive Disks this discussion is based on the paper:» A Case for Redundant Arrays of Inexpensive Disks (),» David A Patterson, Garth Gibson, and Randy H Katz,» In Proceedings

### A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.) - Tehnologii informationale, Editura

### A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

### A numerically adaptive implementation of the simplex method

A numerically adaptive implementation of the simplex method József Smidla, Péter Tar, István Maros Department of Computer Science and Systems Technology University of Pannonia 17th of December 2014. 1

### LaPIe: Collective Communications adapted to Grid Environments

LaPIe: Collective Communications adapted to Grid Environments Luiz Angelo Barchet-Estefanel Thesis Supervisor: M Denis TRYSTRAM Co-Supervisor: M Grégory MOUNIE ID-IMAG Laboratory Grenoble - France LaPIe:

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### CHAPTER 1 INTRODUCTION

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

### Parallelism and Cloud Computing

Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István

### Grid Computing Approach for Dynamic Load Balancing

International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-1 E-ISSN: 2347-2693 Grid Computing Approach for Dynamic Load Balancing Kapil B. Morey 1*, Sachin B. Jadhav

### TESLA Report 2003-03

TESLA Report 23-3 A multigrid based 3D space-charge routine in the tracking code GPT Gisela Pöplau, Ursula van Rienen, Marieke de Loos and Bas van der Geer Institute of General Electrical Engineering,

### A simplified implementation of the least squares solution for pairwise comparisons matrices

A simplified implementation of the least squares solution for pairwise comparisons matrices Marcin Anholcer Poznań University of Economics Al. Niepodleg lości 10, 61-875 Poznań, Poland V. Babiy McMaster

### Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

### Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

### AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina

### Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example

Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example Fernando G. Tinetti Centro de Técnicas Analógico-Digitales (CeTAD) 1 Laboratorio de Investigación y Desarrollo

### Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

### Load Balancing Between Heterogenous Computing Clusters

Load Balancing Between Heterogenous Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, N2L 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu

### BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

### Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

### Measuring Parallel Processor Performance

ARTICLES Measuring Parallel Processor Performance Alan H. Karp and Horace P. Flatt Many metrics are used for measuring the performance of a parallel algorithm running on a parallel processor. This article

### Load Balancing and Termination Detection

Chapter 7 slides7-1 Load Balancing and Termination Detection slides7-2 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination

### Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques

### Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

### Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

Appl. Comput. Math. 2 (2003), no. 1, pp. 57-64 NEUROMATHEMATICS: DEVELOPMENT TENDENCIES GALUSHKIN A.I., KOROBKOVA. S.V., KAZANTSEV P.A. Abstract. This article is the summary of a set of Russian scientists

### 2: Computer Performance

2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

### MPI Hands-On List of the exercises

MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI

### Properties of Stabilizing Computations

Theory and Applications of Mathematics & Computer Science 5 (1) (2015) 71 93 Properties of Stabilizing Computations Mark Burgin a a University of California, Los Angeles 405 Hilgard Ave. Los Angeles, CA

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de

### Interconnection Networks

Advanced Computer Architecture (0630561) Lecture 15 Interconnection Networks Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interconnection Networks: Multiprocessors INs can be classified based on: 1. Mode

### GPGPU accelerated Computational Fluid Dynamics

t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### Computing the Determinants in the Multiprocessor Computer

ISSN: 9-8 (An ISO 9: Certified Organization) Vol., Issue 8, August Computing the Determinants in the Multiprocessor Computer Salman H. Abbas Associate Professor, Department of Mathematics, College of Science,

### Performance of networks containing both MaxNet and SumNet links

Performance of networks containing both MaxNet and SumNet links Lachlan L. H. Andrew and Bartek P. Wydrowski Abstract Both MaxNet and SumNet are distributed congestion control architectures suitable for

### A Direct Numerical Method for Observability Analysis

IEEE TRANSACTIONS ON POWER SYSTEMS, VOL 15, NO 2, MAY 2000 625 A Direct Numerical Method for Observability Analysis Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper presents an algebraic method

### Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu

### AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION

AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION Jörg Frochte *, Christof Kaufmann, Patrick Bouillon Dept. of Electrical Engineering and Computer Science Bochum University of Applied Science 42579

### Various Schemes of Load Balancing in Distributed Systems- A Review

741 Various Schemes of Load Balancing in Distributed Systems- A Review Monika Kushwaha Pranveer Singh Institute of Technology Kanpur, U.P. (208020) U.P.T.U., Lucknow Saurabh Gupta Pranveer Singh Institute

### A Survey on Load Balancing and Scheduling in Cloud Computing

IJIRST International Journal for Innovative Research in Science & Technology Volume 1 Issue 7 December 2014 ISSN (online): 2349-6010 A Survey on Load Balancing and Scheduling in Cloud Computing Niraj Patel

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

58 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 you saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c nn c nn...

### The Methodology of Application Development for Hybrid Architectures

Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

### Systolic Computing. Fundamentals

Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW

### Analysis of an Artificial Hormone System (Extended abstract)

c 2013. This is the author s version of the work. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purpose or for creating

### Hybrid Evolution of Heterogeneous Neural Networks

Hybrid Evolution of Heterogeneous Neural Networks 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001

### An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University

### Feedback guided load balancing in a distributed memory environment

Feedback guided load balancing in a distributed memory environment Constantinos Christofi August 18, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract

### Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

### MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### Asynchronous Computations

Asynchronous Computations Asynchronous Computations Computations in which individual processes operate without needing to synchronize with other processes. Synchronizing processes is an expensive operation

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### Using Cloud Computing for Solving Constraint Programming Problems

Using Cloud Computing for Solving Constraint Programming Problems Mohamed Rezgui, Jean-Charles Régin, and Arnaud Malapert Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, 06900 Sophia Antipolis, France

### The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation Mark Baker, Garry Smith and Ahmad Hasaan SSE, University of Reading Paravirtualization A full assessment of paravirtualization