Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment
|
|
|
- Roland Mills
- 10 years ago
- Views:
Transcription
1 Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment Shuichi Ichikawa and Shinji Yamashita Department of Knowledge-based Information Engineering, Toyohashi University of Technology Tempaku, Toyohashi, Aichi , JAPAN Abstract This paper describes a static load balancing scheme for partial differential equation solvers in a distributed computing environment Though there has been much research on static load balancing for uniform processors, a distributed computing environment is a computationally more difficult target because it usually consists of a variety of processors Our method considers both computing and communication time to minimize the total execution time with automatic data partitioning and processor allocation This problem is formulated as a combinatorial optimization and solved by the branch-and-bound method for up to processors This paper also presents approximation algorithms that give good allocation and partitioning in practical time The quality of the approximation is quantitively evaluated in comparison to the optimal solution or theoretical lower bounds Our method is general and applicable to a wide variety of parallel processing applications 1 Introduction NSL [1] [2] is a numerical simulation language system that automatically generates parallel simulation programs for multicomputers from a high level description of PDE (Partial Differential Equations) NSL adopts an explicit FDM (Finite Difference Method) based on a boundary-fitted coordinate system and multi-block method Though there are many parallel processing systems for PDE (eg //ELLPACK [3], DISTRAN [4], DEQ- SOL [5]), only sub-optimal static load balancing has been implemented, because the combinatorial optimization problem involved is computationally difficult to solve [6] [7] As NSL adopts both a boundary-fitted coordinate system and multi-block method, the optimization problem can be simplified and solved by the branch-and-bound method in practical time with offthe-shelf computers [8] [9] This method takes both computation and communication time into consideration, and finds the most adequate number of processors for execution, thus avoiding the use of excessive processors that results in unnecessary delay of execution In previous research [8] [9], the target computer system was assumed to be a parallel computer consisting of uniform processing elements This study deals with a more general parallel processing environment, which consists of non-uniform processing elements This situation is very common in distributed processing environments However, the optimization problem becomes far more difficult to solve, because non-uniformity of processors means greater numbers of free variables 2 Model of Computation and Communication 21 Parallel Execution Model In NSL, a physical domain with a complex shape can be described by a set of mutually connected blocks (multi-block method) As each block can have curvilinear borders, NSL maps a physical block of various forms to a rectangular computational block (boundary-fitted coordinate system) Figure 1 shows a physical domain of 3 blocks, each of which is mapped to the corresponding computational domain The dotted line in the figure indicates the connected border of a block The solid line is the real border of the domain, where the corresponding boundary condition is imposed These features enable users to set precise boundary conditions while gaining additional performance, because a rectangular computational domain implies
2 H i B i Bs 0 Bs 1 Bs 2 Bs W i Physical Computational Figure 2: Partitioning of a Block Figure 1: Physical Domain and Computational Domain a regular array of data and computations that are favorable to vector/parallel operations NSL generates parallel explicit FDM program for multicomputers Each block is a 2-dimensional array of grid points, on which difference equations are calculated Each calculation of the difference equations can be executed in parallel due to the nature of explicit FDM, followed by communications to exchange data across the connected borders The parallel PDE solver in NSL invokes this set of operations iteratively The purpose of this study is to minimize the total execution time of this parallel PDE solver by distributing mutually connected rectangular blocks among nonuniform processors, considering both computation and communication time to avoid the use of excessive processors 22 Framework of Static Load Balancing Let the number of blocks be m, and the number of processors be n The relationship m n is assumed in this paper (as in previous studies [8] [9]), because the boundary-fitted coordinate system generally helps to keep m small and we are usually interested in bigger n when discussing static load balancing Load balancing under general conditions is left for future study Under this assumption, we can formulate the problem in two stages First, one must find the best allocation of n distinguishable processors to m distinguishable blocks to minimize the execution time Thus, the best combination of processors must be chosen to minimize the total execution time, leaving some of them unused if necessary The calculation time can be made shorter by using more processors, but the communication time would be longer if more processors are used In other words, the best balance must be found between calculation and communication This is particularly important in a distributed computing environment, because communication time tends to be dominant This problem can be formulated as a combinatorial optimization problem that is difficult to compute [10] Section 4 describes how this problem can be solved To determine the best processor allocation, we have to know the best partitioning of a block to a given set of processors Here, we decided that each processor should deal with only one subblock, which is a rectangular fragment of block Figure 2 shows the block B i split into four subblocks Bs 0,, Bs 3 This method is simple and straightforward, and is intuitively expected to suppress the emergence of communication to the possible extent Section 3 describes how a block can be partitioned to the allocated processors according to this rule A more elaborate partitioning scheme can improve the estimated execution time despite increasing communication, but such a method can cause congestion of the network, resulting in unexpected and unjustifiable delay of execution Therefore, our conservative decision would be regarded as appropriate in most circumstances One may consider a more aggressive method in the case of clusters tightly connected by a fast and dedicated network hardware, although this is strongly dependent on the particular implementation 23 Evaluation Function The evaluation function of this optimization problem is the execution time of each repetition of PDE solver This execution time T is given by T = max T i (i =0, 1,, n 1), (1) i T i = Ta i + Tc i (2) Here, T i is the execution time on the i-th processor (P i ) As each processor only deals with one subblock, let the subblock of P i be noted as Bs i Ta i is the calculation time of Bs i, and Tc i is the corresponding communication time Based on experimental results
3 3 Partitioning Bs i w i h i Figure 3: Subblock with the NSL system, Ta i is modeled as a linear function of the number of grid points in Bs i, which is noted as Sa i Also, Tc i is a linear function of the number of grid points that participate in communication, which is noted as Sc i Such linear models have been widely used in past studies, because linear models usually fit well to measurement results However, we have to note that past studies focused on parallel computers, which usually have powerful inter-connects between processors In distributed computing environments, networks tend to be relatively weak and communication time can be far from linear in some circumstances We adopt a simple linear model in this paper, but a more detailed analysis will be required in future work Figure 3 shows a subblock which has all its borders connected This is the worst case for communication, so we show the upper bound of communication quantity here for simplicity Then, Ta i and Tc i are given by Sa i = h i w i, (3) Ta i = Cta i Sa i + Dta i, (4) Sc i = 2(h i + w i +2), (5) Tc i = Ctc i Sc i + Cn i Dtc i (6) Cta i is the time to calculate a difference equation for one grid point on P i Clearly Cta i is dependent on the performance of the processor P i Dta i is also a machine-dependent constant, which represents the overhead of the calculation The communication parameters Ctc i and Dtc i are also machine-dependent constants Cn i is the number of communication links, which is 8 in Figure 3, and 3 for Bs 0 and 2 for Bs 2 in Figure 2 This section describes the methods to partition a block for a given set of non-uniform processors For an example, see Figure 2 This partitioning is a kind of combinatorial optimization with geometrical constraints All heights and widths of subblocks must be integers (integer constraints), and the original block must be reconstructed from its rectangular subblocks like a jigsaw puzzle (geometrical constraints) Under these constraints, we have to find the best partitioning to minimize T described in Equation (1) This optimization problem is so difficult that two heuristic algorithms are presented and quantitively evaluated against a theoretical lower bound 31 Heuristic Algorithm The most intuitive approach to partitioning would be divide-and-conquer First, processors are separated into two groups Then the block is cut straight into two subblocks in accordance with the total performance of each group Cutting along the shorter edge, the cross-section and consequent communication would be minimal This process is recursively applied until each subblock corresponds to one processor This simple technique, which is generally called recursive bisection (RB), has been widely used in scientific computations [11] [12] Plain RB is not always good but many of its derivatives has been developed and examined [12] [13] 311 Type 1 The first algorithm (type 1) is a plain recursive bisection method We can think of (2 n 2) ways to split n processors into two groups at each level of RB Intuitively, it seems good to split processor performance into halves, but this grouping itself is NP-hard [10] and still does not guarantee the best result because this is a combinatorial optimization In this algorithm, the numbers of processors are split into halves at each level of RB Then the block is split into two subblocks according to the ratio of accumulated processor performance in each group This grouping still involves ( n n/2 ) ways at each level, but we examine every possible grouping recursively and take the best result as the result of this approximation algorithm 312 Type 2 The type 1 algorithm takes too much time and gives poor performance when many processors are available
4 type1 type2 Figure 4: Partitioning Table 1: Common Parameters Parameter Value Dta i 100 Ctc i 02 Dtc i 01 1 Bs k Ti Bs i > k T > T j Bs j Bs k Bs i Ti = Tj Bs j Table 2: Variety of Processors Cta i #proc Figure 5: Local load balancing One reason for this is that it makes the level of recursion so high that load balancing is too localized to get a good result There are many ways to overcome this problem, including recursive p-section (p >2) Here, we examine a simple variant of the type 1 algorithm The second algorithm (type 2) is a modified recursive bisection Only at the first level, n processors are equally split into n groups to accelerate the partitioning The block is stripped into parallel n rectangles in accordance with the sum of processor performance The type 1 algorithm is applied from the second level All possible groupings are recursively examined as in the type 1 algorithm Figure 4 displays how a block is partitioned for 10 non-uniform processors by the type 1 and the type 2 algorithm The type of line shows the level of recursion in partitioning 313 Local load balancing Both type 1 and type 2 algorithms concern only the balance of calculation In other words, they try to make the calculation time equal by managing the number of grid points for each processor The result is that the faster processor tends to be a bottle-neck, because the faster processor takes more grid points, which usually means more communication time Local load balancing is effective in such cases After calculating the temporal partitioning by approximation algorithm, some of the grid points are transfered to level the loads between adjoining subblocks In Fig- ure 5, the load of Bs i can be transfered to Bs k or Bs j As T j is smaller than T k, some of the grid points of Bs i are transfered to Bs j This process is iteratively applied until no improvement is derived 32 Estimation of Lower Bound Though the optimal solution is the best measure for quantitative evaluation of heuristics, this partitioning problem seems too difficult to solve Therefore, we adopt a lower bound of T, which is derived from the following relaxed problem minimize subject to T = max i T i H j W j = i h iw i h i,w i R + Let H j and W j be the height and the width of the block B j, and h i and w i be the height and the width of a subblock Bs i which belongs to B j R + means positive real numbers; that is, the integer constraints are removed here Also, the equation H j W j = i h iw i, which only requires the sum of the area of subblocks to be equal to the area of the original block, means that the geometrical constraint is removed This relaxed problem is a kind of non-linear programming, but can be easily solved 33 Evaluation of Partitioning Algorithms Figure 6 shows the quality of approximation algorithms, which is normalized by the lower bound de-
5 T / T_lower type1 type1 + local type2 type2 + local (a) uniform processors T / T_lower type1 type1 + local type2 type2 + local 2e+5 2e+6 2e+7 size (b) non-uniform processors B 0 B 1 B 2 B 3 B B B 2 B (a) approx1 (b) approx Figure 6: Evaluation of Partitioning Algorithms B B B 1 1 B 2 2 rived from the relaxed problem Note that all these results depend on simulation parameters, particularly on the ratio of calculation time to communication time Here, we used the parameters shown in Table 1 Calculation latency and communication parameters are set equal for all processors here Type 1 and Type 2 are the previously mentioned approximation algorithms Additional local load balancing is applied in +local cases Figure 6(a) shows the results with various numbers of processors where Cta i = 001 for all processors The geometry of the block is set to W = 500 and H = 400 On the right side of this graph, communication becomes dominant over calculation Regardless of the number of processors, type 2 shows a better performance than type 1 With type 2, local load balancing shows no effect, which it is natural because all processors have the same Cta i in this graph Figure 6(b) displays the results with ten nonuniform processors, which are shown in Table 2 A block of a variety of grid points with a fixed aspect ratio (W : H = 5 : 4) is partitioned for these processors Calculation is dominant on the right side of the graph, and communication is superior on the left side As the approximation algorithms only consider calculation time in partitioning, the error is bigger on the left side Type 1 and type 2 show similar results, but local load balancing works well for both algorithms, regardless of the ratio between calculation and communication 4 Processor Allocation This section outlines the method to find the best allocation of n distinguishable processors to m distinguishable blocks so as to minimize the execution time For partitioning, the type 2 algorithm described in Section 3 is used with local load balancing B 2 B B 0 B 3 (c) approx3 3 4 Figure 7: Approximation Algorithms 41 Branch-and-Bound Basically, every combination of processors must be examined to solve this kind of combinatorial optimization problem, but this is a difficult computation problem, as mentioned in Section 22 Therefore, some technique is required to reduce the search space In this paper, a branch-and-bound method is adopted Let T be the execution time for the temporal feasible allocation Given a set of processors for a subblock Bs i, the lower bound of T i is easily derived by the relaxed problem mentioned in Section 32 If this lower bound is bigger than T, the current allocation has no possibility of providing a more feasible solution Therefore, the partitioning process for this allocation can be omitted and any groupings which include this allocation can be pruned to reduce search space 42 Approximation Algorithms A good approximation algorithm is important for practical use of the branch-and-bound method, because it gives the initial temporal feasible solution The better the algorithm, the earlier the optimization process will finish because more search space is pruned Figure 7 shows the three approximation algorithms examined in this paper The following is a rough description of these algorithms 1 The first algorithm (approx1) initially sorts blocks according to the number of grid points Processors are 1 The details are omitted due to lack of space
6 Table 3: Simulation Parameters Parameter Value Parameter Value Cta Dta i 100 Cta Ctc i 02 Cta Dtc i 01 Cta also sorted by the order of performance This algorithm then allocates processors in cyclic manner The second (approx2) is the same as approx1 in sorting It then allocates faster processors to bigger blocks in accordance with the share of total grid points and total performance The third (approx3) is a variant of approx2 In approx2, the share is calculated first and used throughout the allocation process On the other hand, approx3 updates the shares of blocks and the share of processors whenever a processor is allocated This makes the allocation more precise accuracy accuracy 140 approx1 135 approx1 + local 130 approx2 approx2 + local 125 approx3 120 approx3 + local best effort (a) m = approx1 135 approx1 + local 130 approx2 approx2 + local 125 approx3 120 approx3 + local best effort (b) m = 8 43 Iterative Improvement The approximated allocation can sometimes be improved by exchanging and transferring processors between blocks This improvement is done as follows First, find the block B i which includes the slowest subblock Second, find the block B j which includes the fastest subblock Then, try to improve T by exchanging or transferring processors between B i and B j This process is iteratively applied as long as T is improved 44 Evaluation The approximation algorithms are quantitively evaluated against the optimal grouping derived by the branch-and-bound method The results of numerical simulations are shown in Figure 8 Approx1, approx2, and approx3 are the approximation algorithms described in Section 42 Local means the iterative local improvement described in Section 43 Best effort in the figure means the best result taken from approx1+local, approx2+local, and approx3+local Simulation parameters are shown in Table 3 The number of processors are multiples of 4, and the processors of Cta 0,, Cta 3 are equally included The calculation latency parameters Dta i are all set to the same value, as are the communication parameters Ctc i and Dtc i The height and the width of blocks are generated randomly as multiples of 10 between 100 and 1000 The result is the average of 20 trials Figure 8: Evaluation of Grouping Algorithms The error is slightly bigger when the a priori assumption m n is not satisfied well (n 8 for m =4 and n 12 for m = 8) However, when the condition m n holds, the error is less than 5% for this parameter set by approx2+local and approx3+local The average execution time of the approximation algorithm and the optimization algorithm is shown in Figure 9 All measurements are done on an Intel Pentium-II 400 MHz processor with 256 MB memory, FreeBSD 31R, and GNU C compiler (ver 27) The approx in Figure 9 means the sum of the execution time of approx1+local, approx2+local, and approx3+local Approximation algorithms are reasonably fast, and the combinatorial optimization ( optimize ) is pragmatic for processors 5 Conclusion and Future Work Static load balancing in a distributed computing environment is very challenging, because it involves many free variables and a quite vast search space This study demonstrates that optimization problems can be solved for up to processors in practical time by off-the-shelf computers Our methods are based on a general combinatorial optimization technique, and hence applicable to a wide variety of applications if
7 time [sec] 1e+05 1e+04 1e+03 1e+02 1e+01 1e+00 1e-01 1e-02 1e-03 1e Figure 9: Execution Time approx(m=4) optimize(m=4) approx(m=8) optimize(m=8) calculation and communication time can be properly modeled More study is required for partitioning An exhaustive search to get the optimal partitioning is desirable, because the best partitioning would be a better measure than a lower bound to evaluate approximation algorithms Also, simpler and more powerful approximation algorithms should be crafted Currently, the grouping phase takes too much time because the branch-and-bound method is still not effective enough More precise lower bounds and better approximation is required to drastically reduce search space Acknowledgments This work was partially supported by a Grant-in- Aid from the Ministry of Education, Science, Sports and Culture References [1] T Kawai, S Ichikawa, and T Shimada, NSL: High-Level Language for Parallel Numerical Simulation, Proc IASTED Int l Conf Modeling and Simulation (MS 99), Acta Press, 1999, pp [2] T Kawai, S Ichikawa, and T Shimada, NSL: High Level Language for Parallel Numerical Simulation, Trans Information Processing Society of Japan, vol 38, no 5, May 1997, pp (in Japanese) [3] E N Houstis and J R Rice, Parallel ELL- PACK: A Development and Problem Solving Environment for High Performance Computing Machines, Programming Environments for High- Level Scientific Problem Solving, P W Gaffney and E N Houstis, Eds, Elsevier Science Publishers B V, 1992, pp [4] K Suzuki et al, DISTRAN System for Parallel Computers, Joint Symp on Parallel Processing (JSPP 91), Information Processing Society of Japan, 1991, pp (in Japanese) [5] T Ohkouchi, C Konno, and M Igai, High Level Numerical Simulation Language DEQSOL for Parallel Computers, Trans Information Processing Society of Japan, vol 35, no 6, June 1994, pp (in Japanese) [6] N Chrisochoides, E Houstis, and J Rice, Mapping Algorithms and Software Environment for Data Parallel PDE Iterative Solvers, Special Issue of the J Parallel and Distributed Computing on Data-Parallel Algorithms and Programming, vol 21, no 1, 1994, pp [7] N Chrisochoides, N Mansour, and G Fox, Comparison of Optimization Heuristics for the Data Distribution Problem, J Concurrency Practice and Experience, vol 9, no 5, 1997, pp [8] S Ichikawa, T Kawai, and T Shimada, Mathematical Programming Approach for Static Load Balancing of Parallel PDE Solver, Proc 16th IASTED Int l Conf Applied Informatics (AI 98), Acta Press, 1998, pp [9] S Ichikawa, T Kawai, and T Shimada, Static Load Balancing for Parallel Numerical Simulation by Combinatorial Optimization, Trans Information Processing Society of Japan, vol 39, no 6, June 1998, pp (in Japanese) [10] M R Garey and D S Johnson, Computers and Intractability, Freeman, 1979 [11] G C Fox, A Graphical Approach to Load Balancing and Sparse Matrix Vector Multiplication on the Hypercube, Numerical Algorithms for Modern Parallel Computer Architectures, M Schultz, Ed, Springer-Verlag, 1988, pp [12] G C Fox, R D Williams, and P C Messina, Parallel Computing Works!, Morgan Kaufmann, 1994, chapter 11 [13] H D Simon and S Teng, How Good is Recursive Bisection?, SIAM J Sci Comput, vol 18, no 5, Sept 1997, pp
Load Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE
HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE Subodha Kumar University of Washington [email protected] Varghese S. Jacob University of Texas at Dallas [email protected]
Experiments on the local load balancing algorithms; part 1
Experiments on the local load balancing algorithms; part 1 Ştefan Măruşter Institute e-austria Timisoara West University of Timişoara, Romania [email protected] Abstract. In this paper the influence
Adaptive Linear Programming Decoding
Adaptive Linear Programming Decoding Mohammad H. Taghavi and Paul H. Siegel ECE Department, University of California, San Diego Email: (mtaghavi, psiegel)@ucsd.edu ISIT 2006, Seattle, USA, July 9 14, 2006
The Problem of Scheduling Technicians and Interventions in a Telecommunications Company
The Problem of Scheduling Technicians and Interventions in a Telecommunications Company Sérgio Garcia Panzo Dongala November 2008 Abstract In 2007 the challenge organized by the French Society of Operational
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations
Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd 4 June 2010 Abstract This research has two components, both involving the
A Performance Comparison of Five Algorithms for Graph Isomorphism
A Performance Comparison of Five Algorithms for Graph Isomorphism P. Foggia, C.Sansone, M. Vento Dipartimento di Informatica e Sistemistica Via Claudio, 21 - I 80125 - Napoli, Italy {foggiapa, carlosan,
Throughput constraint for Synchronous Data Flow Graphs
Throughput constraint for Synchronous Data Flow Graphs *Alessio Bonfietti Michele Lombardi Michela Milano Luca Benini!"#$%&'()*+,-)./&0&20304(5 60,7&-8990,.+:&;/&."!?@A>&"'&=,0B+C. !"#$%&'()* Resource
A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters
A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters Abhijit A. Rajguru, S.S. Apte Abstract - A distributed system can be viewed as a collection
AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION
AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION Jörg Frochte *, Christof Kaufmann, Patrick Bouillon Dept. of Electrical Engineering and Computer Science Bochum University of Applied Science 42579
Load Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
Voronoi Treemaps in D3
Voronoi Treemaps in D3 Peter Henry University of Washington [email protected] Paul Vines University of Washington [email protected] ABSTRACT Voronoi treemaps are an alternative to traditional rectangular
A Mathematical Programming Solution to the Mars Express Memory Dumping Problem
A Mathematical Programming Solution to the Mars Express Memory Dumping Problem Giovanni Righini and Emanuele Tresoldi Dipartimento di Tecnologie dell Informazione Università degli Studi di Milano Via Bramante
A Note on Maximum Independent Sets in Rectangle Intersection Graphs
A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada [email protected] September 12,
NEUROMATHEMATICS: DEVELOPMENT TENDENCIES. 1. Which tasks are adequate of neurocomputers?
Appl. Comput. Math. 2 (2003), no. 1, pp. 57-64 NEUROMATHEMATICS: DEVELOPMENT TENDENCIES GALUSHKIN A.I., KOROBKOVA. S.V., KAZANTSEV P.A. Abstract. This article is the summary of a set of Russian scientists
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, [email protected]
Learning in Abstract Memory Schemes for Dynamic Optimization
Fourth International Conference on Natural Computation Learning in Abstract Memory Schemes for Dynamic Optimization Hendrik Richter HTWK Leipzig, Fachbereich Elektrotechnik und Informationstechnik, Institut
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)
Partitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University
Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA
A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce
Virtual Machine Allocation in Cloud Computing for Minimizing Total Execution Time on Each Machine
Virtual Machine Allocation in Cloud Computing for Minimizing Total Execution Time on Each Machine Quyet Thang NGUYEN Nguyen QUANG-HUNG Nguyen HUYNH TUONG Van Hoai TRAN Nam THOAI Faculty of Computer Science
Identification of Energy Distribution for Crash Deformational Processes of Road Vehicles
Acta Polytechnica Hungarica Vol. 4, No., 007 Identification of Energy Distribution for Crash Deformational Processes of Road Vehicles István Harmati, Péter Várlaki Department of Chassis and Lightweight
How High a Degree is High Enough for High Order Finite Elements?
This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,
Optimal Scheduling for Dependent Details Processing Using MS Excel Solver
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 8, No 2 Sofia 2008 Optimal Scheduling for Dependent Details Processing Using MS Excel Solver Daniela Borissova Institute of
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation
MATHEMATICAL ENGINEERING TECHNICAL REPORTS The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation Shinji IMAHORI, Mutsunori YAGIURA METR 2007 53 September 2007 DEPARTMENT
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW Rajesh Khatri 1, 1 M.Tech Scholar, Department of Mechanical Engineering, S.A.T.I., vidisha
How To Get A Computer Science Degree At Appalachian State
118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University [email protected] http://www.cs.appstate.edu/
ARTICLE IN PRESS. European Journal of Operational Research xxx (2004) xxx xxx. Discrete Optimization. Nan Kong, Andrew J.
A factor 1 European Journal of Operational Research xxx (00) xxx xxx Discrete Optimization approximation algorithm for two-stage stochastic matching problems Nan Kong, Andrew J. Schaefer * Department of
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:
Sparse Matrix Decomposition with Optimal Load Balancing
Sparse Matrix Decomposition with Optimal Load Balancing Ali Pınar and Cevdet Aykanat Computer Engineering Department, Bilkent University TR06533 Bilkent, Ankara, Turkey apinar/aykanat @cs.bilkent.edu.tr
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Chapter 111. Texas Essential Knowledge and Skills for Mathematics. Subchapter B. Middle School
Middle School 111.B. Chapter 111. Texas Essential Knowledge and Skills for Mathematics Subchapter B. Middle School Statutory Authority: The provisions of this Subchapter B issued under the Texas Education
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Statistical Forecasting of High-Way Traffic Jam at a Bottleneck
Metodološki zvezki, Vol. 9, No. 1, 2012, 81-93 Statistical Forecasting of High-Way Traffic Jam at a Bottleneck Igor Grabec and Franc Švegl 1 Abstract Maintenance works on high-ways usually require installation
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
Load Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
Load-Balancing Spatially Located Computations using Rectangular Partitions
Load-Balancing Spatially Located Computations using Rectangular Partitions Erik Saule a,, Erdeniz Ö. Başa,b, Ümit V. Çatalyüreka,c a Department of Biomedical Informatics. The Ohio State University b Department
Cloud Storage and Online Bin Packing
Cloud Storage and Online Bin Packing Doina Bein, Wolfgang Bein, and Swathi Venigella Abstract We study the problem of allocating memory of servers in a data center based on online requests for storage.
Low-resolution Character Recognition by Video-based Super-resolution
2009 10th International Conference on Document Analysis and Recognition Low-resolution Character Recognition by Video-based Super-resolution Ataru Ohkura 1, Daisuke Deguchi 1, Tomokazu Takahashi 2, Ichiro
The assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
High-fidelity electromagnetic modeling of large multi-scale naval structures
High-fidelity electromagnetic modeling of large multi-scale naval structures F. Vipiana, M. A. Francavilla, S. Arianos, and G. Vecchi (LACE), and Politecnico di Torino 1 Outline ISMB and Antenna/EMC Lab
Multiobjective Multicast Routing Algorithm
Multiobjective Multicast Routing Algorithm Jorge Crichigno, Benjamín Barán P. O. Box 9 - National University of Asunción Asunción Paraguay. Tel/Fax: (+9-) 89 {jcrichigno, bbaran}@cnc.una.py http://www.una.py
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations
C3P 913 June 1990 Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams Concurrent Supercomputing Facility California Institute of Technology Pasadena, California
Static Environment Recognition Using Omni-camera from a Moving Vehicle
Static Environment Recognition Using Omni-camera from a Moving Vehicle Teruko Yata, Chuck Thorpe Frank Dellaert The Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 USA College of Computing
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Small Maximal Independent Sets and Faster Exact Graph Coloring
Small Maximal Independent Sets and Faster Exact Graph Coloring David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science The Exact Graph Coloring Problem: Given an undirected
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor [email protected] ABSTRACT The grid
An Optimization Approach for Cooperative Communication in Ad Hoc Networks
An Optimization Approach for Cooperative Communication in Ad Hoc Networks Carlos A.S. Oliveira and Panos M. Pardalos University of Florida Abstract. Mobile ad hoc networks (MANETs) are a useful organizational
LOAD BALANCING TECHNIQUES
LOAD BALANCING TECHNIQUES Two imporatnt characteristics of distributed systems are resource multiplicity and system transparency. In a distributed system we have a number of resources interconnected by
A Tool for Generating Partition Schedules of Multiprocessor Systems
A Tool for Generating Partition Schedules of Multiprocessor Systems Hans-Joachim Goltz and Norbert Pieth Fraunhofer FIRST, Berlin, Germany {hans-joachim.goltz,nobert.pieth}@first.fraunhofer.de Abstract.
Algorithmic Mechanism Design for Load Balancing in Distributed Systems
In Proc. of the 4th IEEE International Conference on Cluster Computing (CLUSTER 2002), September 24 26, 2002, Chicago, Illinois, USA, IEEE Computer Society Press, pp. 445 450. Algorithmic Mechanism Design
Fairness in Routing and Load Balancing
Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria
Provisioning algorithm for minimum throughput assurance service in VPNs using nonlinear programming
Provisioning algorithm for minimum throughput assurance service in VPNs using nonlinear programming Masayoshi Shimamura (masayo-s@isnaistjp) Guraduate School of Information Science, Nara Institute of Science
Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows
TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents
Fast Optimal Load Balancing Algorithms for 1D Partitioning
Fast Optimal Load Balancing Algorithms for D Partitioning Ali Pınar Department of Computer Science University of Illinois at Urbana-Champaign [email protected] Cevdet Aykanat Computer Engineering Department
Analysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program
Analysis of Computer Algorithms Hiroaki Kobayashi Input Algorithm Output 12/13/02 Algorithm Theory 1 Algorithm, Data Structure, Program Algorithm Well-defined, a finite step-by-step computational procedure
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA [email protected], [email protected],
GameTime: A Toolkit for Timing Analysis of Software
GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the
School Timetabling in Theory and Practice
School Timetabling in Theory and Practice Irving van Heuven van Staereling VU University, Amsterdam Faculty of Sciences December 24, 2012 Preface At almost every secondary school and university, some
A Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints
Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints Olivier Beaumont,, Paul Renaud-Goud Inria & University of Bordeaux Bordeaux, France 9th Scheduling for Large Scale Systems
Common Core Unit Summary Grades 6 to 8
Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations
