Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications
|
|
|
- Scot Moore
- 10 years ago
- Views:
Transcription
1 Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications Rupak Biswas MRJ Technology Solutions NASA Ames Research Center Moffett Field, CA 94035, USA rbis nas.nasa.gov Sajal K. Das, Daniel Harvey Dept of Computer Sciences University of North Texas Denton, TX 76203, USA {das,hawey Leonid Oliker RIACS NASA Ames Research Center Moffett Field, CA 94035, USA Abstract Tl:e ability to dynamically adapt an unstru tared grid (or mesh) is a powerjid tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from tire vie_l_oi_lt of portability on various multiprocessor plat- /o.'rr_'. _e ad_ir,'ss this problem by develol)i't;' PLUM, an automatic anti architecture-indepeadent framework ]'or adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general.purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead. 1. Introduction The success of parallel computing in solving real-life, computation-intensive problems relies on their efficient mapping "rod execution on commercially available mu_tiprocessor architectures. When the algorithms and data structures cnrrcspondh_g to these problems are dynamic in nature (i.e., thci r computational workloads grow or shrink at _mtime) or are inlrinsiczdly unstructured, ntapping them ott to distributed-memory parallel machines with dyn_tmic load bal:ulcing offers considerable cl_dlenges. Dynamic load bal_ulcing :tints to balance processor workloads at nmtime while n_inimizm,y inter-processor coml'liutliczll o11.with thc prolit'cralion of parallel computhlg, dynamic load baltu!cing has hcct_mc c_tretucly importtutt in several applic:ttions like scicnlilic c()n_pttlitlg, lask scheduling.sparse rnalrix COltlptltations, parallel discrete event simulation, and data mining. The ability to dynamically adapt an unstructured mesh is a powerful tool for efficiently solving computational problems wiaa evolving physical features. Standard fixed-mesh numerical methods can be made more cost-effective by locally refining and coarsening the mesh to capture these phenomena of interest. Utffortunately, _m efficient parallelization of adaptive methods is rather difficult, primarily due to the load imbalance created by the dynamically-changing nonuniform grids. Nonetheless, it is believed that unstructured adaptive-_'id techniques'will constitute a significant fraction of future high-perfot'mance supercomputing. As an example, if a full-scale problem in computational fluid dynamics were to be solved efficiently in parallel, dynamic mesh adaptation would cause load imbalance among processors. This, in turn, would require large amounts of data movement at runtime. It is therefore imperative to have,an efficient dynamic load balancittg mechanism as part of the solution procedure. However, since the computational mesh will be frequently adapted for tmsteady flows, the runtime load also Ires to be balanced at each step. In other words, the dynamic load balancing procedure itself must not pose a major overhead. This motivates our work. We have developed a novel method, called PLUM [7], that dyttamic?.lly balances processor workloads with a global view when performing adaptive numerical calculations in a parallel message-passing envirotur.ent. Ex_m_- itling the pert'ormazace of PLUM for azl actual workload, wllich simulates _ua acoustic wind-tutmel experiment of a helicopter rotor blade, on three different paraltel machines demonstrates that it C;Ut be successfutly ported without any code n_odifications. We propose atwther new approach to dynamic load battracing for unstta.lctttred grid applic_ltions based on delinin_ a robust cotnmtmication pattcnt (logical or physical) at_ong processors, called xynlmetric broadcast networks (SBN_ 131. It is adaptive and decentralized in natttrt', and
2 canbeportedto,anytopological arcntecture through efficient embedding techniques. Portability results for an adaptive unsteady grid workload, generated by propagating a simulated shock wave through a tube, show that our approach reduces the redistribution cost at the expense of a minimal extra communication overhead. In many mesh adaptation applications in which the data redistribution cost dominates the processing and communication cost, this is an acceptable trade-ell 2. Architecture-Independent Load Balancer PLUM is an automatic and portable load balancing envirorlment, specifically created to handle adaptive unstructured grid applications. It differs from most other load balancers in that it dynamically tmlances processor workloads with a global view [1, 7]. In this paper, we examine its architecture-independent feature by comparing results for a test case running on an SP2, Origin2000, and T3E. PLUM consists of a partitioner and a remapper that load balance mad redistribute the computational mesh when necessary. After an initial partitioning and mapping of the unstructured mesh, a solver executes several iterations of the application. A mesh adaptation procedure is invoked when the mesh is desired to be refined or coarsened. PLUM then g_tins control to determine if the workload among the processors has become unbalanced due to the mesh adaptation, and to take appropriate action. If load balancing is required, the adapted mesh is repartitioned and reassigned among the processors so that the cost of data movement is minimized. It" the estimated remapping cost is lower than the expected computational gain to be achieved, the grid is remapped among the processors before solver execution is resumed. Extensive details about PLUM are given in [7]. For completeness, we enumerate some of its salient features. Reusing the initial dual graph: PLUM repeatedly utilizes the dual of the initial mesh for the purposes of load balancing. This keeps the complexity of the partitioning and reassignment phases constant during the course of an adaptive computation. New computationm grids obtained by adaptation are translated by changing the weights of tile vertices,and edges of the dual graph. -": Parallel mesh repartitioning: PLUM can use any general-purpose partitioner that balances the computatiered loads,and minimizes the rant[me interprocessor communication. Several excellent parallel partitioners ;Ire now av:tilablc [4, 5, 10]; however, the results presented in this paper use PMcTiS [61. l)roce_r remapplng and data movement: To map new partitions to processors while minintizing the cost o1."redistribution, PLUM tirst consmtcts a similar#v matrix. This n alrix indicates how the yet'rex weights of the new sttl'*d,_rh;titls art' distributed over the processors. T[I_C general metrics: TotalV, MaxV. and MaxSR, are used to model the remappitlg cost on most multiprocessor systems [81. Both optimal and heuristic algorithms ['or minire[zing these metrics _weavailable within PLUM. Cost model metrics: PLUM performs data remapping in a bulk synchronous fashion to amortize message start-up costs and obtain good cache performance. The procedure is similar to the superstep model of BSP [91. The expected redistribution cost for a given architecture can be expressed as a linear function of MaxSR. The macl'tinedependent parameters are determined empirically Helicopter Rotor Test Case The computational mesh used to evaluate the PLLrM load balancer is the one used to simulate an acoustics windtunnel experiment of a UH-IH helicopter rotor blade [7]. A cut-out view of the initial tetrahedral mesh is shown in Fig. I. Three refinement strategies, called Real_l, Roal..2,,and Real_3, axe studied, each subdividing varying tractions of the domain based on an error indicator calculated from the flow solution, This increased the number of mesh elements from 60,968 to 82,489, 201,780, and 321,84l, respectively. Figure 1. Cut-out view of the initial mesh for the helicopter rotor blade experiment Experimental Results The three left plots in Fig. 2 illustrate parallel speedup for the three edge-marking strategies on an SP2, Origin2000, _md T3 E. Two._ts of results _ue presented for each tnachinc: one when data rcmapping is pcrt'om_ed after mesh rethlcmctlt, and the other wheel rcmappi ng is done _fore retincrncnt. The speedup nund'crs arc ;dines[ identical on all three machines. The Real_3 case shows the txst sdccdu p valttcs l-x:cattsc il is the tiles[ co[llpulzlliorl inlctlsivc. Rctllapping tlat:t Ixtl'_rt" rclincmcnt has the largest relative cl'il'tl for
3 Ii Real_l. because it has the smallest refinement region and it returns the biggest benefit by predictively load balancing the refined mesh. The best results are for Real_3 with remapping before refinement, showing an efficiency greater than 87,o on 32 processors. 6O i t _ 34 _"t$ 6 0 8O -_6O _-" :tj o SP Ranap gemap after before refinement ret'i_.._ /,0' R,_ l //.._..//t...o --.,...._- _ 0( _ """_ "4'_'_''"... u Origin2"(X)O/,! 10, T3E 16._2 _ 6.t go t Number of pmceuors s.[ LOI s *-R:-:... -,0.::... " "... i_...,,. Ofigin2CX_O, _ 8 1"2 16 "0 24 2'1.,-2 T3E, 16 Y2 _ t2 t: Number of [xocesso_s Figure 2. Refinement Speedup (left) and remapping time (right) within PLUM on an SP2, Origin2000, and T3E, when data is redistributed after or before mesh refinement. Tile three right plots in Fig. 2 show the corresponding remapping times. In almost every case, a signific_mtredttction in remapping time is observed when the adapted mesh is load balanced by performing data movement prior to refl,_emen,'. This is because the mesh grows in size only after the data has been redistributed. In general, the remapping times also decrease as the number of processors is increased. This is because even though the total volume of data movement increases with the number of processors, there _Lreactttally more processors to slmre the work. -." Perhaps the most remarkable featt,re of these resuits'ts the peculiar _havior of the T3E when P > 64. When usmg ttp to 32 processors, the remapping performance of the T3E is vet3, similzu"to that of the other two machines. However. for 1' = 6.1 and 128, the remapping overhead begins tt) increase '.tndvioh tes ottr cost model. Tile nmtime difference when data is remappcd _forc and _ffter refinement is dr:_m,dically diminished; in fact. all the rcmapping times Ix:gin to converge to a single value! This indicates that t!! rcmapping time is ;list) aft'coted by the interprocessor comnltlrhcatioil pdltcm. OIIc solution would Ix: to take _ldvitrlrage of the T3E's ability to efficiently perform one-sided communication. 3. Topology-Independent Load Balancer In this section, we describe a dynamic load balancer b,-tsed on a s.wnmetric broadcast network (SBN), which takes into account the global view of system loads among the processors. The SBN is a robust, topology-independent communication pattern (logical or physical) among P processors in a multicomputer system [3]. An SBN of dimension d > 0, denoted as SBN(d), is a (d + 1)-stage interconnection network with P = 2a processors in each stage. It is constructed recu,ts, ively as follows. A single node forms the basis network SBN(0). For d > 0, an SBN(d) is obtained from a pair of SBN(d - 1)s by adding a communication stage in the front with the following additional inter'processor cormections: (i) node i in stage 0, is made adjacent to node j = (i + P/2) rood P of stage I and (ii) node j in stage 1 is made adjacent to the node in stage 2 wlfich was the stage 0 successor of node i in SBN(d - 1). The proposed SBN-based load balancer takes into account a global view of the system and makes it effective for adaptive grid applications. Prior work [21 has demonstrated the viability of this SBN approach. It processes two types of messages: (i) load balance messages when a load imbalance is detected, and (ii) job distribution messages to reallocate Jobs. We give a brief description of the various parameters and policies involved in our implementation. _, Weighted queue length: TPJs is to take into account.-,11 the system variables like computation, communication,,'rod redistribution costs, that aftcot the processing of a local queue. Note that no redistribution cost is incurred if the data set is available on the local processor. Similarly, the communication cost is zero if the data sets of all adjacent vertices are stored locally. Prioritized vertex selection: When selecting vertices to be processed, the SBN load balartcer takes advantage of the underlying stntcture of the adaptive grid and deters local execution of botmdary vertices as long as possible becattse they may be migrated t'or more efficient execution. Thus, vertices with no communication and redistribution costs arc executed first. Differential edge cut: This is the totzd change in the commutucation,'uld redistribution costs if a vertex were moved from one processor to another. The vertex with the smallest differential edge cut is cho._n for migration. This policy strives to maintain or improve the cut size during the execution of the load balancing algorithm. Data redistribution policy: Data rcdistribt,tion is pcrlonncd ill _.t ItlT.V [nallller, i.e., the t_on-h_cal data set for a vertex is tlt_[ moved to a process(_r until it is ;t[_3utto cxcctttcd. Furtl_cm_orc. the data scls of all adjaccnl ver-
4 tices are also migrated at that time. This policy greatly reduces the redistribution and commutfication costs by avoiding multiple migrations of data sets Unsteady Simulation Test Case To evaluate the SBN framework, a computational grid is used to simulate an unsteady environment where the adapted region is strongly time-dependent. This experirnent is performed by propagating a simulated shock wave through the initial grid shown at the top of Rg. 3. The test case is generated by refitting all elements within a cylindrical volume moving left to right across the domain with constant velocity, while coarsening previously-refined elernents in its wake. Performance is measured at nine successive adaptation levels during which the mesh increases from 50,000 to 1,833,730 elements. I p -= t6 32 Edge Cut Before After 2.88% 5.51% 7.27% 10.76% 12.71% 16.35% 19.40% 23.87% 24.42% 30.41% MaxSR 80,037 76,665 53,745 46,825 28,031 Imbalance I 1.02 Table 1. Grid adaptation results on an SP2 using the SBN-based load balancer. seconds.) The results demonstrate that the cost of vertex migration is significantly greater than the cost of actually balancing the system load. An extrapolation of the results using exponential curve-fitting indicates that normal speedup will not scale for P > 128. Balancing P [ Volume MB MB MB MB MB Messages Bandwidth 0.00% 0.00% 0.01% 0.02% 0.12% Migration VoLume 3.919I_B MB MB 30,454 MB MB Messages Bmldwidth 3.67% 7.44% 23.79% 28.53% 35.83% Table 2. Communication overhead of the SBN load balancer. Figure 3. Initial and adapted meshes (after levels I and 5) for the simulated unsteady experiment Experimental Results Table 1 presents performance results on an SP2, averaged over the nine levels of adaptation. In additionao achieving excellent load balance, the redistribution cost (expressed as MaxSR, the maximum number of vertices moved in and out of any processor) is significantly reduced. However, the edge cut percentages are somewhat higher, indicating {iiat the SBN strategy reduces the redistribution cost at the expense ofa slightly higher communication cost. Table 2 gives the number of bytes that were transferred between processors dttrmg the load balancing _md the job distributiotl phases. The numlxr of bytes transferred is also expressed as a percentage t)l" the available bandwidth. (A widc-ttodc SP2 has a message bmldwidth of 36/llC_r:lbvlcX/sCCot d _llld z nlc_;s,t_c latchkey (_t"40 micro Table 3 shows the percentage of time spent in the SBN load balancer compared to the execution time required to process the mesh adaptation application. The three columns correspond to three types of load balancing activities: (i) the time needed to handle balance related messages, (ii) the time needed to migrate messages from one processor to another, and (iii) the time needed to select the next vertex to be processed. The results show dmt processing related to the selection of vertices is the most expensive phase of the SBN load balancer. However, the total time required to load balance is still relatively small compared to the time spent processing the mesh. We were able to directly port the SBN methodology from zm SP2 to an Origin2000 without any code modifications. This demonstrates the architecture independence of out load bahmccr. The load imb_d_mcc tactors were almost identical on the two machines. The edge cut values wcrc consistently larger on the Origin2t)_L but the MaxSIq values were larger only when rising less than 16 processors. Some of the difrcrcllccs ill [_crt_wnlallfc results <'m tllcsc
5 5. Summary I] P Balancing Activity Migration Activity Selection Vertex % % 8 O. 1745% % % % % % % % % % % % % Table 3. Overhead of the SBN load balancer. two machines are due to additional refinements that were implemented prior to running the experiments on the Origin2000. First, the algorithm for distributing jobs was modified to guarantee that the processor initiating load balancing always had sufficient workload. This reduces the total number of balancing related messages but increases the number of vertices mio_'ated, especially for small numbers of processors. Second, data sets corresponding to groups of vertices were moved at a time. This bulk migration strategy reduces the volume of migration messages by more than,'30% in our experiments. 4. PLUM vs. SBN Let us highlight some of the differences between the two dynamic load balancers described in this paper. Processing is temporarily halted under PLUM while the load is being balanced. The SBN approach, on the other hazed, allows processing to continue asynchronously. This feature allows for the possibility of utilizing latencytolerant techifiques to hide the communication overhead of redistribution. Under PLUM,.suspension of processing and subsequent repartitioning does not guarantee an improvement in the quality of load balance. In contrast, the SBN approach always results in improved load balance. PLUM redistributes all necessary data to the appropriate processors immediately before processing. SBN, however, migrates data to a processor only when it is ready fo process it, thus reducing the redistribution and communication overhead. PLUM. unlike SBN, performs remapping predictively. This results in a significant reduction of data movement during redistribution mid improves the overall cfliciency of the refinement procedure. Lc_ad balancing trader PLUM occurs be/bre the sotver phase of the computatiotl, wherezls SBN b_dances the toad +htring the.solvcr execution. We thcrc fore c_tnnot di rectt y compare PLUM +rod SBN-ba_d load balancillg, since their relative I'_rt\wrnatlcc is solver dc[',cndcnt. We have demonstrated the portability of our novel approaches to solve the load balancing problem for adaptive unstructured grids on three state-of-the-art commercial parallel machines. The experiments were conducted using actual workloads of both steady and unsteady grids obtained from real-life applications. We are currently examining the portability of our software on other platforms (such as shared-memory environments or networks of workstations) as well as employing various parallel programming models. Acknowledgements This work is supported by NASA under Contract Nun'tbets NAS with MR./Technology Solutions and NAS with Universities Space Research Association, and by Texas Advanced Research Program Grant Number TARP-97-O03594-O 13. References It] R. Biswas and L. Oliker. Experiments with mpartitioning and load balancing adaptive meshes. Technical Report NAS- 9"/-021, NASA Ame/Reseabeh Center, [2] S. Das, D. Harvey, and R. Biswas. Parallel processing of adaptive meshes with load balanein 8. In 27th Intl. Conf. on Parallel Processing, pages [3] S. Das and S. Prasad. Implementing task ready queues in a multiprocessing environment. In Intl. Conf. on Parallel Computin&. pages [41 J. Flaherty. R. Loy, C. Ozturan, M. Shephard. B. Sz'3rmanski. J. Tereseo, and L. Ziantz. Parallel structures and dynamic load balancing for adaptive fi.,'titeelement computation. Applied Numerical Mathematics. 26: , [5] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report , University of Minnesota [6] G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. Technical Report , University of Minnesota, [71 L. Oliker and R. Biswas. PLUM: PaJallel load bah'racing for adaptive unstructured meshes. Journal of Parallel and Distributed Computing, 52: , [8] L. Oliker, R. Biswas, and H. Gabow. Performance analysis and portability of the PLUM load balancing system. In Euro-Peu"g8 Parullel Processing. Spri.nger-Verlag LNCS 1470:307-J 17. le/-)8. [91 I.. Valiant. A bridging model for parallel computation. Communications oj'the ACM, 33:103- l I [ [toi C Walshaw. M. Cross, ;rod M. Even:tt. l>at-:lllel dy'namic graph paaitionin_ for adaptive unstructu,cd meshes l,,,rmd of l'ar, llcl and Distributed Comp,ting. 47: ITS')7.
6
Mesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Load Balancing Strategies for Multi-Block Overset Grid Applications NAS-03-007
Load Balancing Strategies for Multi-Block Overset Grid Applications NAS-03-007 M. Jahed Djomehri Computer Sciences Corporation, NASA Ames Research Center, Moffett Field, CA 94035 Rupak Biswas NAS Division,
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Dynamic Load Balancing of SAMR Applications on Distributed Systems y
Dynamic Load Balancing of SAMR Applications on Distributed Systems y Zhiling Lan, Valerie E. Taylor Department of Electrical and Computer Engineering Northwestern University, Evanston, IL 60208 fzlan,
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing
CSE / Notes : Task Scheduling & Load Balancing Task Scheduling A task is a (sequential) activity that uses a set of inputs to produce a set of outputs. A task (precedence) graph is an acyclic, directed
Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations
Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations Umit V. Catalyurek, Erik G. Boman, Karen D. Devine, Doruk Bozdağ, Robert Heaphy, and Lee Ann Riesen Ohio State University Sandia
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Load Balancing and Communication Optimization for Parallel Adaptive Finite Element Methods
Load Balancing and Communication Optimization for Parallel Adaptive Finite Element Methods J. E. Flaherty, R. M. Loy, P. C. Scully, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz Computer
Introduction to Parallel Computing. George Karypis Parallel Programming Platforms
Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
Lecture 12: Partitioning and Load Balancing
Lecture 12: Partitioning and Load Balancing G63.2011.002/G22.2945.001 November 16, 2010 thanks to Schloegel,Karypis and Kumar survey paper and Zoltan website for many of today s slides and pictures Partitioning
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 18, 1037-1048 (2002) Short Paper Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors PANGFENG
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER
European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS 2000 Barcelona, 11-14 September 2000 ECCOMAS DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Dynamic Load Balancing of MPI+OpenMP applications
Dynamic Load Balancing of MPI+OpenMP applications Julita Corbalán, Alejandro Duran, Jesús Labarta CEPBA-IBM Research Institute Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya
Mesh Partitioning and Load Balancing
and Load Balancing Contents: Introduction / Motivation Goals of Load Balancing Structures Tools Slide Flow Chart of a Parallel (Dynamic) Application Partitioning of the initial mesh Computation Iteration
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Dynamic Load Balancing of Parallel Monte Carlo Transport Calculations
Dynamic Load Balancing of Parallel Monte Carlo Transport Calculations Richard Procassini, Matthew O'Brien and Janine Taylor Lawrence Livermore National Laboratory Joint Russian-American Five-Laboratory
Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations
Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Alonso Inostrosa-Psijas, Roberto Solar, Verónica Gil-Costa and Mauricio Marín Universidad de Santiago,
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
Dynamic Load Balancing in a Network of Workstations
Dynamic Load Balancing in a Network of Workstations 95.515F Research Report By: Shahzad Malik (219762) November 29, 2000 Table of Contents 1 Introduction 3 2 Load Balancing 4 2.1 Static Load Balancing
A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS
SIAM J. SCI. COMPUT. Vol. 20, No., pp. 359 392 c 998 Society for Industrial and Applied Mathematics A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS GEORGE KARYPIS AND VIPIN
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
CONVERGE Features, Capabilities and Applications
CONVERGE Features, Capabilities and Applications CONVERGE CONVERGE The industry leading CFD code for complex geometries with moving boundaries. Start using CONVERGE and never make a CFD mesh again. CONVERGE
Big graphs for big data: parallel matching and clustering on billion-vertex graphs
1 Big graphs for big data: parallel matching and clustering on billion-vertex graphs Rob H. Bisseling Mathematical Institute, Utrecht University Collaborators: Bas Fagginger Auer, Fredrik Manne, Albert-Jan
An Enhanced CPU Scheduler for XEN Hypervisor to Improve Performance in Virtualized Environment
An Enhanced CPU Scheduler for XEN Hypervisor to Improve Performance in Virtualized Environment Chia-Ying Tseng 1 and Yi-Ling Chung Department of Computer Science and Engineering, Tatung University #40,
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
LOAD BALANCING TECHNIQUES
LOAD BALANCING TECHNIQUES Two imporatnt characteristics of distributed systems are resource multiplicity and system transparency. In a distributed system we have a number of resources interconnected by
An Effective Dynamic Load Balancing Algorithm for Grid System
An Effective Dynamic Load Balancing Algorithm for Grid System Prakash Kumar #1, Pradeep Kumar #2, Vikas Kumar *3 1,2 Department of CSE, NIET, MTU University, Noida, India 3 Linux Administrator, Eurus Internetworks
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS
DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS by Belal Ahmad Ibraheem Nwiran Dr. Ali Shatnawi Thesis submitted in partial fulfillment of
A Refinement-tree Based Partitioning Method for Dynamic Load Balancing with Adaptively Refined Grids
A Refinement-tree Based Partitioning Method for Dynamic Load Balancing with Adaptively Refined Grids William F. Mitchell Mathematical and Computational Sciences Division National nstitute of Standards
ME6130 An introduction to CFD 1-1
ME6130 An introduction to CFD 1-1 What is CFD? Computational fluid dynamics (CFD) is the science of predicting fluid flow, heat and mass transfer, chemical reactions, and related phenomena by solving numerically
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study
SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study Milan E. Soklic Abstract This article introduces a new load balancing algorithm, called diffusive load balancing, and compares its performance
Load balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
A Comparison of Dynamic Load Balancing Algorithms
A Comparison of Dynamic Load Balancing Algorithms Toufik Taibi 1, Abdelouahab Abid 2 and Engku Fariez Engku Azahan 2 1 College of Information Technology, United Arab Emirates University, P.O. Box 17555,
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:
A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters
A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters Abhijit A. Rajguru, S.S. Apte Abstract - A distributed system can be viewed as a collection
A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems
A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems RUPAM MUKHOPADHYAY, DIBYAJYOTI GHOSH AND NANDINI MUKHERJEE Department of Computer
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Load Balancing Strategies for Parallel SAMR Algorithms
Proposal for a Summer Undergraduate Research Fellowship 2005 Computer science / Applied and Computational Mathematics Load Balancing Strategies for Parallel SAMR Algorithms Randolf Rotta Institut für Informatik,
Partitioning and Dynamic Load Balancing for Petascale Applications
Partitioning and Dynamic Load Balancing for Petascale Applications Karen Devine, Sandia National Laboratories Erik Boman, Sandia National Laboratories Umit Çatalyürek, Ohio State University Lee Ann Riesen,
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Dynamic Mapping and Load Balancing on Scalable Interconnection Networks Alan Heirich, California Institute of Technology Center for Advanced Computing Research The problems of mapping and load balancing
Aerodynamic Department Institute of Aviation. Adam Dziubiński CFD group FLUENT
Adam Dziubiński CFD group IoA FLUENT Content Fluent CFD software 1. Short description of main features of Fluent 2. Examples of usage in CESAR Analysis of flow around an airfoil with a flap: VZLU + ILL4xx
Performance Evaluation of Improved Web Search Algorithms
Performance Evaluation of Improved Web Search Algorithms Esteban Feuerstein 1, Veronica Gil-Costa 2, Michel Mizrahi 1, Mauricio Marin 3 1 Universidad de Buenos Aires, Buenos Aires, Argentina 2 Universidad
Feedback guided load balancing in a distributed memory environment
Feedback guided load balancing in a distributed memory environment Constantinos Christofi August 18, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract
Multiphase Flow - Appendices
Discovery Laboratory Multiphase Flow - Appendices 1. Creating a Mesh 1.1. What is a geometry? The geometry used in a CFD simulation defines the problem domain and boundaries; it is the area (2D) or volume
Mesh Partitioning for Parallel Computational Fluid Dynamics Applications on a Grid
Mesh Partitioning for Parallel Computational Fluid Dynamics Applications on a Grid Youssef Mesri * Hugues Digonnet ** Hervé Guillard * * Smash project, Inria-Sophia Antipolis, BP 93, 06902 Sophia-Antipolis
OpenMosix Presented by Dr. Moshe Bar and MAASK [01]
OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng
Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption
THE DESIGN OF AN EFFICIENT LOAD BALANCING ALGORITHM EMPLOYING BLOCK DESIGN. Ilyong Chung and Yongeun Bae. 1. Introduction
J. Appl. Math. & Computing Vol. 14(2004), No. 1-2, pp. 343-351 THE DESIGN OF AN EFFICIENT LOAD BALANCING ALGORITHM EMPLOYING BLOCK DESIGN Ilyong Chung and Yongeun Bae Abstract. In order to maintain load
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor [email protected] ABSTRACT The grid
Explicit Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations
Explicit Spatial ing for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations Sunil Thulasidasan Shiva Prasad Kasiviswanathan Stephan Eidenbenz Phillip Romero Los Alamos National
3 Extending the Refinement Calculus
Building BSP Programs Using the Refinement Calculus D.B. Skillicorn? Department of Computing and Information Science Queen s University, Kingston, Canada [email protected] Abstract. We extend the
Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
