A PGAS-based implementation for the unstructured CFD solver TAU
|
|
|
- Eugene Cummings
- 10 years ago
- Views:
Transcription
1 A PGAS-based implementation for the unstructured CFD solver TAU Christian Simmendinger T-Systems Solution for Research Pfaffenwaldring Stuttgart, Germany Jens Jägersküpper German Aerospace Center(DLR) Institute of Aerodynamics and Flow Technology Braunschweig, Germany Carsten Lojewski Fraunhofer ITWM Fraunhofer-Platz Kaiserslautern, Germany [email protected] Rui Machado Fraunhofer ITWM Fraunhofer-Platz Kaiserslautern, Germany [email protected] ABSTRACT Whereas most applications in the realm of the partitioned global address space make use of PGAS languages we here demonstrate an implementation on top of a PGAS-API. In order to improve the scalability of the unstructured CFD solver TAU we have implemented an asynchronous communication strategy on top of the PGAS-API of GPI. We have replaced the bulk-synchronous two-sided MPI exchange with an asynchronous, RDMA-driven, one-sided communication pattern. We also have developed an asynchronous shared memory strategy for the TAU solver. We demonstrate that the corresponding implementation not only scales one order of magnitude higher than the original MPI implementation, but that it also outperforms the hybrid OpenMP/MPI programming model. Keywords Computational Fluid Dynamics, Communication libraries, Software for communication optimization, PGAS, GPI 1. INTRODUCTION Over the last decade PGAS languages have emerged as an alternative programming model to MPI and promising candidates for better programmability and better efficiency ([12]). However, PGAS languages like Co-Array Fortran (CAF) [2] or Unified Parallel C (UPC) [12] require the re-implementation of large parts of the implemented algorithm. Additionally - in order to deliver superior scalability to MPI - languages like CAF or UPC require a specific data-layout, suitable for data-parallel access. PGAS languages also require an excellent compiler framework which is able to e.g. automatically schedule and trigger the overlap of communication and computation. In the problem domain of CFD on unstructured grids, where data access is predominantly indirect and graph partitioning a science in itself [9], a compiler driven overlap of communication and computation poses quite a challenge. While waiting for a solution from the compiler folks we have opted for a more direct approach in the meantime. In this work we present a parallel implementation of the CFD solver TAU [11] on top of a PGAS-API. Similar to MPI this PGAS-API comes in the form of a communication library. Our approach hence does not require a complete rewrite of applications. However in order to achieve better scalability than MPI the implementation requires a rethinking and re-formulation of the communication strategy. We have developed an asynchronous shared memory implementation for the TAU solver which scales very well and readily overlaps communication with computation. In addition we have replaced the bulk-synchronous two-sided MPI exchange with an asynchronous, one-sided communication pattern, which makes use of the PGAS space of the GPI- API [8]. We demonstrate that the corresponding implementation not only scales one order of magnitude higher than the original MPI implementation, but that it also outperforms the hybrid OpenMP/MPI programming model. For a production sized mesh of 40 million points we estimate that with this implementation the TAU solver will scale to about 28,800 x86 cores or correspondingly to 37 Double Precision Tflop/sec of sustained application performance. 2. THE TAU SOLVER In this work we closely examine the strong scalability of the unstructured CFD solver TAU, which is developed by the German Aerospace Center(DLR) [6] and represents one of
2 the main building blocks for flight-physics within the European Aerospace eco-system. The TAU-Code is a finitevolume RANS solver which works on unstructured grids. The governing equations are solved on a dual grid, which, together with an edge-based data structure, allows to run the code on any type of cells. The code is composed of independent modules: Grid partitioner, preprocessing module, flow solver and grid adaptation module. The grid partitioner and the preprocessing are decoupled from the solver in order to allow grid partitioning and calculation of metrics on a different platform from the one used by the solver. GPI RDMA Fabric GPI Modules Passive RDMA RDMA NIC 1 Common Queues RDMA NIC N Thread safety Thread fairness Collectives Atomics Internal Secondary Network (TCP) for recovery tasks The flow solver is a three-dimensional finite volume scheme for solving the Unsteady Reynolds-Averaged Navier-Stokes equations. In this work we have used the multi-step Runge- Kutta scheme of TAU. The inviscid fluxes are calculated employing a central method with scalar dissipation. The viscous fluxes are discretized using central differences. We have used a 1-equation turbulence model (Spalart/Allmaras). In order to accelerate the convergence to steady state, localtime stepping and a multigrid technique based on agglomeration of the dual-grid volumes are employed. 2.1 Subdomains in TAU: colors In order to achieve good node-local performance on cache architectures TAU perfoms a partitioning of every MPI domain (which in turn are obtained by means of a domain decomposition) into small subdomains (colors) [1]. These subdomains typically contain of order 100 mesh points. With a memory footprint of 1KByte per mesh point a color typically fits in the L2 cache. The points within these domains, as well as the connecting faces are renumbered in a preprocessing step in order to maximize spatial data locality. Temporal data locality is achieved by consecutively iterating over the colors until the entire MPI domain has been processed. Due to this implemented coloring TAU typically runs with around 12 25% of the double precision peak performance of a standard dual-socket x86 node which is quite remarkable for the problem domain of CFD. However - the performance of TAU depends on the used solver type, the structure of the mesh and the specific use case. In Table 1 we consider three different cases on a single node: M6, F6 and Eurolift. The smallest use case M6 with 100,000 points has a total memory footprint of 100MB. With this memory footprint the solver does not fit in the 12MB L3 cache of the used 2.93 Ghz dual socket 6-core Intel Westmere node. Nevertheless M6 shows the best performance of the three considered use cases. The F6 use case with 2 million points represents a mesh size per node, which is common for production runs. This use case delivers good performance as well. Lastly for the Eurolift 13.6 million points the performance degrades substantially, even though 13.15% of peak performance is still a good value for an unstructured CFD solver. However, we note that mainly due to the correspondingly long run times such large use cases are rarely used on a single node. 3. THE GPI API Figure 1: GPI modules GPI (Global address space Programming Interface) is a PGAS API for C/C++ and Fortran applications. It focuses on asynchronous, one-sided communication as provided by modern, RDMA-enabled interconnects such as Infiniband. The GPI API has two main objectives. Firstly, maximum communication performance by exploiting directly the network interconnect and minimizing the communication overhead by allowing a full communication and computation overlap (zero-copy). The second objective is to provide a simple API for the development of parallel applications based on more asynchrounous algorithms and an easy to learn programming model. GPI is strictly a PGAS API with some similarities to (Open)Shmem. In contrast to the PGAS languages of CAF or UPC,tthe GPI API hence neither requires extensive modifications of the source code nor an efficient underlying compiler framework. What is however required, is a thorough rethinking of the communication patterns of the application and a reformulation of bulk-synchronous data-exchange towards an asynchronous model. The Global Arrays framework is not really applicable to unstructured meshes, so we have not considered it here. 3.1 Functionality of GPI The figure 1 depicts the GPI modules organization. The depicted modules of GPI interact directly with the interconnect (one or more RDMA NICs) and expose the whole GPI functionality. The GPI functionality is therefore resumed in five modules: passive communication (Passive RDMA), global communication through queues (Common Queues), collective operations (Collectives), atomic counters (Atomics) and utilities and internal functionality (Internal). Althought GPI focuses on one-sided communication, the passive communication has a Send/Receive semantic that is, there is always a sender node with a matching receiver node. The only difference is that, and therefore the passive term, the receiver node does not specify the sender node for the receive operation and expects a message from any node. Moreover, the receiver waits for an incoming message in a passive form, not consuming CPU cycles. On the sender side, the more active side of the communication as it speci-
3 Table 1: Node-Local performance of TAU Use Case Mesh Points Gflop/sec/node (Double Precision) % of Peak (Double Precision) M6 100, F6 2 million Eurolift 13,6 million fies the receiver, the call is nevertheless non-blocking. The sender node sends the message and can continue its work, while the message gets processed by the network card. Global communication allows asynchrounous, one-sided communication primitives on the partitioned global address space. Two operations exist to read and write from global memory independent of whether it is a local or remote location. One important point is that each communication request is queued and those operations are non-blocking, allowing the program to continue its execution and hence a better use of CPU cycles while the interconnect independently processes the queues. There are several queues for requests which can be used for example, for different kinds of communication requests. If the application needs to make sure the data was transfered (read or write), it needs to call a wait operation on one queue that blocks until the transfer is finished and asserting that the data is usable. In addition to the communication routines, GPI provides global atomic counters, i. e. integral types that can be manipulated through atomic functions. These atomic functions are guaranteed to execute from start to end without fear of preemption causing corruption. There are two basic operations on atomic counters: fetch and add and fetch, compare and swap. The counters can be used as global shared variables used to synchronize nodes or events. Finally, the internal module includes utility functionality such as getting the rank of one node or the total number of nodes started by the application. Besides those API primivites, it implements internal functions such as the start-up of an application. 3.2 Programming and execution model The GPI programming model implements a SPMD-like model where the same program is started by the nodes of a cluster system and where the application distinguishes nodes through the concept of ranks. The application is started by the GPI daemon. The GPI daemon is a program that is running on all nodes of a cluster, waiting for requests to start an application. The GPI daemon interacts with the batch system to know which nodes are assigned to the parallel job and starts the application on those node. The GPI daemon also performs checks on the status of the cluster nodes such as Infiniband connectivity. To start a GPI application, the user must define the size of the GPI global memory partition. If there are enough resources available and the user has permissions for that, the same size will be reserved (pinned) for the GPI application on all nodes. The global memory is afterwards available to the application through its lifetime and all threads on all nodes have direct access to this space via communication primitives. The communication primitives act therefore directly on the GPI global memory, without any memory allocation or memory pinning. The communication primitives accept a tuple (offset, node) to steer the communication and control exactly the data locality. The application should place its data on the GPI global memory space, making it globally available to all threads. Having the data readly available, applications should focus on algorithms that are more asynchronous and possibly leaving behing the implicit synchronization imposed by a twosided communication model. 4. APPROACH AND IMPLEMENTATION The parallelization of the flow solver is based on a domain decomposition of the computational grid. In the work presented we have used the CHACO toolkit ([7]). The CHACO toolkit manages to deliver well balanced partitions even for a very large number of nodes. We have used the multi-level Kernighan-Lin [5] methods of CHACO. 4.1 The asynchronous shared memory implementation As briefly explained in the introduction TAU makes use of a pool of colors (subdomains in a MPI domain). The implemented asynchronous shared memory parallelization uses the available pool of colors as a work pool. A thread pool then loops over all colors until the entire MPI domain is complete [4]. Unfortunately, this methodology is not as trivial as it seems. About 50% of the solver loops run over the faces of the dual mesh (the edges of the original mesh) and update the attached mesh points. Updates to points attached to crossfaces (faces where the attached mesh points belong to different colors) then can lead to race conditions, if the attached points are updated by more than a single thread at the same time. We have solved this issue by introducing the concept of mutual exclusion and mutual completion of colors. In this concept a thread needs to mutually exclude all neighbouring colors from being simultaneously processed. To that end all threads loop over the entire set of colors and check their status. If a color needs to be processed, the respective thread attempts to lock the color. If successful, the thread checks the neighbouring color locks. If no neighbouring color is locked, the thread computes the locked color and subsequently releases the lock. The lock guarantess that no neighbouring color can be processed during that time frame, since all threads evaluate the locks of the respective neighboring colors prior to processing [4]. While the concept of mutual exclusion guarantees a racecondition free implementation, the concept of mutual com-
4 pletion provides the basis for an asynchronous operation. We have resolved the Amdahl Trap of the OpenMP fork-join model by complementing the above concept of mutual exclusion with a concept of mutual completion. Wether or not a color can be processed in this concept now not only depends on the locks of the neighboring colors, but it also depends on the local state as well as the state of the neighbouring colors. The threads carry a thread local stage counter which is incremented whenever all colors have been processed in one of the color loops of the code. The colors in turn carry a stage counter which is increased every time a color is processed. All loops in TAU have been reimplemented as loops over colors. This holds true not only for the face loops (which run over the faces of a the dual mesh), but also for the point loops. A color in a point loop which follows an face loop hence can only be processed if the color itself still needs to be processed and additionally all neighbouring colors to this color have been updated in the previous face loop. In this implementation there neither is a global synchronization point nor an aggregation of load imbalances at the synchronization point. Instead we always have local dependencies on states and locks of neighbouring colors. We are confident that this concept will scale to a very high core number. Wether it actually scales to e.g the 50 cores of the Intel MIC architecture remains to be seen. This data-flow model is somewhat similar to the data-flow model of StarSs [10], however, in our case local data dependencies (neighbouring colors) are bound to the data structures and set up during an initalization phase. In contrast to StarSs a dedicated runtime environment is not required. Also StarSs does not feature mutual exclusion and hence would not be applicable here. The threads in this implementation are started up once per iteration - there is a single OpenMP directive in the code. 4.2 Overlap of communication and computation The asynchronous shared memory implementation of TAU has been extended with an overlap of communication and computation. The first thread, which reaches the communication phase, locks all colors which are attached to the MPI Halo (see Fig. 7), performes the halo exchange and subsequently unlocks the colors attached to the MPI Halo. All remaining threads compute (non-locked) non-halo colors until the halo colors are unlocked again by the first thread. Since the lock/relase mechanisms for the colors already were in place, the overlap of communication and computation was a logical extension of the principle of mutual exclusion of colors. In the benchmarks section we compare the performance of the hybrid OpenMP/MPI implementation against the hybrid OpenMP/GPI implementation. We emphasize that the only difference between these two implementations is the communication pattern of the nextneighbour halo-exchange. In the OpenMP/GPI Implementation the communication pattern is based on asynchronous one-sided RDMA writes, whereas the OpenMP/MPI is implemented as state-of-the-art MPI Irecv, MPI Isend over all neighbours with a subsequent single MPI Waitall. 4.3 GPI implementation In order to maintain the asynchronous methodology also on a global (inter-node) level, the GPI implementation targets the removal of the implizit synchronization due to the MPI handshake in the 2-sided MPI communication. In the GPI implementation of TAU all sends hence are 1-sided asynchronous communication calls and the sender immediately returns. Since the sender does not know anything about the receiver state, we have implemented a stage counter for the communication: In TAU all communication is performed in SPMD fashion. All processes on all nodes run through the same halo exchange routines in exactly the same sequence. By providing rotating receiver and sender frames and incrementing stage stage counters for send and corresponding receive we can guarantee a correct execution of the solver. A negative side effect of this asynchronous methodology is, that we can not write directly via RDMA into a remote receiver halo, since the sender does not know the state of the remote receiver. While assembling the message from the unstructured mesh, we compute a checksum and write this checksum into the header of our message (together with the stage counter, sender id etc.). The receiver side evaluates the stage counter and sender ID. If these match, the receiving thread assembles the receive halo of the unstructured mesh and - while doing this - computes the checksum. If computed checksum and header checksum coincide and all outstanding receives are complete, the receiver returns. We note that assembling a complete receive halo at the earliest possible stage is of crucial importance for the scalability of the solver. The reason for this is, that for very small local mesh sizes, most colors will have at least one point in the halo. Threads which compute non-halo colors then quickly run out of work and are stalled until the communication thread (or threads) unlocks the halo colors. For this reason we have separated the colors into send and receive halo colors in the preprocessing step. In the GPI implementation the last of the sender threads immediately unlocks the send halo, as soon as the sender frames have been assembled and sent. The sender threads then join the thread pool, which in turn then can process colors from the send halo in addition to non-halo colors. Sends are usually mulithreaded, where the number of threads can be dynamic (e.g. depending on the current level of the multigrid). The receiving threads are assigned simultaneously and in principle also can be multithreaded. For the 6 cores/socket benchmarks however we have assigned 3 sender threads from the thread pool and a single receiver thread. 4.4 Modifications in GPI Even though more than 95% of the communication in TAU is next-neighbour communication, the TAU solver requires a collective reduction operation for monitoring and the global residual (called once per iteration). The collectives module of the GPI implements the global barrier and the Allreduce collective operations as this is one of the most frequently used collective operations. Although GPI aims at more asynchronous implementations and is diverging from a bulk-synchronous processing, some global synchronization points might be required and therefore the
5 GPI IB tools MPI Isend Bandwidth Allreduce: 1 double Bandwidth (MB/secs) Time (us) b 16b 64b 256b 1K 4K 16K 128K 1M 4M Message size (bytes) # cores GPI MVAPICH 1.1 Figure 3: Allreduce operation on a double Figure 2: Bandwidth need of a scalable barrier implementation. The same justifies the implementation of the AllReduce collective operation, that supports typical reduce operations such as minimum, maximum and sum of standard types. The Allreduce collective operation has been developed in the context of this work. 5. BENCHMARKS The Allreduce micro benchmarks were produced with Mellanox ConnectX IB DDR and a central fully non-blocking fat-tree SUN Magnum switch. All other benchmarks were performed with the Mellanox MT26428 Infiniband adapter (ConnectX IB QDR, PCIe 2.0). The actual TAU application benchmarks were produced on the C 2 A 2 S 2 E system in Braunschweig [3], which has recently been upgraded to dual socket 2.93Ghz, 6-core Intel Westmere blades. The nodes feature the afore mentioned Mellanox MT26428 ConnectX DDR adapter and the central infiniband switch is a QDR switch from Voltaire with a blocking factor of two. 5.1 GPI micro benchmarks In this section, we present some results on micro-benchmarks obtained with GPI when compared with the Infiniband performance tools and MPI. The figure 2 presents the obtained bandwidth with GPI, the Infiniband performance tools (IB tools) and the MPI ISend operation. The GPI performance follows the maximum obtained performance from the low level Infiniband performance tools and is very close to the maximum bandwidth already at 4KB message sizes. The MPI ISend operation also reaches maximum bandwidth but only at larger messages sizes. Although not noticeable on Figure 2, for small message sizes (up to 512 bytes), GPI always reaches a better bandwidth with over a factor of 2 when compared with the MPI ISend operation. These results aim at showing that GPI exploits directly the maximum performance (wire-speed) made available by the interconnect (Infiniband). Figure 3 depicts the time needed (in micro-seconds) by the Figure 4: F6 use case Allreduce operation and its scalability as we increase the number of nodes. The GPI operation is compared with the MVAPICH 1.1 MPI implementation. The performance of the GPI Allreduce operation scales very well up to the maximum number of cores whereas the MPI version starts decreasing its scalability after 64 cores. The results demonstrate that the GPI operation yields a better scaling behaviour with an increasing number of cores. 5.2 TAU benchmarks All presented results have been produced in the same run with an identical node set. In order to eliminate system jitter, which can arise due to the blocking factor of two, we have taken the minimum time for a single iteration out of 1000 iterations. For a 4W multigrid-cycle and our settings, this single iteration requires 140 next-neighbour halo exchanges with local neighbours and a single Allreduce, which in turn implies that the TAU solver features close to perfect weak scalability. The figures in 5 and 6 sum up the main results of this work. The three different lines represent the original MPI implementation, the OpenMP/MPI implementation, (run on a per-socket basis) and the OpenMP/GPI version, also run on a per-socket basis. We have also evaluated both OpenMP/MPI and OpenMP/GPI with the node as a basis
6 Figure 5: F6, Single Grid, 2 million Points Figure 6: F6, 4W Multigrid, 2 million Points for the shared memory implementaion. This version however suffers severly from cc-numa issues and we have refrained from using it. Since we expect future hardware developments to increase the number of cores per die rather than the number of sockets per node, we currently consider this a minor issue. The depicted speedup is relativ to the MPI results for 10 nodes (120 cores). Until this point OpenMP/MPI as well as OpenMP/GPI scale almost linearly. At 240 cores (i.e. 20 nodes), the Tau Code has a local mesh size of around 100,000 points. Since both the node-local mesh size as well as the used solver type coincides with the M6 use case, we estimate the performance at this point to be about 20 nodes x Gflop/sec = 319 Gflop/sec. Whenever we split a domain, we introduce additional faces and points (see Fig. 7). In our case, due to the employed multigrid this effect can become quite dramatic: For 1440 cores we have 240 domains (1 domain per 6-core westmere). This corresponds to 2 million points / 240 = 8333 points per domain, which still seems reasonable. However - due to the agglomeration of the multigrid there are only about 60 points in a coarse level domain - and an additonal 120 points in the receive halo of this domain. The amount of work on the coarse multigrid levels thus increases substantially with an increasing socket count Node local load imbalance In order to test the strong scalability of our implementation, we have chosen a very small benchmark mesh with just 2 million meshpoints. With close to linear weak scaling, the strong scaling result implies that for a typical production mesh with million mesh points, TAU should scale to around 28,800 x86 cores (or correspondingly to 37 double precision Tflop/sec of sustained application performance). From this point on the observed curves are the aggregation of four effects: The increase of the workload due to the replication of send and receive halo, the node local load imbalance, the global load imbalance, and last but not least cache effects. We will briefly examine these four effects in a bit more detail: Increased workload Figure 7: Splitting a domain: The introduced gray points and the correspondingly attached faces imply additional work for the solver. The gray points correspond to the recv halo, whereas the points connected to them correspond to the send halo.
7 While the asynchronous shared memory implementation is able to handle load imbalance extremely well (compared to e.g a loop parallelization with OpenMP), there are still some limits: In the above case of 60 inner points per domain and 120 points in the receive halo, all inner points probably will belong to the send halo. There is thus no possible overlap of communication and computation. All threads are stalled until the send frames have been assembled and again are stalled until the receive halo has been build Cache effects The TAU solver already runs at good scalar efficiency at 240 cores with a node-local mesh size of 100,000 points. However - this mesh still does not fit into the cache entirely. There is a pronounced superlinear speedup at 480 cores in the curve, where large parts of the solver start to operate entirely in L3 cache Global load imbalance While the CHACO toolkit does an excellent job in providing a load balanced mesh, small load imbalances are showing up on the coarser levels of the multigrid. In the asynchronous GPI implementation we can hide most of that imbalance: If a specific process is slower in a computational part A between two halo exchanges, it might be faster in a following part B. An asynchronous implementation is able to hide this load imbalance, provided the total runtime of A+B is equal to the corresponding runtimes A+B at other processes. We emphasize, that hiding this global load imbalance is only possible with an implementation which both can overlap the halo exchange between part A and B with computation and which also provides an asynchronous halo exchange. The hidden load imbalance shows up in the relative smoothness of the speedup curve in the OpenMP/GPI results. 6. CONCLUSION AND FUTURE WORK We have presented a highly efficient and highly scalable implementation of a CFD solver on top of a PGAS API. The combination of both asynchronous shared memory implementation and global asynchronous communication leads to excellent scalability - even for a 4w Multigrid with 2 million mesh points / 1440 cores = 1388 mesh points/core. We were able to use key elements of the highly efficient scalar implementation (the colors) and reshape this concept towards an equally efficient but also highly scalable shared memory implementation. We have extended the node-local asynchronous methodology towards a global asynchronous operation and thus were able to hide most of the global load imbalance arising from the domain decomposition. The GPI implementation features a dynamic assignment of the number of send and receive threads. We believe that this will be of strong relevance for future multi- or manycore architectures like the Intel MIC, since it will dramatically reduce both polling overhead for outstanding receives (on the fine mesh multigrid level) as well as a multithreaded simultaneous 1-sided send to all neighbours in the domain decomposition (on the coarse multigrid levels). 6.1 Future work In a Partitioned Global Address Space every thread can access the entire global memory of the application at any given point in time. We believe that this programming model not only provides opportunities for better programmability, but that it also has a great potential to deliver higher scalability than is available in the scope of message passing. For the future we envision a new standard for a PGAS-APl which should enable application programmers to reshape and reimplement their bulk-synchronous communication patterns towards an asynchronous model with a very high degree of overlap of communication and computation. 7. ACKNOWLEDGMENTS This work has been funded partly by the german Federal Ministry of Education and Research within the national research project HI-CFD. 8. REFERENCES [1] T. Alrutz, C. Simmendinger, and T. Gerhold. Efficiency enhancement of an unstructured CFD-Code on distributed computing systems. In Parallel Computational Fluid Dynamics 2009, [2] C. Coarfa, Y. Dotsenko, J. Eckhardt, and J. Mellor-Crummey. Co-array Fortran performance and potential: An NPB experimental study. Languages and Compilers for Parallel Computing, pages , [3] DLR Institute of Aerodynamics and Flow Technology. Center of Computer Applications in Aerospace Science and Engineering, [4] J. Jägersküpper and C. Simmendinger. A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers. In Euro-Par 2011, to appear, [5] B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49: , [6] N. Kroll, T. Gerhold, S. Melber, R. Heinrich, T. Schwarz, and B. Schöning. Parallel Large Scale Computations for Aerodynamic Aircraft Design with the German CFD System MEGAFLOW. In Proceedings of Parallel CFD 2001, [7] R. Leland and B. Hendrickson. A Multilevel Algorithm for Partitioning Graphs. Proc. Supercomputing 95., [8] C. Lojewski and R. Machado. The Fraunhofer virtual machine: a communication library and runtime system based on the RDMA model. Computer Science - Research and Development, 23(3-4): , [9] J. Paz. Evaluation of Parallel Domain Decomposition Algorithms. In 1.st national computer science encounter, workshop of distributed and parallel systems. Citeseer, [10] J. Planas, R. M. Badia, E.. Ayguadé, and J. Labarta. Hierarchical task based programming with StarSs. International Journal of High Performance Computing Applications, 23: , [11] D. Schwamborn, T. Gerhold, and R. Heinrich. The DLR TAU-code: Recent applications in research and industry. In European conference on computational fluid dynamics, ECCOMAS CFD, pages Citeseer, 2006.
8 [12] K. Yelick. Beyond UPC. Proceedings of the Third Conference on Partitioned Global Address Spaces, 2009.
GASPI A PGAS API for Scalable and Fault Tolerant Computing
GASPI A PGAS API for Scalable and Fault Tolerant Computing Specification of a general purpose API for one-sided and asynchronous communication and provision of libraries, tools, examples and best practices
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
HPC Deployment of OpenFOAM in an Industrial Setting
HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak [email protected] Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Advancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner** German Aerospace
Storage at a Distance; Using RoCE as a WAN Transport
Storage at a Distance; Using RoCE as a WAN Transport Paul Grun Chief Scientist, System Fabric Works, Inc. (503) 620-8757 [email protected] Why Storage at a Distance the Storage Cloud Following
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER
European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS 2000 Barcelona, 11-14 September 2000 ECCOMAS DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
Performance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 [email protected] Amith Mamidala, Abhinav Vishnu, and Dhabaleswar
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
Computational Modeling of Wind Turbines in OpenFOAM
Computational Modeling of Wind Turbines in OpenFOAM Hamid Rahimi [email protected] ForWind - Center for Wind Energy Research Institute of Physics, University of Oldenburg, Germany Outline Computational
Operating System for the K computer
Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world s highest performance, Fujitsu has worked on the following three performance improvements
reduction critical_section
A comparison of OpenMP and MPI for the parallel CFD test case Michael Resch, Bjíorn Sander and Isabel Loebich High Performance Computing Center Stuttgart èhlrsè Allmandring 3, D-755 Stuttgart Germany [email protected]
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
Operating Systems 4 th Class
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing
Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing WHITE PAPER Highlights: There is a large number of HPC applications that need the lowest possible latency for best performance
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
How To Program With Adaptive Vision Studio
Studio 4 intuitive powerful adaptable software for machine vision engineers Introduction Adaptive Vision Studio Adaptive Vision Studio software is the most powerful graphical environment for machine vision
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
A Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Informatica Ultra Messaging SMX Shared-Memory Transport
White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation
walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 February 16, 2012 Florian Schornbaum,
ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS
1 ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS Sreenivas Varadan a, Kentaro Hara b, Eric Johnsen a, Bram Van Leer b a. Department of Mechanical Engineering, University of Michigan,
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
Kriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
Source Code Transformations Strategies to Load-balance Grid Applications
Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,
