Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*
|
|
- Arlene Freeman
- 8 years ago
- Views:
Transcription
1 Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters* Ronal Muresano, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de Barcelona, Barcelona, SPAIN Abstract A huge challenge that parallel computing wants to overcome is to improve the performance of many MPI applications. However, some of these applications do not scale when the problem size is fixed and the number of core is increased. This scalability problem is increased when is used a hierarchical communications architecture how is included on multicore clusters. Therefore, this work presents a novel method developed for SPMD (Single Program Multiple Data) applications, which is based on finding the maximum strong scalability point while the efficiency is maintained over a defined threshold. This method integrates four phases: a characterization, a tiles distribution model, a mapping strategy, and a scheduling policy. Also, this method is focused on SPMD applications designed to use MPI libraries with high communication volumes. Our methodology has been tested with different SPMD scientific applications and we observed that the maximum speedup and scalability were located close to the values calculated with our model. Keywords: multicore, SPMD, Performance, Scalability 1. Introduction Nowadays, the scientific applications are developed with more complexity and accuracy and these precisions need high computational resources to be executed faster and efficiently. Also, the current trend in high performance computing (HPC) is to find clusters composed of multicore nodes as can be evidenced in the top500 list (rank of the parallel machines used for HPC). The integration of these nodes in HPC has allowed the inclusion of more parallelism within nodes. However, this parallelism must deal with some problems such as: number of cores per chip, shared cache, bus interconnection, memory bandwidth, etc.[1]. These issues are becoming more important in order to manage the application scalability and efficiency. Also, the hierarchical communication architecture integrated on multicore clusters creates an heterogeneous environment, which affects some performance metrics such as efficiency, speedup and applications scalability due to * This research has been supported by the MEC-MICINN Spain under contract TIN *Contact Autor: R. Muresano, rmuresano@caos.uab.es This paper is addressed to the PDPTA conference. the different speeds and bandwidths of each communication paths(fig. 1), which may cause degradations in the application performance [2]. Despite of these communication issues and in order to benefit from such computational multicore cluster capacities, we focused on improving the performance application in these environments. This work is focused on calculating the maximum number of cores that maintain the strong application scalability while the efficiency is over a defined threshold. The objective of strong scalability is to maintain the problem size constant while the number of processors increases [3]. To obtain this goal, we have to consider the parallel programming paradigm, which the application has been designed, e.g., a master/worker, pipeline, SPMD, etc. Each of these paradigms has different communication patterns, which can affect the applications performance. In this sense, we consider the parallel applications designed using message passing interface (MPI) for communication and SPMD as a parallel paradigm. The SPMD paradigm was selected due to its behavior which is to execute the same program in all processes but with a different set of tiles. These tiles have to exchange their information in each iteration and these can become in a huge issue when we use a multicore environment. The figure 1 shows an SPMD application execution over a multicore cluster. The tiles are computed in a similar time due to the homogeneity of the core. However, the communications are performed by different links with the objective of maintaining the communication pattern and each pathscan include up to one and a half order of magnitude of difference in latency for the same communication volume. These differences are translated into inefficiency that decreases the performance and do not allow us to obtain a linear strong scalability. To solve these inefficiencies, we have developed a method that manages the communication latencies using some characteristics of each SPMD application (e.g. computation and communication tile ratio) and allows us to determine a relationship between scalability and efficiency. To achieve this performance relationship, our methodology is organized in four phases: characterization, tiles distribution model, mapping strategy, and scheduling policy, which allow us to
2 Fig. 1: Issues of SPMD applications on a multicore cluster. distribute the tile inside the environment. In this sense, this methodology classifies the SPMD tiles in groups, called Supertile (ST), assigning each one of them to one core. The tiles of these ST belong to one of two types: internal tiles which their communications are made in the same core and edge tiles, where their communications are performed with tiles allocated in other cores. This division allows us to apply an overlapping method, which permits us to execute the internal tiles while the edge communications are communicating. This division allows our method to find the ideal number of core that permits us to achieve the maximum strong scalability with a defined efficiency. This paper is structured as follows: the related works are described in section 2. Section 3 exposes the issues of SPMD applications on a multicore architecture. A description of the methodology is presented in section 4. Section 5 describes the efficiency and scalability for SPMD application. Next, section 6 illustrates the performance evaluation. Finally, conclusions are given in section Related Works There are different works developing methodologies which are focused on improving some performance metrics on multicore enviroments. Mercier et al [4] have designed a method to efficiently place MPI processes on multicore machines where establish an adequate placement policy to improve applications efficiency. However, this work does not include the combination of scalability that is very important when we wish to execute faster and efficiently. On the other hand, Liebrock [5] defines a methodology for deriving a performance model for SPMD hybrid parallel applications. This work was focused on improving three specifics performance: adaptability, scalability and fidelity using mapping, scheduling and synchronization overhead strategies designed for hybrid message passing and distribute memory applications. On the contrary, our work evaluates pure MPI applications and similarly, we develop a methodology centered on mapping and scheduling strategies, and also, we include an efficient execution. Moreover, there are works centered on studing and improving the efficiency [6] or enhancing the speedup on multicore clusters [7] separately. In contrast, we developed a methodology centering on mapping and scheduling strategies, and we search an improvement in both speedup and efficiency performance metrics on these clusters [8]. In this previous work, we have developed the methodology phases using the characterization, mapping and scheduling strategyes. However, these phase only permit us to find the number of tiles that let us to obtain the maximum speedup of the SPMD aplication defining a desired efficiency. However, this current work searches for a combination of strong scalability and efficiency, in which we can predict the number of core that maintain the relationship between both metrics. We are focused on using mapping and scheduling strategies top achieve our objective, In this sense, some works have developed mapping strategy for SPMD applications, which are centered on improving the application efficiency [7]. Another technique was designed by Brehm et al [9], in which the main objective was to map the application using the characteristics of the applications. Similarly, our proposed mapping maintains the efficiency using the characteristics of the machine and the application, but we add an affinity process that allows us to minimize the communication effect of the multicore environment. Also, there are some scheduling strategies for SPMD applications [10] [11] that are based on finding the minimum execution time, which is part of our objective. Nevertheless, we analyzed and evaluated the model defined by Panshenskov et al [12] and we chose some characteristics such as: tiles are divided into blocks, asynchronous communications, computation and communication overlapping, with the aim of minimizing the communications overhead and improving the efficiency of the SPMD application. 3. SPMD applications on multicore In this study, the SPMD applications used have to accomplish the following characteristics: static, where parallel application defines the communication process and is maintained during all the execution, local, where applications do not have collective communications, 2D grid applications, and regular, because communications are repeated for several iterations. In this sense, there are some benchmark that use these characteristic: one of them is NAS parallel benchmarks in the CG, BT algorithms [13] and also there are real applications such as: heat transfer simulation, Laplace equation, applications focus on fluid dynamics field like mpbl suite [14], application of finite differences etc. Also, the communication pattern can vary according to the objective of the SPMD application. However,these patterns
3 Fig. 4: Phases for efficient execution of SPMD appl. Fig. 2: SPMD application on multicore cluster. are defined in the beginning of the application and are kept until the application end. The figure 2 shows an example of SPMD applications and multicore clusters, in which is illustrated the idle time generated by slower communication links, (e.g. cores 5-8 communicating from node 1 with core 1-4 of node 2 through the inter-node link). These Internode communications have the bigger delay that can generate huge influences in the efficiency and scalability. However, these idle times allows us to establish strategies in order to organize how SPMD tiles could be distributed on multicore cluster with the aim of managing these communications inefficiency. These communication links can vary the communication time in an one and a half order of magnitude according to the path which perform the communication. These variations are a limiting factor to improve application performance, due to the latency of the slower link, which determines when an iteration has been completed (Fig. 2). These inefficiencies have to managed if we wish to executed the SPMD application faster, efficient and scalable. To manage this communication issues, we use the problem size of the SPMD applications that is composed by a number of tiles and we create the SuperTile (ST). The problem of finding the optimal ST size is formulated as an analytical problem, in which the ratio between computation and communication of the tile has to be founded with the objective of searching the relationship between strong scalability and efficiency. The ST is calculated maintaining the focus of Fig. 3: SuperTile (ST) creation for improving the efficiency. obtaining the maximum strong scalability point while the efficiency is maintained over a defined threshold. The figure 3 shows an example of the overlapping process and the ST creation. This ST is a group of tiles of the global problem size which is defined by M xm. In this sense, this ST is integrated of a set of KxK Tiles, where K is the square root of the number of tiles, which have to be assigned to each core with the aim of maintaining an ideal relationship between efficiency, strong scalability and speedup. As mentioned before, the ST is composed by two type of tile internal and edge tile. This is done with the objective of creating an overlapping strategy that minimize the communication effects in the parallel execution time. 4. Methodology definition This methodology is focused on managing the different communication latencies and bandwiths presents on multicore clusters with the objective of improving both efficiency and application scalability. This process is realized through four phases: a characterization, a tile distribution model, a mapping strategy and a scheduling policy (Fig. 4). These phase allow us to handle the latencies and the imbalances created due to the different communication paths. Thus, our methodology realizes an application and environment analysis in the characterization phase with the aim of obtaining the application parameters and the computation and communication ratio which will be used to calculate the number of tile of the ST and the ideal number of cores. The next step is to calculate the tiles distribution which determines the number of tiles that have to be assigned to each core in order to achieve our objective, and also we calculate the number of core necessary to maintain both strong scalability and efficiency conditions. Next, the mapping phase allocates the set of tiles (ST) among the cores which are calculated with the model defined in the tile distribution phase. Finally, the scheduling phase has two functions, one of them is to assign tile priorities and the other is to control the overlapping process. Once the methodology is applied, we evaluate the performance results obtained. 4.1 Characterization phase The objective of this phase is to gather the necessary parameters of SPMD applications and environment. This characterization parameters are classified in three groups: the application parameters, parallel environment characteristics and the defined efficiency. All these parameters give us the nearest relationship between the machine and the application.
4 The application parameters offer the information necessary of the application characteristics such as: problem size, number of tile, iteration number, communication volume and computation time of a tile, etc. Also, these parameters allow us to determine the communication pattern of a tile and determine the distribution schemes of the SPMD application. The parallel environment parameters enable us to determine the communication and computational time of a tile inside the hierarchical communication architecture. These values of a tile obtained allow us to calculate a ratio between them. This ratio will be defined as λ (p)(w), where p determine the pathswhere a tile has to communicate with neighboring tile, e.g. through A, B, or C link (Fig. 1). The variable w describes the direction of the communication e.g. up, right, left or down in a four communications pattern. This ratio is calculated with equation 1, where Comt (p)(w) determines the time of communicating a tile for a p pathsand the Cpt is the computing time of a tile. λ (p)(w) = Comt (p)(w) /Cpt (1) Finally, once all parameters are found through the characterization phase, we include the efficiency value in the model and evaluate the execution time. The efficiency value is defined by the variable effic and this will be included in the model. 4.2 Tile distribution model phase The main objective of this model is to determine the number of cores N cores that allow us to maintain the relationship between the maximum strong scalability and the desired efficiency. In this sense, equation 2 calculates the number of core. This equation depends on the problem size which is represented by the M 2 divided by the optimal number of tile K 2 (Equ. 2). Ncores = M 2 /K 2 (2) Knowing the value of K, we can estimate the execution time of the SPMD application. For example, the equation 3 represents the behavior of SPMD application using the overlapping strategy, where first is calculated the edge tile computation (EdgeComp i ) and then is added with the maximum value between internal tile computation (IntComp i ) and edge tile communication (Edgecomm i ). This process will be repeated for a set of iterations (iter) where n determine the number of an iteration. This process is carried out for the communication exchanging of the SPMD application and can be possible to calculate due to the deterministic behavior of these applications. { IntComp(i) T ex i = iter n=1 (EdgComp (i) + Max Edgcomm (i) (3) EdgeComp (i) = 4 (K 1) Cpt (4) IntComp (i) = (K 2) 2 Cpt (5) Edgecomm (i) = K Max(Comt (p)(w) ) (6) The edge communication (Equ.6) has to be for the worst communication case. This means that we use the slowest communication time to estimate the number of tiles necessary for maintaining the efficiency. To do this, we have to calculate the λ (p)(w) ratio (Equ 1) explained before. Then, the first step is to determine the ideal value of K. We start from the overlapping strategy, where internal tile computation (IntComp (i) ) and the edge tile communication (Edgecomm (i) ) are overlapped. The equation 7 represents the ideal overlapping that allow us to obtain the maximum speedup while the efficiency ef f ic is maintained over a defined threshold. Therefore, we start from an initial condition where the edge communication time is bigger than the internal computation time divided by the efficiency. This division represents the maximum inefficiency allowed. K Max(Commt (p)(w) ) >= ((K 2) 2 Cpt)/effic (7) However, this equation 7 has to consider a constraint defined in equation 8 where Edgecomm (i) can be bigger than IntComp (i) over the defined efficiency (Equ. 7), but the Edgcomm (i) have to be slower than the IntComp (i) without any efficiency definition. To calculate the optimal value of K, we determine the λ (p)(w) (Equ. 1) value and we solve the Commt value, which can be calculated with respect to λ (p)(w) multiplied by computational time Cpt of a tile. This process is performed to equalize both internal computation and edge communication equations in function of Cpt. This value is replaced in equation 7 and we obtain the equation 9. K Max(Commt (p)(w) ) <= ((K 2) 2 Cpt) (8) To find the value of K, we equal the equation 9 to zero and we obtain a quadratic equation (Equ. 10). These two solutions obtained have to be replaced in equations 7 and 8, with the aim of validating if the k value accomplish the constraint defined. effic K Cpt Max(λ (p)(w) ) = (K 2) 2 Cpt (9) K 2 (4 + effic max(λ (p)(w) ) K + 4 = 0 (10) The next step is to calculate the number ideal of core (Equ. 2), which are needs to find the strong scalability with the desired efficiency. To do this, we start of the initial consideration that establish that one ST will be assigned to each core and we use the equation 2, that calculate the ideal number of core that allow us to obtain the objective stated. This number of core determines the inflection point until the application has a strong scalability. Finally, we can determine
5 the theoretical behavior of the SPMD application for a lower number of cores that the optimal calculated and predict its behavior. Equation 11 calculates the new values of K for a number of core given with the objective of determining the execution time with equation Mapping strategy phase K = M 2 /Ncores (11) A set of difficulties arise when we allocate SPMD tile into distinct cores and these cores have to communicate through different links. Under this focus, the main objective of this phase is to design a strategy of allocating the ST into each core with the aim of minimizing the communication effects. The ST assignations are made applying a core affinity which allows us to allocate the set of tiles according to the policy of minimizing the communications latencies [4]. This core affinity permits us to identify where the processes have to be allocated and how the ST can be assigned to each core. The next step in this phase is to create a logical processes distribution that allows the application to identify the neighbor communications. This is done using a cartesian topology of the processes that give to each process two coordinate in the grid distribution. These two coordinates identify the cores, in which the processes have to be allocated. Also, we can coordinate the communication order with the objective of minimizing the saturation of the links. The last step is to create the ST with the values obtained with the model. 4.4 Scheduling policy phase The scheduling phase is divided into two main parts: the first one is to develop an execution priority which determines how the tiles have to be executed inside the core and the second part of the scheduling phase which is focused on applying an overlapping strategy between internal computation and edge communication tiles. The execution priority assignments are assigned by each tile and the highest priorities are established for tiles which have communications through slower paths. These assignments have the following policies: tiles with external communications are selected with priority 1. These edge tiles are saved in buffers with the aim of executing these tiles first. These buffers are updated all iterations. The second assignation is made for internal tiles which are overlapped with the edge communications, which are assigned with the priority 2. The overlapping process uses two threads, one of them is to perform the internal computation and the other is to manage the asynchronous communications. These communications enable us to perform the internal computation and the edge communication together (Fig. 5). 5. Combining scalability and efficiency Our methodology attempts to find the number of core that achieves the maximum strong scalability with a defined efficiency. However, there are two distinct definition of scalability in HPC. One definition is the weak scalability that is considered when the problem size and the number of processing elements are expanded. The main goal of this scalability is to achieve constant time-to-solution for larger problems and the computational load per processor stays constant [3]. The second definition is the strong scalability in which the problem size is fixed and the number of processing elements is increased. The goal in this scalability is to minimize the time solution. Hence, the scalability means that speedup is roughly proportional to the number of processing elements. Under these two scalability definition, our methodology searches a combination between strong scalability and efficiency. This combination means that our analytical model has to determine the number of cores that allow us to obtain the ideal relationship between speedup and the defined efficiency. This number of cores can be calculated using the model and this number determines the maximum systems capacity growth. Also, we can determine the theoretical behavior of the application (Equ.2. This equation allows us to find the K value size that have to be assigned to each core. The model only finds one ideal value to maintain the ideal overlapping. However, we can calculate values for another number of cores with the aim of evaluating the performance. Fig. 5: Scheduling policy. Cores K Edge Cp Int Cp Edge Comm Exec T (256) Fig. 6: Combining scalability and efficiency of SPMD appl.
6 5.1 A theoretical Example This numerical example illustrates how we can combine the efficiency and the strong scalability concept. Suppose the following application characteristics: a defined problem size of M=1585, a defined efficiency (Effic) of 95%, a four communication pattern, three different communication links (e.g. cache, main memory and network) and a set of node of doble quad core architecture. Then, we have to determine the λ (p)(w) using equation 1 and we have to use the maximum value obtained. We assume that the Cpt = 1 time unit and the maximum communication time for the slowest communication paths Commt = 100 time units. Afterward, we apply our analytical model, where we use the equation 10 to determine the ideal ST and the equation 2 to calculate the ideal number of cores, which represents the maximum combining strong scalability and efficiency. The ideal value of K obtained is around 99 and the ideal number of core is equal to 256 for this example (Eq. 2). Once the K and Ncores are calculated, we have to determined the efficiency and speedup for this ideal number of core. In this sense, we have to calculate the serial execution time using the global problem size multiplied by one computational tile time Cpt. This example is for one iteration and it has a serial time of time units for this specific problem size. The figure 6 shows the result for a different distribution of cores with the aim of visualizing the efficiency and speedup curve for this example. Also, it illustrates the performance behavior for different number of cores, in which the ideal number calculated have the efficiency around the optimal value defined and the speedup until this point has a roughly linear growth. This point is the maximum strong application scalability under a desired efficiency. After this ideal point, we can observe that speedup increases but no proportionally to the number of core and the efficiency begins to decrease considerably due to the communication bound behavior. On the contrary, before the ideal point the efficiency and speedup are around the maximum values (Fig. 6). 6. Performance Evaluation The experiments to test our methodology has been conducted on two multicore clusters, one of them is a DELL cluster with 8 nodes with 2 Quadcore Intel Xeon E5430 of 2.66 Ghz, 6 MB of cache L2 shared by each two core and 12 GB of RAM memory. The second is an IBM with 32 nodes with 2 processors Dual-Core Intel(R) Xeon(R) CPU 5150, 4 MB shared cache memory by 2 cores, 12 GB of RAM and gigaethernet network. Both clusters have Openmpi To validate the result of this article, we chose two applications: heat transfer and one application of fluid dynamics (LL-2D- STD-MPI) integrated in the MP-Labs suite. 6.1 Efficiency and scalability evaluation The main objective of this evaluation is to demonstrate the improvement of applying our methodology. The table 1 Fig. 7: Efficiency of Heat Trasnfer App. on Dell Cluster shows the characterization values of computation (Cpt) and Commt (p)(w) of slowest communication of a tile, problem size, desired efficiency and also illustrates the theoretical values of number of cores, edge and internal computation, the edge communication and the number of tiles. To develop our performance analysis, we executed SPMD applications but making a comparison between the theoretical value, the application without using and the application using our methodology. In this sense, the figure 7 shows the efficiency behavior of heat transfer application executed with 100 iterations. This figure 7 illustrates a considerable improvement in efficiency of around 42% when we execute with the number of cores determined by our model. Also, we can observe how the application using our methodology behaves similarly to the analytical model where the error rate is around 5% when the number of core is below to the maximum obtained of our model (Fig. 7). The figure 8 shows how the speedup increases when we add more core, but this speedup do not scale linearly after the maximum number of core determined with our model. The ideal point calculated meets the maximum strong scalability while the efficiency is over a defined threshold. On the other hand, the LL-2D-STD-MPI application is integrated by 3 main parts: prestep, poststep, and the main module where the communication and the computation is performed. In this order, we apply our methodology and the tile characterization process to the last module, because the other two only compute and they do not have any communication and this application has been tested with 100 iterations. The figure 9 shows the improvement in the efficiency between the original version and when we applied our methodology with an error rate of 4% of precision Fig. 8: Speedup of Heat Trasnfer App on Dell Cluster
7 Table 1: Tile distribution model evaluation examples App. Cpt Comt pw M effic K IntComp EdgeComp EdgeComm Ncore Cluster Heat Tran 0.021µsec 58, 8µsec 9500x % ,18E-01Sec 1,99E-04Sec 1,40E-01Sec 16 DELL LL-2D-STD 0.24µsec 60, 7µsec 2000x % 249 1,458E-02Sec 2,34E-04Sec 1,48E-02Sec 64 IBM Fig. 9: Efficiency of LL-2D-STD-MPI app on IBM Cluster efficiency over a defined threshold. To achieve this, we have proposed an appropriate manner to manage the inefficiencies generated by communications links presented on multicore clusters, as was described. In addition, with our method we can observe how the SPMD applications with some specific characteristics can behave with a specific problem size while is incremented the number of cores. This is the main purpose of finding the maximum point that allows the SPMD application to scale linearly. Future works are focused on working with heterogeneous computation on multicore environment with the aim of executing the SPMD applications efficiently in an communication and computation heterogeneus environments. References Fig. 10: Speedup of LL-2D-STD-MPI app on IBM Cluster with the analytical model. Similarly to the heat transfer application, the model gives us the ideal ST and number or core that maintain the relationship between efficiency and scalability (Table 1). Finally, we observe the behavior of speedup and the strong scalability in this application in figure 10, where we can observe the linear speedup until the number of cores is below to the theoretical value. This allows us to conclude that our methodology can determine the maximum strong scalability combining the maximum speedup and the efficiency over a defined threshold. The objective to test the application in two clusters is due to check the functionality of our method in different multicore architectures. These two examples show an approximation of our method and how the maximum speedup is reached with the efficiency defined in the model. 7. Conclusion and Future Work This works addresses how we can combine the efficiency and the strong scalability in parallel applications. Also, it was presented a novel methodology based on characterization, a tile distribution model, a mapping strategy and a scheduling policy. These phases allowed us to find through an analytical model the optimal size of the Supertile and the number of core needs to accomplish the objective stated. This model is focused on managing the hierarchical communication architecture presents on multicore clusters. The experimentations have demonstrated that this optimal size can achieve the conditions of maximum speedup and [1] I. M. Nielsen and C. L. Janssen, Multicore challenges and benefits for high performance scientific computing, Scientific Programming, vol. 16, pp , [2] M. Mccool, Scalable programming models for massively multicore processors, Proc. of the IEEE, vol. 96, no. 5, pp , [3] L. Peng, M. Kunaseth, H. Dursun, K. ichi Nomura, W. Wang, R. K. Kalia, A. Nakano, and P. Vashisht, A scalable hierarchical parallelization framework for molecular dynamics simulation on multicore clusters, Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, USA, pp , [4] G. Mercier and J. Clet-Ortega, Towards an efficient process placement policy for mpi applications in multicore environments, EuroPVM/MPI 2009, pp , [5] L. M. Liebrock and S. P. Goudy, Methodology for modelling spmd hybrid parallel computation, Concurr. Comput. : Pract. Exper., vol. 20, no. 8, pp , [6] G. Cong and D. A. Bader, Techniques for designing efficient parallel graph algorithms for smps and multicore processors, The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), pp , [7] K. Vikram and V. Vasudevan, Mapping data-parallel tasks onto partially reconfigurable hybrid processor architectures, IEEE Transactions on Very Large Scale Integration Systems, vol. 14, no. 9, p. 1010, [8] R. Muresano, D. Rexachs, and E. Luque., Methodology for efficient execution of spmd applications on multicore clusters, 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID), IEEE Computer Society, pp , (2010). [9] J. Brehm, P. H. Worley, and M. Madhukar, Performance modeling for spmd message-passing programs, Concurrency - Practice and Experience, vol. 10, no. 5, pp , [10] O. Beaumont, A. Legrand, and Y. Robert, Optimal algorithms for scheduling divisible workloads on heterogeneous systems, 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), p. 98, [11] J. B. Weissman and X. Zhao, Scheduling parallel applications in distributed networks, Cluster Computing, vol. 1, pp , [12] M. Panshenskov and A. Vakhitov, Adaptive scheduling of parallel computations for spmd tasks, ICCSA 2007, pp , [13] V. der Wijngaart and H. Jin, Nas parallel benchmarks, multi-zone versions, NASA Advanced Supercomputing Division Ames Research Center, USA, , Tech. Rep., [14] T. Lee and C.-L. Lin, A stable discretization of the lattice boltzmann equation for simulation of incompressible two-phase flows at high density ratio, J. Comput. Phys., vol. 206, pp , June 2005.
Methodology for predicting the energy consumption of SPMD application on virtualized environments *
Methodology for predicting the energy consumption of SPMD application on virtualized environments * Javier Balladini, Ronal Muresano +, Remo Suppi +, Dolores Rexachs + and Emilio Luque + * Computer Engineering
More informationModeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 191 Modeling Parallel Applications for calability Analysis: An approach to predict the communication pattern Javier Panadero 1, Alvaro Wong 1,
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More informationAutomatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures
ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 1 Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures Laura De Giusti 1, Franco Chichizola 1, Marcelo Naiouf 1, Armando
More informationA Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationInterconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality
More informationA STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationA Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
More informationHow To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,
More informationAdvances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.
Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using
More informationWorkshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationbenchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
More informationKeywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
More informationOptimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationCellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationOptimal Service Pricing for a Cloud Cache
Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationScientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
More informationParallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
More informationSource Code Transformations Strategies to Load-balance Grid Applications
Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,
More informationA Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationFigure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues
More informationA Load Balancing Schema for Agent-based SPMD Applications
A Load Balancing Schema for Agent-based SPMD Applications Claudio Márquez, Eduardo César, and Joan Sorribes Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona,
More informationPerformance of the NAS Parallel Benchmarks on Grid Enabled Clusters
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 phil@wayne.edu
More informationPerformance evaluation of Web Information Retrieval Systems and its application to e-business
Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,
More informationIT@Intel. Comparing Multi-Core Processors for Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core
More informationThe Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
More informationEfficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
More informationHeterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing
Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004
More informationA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems
A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationANALYSIS OF WORKFLOW SCHEDULING PROCESS USING ENHANCED SUPERIOR ELEMENT MULTITUDE OPTIMIZATION IN CLOUD
ANALYSIS OF WORKFLOW SCHEDULING PROCESS USING ENHANCED SUPERIOR ELEMENT MULTITUDE OPTIMIZATION IN CLOUD Mrs. D.PONNISELVI, M.Sc., M.Phil., 1 E.SEETHA, 2 ASSISTANT PROFESSOR, M.PHIL FULL-TIME RESEARCH SCHOLAR,
More informationCONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS
CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS GH. ADAM 1,2, S. ADAM 1,2, A. AYRIYAN 2, V. KORENKOV 2, V. MITSYN 2, M. DULEA 1, I. VASILE 1 1 Horia Hulubei National Institute for Physics
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationA Novel Switch Mechanism for Load Balancing in Public Cloud
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationSWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
More informationA Hybrid Load Balancing Policy underlying Cloud Computing Environment
A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349
More informationIcepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationExperiments on the local load balancing algorithms; part 1
Experiments on the local load balancing algorithms; part 1 Ştefan Măruşter Institute e-austria Timisoara West University of Timişoara, Romania maruster@info.uvt.ro Abstract. In this paper the influence
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationVaralakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam
A Survey on P2P File Sharing Systems Using Proximity-aware interest Clustering Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam
More informationHigh Performance Computing for Operation Research
High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity
More informationPower-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
More informationEnhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
More informationScalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
More informationSCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
More informationEfficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
More informationDepartment of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012
Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................
More informationMulti-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
More informationSilviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)
Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Outline Introduction EO challenges; EO and classical/cloud computing; EO Services The computing platform Cluster -> Grid -> Cloud
More informationCloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project
Intelligent Services for Energy-Efficient Design and Life Cycle Simulation Project number: 288819 Call identifier: FP7-ICT-2011-7 Project coordinator: Technische Universität Dresden, Germany Website: ises.eu-project.info
More informationInfrastructure as a Service (IaaS)
Infrastructure as a Service (IaaS) (ENCS 691K Chapter 4) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ References 1. R. Moreno et al.,
More informationProposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm
Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm Luís Fabrício Wanderley Góes, Carlos Augusto Paiva da Silva Martins Graduate Program in Electrical Engineering PUC Minas {lfwgoes,capsm}@pucminas.br
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationJournal of Theoretical and Applied Information Technology 20 th July 2015. Vol.77. No.2 2005-2015 JATIT & LLS. All rights reserved.
EFFICIENT LOAD BALANCING USING ANT COLONY OPTIMIZATION MOHAMMAD H. NADIMI-SHAHRAKI, ELNAZ SHAFIGH FARD, FARAMARZ SAFI Department of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad,
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationEnergy Constrained Resource Scheduling for Cloud Environment
Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering
More informationUsing Cloud Computing for Solving Constraint Programming Problems
Using Cloud Computing for Solving Constraint Programming Problems Mohamed Rezgui, Jean-Charles Régin, and Arnaud Malapert Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, 06900 Sophia Antipolis, France
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationA Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application
2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationAn Architecture Model of Sensor Information System Based on Cloud Computing
An Architecture Model of Sensor Information System Based on Cloud Computing Pengfei You, Yuxing Peng National Key Laboratory for Parallel and Distributed Processing, School of Computer Science, National
More informationPerformance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems
Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Nagaveni V # Dr. G T Raju * # Department of Computer Science and
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationAn Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationDynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
More informationP013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
More informationDECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor pneelakantan@rediffmail.com ABSTRACT The grid
More informationTHE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand
More informationSupercomputing applied to Parallel Network Simulation
Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary
More informationMAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
More informationParallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,
More informationThis is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902
Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND RESOURCE UTILIZATION IN CLOUD NETWORK
International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 1, Jan-Feb 2016, pp. 45-53, Article ID: IJCET_07_01_006 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=7&itype=1
More informationTowards a Load Balancing in a Three-level Cloud Computing Network
Towards a Load Balancing in a Three-level Cloud Computing Network Shu-Ching Wang, Kuo-Qin Yan * (Corresponding author), Wen-Pin Liao and Shun-Sheng Wang Chaoyang University of Technology Taiwan, R.O.C.
More informationIBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationEfficient DNS based Load Balancing for Bursty Web Application Traffic
ISSN Volume 1, No.1, September October 2012 International Journal of Science the and Internet. Applied However, Information this trend leads Technology to sudden burst of Available Online at http://warse.org/pdfs/ijmcis01112012.pdf
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
More informationCluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
More information