Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*

Size: px
Start display at page:

Download "Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*"

Transcription

1 Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters* Ronal Muresano, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de Barcelona, Barcelona, SPAIN Abstract A huge challenge that parallel computing wants to overcome is to improve the performance of many MPI applications. However, some of these applications do not scale when the problem size is fixed and the number of core is increased. This scalability problem is increased when is used a hierarchical communications architecture how is included on multicore clusters. Therefore, this work presents a novel method developed for SPMD (Single Program Multiple Data) applications, which is based on finding the maximum strong scalability point while the efficiency is maintained over a defined threshold. This method integrates four phases: a characterization, a tiles distribution model, a mapping strategy, and a scheduling policy. Also, this method is focused on SPMD applications designed to use MPI libraries with high communication volumes. Our methodology has been tested with different SPMD scientific applications and we observed that the maximum speedup and scalability were located close to the values calculated with our model. Keywords: multicore, SPMD, Performance, Scalability 1. Introduction Nowadays, the scientific applications are developed with more complexity and accuracy and these precisions need high computational resources to be executed faster and efficiently. Also, the current trend in high performance computing (HPC) is to find clusters composed of multicore nodes as can be evidenced in the top500 list (rank of the parallel machines used for HPC). The integration of these nodes in HPC has allowed the inclusion of more parallelism within nodes. However, this parallelism must deal with some problems such as: number of cores per chip, shared cache, bus interconnection, memory bandwidth, etc.[1]. These issues are becoming more important in order to manage the application scalability and efficiency. Also, the hierarchical communication architecture integrated on multicore clusters creates an heterogeneous environment, which affects some performance metrics such as efficiency, speedup and applications scalability due to * This research has been supported by the MEC-MICINN Spain under contract TIN *Contact Autor: R. Muresano, rmuresano@caos.uab.es This paper is addressed to the PDPTA conference. the different speeds and bandwidths of each communication paths(fig. 1), which may cause degradations in the application performance [2]. Despite of these communication issues and in order to benefit from such computational multicore cluster capacities, we focused on improving the performance application in these environments. This work is focused on calculating the maximum number of cores that maintain the strong application scalability while the efficiency is over a defined threshold. The objective of strong scalability is to maintain the problem size constant while the number of processors increases [3]. To obtain this goal, we have to consider the parallel programming paradigm, which the application has been designed, e.g., a master/worker, pipeline, SPMD, etc. Each of these paradigms has different communication patterns, which can affect the applications performance. In this sense, we consider the parallel applications designed using message passing interface (MPI) for communication and SPMD as a parallel paradigm. The SPMD paradigm was selected due to its behavior which is to execute the same program in all processes but with a different set of tiles. These tiles have to exchange their information in each iteration and these can become in a huge issue when we use a multicore environment. The figure 1 shows an SPMD application execution over a multicore cluster. The tiles are computed in a similar time due to the homogeneity of the core. However, the communications are performed by different links with the objective of maintaining the communication pattern and each pathscan include up to one and a half order of magnitude of difference in latency for the same communication volume. These differences are translated into inefficiency that decreases the performance and do not allow us to obtain a linear strong scalability. To solve these inefficiencies, we have developed a method that manages the communication latencies using some characteristics of each SPMD application (e.g. computation and communication tile ratio) and allows us to determine a relationship between scalability and efficiency. To achieve this performance relationship, our methodology is organized in four phases: characterization, tiles distribution model, mapping strategy, and scheduling policy, which allow us to

2 Fig. 1: Issues of SPMD applications on a multicore cluster. distribute the tile inside the environment. In this sense, this methodology classifies the SPMD tiles in groups, called Supertile (ST), assigning each one of them to one core. The tiles of these ST belong to one of two types: internal tiles which their communications are made in the same core and edge tiles, where their communications are performed with tiles allocated in other cores. This division allows us to apply an overlapping method, which permits us to execute the internal tiles while the edge communications are communicating. This division allows our method to find the ideal number of core that permits us to achieve the maximum strong scalability with a defined efficiency. This paper is structured as follows: the related works are described in section 2. Section 3 exposes the issues of SPMD applications on a multicore architecture. A description of the methodology is presented in section 4. Section 5 describes the efficiency and scalability for SPMD application. Next, section 6 illustrates the performance evaluation. Finally, conclusions are given in section Related Works There are different works developing methodologies which are focused on improving some performance metrics on multicore enviroments. Mercier et al [4] have designed a method to efficiently place MPI processes on multicore machines where establish an adequate placement policy to improve applications efficiency. However, this work does not include the combination of scalability that is very important when we wish to execute faster and efficiently. On the other hand, Liebrock [5] defines a methodology for deriving a performance model for SPMD hybrid parallel applications. This work was focused on improving three specifics performance: adaptability, scalability and fidelity using mapping, scheduling and synchronization overhead strategies designed for hybrid message passing and distribute memory applications. On the contrary, our work evaluates pure MPI applications and similarly, we develop a methodology centered on mapping and scheduling strategies, and also, we include an efficient execution. Moreover, there are works centered on studing and improving the efficiency [6] or enhancing the speedup on multicore clusters [7] separately. In contrast, we developed a methodology centering on mapping and scheduling strategies, and we search an improvement in both speedup and efficiency performance metrics on these clusters [8]. In this previous work, we have developed the methodology phases using the characterization, mapping and scheduling strategyes. However, these phase only permit us to find the number of tiles that let us to obtain the maximum speedup of the SPMD aplication defining a desired efficiency. However, this current work searches for a combination of strong scalability and efficiency, in which we can predict the number of core that maintain the relationship between both metrics. We are focused on using mapping and scheduling strategies top achieve our objective, In this sense, some works have developed mapping strategy for SPMD applications, which are centered on improving the application efficiency [7]. Another technique was designed by Brehm et al [9], in which the main objective was to map the application using the characteristics of the applications. Similarly, our proposed mapping maintains the efficiency using the characteristics of the machine and the application, but we add an affinity process that allows us to minimize the communication effect of the multicore environment. Also, there are some scheduling strategies for SPMD applications [10] [11] that are based on finding the minimum execution time, which is part of our objective. Nevertheless, we analyzed and evaluated the model defined by Panshenskov et al [12] and we chose some characteristics such as: tiles are divided into blocks, asynchronous communications, computation and communication overlapping, with the aim of minimizing the communications overhead and improving the efficiency of the SPMD application. 3. SPMD applications on multicore In this study, the SPMD applications used have to accomplish the following characteristics: static, where parallel application defines the communication process and is maintained during all the execution, local, where applications do not have collective communications, 2D grid applications, and regular, because communications are repeated for several iterations. In this sense, there are some benchmark that use these characteristic: one of them is NAS parallel benchmarks in the CG, BT algorithms [13] and also there are real applications such as: heat transfer simulation, Laplace equation, applications focus on fluid dynamics field like mpbl suite [14], application of finite differences etc. Also, the communication pattern can vary according to the objective of the SPMD application. However,these patterns

3 Fig. 4: Phases for efficient execution of SPMD appl. Fig. 2: SPMD application on multicore cluster. are defined in the beginning of the application and are kept until the application end. The figure 2 shows an example of SPMD applications and multicore clusters, in which is illustrated the idle time generated by slower communication links, (e.g. cores 5-8 communicating from node 1 with core 1-4 of node 2 through the inter-node link). These Internode communications have the bigger delay that can generate huge influences in the efficiency and scalability. However, these idle times allows us to establish strategies in order to organize how SPMD tiles could be distributed on multicore cluster with the aim of managing these communications inefficiency. These communication links can vary the communication time in an one and a half order of magnitude according to the path which perform the communication. These variations are a limiting factor to improve application performance, due to the latency of the slower link, which determines when an iteration has been completed (Fig. 2). These inefficiencies have to managed if we wish to executed the SPMD application faster, efficient and scalable. To manage this communication issues, we use the problem size of the SPMD applications that is composed by a number of tiles and we create the SuperTile (ST). The problem of finding the optimal ST size is formulated as an analytical problem, in which the ratio between computation and communication of the tile has to be founded with the objective of searching the relationship between strong scalability and efficiency. The ST is calculated maintaining the focus of Fig. 3: SuperTile (ST) creation for improving the efficiency. obtaining the maximum strong scalability point while the efficiency is maintained over a defined threshold. The figure 3 shows an example of the overlapping process and the ST creation. This ST is a group of tiles of the global problem size which is defined by M xm. In this sense, this ST is integrated of a set of KxK Tiles, where K is the square root of the number of tiles, which have to be assigned to each core with the aim of maintaining an ideal relationship between efficiency, strong scalability and speedup. As mentioned before, the ST is composed by two type of tile internal and edge tile. This is done with the objective of creating an overlapping strategy that minimize the communication effects in the parallel execution time. 4. Methodology definition This methodology is focused on managing the different communication latencies and bandwiths presents on multicore clusters with the objective of improving both efficiency and application scalability. This process is realized through four phases: a characterization, a tile distribution model, a mapping strategy and a scheduling policy (Fig. 4). These phase allow us to handle the latencies and the imbalances created due to the different communication paths. Thus, our methodology realizes an application and environment analysis in the characterization phase with the aim of obtaining the application parameters and the computation and communication ratio which will be used to calculate the number of tile of the ST and the ideal number of cores. The next step is to calculate the tiles distribution which determines the number of tiles that have to be assigned to each core in order to achieve our objective, and also we calculate the number of core necessary to maintain both strong scalability and efficiency conditions. Next, the mapping phase allocates the set of tiles (ST) among the cores which are calculated with the model defined in the tile distribution phase. Finally, the scheduling phase has two functions, one of them is to assign tile priorities and the other is to control the overlapping process. Once the methodology is applied, we evaluate the performance results obtained. 4.1 Characterization phase The objective of this phase is to gather the necessary parameters of SPMD applications and environment. This characterization parameters are classified in three groups: the application parameters, parallel environment characteristics and the defined efficiency. All these parameters give us the nearest relationship between the machine and the application.

4 The application parameters offer the information necessary of the application characteristics such as: problem size, number of tile, iteration number, communication volume and computation time of a tile, etc. Also, these parameters allow us to determine the communication pattern of a tile and determine the distribution schemes of the SPMD application. The parallel environment parameters enable us to determine the communication and computational time of a tile inside the hierarchical communication architecture. These values of a tile obtained allow us to calculate a ratio between them. This ratio will be defined as λ (p)(w), where p determine the pathswhere a tile has to communicate with neighboring tile, e.g. through A, B, or C link (Fig. 1). The variable w describes the direction of the communication e.g. up, right, left or down in a four communications pattern. This ratio is calculated with equation 1, where Comt (p)(w) determines the time of communicating a tile for a p pathsand the Cpt is the computing time of a tile. λ (p)(w) = Comt (p)(w) /Cpt (1) Finally, once all parameters are found through the characterization phase, we include the efficiency value in the model and evaluate the execution time. The efficiency value is defined by the variable effic and this will be included in the model. 4.2 Tile distribution model phase The main objective of this model is to determine the number of cores N cores that allow us to maintain the relationship between the maximum strong scalability and the desired efficiency. In this sense, equation 2 calculates the number of core. This equation depends on the problem size which is represented by the M 2 divided by the optimal number of tile K 2 (Equ. 2). Ncores = M 2 /K 2 (2) Knowing the value of K, we can estimate the execution time of the SPMD application. For example, the equation 3 represents the behavior of SPMD application using the overlapping strategy, where first is calculated the edge tile computation (EdgeComp i ) and then is added with the maximum value between internal tile computation (IntComp i ) and edge tile communication (Edgecomm i ). This process will be repeated for a set of iterations (iter) where n determine the number of an iteration. This process is carried out for the communication exchanging of the SPMD application and can be possible to calculate due to the deterministic behavior of these applications. { IntComp(i) T ex i = iter n=1 (EdgComp (i) + Max Edgcomm (i) (3) EdgeComp (i) = 4 (K 1) Cpt (4) IntComp (i) = (K 2) 2 Cpt (5) Edgecomm (i) = K Max(Comt (p)(w) ) (6) The edge communication (Equ.6) has to be for the worst communication case. This means that we use the slowest communication time to estimate the number of tiles necessary for maintaining the efficiency. To do this, we have to calculate the λ (p)(w) ratio (Equ 1) explained before. Then, the first step is to determine the ideal value of K. We start from the overlapping strategy, where internal tile computation (IntComp (i) ) and the edge tile communication (Edgecomm (i) ) are overlapped. The equation 7 represents the ideal overlapping that allow us to obtain the maximum speedup while the efficiency ef f ic is maintained over a defined threshold. Therefore, we start from an initial condition where the edge communication time is bigger than the internal computation time divided by the efficiency. This division represents the maximum inefficiency allowed. K Max(Commt (p)(w) ) >= ((K 2) 2 Cpt)/effic (7) However, this equation 7 has to consider a constraint defined in equation 8 where Edgecomm (i) can be bigger than IntComp (i) over the defined efficiency (Equ. 7), but the Edgcomm (i) have to be slower than the IntComp (i) without any efficiency definition. To calculate the optimal value of K, we determine the λ (p)(w) (Equ. 1) value and we solve the Commt value, which can be calculated with respect to λ (p)(w) multiplied by computational time Cpt of a tile. This process is performed to equalize both internal computation and edge communication equations in function of Cpt. This value is replaced in equation 7 and we obtain the equation 9. K Max(Commt (p)(w) ) <= ((K 2) 2 Cpt) (8) To find the value of K, we equal the equation 9 to zero and we obtain a quadratic equation (Equ. 10). These two solutions obtained have to be replaced in equations 7 and 8, with the aim of validating if the k value accomplish the constraint defined. effic K Cpt Max(λ (p)(w) ) = (K 2) 2 Cpt (9) K 2 (4 + effic max(λ (p)(w) ) K + 4 = 0 (10) The next step is to calculate the number ideal of core (Equ. 2), which are needs to find the strong scalability with the desired efficiency. To do this, we start of the initial consideration that establish that one ST will be assigned to each core and we use the equation 2, that calculate the ideal number of core that allow us to obtain the objective stated. This number of core determines the inflection point until the application has a strong scalability. Finally, we can determine

5 the theoretical behavior of the SPMD application for a lower number of cores that the optimal calculated and predict its behavior. Equation 11 calculates the new values of K for a number of core given with the objective of determining the execution time with equation Mapping strategy phase K = M 2 /Ncores (11) A set of difficulties arise when we allocate SPMD tile into distinct cores and these cores have to communicate through different links. Under this focus, the main objective of this phase is to design a strategy of allocating the ST into each core with the aim of minimizing the communication effects. The ST assignations are made applying a core affinity which allows us to allocate the set of tiles according to the policy of minimizing the communications latencies [4]. This core affinity permits us to identify where the processes have to be allocated and how the ST can be assigned to each core. The next step in this phase is to create a logical processes distribution that allows the application to identify the neighbor communications. This is done using a cartesian topology of the processes that give to each process two coordinate in the grid distribution. These two coordinates identify the cores, in which the processes have to be allocated. Also, we can coordinate the communication order with the objective of minimizing the saturation of the links. The last step is to create the ST with the values obtained with the model. 4.4 Scheduling policy phase The scheduling phase is divided into two main parts: the first one is to develop an execution priority which determines how the tiles have to be executed inside the core and the second part of the scheduling phase which is focused on applying an overlapping strategy between internal computation and edge communication tiles. The execution priority assignments are assigned by each tile and the highest priorities are established for tiles which have communications through slower paths. These assignments have the following policies: tiles with external communications are selected with priority 1. These edge tiles are saved in buffers with the aim of executing these tiles first. These buffers are updated all iterations. The second assignation is made for internal tiles which are overlapped with the edge communications, which are assigned with the priority 2. The overlapping process uses two threads, one of them is to perform the internal computation and the other is to manage the asynchronous communications. These communications enable us to perform the internal computation and the edge communication together (Fig. 5). 5. Combining scalability and efficiency Our methodology attempts to find the number of core that achieves the maximum strong scalability with a defined efficiency. However, there are two distinct definition of scalability in HPC. One definition is the weak scalability that is considered when the problem size and the number of processing elements are expanded. The main goal of this scalability is to achieve constant time-to-solution for larger problems and the computational load per processor stays constant [3]. The second definition is the strong scalability in which the problem size is fixed and the number of processing elements is increased. The goal in this scalability is to minimize the time solution. Hence, the scalability means that speedup is roughly proportional to the number of processing elements. Under these two scalability definition, our methodology searches a combination between strong scalability and efficiency. This combination means that our analytical model has to determine the number of cores that allow us to obtain the ideal relationship between speedup and the defined efficiency. This number of cores can be calculated using the model and this number determines the maximum systems capacity growth. Also, we can determine the theoretical behavior of the application (Equ.2. This equation allows us to find the K value size that have to be assigned to each core. The model only finds one ideal value to maintain the ideal overlapping. However, we can calculate values for another number of cores with the aim of evaluating the performance. Fig. 5: Scheduling policy. Cores K Edge Cp Int Cp Edge Comm Exec T (256) Fig. 6: Combining scalability and efficiency of SPMD appl.

6 5.1 A theoretical Example This numerical example illustrates how we can combine the efficiency and the strong scalability concept. Suppose the following application characteristics: a defined problem size of M=1585, a defined efficiency (Effic) of 95%, a four communication pattern, three different communication links (e.g. cache, main memory and network) and a set of node of doble quad core architecture. Then, we have to determine the λ (p)(w) using equation 1 and we have to use the maximum value obtained. We assume that the Cpt = 1 time unit and the maximum communication time for the slowest communication paths Commt = 100 time units. Afterward, we apply our analytical model, where we use the equation 10 to determine the ideal ST and the equation 2 to calculate the ideal number of cores, which represents the maximum combining strong scalability and efficiency. The ideal value of K obtained is around 99 and the ideal number of core is equal to 256 for this example (Eq. 2). Once the K and Ncores are calculated, we have to determined the efficiency and speedup for this ideal number of core. In this sense, we have to calculate the serial execution time using the global problem size multiplied by one computational tile time Cpt. This example is for one iteration and it has a serial time of time units for this specific problem size. The figure 6 shows the result for a different distribution of cores with the aim of visualizing the efficiency and speedup curve for this example. Also, it illustrates the performance behavior for different number of cores, in which the ideal number calculated have the efficiency around the optimal value defined and the speedup until this point has a roughly linear growth. This point is the maximum strong application scalability under a desired efficiency. After this ideal point, we can observe that speedup increases but no proportionally to the number of core and the efficiency begins to decrease considerably due to the communication bound behavior. On the contrary, before the ideal point the efficiency and speedup are around the maximum values (Fig. 6). 6. Performance Evaluation The experiments to test our methodology has been conducted on two multicore clusters, one of them is a DELL cluster with 8 nodes with 2 Quadcore Intel Xeon E5430 of 2.66 Ghz, 6 MB of cache L2 shared by each two core and 12 GB of RAM memory. The second is an IBM with 32 nodes with 2 processors Dual-Core Intel(R) Xeon(R) CPU 5150, 4 MB shared cache memory by 2 cores, 12 GB of RAM and gigaethernet network. Both clusters have Openmpi To validate the result of this article, we chose two applications: heat transfer and one application of fluid dynamics (LL-2D- STD-MPI) integrated in the MP-Labs suite. 6.1 Efficiency and scalability evaluation The main objective of this evaluation is to demonstrate the improvement of applying our methodology. The table 1 Fig. 7: Efficiency of Heat Trasnfer App. on Dell Cluster shows the characterization values of computation (Cpt) and Commt (p)(w) of slowest communication of a tile, problem size, desired efficiency and also illustrates the theoretical values of number of cores, edge and internal computation, the edge communication and the number of tiles. To develop our performance analysis, we executed SPMD applications but making a comparison between the theoretical value, the application without using and the application using our methodology. In this sense, the figure 7 shows the efficiency behavior of heat transfer application executed with 100 iterations. This figure 7 illustrates a considerable improvement in efficiency of around 42% when we execute with the number of cores determined by our model. Also, we can observe how the application using our methodology behaves similarly to the analytical model where the error rate is around 5% when the number of core is below to the maximum obtained of our model (Fig. 7). The figure 8 shows how the speedup increases when we add more core, but this speedup do not scale linearly after the maximum number of core determined with our model. The ideal point calculated meets the maximum strong scalability while the efficiency is over a defined threshold. On the other hand, the LL-2D-STD-MPI application is integrated by 3 main parts: prestep, poststep, and the main module where the communication and the computation is performed. In this order, we apply our methodology and the tile characterization process to the last module, because the other two only compute and they do not have any communication and this application has been tested with 100 iterations. The figure 9 shows the improvement in the efficiency between the original version and when we applied our methodology with an error rate of 4% of precision Fig. 8: Speedup of Heat Trasnfer App on Dell Cluster

7 Table 1: Tile distribution model evaluation examples App. Cpt Comt pw M effic K IntComp EdgeComp EdgeComm Ncore Cluster Heat Tran 0.021µsec 58, 8µsec 9500x % ,18E-01Sec 1,99E-04Sec 1,40E-01Sec 16 DELL LL-2D-STD 0.24µsec 60, 7µsec 2000x % 249 1,458E-02Sec 2,34E-04Sec 1,48E-02Sec 64 IBM Fig. 9: Efficiency of LL-2D-STD-MPI app on IBM Cluster efficiency over a defined threshold. To achieve this, we have proposed an appropriate manner to manage the inefficiencies generated by communications links presented on multicore clusters, as was described. In addition, with our method we can observe how the SPMD applications with some specific characteristics can behave with a specific problem size while is incremented the number of cores. This is the main purpose of finding the maximum point that allows the SPMD application to scale linearly. Future works are focused on working with heterogeneous computation on multicore environment with the aim of executing the SPMD applications efficiently in an communication and computation heterogeneus environments. References Fig. 10: Speedup of LL-2D-STD-MPI app on IBM Cluster with the analytical model. Similarly to the heat transfer application, the model gives us the ideal ST and number or core that maintain the relationship between efficiency and scalability (Table 1). Finally, we observe the behavior of speedup and the strong scalability in this application in figure 10, where we can observe the linear speedup until the number of cores is below to the theoretical value. This allows us to conclude that our methodology can determine the maximum strong scalability combining the maximum speedup and the efficiency over a defined threshold. The objective to test the application in two clusters is due to check the functionality of our method in different multicore architectures. These two examples show an approximation of our method and how the maximum speedup is reached with the efficiency defined in the model. 7. Conclusion and Future Work This works addresses how we can combine the efficiency and the strong scalability in parallel applications. Also, it was presented a novel methodology based on characterization, a tile distribution model, a mapping strategy and a scheduling policy. These phases allowed us to find through an analytical model the optimal size of the Supertile and the number of core needs to accomplish the objective stated. This model is focused on managing the hierarchical communication architecture presents on multicore clusters. The experimentations have demonstrated that this optimal size can achieve the conditions of maximum speedup and [1] I. M. Nielsen and C. L. Janssen, Multicore challenges and benefits for high performance scientific computing, Scientific Programming, vol. 16, pp , [2] M. Mccool, Scalable programming models for massively multicore processors, Proc. of the IEEE, vol. 96, no. 5, pp , [3] L. Peng, M. Kunaseth, H. Dursun, K. ichi Nomura, W. Wang, R. K. Kalia, A. Nakano, and P. Vashisht, A scalable hierarchical parallelization framework for molecular dynamics simulation on multicore clusters, Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, USA, pp , [4] G. Mercier and J. Clet-Ortega, Towards an efficient process placement policy for mpi applications in multicore environments, EuroPVM/MPI 2009, pp , [5] L. M. Liebrock and S. P. Goudy, Methodology for modelling spmd hybrid parallel computation, Concurr. Comput. : Pract. Exper., vol. 20, no. 8, pp , [6] G. Cong and D. A. Bader, Techniques for designing efficient parallel graph algorithms for smps and multicore processors, The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), pp , [7] K. Vikram and V. Vasudevan, Mapping data-parallel tasks onto partially reconfigurable hybrid processor architectures, IEEE Transactions on Very Large Scale Integration Systems, vol. 14, no. 9, p. 1010, [8] R. Muresano, D. Rexachs, and E. Luque., Methodology for efficient execution of spmd applications on multicore clusters, 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID), IEEE Computer Society, pp , (2010). [9] J. Brehm, P. H. Worley, and M. Madhukar, Performance modeling for spmd message-passing programs, Concurrency - Practice and Experience, vol. 10, no. 5, pp , [10] O. Beaumont, A. Legrand, and Y. Robert, Optimal algorithms for scheduling divisible workloads on heterogeneous systems, 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), p. 98, [11] J. B. Weissman and X. Zhao, Scheduling parallel applications in distributed networks, Cluster Computing, vol. 1, pp , [12] M. Panshenskov and A. Vakhitov, Adaptive scheduling of parallel computations for spmd tasks, ICCSA 2007, pp , [13] V. der Wijngaart and H. Jin, Nas parallel benchmarks, multi-zone versions, NASA Advanced Supercomputing Division Ames Research Center, USA, , Tech. Rep., [14] T. Lee and C.-L. Lin, A stable discretization of the lattice boltzmann equation for simulation of incompressible two-phase flows at high density ratio, J. Comput. Phys., vol. 206, pp , June 2005.

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

Methodology for predicting the energy consumption of SPMD application on virtualized environments * Methodology for predicting the energy consumption of SPMD application on virtualized environments * Javier Balladini, Ronal Muresano +, Remo Suppi +, Dolores Rexachs + and Emilio Luque + * Computer Engineering

More information

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 191 Modeling Parallel Applications for calability Analysis: An approach to predict the communication pattern Javier Panadero 1, Alvaro Wong 1,

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures

Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 1 Automatic Mapping Tasks to Cores - Evaluating AMTHA Algorithm in Multicore Architectures Laura De Giusti 1, Franco Chichizola 1, Marcelo Naiouf 1, Armando

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3 A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

A Flexible Cluster Infrastructure for Systems Research and Software Development

A Flexible Cluster Infrastructure for Systems Research and Software Development Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

How To Balance In Cloud Computing

How To Balance In Cloud Computing A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

More information

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.

Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp. Advances in Smart Systems Research : ISSN 2050-8662 : http://nimbusvault.net/publications/koala/assr/ Vol. 3. No. 3 : pp.49-54 : isrp13-005 Optimized Communications on Cloud Computer Processor by Using

More information

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012 Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

benchmarking Amazon EC2 for high-performance scientific computing

benchmarking Amazon EC2 for high-performance scientific computing Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

More information

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age. Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Optimal Service Pricing for a Cloud Cache

Optimal Service Pricing for a Cloud Cache Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

Scientific Computing Programming with Parallel Objects

Scientific Computing Programming with Parallel Objects Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Source Code Transformations Strategies to Load-balance Grid Applications

Source Code Transformations Strategies to Load-balance Grid Applications Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,

More information

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

A Load Balancing Schema for Agent-based SPMD Applications

A Load Balancing Schema for Agent-based SPMD Applications A Load Balancing Schema for Agent-based SPMD Applications Claudio Márquez, Eduardo César, and Joan Sorribes Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona,

More information

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 phil@wayne.edu

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

IT@Intel. Comparing Multi-Core Processors for Server Virtualization

IT@Intel. Comparing Multi-Core Processors for Server Virtualization White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

ANALYSIS OF WORKFLOW SCHEDULING PROCESS USING ENHANCED SUPERIOR ELEMENT MULTITUDE OPTIMIZATION IN CLOUD

ANALYSIS OF WORKFLOW SCHEDULING PROCESS USING ENHANCED SUPERIOR ELEMENT MULTITUDE OPTIMIZATION IN CLOUD ANALYSIS OF WORKFLOW SCHEDULING PROCESS USING ENHANCED SUPERIOR ELEMENT MULTITUDE OPTIMIZATION IN CLOUD Mrs. D.PONNISELVI, M.Sc., M.Phil., 1 E.SEETHA, 2 ASSISTANT PROFESSOR, M.PHIL FULL-TIME RESEARCH SCHOLAR,

More information

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS GH. ADAM 1,2, S. ADAM 1,2, A. AYRIYAN 2, V. KORENKOV 2, V. MITSYN 2, M. DULEA 1, I. VASILE 1 1 Horia Hulubei National Institute for Physics

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

A Novel Switch Mechanism for Load Balancing in Public Cloud

A Novel Switch Mechanism for Load Balancing in Public Cloud International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable

More information

A Hybrid Load Balancing Policy underlying Cloud Computing Environment

A Hybrid Load Balancing Policy underlying Cloud Computing Environment A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349

More information

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Experiments on the local load balancing algorithms; part 1

Experiments on the local load balancing algorithms; part 1 Experiments on the local load balancing algorithms; part 1 Ştefan Măruşter Institute e-austria Timisoara West University of Timişoara, Romania maruster@info.uvt.ro Abstract. In this paper the influence

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam A Survey on P2P File Sharing Systems Using Proximity-aware interest Clustering Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

More information

High Performance Computing for Operation Research

High Performance Computing for Operation Research High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity

More information

Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M. SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012 Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania) Outline Introduction EO challenges; EO and classical/cloud computing; EO Services The computing platform Cluster -> Grid -> Cloud

More information

Cloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project

Cloud computing. Intelligent Services for Energy-Efficient Design and Life Cycle Simulation. as used by the ISES project Intelligent Services for Energy-Efficient Design and Life Cycle Simulation Project number: 288819 Call identifier: FP7-ICT-2011-7 Project coordinator: Technische Universität Dresden, Germany Website: ises.eu-project.info

More information

Infrastructure as a Service (IaaS)

Infrastructure as a Service (IaaS) Infrastructure as a Service (IaaS) (ENCS 691K Chapter 4) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ References 1. R. Moreno et al.,

More information

Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm

Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm Luís Fabrício Wanderley Góes, Carlos Augusto Paiva da Silva Martins Graduate Program in Electrical Engineering PUC Minas {lfwgoes,capsm}@pucminas.br

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Journal of Theoretical and Applied Information Technology 20 th July 2015. Vol.77. No.2 2005-2015 JATIT & LLS. All rights reserved.

Journal of Theoretical and Applied Information Technology 20 th July 2015. Vol.77. No.2 2005-2015 JATIT & LLS. All rights reserved. EFFICIENT LOAD BALANCING USING ANT COLONY OPTIMIZATION MOHAMMAD H. NADIMI-SHAHRAKI, ELNAZ SHAFIGH FARD, FARAMARZ SAFI Department of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad,

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Energy Constrained Resource Scheduling for Cloud Environment

Energy Constrained Resource Scheduling for Cloud Environment Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering

More information

Using Cloud Computing for Solving Constraint Programming Problems

Using Cloud Computing for Solving Constraint Programming Problems Using Cloud Computing for Solving Constraint Programming Problems Mohamed Rezgui, Jean-Charles Régin, and Arnaud Malapert Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, 06900 Sophia Antipolis, France

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

An Architecture Model of Sensor Information System Based on Cloud Computing

An Architecture Model of Sensor Information System Based on Cloud Computing An Architecture Model of Sensor Information System Based on Cloud Computing Pengfei You, Yuxing Peng National Key Laboratory for Parallel and Distributed Processing, School of Computer Science, National

More information

Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems

Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Nagaveni V # Dr. G T Raju * # Department of Computer Science and

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Dynamic resource management for energy saving in the cloud computing environment

Dynamic resource management for energy saving in the cloud computing environment Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan

More information

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

More information

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor pneelakantan@rediffmail.com ABSTRACT The grid

More information

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand

More information

Supercomputing applied to Parallel Network Simulation

Supercomputing applied to Parallel Network Simulation Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902

This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

ENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND RESOURCE UTILIZATION IN CLOUD NETWORK

ENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND RESOURCE UTILIZATION IN CLOUD NETWORK International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 1, Jan-Feb 2016, pp. 45-53, Article ID: IJCET_07_01_006 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=7&itype=1

More information

Towards a Load Balancing in a Three-level Cloud Computing Network

Towards a Load Balancing in a Three-level Cloud Computing Network Towards a Load Balancing in a Three-level Cloud Computing Network Shu-Ching Wang, Kuo-Qin Yan * (Corresponding author), Wen-Pin Liao and Shun-Sheng Wang Chaoyang University of Technology Taiwan, R.O.C.

More information

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Efficient DNS based Load Balancing for Bursty Web Application Traffic

Efficient DNS based Load Balancing for Bursty Web Application Traffic ISSN Volume 1, No.1, September October 2012 International Journal of Science the and Internet. Applied However, Information this trend leads Technology to sudden burst of Available Online at http://warse.org/pdfs/ijmcis01112012.pdf

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

More information