Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*

Transcription

1 Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters* Ronal Muresano, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de Barcelona, Barcelona, SPAIN Abstract A huge challenge that parallel computing wants to overcome is to improve the performance of many MPI applications. However, some of these applications do not scale when the problem size is fixed and the number of core is increased. This scalability problem is increased when is used a hierarchical communications architecture how is included on multicore clusters. Therefore, this work presents a novel method developed for SPMD (Single Program Multiple Data) applications, which is based on finding the maximum strong scalability point while the efficiency is maintained over a defined threshold. This method integrates four phases: a characterization, a tiles distribution model, a mapping strategy, and a scheduling policy. Also, this method is focused on SPMD applications designed to use MPI libraries with high communication volumes. Our methodology has been tested with different SPMD scientific applications and we observed that the maximum speedup and scalability were located close to the values calculated with our model. Keywords: multicore, SPMD, Performance, Scalability 1. Introduction Nowadays, the scientific applications are developed with more complexity and accuracy and these precisions need high computational resources to be executed faster and efficiently. Also, the current trend in high performance computing (HPC) is to find clusters composed of multicore nodes as can be evidenced in the top500 list (rank of the parallel machines used for HPC). The integration of these nodes in HPC has allowed the inclusion of more parallelism within nodes. However, this parallelism must deal with some problems such as: number of cores per chip, shared cache, bus interconnection, memory bandwidth, etc.[1]. These issues are becoming more important in order to manage the application scalability and efficiency. Also, the hierarchical communication architecture integrated on multicore clusters creates an heterogeneous environment, which affects some performance metrics such as efficiency, speedup and applications scalability due to * This research has been supported by the MEC-MICINN Spain under contract TIN *Contact Autor: R. Muresano, rmuresano@caos.uab.es This paper is addressed to the PDPTA conference. the different speeds and bandwidths of each communication paths(fig. 1), which may cause degradations in the application performance [2]. Despite of these communication issues and in order to benefit from such computational multicore cluster capacities, we focused on improving the performance application in these environments. This work is focused on calculating the maximum number of cores that maintain the strong application scalability while the efficiency is over a defined threshold. The objective of strong scalability is to maintain the problem size constant while the number of processors increases [3]. To obtain this goal, we have to consider the parallel programming paradigm, which the application has been designed, e.g., a master/worker, pipeline, SPMD, etc. Each of these paradigms has different communication patterns, which can affect the applications performance. In this sense, we consider the parallel applications designed using message passing interface (MPI) for communication and SPMD as a parallel paradigm. The SPMD paradigm was selected due to its behavior which is to execute the same program in all processes but with a different set of tiles. These tiles have to exchange their information in each iteration and these can become in a huge issue when we use a multicore environment. The figure 1 shows an SPMD application execution over a multicore cluster. The tiles are computed in a similar time due to the homogeneity of the core. However, the communications are performed by different links with the objective of maintaining the communication pattern and each pathscan include up to one and a half order of magnitude of difference in latency for the same communication volume. These differences are translated into inefficiency that decreases the performance and do not allow us to obtain a linear strong scalability. To solve these inefficiencies, we have developed a method that manages the communication latencies using some characteristics of each SPMD application (e.g. computation and communication tile ratio) and allows us to determine a relationship between scalability and efficiency. To achieve this performance relationship, our methodology is organized in four phases: characterization, tiles distribution model, mapping strategy, and scheduling policy, which allow us to

2 Fig. 1: Issues of SPMD applications on a multicore cluster. distribute the tile inside the environment. In this sense, this methodology classifies the SPMD tiles in groups, called Supertile (ST), assigning each one of them to one core. The tiles of these ST belong to one of two types: internal tiles which their communications are made in the same core and edge tiles, where their communications are performed with tiles allocated in other cores. This division allows us to apply an overlapping method, which permits us to execute the internal tiles while the edge communications are communicating. This division allows our method to find the ideal number of core that permits us to achieve the maximum strong scalability with a defined efficiency. This paper is structured as follows: the related works are described in section 2. Section 3 exposes the issues of SPMD applications on a multicore architecture. A description of the methodology is presented in section 4. Section 5 describes the efficiency and scalability for SPMD application. Next, section 6 illustrates the performance evaluation. Finally, conclusions are given in section Related Works There are different works developing methodologies which are focused on improving some performance metrics on multicore enviroments. Mercier et al [4] have designed a method to efficiently place MPI processes on multicore machines where establish an adequate placement policy to improve applications efficiency. However, this work does not include the combination of scalability that is very important when we wish to execute faster and efficiently. On the other hand, Liebrock [5] defines a methodology for deriving a performance model for SPMD hybrid parallel applications. This work was focused on improving three specifics performance: adaptability, scalability and fidelity using mapping, scheduling and synchronization overhead strategies designed for hybrid message passing and distribute memory applications. On the contrary, our work evaluates pure MPI applications and similarly, we develop a methodology centered on mapping and scheduling strategies, and also, we include an efficient execution. Moreover, there are works centered on studing and improving the efficiency [6] or enhancing the speedup on multicore clusters [7] separately. In contrast, we developed a methodology centering on mapping and scheduling strategies, and we search an improvement in both speedup and efficiency performance metrics on these clusters [8]. In this previous work, we have developed the methodology phases using the characterization, mapping and scheduling strategyes. However, these phase only permit us to find the number of tiles that let us to obtain the maximum speedup of the SPMD aplication defining a desired efficiency. However, this current work searches for a combination of strong scalability and efficiency, in which we can predict the number of core that maintain the relationship between both metrics. We are focused on using mapping and scheduling strategies top achieve our objective, In this sense, some works have developed mapping strategy for SPMD applications, which are centered on improving the application efficiency [7]. Another technique was designed by Brehm et al [9], in which the main objective was to map the application using the characteristics of the applications. Similarly, our proposed mapping maintains the efficiency using the characteristics of the machine and the application, but we add an affinity process that allows us to minimize the communication effect of the multicore environment. Also, there are some scheduling strategies for SPMD applications [10] [11] that are based on finding the minimum execution time, which is part of our objective. Nevertheless, we analyzed and evaluated the model defined by Panshenskov et al [12] and we chose some characteristics such as: tiles are divided into blocks, asynchronous communications, computation and communication overlapping, with the aim of minimizing the communications overhead and improving the efficiency of the SPMD application. 3. SPMD applications on multicore In this study, the SPMD applications used have to accomplish the following characteristics: static, where parallel application defines the communication process and is maintained during all the execution, local, where applications do not have collective communications, 2D grid applications, and regular, because communications are repeated for several iterations. In this sense, there are some benchmark that use these characteristic: one of them is NAS parallel benchmarks in the CG, BT algorithms [13] and also there are real applications such as: heat transfer simulation, Laplace equation, applications focus on fluid dynamics field like mpbl suite [14], application of finite differences etc. Also, the communication pattern can vary according to the objective of the SPMD application. However,these patterns

3 Fig. 4: Phases for efficient execution of SPMD appl. Fig. 2: SPMD application on multicore cluster. are defined in the beginning of the application and are kept until the application end. The figure 2 shows an example of SPMD applications and multicore clusters, in which is illustrated the idle time generated by slower communication links, (e.g. cores 5-8 communicating from node 1 with core 1-4 of node 2 through the inter-node link). These Internode communications have the bigger delay that can generate huge influences in the efficiency and scalability. However, these idle times allows us to establish strategies in order to organize how SPMD tiles could be distributed on multicore cluster with the aim of managing these communications inefficiency. These communication links can vary the communication time in an one and a half order of magnitude according to the path which perform the communication. These variations are a limiting factor to improve application performance, due to the latency of the slower link, which determines when an iteration has been completed (Fig. 2). These inefficiencies have to managed if we wish to executed the SPMD application faster, efficient and scalable. To manage this communication issues, we use the problem size of the SPMD applications that is composed by a number of tiles and we create the SuperTile (ST). The problem of finding the optimal ST size is formulated as an analytical problem, in which the ratio between computation and communication of the tile has to be founded with the objective of searching the relationship between strong scalability and efficiency. The ST is calculated maintaining the focus of Fig. 3: SuperTile (ST) creation for improving the efficiency. obtaining the maximum strong scalability point while the efficiency is maintained over a defined threshold. The figure 3 shows an example of the overlapping process and the ST creation. This ST is a group of tiles of the global problem size which is defined by M xm. In this sense, this ST is integrated of a set of KxK Tiles, where K is the square root of the number of tiles, which have to be assigned to each core with the aim of maintaining an ideal relationship between efficiency, strong scalability and speedup. As mentioned before, the ST is composed by two type of tile internal and edge tile. This is done with the objective of creating an overlapping strategy that minimize the communication effects in the parallel execution time. 4. Methodology definition This methodology is focused on managing the different communication latencies and bandwiths presents on multicore clusters with the objective of improving both efficiency and application scalability. This process is realized through four phases: a characterization, a tile distribution model, a mapping strategy and a scheduling policy (Fig. 4). These phase allow us to handle the latencies and the imbalances created due to the different communication paths. Thus, our methodology realizes an application and environment analysis in the characterization phase with the aim of obtaining the application parameters and the computation and communication ratio which will be used to calculate the number of tile of the ST and the ideal number of cores. The next step is to calculate the tiles distribution which determines the number of tiles that have to be assigned to each core in order to achieve our objective, and also we calculate the number of core necessary to maintain both strong scalability and efficiency conditions. Next, the mapping phase allocates the set of tiles (ST) among the cores which are calculated with the model defined in the tile distribution phase. Finally, the scheduling phase has two functions, one of them is to assign tile priorities and the other is to control the overlapping process. Once the methodology is applied, we evaluate the performance results obtained. 4.1 Characterization phase The objective of this phase is to gather the necessary parameters of SPMD applications and environment. This characterization parameters are classified in three groups: the application parameters, parallel environment characteristics and the defined efficiency. All these parameters give us the nearest relationship between the machine and the application.

4 The application parameters offer the information necessary of the application characteristics such as: problem size, number of tile, iteration number, communication volume and computation time of a tile, etc. Also, these parameters allow us to determine the communication pattern of a tile and determine the distribution schemes of the SPMD application. The parallel environment parameters enable us to determine the communication and computational time of a tile inside the hierarchical communication architecture. These values of a tile obtained allow us to calculate a ratio between them. This ratio will be defined as λ (p)(w), where p determine the pathswhere a tile has to communicate with neighboring tile, e.g. through A, B, or C link (Fig. 1). The variable w describes the direction of the communication e.g. up, right, left or down in a four communications pattern. This ratio is calculated with equation 1, where Comt (p)(w) determines the time of communicating a tile for a p pathsand the Cpt is the computing time of a tile. λ (p)(w) = Comt (p)(w) /Cpt (1) Finally, once all parameters are found through the characterization phase, we include the efficiency value in the model and evaluate the execution time. The efficiency value is defined by the variable effic and this will be included in the model. 4.2 Tile distribution model phase The main objective of this model is to determine the number of cores N cores that allow us to maintain the relationship between the maximum strong scalability and the desired efficiency. In this sense, equation 2 calculates the number of core. This equation depends on the problem size which is represented by the M 2 divided by the optimal number of tile K 2 (Equ. 2). Ncores = M 2 /K 2 (2) Knowing the value of K, we can estimate the execution time of the SPMD application. For example, the equation 3 represents the behavior of SPMD application using the overlapping strategy, where first is calculated the edge tile computation (EdgeComp i ) and then is added with the maximum value between internal tile computation (IntComp i ) and edge tile communication (Edgecomm i ). This process will be repeated for a set of iterations (iter) where n determine the number of an iteration. This process is carried out for the communication exchanging of the SPMD application and can be possible to calculate due to the deterministic behavior of these applications. { IntComp(i) T ex i = iter n=1 (EdgComp (i) + Max Edgcomm (i) (3) EdgeComp (i) = 4 (K 1) Cpt (4) IntComp (i) = (K 2) 2 Cpt (5) Edgecomm (i) = K Max(Comt (p)(w) ) (6) The edge communication (Equ.6) has to be for the worst communication case. This means that we use the slowest communication time to estimate the number of tiles necessary for maintaining the efficiency. To do this, we have to calculate the λ (p)(w) ratio (Equ 1) explained before. Then, the first step is to determine the ideal value of K. We start from the overlapping strategy, where internal tile computation (IntComp (i) ) and the edge tile communication (Edgecomm (i) ) are overlapped. The equation 7 represents the ideal overlapping that allow us to obtain the maximum speedup while the efficiency ef f ic is maintained over a defined threshold. Therefore, we start from an initial condition where the edge communication time is bigger than the internal computation time divided by the efficiency. This division represents the maximum inefficiency allowed. K Max(Commt (p)(w) ) >= ((K 2) 2 Cpt)/effic (7) However, this equation 7 has to consider a constraint defined in equation 8 where Edgecomm (i) can be bigger than IntComp (i) over the defined efficiency (Equ. 7), but the Edgcomm (i) have to be slower than the IntComp (i) without any efficiency definition. To calculate the optimal value of K, we determine the λ (p)(w) (Equ. 1) value and we solve the Commt value, which can be calculated with respect to λ (p)(w) multiplied by computational time Cpt of a tile. This process is performed to equalize both internal computation and edge communication equations in function of Cpt. This value is replaced in equation 7 and we obtain the equation 9. K Max(Commt (p)(w) ) <= ((K 2) 2 Cpt) (8) To find the value of K, we equal the equation 9 to zero and we obtain a quadratic equation (Equ. 10). These two solutions obtained have to be replaced in equations 7 and 8, with the aim of validating if the k value accomplish the constraint defined. effic K Cpt Max(λ (p)(w) ) = (K 2) 2 Cpt (9) K 2 (4 + effic max(λ (p)(w) ) K + 4 = 0 (10) The next step is to calculate the number ideal of core (Equ. 2), which are needs to find the strong scalability with the desired efficiency. To do this, we start of the initial consideration that establish that one ST will be assigned to each core and we use the equation 2, that calculate the ideal number of core that allow us to obtain the objective stated. This number of core determines the inflection point until the application has a strong scalability. Finally, we can determine

5 the theoretical behavior of the SPMD application for a lower number of cores that the optimal calculated and predict its behavior. Equation 11 calculates the new values of K for a number of core given with the objective of determining the execution time with equation Mapping strategy phase K = M 2 /Ncores (11) A set of difficulties arise when we allocate SPMD tile into distinct cores and these cores have to communicate through different links. Under this focus, the main objective of this phase is to design a strategy of allocating the ST into each core with the aim of minimizing the communication effects. The ST assignations are made applying a core affinity which allows us to allocate the set of tiles according to the policy of minimizing the communications latencies [4]. This core affinity permits us to identify where the processes have to be allocated and how the ST can be assigned to each core. The next step in this phase is to create a logical processes distribution that allows the application to identify the neighbor communications. This is done using a cartesian topology of the processes that give to each process two coordinate in the grid distribution. These two coordinates identify the cores, in which the processes have to be allocated. Also, we can coordinate the communication order with the objective of minimizing the saturation of the links. The last step is to create the ST with the values obtained with the model. 4.4 Scheduling policy phase The scheduling phase is divided into two main parts: the first one is to develop an execution priority which determines how the tiles have to be executed inside the core and the second part of the scheduling phase which is focused on applying an overlapping strategy between internal computation and edge communication tiles. The execution priority assignments are assigned by each tile and the highest priorities are established for tiles which have communications through slower paths. These assignments have the following policies: tiles with external communications are selected with priority 1. These edge tiles are saved in buffers with the aim of executing these tiles first. These buffers are updated all iterations. The second assignation is made for internal tiles which are overlapped with the edge communications, which are assigned with the priority 2. The overlapping process uses two threads, one of them is to perform the internal computation and the other is to manage the asynchronous communications. These communications enable us to perform the internal computation and the edge communication together (Fig. 5). 5. Combining scalability and efficiency Our methodology attempts to find the number of core that achieves the maximum strong scalability with a defined efficiency. However, there are two distinct definition of scalability in HPC. One definition is the weak scalability that is considered when the problem size and the number of processing elements are expanded. The main goal of this scalability is to achieve constant time-to-solution for larger problems and the computational load per processor stays constant [3]. The second definition is the strong scalability in which the problem size is fixed and the number of processing elements is increased. The goal in this scalability is to minimize the time solution. Hence, the scalability means that speedup is roughly proportional to the number of processing elements. Under these two scalability definition, our methodology searches a combination between strong scalability and efficiency. This combination means that our analytical model has to determine the number of cores that allow us to obtain the ideal relationship between speedup and the defined efficiency. This number of cores can be calculated using the model and this number determines the maximum systems capacity growth. Also, we can determine the theoretical behavior of the application (Equ.2. This equation allows us to find the K value size that have to be assigned to each core. The model only finds one ideal value to maintain the ideal overlapping. However, we can calculate values for another number of cores with the aim of evaluating the performance. Fig. 5: Scheduling policy. Cores K Edge Cp Int Cp Edge Comm Exec T (256) Fig. 6: Combining scalability and efficiency of SPMD appl.

6 5.1 A theoretical Example This numerical example illustrates how we can combine the efficiency and the strong scalability concept. Suppose the following application characteristics: a defined problem size of M=1585, a defined efficiency (Effic) of 95%, a four communication pattern, three different communication links (e.g. cache, main memory and network) and a set of node of doble quad core architecture. Then, we have to determine the λ (p)(w) using equation 1 and we have to use the maximum value obtained. We assume that the Cpt = 1 time unit and the maximum communication time for the slowest communication paths Commt = 100 time units. Afterward, we apply our analytical model, where we use the equation 10 to determine the ideal ST and the equation 2 to calculate the ideal number of cores, which represents the maximum combining strong scalability and efficiency. The ideal value of K obtained is around 99 and the ideal number of core is equal to 256 for this example (Eq. 2). Once the K and Ncores are calculated, we have to determined the efficiency and speedup for this ideal number of core. In this sense, we have to calculate the serial execution time using the global problem size multiplied by one computational tile time Cpt. This example is for one iteration and it has a serial time of time units for this specific problem size. The figure 6 shows the result for a different distribution of cores with the aim of visualizing the efficiency and speedup curve for this example. Also, it illustrates the performance behavior for different number of cores, in which the ideal number calculated have the efficiency around the optimal value defined and the speedup until this point has a roughly linear growth. This point is the maximum strong application scalability under a desired efficiency. After this ideal point, we can observe that speedup increases but no proportionally to the number of core and the efficiency begins to decrease considerably due to the communication bound behavior. On the contrary, before the ideal point the efficiency and speedup are around the maximum values (Fig. 6). 6. Performance Evaluation The experiments to test our methodology has been conducted on two multicore clusters, one of them is a DELL cluster with 8 nodes with 2 Quadcore Intel Xeon E5430 of 2.66 Ghz, 6 MB of cache L2 shared by each two core and 12 GB of RAM memory. The second is an IBM with 32 nodes with 2 processors Dual-Core Intel(R) Xeon(R) CPU 5150, 4 MB shared cache memory by 2 cores, 12 GB of RAM and gigaethernet network. Both clusters have Openmpi To validate the result of this article, we chose two applications: heat transfer and one application of fluid dynamics (LL-2D- STD-MPI) integrated in the MP-Labs suite. 6.1 Efficiency and scalability evaluation The main objective of this evaluation is to demonstrate the improvement of applying our methodology. The table 1 Fig. 7: Efficiency of Heat Trasnfer App. on Dell Cluster shows the characterization values of computation (Cpt) and Commt (p)(w) of slowest communication of a tile, problem size, desired efficiency and also illustrates the theoretical values of number of cores, edge and internal computation, the edge communication and the number of tiles. To develop our performance analysis, we executed SPMD applications but making a comparison between the theoretical value, the application without using and the application using our methodology. In this sense, the figure 7 shows the efficiency behavior of heat transfer application executed with 100 iterations. This figure 7 illustrates a considerable improvement in efficiency of around 42% when we execute with the number of cores determined by our model. Also, we can observe how the application using our methodology behaves similarly to the analytical model where the error rate is around 5% when the number of core is below to the maximum obtained of our model (Fig. 7). The figure 8 shows how the speedup increases when we add more core, but this speedup do not scale linearly after the maximum number of core determined with our model. The ideal point calculated meets the maximum strong scalability while the efficiency is over a defined threshold. On the other hand, the LL-2D-STD-MPI application is integrated by 3 main parts: prestep, poststep, and the main module where the communication and the computation is performed. In this order, we apply our methodology and the tile characterization process to the last module, because the other two only compute and they do not have any communication and this application has been tested with 100 iterations. The figure 9 shows the improvement in the efficiency between the original version and when we applied our methodology with an error rate of 4% of precision Fig. 8: Speedup of Heat Trasnfer App on Dell Cluster

7 Table 1: Tile distribution model evaluation examples App. Cpt Comt pw M effic K IntComp EdgeComp EdgeComm Ncore Cluster Heat Tran 0.021µsec 58, 8µsec 9500x % ,18E-01Sec 1,99E-04Sec 1,40E-01Sec 16 DELL LL-2D-STD 0.24µsec 60, 7µsec 2000x % 249 1,458E-02Sec 2,34E-04Sec 1,48E-02Sec 64 IBM Fig. 9: Efficiency of LL-2D-STD-MPI app on IBM Cluster efficiency over a defined threshold. To achieve this, we have proposed an appropriate manner to manage the inefficiencies generated by communications links presented on multicore clusters, as was described. In addition, with our method we can observe how the SPMD applications with some specific characteristics can behave with a specific problem size while is incremented the number of cores. This is the main purpose of finding the maximum point that allows the SPMD application to scale linearly. Future works are focused on working with heterogeneous computation on multicore environment with the aim of executing the SPMD applications efficiently in an communication and computation heterogeneus environments. References Fig. 10: Speedup of LL-2D-STD-MPI app on IBM Cluster with the analytical model. Similarly to the heat transfer application, the model gives us the ideal ST and number or core that maintain the relationship between efficiency and scalability (Table 1). Finally, we observe the behavior of speedup and the strong scalability in this application in figure 10, where we can observe the linear speedup until the number of cores is below to the theoretical value. This allows us to conclude that our methodology can determine the maximum strong scalability combining the maximum speedup and the efficiency over a defined threshold. The objective to test the application in two clusters is due to check the functionality of our method in different multicore architectures. These two examples show an approximation of our method and how the maximum speedup is reached with the efficiency defined in the model. 7. Conclusion and Future Work This works addresses how we can combine the efficiency and the strong scalability in parallel applications. Also, it was presented a novel methodology based on characterization, a tile distribution model, a mapping strategy and a scheduling policy. These phases allowed us to find through an analytical model the optimal size of the Supertile and the number of core needs to accomplish the objective stated. This model is focused on managing the hierarchical communication architecture presents on multicore clusters. The experimentations have demonstrated that this optimal size can achieve the conditions of maximum speedup and [1] I. M. Nielsen and C. L. Janssen, Multicore challenges and benefits for high performance scientific computing, Scientific Programming, vol. 16, pp , [2] M. Mccool, Scalable programming models for massively multicore processors, Proc. of the IEEE, vol. 96, no. 5, pp , [3] L. Peng, M. Kunaseth, H. Dursun, K. ichi Nomura, W. Wang, R. K. Kalia, A. Nakano, and P. Vashisht, A scalable hierarchical parallelization framework for molecular dynamics simulation on multicore clusters, Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, USA, pp , [4] G. Mercier and J. Clet-Ortega, Towards an efficient process placement policy for mpi applications in multicore environments, EuroPVM/MPI 2009, pp , [5] L. M. Liebrock and S. P. Goudy, Methodology for modelling spmd hybrid parallel computation, Concurr. Comput. : Pract. Exper., vol. 20, no. 8, pp , [6] G. Cong and D. A. Bader, Techniques for designing efficient parallel graph algorithms for smps and multicore processors, The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), pp , [7] K. Vikram and V. Vasudevan, Mapping data-parallel tasks onto partially reconfigurable hybrid processor architectures, IEEE Transactions on Very Large Scale Integration Systems, vol. 14, no. 9, p. 1010, [8] R. Muresano, D. Rexachs, and E. Luque., Methodology for efficient execution of spmd applications on multicore clusters, 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID), IEEE Computer Society, pp , (2010). [9] J. Brehm, P. H. Worley, and M. Madhukar, Performance modeling for spmd message-passing programs, Concurrency - Practice and Experience, vol. 10, no. 5, pp , [10] O. Beaumont, A. Legrand, and Y. Robert, Optimal algorithms for scheduling divisible workloads on heterogeneous systems, 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), p. 98, [11] J. B. Weissman and X. Zhao, Scheduling parallel applications in distributed networks, Cluster Computing, vol. 1, pp , [12] M. Panshenskov and A. Vakhitov, Adaptive scheduling of parallel computations for spmd tasks, ICCSA 2007, pp , [13] V. der Wijngaart and H. Jin, Nas parallel benchmarks, multi-zone versions, NASA Advanced Supercomputing Division Ames Research Center, USA, , Tech. Rep., [14] T. Lee and C.-L. Lin, A stable discretization of the lattice boltzmann equation for simulation of incompressible two-phase flows at high density ratio, J. Comput. Phys., vol. 206, pp , June 2005.