Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department of Computer Systems Engineering and Informatics, Saint-Petersburg State Electrotechnical University LETI, Saint-Petersburg 197376, Russia Received: September 12, 2013 / Accepted: September 30, 2013 / Published: October 25, 2013. Abstract: This paper provides an overview of the main recommendations and approaches of the methodology on parallel computation application development for hybrid structures. This methodology was developed within the master s thesis project Optimization of complex tasks computation on hybrid distributed computational structures accomplished by Orekhov during which the main research objective was the determination of patterns of the behavior of scaling efficiency and other parameters which define performance of different algorithms implementations executed on hybrid distributed computational structures. Major outcomes and dependencies obtained within the master s thesis project were formed into a methodology which covers the problems of applications based on parallel computations and describes the process of its development in details, offering easy ways of avoiding potentially crucial problems. The paper is backed by the real-life examples such as clustering algorithms instead of artificial benchmarks. Key words: Hybrid architectures, parallel computing, simulation modeling. 1. Introduction The research consisted of two main phases. The preliminary phase included serial execution of a number of algorithms and parallelization of these algorithms on an elementary hybrid system (one CPU and one OpenCL enabled GPU). This phase aimed at researching the ways of efficient parallelization of algorithms implementations using GPGPU paradigm and OpenCL. During the preliminary phase the main monitored parameter was the speedup comparing to the serial execution. The main phase focused on the research of scaling efficiency parameter and the algorithms from previous phase were executed on the hybrid cluster of Peterhof Telecommunication Center. During this phase numerous experiments were conducted and approaches were tested. Such dependencies as the dependency of Corresponding author: Vladimir Orekhov, M.Sc., research fields: software engineering. E-mail: orekhov.volodya@gmail.com. scaling efficiency from task difficulty and input size were gathered and analyzed and this in-depth analysis formed the methodology described in this paper. This paper consists of three major sections: Section 2 covers the recommendations concerning preliminary algorithm analysis and simulation based on the simulation performance modeling approach; Section 3 describes the approach of optimal scaling efficiency determination; Section 4 covers the recommendations upon the target computational system choice. 2. Preliminary Algorithm Analysis and Task Simulation The following two sections (2 and 3) cover the problem of optimizing algorithms implementations in order to achieve high performance on a given system with certain constraints. Gebali defines it as a route from Parallel Computer Space to Algorithm Space [1]. Before the actual development of an application based on parallel computing, developers should go through several steps, which are illustrated in Fig. 1.
544 The Methodology of Application Development for Hybrid Architectures The steps shown in Fig. 1 are based on the simulation performancee modeling and will be discussed in more detail in the following subsections. 2.1 Execution Scheme Design The execution scheme design step includes the design of a general execution scheme, which will, on a high level of abstraction, show the main elements: blocks of code, executed in parallel, blocks of code, executed in series and data transfers between the host and the computational devices. An example of an execution scheme for the K-Means clustering algorithm is represented in Fig. 2. The process of the execution plan design is iterative; in each iteration, developers should try to maximize two main parameters: parallelization coefficient and the number of stand-alone computations (meaning the amount of computations, during which the host computational devices channel is not used to transfer data). Although these two parameters seem to have the same meaning, it is very important to understandd the difference between them. The parallelization coefficient shows the general amount of computations, which are executed in parallel: where, the amount of computations, which are executed in parallel; C the overall amount Fig. 1 The scheme of preliminary algorithm analysiss and task simulation. Fig. 2 Sample execution scheme for the K-means clustering algorithm. of computations (both parallel and serial). The number of stand-alone computations, however, is a more specific parameter, and basically means the number of computations, during the execution of which the host computational devices channel is not used to transfer data. This parameter takes into consideration the data transfer costs and the maximization of this parameter refers to the maximization of task execution solely on computational devices. 2.2 Task Simulation The task simulation step includes the design of a task on the basis of the designed execution scheme, whichh willl simulate the behavior of the target task. As far as on this step we focus on time complexity, the actual implementation of the algorithm s steps is not important; what is importantt is to simulate the same amount of computations via something basic like simple arithmetic operations. In this simulated task, all the computations should be replaced with the equivalent number of simple operations. This will allow achieving a model with the same computational difficulty as in the target task, notably cutting the development time. 2.3 Executing the Simulated Task During the execution step, the simulated task should
The Methodology of Application Development for Hybrid Architectures 545 be executed on the target computational system. After that, it is time to analyze the execution results. The results can include such base parameters as Execution time; Execution speed: Although execution speed is usually measured in seconds [1], if the number of iterations is not constant (as it is for the clustering algorithms) it needs to be measured in iterations per second. These base parameters show the performance of a particular task, executed on a particular computational system. At the same time, it appears that knowing a simple fact that X executes faster, than Y is not enough we need a more detailed comparison of X and Y to organize a proper analysis. Derivative parameters can be used for that purpose: Speedup factor which will be called scaling efficiency in this article. 1 1 where, T(N) execution time for N computational devices; V(N) execution speed for N computational devices. Coefficient is showing the proximity to the ideal scaling efficiency. One of the algorithms examined during the research was K-means clustering algorithm which is an iterative algorithm: in each iteration, the nodes act entirely independently of each other, but after each iteration there is a significant communication overhead [2]. For the research purposes the communication overhead was minimized by maximizing the amount of work to do in each iteration (number of clusters was increased up to 16,384). The scaling efficiency chart for the K-Means clustering algorithm can illustrate the meaning of the scaling efficiency parameter and is represented in Fig. 3. In Fig. 3, the real scaling repeats the speedup curve of Amdahl s law [3]. If the results are satisfactory, the execution scheme is proven to be appropriate and the application should be developed based on it. If the results are not satisfactory, either the execution scheme needs to be redesigned or the target computational system has to be changed. Quite frequently it is impossible to execute the simulated task on the target computational system because it does not exist, for example, all parts for the hybrid cluster are bought but not yet set to work, or is not available for simulation purposes. In that case the simulated task may be executed on the closest available computational system (here the closest corresponds to the structure and size of the system). Martinez suggests that in this case a combination of analytical modeling, simulation and empirical experimentation is required [4]. In terms of this paper analytical modeling is related to the execution scheme design step, simulation is related to the task simulation and to the execution of simulated task steps. Empirical experimentation covers the whole iterative process of preliminary algorithm analysis and task simulation described in Fig. 1. 3. Determination of Scaling Efficiency When we speak about effective scaling in terms of hybrid clusters, one of the main trends is the following: Scaling can be efficient up to some critical number of computational devices, and then it becomes inefficient. Gebali mentions [1] that for large values of computational devices, the speedup is approximated by 1 1 1 This number of computational devices is a crucially important constant as it describes the relations between the specific task (usually an implementation of some algorithm) and the specific computational system. However, it can take much effort to detect this number and in most cases it is unnecessary. All we need to know is: The upper bound for the computational devices number, which are supposed to be used for this task; Whether this number is smaller than the critical number or not, if we lose scaling efficiency?
546 The Methodology of Application Development for Hybrid Architectures Scaling efficiency 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Ideal scaling GPU quantity Real scaling for 16 384 clusters Fig. 3 Scaling efficiency analysis. It is important to know these parameters as early as possible to avoid the problems of inefficient scaling. Sometimes the upper bound of scaling is known in the very beginning of the development process. Then the goal is to determine whether the scaling is effective for this number of computational devices, for a particular algorithm s implementation and on a particular computational system. In these terms, if scaling for an exact upper bound of computational devices is not effective (see Case 2 described below), either the execution scheme should be redesigned or the target computational system should be changed, and this approach absolutely fits the scheme, illustrated in Fig. 1. The deeper understanding of scaling efficiency can be obtained by simulating and testing tasks on the target systems and building scaling efficiency charts as shown in Fig. 3. This chart basically shows how much of the hardware potential is used when scaling on more computational devices. For instance, Fig. 3 shows a scaling efficiency of nearly 100% on 2 GPUs and a scaling efficiency of only about 50% on 8 GPUs. It is important to understand the bottom bound of scaling efficiency, after which we do not want to continue the scaling process (it may be, for instance, 20%). Several real-life cases will help to better understand the subject: Case 1: describes the necessity of knowing the upper bound of scaling before the beginning of the development process. The application was simulated, developed and tested for an upper bound of 20 computational devices. After a while, the target hybrid cluster grows wider and receives extra 10 computational devices, and then the application is executed on a total of 30 computational devices. In worst-case scenario, on 30 devices it will run slower, than on 20 devices, which is, obviously, an extremely inefficient use of computational capabilities. Case 2: describes how the scaling efficiency problem can be detected and eliminated in the early stage. The upper bound is known to be 40 devices but the results of an iterative execution have clearly shown that the task mapped on 16 devices executes faster than the task mapped on 32 devices. Therefore, this combination of an algorithm s implementation and a target computational system are not able to provide an effective scalability up to 40 devices. 4. Optimal Target System Choice The following section covers the backward problem comparing to the previous problem discussed in Sections 2 and 3: in this case the actual algorithm is given and the task is to find an appropriate
The Methodology of Application Development for Hybrid Architectures 547 computational system for this algorithm. An important question during the parallel application development process is the choice of a target computational system (meaning hardware platform). The first thing to think about is whether the application can be scaled efficiently. If it can be scaled efficiently, the target computational platform may be a hybrid computational system containing several computational devices. Both scientists and developers often set quite ambitious goals implying efficient scaling on up to hundreds of computational devices. In this case, poor scalability is fatal. At the same time, if the goal is not that ambitious (for instance, achieving a 40x speedup), it is good to remember the following trend: inefficient scaling doesn t mean, that the task can t be effectively executed in parallel. If the task cannot be scaled efficiently, it can always be executed on one computational device. In 2013 practically every PC is a simplest hybrid system with a CPU representing host device and GPU representing computational device and can cope with this task. When we speak about parallel computations on hybrid systems, it is important to choose between AMD OpenCL and NVidia CUDA technologies. This choice heavily affects the target computational system choice. Here, one of the paradigms should be chosen. OpenCL is a free standard, supported by many hardware developers, including AMD, Intel and NVidia. The basic idea behind OpenCL is to create a unified framework for application development on hybrid structures. OpenCL hides all the implementation details and represents complex hybrid systems as a single system with a number of computational devices [5, 6]. CUDA is focused on the massively parallel computing around NVidia GPUs solely; as they choose such a narrow area, obviously, CUDA shows a better performance on NVidia GPUs comparing to OpenCL. If NVidia CUDA is used as the dominant technology, no other options than using a GPU cluster, containing numerous NVidia GPUs. If we choose OpenCL, practically every hybrid system can be a target system. 5. Conclusions The described methodology covers the development process of a parallel application including preliminary algorithm analysis, task simulation, scalability questions, dominant technology and target system choice. The methodology is based on the in-depth studies of algorithm implementation for hybrid distributed structures, one of the center aspects of which was scaling efficiency. All the main principles are covered by real-life examples. Acknowledgments Work partially supported by the Russian Foundation for Basic Research, project no. 13-07-00747-a and St. Petersburg State University, project no. 9.38.674.2013. References [1] F. Gebali, Algorithms and Parallel Computing, Hoboken, NJ: Wiley, 2011. [2] N. Matloff, Programming on Parallel Machines: GPU, Multicore, Clusters and More, CA: University of California, 2012. [3] G.M. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings 30 (1967) 483-485. [4] D.R. Martinez, V. Blanco, M. Boullon, J.C. Cabaleiro, T.F. Pena, Analytical performance models of parallel programs in clusters, Parallel Computing: Architectures, Algorithms and Applications 15 (2008) 99-106. [5] J. Kowalik, T. Puzniakowski, Using OpenCL: Programming Massively Parallel Computers, Amsterdam, IOS Press, Netherlands, 2012. [6] A. Munshi, The OpenCL Specification, Ver.: 1.2, Khronos OpenCL Working Group, 2012.