The Methodology of Application Development for Hybrid Architectures



Similar documents
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

GPGPU accelerated Computational Fluid Dynamics

Cellular Computing on a Linux Cluster

ultra fast SOM using CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Introduction to GPU hardware and to CUDA

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Stream Processing on GPUs Using Distributed Multimedia Middleware

International Journal of Engineering Research & Management Technology

International Journal of Advance Research in Computer Science and Management Studies

GPUs for Scientific Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Relational Databases in the Cloud

High Performance Matrix Inversion with Several GPUs

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ST810 Advanced Computing

Design and Implementation of the Heterogeneous Multikernel Operating System

A survey on platforms for big data analytics

HPC with Multicore and GPUs

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Turbomachinery CFD on many-core platforms experiences and strategies

on an system with an infinite number of processors. Calculate the speedup of

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GPGPU acceleration in OpenFOAM

A Lab Course on Computer Architecture

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Energy Efficient MapReduce

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

HPC Wales Skills Academy Course Catalogue 2015

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Direct GPU/FPGA Communication Via PCI Express

The Value of High-Performance Computing for Simulation

Choosing a Computer for Running SLX, P3D, and P5

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration

White Paper. Managed IT Services as a Business Solution

Parallel Programming Survey

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Load Balancing on a Grid Using Data Characteristics

OpenCL Programming for the CUDA Architecture. Version 2.3

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

GPU for Scientific Computing. -Ali Saleh

Large Scale Simulation on Clusters using COMSOL 4.2

Parallel Firewalls on General-Purpose Graphics Processing Units

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Computing with MATLAB

RevoScaleR Speed and Scalability

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Pedraforca: ARM + GPU prototype

Clustering Billions of Data Points Using GPUs

High Performance GPGPU Computer for Embedded Systems

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Data Structures and Algorithms

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Parallel Scalable Algorithms- Performance Parameters

Recommended hardware system configurations for ANSYS users

SQL Server 2012 Performance White Paper

ARMS Counterparty Credit Risk

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Computing. Benson Muite. benson.

Performance with the Oracle Database Cloud

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

CHAPTER 1 INTRODUCTION

FPGA area allocation for parallel C applications

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Task Scheduling in Hadoop

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

The Big Data methodology in computer vision systems

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Realizing the next step in storage/converged architectures

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

An Implementation of Active Data Technology

Performance metrics for parallel systems

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

Load Balancing using Potential Functions for Hierarchical Topologies

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Chapter 6 Experiment Process

Applying Parallel and Distributed Computing for Image Reconstruction in 3D Electrical Capacitance Tomography

Transcription:

Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department of Computer Systems Engineering and Informatics, Saint-Petersburg State Electrotechnical University LETI, Saint-Petersburg 197376, Russia Received: September 12, 2013 / Accepted: September 30, 2013 / Published: October 25, 2013. Abstract: This paper provides an overview of the main recommendations and approaches of the methodology on parallel computation application development for hybrid structures. This methodology was developed within the master s thesis project Optimization of complex tasks computation on hybrid distributed computational structures accomplished by Orekhov during which the main research objective was the determination of patterns of the behavior of scaling efficiency and other parameters which define performance of different algorithms implementations executed on hybrid distributed computational structures. Major outcomes and dependencies obtained within the master s thesis project were formed into a methodology which covers the problems of applications based on parallel computations and describes the process of its development in details, offering easy ways of avoiding potentially crucial problems. The paper is backed by the real-life examples such as clustering algorithms instead of artificial benchmarks. Key words: Hybrid architectures, parallel computing, simulation modeling. 1. Introduction The research consisted of two main phases. The preliminary phase included serial execution of a number of algorithms and parallelization of these algorithms on an elementary hybrid system (one CPU and one OpenCL enabled GPU). This phase aimed at researching the ways of efficient parallelization of algorithms implementations using GPGPU paradigm and OpenCL. During the preliminary phase the main monitored parameter was the speedup comparing to the serial execution. The main phase focused on the research of scaling efficiency parameter and the algorithms from previous phase were executed on the hybrid cluster of Peterhof Telecommunication Center. During this phase numerous experiments were conducted and approaches were tested. Such dependencies as the dependency of Corresponding author: Vladimir Orekhov, M.Sc., research fields: software engineering. E-mail: orekhov.volodya@gmail.com. scaling efficiency from task difficulty and input size were gathered and analyzed and this in-depth analysis formed the methodology described in this paper. This paper consists of three major sections: Section 2 covers the recommendations concerning preliminary algorithm analysis and simulation based on the simulation performance modeling approach; Section 3 describes the approach of optimal scaling efficiency determination; Section 4 covers the recommendations upon the target computational system choice. 2. Preliminary Algorithm Analysis and Task Simulation The following two sections (2 and 3) cover the problem of optimizing algorithms implementations in order to achieve high performance on a given system with certain constraints. Gebali defines it as a route from Parallel Computer Space to Algorithm Space [1]. Before the actual development of an application based on parallel computing, developers should go through several steps, which are illustrated in Fig. 1.

544 The Methodology of Application Development for Hybrid Architectures The steps shown in Fig. 1 are based on the simulation performancee modeling and will be discussed in more detail in the following subsections. 2.1 Execution Scheme Design The execution scheme design step includes the design of a general execution scheme, which will, on a high level of abstraction, show the main elements: blocks of code, executed in parallel, blocks of code, executed in series and data transfers between the host and the computational devices. An example of an execution scheme for the K-Means clustering algorithm is represented in Fig. 2. The process of the execution plan design is iterative; in each iteration, developers should try to maximize two main parameters: parallelization coefficient and the number of stand-alone computations (meaning the amount of computations, during which the host computational devices channel is not used to transfer data). Although these two parameters seem to have the same meaning, it is very important to understandd the difference between them. The parallelization coefficient shows the general amount of computations, which are executed in parallel: where, the amount of computations, which are executed in parallel; C the overall amount Fig. 1 The scheme of preliminary algorithm analysiss and task simulation. Fig. 2 Sample execution scheme for the K-means clustering algorithm. of computations (both parallel and serial). The number of stand-alone computations, however, is a more specific parameter, and basically means the number of computations, during the execution of which the host computational devices channel is not used to transfer data. This parameter takes into consideration the data transfer costs and the maximization of this parameter refers to the maximization of task execution solely on computational devices. 2.2 Task Simulation The task simulation step includes the design of a task on the basis of the designed execution scheme, whichh willl simulate the behavior of the target task. As far as on this step we focus on time complexity, the actual implementation of the algorithm s steps is not important; what is importantt is to simulate the same amount of computations via something basic like simple arithmetic operations. In this simulated task, all the computations should be replaced with the equivalent number of simple operations. This will allow achieving a model with the same computational difficulty as in the target task, notably cutting the development time. 2.3 Executing the Simulated Task During the execution step, the simulated task should

The Methodology of Application Development for Hybrid Architectures 545 be executed on the target computational system. After that, it is time to analyze the execution results. The results can include such base parameters as Execution time; Execution speed: Although execution speed is usually measured in seconds [1], if the number of iterations is not constant (as it is for the clustering algorithms) it needs to be measured in iterations per second. These base parameters show the performance of a particular task, executed on a particular computational system. At the same time, it appears that knowing a simple fact that X executes faster, than Y is not enough we need a more detailed comparison of X and Y to organize a proper analysis. Derivative parameters can be used for that purpose: Speedup factor which will be called scaling efficiency in this article. 1 1 where, T(N) execution time for N computational devices; V(N) execution speed for N computational devices. Coefficient is showing the proximity to the ideal scaling efficiency. One of the algorithms examined during the research was K-means clustering algorithm which is an iterative algorithm: in each iteration, the nodes act entirely independently of each other, but after each iteration there is a significant communication overhead [2]. For the research purposes the communication overhead was minimized by maximizing the amount of work to do in each iteration (number of clusters was increased up to 16,384). The scaling efficiency chart for the K-Means clustering algorithm can illustrate the meaning of the scaling efficiency parameter and is represented in Fig. 3. In Fig. 3, the real scaling repeats the speedup curve of Amdahl s law [3]. If the results are satisfactory, the execution scheme is proven to be appropriate and the application should be developed based on it. If the results are not satisfactory, either the execution scheme needs to be redesigned or the target computational system has to be changed. Quite frequently it is impossible to execute the simulated task on the target computational system because it does not exist, for example, all parts for the hybrid cluster are bought but not yet set to work, or is not available for simulation purposes. In that case the simulated task may be executed on the closest available computational system (here the closest corresponds to the structure and size of the system). Martinez suggests that in this case a combination of analytical modeling, simulation and empirical experimentation is required [4]. In terms of this paper analytical modeling is related to the execution scheme design step, simulation is related to the task simulation and to the execution of simulated task steps. Empirical experimentation covers the whole iterative process of preliminary algorithm analysis and task simulation described in Fig. 1. 3. Determination of Scaling Efficiency When we speak about effective scaling in terms of hybrid clusters, one of the main trends is the following: Scaling can be efficient up to some critical number of computational devices, and then it becomes inefficient. Gebali mentions [1] that for large values of computational devices, the speedup is approximated by 1 1 1 This number of computational devices is a crucially important constant as it describes the relations between the specific task (usually an implementation of some algorithm) and the specific computational system. However, it can take much effort to detect this number and in most cases it is unnecessary. All we need to know is: The upper bound for the computational devices number, which are supposed to be used for this task; Whether this number is smaller than the critical number or not, if we lose scaling efficiency?

546 The Methodology of Application Development for Hybrid Architectures Scaling efficiency 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Ideal scaling GPU quantity Real scaling for 16 384 clusters Fig. 3 Scaling efficiency analysis. It is important to know these parameters as early as possible to avoid the problems of inefficient scaling. Sometimes the upper bound of scaling is known in the very beginning of the development process. Then the goal is to determine whether the scaling is effective for this number of computational devices, for a particular algorithm s implementation and on a particular computational system. In these terms, if scaling for an exact upper bound of computational devices is not effective (see Case 2 described below), either the execution scheme should be redesigned or the target computational system should be changed, and this approach absolutely fits the scheme, illustrated in Fig. 1. The deeper understanding of scaling efficiency can be obtained by simulating and testing tasks on the target systems and building scaling efficiency charts as shown in Fig. 3. This chart basically shows how much of the hardware potential is used when scaling on more computational devices. For instance, Fig. 3 shows a scaling efficiency of nearly 100% on 2 GPUs and a scaling efficiency of only about 50% on 8 GPUs. It is important to understand the bottom bound of scaling efficiency, after which we do not want to continue the scaling process (it may be, for instance, 20%). Several real-life cases will help to better understand the subject: Case 1: describes the necessity of knowing the upper bound of scaling before the beginning of the development process. The application was simulated, developed and tested for an upper bound of 20 computational devices. After a while, the target hybrid cluster grows wider and receives extra 10 computational devices, and then the application is executed on a total of 30 computational devices. In worst-case scenario, on 30 devices it will run slower, than on 20 devices, which is, obviously, an extremely inefficient use of computational capabilities. Case 2: describes how the scaling efficiency problem can be detected and eliminated in the early stage. The upper bound is known to be 40 devices but the results of an iterative execution have clearly shown that the task mapped on 16 devices executes faster than the task mapped on 32 devices. Therefore, this combination of an algorithm s implementation and a target computational system are not able to provide an effective scalability up to 40 devices. 4. Optimal Target System Choice The following section covers the backward problem comparing to the previous problem discussed in Sections 2 and 3: in this case the actual algorithm is given and the task is to find an appropriate

The Methodology of Application Development for Hybrid Architectures 547 computational system for this algorithm. An important question during the parallel application development process is the choice of a target computational system (meaning hardware platform). The first thing to think about is whether the application can be scaled efficiently. If it can be scaled efficiently, the target computational platform may be a hybrid computational system containing several computational devices. Both scientists and developers often set quite ambitious goals implying efficient scaling on up to hundreds of computational devices. In this case, poor scalability is fatal. At the same time, if the goal is not that ambitious (for instance, achieving a 40x speedup), it is good to remember the following trend: inefficient scaling doesn t mean, that the task can t be effectively executed in parallel. If the task cannot be scaled efficiently, it can always be executed on one computational device. In 2013 practically every PC is a simplest hybrid system with a CPU representing host device and GPU representing computational device and can cope with this task. When we speak about parallel computations on hybrid systems, it is important to choose between AMD OpenCL and NVidia CUDA technologies. This choice heavily affects the target computational system choice. Here, one of the paradigms should be chosen. OpenCL is a free standard, supported by many hardware developers, including AMD, Intel and NVidia. The basic idea behind OpenCL is to create a unified framework for application development on hybrid structures. OpenCL hides all the implementation details and represents complex hybrid systems as a single system with a number of computational devices [5, 6]. CUDA is focused on the massively parallel computing around NVidia GPUs solely; as they choose such a narrow area, obviously, CUDA shows a better performance on NVidia GPUs comparing to OpenCL. If NVidia CUDA is used as the dominant technology, no other options than using a GPU cluster, containing numerous NVidia GPUs. If we choose OpenCL, practically every hybrid system can be a target system. 5. Conclusions The described methodology covers the development process of a parallel application including preliminary algorithm analysis, task simulation, scalability questions, dominant technology and target system choice. The methodology is based on the in-depth studies of algorithm implementation for hybrid distributed structures, one of the center aspects of which was scaling efficiency. All the main principles are covered by real-life examples. Acknowledgments Work partially supported by the Russian Foundation for Basic Research, project no. 13-07-00747-a and St. Petersburg State University, project no. 9.38.674.2013. References [1] F. Gebali, Algorithms and Parallel Computing, Hoboken, NJ: Wiley, 2011. [2] N. Matloff, Programming on Parallel Machines: GPU, Multicore, Clusters and More, CA: University of California, 2012. [3] G.M. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings 30 (1967) 483-485. [4] D.R. Martinez, V. Blanco, M. Boullon, J.C. Cabaleiro, T.F. Pena, Analytical performance models of parallel programs in clusters, Parallel Computing: Architectures, Algorithms and Applications 15 (2008) 99-106. [5] J. Kowalik, T. Puzniakowski, Using OpenCL: Programming Massively Parallel Computers, Amsterdam, IOS Press, Netherlands, 2012. [6] A. Munshi, The OpenCL Specification, Ver.: 1.2, Khronos OpenCL Working Group, 2012.