A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833 thomas_babu@vssc.gov.in ASHOK.V ADCG, AERO, VSSC, ISRO 914712565887 v_ashok@vssc.gov.in ABSTRACT In this paper, we describe a GPU based supercomputer, SAGA (Supercomputer for Aerospace with GPU Architecture), developed in VSSC and the challenges involved in developing a Computational Fluid Dynamics code, PARAS-3D, which runs on SAGA. This GPU facility together with PARAS-3D helped to solve CFD problems in a very cost effective manner with considerable reduction in solution time. SAGA supercomputer is extensively used for the aerodynamic design and analysis of Launch Vehicles. Categories and Subject Descriptors [GPU Facility]: nvidia GPUs Tesla C2070, M2090, [CFD Application ]: PARAS-3D, CUDA, MPI, SIMD Architecture General Terms SAGA Supercomputer NVIDIA GPUs & GPU Architecture Linux Operating System Resource manager and Job Scheduler PARAS-3D CFD Code CUDA Programming Keywords SAGA, GPU, SIMD, CFD, PARAS-3D, GPGPU, CUDA 1. INTRODUCTION A GPU based supercomputer, SAGA (Supercomputer for Aerospace with GPU Architecture) is developed in VSSC using Intel Xeon processors and nvidia GPUs. This supercomputer has a theoretical peak performance of 448 TFLOPS (DP). GPU version of Linpack Benchmark code is used to evaluate sustained performance and full benchmarking of the facility is in progress. We will be submitting the Benchmark results to include SAGA in top500.org during the update in June 2012. A photograph of SAGA supercomputer is shown in Figure 1. A short introduction of SAGA can be found in Reference.[1]. A GPU based Computational Fluid Dynamics code PARAS-3D is also developed in VSSC. PARAS-3D is a major application software running on SAGA, which is written for GPUs using CUDA programming model. This paper describes the development of SAGA supercomputer and the challenges involved in developing a GPU based application. The SAGA supercomputer together with PARAS-3D GPU application is extensively used for the aerodynamic design and analysis of Launch Vehicles. Section.2 describes the development of SAGA supercomputer and Section.3 describes the development of GPU application PARAS-3D. Section.3 provides some of the analysis carried out in SAGA followed by conclusion and references. 2. SAGA SUPERCOMPUTER

A GPU based supercomputer, SAGA is developed in VSSC using Intel processors and nvidia GPUs. The supercomputer consists of 736 numbers of Intel Xeon processors 736 nvidia Tesla GPUs. The individual nodes of the supercomputer are configured with two CPUs and two GPUs to have 1:1 ratio. The nodes are disk-less machines using compressed RAM file system. A pair of brain servers is used for network booting of nodes, job scheduling, power and resource management. Another five pairs of redundant servers are used to provide Network File System (NFS). The Linux operating system for SAGA is configured using open source components. Application software PARAS- 3D, job scheduler, resource and power manager are developed in VSSC. Cluster and infrastructure design, electrical and communication network and the design of precision airconditioning systems are also done in-house. SAGA supercomputer have a theoretical peak performance of 448 TFLOPS (DP). The following sub-sections give the details of each of these activities. 2.1 Operating System SAGA uses in-house configured Linux Operating System. The 64-bit Linux operating system (OS) for SAGA is configured using LFS (Linux from Scratch) with support for GPUs and InfiniBand. The servers and front-end systems use this OS with NFS support. A tiny 64-bit compute-node Linux OS is also developed for the nodes and are stored in the brain server. The nodes are diskless machines, which are network booted from the brain server with the tiny OS. Linux kernel will be recompiled and updated whenever stable kernels are available. 2.2 Resource manager and job scheduler An automated resource manager and job scheduler is developed for SAGA to efficiently manage the operation of entire supercomputer. The job schedules queues the jobs and executes when sufficient number nodes are available. The resource manager will monitor the status of UPS systems, room temperature, state of nodes etc and will also switch off the nodes when there is no job in the queue for execution and switch on the nodes as per demand or when new jobs arrive in the queue. In this way, SAGA minimizes electrical power consumption. 2.3 Graphics Processing Units (GPUs) SAGA uses two types of nvidia Tesla GPUs, namely, C2070 and M2090. C2070 is the first generation double precision GPUs, which have 448 cores and capable of delivering double precision (DP) floating point performance of 515 GFLOPS per GPU. Each M2090 GPU have 512 cores and have double precision (DP) floating point performance of 665 GFLOPS. There are 436 numbers of C2070 GPUs and 300 numbers of M2090 GPUs in SAGA, giving a total performance of 414 TeraFlops, in addition to the CPU power of about 34 TeraFlops. Features of C2070 and M2090 arte summarized in Table-I. Brand Name Table-1. Features of C2070 and M2090 GPUs nvidia Tesla C2070 (FERMI) No. of Cores 448 512 Built-in Memory 6 GB 6 GB Double Precision Floating point Performance Single Precision Floating point Performance Power Consumption nvidia Tesla M2090 515 GigaFLOPS 665 GigaFLOPS 1030 GigaFLOPS 1331 GigaFLOPS 190 W(Typical) 225 W (Max.) More about nvidia GPUs can be found in [2]. 2.4 Network and Topology 190 W(Typical) 225 W (Max.) SAGA has three types of network interconnect, namely, InfiniBand, Gigabit and IPMI network. 40 Gbps QDR InfiniBand network is used for inter-process communication between the nodes. It is configured for fully non-blocking mode using 44 switches, each having 36 numbers of QDR ports. 28 switches are connected in layer-1 and 16 switches are connected in layer-2. Gigabit Ethernet is used for network booting of nodes and user interface. A third IPMI network, which is a 10 Mbps network used for platform management such as resetting the nodes, interfacing with the hardware etc. SAGA Network layout is shown in Figure.2. 2.5 Storage and Brain Servers SAGA has a pair of brain servers used for network booting of the nodes and system management. The servers are configured using DRBD and heartbeat for fail-safe operation. Queuing and scheduling of jobs, node, power and resource management are done by these servers. The brain server also monitor the status of UPS, Room temperature etc and switch-off the facility in the event of a long power failure or air-conditioning system failure. SAGA also has 5 pair of NFS servers, which are also configured using DRBD and heartbeat for fail-safe operation. The NFS servers provide storage file system for all users including system file storage. The primary servers will provide the file system under normal operation. In the event of failure of a primary server, the corresponding secondary server will change to primary and provide the service without any intervention. If any secondary server fails, primary will continue the service. The failed server can be rectified and connected back, which will resume its function automatically. 2.6 Linpack Benchmarking GPU version of HPL is used for benchmarking our machines. We could get 58 % performance on a machine having two numbers of C2070 GPUs and two numbers of quad-core Xeon processors. As described in later sections, the Linpack code should be tuned based on CUDA programming guidelines, to obtain maximum performance. We are working on this and we

expect that full benchmarking of SAGA can be completed by the middle of May 2012. We are trying to include SAGA in Top500.org [3] during the website update in May-June, 2012. 3. APPLICATION SOFTWARE- PARAS-3D A Cartesian grid based Computational Fluid Dynamics code PARAS-3D is developed for SAGA. PARAS-3D code was written for GPUs using CUDA programming model provided by nvidia. PARAS-3D has about 2.5 lakhs lines of C-code and is one of the most complex applications running on GPUs. The code is extensively used by ISRO and other aerospace organizations in the country. The advantages of PARAS-3D includes fully automatic grid generation, ability to handle complex geometries, interface for CAD geometries, adaptive grid refinement etc. The opening window of PARAS-3D is shown in figure-3. More about PARAS can be found in [4]. GPU Version of PARAS-3D was developed from its parallelmultithreaded MPI Version. Parallelisaion of Navier-Stokes code on cluster of machines[5] started in VSSC in 1990 s using PC clusters. Subsequently we moved to a cluster of DEC-Alpha machines and then to AMD clusters. The present GPU facility is based on Xeon processors and nvidia GPUs with PARAS ported to CPU-GPU hybrid computing environment. 3.1 CUDA Development Tools [6] nvidia provides a programming environment known as CUDA, which is specialized for their GPUs. OpenCL could also be used, but we prefer CUDA since our GPUs are manufactured by nvidia. CUDA provides ability to use high-level languages such as C to develop application that can take advantage of high level of performance and scalability that GPUs architecture offer. PARAS-3D is written based on the CUDA programming tools, which are available at nvidia website [6]. A number of CUDA programming guides such as CUDA Getting Started Linux, NVIDIA CUDA C Programming Guide[7], CUDA C Best Practices Guide, CUDA for Developers [8] are also available at nvidia website. CUDA manuals and binaries can also be downloaded from nvidia website [9]. 3.2 Challenges in developing GPU application The challenges involved in developing a good GPU application are discussed in this section. GPUs inherit their architecture from traditional graphics processors, which are SIMD (Single Instruction Multiple Data) processors employing data parallelism. Accordingly, to extract good performance from GPUs, the algorithm must be designed in a data parallel fashion. The underlying application must be re-written to exploit this data parallel behavior of GPUs and assigning serial portion of the code to CPUs. Memory management aspects were found to be very important for getting good performance from GPUs. The copy process between the memories of CPU and GPU are to be optimized for better performance. It is to be noted that GPUs have only a limited set of registers and cache. The application should be able to utilize this limited set of registers and caches. Moreover, the number of local variables are also to be optimized to keep them within the cache, to the extend possible. The programmer should follow the CUDA guidelines to obtain maximum performance from GPUs. Threads should only be run in groups of 32 and up for best performance. Each processing unit on GPU contains local memory that improves data manipulation and reduces fetch time. PARAS- 3D software with its adaptive Cartesian grids and with Oct-tree data structure is not inherently data parallel. With suitable algorithms, it was made more data parallel by arranging cells into different groups having varying levels of dataparallelism. In this process the most complex set of computations are assigned to CPUs. The update operations between CPU and GPU was programmed as a single update to minimize the copy process between the memories of CPU and GPU. Other aspects of GPU programming include the removal of recursion and function calls from the code, as, CUDA does not support these features. At the start of application execution, CUDA's compiled code runs like any other application. Its primary execution is happening in CPU. When kernel call is made, application continue execution of non-kernel function on CPU. At the same time, kernel function does its execution on GPU. In this manner we obtain parallel processing between CPU and GPU. 3.3 GPU version of PARAS-3D PARAS-3D code was written based on CUDA programming tools and the guidelines given in Section 3.2. A single code is used for both CPUs and GPUs, which is capable of identifying the CPUs and GPUs in a machine and the number of cores in each CPU and GPU. Users need not be aware of the number of CPUs and GPUs in the machines where they run PARAS. PARAS identifies the CPU and GPU cores in the machine and configures automatically. Presently users have the freedom of choosing the number of machines to run their application based on the number of grids and previous run history. 3.4 Performance improvements With the above modifications, PARAS gives very good performance in GPU systems. The software uses the three technologies for high performance computing, namely, distributed computing, shared memory computing and GPU accelerators. The speed-up obtained for CFD problems for a single GPU node consisting of 2 Quad Core Xeon processors and 2 GPUs is 4.5 to 7 times compared to a single CPU node having 2 Quad Core Xeon processors. The speed up depends on the complexity of the geometry, level of grid adaptation and size of the problem under consideration. Figure.4 shows the speedup obtained for PARAS when 46, 36 and 63 million cells are used. With suitable tuning of the parameters, we could achieve upto 90 % efficiency for PARAS when 40 nodes are used. In general, about 1 million cells per GPU gives very good performance ( > 90 %) upto 40 nodes. 3.5 Potential Users PARAS-3D is extensively used by Scientists and Engineers of ISRO and other government organizations DRDL, ADA and ADE. We have also distributed PARAS to some of the research institutions in the country such as IISc and IIST. Presently PARAS license is limited to government organizations. 3.6 Real-World Applications PARAS-3D is extensively used for the aerodynamic design and analysis of launch vehicles in ISRO and aircraft design in ADA. PARAS has its own pre-processor to generate geometry and grids of a problem under consideration. It also has capability of importing geometry generated by other CAD software s. A typical CFD problem for a launch vehicle with 93 million grids is shown in Figure.5. PARAS is capable of using fine grids near to

the body under consideration and coarse grids away from the body, as shown in the figure. Figure. 6 shows the pressure distribution obtained using PARAS, after 50,000 iterations. Some of the published CFD Simulations carried out using PARAS-3D can be found in [10,11, 12]. The list of publications in national journals/seminar/workshops is not included in this paper. Figure -2 SAGA Network Layout 4. CONCLUSION In this paper we are providing the details of a GPU based supercomputer, SAGA developed by VSSC. The cluster and infrastructure design was carried out in VSSC. Operating system, job scheduler, automated resource and power manager are also developed in-house. A GPU based application PARAS- 3D was also developed in VSSC. The challenges involved in developing a GPU based application is discussed in this paper. With suitable algorithms and tuning of parameters, PARAS gave upto 90 % efficiency for 40 nodes. In general, about 1 million cells per GPU gives very good performance ( > 90 %) upto 40 nodes 5. ACKNOWLEDGMENTS We would like to express our sincere thanks to Chairman, ISRO for providing necessary approval and funds for building a GPU based supercomputing facility in VSSC. We also express our sincere thanks to Director, VSSC for the technical and logistic support for building the facility in VSSC. We would like to thank DD,AERO, who took the initiative for establishing a GPU facility in VSSC and provided necessary support for building the facility. We extend our thanks to construction and maintenance wing of VSSC, which gave support for designing and establishing the facility. We would like to thank all our engineers, who helped in developing the facility and PARAS-3D CFD code. We have used good amount of open source software components to build SAGA supercomputer and GPU version of PARAS-3D. The developers of open source community is greatly acknowledged. Finally we would like to thank our valuable users, without them the need of the facility and PARAS code would not arise at all. Figure -3 Opening Window of PARAS-3D Figure -4 Speed up obtained for PARAS Figure -1 SAGA Supercomputer

Figure -6 Pressure Contours for a Typical Problem solved using PARAS-3D Figure -5 Geometry and Grids for a Typical Problem 6. REFERENCES [1] The SAGA Supercomputer, Technology Review India, Vol.3, No. 6, June 2011. [2] Tesla Product Literature, http://www.nvidia.com/object/tesla_product_literature.html [3] http://top 500.org. [4] PARAS-3D Users Manual Ver.4.1.0, VSSC/ARD/GN/01/2011, December 2011. [5] Parallelisation of Navier-Stokes code on a cluster of workstations, Ashok.V and Thomas C Babu, Lecture notes in Computer Science-1745, Spinger Verlag edition pp.349-353, 1999. [6] CUDA toolkit 4.1 http://developer.nvidia.com/cuda-toolkit [7] NVIDIA CUDA C Programming Guide, Ver. 4.1, Nov 2011 [8] CUDA for Developers, nvidia http://www.nvidia.com/object/cuda_home.html# [9] Download CUDA manual and binaries. http://www.nvidia.com/object/cuda_get.html [10] Effect of connected pipe test conditions on scramjet engine modules for flight testing, Gnanasekhar.S, Dipankar Das, Ashok.v and Lazar T Chitilappilly, Int. J. of Aerospace Innovations, Vol.1, No.4, Dec. 2009. [11] CFD Simulation of Flow field over a large protuberance on a flat plate at high supersonic mach number, K.Manokaran, G.Vidya and V.K.Goyal, AIAA-2003-1253 [12] Numerical Simulation of single and twin jet impingement on a typical jet deflector, Navin Kumar Kessop, Dipankar Das, K. J. Devasia *, P. Jeyajothiraj, 11 th Asian Symposium on Visualization, Nigata, Japan, 2011.