High Performance LINPACK and GADGET 3 investigation

Similar documents

Building a Top500-class Supercomputing Cluster at LNS-BUAP

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

Accelerating CFD using OpenFOAM with GPUs

HP ProLiant SL270s Gen8 Server. Evaluation Report

Cluster Computing at HRI

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

FLOW-3D Performance Benchmark and Profiling. September 2012

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Performance Computing in CST STUDIO SUITE

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Pedraforca: ARM + GPU prototype

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Stream Processing on GPUs Using Distributed Multimedia Middleware

Trends in High-Performance Computing for Power Grid Applications

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Overview of HPC systems and software available within

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Overlapping Data Transfer With Application Execution on Clusters

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Big Data Visualization on the MIC

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Resource Scheduling Best Practice in Hybrid Clusters

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

LS DYNA Performance Benchmarks and Profiling. January 2009

How System Settings Impact PCIe SSD Performance

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Scalability and Classifications

PRIMERGY server-based High Performance Computing solutions

Scaling from Workstation to Cluster for Compute-Intensive Applications

HPC Deployment of OpenFOAM in an Industrial Setting

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

- An Essential Building Block for Stable and Reliable Compute Clusters

Linux Cluster Computing An Administrator s Perspective

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Hadoop on the Gordon Data Intensive Cluster

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

1 Bull, 2011 Bull Extreme Computing

Sun Constellation System: The Open Petascale Computing Architecture

Lattice QCD Performance. on Multi core Linux Servers

HUAWEI Tecal E6000 Blade Server

RevoScaleR Speed and Scalability

VMWARE WHITE PAPER 1

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Comparing Multi-Core Processors for Server Virtualization

Installation Guide. (Version ) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Overview of HPC Resources at Vanderbilt

Improved LS-DYNA Performance on Sun Servers

A-CLASS The rack-level supercomputer platform with hot-water cooling

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

HP Z Turbo Drive PCIe SSD

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

AirWave 7.7. Server Sizing Guide

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Using the Windows Cluster

ECLIPSE Performance Benchmarks and Profiling. January 2009

Intel Xeon Processor E5-2600

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Lecture 1: the anatomy of a supercomputer

SR-IOV: Performance Benefits for Virtualized Interconnects!

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

SAS Business Analytics. Base SAS for SAS 9.2

Recommended hardware system configurations for ANSYS users

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Big Data Performance Growth on the Rise

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Case study: End-to-end data centre infrastructure management

High-Performance Computing Clusters

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Transcription:

High Performance LINPACK and GADGET 3 investigation Konstantinos Mouzakitis August 21, 2014 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2014

Abstract This project is researching in the area of energy efficiency, as well as performance of heterogeneous clusters. In particular, two applications are used, in order to investigate their performance and their power consumption, when changing various input parameters, as well as system parameters. These two applications are High Performance LINPACK (HPL) and GADGET, and they are chosen from this year s Student Cluster Competition (SCC), held during the International Supercomputing Conference (ISC) in Leipzig.

Contents 1 Introduction 1 1.1 Report organisation............................ 2 2 Student Cluster Competition 3 2.1 Rules................................... 4 2.2 System Configuration........................... 5 2.3 Applications................................ 8 2.4 Obstacles................................. 8 3 Highest LINPACK Award 10 3.1 Background theory............................ 10 3.1.1 Algorithm............................. 10 3.2 Achieving the award........................... 11 3.2.1 Increase the Performance..................... 12 3.2.2 Decrease the Power consumption................ 16 3.2.3 Competition results........................ 21 4 GADGET Investigation 23 4.1 Background theory............................ 23 4.1.1 Algorithm............................. 23 4.1.2 Technical information...................... 25 4.2 Performance investigation........................ 26 4.3 Competition Results............................ 31 4.4 Power results............................... 31 5 Conclusions 34 5.1 Future work................................ 34 A HPL scripts 36 A.1 Running script............................... 36 A.2 Affinity script............................... 37 B System configuration scripts 38 B.1 CPU clock frequency script........................ 38 B.2 Hyper-threading script.......................... 40 i

B.3 GPU ECC script.............................. 40 B.4 GPU Persistence mode script....................... 41 B.5 GPU clocks frequency script....................... 41 ii

List of Tables 3.1 Performance for different block sizes and corresponding problem sizes. 12 3.2 Performance for different problem sizes (block size constant)...... 13 3.3 Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant)........................... 15 3.4 Power consumption of different hardware parts, when idle and on full load..................................... 17 3.5 Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values.............. 17 3.6 Power consumption and Performance for different number of cores per GPU.................................... 19 3.7 Results from numerous tests, carried out during the SCC......... 21 4.1 GADGET timing for different number of MPI tasks........... 27 4.2 GADGET timing for different MPI libraries................ 28 4.3 GADGET timing for different PMGRID dimension and different number of Multiple Domains.......................... 29 4.4 GADGET timing for different compiler flags............... 30 4.5 GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values............ 32 iii

List of Figures 2.1 Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack..................... 6 2.2 Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible......... 6 2.3 Server and Manifold modules of the liquid cooling system........ 7 2.4 Heat exchanger sketch........................... 7 3.1 Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz.... 13 3.2 Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz..... 14 3.3 Performance difference for P and Q value exchange............ 15 3.4 Performance results for different GPU_DGEM_SPLIT values...... 16 3.5 Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values.............. 18 3.6 Power consumption and Performance for Hyper-Threading (H/T) on and off................................... 19 3.7 Power consumption and Performance for different number of cores per GPU.................................... 20 3.8 Power consumption according to the competition committee....... 22 4.1 Peano-Hilbert curve creation....................... 25 4.2 Subdomains from Peano-Hibert curve.................. 25 4.3 GADGET timing for different number of MPI tasks........... 27 4.4 GADGET timing for different MPI libraries................ 28 4.5 GADGET timing for different PMGRID dimension and different number of Multiple Domains.......................... 29 4.6 GADGET timing for different compiler flags............... 30 4.7 GADGET simulation from ISC...................... 31 4.8 GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values............ 32 iv

Acknowledgements This project would not have been possible without the kind contribution of others. I would like to thank my supervisor James Perry for his constant guidance and time he devoted to me. I am also grateful to all the people from Boston Ltd and especially to David Power Michael Holiday and Jon Howard for the help and support they provided for the ISC 14 Student Cluster Challenge. Finally, I would like to thank my parents Michael and Markella for their unconditional love and support, without which I would not be able to do this Master of Science.

Chapter 1 Introduction Energy efficiency, or power efficiency, is one of the main issues in modern society. Many researches and studies have been made concerning power efficiency, trying to reduce the power consumption of various products. The term "green product" is very common these days and it is advertised in every sector. Computer science and especially High Performance Computing (HPC) is not an exception. The idea of green HPC has been gaining traction over the past decade, and power efficiency in supercomputers is of great importance, these days, as we are walking towards the exascale era. With a standard technology progression over the next decade, experts estimate that an exascale supercomputer could be constructed with power requirements in the 200 megawatt range, resulting in an estimated cost of $200-$300 million per year [2]. To make a more reasonable operating cost of the first exascale system, DARPA, after research, came up with an exascale power limit of 20MW, which translates into 50 GFlops/Watt. This ratio describes a system, which is approximately 11 times more power efficient than the current top system in the Green500 list [3]. The design of the upcoming supercomputers will be driven by the power consumption of their components and their software, as well. New technology and hardware parts must be used, with accelerators and liquid cooling being the most popular ones, in conjunction with some software techniques, in order to reach the desired Flops/Watt ratio. Some of these parts and techniques are tested every year during the Student Cluster Competition (SCC), being held during the International Supercomputing Conference (ISC). The current dissertation is written by a member of the EPCC team, which participated in the SCC of ISC 2014 and won one of the competition s awards. The report will describe the methodology and strategies followed, in order to achieve the winning results, as well as an investigation of the performance and power consumption of a competition s application. Further details about the SCC rules and applications will be described in Chapter 2. 1

1.1 Report organisation This report consists of 5 chapters: Chapter 1 is a quick introduction to prepare the reader for the contents of the dissertation. Chapter 2 presents the rules of the Student Cluster Competition, the configuration of the team EPCC s supercomputer and the challenges faced throughout the preparation and the actual competition. Chapter 3 describes all the testing that was carried out, in order to achieve the winning results and get the award in SCC. Chapter 4 analyses the performance, as well as the power consumption of GADGET, which was an application used in SCC. Various parameters are changed and tested for energy efficiency. Chapter 5 summarises the results and the experience gained. 2

Chapter 2 Student Cluster Competition The writer of this report was one of the four members of EPCC s team, which represented University of Edinburgh in this year s International Supercomputing Conference SCC. The conference and the competition took place in Leipzig, Germany and it lasted 5 days, starting from June 21 and finishing on June 25. The first two and a half days were assigned to system setup, during which the team s cluster was assembled and tested to make sure that everything is working correctly. During the last three days, the actual competition was carried out, with each team trying to run various applications and benchmarks and compete with each other for the best performance. The other 3 team members were Manos Farsarakis, George Iniatis and Chenhui Quan, while the team leader was Xu Guo. This year there were a total of 11 teams, representing universities from around the world, shown in the following list. Centre for HPC (CHPC), South Africa Ulsan National Institute of Science and Technology (UNIST), South Korea Massachusetts Institute of Technology (MIT), Bentley University, Northeastern University(NEU), United States EPCC at The University of Edinburgh (EPCC), United Kingdom Chemnitz University of Technology, Germany University of Hamburg, Germany University of Sao Paulo (USP), Brazil University of Colorado at Boulder, United States University of Science and Technology of China (USTC), China Shanghai Jiao Tong University (SJTU), China Tsinghua University, China 3

Except for the University of Edinburgh, EPCC team was cooperating with Boston Limited, which provided all the hardware, as well as technical support, a task that was assigned to a Boston s employee, Michael Holiday. Boston is an HPC company, which, in collaboration with Supermicro, helps its customers customise and create their ideal solution, in order to solve their challenges [4]. Boston was very helpful, as they were always available to answer our emails and calls, in order to solve any technical problems we were facing. In addition to that, they gave us full control of our system, both software-wise and hardware-wise, which helped us achieving the best possible results and developing our software and hardware skills. 2.1 Rules The first and most important rule of the whole competition is the power limit. Each team must run all the given applications within a power cap of 3000 Watts or approximately 13 Amperes, on one power circuit. One Power Distribution Unit (PDU) was given to each team, which monitored the consumption of each cluster. The SCC supervisors as well as all the participating teams were able to watch the current power consumption of all teams through the Ethernet interface of the PDU. If any team exceeded the 3 kw limit, then a supervisor would come and warn the members of that team, while asking them to rerun any application/benchmark that might have run above the power limit. If the power consumption reached a level, way above the 3000 Watt limit, the circuit breaker would trip and all the power to the systems would be cut off, leading to valuable time loss. [5] Another significant rule of the competition is that no one is allowed physically touch the system after the submission of the first results, at the end of the first day. If there is a need to touch the equipment an official SCC supervisor needs to be called, in order to judge the situation. In addition, the system must be active all the time and reboots are prohibited. This means that changes to BIOS (EFI), and hibernation of suspension modes are not allowed, as well. The only exception to this rule is the case of an unsafe situation, in which anyone can power down the system and an SCC supervisor must be called immediately. [5] The input files of each day s applications, along with any special instructions for running them, were given to each team at the beginning of the day, using a USB flash drive. Teams should only use those datasets, run the corresponding applications and save the output files in another USB drive, which was given to the competition s supervisors at the end of each day. [5] In addition, the cluster should not be accessed outside the official hours of the competition, which means that the teams could not prepare, for any upcoming application/benchmark, beforehand. At the end of the competition all the results were taken under consideration by the competition s committee and five awards were given, in total: Highest LINPACK: The highest score received for the LINPACK benchmark 4

under the power budget. Fan Favorite: Given to the team which receives the most votes from ISC participants during the SCC. 1st, 2nd and 3rd Place Overall Winners: The scoring for the overall winners was calculated using the scores from HPCC (10%), the chosen applications (80%), and an interview by the SCC board (10%). [5] 2.2 System Configuration Team s cluster, which was named "Antikythera" after the first analog computer, consisted of 1 head node and 3 compute nodes. In fact, all 4 of them were used for computational reasons, but the head node was named "head" because of some additional configuration and monitoring processes. It must be noted that the initial configuration included a fifth node, which was removed later on, because of power limitations, but was brought in the competition, in case it would be needed. All of the nodes were Supermicro server nodes, containing: 2 x Intel Xeon E5-2680V2 CPUs 10 cores (20 threads) 2.8 GHz clock frequency 25MB Intel Smart Cache 64 bit Instruction Set 2 x NVIDIA Tesla K40 GPUs 12GB GDDR5 Memory 288 GB/sec Memory Bandwidth 2889 CUDA cores 1.43 (4.29) TFlops Peak double (single) floating point performance 8 x 8GB DDR3 Registered ECC Memory 1 x Intel 510 Series SSD (the head node contained 4 SSDs) For the interconnect we used a 12-port Mellanox Infiniband 40/56 GbE switch and for Internet access we used a 48-port 10 GbE Ethernet switch. The component of the cluster, which made the difference, was the cooling technology. Instead of the conventional fan cooling, liquid cooling was used, which was provided by CoolIT. More specifically, CoolIT s Rack Direct Contact Liquid Cooling (DCLC) AHx20 model was used, which is a rack-based liquid cooling solution that enables highperformance and high-density clusters anywhere, without the requirement for facility 5

Figure 2.1: Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack. liquid to be plumbed to the rack. The unique AHx configuration consists of a liquid cooling network that is mounted directly onto the Intel Xeon CPUs and the NVIDIA K40 GPUs. This system allows both the processor and GPU accelerator heat output to be directly absorbed into circulating liquid which then efficiently transports the heat to a liquid-to-air heat exchanger, mounted on the top of the rack. This stand-alone rack solution is modular and is compatible with any rack-computing set-up, enabling ultrahigh density clusters to be deployed quickly and easily. The modules of the AHx20 system, as well as its heat exchanger, can be seen in Figures 2.3 and 2.4, respectively. Figure 2.2: Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible. Except for its modularity, the greatest gaining of liquid cooling was the reduction of the total power consumption of the cluster, as a large number of power consuming fans were able to be removed from the system. More specifically, initially, there were 10 fans, inside each node, each one consuming a maximum of 25 Watts, whereas after the installation of CoolIT AHx20 the 10 fans were replaced with 3 power efficient ones, 6

Figure 2.3: Server and Manifold modules of the liquid cooling system. Figure 2.4: Heat exchanger sketch. each one consuming a maximum of 5 Watts. The importance of this power reduction was significant, because it allowed the addition of one whole node to the system, without which the winning results would not be achieved. As for the software stack, all nodes had to be configured from the beginning, so firstly, we installed the CentOS Linux distribution and after setting up a shared folder, using the Network File System (NFS), we installed all the compilers and libraries, needed, in it, in order for them to be accessible from all the nodes. Some of them are shown in the following list. GNU and PGI compilers MPI libraries (OpenMPI, MVAPICH2) NVIDIA drivers and CUDA 5.5 toolkit Intel compilers, MKL Library and Intel MPI GNU Scientific (GSL) Library FFTW2 Library 7

2.3 Applications During the competition, a total of 6 applications/benchmarks should have been run, 5 of which were known from earlier this year, whereas the sixth one was announced throughout the competition. The 5 known applications were: High Performance LINPACK (HPL), GADGET, HPC Challenge (HPCC), Quantum ESPRESSO, OpenFOAM, while the secret one was HPCG. In previous years competitions, there was a seventh secret application announced during the competition, but this was not the case this year. Instead of an additional application/benchmark, the SCC committee organised a challenge. The goal of this challenge was to run one of the known applications, Quantum ESPRESSO in particular, with a time limit of 20 minutes and the winner would be the team that managed to do that with the lowest peak power consumption. An additional rule to this challenge was that physical alterations to the systems were allowed. In order to achieve that, our team, firstly, disconnected all GPUs from the whole system, in order to avoid their idle power, as the provided version of Quantum ESPRESSO was running only on CPUs. The procedure of removing the GPUs from the system was not trivial, because of the liquid cooling system. The liquid tubes were coming from the heat exchanger into the CPU, then into the GPU and then back out from the node, which made the complete removal of GPUs impossible. In order to overcome this issue, we just unplugged the PCI EXPRESS connector and all the additional power cables from each GPU and left it in the node. Moreover, we added the fifth spare node, whose CPUs would help us complete the application run, in under 20 minutes, as the rules stated. 2.4 Obstacles Numerous issues were faced, mainly throughout the preparation period, but during the actual competition as well. The most important ones are presented in the following list. Power measurement tool One of the most significant problems that we faced, while preparing for the competition, was the technical issues of our power measurement tool. There were times that its software would not work as expected or even crash. Even when we had it working, not all the nodes could be connected on the tool, because the fuse could not handle so much power. Having this as the only choice, we measured 8

the power consumed by one node and made an assumption about the total system consumption. It was only the last week before the competition that bigger fuses were used and we managed to take the desired measurements. Cluster management tools The choice of the appropriate cluster management tool for our cluster was another time consuming issue. In the beginning, we started using Bright Cluster Manager but it was restrictive for our cause, so we gave up on it and started using Cobbler and Puppet instead. While Cobbler worked fine, we had some issues with a few Puppet s manifests, which were not being able to cooperate with the Network File System. After this problem was surpassed, we did not face any other difficulties, regarding the management tools. Node crashes During benchmarking we made our cluster crush numerous times. The main reason was GPU memory insufficiency, while running High Performance LINPACK, with large problem sizes. The problem was that we were not able to physically reboot the system, as we were working from distance, and if the crashes occurred on weekends, then we had to wait until Monday for someone to reboot the system. This problem was solved after installing IPMI, which enabled us to remotely control the cluster.. Space insufficiency After a large number of application runs, a lot of files were created, from initial conditions to output files, leading to insufficient space, left in the cluster. For this reason, instead of 2 SSDs, that were initially in head node, we finally used 4. Compilers and Libraries The main software issue, during the benchmarking phase, was the installation of all the compilers and libraries. GNU compilers are installed by default in CentOS, but PGI compilers are not free, and we contacted the appropriate people in order to get the desired licences. On the other hand, Intel compilers and libraries were provided by Boston, but they consumed a lot of disk space, so they had to be moved to another directory. In addition, installing all the required libraries, using all 3 compilers (GNU, PGI, Intel), was a time consuming task, mainly because of compatibility issues. Network problem during SCC The only problem we faced during the competition was a network failure, which restricted internet access to some teams, and our team was in that group. The main issue was that we could not watch the power consumption of our system, leading to "blind" tests. Thankfully, the committee took care of the problem and everything continued normally, with a small time extension given to the teams affected by the problem. 9

Chapter 3 Highest LINPACK Award EPCC team achieved great LINPACK results in this year s SCC, in ISC, and won the Highest LINPACK award. The team was the first one throughout the years to break the 10 TFlop/s boundary, under the 3kW limit, achieving 10.14 Tflop/s. The previous record was approximately 9.2 Tflop/s, which was achieved in the last Asia Student Supercomputing Challenge (ASC), while the second best result in the competition we attended was 9.4 TFlop/s. A lot of hours were spent for testing numerous system and benchmark parameters, in order to attain the winning result. The details of those tests, as well as some LINPACK background theory, will be described in the current chapter. 3.1 Background theory High Performance LINPACK (HPL) is a benchmark that solves a dense linear system, using double precision floating points, and it is the one used in the competition. In general, the LINPACK benchmark is used to measure the performance of any HPC system and classify it into the TOP500 list. It must be noted that no single number can reflect the overall performance of a system and neither can the LINPACK results, but since the solving of a dense system of linear equations is very regular, the performance numbers extracted from the LINPACK benchmark give a good approximation of a system s peak performance. [6] 3.1.1 Algorithm The algorithm used in HPL utilises LU factorization with partial pivoting, featuring multiple look-ahead depths. In particular, the operation count for the algorithm is 2/3n 3 + O(n 2 ) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen s Method" or algorithms which compute a solution in a precision lower than full precision and refine the solution using an iterative approach. The data is distributed onto a two-dimensional P-by-Q grid of processes according to a 10

block-cyclic scheme to ensure balanced workload and increase the scalability of the algorithm. The N-by-N matrix is, firstly, logically partitioned into NB-by-NB blocks, that are cyclically distributed onto the P-by-Q process grid. This is done in both dimensions of the matrix. [6] [7] 3.2 Achieving the award The HPL version that was given was an optimised executable by NVIDIA, designed to run on both Intel CPUs and NVIDIA GPUs. Along with the executable, there was an input parameters file, called HPL.dat, which included all the variables to be tuned, in order to achieve the best possible performance. The 4 main parameters we experimented with were: N: The problem size to be run, NB: The block size, which determines the data distribution, P: Number of rows in process grid, Q: Number of columns in process grid. In addition, 2 scripts were used, one of which included the commands to load the required libraries, by setting the appropriate environment variables, and the command to run the benchmark, while the other script was used to set 3 parameters: The number of threads that will be used per GPU (CPU_CORES_PER_GPU), The percentage of work to be done in GPUs (GPU_DGEMM_SPLIT), The GPU and CPU affinity for each local MPI rank. The first two of these parameters played a significant role in achieving the desired results and a lot of tests were run in order to find their best values. The 2 mentioned scripts are shown in Appendix A. In order to run the given HPL executable the following libraries must be installed and loaded, which is done in the script shown in Appendix A.1: A version of the OpenMPI MPI library, in both PATH and LD_LIBRARY_PATH CUDA 5.5 tool-kit, in both PATH and LD_LIBRARY_PATH Intel library, in LD_LIBRARY_PATH Intel MKL library, in LD_LIBRARY_PATH The libdgemm dynamic library, in LD_LIBRARY_PATH. This library was located in the HPL root directory. The whole preparation process was divided into two parts. The first step was to find the input parameters values that would give the highest performance, without worrying 11

about the power consumption. After the end of this first part, a baseline for the best achievable results would have been found and the next step would investigate ways to reduce the peak power consumption. The second part focused in experimenting not only with HPL s parameters, but with some system parameters as well (CPU clock speed, GPU clock speed, Hyper-Threading), in order to drop the power consumption under the 3 kw limit. 3.2.1 Increase the Performance The first 2 variables, whose optimal values needed to be found, were the problem size (N) and the block size (NB). These two parameters are dependent to each other, because, in order to achieve better load balance between processes, the problem size must be a multiple of the block size. Furthermore, the desired values must ensure that the number of blocks in each matrix dimension must be exactly divided by the corresponding process grid dimension (P and Q), in order to ensure that each process will be assigned with a whole data block. Inside a README file, it was stated that the most common NB values, for a 2 GPU per node system, are 768, 896 and 1024. The performance for these values, along with some more values close to them, was computed using 2 problem sizes, 153600 and 161280. The 2 problem sizes were picked, so they are multiples of the NB values to be tested. The detailed results of the tests are shown in Table 3.1. All the NB tests were executed with the default CPU clock frequency (2.8 GHz). N (bytes) NB (bytes) Performance (Gflop/s) 256 6.373E+03 512 8.933E+03 153600 768 9.846E+03 1024 9.971E+03 1280 9.768E+03 1536 9.585E+03 840 9.815E+03 161280 896 1.004E+04 960 1.000E+04 Table 3.1: Performance for different block sizes and corresponding problem sizes. The red color in Table 3.1 means that the corresponding values resulted in a failed correctness test, for the HPL benchmark. The results are also presented in Figure 3.1, where the red line parts also represents the failed test values. As we can see in both Table 3.1 and Figure 3.1, the best block size is 896, which was used for the following tests, as well as for the final runs in the competition. After the optimal NB value was found, the next step was to identify the best multiple to use as the problem size. In order to maximize the performance, most of the available system 12

10500 10000 9500 Performance (GFlop/s) 9000 8500 8000 7500 Failed test N=153600 Failed test N=161280 7000 6500 6000 200 400 600 800 1000 1200 1400 1600 Block size (bytes) Figure 3.1: Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz. memory should be used. In our case, the problem size that should be used must be such that everything fits in GPU memory. This was a challenging step, because overfilling the GPU with data caused the corresponding node to crash and a reboot was needed. The test results are shown in Table 3.2 and in Figure 3.2. N (bytes) Performance (Gflop/s) 157696 9.827E+03 159488 9.772E+03 160384 9.805E+03 161280 9.850E+03 162176 9.694E+03 162206 9.580E+03 164864 0.000E+03 168960 0.000E+03 Table 3.2: Performance for different problem sizes (block size constant) The tests start with an N value that uses significantly less than all the GPU memory and slowly increase N until the GPU memory is just filled, but not overfilled. The nvidia-smi tool was used to monitor the GPU memory usage when HPL was running. The zero values in Table 3.2 mean that the memory was overfilled and the system crashed, for the corresponding problem size. These values are not clearly visible in Figure 3.2 for scaling reasons. It must be noted here that the performance values of Table 13

10500 10000 9500 Performance (GFlop/s) 9000 8500 8000 7500 7000 152000 154000 156000 158000 160000 162000 164000 166000 168000 170000 Problem size (bytes) Figure 3.2: Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz 3.2 do not match with those in Table 3.1, because they were extracted using a lower CPU clock frequency (2 GHz). The best performance is achieved by using a problem size equal to 161280. After detecting the optimal problem and block sizes, the grid dimensions, P and Q, were studied. P x Q should equal the total number of GPUs. Note that when P=1 it can only run a problem that fits on the GPU memory (N needs to be reduced). Typically the best result is achieved for a process grid, which is close to a square, meaning that P and Q should be almost equal. EPCC team s system included 8 GPUs in total, so there are two possible combinations for P and Q: P = 2 and Q = 4 P = 4 and Q = 2 The case of P=1 and Q=8 (or vice versa) was not tested because the problem size needed to be reduced, which would not be beneficial in our case. The results are shown in Figure 3.3, where we can see that the process grid dimensions play a significant role in achieving the highest performance. We chose to continue with P=4 and Q=2. The last parameter, investigated in order to further increase the performance was the percentage of the work to be done in GPUs, which is determined by the variable GPU_DGEM_SPLIT. The proposed, by the README file, value was 0.97, which means that 97% of the computational work will be done in the GPUs and the remaining 3% will be executed in CPUs. After some testing it was found out that the proposed value was not always the optimal one. Two sets of tests were carried out, in order to 14

10500 Performance (GFlop/s) 10000 9500 9000 P=2 Q=4 P=4 Q=2 Figure 3.3: Performance difference for P and Q value exchange. test that, one with the CPU clocks running at 2 GHz and a second one with the default frequency of 2.8 GHz. The results are shown in Table 3.3. GPU_DGEM_SPLIT (%) Performance (Gflop/s) cpufreq 2GHz cpufreq 2.8GHz 95 9.544E+03 1.032E+04 96 9.532E+03 1.020E+04 97 9.900E+03 1.013E+04 98 9.847E+03 1.006E+04 99 9.788E+03 9.943E+03 99.5 9.702E+03 9.891E+03 100 9.673E+03 9.830E+03 Table 3.3: Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant) We can see that the results have a different trend for different CPU clock frequencies. In the first series of tests, with the lowest clock speed, the best performance is achieved for a GPU_DGEM_SPLIT equal to 97%, whereas in the series with the highest clock frequency, the best performance is achieved for a lower GPU_DGEM_SPLIT percentage (95%). The reason, for which this phenomenon occurs, is because CPUs with higher clock frequency can execute calculations faster, so assigning to them more computational work is beneficial, helps the GPUs complete the job and leads to better performance. This is better presented in Figure 3.4. 15

10500 10000 9500 Performance (GFlop/s) 9000 8500 8000 CPU FREQ=2GHz CPU FREQ=2.8GHz 7500 7000 95 96 97 98 99 100 DGEM_SPLIT (%) Figure 3.4: Performance results for different GPU_DGEM_SPLIT values. After examining all the results, so far, the highest performance is 1.032E+04 GFlop/s, and the parameters for achieving it are the following: Problem size (N) = 161280 Block size (NB) = 896 Process grid dimensions: P = 4 and Q = 2 GPU workload percentage (GPU_DGEM_SPLIT) = 0.95 3.2.2 Decrease the Power consumption The purpose of the competition was not only to achieve the highest HPL performance, but also do it under the power limit of 3000 Watts. For the peak performance, reached so far, the power consumption of the whole system was 3292 Watts. In order to find the appropriate system variables to increase the cluster s energy efficiency, the power consumption of some independent hardware parts was measured. These measurements are shown in Table 3.4. The consumption of the Ethernet and the InfiniBand switches is constant and there is nothing to be done in order to decrease it. The hardware parameters and parts, experimented with, were the clock frequencies of each CPU and GPU, the hyper-threading (on or off), the number of water cooling fans, located on the CoolIT s heat exchanger and the number of conventional cooling fans, located inside each node. In addition, the number of CPU cores used by each GPU played a significant role in reducing the 16

Hardware part Idle power (Watt) Full load power (Watt) Ethernet switch 10 10 Mellanox SX6012 IB switch 126 152 NVIDIA K40 GPU 20 235 Intel Xeon E5-2680V2 CPU 40 115 CoolIT Rack DCLC AHx 90 90 Table 3.4: Power consumption of different hardware parts, when idle and on full load. power consumption, because the less cores each GPU uses, the less power is consumed, but the performance is decreased, as well, so a balanced value should be found. Moreover, with the changes in the CPU clock frequencies, the value of GPU_DGEM_SPLIT should be investigated once more, in order to achieve higher performance and lower power consumption. One important detail for the power consumption of the system is that it was not allowed to exceed the 3 kw limit, at any time, meaning that the peak power consumption of each HPL run should be measured, and not the average power consumption of the whole run. All the preparation measurements were taken by a power measurement tool provided by Boston. The first step, towards the reduction of the system s consumption, was the experimentation with some lower CPU clock frequencies. The software used, to manipulate the clock rates, was cpufrequtils, and in order for the changes to take place in all nodes, a script was created, which is shown in Appendix B.1. The power consumption, as well as the performance for 3 different CPU clock speeds and different GPU_DGEM_SPLIT values, was measured and the results are shown in Table 3.5 and in Figure 3.5. DGEM (%) Power consumption (Watt) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz 97 3037 3021 3053 98 3042 3013 3066 99 2969 2968 3027 100 2895 2941 3001 DGEM (%) Performance (Gflop/s) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz 97 9.406E+03 9.685E+03 9.900E+03 98 9.466E+03 9.667E+03 9.847E+03 99 9.514E+03 9.651E+03 9.788E+03 100 9.477E+03 9.608E+03 9.673E+03 Table 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. As we can see in Figure 3.5, higher CPU clock frequency means higher performance, but, in general, it also leads to higher power consumption as well. Furthermore, the 17

Performance (GFlop/s) 9900 9850 9800 9750 9700 9650 9600 9550 9500 9450 9400 1.5 GHz 1.7 GHz 2 GHz Power Consumption (Watt) 3080 3060 3040 3020 3000 2980 2960 2940 2920 2900 2880 97 97.5 98 98.5 99 99.5 100 dgemm (%) 1.5 GHz 1.7 GHz 2 GHz Figure 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. power consumption decreases as the percentage of GPU_DGEM_SPLIT increases. This is happening because the main cause of power increases and reductions is the CPU. The GPUs, on full load, consume a constant amount of approximately 235 Watts, which does not change for small differences in the GPU_DGEM_SPLIT value. On the contrary, difference in the amount of work to be done in the CPUs affects their power consumption and leads to the presented power reduction, as the DGEM percentage rises. One other parameter, to be tested, was the Hyper-Threading feature. Initially, we thought that we would not use it because it would consume more power, without any performance improvements, as well. The results shown in Figure 3.6 do not agree with the initial thoughts (the results shown in Figure 3.6 were extracted using a small number of cores per GPU). Hyper-threading may caused the system to consume more power, as it enables the "software" threads, but on the other hand, it helped reaching a higher performance. One possible reason that hyper-threading increased the performance is that it helped eliminating some dependences in the HPL code. It must be noted that hyper-threading was turned on and off using a script, which is presented in Appendix B.2, and not through the system s BIOS settings. In order to further reduce the power consumption, the reduction of the number of cores to be used by each GPU, was investigated. The results can be seen in Table 3.6 and in Figure 3.7. As we can see in Figure 3.7, the performance is increased as the number of CPU coresthreads per GPU increases, which is expected as there is more computing power used for the calculations. On the other hand, the power consumption of the system is also 18

Performance (GFlop/s) 9950 9900 9850 9800 9750 9700 Measurement 1 Measurement 2 9650 H/T ON H/T OFF Power Consumption (Watt) 3000 2980 2960 2940 2920 2900 Measurement 1 Measurement 2 2880 H/T ON H/T OFF Figure 3.6: Power consumption and Performance for Hyper-Threading (H/T) on and off. Cores per GPU Performance (Gflop/s) Power (Watt) 5 9.430E+03 2993 6 9.946E+03 3063 7 9.964E+03 3120 8 1.020E+04 3146 9 1.019E+04 3167 10 1.028E+04 3292 Table 3.6: Power consumption and Performance for different number of cores per GPU. increased, because there are more CPU cores-threads on full load. Each measurement of Table 3.6 and Figure 3.7, is taken with different, but not too distant, values of DGEM percentage and CPU clock frequencies. All the results, presented so far, were measured during the preparation period, before the actual competition. The best performance achieved until this stage, which was under the power limit, was 9.923E+03 GFlop/s which was achieved with the following parameters: CPU clock frequency: 2.8 GHz GPU clocks frequency: defaults GPU_DGEM_SPLIT: 98% Cores per GPU: 6 19

Performance (GFlop/s) 10300 10200 10100 10000 9900 9800 9700 9600 9500 9400 5 6 7 8 9 10 3300 Power Consumption (Watt) 3250 3200 3150 3100 3050 3000 2950 5 6 7 8 9 10 Cores per GPU Figure 3.7: Power consumption and Performance for different number of cores per GPU. Hyper-Threading: ON COMMENTS ON GPUs During both the preparation phase and the actual competition, numerous GPU parameters were investigated, as well. In order to change the values of these parameters, the nvidia-smi interface was used, provided by NVIDIA. Firstly, the idle power of each GPU needed to be reduced, because it was approximately 60 Watts. After a lot of research and testing, we found out that changing the persistence mode, of each GPU, from 0 to 1, not only decreases its idle power consumption down to 19 Watts, but also gives a performance boost of approximately 150 GFlop/s. Furthermore, it was noticed the maximum memory available was smaller in some GPUs, which was caused by the ECC (Error-Correcting Code) mode, which can detect and correct the most common kinds of internal data corruption. In order to make all the GPUs use the maximum memory allowed, the ECC mode was turned off. After all the power and memory issues were dealt with, the experimentation of GPU clock frequencies (graphics and memory clocks) started. The supported frequencies for the memory clock are 324 MHz, which only supports a 324 MHz graphics clock, and 3004 MHz, which supports a graphics clock with a frequency equal to 666 MHz, 745 MHz, 810 MHz or 875 MHz. The default clock values are 3004 MHz for the memory and 745 MHz for the graphics. After a lot of testing, the default clocks turned out to be working better than all the other combinations. The lower clock speeds caused the 20

performance to drop significantly, while the highest clocks, caused the power consumption to reach high peaks, without giving any performance boost. The scripts for setting and changing ECC mode, persistence mode and the GPU clock frequencies are shown in Appendices B.3, B.4 and B.5, respectively. 3.2.3 Competition results Before the official start of the competition, we had one day, after the system set-up, for testing, a big part of which was dedicated to HPL. We found out that the committee s Power Distribution Unit (PDU) was more tolerant, than the one we were using before, as it was measuring the system s power consumption every 5 seconds, a fact that gave us some room for increasing the performance. In order to further decrease the power consumption, we decided to unplug some of the heat-exchanger s fans, as well. The heat-exchanger has a total of 20 fans (5 groups of 4 fans), as shown in Figure 2.4, and we ended up disconnecting 8 of them (2 groups of 4), which left us only 12, a sufficient number for our system, A large number of the competition results is shown in Table 3.7. CPU freq (GHz) DGEM (%) Cores/GPU Gflop/s Power (W) GFlops/Watt 2.5 GHz 97.5 8 9.994E+03 2900 3.446 2.5 GHz 97.5 10 1.003E+04 2965 3.383 2.6 GHz 97.5 10 1.003E+04 2965 3.383 2.6 GHz 97.0 10 1.009E+04 3002 3.361 2.6 GHz 97.0 10 1.009E+04 2984 3.381 2.6 GHz 97.0 10 1.013E+04 3013 3.362 2.6 GHz 96.5 10 1.013E+04 2952 3.432 2.6 GHz 96.5 10 1.014E+04 3008 3.371 2.7 GHz 97.0 10 1.010E+04 3029 3.334 2.7 GHz 96.5 10 1.014E+04 2998 3.382 2.8 GHz 98.0 6 9.951E+03 2940 3.385 2.8 GHz 98.0 7 9.999E+03 3040 3.289 2.8 GHz 98.0 7 9.999E+03 3010 3.322 2.8 GHz 98.0 7 1.002E+04 3040 3.296 2.8 GHz 97.0 7 1.012E+04 3080 3.286 2.8 GHz 97.0 10 1.011E+04 3015 3.353 Table 3.7: Results from numerous tests, carried out during the SCC. The highlighted line in Table 3.7 is showing the winning result, which was submitted to the SCC committee. This result was extracted from the first test, completed at the very beginning of the competition, and it was achieved after leaving the system to cool for at least half an hour, down to 24 Celsius degrees. Furthermore, the GFlops/Watt ratio is presented in the last column of Table 3.7, for reference. It must be noted that our 21

system would be ranked at the 5th place of the Green500 list, according to the 3.382 GFlops/Watt ratio of the winning result. In addition to the above results, Figure 3.8 presents the power consumption of the EPCC cluster, on Monday, when HPL (the 4 high spikes) and HPCC (the 4 lower spikes) were run. The power consumption of the winning HPL run is represented by the fourth high spike. Figure 3.8: Power consumption according to the competition committee. 22

Chapter 4 GADGET Investigation In addition to HPL, the writer of the current dissertation was responsible for another competition s application, GADGET. A large number of tests were carried out, in order to increase the performance of GADGET, and an attempt to investigate its power behaviour, when changing various parameters, was made. The GADGET code version used was GADGET 3, which is not an official version and has little documentation. 4.1 Background theory GADGET is an open source application, for cosmological N-body 1 and Smoothedparticle hydrodynamics 2 (SPH) simulations, ran on distributed memory architecture systems. It can be run on any workstation, ranged from an individual PC to a largescale cluster. GADGET can be used to simulate and study a variety of astrophysical problems, ranging from colliding and merging galaxies, to the formation of largescale structures in the Universe. It is tunable through an input file and its simulations can include either dark matter only, or dark matter with gas. By enabling some additional gas processes, such ad radiative cooling and heating, GADGET can be used to study the dynamics of the plasma, existing between galaxies, and star formation. [8] 4.1.1 Algorithm An important part of GADGET s algorithm and implementation is that both dark matter and gas are represented as particles. The main part of the algorithm and the most computational intensive one is the calculation of the gravitational forces (N- body simulation), but besides that, GADGET uses hydrodynamical methods, as well, in order to 1 Simulation of a dynamical system of particles, which act under the influence of physical forces, such as gravity. 2 Computational method used for simulating fluid flows. 23

simulate fluid flows. The MPI library is used for exploiting parallelism, with a domain decomposition, discussed further in this section, along with the forces algorithms. Gravitational field methods There are two common used types of methods for computing the gravitational forces between particles [9] : Particle-Mesh (PM) methods are the fastest ones, for computing the gravitational field, as they are based on Fast Fourier Transformation (FFT). The defect of PM is that it is not efficient when computing forces for particles that are in adjacent, dense grid cells. Hierarchical tree algorithms are more efficient for close, high-density cells, but they are slower than PM for distant cells with low-density contrast. Hierarchical tree algorithm is the basic choice of GADGET. It organises distant particles into larger groups, allowing their gravity to be accounted for by means of a single force. Their forces are then calculated, firstly, by creating a tree, using recursive space decomposition and then by traversing that tree and calculating partial forces between its nodes. GADGET can use a method called TreePM, for computing the gravitational field, which is a hybrid of the above two and it uses each one wherever it is more efficient. When TreePM is toggled off, GADGET uses the Hierarchical tree algorithm alone. Hydrodynamical methods There are two well-known methods for computing the hydrodynamical field [9] : Eulerian methods, which are based on space discretisation and represent fluid variables on a mesh. Lagrangian methods, which are based on mass discretisation and use fluid particles to model the flow. GADGET uses SPH for hydrodynamical computations, which is a Lagrangian method, because gas is represented as particles. Domain decomposition In order to define the domain decomposition, across processing units, GADGET uses the Peano-Hilbert curve. Firstly, it creates the curve, as shown in Figure 4.1, which maps 3D or 2D on to 1D space, and then it splits the curve into pieces that define the sub-domains, as presented in Figure 4.2. 24

Figure 4.1: Peano-Hilbert curve creation Figure 4.2: Subdomains from Peano-Hibert curve 4.1.2 Technical information Cosmological simulations need some initial conditions (ICs), in order to begin, which are generated by a given (parallel) program, called N-GenIC, and are given as input to GADGET. Both GADGET and N-GenIC need an MPI library, in order to compile and run. The results of different MPI libraries experimentation will be presented later in this chapter. In addition, two other libraries are, also, required: GNU Scientific Library (GSL) FFTW2 (Fastest Fourier Transform in the West) GADGET is tunable in two ways. The first one is via a configuration file, during compilation time, which can enable the TreePM algorithm and the periodic boundary conditions, it can set the precision to single or double, disable the gravity factor (only for pure hydrodynamic problems), and set the particle IDs variables to be of type long (in case of a number of particles greater than 2 billion). The described parameters are only a part of everything that is included in the configuration file. 25

The second way, to tune GADGET, is via a parameter file, during runtime. Some of the settable variables, included in this file are the output directory location, the name of output files, the CPU-time limit for the current run, the number of time-steps to simulate, the communication buffer size, as well as the initial and minimum allowed gas temperatures. The code comes with three problem sizes and their corresponding ICs, configuration and parameter files: Small: 2x128 3 = 4.194.304 particles Medium: 2x512 3 = 268.435.456 particles Large: 2x2048 3 = 17.179.869.184 particles It must be noted that the Large problem size was not able to run in the team s cluster, because of memory insufficiency. The output of GADGET is a folder with numerous files, the most important of which include information about global energy statistics of the simulation (energy.txt), various aspects of the performance of the gravitational force computation for each timestep (timings.txt) and a list of all the timesteps (info.txt). There are also the restart files, which are used in case of a paused simulation and some snapshot files which are used for the visualisation of the simulation. The most important output file is the one that keeps track of the cumulative CPU-consumption of various parts of the code and the time taken for each timestep. [11] 4.2 Performance investigation Increasing the performance of GADGET was not a straight forward thing to do, because GADGET is a scientific application and not a benchmark. The Small test case was suited for the initial testing, as it had low completion time. The first step was to investigate how it performs when the number of MPI tasks, used for the job, increases. The main reason of this step, was to see if GADGET scales well across multiple nodes, and have a first estimation of the final results. The results are presented in Table 4.1 and the MVAPICH MPI library is used for this experiment, which was compiled with the PGI compiler. It must be noted that all the following GADGET timings are taken for the execution of 3 time-steps. As we can see, GADGET scales really well, as the output time is almost halved when the MPI tasks are doubled. The reason we see a big drop in time, when the MPI tasks go from 8 to 16, is because, at that time, process binding flags were added in the mpirun command. The flags used are -bind-to socket -map-by hwthread, which bind processes to a socket, and then map them by hardware thread, in order to group sequential ranks together [10]. The results are, also, presented in Figure 4.3. Figure 4.3 is a timing plot. A corresponding scaling plot could be created, because 26

Nodes used MPI tasks Time (sec) 2 73.95 4 40.69 1 8 22.24 16 6.96 2 32 3.66 4 64 1.97 Table 4.1: GADGET timing for different number of MPI tasks. 80 70 60 50 Time (sec) 40 30 20 10 0 0 10 20 30 40 50 60 70 MPI tasks Figure 4.3: GADGET timing for different number of MPI tasks. GADGET did not run sequentially, neither with one MPI task, so a baseline to be divided with the other results, could not be computed. The next step, for GADGET s performance optimisation, was to test which MPI library works better with the application. The tested libraries were: MVAPICH, compiled with PGI compiler MVAPICH, compiled with Intel compiler Intel MPI, using Intel compiler OpenMPI, compiled with Intel compiler The results of the library testing are shown in Table 4.2 and in Figure 4.4. The results of the different MPI libraries have small differences with each other, but when used in bigger test cases, these differences can play a significant role. We can see 27

MPI tasks Time (sec) MVAPICH-PGI MVAPICH-Intel Intel MPI OPENMPI-Intel 16 6.96 7.18 6.6 6.98 32 3.66 3.75 3.73 3.75 64 1.97 2.03 2.33 2.06 Table 4.2: GADGET timing for different MPI libraries. 8 7 MVAPICH-PGI MVAPICH-Intel Intel MPI OpenMPI-Intel 6 Time (sec) 5 4 3 2 1 15 20 25 30 35 40 45 50 55 60 65 MPI tasks Figure 4.4: GADGET timing for different MPI libraries. that for a smaller number of MPI tasks (16), Intel MPI performs better than every other library, but for 64 MPI processes MVAPICH with PGI compiler performs the best. A possible explanation for that is that when 64 MPI processes are used, all 4 nodes are utilised, as well, and the MVAPICH library is optimised for InfiniBand, inter-node communication. This is the reason that MVAPICH, compiled with PGI compilers, was chosen to be used for future testing. All the results, presented so far, were extracted from the Small test case of GADGET. In order to be better prepared, the optimisation process continued using the Medium test case, because it was a more plausible candidate for the competition. The next step involved experimenting with two GADGET s input parameters, PMGRID, which enables the TreePM method and sets the size of the mesh that should be used [11], and MULTIPLEDOMAINS, a variable introduced in GADGET 3 and enables better scaling of the tree algorithm. For even better results, the PMGRID variable should be a power of 2, for the FFTs to be faster. The results are shown in Table 4.3 and in Figure 4.5. We can see that the PMGRID parameter plays a significant role for the performance, as 28

PMGRID size Time (sec) Multiple domains Time (sec) 768 183.5 4 113.41 1024 112.39 8 112.39 2048 122.61 16 118.93 Table 4.3: GADGET timing for different PMGRID dimension and different number of Multiple Domains. Time (sec) Time (sec) 190 180 Multiple Domains = 8 170 160 150 140 130 120 110 600 800 1000 1200 1400 1600 1800 2000 2200 PMGRID size 119 PMGRID = 1024 118 117 116 115 114 113 112 4 6 8 10 12 14 16 Multiple Domains Figure 4.5: GADGET timing for different PMGRID dimension and different number of Multiple Domains. a performance difference of more that a minute is noticed for a small change of the mesh dimension. On the other hand, multiple domains have lesser impact on the performance, but, still, the gain is very important, because even 1 sec can make the difference in the competition. The last thought, for increasing the performance, was to find the best compiler flags for the code. 9 flags were tested, in total, and the results are shown in both Table 4.4 and Figure 4.6. The best flag combination is that with ID 9, which: [12] Incorporates optimization options to enable use of vector streaming SIMD (SSE) instructions (-fastsse) Invokes inter-procedural analysis (IPA) (-Mipa=fast) Enables automatic inlining with IPA (-Mipa=inline) 29

Flags ID Compiler Flags Time (sec) 1 -fastsse 116.67 2 -O4 -Mipa=fast,inline -Munroll 112.46 3 -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre 116.44 4 -fastsse -Mipa=fast,inline 116.77 5 -fastsse -Mipa=fast,inline -Mvect=sse,short 115 6 -fastsse -Mipa=fast,inline -Munroll 119.02 7 -fastsse -O4 -Mipa=fast, inline, libc, libopt, libinline -Mvect=simd -Mcache_align -Munroll 112.37 8 -fastsse -O4 -Mipa=fast, inline -Mvect=sse -Munroll=c:1 -Mcache_align -g 112.35 9 -fastsse -O4 -Mipa=fast, inline, libc, libopt, libinline -Mvect=simd -Mcache_align -Munroll -g 112.33 Table 4.4: GADGET timing for different compiler flags. 120 119 118 117 Time (sec) 116 115 114 113 112 1 2 3 4 5 6 7 8 9 Compiler Flag ID Figure 4.6: GADGET timing for different compiler flags. Performs IPA optimizations on libraries (-Mipa=libc, libopt) Performs IPA inlining from libraries (-Mipa=libinline) Automatically generates packed SSE (Streaming SIMD Extensions), and prefetches instructions when vectorisable loops are encountered (-Mvect=simd) Expands the contents of a loop and reduces the number of times a loop is executed (-Munroll) 30

Aligns data along cache line boundaries (-Mcache_align) 4.3 Competition Results In the competition, we were asked to run GADGET, with given Initial Conditions and Configuration file, for 1702 timesteps. The total time taken was 978.06 seconds. In addition, we were given a code that produces an output image of the simulation. This visualisation is shown in Figure 4.7. Figure 4.7: GADGET simulation from ISC. 4.4 Power results The initial topic of the current dissertation was the in-depth investigation of GADGET s power consumption and, if it was possible, the alteration of code fragments, such as the data decomposition algorithm, in order to increase the energy efficiency of the application. The lack of time, as well as a technical problem with the cluster s power measurement tool, prevented the writer from achieving those initial goals. Nevertheless, there are some carried out tests, which give an initial impression of GADGET s power consumption, depending on the two, already mentioned, parameters, PMGRID and MULTIPLEDOMAINS. 31

It must be noted that the power measurement tool only came back up 3 days before the dissertation deadline, and the writer of the current report was not the only one waiting to run tests. In addition, the liquid cooling system was removed from the cluster, after the end of the competition and the conventional air cooling was used again. MULTIDOM PMGRID=768 PMGRID=1024 PMGRID=2048 Power (W) Time (s) Power (W) Time (s) Power (W) Time (s) 2 1351 158.85 1363 108.65 1379 124.5 4 1346 155.11 1334 106.69 1381 117.89 6 1327 154.83 1379 106.31 1374 117.76 8 1368 156.88 1323 105.65 1359 122.95 10 1334 170.09 1340 107.15 1359 125.65 12 1380 169.95 1355 107.41 1332 122.51 14 1351 172.72 1337 108.49 1347 125.57 16 1348 169.54 1332 107.65 1357 124.86 Table 4.5: GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values. Power Consumption (Watt) 1390 1380 1370 1360 1350 1340 1330 PMGRID=768 PMGRID=1024 PMGRID=2048 1320 2 4 6 8 10 12 14 16 Time (sec) 180 170 160 150 140 130 120 110 100 2 4 6 8 10 12 14 16 MULTIDOMAINS PMGRID=768 PMGRID=1024 PMGRID=2048 Figure 4.8: GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values. Power consumption, as well as the corresponding completion time, for different values of the PMGRID and MULTIPLEDOMAINS variables, were calculated and the results are shown in Table 4.5 and in Figure 4.8. The power consumption shown represents the peak power consumption of each test. In addition, all the results are produced using the Medium GADGET test case, because the Small one would not utilise the whole cluster. 32

As we can see in Table 4.5 and in Figure 4.8, the most power efficient values are 1024 for PMGRID and 8 for MULTIDOMAINS, which are highlighted in the Table, as well. Furthermore, these values are the ones that give the best performance, as well, a fact that makes them beneficial from every aspect. One possible reason for this, is that small mesh sizes (PMGRID values), leads to insufficient space for the calculations and the FFTs, while larger mesh dimensions may cause frequent cache misses. 33

Chapter 5 Conclusions Optimising applications not only for performance, but for energy efficiency as well, is not a trivial process, but it is one of the most significant ones, especially in modern days that are getting closer to the exascale era of High Performance Computing. Numerous applications parameters, as well as system parameters, should be altered and tested, in order to find the most beneficial combination for lower power consumption and higher performance, at the same time. This was proved by the investigation of, maybe, the most important benchmark for HPC, High Performance LINPACK, as well as the cosmological simulation application, GADGET. Being part of the team that achieved the Highest LINPACK award, in the 2014 Student Cluster Competition, during the International Supercomputing Conference, was a lifetime experience. During the ISC exhibition we had the chance to meet important people from different companies and learn a lot for HPC innovations and future technologies. In addition, meeting other students is always interesting, especially when they come from different parts of the world. Apart from the actual event though, participating in the cluster competition was incredibly educational. It has been a learning experience, well worth all the hard work, recommended to any future students. 5.1 Future work Future work and further investigation on GADGET may include: Experimentation, of more input parameters, for performance and power consumption. In addition, code fragments could be altered, in order to achieve better energy efficiency. Porting the computational intensive parts of GADGET to GPUs, or to accelerators, in general. Finding a way to port the computation of the gravitational field, which is the most "expensive" part of GADGET, onto accelerators, will significantly improve its performance. This is an expected feature of the newest, 34

upcoming, fourth version of GADGET. Linking the FFTW library of Intel s MKL library with GADGET may be beneficial for its performance, as well as for its power consumption. Attempts for this extension were made throughout the preparation phase of the competition, but after many unsuccessful tries, the task was abandoned. 35

Appendix A HPL scripts A.1 Running script #! / b i n / bash # l o c a t i o n o f HPL HPL_DIR= pwd cp HPL. d a t. 4 nodes HPL. d a t export PATH=/ s h a r e d / openmpi 1. 8. 1 / openmpi 1.8.1 gcc490 / b i n : $PATH export PATH=/ s h a r e d / cuda 5.5/ b i n : $PATH export LD_LIBRARY_PATH=/ s h a r e d / apps / h p l / hpl 2.1 _cuda55_gcc_ompi_mkl : $LD_LIBRARY_PATH export LD_LIBRARY_PATH=/ s h a r e d / openmpi 1. 8. 1 / openmpi 1.8.1 gcc490 / l i b : $LD_LIBRARY_PATH export LD_LIBRARY_PATH=/ s h a r e d / cuda 5.5/ l i b : / s h a r e d / cuda 5.5/ l i b 6 4 : $LD_LIBRARY_PATH export LD_LIBRARY_PATH=/ home / i n t e l / c l u s t e r s t u d i o / mkl / l i b / i n t e l 6 4 : / home / i n t e l / c l u s t e r s t u d i o / l i b / i n t e l 6 4 : $LD_LIBRARY_PATH mpirun allow run as r o o t x LD_LIBRARY_PATH h o s t f i l e m f i l e 4 n o d e s n 8 map by ppr : 2 : node $HPL_DIR / r u n _ l i n p a c k _ 2 _ g p u _ p e r _ n o d e 36

A.2 Affinity script #! / b i n / bash # l o c a t i o n o f HPL HPL_DIR= pwd # Number o f CPU c o r e s CPU_CORES_PER_GPU=10 e x p o r t OMP_NUM_THREADS=$CPU_CORES_PER_GPU e x p o r t MKL_NUM_THREADS=$CPU_CORES_PER_GPU e x p o r t LD_LIBRARY_PATH=$HPL_DIR : $LD_LIBRARY_PATH e x p o r t CUDA_DEVICE_MAX_CONNECTIONS=20 export TRSM_CUTOFF=16000 export GPU_DGEMM_SPLIT=0.965 l r a n k =$OMPI_COMM_WORLD_LOCAL_RANK case ${ l r a n k } i n [ 0 ] ) # s e t GPU a f f i n i t y o f l o c a l rank 0 export CUDA_VISIBLE_DEVICES=0 # s e t CPU a f f i n i t y o f l o c a l rank 0 numactl cpunodebind =0 $HPL_DIR / xhpl ; ; [ 1 ] ) # s e t GPU a f f i n i t y o f l o c a l rank 1 export CUDA_VISIBLE_DEVICES=1 # s e t CPU a f f i n i t y o f l o c a l rank 1 numactl cpunodebind =1 $HPL_DIR / xhpl ; ; e s a c 37

Appendix B System configuration scripts B.1 CPU clock frequency script #! / b i n / bash echo " S e t t i n g c l o c k s t o $1 $2 " i f [ " $ ( hostname ) "!= " epcc head " ] ; then echo " Must be r a n from epcc head!!! " e x i t f i i f [ [ " $1 "!= " " && " $2 "!= " " ] ] ; then f o r c p u i d i n { 0.. 3 9 } do c p u f r e q s e t c $ cpuid g ondemand c p u f r e q s e t c $ cpuid d $1GHz c p u f r e q s e t c $ cpuid u $2GHz done s s h epcc2 " f o r c p u i d i n { 0.. 3 9 } ; do c p u f r e q s e t c \ $ c p u i d g ondemand ; c p u f r e q s e t c \ $ cpuid d \ " $1 \ "GHz ; c p u f r e q s e t c \ $ cpuid u \ " $2 \ " GHz ; done " s s h epcc3 " f o r c p u i d i n { 0.. 3 9 } ; do c p u f r e q s e t c \ $ c p u i d g ondemand ; c p u f r e q s e t c \ $ cpuid d \ " $1 \ "GHz ; c p u f r e q s e t c \ $ cpuid u \ " $2 \ " GHz ; done " 38

s s h epcc4 " f o r c p u i d i n { 0.. 3 9 } ; do c p u f r e q s e t c \ $ c p u i d g ondemand ; c p u f r e q s e t c \ $ cpuid d \ " $1 \ "GHz ; c p u f r e q s e t c \ $ cpuid u \ " $2 \ " GHz ; done " # s s h epcc5 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d g ondemand ; c p u f r e q s e t c \ $ cpuid d \ " $1 \ " GHz ; c p u f r e q s e t c \ $ cpuid u \ " $2 \ " GHz ; done " e l i f [ [ " $1 " == " p e r f " && " $2 " == " " ] ] ; then f o r c p u i d i n { 0.. 1 9 } do c p u f r e q s e t c $ cpuid g p e r f o r m a n c e done s s h epcc2 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d g p e r f o r m a n c e ; done " s s h epcc3 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d g p e r f o r m a n c e ; done " s s h epcc4 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d g p e r f o r m a n c e ; done " s s h epcc5 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d g p e r f o r m a n c e ; done " e l i f [ [ " $1 "!= " " && " $2 " == " " ] ] ; then f o r c p u i d i n { 0.. 1 9 } do c p u f r e q s e t c $ cpuid f $1GHz done e l s e f i s s h epcc2 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d f \ " $1 \ "GHz ; done " s s h epcc3 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d f \ " $1 \ "GHz ; done " s s h epcc4 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d f \ " $1 \ "GHz ; done " s s h epcc5 " f o r c p u i d i n { 0.. 1 9 } ; do c p u f r e q s e t c \ $ c p u i d f \ " $1 \ "GHz ; done " echo "You must p r o v i d e e i t h e r 1 f r e q f o r f i x e d c l o c k OR 2 f r e q s f o r low and high c l o c k s OR \ " p e r f \ " f o r t h e p e r f o r m a n c e g o v e r n o r. " 39

B.2 Hyper-threading script #! / b i n / bash i f [ " $1 "!= " 1 " ] && [ " $1 "!= " 0 " ] ; then echo " Argument must be 1 f o r e n a b l i n g or 0 f o r d e s a b l i n g h y p e r t h r e a d i n g " echo " E x i t i n g... " e x i t f i f o r c p u i d i n { 2 0.. 2 9 } do temp=$ ( c a t / s y s / d e v i c e s / system / node / node0 / cpu$cpuid / o n l i n e ) i f [ " $temp "!= " $1 " ] ; then echo " $1 " > / s y s / d e v i c e s / system / node / node0 / cpu$cpuid / o n l i n e f i echo " cpu$cpuid : $ ( c a t / s y s / d e v i c e s / system / node / node0 / cpu$cpuid / o n l i n e ) " done f o r c p u i d i n { 3 0.. 3 9 } do temp=$ ( c a t / s y s / d e v i c e s / system / node / node1 / cpu$cpuid / o n l i n e ) i f [ " $temp "!= " $1 " ] ; then echo " $1 " > / s y s / d e v i c e s / system / node / node1 / cpu$cpuid / o n l i n e f i echo " cpu$cpuid : $ ( c a t / s y s / d e v i c e s / system / node / node1 / cpu$cpuid / o n l i n e ) " done B.3 GPU ECC script #! / b i n / bash i f [ " $ ( hostname ) "!= " epcc head " ] ; then echo " Must be r a n from epcc head!!! " e x i t 40

f i i f [ " $1 " == " 0 " ] [ " $1 " == " 1 " ] ; then n v i d i a smi e $1 s s h epcc2 n v i d i a smi e $1 s s h epcc3 n v i d i a smi e $1 s s h epcc4 n v i d i a smi e $1 echo READY! e l s e echo " P l e a s e e n t e r 0 f o r o f f or 1 f o r on. " e x i t f i B.4 GPU Persistence mode script #! / b i n / bash i f [ " $ ( hostname ) "!= " epcc head " ] ; then echo " Must be r a n from epcc head!!! " e x i t f i i f [ " $1 " == " 0 " ] [ " $1 " == " 1 " ] ; then n v i d i a smi pm $1 s s h epcc2 n v i d i a smi pm $1 s s h epcc3 n v i d i a smi pm $1 s s h epcc4 n v i d i a smi pm $1 s s h epcc5 n v i d i a smi pm $1 echo READY! e l s e echo " P l e a s e e n t e r 0 f o r o f f or 1 f o r on. " e x i t f i B.5 GPU clocks frequency script #! / b i n / bash i f [ " $ ( hostname ) "!= " epcc head " ] ; then echo " Must be r a n from epcc head!!! " e x i t 41

f i i f [ " $1 " == " d e f " ] ; then n v i d i a smi r a c s s h epcc2 n v i d i a smi r a c s s h epcc3 n v i d i a smi r a c s s h epcc4 n v i d i a smi r a c s s h epcc5 n v i d i a smi r a c echo D e f a u l t c l o c k s r e a d y e l i f [ " $1 " == " h igh " ] ; then n v i d i a smi ac 3004,875 s s h epcc2 n v i d i a smi ac 3004,875 s s h epcc3 n v i d i a smi ac 3004,875 s s h epcc4 n v i d i a smi ac 3004,875 s s h epcc5 n v i d i a smi ac 3004,875 e l i f [ " $1 " == " low " ] ; then n v i d i a smi ac 324,324 s s h epcc2 n v i d i a smi ac 324,324 s s h epcc3 n v i d i a smi ac 324,324 s s h epcc4 n v i d i a smi ac 324,324 s s h epcc5 n v i d i a smi ac 324,324 e l i f [ [ ( " $1 " == " 3004 " && ( " $2 " == " 875 " " $2 " == " 810 " " $2 " == " 745 " " $2 " == " 666 " ) ) ( " $1 " == " 324 " && " $2 " == " 324 " ) ] ] ; then n v i d i a smi ac $1, $2 s s h epcc2 n v i d i a smi ac $1, $2 s s h epcc3 n v i d i a smi ac $1, $2 s s h epcc4 n v i d i a smi ac $1, $2 s s h epcc5 n v i d i a smi ac $1, $2 echo User c l o c k s r e a d y e l s e echo You must p r o v i d e t h e d e s i r e d c l o c k f r e q s or \ " d e f \ " f o r d e f a u l t c l o c k s or \ " high \ " f o r h i g h e s t c l o c k s or \ " low \ " f o r l o w e s t c l o c k s. n v i d i a smi q d SUPPORTED_CLOCKS e x i t f i 42

Bibliography [1] Nikolaos Koutsikos: Investigating power efficiency and co-location effects on heterogeneous HPC architectures. [online]. Available from: http://www.epcc.ed.ac.uk/sites/ default/files/dissertations/2012-2013/investigating% 20power%20efficiency%20and%20col-location%20effects% 20on%20heteorgeneous%20HPC%20architectures.pdf. [Accessed: 23 July 2014]. [2] Tiffany Trader: DOE Sets Exascale Pricetag. [online]. Available from: http://www.hpcwire.com/2013/09/16/ doe_sets_exascale_pricetag/. [Accessed: 23 July 2014]. [3] Green500 list published in June 2014. [online]. Available from: http://www.green500.org/lists/ green201406. [Accessed: 23 July 2014]. [4] Boston Limited: About Us. [online]. Available from: http://www.boston.co.uk/about/ default.aspx. [Accessed: 24 July 2014]. [5] ISC 14 Student Cluster Competition: Competition Rules. [online]. Available from: http://www.hpcadvisorycouncil.com/ events/2014/isc14-student-cluster-competition/rules. php. [Accessed: 25 July 2014]. [6] TOP500: The Linpack Benchmark. [online]. Available from: http://www.top500.org/project/ linpack/. [Accessed: 28 July 2014]. [7] HPL Algorithm. [online]. Available from: http://www.netlib.org/benchmark/hpl/ algorithm.html. [Accessed: 2 August 2014]. [8] Volker Springel: Cosmological simulations with GADGET. 43

[online]. Available from: http://www.mpa-garching.mpg.de/ gadget/. [Accessed: 18 August 2014]. [9] Volker Springel: The cosmological simulation code GADGET-2. [online]. Available from: http://www.mpa-garching.mpg.de/ gadget/gadget2-paper.pdf. [Accessed: 18 August 2014]. [10] Using the Hydra Process Manager - Mpich. [online]. Available from: https://wiki.mpich.org/mpich/index. php/using_the_hydra_process_manager. [Accessed: 19 August 2014]. [11] Volker Springel: User guide for GADGET-2. [online]. Available from: http://www.mpa-garching.mpg.de/ gadget/users-guide.pdf. [Accessed: 19 August 2014]. [12] PGI Compiler User s Guide. [online]. Available from: http://www.pgroup.com/doc/pgiug.pdf. [Accessed: 19 August 2014]. 44