High Performance LINPACK and GADGET 3 investigation

Transcription

1 High Performance LINPACK and GADGET 3 investigation Konstantinos Mouzakitis August 21, 2014 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2014

2 Abstract This project is researching in the area of energy efficiency, as well as performance of heterogeneous clusters. In particular, two applications are used, in order to investigate their performance and their power consumption, when changing various input parameters, as well as system parameters. These two applications are High Performance LINPACK (HPL) and GADGET, and they are chosen from this year s Student Cluster Competition (SCC), held during the International Supercomputing Conference (ISC) in Leipzig.

3 Contents 1 Introduction Report organisation Student Cluster Competition Rules System Configuration Applications Obstacles Highest LINPACK Award Background theory Algorithm Achieving the award Increase the Performance Decrease the Power consumption Competition results GADGET Investigation Background theory Algorithm Technical information Performance investigation Competition Results Power results Conclusions Future work A HPL scripts 36 A.1 Running script A.2 Affinity script B System configuration scripts 38 B.1 CPU clock frequency script B.2 Hyper-threading script i

4 B.3 GPU ECC script B.4 GPU Persistence mode script B.5 GPU clocks frequency script ii

5 List of Tables 3.1 Performance for different block sizes and corresponding problem sizes Performance for different problem sizes (block size constant) Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant) Power consumption of different hardware parts, when idle and on full load Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values Power consumption and Performance for different number of cores per GPU Results from numerous tests, carried out during the SCC GADGET timing for different number of MPI tasks GADGET timing for different MPI libraries GADGET timing for different PMGRID dimension and different number of Multiple Domains GADGET timing for different compiler flags GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values iii

6 List of Figures 2.1 Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible Server and Manifold modules of the liquid cooling system Heat exchanger sketch Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz Performance difference for P and Q value exchange Performance results for different GPU_DGEM_SPLIT values Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values Power consumption and Performance for Hyper-Threading (H/T) on and off Power consumption and Performance for different number of cores per GPU Power consumption according to the competition committee Peano-Hilbert curve creation Subdomains from Peano-Hibert curve GADGET timing for different number of MPI tasks GADGET timing for different MPI libraries GADGET timing for different PMGRID dimension and different number of Multiple Domains GADGET timing for different compiler flags GADGET simulation from ISC GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values iv

7 Acknowledgements This project would not have been possible without the kind contribution of others. I would like to thank my supervisor James Perry for his constant guidance and time he devoted to me. I am also grateful to all the people from Boston Ltd and especially to David Power Michael Holiday and Jon Howard for the help and support they provided for the ISC 14 Student Cluster Challenge. Finally, I would like to thank my parents Michael and Markella for their unconditional love and support, without which I would not be able to do this Master of Science.

8 Chapter 1 Introduction Energy efficiency, or power efficiency, is one of the main issues in modern society. Many researches and studies have been made concerning power efficiency, trying to reduce the power consumption of various products. The term "green product" is very common these days and it is advertised in every sector. Computer science and especially High Performance Computing (HPC) is not an exception. The idea of green HPC has been gaining traction over the past decade, and power efficiency in supercomputers is of great importance, these days, as we are walking towards the exascale era. With a standard technology progression over the next decade, experts estimate that an exascale supercomputer could be constructed with power requirements in the 200 megawatt range, resulting in an estimated cost of $200-$300 million per year [2]. To make a more reasonable operating cost of the first exascale system, DARPA, after research, came up with an exascale power limit of 20MW, which translates into 50 GFlops/Watt. This ratio describes a system, which is approximately 11 times more power efficient than the current top system in the Green500 list [3]. The design of the upcoming supercomputers will be driven by the power consumption of their components and their software, as well. New technology and hardware parts must be used, with accelerators and liquid cooling being the most popular ones, in conjunction with some software techniques, in order to reach the desired Flops/Watt ratio. Some of these parts and techniques are tested every year during the Student Cluster Competition (SCC), being held during the International Supercomputing Conference (ISC). The current dissertation is written by a member of the EPCC team, which participated in the SCC of ISC 2014 and won one of the competition s awards. The report will describe the methodology and strategies followed, in order to achieve the winning results, as well as an investigation of the performance and power consumption of a competition s application. Further details about the SCC rules and applications will be described in Chapter 2. 1

9 1.1 Report organisation This report consists of 5 chapters: Chapter 1 is a quick introduction to prepare the reader for the contents of the dissertation. Chapter 2 presents the rules of the Student Cluster Competition, the configuration of the team EPCC s supercomputer and the challenges faced throughout the preparation and the actual competition. Chapter 3 describes all the testing that was carried out, in order to achieve the winning results and get the award in SCC. Chapter 4 analyses the performance, as well as the power consumption of GADGET, which was an application used in SCC. Various parameters are changed and tested for energy efficiency. Chapter 5 summarises the results and the experience gained. 2

10 Chapter 2 Student Cluster Competition The writer of this report was one of the four members of EPCC s team, which represented University of Edinburgh in this year s International Supercomputing Conference SCC. The conference and the competition took place in Leipzig, Germany and it lasted 5 days, starting from June 21 and finishing on June 25. The first two and a half days were assigned to system setup, during which the team s cluster was assembled and tested to make sure that everything is working correctly. During the last three days, the actual competition was carried out, with each team trying to run various applications and benchmarks and compete with each other for the best performance. The other 3 team members were Manos Farsarakis, George Iniatis and Chenhui Quan, while the team leader was Xu Guo. This year there were a total of 11 teams, representing universities from around the world, shown in the following list. Centre for HPC (CHPC), South Africa Ulsan National Institute of Science and Technology (UNIST), South Korea Massachusetts Institute of Technology (MIT), Bentley University, Northeastern University(NEU), United States EPCC at The University of Edinburgh (EPCC), United Kingdom Chemnitz University of Technology, Germany University of Hamburg, Germany University of Sao Paulo (USP), Brazil University of Colorado at Boulder, United States University of Science and Technology of China (USTC), China Shanghai Jiao Tong University (SJTU), China Tsinghua University, China 3

11 Except for the University of Edinburgh, EPCC team was cooperating with Boston Limited, which provided all the hardware, as well as technical support, a task that was assigned to a Boston s employee, Michael Holiday. Boston is an HPC company, which, in collaboration with Supermicro, helps its customers customise and create their ideal solution, in order to solve their challenges [4]. Boston was very helpful, as they were always available to answer our s and calls, in order to solve any technical problems we were facing. In addition to that, they gave us full control of our system, both software-wise and hardware-wise, which helped us achieving the best possible results and developing our software and hardware skills. 2.1 Rules The first and most important rule of the whole competition is the power limit. Each team must run all the given applications within a power cap of 3000 Watts or approximately 13 Amperes, on one power circuit. One Power Distribution Unit (PDU) was given to each team, which monitored the consumption of each cluster. The SCC supervisors as well as all the participating teams were able to watch the current power consumption of all teams through the Ethernet interface of the PDU. If any team exceeded the 3 kw limit, then a supervisor would come and warn the members of that team, while asking them to rerun any application/benchmark that might have run above the power limit. If the power consumption reached a level, way above the 3000 Watt limit, the circuit breaker would trip and all the power to the systems would be cut off, leading to valuable time loss. [5] Another significant rule of the competition is that no one is allowed physically touch the system after the submission of the first results, at the end of the first day. If there is a need to touch the equipment an official SCC supervisor needs to be called, in order to judge the situation. In addition, the system must be active all the time and reboots are prohibited. This means that changes to BIOS (EFI), and hibernation of suspension modes are not allowed, as well. The only exception to this rule is the case of an unsafe situation, in which anyone can power down the system and an SCC supervisor must be called immediately. [5] The input files of each day s applications, along with any special instructions for running them, were given to each team at the beginning of the day, using a USB flash drive. Teams should only use those datasets, run the corresponding applications and save the output files in another USB drive, which was given to the competition s supervisors at the end of each day. [5] In addition, the cluster should not be accessed outside the official hours of the competition, which means that the teams could not prepare, for any upcoming application/benchmark, beforehand. At the end of the competition all the results were taken under consideration by the competition s committee and five awards were given, in total: Highest LINPACK: The highest score received for the LINPACK benchmark 4

12 under the power budget. Fan Favorite: Given to the team which receives the most votes from ISC participants during the SCC. 1st, 2nd and 3rd Place Overall Winners: The scoring for the overall winners was calculated using the scores from HPCC (10%), the chosen applications (80%), and an interview by the SCC board (10%). [5] 2.2 System Configuration Team s cluster, which was named "Antikythera" after the first analog computer, consisted of 1 head node and 3 compute nodes. In fact, all 4 of them were used for computational reasons, but the head node was named "head" because of some additional configuration and monitoring processes. It must be noted that the initial configuration included a fifth node, which was removed later on, because of power limitations, but was brought in the competition, in case it would be needed. All of the nodes were Supermicro server nodes, containing: 2 x Intel Xeon E5-2680V2 CPUs 10 cores (20 threads) 2.8 GHz clock frequency 25MB Intel Smart Cache 64 bit Instruction Set 2 x NVIDIA Tesla K40 GPUs 12GB GDDR5 Memory 288 GB/sec Memory Bandwidth 2889 CUDA cores 1.43 (4.29) TFlops Peak double (single) floating point performance 8 x 8GB DDR3 Registered ECC Memory 1 x Intel 510 Series SSD (the head node contained 4 SSDs) For the interconnect we used a 12-port Mellanox Infiniband 40/56 GbE switch and for Internet access we used a 48-port 10 GbE Ethernet switch. The component of the cluster, which made the difference, was the cooling technology. Instead of the conventional fan cooling, liquid cooling was used, which was provided by CoolIT. More specifically, CoolIT s Rack Direct Contact Liquid Cooling (DCLC) AHx20 model was used, which is a rack-based liquid cooling solution that enables highperformance and high-density clusters anywhere, without the requirement for facility 5

13 Figure 2.1: Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack. liquid to be plumbed to the rack. The unique AHx configuration consists of a liquid cooling network that is mounted directly onto the Intel Xeon CPUs and the NVIDIA K40 GPUs. This system allows both the processor and GPU accelerator heat output to be directly absorbed into circulating liquid which then efficiently transports the heat to a liquid-to-air heat exchanger, mounted on the top of the rack. This stand-alone rack solution is modular and is compatible with any rack-computing set-up, enabling ultrahigh density clusters to be deployed quickly and easily. The modules of the AHx20 system, as well as its heat exchanger, can be seen in Figures 2.3 and 2.4, respectively. Figure 2.2: Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible. Except for its modularity, the greatest gaining of liquid cooling was the reduction of the total power consumption of the cluster, as a large number of power consuming fans were able to be removed from the system. More specifically, initially, there were 10 fans, inside each node, each one consuming a maximum of 25 Watts, whereas after the installation of CoolIT AHx20 the 10 fans were replaced with 3 power efficient ones, 6

14 Figure 2.3: Server and Manifold modules of the liquid cooling system. Figure 2.4: Heat exchanger sketch. each one consuming a maximum of 5 Watts. The importance of this power reduction was significant, because it allowed the addition of one whole node to the system, without which the winning results would not be achieved. As for the software stack, all nodes had to be configured from the beginning, so firstly, we installed the CentOS Linux distribution and after setting up a shared folder, using the Network File System (NFS), we installed all the compilers and libraries, needed, in it, in order for them to be accessible from all the nodes. Some of them are shown in the following list. GNU and PGI compilers MPI libraries (OpenMPI, MVAPICH2) NVIDIA drivers and CUDA 5.5 toolkit Intel compilers, MKL Library and Intel MPI GNU Scientific (GSL) Library FFTW2 Library 7

15 2.3 Applications During the competition, a total of 6 applications/benchmarks should have been run, 5 of which were known from earlier this year, whereas the sixth one was announced throughout the competition. The 5 known applications were: High Performance LINPACK (HPL), GADGET, HPC Challenge (HPCC), Quantum ESPRESSO, OpenFOAM, while the secret one was HPCG. In previous years competitions, there was a seventh secret application announced during the competition, but this was not the case this year. Instead of an additional application/benchmark, the SCC committee organised a challenge. The goal of this challenge was to run one of the known applications, Quantum ESPRESSO in particular, with a time limit of 20 minutes and the winner would be the team that managed to do that with the lowest peak power consumption. An additional rule to this challenge was that physical alterations to the systems were allowed. In order to achieve that, our team, firstly, disconnected all GPUs from the whole system, in order to avoid their idle power, as the provided version of Quantum ESPRESSO was running only on CPUs. The procedure of removing the GPUs from the system was not trivial, because of the liquid cooling system. The liquid tubes were coming from the heat exchanger into the CPU, then into the GPU and then back out from the node, which made the complete removal of GPUs impossible. In order to overcome this issue, we just unplugged the PCI EXPRESS connector and all the additional power cables from each GPU and left it in the node. Moreover, we added the fifth spare node, whose CPUs would help us complete the application run, in under 20 minutes, as the rules stated. 2.4 Obstacles Numerous issues were faced, mainly throughout the preparation period, but during the actual competition as well. The most important ones are presented in the following list. Power measurement tool One of the most significant problems that we faced, while preparing for the competition, was the technical issues of our power measurement tool. There were times that its software would not work as expected or even crash. Even when we had it working, not all the nodes could be connected on the tool, because the fuse could not handle so much power. Having this as the only choice, we measured 8

16 the power consumed by one node and made an assumption about the total system consumption. It was only the last week before the competition that bigger fuses were used and we managed to take the desired measurements. Cluster management tools The choice of the appropriate cluster management tool for our cluster was another time consuming issue. In the beginning, we started using Bright Cluster Manager but it was restrictive for our cause, so we gave up on it and started using Cobbler and Puppet instead. While Cobbler worked fine, we had some issues with a few Puppet s manifests, which were not being able to cooperate with the Network File System. After this problem was surpassed, we did not face any other difficulties, regarding the management tools. Node crashes During benchmarking we made our cluster crush numerous times. The main reason was GPU memory insufficiency, while running High Performance LINPACK, with large problem sizes. The problem was that we were not able to physically reboot the system, as we were working from distance, and if the crashes occurred on weekends, then we had to wait until Monday for someone to reboot the system. This problem was solved after installing IPMI, which enabled us to remotely control the cluster.. Space insufficiency After a large number of application runs, a lot of files were created, from initial conditions to output files, leading to insufficient space, left in the cluster. For this reason, instead of 2 SSDs, that were initially in head node, we finally used 4. Compilers and Libraries The main software issue, during the benchmarking phase, was the installation of all the compilers and libraries. GNU compilers are installed by default in CentOS, but PGI compilers are not free, and we contacted the appropriate people in order to get the desired licences. On the other hand, Intel compilers and libraries were provided by Boston, but they consumed a lot of disk space, so they had to be moved to another directory. In addition, installing all the required libraries, using all 3 compilers (GNU, PGI, Intel), was a time consuming task, mainly because of compatibility issues. Network problem during SCC The only problem we faced during the competition was a network failure, which restricted internet access to some teams, and our team was in that group. The main issue was that we could not watch the power consumption of our system, leading to "blind" tests. Thankfully, the committee took care of the problem and everything continued normally, with a small time extension given to the teams affected by the problem. 9

17 Chapter 3 Highest LINPACK Award EPCC team achieved great LINPACK results in this year s SCC, in ISC, and won the Highest LINPACK award. The team was the first one throughout the years to break the 10 TFlop/s boundary, under the 3kW limit, achieving Tflop/s. The previous record was approximately 9.2 Tflop/s, which was achieved in the last Asia Student Supercomputing Challenge (ASC), while the second best result in the competition we attended was 9.4 TFlop/s. A lot of hours were spent for testing numerous system and benchmark parameters, in order to attain the winning result. The details of those tests, as well as some LINPACK background theory, will be described in the current chapter. 3.1 Background theory High Performance LINPACK (HPL) is a benchmark that solves a dense linear system, using double precision floating points, and it is the one used in the competition. In general, the LINPACK benchmark is used to measure the performance of any HPC system and classify it into the TOP500 list. It must be noted that no single number can reflect the overall performance of a system and neither can the LINPACK results, but since the solving of a dense system of linear equations is very regular, the performance numbers extracted from the LINPACK benchmark give a good approximation of a system s peak performance. [6] Algorithm The algorithm used in HPL utilises LU factorization with partial pivoting, featuring multiple look-ahead depths. In particular, the operation count for the algorithm is 2/3n 3 + O(n 2 ) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen s Method" or algorithms which compute a solution in a precision lower than full precision and refine the solution using an iterative approach. The data is distributed onto a two-dimensional P-by-Q grid of processes according to a 10

18 block-cyclic scheme to ensure balanced workload and increase the scalability of the algorithm. The N-by-N matrix is, firstly, logically partitioned into NB-by-NB blocks, that are cyclically distributed onto the P-by-Q process grid. This is done in both dimensions of the matrix. [6] [7] 3.2 Achieving the award The HPL version that was given was an optimised executable by NVIDIA, designed to run on both Intel CPUs and NVIDIA GPUs. Along with the executable, there was an input parameters file, called HPL.dat, which included all the variables to be tuned, in order to achieve the best possible performance. The 4 main parameters we experimented with were: N: The problem size to be run, NB: The block size, which determines the data distribution, P: Number of rows in process grid, Q: Number of columns in process grid. In addition, 2 scripts were used, one of which included the commands to load the required libraries, by setting the appropriate environment variables, and the command to run the benchmark, while the other script was used to set 3 parameters: The number of threads that will be used per GPU (CPU_CORES_PER_GPU), The percentage of work to be done in GPUs (GPU_DGEMM_SPLIT), The GPU and CPU affinity for each local MPI rank. The first two of these parameters played a significant role in achieving the desired results and a lot of tests were run in order to find their best values. The 2 mentioned scripts are shown in Appendix A. In order to run the given HPL executable the following libraries must be installed and loaded, which is done in the script shown in Appendix A.1: A version of the OpenMPI MPI library, in both PATH and LD_LIBRARY_PATH CUDA 5.5 tool-kit, in both PATH and LD_LIBRARY_PATH Intel library, in LD_LIBRARY_PATH Intel MKL library, in LD_LIBRARY_PATH The libdgemm dynamic library, in LD_LIBRARY_PATH. This library was located in the HPL root directory. The whole preparation process was divided into two parts. The first step was to find the input parameters values that would give the highest performance, without worrying 11

19 about the power consumption. After the end of this first part, a baseline for the best achievable results would have been found and the next step would investigate ways to reduce the peak power consumption. The second part focused in experimenting not only with HPL s parameters, but with some system parameters as well (CPU clock speed, GPU clock speed, Hyper-Threading), in order to drop the power consumption under the 3 kw limit Increase the Performance The first 2 variables, whose optimal values needed to be found, were the problem size (N) and the block size (NB). These two parameters are dependent to each other, because, in order to achieve better load balance between processes, the problem size must be a multiple of the block size. Furthermore, the desired values must ensure that the number of blocks in each matrix dimension must be exactly divided by the corresponding process grid dimension (P and Q), in order to ensure that each process will be assigned with a whole data block. Inside a README file, it was stated that the most common NB values, for a 2 GPU per node system, are 768, 896 and The performance for these values, along with some more values close to them, was computed using 2 problem sizes, and The 2 problem sizes were picked, so they are multiples of the NB values to be tested. The detailed results of the tests are shown in Table 3.1. All the NB tests were executed with the default CPU clock frequency (2.8 GHz). N (bytes) NB (bytes) Performance (Gflop/s) E E E E E E E E E+04 Table 3.1: Performance for different block sizes and corresponding problem sizes. The red color in Table 3.1 means that the corresponding values resulted in a failed correctness test, for the HPL benchmark. The results are also presented in Figure 3.1, where the red line parts also represents the failed test values. As we can see in both Table 3.1 and Figure 3.1, the best block size is 896, which was used for the following tests, as well as for the final runs in the competition. After the optimal NB value was found, the next step was to identify the best multiple to use as the problem size. In order to maximize the performance, most of the available system 12

20 Performance (GFlop/s) Failed test N= Failed test N= Block size (bytes) Figure 3.1: Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz. memory should be used. In our case, the problem size that should be used must be such that everything fits in GPU memory. This was a challenging step, because overfilling the GPU with data caused the corresponding node to crash and a reboot was needed. The test results are shown in Table 3.2 and in Figure 3.2. N (bytes) Performance (Gflop/s) E E E E E E E E+03 Table 3.2: Performance for different problem sizes (block size constant) The tests start with an N value that uses significantly less than all the GPU memory and slowly increase N until the GPU memory is just filled, but not overfilled. The nvidia-smi tool was used to monitor the GPU memory usage when HPL was running. The zero values in Table 3.2 mean that the memory was overfilled and the system crashed, for the corresponding problem size. These values are not clearly visible in Figure 3.2 for scaling reasons. It must be noted here that the performance values of Table 13

21 Performance (GFlop/s) Problem size (bytes) Figure 3.2: Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz 3.2 do not match with those in Table 3.1, because they were extracted using a lower CPU clock frequency (2 GHz). The best performance is achieved by using a problem size equal to After detecting the optimal problem and block sizes, the grid dimensions, P and Q, were studied. P x Q should equal the total number of GPUs. Note that when P=1 it can only run a problem that fits on the GPU memory (N needs to be reduced). Typically the best result is achieved for a process grid, which is close to a square, meaning that P and Q should be almost equal. EPCC team s system included 8 GPUs in total, so there are two possible combinations for P and Q: P = 2 and Q = 4 P = 4 and Q = 2 The case of P=1 and Q=8 (or vice versa) was not tested because the problem size needed to be reduced, which would not be beneficial in our case. The results are shown in Figure 3.3, where we can see that the process grid dimensions play a significant role in achieving the highest performance. We chose to continue with P=4 and Q=2. The last parameter, investigated in order to further increase the performance was the percentage of the work to be done in GPUs, which is determined by the variable GPU_DGEM_SPLIT. The proposed, by the README file, value was 0.97, which means that 97% of the computational work will be done in the GPUs and the remaining 3% will be executed in CPUs. After some testing it was found out that the proposed value was not always the optimal one. Two sets of tests were carried out, in order to 14

22 10500 Performance (GFlop/s) P=2 Q=4 P=4 Q=2 Figure 3.3: Performance difference for P and Q value exchange. test that, one with the CPU clocks running at 2 GHz and a second one with the default frequency of 2.8 GHz. The results are shown in Table 3.3. GPU_DGEM_SPLIT (%) Performance (Gflop/s) cpufreq 2GHz cpufreq 2.8GHz E E E E E E E E E E E E E E+03 Table 3.3: Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant) We can see that the results have a different trend for different CPU clock frequencies. In the first series of tests, with the lowest clock speed, the best performance is achieved for a GPU_DGEM_SPLIT equal to 97%, whereas in the series with the highest clock frequency, the best performance is achieved for a lower GPU_DGEM_SPLIT percentage (95%). The reason, for which this phenomenon occurs, is because CPUs with higher clock frequency can execute calculations faster, so assigning to them more computational work is beneficial, helps the GPUs complete the job and leads to better performance. This is better presented in Figure

23 Performance (GFlop/s) CPU FREQ=2GHz CPU FREQ=2.8GHz DGEM_SPLIT (%) Figure 3.4: Performance results for different GPU_DGEM_SPLIT values. After examining all the results, so far, the highest performance is 1.032E+04 GFlop/s, and the parameters for achieving it are the following: Problem size (N) = Block size (NB) = 896 Process grid dimensions: P = 4 and Q = 2 GPU workload percentage (GPU_DGEM_SPLIT) = Decrease the Power consumption The purpose of the competition was not only to achieve the highest HPL performance, but also do it under the power limit of 3000 Watts. For the peak performance, reached so far, the power consumption of the whole system was 3292 Watts. In order to find the appropriate system variables to increase the cluster s energy efficiency, the power consumption of some independent hardware parts was measured. These measurements are shown in Table 3.4. The consumption of the Ethernet and the InfiniBand switches is constant and there is nothing to be done in order to decrease it. The hardware parameters and parts, experimented with, were the clock frequencies of each CPU and GPU, the hyper-threading (on or off), the number of water cooling fans, located on the CoolIT s heat exchanger and the number of conventional cooling fans, located inside each node. In addition, the number of CPU cores used by each GPU played a significant role in reducing the 16

24 Hardware part Idle power (Watt) Full load power (Watt) Ethernet switch Mellanox SX6012 IB switch NVIDIA K40 GPU Intel Xeon E5-2680V2 CPU CoolIT Rack DCLC AHx Table 3.4: Power consumption of different hardware parts, when idle and on full load. power consumption, because the less cores each GPU uses, the less power is consumed, but the performance is decreased, as well, so a balanced value should be found. Moreover, with the changes in the CPU clock frequencies, the value of GPU_DGEM_SPLIT should be investigated once more, in order to achieve higher performance and lower power consumption. One important detail for the power consumption of the system is that it was not allowed to exceed the 3 kw limit, at any time, meaning that the peak power consumption of each HPL run should be measured, and not the average power consumption of the whole run. All the preparation measurements were taken by a power measurement tool provided by Boston. The first step, towards the reduction of the system s consumption, was the experimentation with some lower CPU clock frequencies. The software used, to manipulate the clock rates, was cpufrequtils, and in order for the changes to take place in all nodes, a script was created, which is shown in Appendix B.1. The power consumption, as well as the performance for 3 different CPU clock speeds and different GPU_DGEM_SPLIT values, was measured and the results are shown in Table 3.5 and in Figure 3.5. DGEM (%) Power consumption (Watt) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz DGEM (%) Performance (Gflop/s) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz E E E E E E E E E E E E+03 Table 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. As we can see in Figure 3.5, higher CPU clock frequency means higher performance, but, in general, it also leads to higher power consumption as well. Furthermore, the 17

25 Performance (GFlop/s) GHz 1.7 GHz 2 GHz Power Consumption (Watt) dgemm (%) 1.5 GHz 1.7 GHz 2 GHz Figure 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. power consumption decreases as the percentage of GPU_DGEM_SPLIT increases. This is happening because the main cause of power increases and reductions is the CPU. The GPUs, on full load, consume a constant amount of approximately 235 Watts, which does not change for small differences in the GPU_DGEM_SPLIT value. On the contrary, difference in the amount of work to be done in the CPUs affects their power consumption and leads to the presented power reduction, as the DGEM percentage rises. One other parameter, to be tested, was the Hyper-Threading feature. Initially, we thought that we would not use it because it would consume more power, without any performance improvements, as well. The results shown in Figure 3.6 do not agree with the initial thoughts (the results shown in Figure 3.6 were extracted using a small number of cores per GPU). Hyper-threading may caused the system to consume more power, as it enables the "software" threads, but on the other hand, it helped reaching a higher performance. One possible reason that hyper-threading increased the performance is that it helped eliminating some dependences in the HPL code. It must be noted that hyper-threading was turned on and off using a script, which is presented in Appendix B.2, and not through the system s BIOS settings. In order to further reduce the power consumption, the reduction of the number of cores to be used by each GPU, was investigated. The results can be seen in Table 3.6 and in Figure 3.7. As we can see in Figure 3.7, the performance is increased as the number of CPU coresthreads per GPU increases, which is expected as there is more computing power used for the calculations. On the other hand, the power consumption of the system is also 18

26 Performance (GFlop/s) Measurement 1 Measurement H/T ON H/T OFF Power Consumption (Watt) Measurement 1 Measurement H/T ON H/T OFF Figure 3.6: Power consumption and Performance for Hyper-Threading (H/T) on and off. Cores per GPU Performance (Gflop/s) Power (Watt) E E E E E E Table 3.6: Power consumption and Performance for different number of cores per GPU. increased, because there are more CPU cores-threads on full load. Each measurement of Table 3.6 and Figure 3.7, is taken with different, but not too distant, values of DGEM percentage and CPU clock frequencies. All the results, presented so far, were measured during the preparation period, before the actual competition. The best performance achieved until this stage, which was under the power limit, was 9.923E+03 GFlop/s which was achieved with the following parameters: CPU clock frequency: 2.8 GHz GPU clocks frequency: defaults GPU_DGEM_SPLIT: 98% Cores per GPU: 6 19

27 Performance (GFlop/s) Power Consumption (Watt) Cores per GPU Figure 3.7: Power consumption and Performance for different number of cores per GPU. Hyper-Threading: ON COMMENTS ON GPUs During both the preparation phase and the actual competition, numerous GPU parameters were investigated, as well. In order to change the values of these parameters, the nvidia-smi interface was used, provided by NVIDIA. Firstly, the idle power of each GPU needed to be reduced, because it was approximately 60 Watts. After a lot of research and testing, we found out that changing the persistence mode, of each GPU, from 0 to 1, not only decreases its idle power consumption down to 19 Watts, but also gives a performance boost of approximately 150 GFlop/s. Furthermore, it was noticed the maximum memory available was smaller in some GPUs, which was caused by the ECC (Error-Correcting Code) mode, which can detect and correct the most common kinds of internal data corruption. In order to make all the GPUs use the maximum memory allowed, the ECC mode was turned off. After all the power and memory issues were dealt with, the experimentation of GPU clock frequencies (graphics and memory clocks) started. The supported frequencies for the memory clock are 324 MHz, which only supports a 324 MHz graphics clock, and 3004 MHz, which supports a graphics clock with a frequency equal to 666 MHz, 745 MHz, 810 MHz or 875 MHz. The default clock values are 3004 MHz for the memory and 745 MHz for the graphics. After a lot of testing, the default clocks turned out to be working better than all the other combinations. The lower clock speeds caused the 20

28 performance to drop significantly, while the highest clocks, caused the power consumption to reach high peaks, without giving any performance boost. The scripts for setting and changing ECC mode, persistence mode and the GPU clock frequencies are shown in Appendices B.3, B.4 and B.5, respectively Competition results Before the official start of the competition, we had one day, after the system set-up, for testing, a big part of which was dedicated to HPL. We found out that the committee s Power Distribution Unit (PDU) was more tolerant, than the one we were using before, as it was measuring the system s power consumption every 5 seconds, a fact that gave us some room for increasing the performance. In order to further decrease the power consumption, we decided to unplug some of the heat-exchanger s fans, as well. The heat-exchanger has a total of 20 fans (5 groups of 4 fans), as shown in Figure 2.4, and we ended up disconnecting 8 of them (2 groups of 4), which left us only 12, a sufficient number for our system, A large number of the competition results is shown in Table 3.7. CPU freq (GHz) DGEM (%) Cores/GPU Gflop/s Power (W) GFlops/Watt 2.5 GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E Table 3.7: Results from numerous tests, carried out during the SCC. The highlighted line in Table 3.7 is showing the winning result, which was submitted to the SCC committee. This result was extracted from the first test, completed at the very beginning of the competition, and it was achieved after leaving the system to cool for at least half an hour, down to 24 Celsius degrees. Furthermore, the GFlops/Watt ratio is presented in the last column of Table 3.7, for reference. It must be noted that our 21

29 system would be ranked at the 5th place of the Green500 list, according to the GFlops/Watt ratio of the winning result. In addition to the above results, Figure 3.8 presents the power consumption of the EPCC cluster, on Monday, when HPL (the 4 high spikes) and HPCC (the 4 lower spikes) were run. The power consumption of the winning HPL run is represented by the fourth high spike. Figure 3.8: Power consumption according to the competition committee. 22

30 Chapter 4 GADGET Investigation In addition to HPL, the writer of the current dissertation was responsible for another competition s application, GADGET. A large number of tests were carried out, in order to increase the performance of GADGET, and an attempt to investigate its power behaviour, when changing various parameters, was made. The GADGET code version used was GADGET 3, which is not an official version and has little documentation. 4.1 Background theory GADGET is an open source application, for cosmological N-body 1 and Smoothedparticle hydrodynamics 2 (SPH) simulations, ran on distributed memory architecture systems. It can be run on any workstation, ranged from an individual PC to a largescale cluster. GADGET can be used to simulate and study a variety of astrophysical problems, ranging from colliding and merging galaxies, to the formation of largescale structures in the Universe. It is tunable through an input file and its simulations can include either dark matter only, or dark matter with gas. By enabling some additional gas processes, such ad radiative cooling and heating, GADGET can be used to study the dynamics of the plasma, existing between galaxies, and star formation. [8] Algorithm An important part of GADGET s algorithm and implementation is that both dark matter and gas are represented as particles. The main part of the algorithm and the most computational intensive one is the calculation of the gravitational forces (N- body simulation), but besides that, GADGET uses hydrodynamical methods, as well, in order to 1 Simulation of a dynamical system of particles, which act under the influence of physical forces, such as gravity. 2 Computational method used for simulating fluid flows. 23

31 simulate fluid flows. The MPI library is used for exploiting parallelism, with a domain decomposition, discussed further in this section, along with the forces algorithms. Gravitational field methods There are two common used types of methods for computing the gravitational forces between particles [9] : Particle-Mesh (PM) methods are the fastest ones, for computing the gravitational field, as they are based on Fast Fourier Transformation (FFT). The defect of PM is that it is not efficient when computing forces for particles that are in adjacent, dense grid cells. Hierarchical tree algorithms are more efficient for close, high-density cells, but they are slower than PM for distant cells with low-density contrast. Hierarchical tree algorithm is the basic choice of GADGET. It organises distant particles into larger groups, allowing their gravity to be accounted for by means of a single force. Their forces are then calculated, firstly, by creating a tree, using recursive space decomposition and then by traversing that tree and calculating partial forces between its nodes. GADGET can use a method called TreePM, for computing the gravitational field, which is a hybrid of the above two and it uses each one wherever it is more efficient. When TreePM is toggled off, GADGET uses the Hierarchical tree algorithm alone. Hydrodynamical methods There are two well-known methods for computing the hydrodynamical field [9] : Eulerian methods, which are based on space discretisation and represent fluid variables on a mesh. Lagrangian methods, which are based on mass discretisation and use fluid particles to model the flow. GADGET uses SPH for hydrodynamical computations, which is a Lagrangian method, because gas is represented as particles. Domain decomposition In order to define the domain decomposition, across processing units, GADGET uses the Peano-Hilbert curve. Firstly, it creates the curve, as shown in Figure 4.1, which maps 3D or 2D on to 1D space, and then it splits the curve into pieces that define the sub-domains, as presented in Figure

32 Figure 4.1: Peano-Hilbert curve creation Figure 4.2: Subdomains from Peano-Hibert curve Technical information Cosmological simulations need some initial conditions (ICs), in order to begin, which are generated by a given (parallel) program, called N-GenIC, and are given as input to GADGET. Both GADGET and N-GenIC need an MPI library, in order to compile and run. The results of different MPI libraries experimentation will be presented later in this chapter. In addition, two other libraries are, also, required: GNU Scientific Library (GSL) FFTW2 (Fastest Fourier Transform in the West) GADGET is tunable in two ways. The first one is via a configuration file, during compilation time, which can enable the TreePM algorithm and the periodic boundary conditions, it can set the precision to single or double, disable the gravity factor (only for pure hydrodynamic problems), and set the particle IDs variables to be of type long (in case of a number of particles greater than 2 billion). The described parameters are only a part of everything that is included in the configuration file. 25

33 The second way, to tune GADGET, is via a parameter file, during runtime. Some of the settable variables, included in this file are the output directory location, the name of output files, the CPU-time limit for the current run, the number of time-steps to simulate, the communication buffer size, as well as the initial and minimum allowed gas temperatures. The code comes with three problem sizes and their corresponding ICs, configuration and parameter files: Small: 2x128 3 = particles Medium: 2x512 3 = particles Large: 2x = particles It must be noted that the Large problem size was not able to run in the team s cluster, because of memory insufficiency. The output of GADGET is a folder with numerous files, the most important of which include information about global energy statistics of the simulation (energy.txt), various aspects of the performance of the gravitational force computation for each timestep (timings.txt) and a list of all the timesteps (info.txt). There are also the restart files, which are used in case of a paused simulation and some snapshot files which are used for the visualisation of the simulation. The most important output file is the one that keeps track of the cumulative CPU-consumption of various parts of the code and the time taken for each timestep. [11] 4.2 Performance investigation Increasing the performance of GADGET was not a straight forward thing to do, because GADGET is a scientific application and not a benchmark. The Small test case was suited for the initial testing, as it had low completion time. The first step was to investigate how it performs when the number of MPI tasks, used for the job, increases. The main reason of this step, was to see if GADGET scales well across multiple nodes, and have a first estimation of the final results. The results are presented in Table 4.1 and the MVAPICH MPI library is used for this experiment, which was compiled with the PGI compiler. It must be noted that all the following GADGET timings are taken for the execution of 3 time-steps. As we can see, GADGET scales really well, as the output time is almost halved when the MPI tasks are doubled. The reason we see a big drop in time, when the MPI tasks go from 8 to 16, is because, at that time, process binding flags were added in the mpirun command. The flags used are -bind-to socket -map-by hwthread, which bind processes to a socket, and then map them by hardware thread, in order to group sequential ranks together [10]. The results are, also, presented in Figure 4.3. Figure 4.3 is a timing plot. A corresponding scaling plot could be created, because 26