High Performance LINPACK and GADGET 3 investigation

Size: px
Start display at page:

Download "High Performance LINPACK and GADGET 3 investigation"

Transcription

1 High Performance LINPACK and GADGET 3 investigation Konstantinos Mouzakitis August 21, 2014 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2014

2 Abstract This project is researching in the area of energy efficiency, as well as performance of heterogeneous clusters. In particular, two applications are used, in order to investigate their performance and their power consumption, when changing various input parameters, as well as system parameters. These two applications are High Performance LINPACK (HPL) and GADGET, and they are chosen from this year s Student Cluster Competition (SCC), held during the International Supercomputing Conference (ISC) in Leipzig.

3 Contents 1 Introduction Report organisation Student Cluster Competition Rules System Configuration Applications Obstacles Highest LINPACK Award Background theory Algorithm Achieving the award Increase the Performance Decrease the Power consumption Competition results GADGET Investigation Background theory Algorithm Technical information Performance investigation Competition Results Power results Conclusions Future work A HPL scripts 36 A.1 Running script A.2 Affinity script B System configuration scripts 38 B.1 CPU clock frequency script B.2 Hyper-threading script i

4 B.3 GPU ECC script B.4 GPU Persistence mode script B.5 GPU clocks frequency script ii

5 List of Tables 3.1 Performance for different block sizes and corresponding problem sizes Performance for different problem sizes (block size constant) Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant) Power consumption of different hardware parts, when idle and on full load Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values Power consumption and Performance for different number of cores per GPU Results from numerous tests, carried out during the SCC GADGET timing for different number of MPI tasks GADGET timing for different MPI libraries GADGET timing for different PMGRID dimension and different number of Multiple Domains GADGET timing for different compiler flags GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values iii

6 List of Figures 2.1 Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible Server and Manifold modules of the liquid cooling system Heat exchanger sketch Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz Performance difference for P and Q value exchange Performance results for different GPU_DGEM_SPLIT values Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values Power consumption and Performance for Hyper-Threading (H/T) on and off Power consumption and Performance for different number of cores per GPU Power consumption according to the competition committee Peano-Hilbert curve creation Subdomains from Peano-Hibert curve GADGET timing for different number of MPI tasks GADGET timing for different MPI libraries GADGET timing for different PMGRID dimension and different number of Multiple Domains GADGET timing for different compiler flags GADGET simulation from ISC GADGET power consumption, along with corresponding timings, for different PMGRID and MULTIDOMAINS values iv

7 Acknowledgements This project would not have been possible without the kind contribution of others. I would like to thank my supervisor James Perry for his constant guidance and time he devoted to me. I am also grateful to all the people from Boston Ltd and especially to David Power Michael Holiday and Jon Howard for the help and support they provided for the ISC 14 Student Cluster Challenge. Finally, I would like to thank my parents Michael and Markella for their unconditional love and support, without which I would not be able to do this Master of Science.

8 Chapter 1 Introduction Energy efficiency, or power efficiency, is one of the main issues in modern society. Many researches and studies have been made concerning power efficiency, trying to reduce the power consumption of various products. The term "green product" is very common these days and it is advertised in every sector. Computer science and especially High Performance Computing (HPC) is not an exception. The idea of green HPC has been gaining traction over the past decade, and power efficiency in supercomputers is of great importance, these days, as we are walking towards the exascale era. With a standard technology progression over the next decade, experts estimate that an exascale supercomputer could be constructed with power requirements in the 200 megawatt range, resulting in an estimated cost of $200-$300 million per year [2]. To make a more reasonable operating cost of the first exascale system, DARPA, after research, came up with an exascale power limit of 20MW, which translates into 50 GFlops/Watt. This ratio describes a system, which is approximately 11 times more power efficient than the current top system in the Green500 list [3]. The design of the upcoming supercomputers will be driven by the power consumption of their components and their software, as well. New technology and hardware parts must be used, with accelerators and liquid cooling being the most popular ones, in conjunction with some software techniques, in order to reach the desired Flops/Watt ratio. Some of these parts and techniques are tested every year during the Student Cluster Competition (SCC), being held during the International Supercomputing Conference (ISC). The current dissertation is written by a member of the EPCC team, which participated in the SCC of ISC 2014 and won one of the competition s awards. The report will describe the methodology and strategies followed, in order to achieve the winning results, as well as an investigation of the performance and power consumption of a competition s application. Further details about the SCC rules and applications will be described in Chapter 2. 1

9 1.1 Report organisation This report consists of 5 chapters: Chapter 1 is a quick introduction to prepare the reader for the contents of the dissertation. Chapter 2 presents the rules of the Student Cluster Competition, the configuration of the team EPCC s supercomputer and the challenges faced throughout the preparation and the actual competition. Chapter 3 describes all the testing that was carried out, in order to achieve the winning results and get the award in SCC. Chapter 4 analyses the performance, as well as the power consumption of GADGET, which was an application used in SCC. Various parameters are changed and tested for energy efficiency. Chapter 5 summarises the results and the experience gained. 2

10 Chapter 2 Student Cluster Competition The writer of this report was one of the four members of EPCC s team, which represented University of Edinburgh in this year s International Supercomputing Conference SCC. The conference and the competition took place in Leipzig, Germany and it lasted 5 days, starting from June 21 and finishing on June 25. The first two and a half days were assigned to system setup, during which the team s cluster was assembled and tested to make sure that everything is working correctly. During the last three days, the actual competition was carried out, with each team trying to run various applications and benchmarks and compete with each other for the best performance. The other 3 team members were Manos Farsarakis, George Iniatis and Chenhui Quan, while the team leader was Xu Guo. This year there were a total of 11 teams, representing universities from around the world, shown in the following list. Centre for HPC (CHPC), South Africa Ulsan National Institute of Science and Technology (UNIST), South Korea Massachusetts Institute of Technology (MIT), Bentley University, Northeastern University(NEU), United States EPCC at The University of Edinburgh (EPCC), United Kingdom Chemnitz University of Technology, Germany University of Hamburg, Germany University of Sao Paulo (USP), Brazil University of Colorado at Boulder, United States University of Science and Technology of China (USTC), China Shanghai Jiao Tong University (SJTU), China Tsinghua University, China 3

11 Except for the University of Edinburgh, EPCC team was cooperating with Boston Limited, which provided all the hardware, as well as technical support, a task that was assigned to a Boston s employee, Michael Holiday. Boston is an HPC company, which, in collaboration with Supermicro, helps its customers customise and create their ideal solution, in order to solve their challenges [4]. Boston was very helpful, as they were always available to answer our s and calls, in order to solve any technical problems we were facing. In addition to that, they gave us full control of our system, both software-wise and hardware-wise, which helped us achieving the best possible results and developing our software and hardware skills. 2.1 Rules The first and most important rule of the whole competition is the power limit. Each team must run all the given applications within a power cap of 3000 Watts or approximately 13 Amperes, on one power circuit. One Power Distribution Unit (PDU) was given to each team, which monitored the consumption of each cluster. The SCC supervisors as well as all the participating teams were able to watch the current power consumption of all teams through the Ethernet interface of the PDU. If any team exceeded the 3 kw limit, then a supervisor would come and warn the members of that team, while asking them to rerun any application/benchmark that might have run above the power limit. If the power consumption reached a level, way above the 3000 Watt limit, the circuit breaker would trip and all the power to the systems would be cut off, leading to valuable time loss. [5] Another significant rule of the competition is that no one is allowed physically touch the system after the submission of the first results, at the end of the first day. If there is a need to touch the equipment an official SCC supervisor needs to be called, in order to judge the situation. In addition, the system must be active all the time and reboots are prohibited. This means that changes to BIOS (EFI), and hibernation of suspension modes are not allowed, as well. The only exception to this rule is the case of an unsafe situation, in which anyone can power down the system and an SCC supervisor must be called immediately. [5] The input files of each day s applications, along with any special instructions for running them, were given to each team at the beginning of the day, using a USB flash drive. Teams should only use those datasets, run the corresponding applications and save the output files in another USB drive, which was given to the competition s supervisors at the end of each day. [5] In addition, the cluster should not be accessed outside the official hours of the competition, which means that the teams could not prepare, for any upcoming application/benchmark, beforehand. At the end of the competition all the results were taken under consideration by the competition s committee and five awards were given, in total: Highest LINPACK: The highest score received for the LINPACK benchmark 4

12 under the power budget. Fan Favorite: Given to the team which receives the most votes from ISC participants during the SCC. 1st, 2nd and 3rd Place Overall Winners: The scoring for the overall winners was calculated using the scores from HPCC (10%), the chosen applications (80%), and an interview by the SCC board (10%). [5] 2.2 System Configuration Team s cluster, which was named "Antikythera" after the first analog computer, consisted of 1 head node and 3 compute nodes. In fact, all 4 of them were used for computational reasons, but the head node was named "head" because of some additional configuration and monitoring processes. It must be noted that the initial configuration included a fifth node, which was removed later on, because of power limitations, but was brought in the competition, in case it would be needed. All of the nodes were Supermicro server nodes, containing: 2 x Intel Xeon E5-2680V2 CPUs 10 cores (20 threads) 2.8 GHz clock frequency 25MB Intel Smart Cache 64 bit Instruction Set 2 x NVIDIA Tesla K40 GPUs 12GB GDDR5 Memory 288 GB/sec Memory Bandwidth 2889 CUDA cores 1.43 (4.29) TFlops Peak double (single) floating point performance 8 x 8GB DDR3 Registered ECC Memory 1 x Intel 510 Series SSD (the head node contained 4 SSDs) For the interconnect we used a 12-port Mellanox Infiniband 40/56 GbE switch and for Internet access we used a 48-port 10 GbE Ethernet switch. The component of the cluster, which made the difference, was the cooling technology. Instead of the conventional fan cooling, liquid cooling was used, which was provided by CoolIT. More specifically, CoolIT s Rack Direct Contact Liquid Cooling (DCLC) AHx20 model was used, which is a rack-based liquid cooling solution that enables highperformance and high-density clusters anywhere, without the requirement for facility 5

13 Figure 2.1: Antikythera (front view). The heat exchanger of the liquid cooling system can be seen on top of the rack. liquid to be plumbed to the rack. The unique AHx configuration consists of a liquid cooling network that is mounted directly onto the Intel Xeon CPUs and the NVIDIA K40 GPUs. This system allows both the processor and GPU accelerator heat output to be directly absorbed into circulating liquid which then efficiently transports the heat to a liquid-to-air heat exchanger, mounted on the top of the rack. This stand-alone rack solution is modular and is compatible with any rack-computing set-up, enabling ultrahigh density clusters to be deployed quickly and easily. The modules of the AHx20 system, as well as its heat exchanger, can be seen in Figures 2.3 and 2.4, respectively. Figure 2.2: Antikythera (back view). All the connections can be seen in this picture. The tubes of the liquid cooling system are clearly visible. Except for its modularity, the greatest gaining of liquid cooling was the reduction of the total power consumption of the cluster, as a large number of power consuming fans were able to be removed from the system. More specifically, initially, there were 10 fans, inside each node, each one consuming a maximum of 25 Watts, whereas after the installation of CoolIT AHx20 the 10 fans were replaced with 3 power efficient ones, 6

14 Figure 2.3: Server and Manifold modules of the liquid cooling system. Figure 2.4: Heat exchanger sketch. each one consuming a maximum of 5 Watts. The importance of this power reduction was significant, because it allowed the addition of one whole node to the system, without which the winning results would not be achieved. As for the software stack, all nodes had to be configured from the beginning, so firstly, we installed the CentOS Linux distribution and after setting up a shared folder, using the Network File System (NFS), we installed all the compilers and libraries, needed, in it, in order for them to be accessible from all the nodes. Some of them are shown in the following list. GNU and PGI compilers MPI libraries (OpenMPI, MVAPICH2) NVIDIA drivers and CUDA 5.5 toolkit Intel compilers, MKL Library and Intel MPI GNU Scientific (GSL) Library FFTW2 Library 7

15 2.3 Applications During the competition, a total of 6 applications/benchmarks should have been run, 5 of which were known from earlier this year, whereas the sixth one was announced throughout the competition. The 5 known applications were: High Performance LINPACK (HPL), GADGET, HPC Challenge (HPCC), Quantum ESPRESSO, OpenFOAM, while the secret one was HPCG. In previous years competitions, there was a seventh secret application announced during the competition, but this was not the case this year. Instead of an additional application/benchmark, the SCC committee organised a challenge. The goal of this challenge was to run one of the known applications, Quantum ESPRESSO in particular, with a time limit of 20 minutes and the winner would be the team that managed to do that with the lowest peak power consumption. An additional rule to this challenge was that physical alterations to the systems were allowed. In order to achieve that, our team, firstly, disconnected all GPUs from the whole system, in order to avoid their idle power, as the provided version of Quantum ESPRESSO was running only on CPUs. The procedure of removing the GPUs from the system was not trivial, because of the liquid cooling system. The liquid tubes were coming from the heat exchanger into the CPU, then into the GPU and then back out from the node, which made the complete removal of GPUs impossible. In order to overcome this issue, we just unplugged the PCI EXPRESS connector and all the additional power cables from each GPU and left it in the node. Moreover, we added the fifth spare node, whose CPUs would help us complete the application run, in under 20 minutes, as the rules stated. 2.4 Obstacles Numerous issues were faced, mainly throughout the preparation period, but during the actual competition as well. The most important ones are presented in the following list. Power measurement tool One of the most significant problems that we faced, while preparing for the competition, was the technical issues of our power measurement tool. There were times that its software would not work as expected or even crash. Even when we had it working, not all the nodes could be connected on the tool, because the fuse could not handle so much power. Having this as the only choice, we measured 8

16 the power consumed by one node and made an assumption about the total system consumption. It was only the last week before the competition that bigger fuses were used and we managed to take the desired measurements. Cluster management tools The choice of the appropriate cluster management tool for our cluster was another time consuming issue. In the beginning, we started using Bright Cluster Manager but it was restrictive for our cause, so we gave up on it and started using Cobbler and Puppet instead. While Cobbler worked fine, we had some issues with a few Puppet s manifests, which were not being able to cooperate with the Network File System. After this problem was surpassed, we did not face any other difficulties, regarding the management tools. Node crashes During benchmarking we made our cluster crush numerous times. The main reason was GPU memory insufficiency, while running High Performance LINPACK, with large problem sizes. The problem was that we were not able to physically reboot the system, as we were working from distance, and if the crashes occurred on weekends, then we had to wait until Monday for someone to reboot the system. This problem was solved after installing IPMI, which enabled us to remotely control the cluster.. Space insufficiency After a large number of application runs, a lot of files were created, from initial conditions to output files, leading to insufficient space, left in the cluster. For this reason, instead of 2 SSDs, that were initially in head node, we finally used 4. Compilers and Libraries The main software issue, during the benchmarking phase, was the installation of all the compilers and libraries. GNU compilers are installed by default in CentOS, but PGI compilers are not free, and we contacted the appropriate people in order to get the desired licences. On the other hand, Intel compilers and libraries were provided by Boston, but they consumed a lot of disk space, so they had to be moved to another directory. In addition, installing all the required libraries, using all 3 compilers (GNU, PGI, Intel), was a time consuming task, mainly because of compatibility issues. Network problem during SCC The only problem we faced during the competition was a network failure, which restricted internet access to some teams, and our team was in that group. The main issue was that we could not watch the power consumption of our system, leading to "blind" tests. Thankfully, the committee took care of the problem and everything continued normally, with a small time extension given to the teams affected by the problem. 9

17 Chapter 3 Highest LINPACK Award EPCC team achieved great LINPACK results in this year s SCC, in ISC, and won the Highest LINPACK award. The team was the first one throughout the years to break the 10 TFlop/s boundary, under the 3kW limit, achieving Tflop/s. The previous record was approximately 9.2 Tflop/s, which was achieved in the last Asia Student Supercomputing Challenge (ASC), while the second best result in the competition we attended was 9.4 TFlop/s. A lot of hours were spent for testing numerous system and benchmark parameters, in order to attain the winning result. The details of those tests, as well as some LINPACK background theory, will be described in the current chapter. 3.1 Background theory High Performance LINPACK (HPL) is a benchmark that solves a dense linear system, using double precision floating points, and it is the one used in the competition. In general, the LINPACK benchmark is used to measure the performance of any HPC system and classify it into the TOP500 list. It must be noted that no single number can reflect the overall performance of a system and neither can the LINPACK results, but since the solving of a dense system of linear equations is very regular, the performance numbers extracted from the LINPACK benchmark give a good approximation of a system s peak performance. [6] Algorithm The algorithm used in HPL utilises LU factorization with partial pivoting, featuring multiple look-ahead depths. In particular, the operation count for the algorithm is 2/3n 3 + O(n 2 ) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen s Method" or algorithms which compute a solution in a precision lower than full precision and refine the solution using an iterative approach. The data is distributed onto a two-dimensional P-by-Q grid of processes according to a 10

18 block-cyclic scheme to ensure balanced workload and increase the scalability of the algorithm. The N-by-N matrix is, firstly, logically partitioned into NB-by-NB blocks, that are cyclically distributed onto the P-by-Q process grid. This is done in both dimensions of the matrix. [6] [7] 3.2 Achieving the award The HPL version that was given was an optimised executable by NVIDIA, designed to run on both Intel CPUs and NVIDIA GPUs. Along with the executable, there was an input parameters file, called HPL.dat, which included all the variables to be tuned, in order to achieve the best possible performance. The 4 main parameters we experimented with were: N: The problem size to be run, NB: The block size, which determines the data distribution, P: Number of rows in process grid, Q: Number of columns in process grid. In addition, 2 scripts were used, one of which included the commands to load the required libraries, by setting the appropriate environment variables, and the command to run the benchmark, while the other script was used to set 3 parameters: The number of threads that will be used per GPU (CPU_CORES_PER_GPU), The percentage of work to be done in GPUs (GPU_DGEMM_SPLIT), The GPU and CPU affinity for each local MPI rank. The first two of these parameters played a significant role in achieving the desired results and a lot of tests were run in order to find their best values. The 2 mentioned scripts are shown in Appendix A. In order to run the given HPL executable the following libraries must be installed and loaded, which is done in the script shown in Appendix A.1: A version of the OpenMPI MPI library, in both PATH and LD_LIBRARY_PATH CUDA 5.5 tool-kit, in both PATH and LD_LIBRARY_PATH Intel library, in LD_LIBRARY_PATH Intel MKL library, in LD_LIBRARY_PATH The libdgemm dynamic library, in LD_LIBRARY_PATH. This library was located in the HPL root directory. The whole preparation process was divided into two parts. The first step was to find the input parameters values that would give the highest performance, without worrying 11

19 about the power consumption. After the end of this first part, a baseline for the best achievable results would have been found and the next step would investigate ways to reduce the peak power consumption. The second part focused in experimenting not only with HPL s parameters, but with some system parameters as well (CPU clock speed, GPU clock speed, Hyper-Threading), in order to drop the power consumption under the 3 kw limit Increase the Performance The first 2 variables, whose optimal values needed to be found, were the problem size (N) and the block size (NB). These two parameters are dependent to each other, because, in order to achieve better load balance between processes, the problem size must be a multiple of the block size. Furthermore, the desired values must ensure that the number of blocks in each matrix dimension must be exactly divided by the corresponding process grid dimension (P and Q), in order to ensure that each process will be assigned with a whole data block. Inside a README file, it was stated that the most common NB values, for a 2 GPU per node system, are 768, 896 and The performance for these values, along with some more values close to them, was computed using 2 problem sizes, and The 2 problem sizes were picked, so they are multiples of the NB values to be tested. The detailed results of the tests are shown in Table 3.1. All the NB tests were executed with the default CPU clock frequency (2.8 GHz). N (bytes) NB (bytes) Performance (Gflop/s) E E E E E E E E E+04 Table 3.1: Performance for different block sizes and corresponding problem sizes. The red color in Table 3.1 means that the corresponding values resulted in a failed correctness test, for the HPL benchmark. The results are also presented in Figure 3.1, where the red line parts also represents the failed test values. As we can see in both Table 3.1 and Figure 3.1, the best block size is 896, which was used for the following tests, as well as for the final runs in the competition. After the optimal NB value was found, the next step was to identify the best multiple to use as the problem size. In order to maximize the performance, most of the available system 12

20 Performance (GFlop/s) Failed test N= Failed test N= Block size (bytes) Figure 3.1: Performance for different block sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2.8 GHz. memory should be used. In our case, the problem size that should be used must be such that everything fits in GPU memory. This was a challenging step, because overfilling the GPU with data caused the corresponding node to crash and a reboot was needed. The test results are shown in Table 3.2 and in Figure 3.2. N (bytes) Performance (Gflop/s) E E E E E E E E+03 Table 3.2: Performance for different problem sizes (block size constant) The tests start with an N value that uses significantly less than all the GPU memory and slowly increase N until the GPU memory is just filled, but not overfilled. The nvidia-smi tool was used to monitor the GPU memory usage when HPL was running. The zero values in Table 3.2 mean that the memory was overfilled and the system crashed, for the corresponding problem size. These values are not clearly visible in Figure 3.2 for scaling reasons. It must be noted here that the performance values of Table 13

21 Performance (GFlop/s) Problem size (bytes) Figure 3.2: Performance for different problem sizes. For these tests GPU_DGEMM_SPLIT was equal to 0.97 and the CPU clock frequency equal to 2 GHz 3.2 do not match with those in Table 3.1, because they were extracted using a lower CPU clock frequency (2 GHz). The best performance is achieved by using a problem size equal to After detecting the optimal problem and block sizes, the grid dimensions, P and Q, were studied. P x Q should equal the total number of GPUs. Note that when P=1 it can only run a problem that fits on the GPU memory (N needs to be reduced). Typically the best result is achieved for a process grid, which is close to a square, meaning that P and Q should be almost equal. EPCC team s system included 8 GPUs in total, so there are two possible combinations for P and Q: P = 2 and Q = 4 P = 4 and Q = 2 The case of P=1 and Q=8 (or vice versa) was not tested because the problem size needed to be reduced, which would not be beneficial in our case. The results are shown in Figure 3.3, where we can see that the process grid dimensions play a significant role in achieving the highest performance. We chose to continue with P=4 and Q=2. The last parameter, investigated in order to further increase the performance was the percentage of the work to be done in GPUs, which is determined by the variable GPU_DGEM_SPLIT. The proposed, by the README file, value was 0.97, which means that 97% of the computational work will be done in the GPUs and the remaining 3% will be executed in CPUs. After some testing it was found out that the proposed value was not always the optimal one. Two sets of tests were carried out, in order to 14

22 10500 Performance (GFlop/s) P=2 Q=4 P=4 Q=2 Figure 3.3: Performance difference for P and Q value exchange. test that, one with the CPU clocks running at 2 GHz and a second one with the default frequency of 2.8 GHz. The results are shown in Table 3.3. GPU_DGEM_SPLIT (%) Performance (Gflop/s) cpufreq 2GHz cpufreq 2.8GHz E E E E E E E E E E E E E E+03 Table 3.3: Performance for different GPU_DGEM_SPLIT values (problem and block sizes constant) We can see that the results have a different trend for different CPU clock frequencies. In the first series of tests, with the lowest clock speed, the best performance is achieved for a GPU_DGEM_SPLIT equal to 97%, whereas in the series with the highest clock frequency, the best performance is achieved for a lower GPU_DGEM_SPLIT percentage (95%). The reason, for which this phenomenon occurs, is because CPUs with higher clock frequency can execute calculations faster, so assigning to them more computational work is beneficial, helps the GPUs complete the job and leads to better performance. This is better presented in Figure

23 Performance (GFlop/s) CPU FREQ=2GHz CPU FREQ=2.8GHz DGEM_SPLIT (%) Figure 3.4: Performance results for different GPU_DGEM_SPLIT values. After examining all the results, so far, the highest performance is 1.032E+04 GFlop/s, and the parameters for achieving it are the following: Problem size (N) = Block size (NB) = 896 Process grid dimensions: P = 4 and Q = 2 GPU workload percentage (GPU_DGEM_SPLIT) = Decrease the Power consumption The purpose of the competition was not only to achieve the highest HPL performance, but also do it under the power limit of 3000 Watts. For the peak performance, reached so far, the power consumption of the whole system was 3292 Watts. In order to find the appropriate system variables to increase the cluster s energy efficiency, the power consumption of some independent hardware parts was measured. These measurements are shown in Table 3.4. The consumption of the Ethernet and the InfiniBand switches is constant and there is nothing to be done in order to decrease it. The hardware parameters and parts, experimented with, were the clock frequencies of each CPU and GPU, the hyper-threading (on or off), the number of water cooling fans, located on the CoolIT s heat exchanger and the number of conventional cooling fans, located inside each node. In addition, the number of CPU cores used by each GPU played a significant role in reducing the 16

24 Hardware part Idle power (Watt) Full load power (Watt) Ethernet switch Mellanox SX6012 IB switch NVIDIA K40 GPU Intel Xeon E5-2680V2 CPU CoolIT Rack DCLC AHx Table 3.4: Power consumption of different hardware parts, when idle and on full load. power consumption, because the less cores each GPU uses, the less power is consumed, but the performance is decreased, as well, so a balanced value should be found. Moreover, with the changes in the CPU clock frequencies, the value of GPU_DGEM_SPLIT should be investigated once more, in order to achieve higher performance and lower power consumption. One important detail for the power consumption of the system is that it was not allowed to exceed the 3 kw limit, at any time, meaning that the peak power consumption of each HPL run should be measured, and not the average power consumption of the whole run. All the preparation measurements were taken by a power measurement tool provided by Boston. The first step, towards the reduction of the system s consumption, was the experimentation with some lower CPU clock frequencies. The software used, to manipulate the clock rates, was cpufrequtils, and in order for the changes to take place in all nodes, a script was created, which is shown in Appendix B.1. The power consumption, as well as the performance for 3 different CPU clock speeds and different GPU_DGEM_SPLIT values, was measured and the results are shown in Table 3.5 and in Figure 3.5. DGEM (%) Power consumption (Watt) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz DGEM (%) Performance (Gflop/s) CPU clock 1.5 GHz CPU clock 1.7 GHz CPU clock 2 GHz E E E E E E E E E E E E+03 Table 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. As we can see in Figure 3.5, higher CPU clock frequency means higher performance, but, in general, it also leads to higher power consumption as well. Furthermore, the 17

25 Performance (GFlop/s) GHz 1.7 GHz 2 GHz Power Consumption (Watt) dgemm (%) 1.5 GHz 1.7 GHz 2 GHz Figure 3.5: Power consumption and Performance for different CPU clock frequencies and different GPU_DGEM_SPLIT values. power consumption decreases as the percentage of GPU_DGEM_SPLIT increases. This is happening because the main cause of power increases and reductions is the CPU. The GPUs, on full load, consume a constant amount of approximately 235 Watts, which does not change for small differences in the GPU_DGEM_SPLIT value. On the contrary, difference in the amount of work to be done in the CPUs affects their power consumption and leads to the presented power reduction, as the DGEM percentage rises. One other parameter, to be tested, was the Hyper-Threading feature. Initially, we thought that we would not use it because it would consume more power, without any performance improvements, as well. The results shown in Figure 3.6 do not agree with the initial thoughts (the results shown in Figure 3.6 were extracted using a small number of cores per GPU). Hyper-threading may caused the system to consume more power, as it enables the "software" threads, but on the other hand, it helped reaching a higher performance. One possible reason that hyper-threading increased the performance is that it helped eliminating some dependences in the HPL code. It must be noted that hyper-threading was turned on and off using a script, which is presented in Appendix B.2, and not through the system s BIOS settings. In order to further reduce the power consumption, the reduction of the number of cores to be used by each GPU, was investigated. The results can be seen in Table 3.6 and in Figure 3.7. As we can see in Figure 3.7, the performance is increased as the number of CPU coresthreads per GPU increases, which is expected as there is more computing power used for the calculations. On the other hand, the power consumption of the system is also 18

26 Performance (GFlop/s) Measurement 1 Measurement H/T ON H/T OFF Power Consumption (Watt) Measurement 1 Measurement H/T ON H/T OFF Figure 3.6: Power consumption and Performance for Hyper-Threading (H/T) on and off. Cores per GPU Performance (Gflop/s) Power (Watt) E E E E E E Table 3.6: Power consumption and Performance for different number of cores per GPU. increased, because there are more CPU cores-threads on full load. Each measurement of Table 3.6 and Figure 3.7, is taken with different, but not too distant, values of DGEM percentage and CPU clock frequencies. All the results, presented so far, were measured during the preparation period, before the actual competition. The best performance achieved until this stage, which was under the power limit, was 9.923E+03 GFlop/s which was achieved with the following parameters: CPU clock frequency: 2.8 GHz GPU clocks frequency: defaults GPU_DGEM_SPLIT: 98% Cores per GPU: 6 19

27 Performance (GFlop/s) Power Consumption (Watt) Cores per GPU Figure 3.7: Power consumption and Performance for different number of cores per GPU. Hyper-Threading: ON COMMENTS ON GPUs During both the preparation phase and the actual competition, numerous GPU parameters were investigated, as well. In order to change the values of these parameters, the nvidia-smi interface was used, provided by NVIDIA. Firstly, the idle power of each GPU needed to be reduced, because it was approximately 60 Watts. After a lot of research and testing, we found out that changing the persistence mode, of each GPU, from 0 to 1, not only decreases its idle power consumption down to 19 Watts, but also gives a performance boost of approximately 150 GFlop/s. Furthermore, it was noticed the maximum memory available was smaller in some GPUs, which was caused by the ECC (Error-Correcting Code) mode, which can detect and correct the most common kinds of internal data corruption. In order to make all the GPUs use the maximum memory allowed, the ECC mode was turned off. After all the power and memory issues were dealt with, the experimentation of GPU clock frequencies (graphics and memory clocks) started. The supported frequencies for the memory clock are 324 MHz, which only supports a 324 MHz graphics clock, and 3004 MHz, which supports a graphics clock with a frequency equal to 666 MHz, 745 MHz, 810 MHz or 875 MHz. The default clock values are 3004 MHz for the memory and 745 MHz for the graphics. After a lot of testing, the default clocks turned out to be working better than all the other combinations. The lower clock speeds caused the 20

28 performance to drop significantly, while the highest clocks, caused the power consumption to reach high peaks, without giving any performance boost. The scripts for setting and changing ECC mode, persistence mode and the GPU clock frequencies are shown in Appendices B.3, B.4 and B.5, respectively Competition results Before the official start of the competition, we had one day, after the system set-up, for testing, a big part of which was dedicated to HPL. We found out that the committee s Power Distribution Unit (PDU) was more tolerant, than the one we were using before, as it was measuring the system s power consumption every 5 seconds, a fact that gave us some room for increasing the performance. In order to further decrease the power consumption, we decided to unplug some of the heat-exchanger s fans, as well. The heat-exchanger has a total of 20 fans (5 groups of 4 fans), as shown in Figure 2.4, and we ended up disconnecting 8 of them (2 groups of 4), which left us only 12, a sufficient number for our system, A large number of the competition results is shown in Table 3.7. CPU freq (GHz) DGEM (%) Cores/GPU Gflop/s Power (W) GFlops/Watt 2.5 GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E GHz E Table 3.7: Results from numerous tests, carried out during the SCC. The highlighted line in Table 3.7 is showing the winning result, which was submitted to the SCC committee. This result was extracted from the first test, completed at the very beginning of the competition, and it was achieved after leaving the system to cool for at least half an hour, down to 24 Celsius degrees. Furthermore, the GFlops/Watt ratio is presented in the last column of Table 3.7, for reference. It must be noted that our 21

29 system would be ranked at the 5th place of the Green500 list, according to the GFlops/Watt ratio of the winning result. In addition to the above results, Figure 3.8 presents the power consumption of the EPCC cluster, on Monday, when HPL (the 4 high spikes) and HPCC (the 4 lower spikes) were run. The power consumption of the winning HPL run is represented by the fourth high spike. Figure 3.8: Power consumption according to the competition committee. 22

30 Chapter 4 GADGET Investigation In addition to HPL, the writer of the current dissertation was responsible for another competition s application, GADGET. A large number of tests were carried out, in order to increase the performance of GADGET, and an attempt to investigate its power behaviour, when changing various parameters, was made. The GADGET code version used was GADGET 3, which is not an official version and has little documentation. 4.1 Background theory GADGET is an open source application, for cosmological N-body 1 and Smoothedparticle hydrodynamics 2 (SPH) simulations, ran on distributed memory architecture systems. It can be run on any workstation, ranged from an individual PC to a largescale cluster. GADGET can be used to simulate and study a variety of astrophysical problems, ranging from colliding and merging galaxies, to the formation of large- scale structures in the Universe. It is tunable through an input file and its simulations can include either dark matter only, or dark matter with gas. By enabling some additional gas processes, such ad radiative cooling and heating, GADGET can be used to study the dynamics of the plasma, existing between galaxies, and star formation. [8] Algorithm An important part of GADGET s algorithm and implementation is that both dark matter and gas are represented as particles. The main part of the algorithm and the most computational intensive one is the calculation of the gravitational forces (N- body simulation), but besides that, GADGET uses hydrodynamical methods, as well, in order to 1 Simulation of a dynamical system of particles, which act under the influence of physical forces, such as gravity. 2 Computational method used for simulating fluid flows. 23

31 simulate fluid flows. The MPI library is used for exploiting parallelism, with a domain decomposition, discussed further in this section, along with the forces algorithms. Gravitational field methods There are two common used types of methods for computing the gravitational forces between particles [9] : Particle-Mesh (PM) methods are the fastest ones, for computing the gravitational field, as they are based on Fast Fourier Transformation (FFT). The defect of PM is that it is not efficient when computing forces for particles that are in adjacent, dense grid cells. Hierarchical tree algorithms are more efficient for close, high-density cells, but they are slower than PM for distant cells with low-density contrast. Hierarchical tree algorithm is the basic choice of GADGET. It organises distant particles into larger groups, allowing their gravity to be accounted for by means of a single force. Their forces are then calculated, firstly, by creating a tree, using recursive space decomposition and then by traversing that tree and calculating partial forces between its nodes. GADGET can use a method called TreePM, for computing the gravitational field, which is a hybrid of the above two and it uses each one wherever it is more efficient. When TreePM is toggled off, GADGET uses the Hierarchical tree algorithm alone. Hydrodynamical methods There are two well-known methods for computing the hydrodynamical field [9] : Eulerian methods, which are based on space discretisation and represent fluid variables on a mesh. Lagrangian methods, which are based on mass discretisation and use fluid particles to model the flow. GADGET uses SPH for hydrodynamical computations, which is a Lagrangian method, because gas is represented as particles. Domain decomposition In order to define the domain decomposition, across processing units, GADGET uses the Peano-Hilbert curve. Firstly, it creates the curve, as shown in Figure 4.1, which maps 3D or 2D on to 1D space, and then it splits the curve into pieces that define the sub-domains, as presented in Figure

32 Figure 4.1: Peano-Hilbert curve creation Figure 4.2: Subdomains from Peano-Hibert curve Technical information Cosmological simulations need some initial conditions (ICs), in order to begin, which are generated by a given (parallel) program, called N-GenIC, and are given as input to GADGET. Both GADGET and N-GenIC need an MPI library, in order to compile and run. The results of different MPI libraries experimentation will be presented later in this chapter. In addition, two other libraries are, also, required: GNU Scientific Library (GSL) FFTW2 (Fastest Fourier Transform in the West) GADGET is tunable in two ways. The first one is via a configuration file, during compilation time, which can enable the TreePM algorithm and the periodic boundary conditions, it can set the precision to single or double, disable the gravity factor (only for pure hydrodynamic problems), and set the particle IDs variables to be of type long (in case of a number of particles greater than 2 billion). The described parameters are only a part of everything that is included in the configuration file. 25

33 The second way, to tune GADGET, is via a parameter file, during runtime. Some of the settable variables, included in this file are the output directory location, the name of output files, the CPU-time limit for the current run, the number of time-steps to simulate, the communication buffer size, as well as the initial and minimum allowed gas temperatures. The code comes with three problem sizes and their corresponding ICs, configuration and parameter files: Small: 2x128 3 = particles Medium: 2x512 3 = particles Large: 2x = particles It must be noted that the Large problem size was not able to run in the team s cluster, because of memory insufficiency. The output of GADGET is a folder with numerous files, the most important of which include information about global energy statistics of the simulation (energy.txt), various aspects of the performance of the gravitational force computation for each timestep (timings.txt) and a list of all the timesteps (info.txt). There are also the restart files, which are used in case of a paused simulation and some snapshot files which are used for the visualisation of the simulation. The most important output file is the one that keeps track of the cumulative CPU-consumption of various parts of the code and the time taken for each timestep. [11] 4.2 Performance investigation Increasing the performance of GADGET was not a straight forward thing to do, because GADGET is a scientific application and not a benchmark. The Small test case was suited for the initial testing, as it had low completion time. The first step was to investigate how it performs when the number of MPI tasks, used for the job, increases. The main reason of this step, was to see if GADGET scales well across multiple nodes, and have a first estimation of the final results. The results are presented in Table 4.1 and the MVAPICH MPI library is used for this experiment, which was compiled with the PGI compiler. It must be noted that all the following GADGET timings are taken for the execution of 3 time-steps. As we can see, GADGET scales really well, as the output time is almost halved when the MPI tasks are doubled. The reason we see a big drop in time, when the MPI tasks go from 8 to 16, is because, at that time, process binding flags were added in the mpirun command. The flags used are -bind-to socket -map-by hwthread, which bind processes to a socket, and then map them by hardware thread, in order to group sequential ranks together [10]. The results are, also, presented in Figure 4.3. Figure 4.3 is a timing plot. A corresponding scaling plot could be created, because 26

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale Toshio Endo,Akira Nukada, Satoshi Matsuoka GSIC, Tokyo Institute of Technology ( 東 京 工 業 大 学 ) Performance/Watt is the Issue

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Cluster Computing at HRI

Cluster Computing at HRI Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007

More information

Pedraforca: ARM + GPU prototype

Pedraforca: ARM + GPU prototype www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative

More information

Cooling and thermal efficiently in

Cooling and thermal efficiently in Cooling and thermal efficiently in the datacentre George Brown HPC Systems Engineer Viglen Overview Viglen Overview Products and Technologies Looking forward Company Profile IT hardware manufacture, reseller

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Overview of HPC systems and software available within

Overview of HPC systems and software available within Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster

More information

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering Enquiry No: Enq/IITK/ME/JB/02 Enquiry Date: 14/12/15 Last Date of Submission: 21/12/15 Formal quotations are invited for HPC cluster.

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

HPC Growing Pains. Lessons learned from building a Top500 supercomputer HPC Growing Pains Lessons learned from building a Top500 supercomputer John L. Wofford Center for Computational Biology & Bioinformatics Columbia University I. What is C2B2? Outline Lessons learned from

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Big Data Visualization on the MIC

Big Data Visualization on the MIC Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical

More information

Resource Scheduling Best Practice in Hybrid Clusters

Resource Scheduling Best Practice in Hybrid Clusters Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services ALPS Supercomputing System A Scalable Supercomputer with Flexible Services 1 Abstract Supercomputing is moving from the realm of abstract to mainstream with more and more applications and research being

More information

ECDF Infrastructure Refresh - Requirements Consultation Document

ECDF Infrastructure Refresh - Requirements Consultation Document Edinburgh Compute & Data Facility - December 2014 ECDF Infrastructure Refresh - Requirements Consultation Document Introduction In order to sustain the University s central research data and computing

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

PRIMERGY server-based High Performance Computing solutions

PRIMERGY server-based High Performance Computing solutions PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating

More information

Scaling from Workstation to Cluster for Compute-Intensive Applications

Scaling from Workstation to Cluster for Compute-Intensive Applications Cluster Transition Guide: Scaling from Workstation to Cluster for Compute-Intensive Applications IN THIS GUIDE: The Why: Proven Performance Gains On Cluster Vs. Workstation The What: Recommended Reference

More information

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Linux Cluster Computing An Administrator s Perspective

Linux Cluster Computing An Administrator s Perspective Linux Cluster Computing An Administrator s Perspective Robert Whitinger Traques LLC and High Performance Computing Center East Tennessee State University : http://lxer.com/pub/self2015_clusters.pdf 2015-Jun-14

More information

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document

More information

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks

More information

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This

More information

Hadoop on the Gordon Data Intensive Cluster

Hadoop on the Gordon Data Intensive Cluster Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

Sun Constellation System: The Open Petascale Computing Architecture

Sun Constellation System: The Open Petascale Computing Architecture CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

HUAWEI Tecal E6000 Blade Server

HUAWEI Tecal E6000 Blade Server HUAWEI Tecal E6000 Blade Server Professional Trusted Future-oriented HUAWEI TECHNOLOGIES CO., LTD. The HUAWEI Tecal E6000 is a new-generation server platform that guarantees comprehensive and powerful

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing. Appro HyperBlade A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing. Appro HyperBlade clusters are flexible, modular scalable offering a high-density

More information

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server HUAWEI FusionServer X6800 Data Center Server Data Center Cloud Internet App Big Data HPC As the IT infrastructure changes with

More information

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015 Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015 Xian Shi 1 bio I am a second-year Ph.D. student from Combustion Analysis/Modeling Lab,

More information

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver 1 The PHI solution Fujitsu Industry Ready Intel XEON-PHI based solution SC2013 - Denver Industrial Application Challenges Most of existing scientific and technical applications Are written for legacy execution

More information

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist The Top Six Advantages of CUDA-Ready Clusters Ian Lumb Bright Evangelist GTC Express Webinar January 21, 2015 We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing

More information

IT@Intel. Comparing Multi-Core Processors for Server Virtualization

IT@Intel. Comparing Multi-Core Processors for Server Virtualization White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core

More information

Installation Guide. (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Installation Guide. (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom Installation Guide (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom Tel: +44 (0) 141 3322681 Fax: +44 (0) 141 3326792 www.mve.com Table of Contents 1.

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

Improved LS-DYNA Performance on Sun Servers

Improved LS-DYNA Performance on Sun Servers 8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms

More information

A-CLASS The rack-level supercomputer platform with hot-water cooling

A-CLASS The rack-level supercomputer platform with hot-water cooling A-CLASS The rack-level supercomputer platform with hot-water cooling INTRODUCTORY PRESENTATION JUNE 2014 Rev 1 ENG COMPUTE PRODUCT SEGMENTATION 3 rd party board T-MINI P (PRODUCTION): Minicluster/WS systems

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

HP Z Turbo Drive PCIe SSD

HP Z Turbo Drive PCIe SSD Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e

More information

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging

More information

AirWave 7.7. Server Sizing Guide

AirWave 7.7. Server Sizing Guide AirWave 7.7 Server Sizing Guide Copyright 2013 Aruba Networks, Inc. Aruba Networks trademarks include, Aruba Networks, Aruba Wireless Networks, the registered Aruba the Mobile Edge Company logo, Aruba

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Using the Windows Cluster

Using the Windows Cluster Using the Windows Cluster Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster

More information

ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009 ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster

More information

Intel Xeon Processor E5-2600

Intel Xeon Processor E5-2600 Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset

More information

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Lecture 1: the anatomy of a supercomputer

Lecture 1: the anatomy of a supercomputer Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949

More information

SR-IOV: Performance Benefits for Virtualized Interconnects!

SR-IOV: Performance Benefits for Virtualized Interconnects! SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

June, 2009. Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009

June, 2009. Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009 Supermicro ICR Recipe For 1U Twin Department Cluster with ClusterVision ClusterVisionOS Version 1.4 6/25/2009 1 Table of Contents 1. System Configuration... 3 Bill Of Materials (Hardware)... 3 Bill Of

More information

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2 Picking the right number of targets per server for BeeGFS Jan Heichler March 2015 v1.2 Evaluating the MetaData Performance of BeeGFS 2 Abstract In this paper we will show the performance of two different

More information

SAS Business Analytics. Base SAS for SAS 9.2

SAS Business Analytics. Base SAS for SAS 9.2 Performance & Scalability of SAS Business Analytics on an NEC Express5800/A1080a (Intel Xeon 7500 series-based Platform) using Red Hat Enterprise Linux 5 SAS Business Analytics Base SAS for SAS 9.2 Red

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

NVIDIA GPUs in the Cloud

NVIDIA GPUs in the Cloud NVIDIA GPUs in the Cloud 4 EVOLVING CLOUD REQUIREMENTS On premises Off premises Hybrid Cloud Connecting clouds New workloads Components to disrupt 5 GLOBAL CLOUD PLATFORM Unified architecture enabled by

More information

Big Data Performance Growth on the Rise

Big Data Performance Growth on the Rise Impact of Big Data growth On Transparent Computing Michael A. Greene Intel Vice President, Software and Services Group, General Manager, System Technologies and Optimization 1 Transparent Computing (TC)

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Performance analysis of parallel applications on modern multithreaded processor architectures

Performance analysis of parallel applications on modern multithreaded processor architectures Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance analysis of parallel applications on modern multithreaded processor architectures Maciej Cytowski* a, Maciej

More information

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended

More information

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009 ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory

More information

Case study: End-to-end data centre infrastructure management

Case study: End-to-end data centre infrastructure management Case study: End-to-end data centre infrastructure management Situation: A leading public sector organisation suspected that their air conditioning units were not cooling the data centre efficiently. Consequently,

More information

High-Performance Computing Clusters

High-Performance Computing Clusters High-Performance Computing Clusters 7401 Round Pond Road North Syracuse, NY 13212 Ph: 800.227.3432 Fx: 315.433.0945 www.nexlink.com What Is a Cluster? There are several types of clusters and the only constant

More information

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

More information

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect a.jackson@epcc.ed.ac.uk EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training

More information