HPC Applications Scalability Gilad@hpcadvisorycouncil.com
Applications Best Practices LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator D. E. Shaw Research Desmond NWChem MPQC - Massively Parallel Quantum Chemistry Program ANSYS FLUENT and CFX CD-adapco STAR-CCM+ CD-adapco STAR-CD MM5 - The Fifth-Generation Mesoscale Model NAMD LSTC LS-DYNA Schlumberger ECLIPSE CPMD - Car-Parrinello Molecular Dynamics WRF - Weather Research and Forecast Model Dacapo - total energy program based on density functional theory Lattice QCD Open Foam GAMESS HOMME OpenAtom PopPerf 2
Applications Best Practices Performance evaluation Profiling network, I/O Recipes (installation, debug, command lines etc) MPI libraries Power management Optimizations at small scale Optimizations at scale 3
HPC Applications Small Scale
ECLIPSE Performance Results - Interconnect InfiniBand enables highest scalability Performance accelerates with cluster size Performance over GigE and 1GigE is not scaling Slowdown occurs beyond 8 nodes 6 Schlumberger ECLIPSE (FOURMILL) Elapsed Time (Seconds) 5 4 3 2 1 4 8 12 16 2 22 24 Number of Nodes Lower is better GigE 1GigE InfiniBand Single job per cluster size 5
MM5 Performance Results - Interconnect Input Data: T3A Resolution 36KM, grid size 112x136, 33 vertical levels 81 second time-step, 3 hour forecast InfiniBand DDR delivers higher performance in any cluster size Up to 46% versus GigE, 3% versus 1GigE MM5 Benchmark Results - T3A Mflop/s 6 5 4 3 2 1 4 6 8 1 12 14 16 18 2 22 Number of Nodes Higher is better GigE 1GigE InfiniBand-DDR Platform MPI 6
MPQC Performance Results - Interconnect Input Dataset MP2 calculations of the uracil dimer binding energy using the aug-cc-pvtz basis set InfiniBand enables higher scalability Performance accelerates with cluster size Outperforms GigE by up to 124% and 1GigE by up to 38% Elapsed Time(Seconds) 16 12 8 4 MPQC Benchmark Result (aug-cc-pvtz) 8 16 24 Number of Nodes Lower is better GigE 1GigE InfiniBand DDR 7
NAMD Performance Results Interconnect ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water Models a bloodstream lipoprotein particle One of the most used data sets for benchmarking NAMD InfiniBand 2Gb/s outperforms GigE and 1GigE in every cluster size InfiniBand provides higher performance up to 79% vs GigE and 49% vs 1GigE 7 6 5 NAMD (ApoA1) ns/day 4 3 2 1 4 8 12 16 2 24 Number of Nodes Higher is better GigE 1GigE InfiniBand Open MPI 8
CPMD Performance - MPI Platform MPI shows better scalability over Open MPI CPMD (Wat32 inp-1) CPMD (Wat32 inp-2) 1 18 Elapsed time (Seconds) 9 8 7 6 5 4 3 2 1 Elapsed time (Seconds) 16 14 12 1 8 6 4 2 4 8 16 4 8 16 Number of Nodes Number of Nodes Open MPI Platform MPI Open MPI Platform MPI Lower is better These results are based on InfiniBand 9
MM5 Performance - MPI Platform MPI demonstrates higher performance versus MVAPICH Up to 25% higher performance Platform MPI advantage increases with increased cluster size 6 5 MM5 Benchmark Results - T3A Mflop/s 4 3 2 1 4 6 8 1 12 14 16 18 2 22 Number of Nodes Higher is better MVAPICH Platform MPI Single job on each node 1
NAMD Performance - MPI Platform MPI and Open MPI provides same level of performance Platform MPI has better performance for cluster size lower than 2 nodes Open MPI becomes better with 24 nodes Higher configurations than 24 nodes were not tested 7 6 5 NAMD (ApoA1) ns/day 4 3 2 1 4 8 12 16 2 24 Number of Nodes Higher is better Open MPI Platform MPI These results are based on InfiniBand 11
STAR-CD Performance - MPI Test case Single job over the entire systems Input Dataset (A-Class) HP-MPI has slightly better performance with CPU affinity enabled 1 STAR-CD Benchmark Results (A-Class) 9 Total Elapsed Time (s) 8 7 6 5 4 3 2 1 4 8 16 2 24 Number of Nodes Lower is better Platform MPI HP-MPI 12
CPMD MPI Profiling MPI_AlltoAll is the key collective function in CPMD Number of AlltoAll messages increases dramatically with cluster size MPI Function Distribution (C12 inp-1) Number of Messages (Millions) 4 3.5 3 2.5 2 1.5 1.5 MPI_Allgather MPI_Allreduce MPI_Alltoall MPI_Bcast MPI_Isend MPI_Recv MPI_Send 4 Nodes 8 Nodes 16 Nodes 13
ECLIPSE - MPI Profiling Eclipse MPI Profiliing Percentage of Messages 6% 5% 4% 3% 4 8 12 16 2 24 Number of Nodes MPI_Isend < 128 Bytes MPI_Recv < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 256 KB Majority of MPI messages are large size Demonstrating the need for highest throughput 14
Abaqus/Standard - MPI Profiling MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest communication overhead MPI Runtime (s) 1 1 1 1 1 1 1 MPI_Allreduce MPI_Alltoall MPI_Barrier Abaqus/Standard MPI Profiliing (S2A) MPI_Bcast MPI_Irecv MPI_Isend MPI Function MPI_Recv MPI_Reduce MPI_Send MPI_Ssend MPI_Test MPI_Waitall 16 Cores 32 Cores 64 Cores 15
OpenFOAM MPI Profiling MPI Functions Mostly used MPI functions MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv Number of MPI functions increases with cluster size 16
ECLIPSE - Interconnect Usage Total server throughput increases rapidly with cluster size Data Transferred (MB/s) Data Transferred (MB/s) Data Sent 4 Nodes Data Sent 8 Nodes 14 12 1 8 6 4 2 1 146 291 436 581 726 871 116 1161 136 1451 1596 1741 1886 231 Timing (s) Data Transferred (MB/s) Data Sent 16 Nodes Data Sent 24 Nodes 14 12 1 8 6 4 2 1 48 95 142 189 236 283 33 377 424 471 518 565 612 659 76 753 8 847 Timing (s) Data Transferred (MB/s) This data is per node based 14 12 1 8 6 4 2 1 86 171 256 341 426 511 596 681 766 851 936 121 116 1191 Timing (s) 14 12 1 8 6 4 2 1 45 89 133 177 221 265 39 353 397 441 485 529 573 617 661 75 749 793 Timing (s) 17
NWChem MPI Profiling Message Transferred Most data related MPI messages are within 8KB-256KB in size Number of messages increases with cluster size Shows the need for highest throughput to ensure highest system utilization Number of Messages (K) 6 5 4 3 2 1 NWChem Benchmark Profiliing (Siosi7) <128B <1KB <8KB <256KB <1MB >1MB Message Size 8nodes 12nodes 16nodes 18
STAR-CD MPI Profiling Data Transferred Most point-to-point MPI messages are within 1KB to 8KB in size Number of messages increases with cluster size STAR-CD MPI Profiling (MPI_Sendrecv) Number of Messages (Millions) 7 6 5 4 3 2 1 [-128B> [128B-1K> [1K-8K> Message Size [8K-256K> [256K-1M> 4 Nodes 8 Nodes 12 Nodes 16 Nodes 2 Nodes 22 Nodes 19
FLUENT 12 MPI Profiling Message Transferred Most data related MPI messages are within 256B-1KB in size Typical MPI synchronization messages are lower than 64B in size Number of messages increases with cluster size Number of Messages (Thousands) 28 24 2 16 12 8 4 FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) [..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..2GB] Message Size 4 Nodes 8 Nodes 16 Nodes 2
ECLIPSE Performance - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for reservoir simulations Three cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Eight jobs, each on one CPU core per server Eight jobs per node increases productivity by up to 142% Number of Jobs 3 25 2 15 1 5 Schlumberger ECLIPSE (FOURMILL) 8 12 16 2 22 24 Number of Nodes Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node InfiniBand 21
MPQC Performance - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for MPQC computation Two cases are presented Single job over the entire systems Two jobs, each on four cores per server Two jobs per node increases productivity by up to 35% Higher is better Jobs/Day 3 25 2 15 1 5 MPQC Benchmark Result (aug-cc-pvdz) 8 12 16 2 24 Number of Nodes 1 Job 2 Jobs 35% InfiniBand 22
LS-DYNA Performance - Productivity Jobs per Day 12 1 8 6 4 2 4 (32 Cores) 8 (64 Cores) LS-DYNA - 3 Vehicle Collision 12 (96 Cores) 16 (128 Cores) Number of Nodes 2 (16 Cores) 24 (192 Cores) 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Jobs per Day 14 12 1 8 6 4 2 4 (32 Cores) LS-DYNA - Neon Refined Revised 8 (64 Cores) 12 (96 Cores) 16 (128 Cores) Number of Nodes 2 (16 Cores) 24 (192 Cores) 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Higher is better 23
FLUENT Performance - Productivity Test cases Single job over the entire systems 2 jobs, each runs on four cores per server Running multiple jobs simultaneously improves FLUENT productivity Up to 9% more jobs per day for Eddy_417K Up to 3% more jobs per day for Aircraft_2M Up to 3% more jobs per day for Truck_14M As bigger the # of elements, higher node count is required for increased productivity The CPU is the bottleneck for larger number of servers Productivity Gain 1% 8% 6% 4% 2% % FLUENT 12. Productivity Result (2 jobs in parallel vs 1 job) 4 8 12 16 2 24 Number of Nodes Higher is better Eddy_417K Aircraft_2M Truck_14M InfiniBand DDR 24
MM5 Performance with Power Management Test Scenario 24 servers, 4 processes per node, 2 processes per CPU (socket) Similar performance with power management enabled or disabled Only 1.4% performance degradation 4 35 MM5 Benchmark Results - T3A Gflop/s 3 25 2 15 1 Higher is better Power Management Enabled Power Management Disabled InfiniBand DDR 25
MM5 Benchmark Power Cost Savings Power management saves 673$/year for the 24-node cluster As cluster size increases, bigger saving are expected 11 MM5 Benchmark - T3A Power Cost Comparison 16 7% $/year 12 98 94 9 Power Management Enabled Power Management Disabled 24 Node Cluster $/year = Total power consumption/year (KWh) * $.2 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf InfiniBand DDR 26
STAR-CD Benchmark Power Consumption Power management reduces 2% of total system power consumption Power Consumption (Watt) 75 7 65 6 55 5 STAR-CD Benchmark Results (AClass) Power Management Enabled Power Management Disabled Lower is better InfiniBand DDR 27
STAR-CD Performance with Power Management Test Scenario 24 servers, 4-Cores/Proc Nearly identical performance with power management enabled or disabled Total Elapsed Time (s) 18 16 14 12 1 STAR-CD Benchmark Results (AClass) Power Management Enabled Power Management Disabled Lower is better InfiniBand DDR 28
MQPC Performance with CPU Frequency Scaling Enabling CPU Frequency Scaling Userspace reducing CPU frequency to 1.9GHz Performance setting for maximum performance (CPU frequency of 2.6GHz) OnDemand Maximum performance per application activity Userspace increases job run time since CPU frequency is reduced Performance and OnDemand enable similar performance Due to high resource demands from the application 3 MPQC Performance Elapsed Time (s) 25 2 15 1 5 ug-cc-pvdz ug-cc-pvtz 24 Node Cluster Benchmark Dataset Userspace (1.9GHz) Performance OnDemand InfiniBand DDR 29
NWChem Benchmark Results Input Dataset - Siosi7 Open MPI and HP-MPI provides similar performance and scalability Intel MKL library enables better performance versus GCC 3 NWChem Benchmark Result (Siosi7) Performance (Jobs/Day) 25 2 15 1 5 8 12 16 HP-MPI + GCC Number of HP-MPI Nodes + Intel MKL Higher is better Open MPI + GCC Open MPI + Intel MKL InfiniBand QDR 3
NWChem Benchmark Results Input Dataset - H2O7 AMD ACML provides higher performance and scalability versus the default BLAS library 3 NWChem Benchmark Result (H2O7 MP2) Wall time (Seconds) 25 2 15 1 5 Lower is better 4 8 16 24 Number of Nodes MVAPICH HP-MPI Open MPI MVAPICH + ACML HP-MPI + ACML OpenMPI + ACML InfiniBand DDR 31
Tuning MPI For Performance Improvements MPI Performance tuning Enabling CPU affinity mca mpi_paffinity_alone 1 Increasing eager limit over infiniband to 32K mca btl_openib_eager_limit 32767 Performance increase of up to 1% 7 6 5 NAMD (ApoA1) Days/ns 4 3 2 1 4 8 12 16 2 24 Higher is better Number of Nodes Open MPI - Default Open MPI - Tuned 32
HPC Applications Large Scale
Acknowledgments The following research was performed under the HPC Advisory Council activities Participating members: AMD, Dell, Jülich, Mellanox NCAR, ParTec, Sun Compute resource: HPC Advisory Council Cluster Center, Jülich Supercomputing Centre For more info please refer to www.hpcadvisorycouncil.com 34
HOMME High-Order Methods Modeling Environment (HOMME) Framework for creating a high-performance scalable global atmospheric model Configurable for shallow water or the dry/moist primitive equations Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate System Model (CCSM) HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP Developed by the Scientific Computing Section at the National Center for Atmospheric Research (NCAR) 35
POPperf POP (Parallel Ocean Program) is an ocean circulation model Simulations of the global ocean Ocean-ice coupled simulations Developed at Los Alamos National Lab POPperf is a modified version of POP 2. (Parallel Ocean Program) POPperf improves POP scalability on large processor counts Re-writing of the conjugate gradient solver to use a 1D data structure The addition of a space-filling curve partitioning technique Low memory binary parallel I/O functionality Developed by NCAR and freely available to the community 36
Objectives and Environments The presented research was done to provide best practices HOMME and POPperf scalability and optimizations Understanding communication patterns HPC Advisory Council system Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX HCAs and switches Jülich Supercomputing Centre - JuRoPA Sun Blade servers with Intel Nehalem processors Mellanox 4Gb/s InfiniBand HCAs and switches ParTec ParaStation Cluster Operation Software and MPI 37
ParTec MPI Parameters PSP_ONDEMAND Each MPI connection needs a certain amount of memory by default (.5MB) PSP_ONDEMAND disabled MPI connections establish when application starts Reduce overhead to start connection dynamically PSP_ONDEMAND enabled MPI connections establish per need Reduce unnecessary message checking from large number of connections Reduce memory footprint PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE Define default send/receive queue size (Default is16) Changing them can reduce memory footprint 38
Lessons Learn for Large Scale Each application needs customized MPI tuning Dynamic MPI connection establishment should be enabled if Simultaneous connections setup is not a must Some MPI function needs to check incoming messages from any source Dynamic MPI connection establishment should be disabled if MPI_Alltoall is used in the application Number of calls to check MPI_ANY_SROUCE is small Enough memory in the system At different scale, different parameters may be used PSP_ONDEMAND shouldn t be enabled at very small scale cluster Different MPI libraries has different characteristics Parameters change for one MPI may not fit to other MPIs 39
HOMME Performance at Scale 4
HOMME Performance at Higher Scale 41
HOMME Scalability 42
POPperf Performance at Scale 43
POPperf Scalability 44
Summary HOMME and POPperf performance depends on low latency network InfiniBand enables both HOMME and POPperf to scale Tested over 13824 cores Same scalability is expected at even higher core count Optimized MPI settings can dramatically improve application performance Understanding application communication pattern MPI parameter tuning 45
Acknowledgements Thanks to all the participated HPC Advisory members Ben Mayer and John Dennis from NCAR who provided the code and benchmark instructions Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the JuRoPA system available for benchmarking Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec who helped review the report 46
Thank You HPC Advisory Council www.hpcadvisorycouncil.com info@hpcadvisorycouncil.com 47 47