HPC Applications Scalability. Gilad@hpcadvisorycouncil.com



Similar documents
ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

FLOW-3D Performance Benchmark and Profiling. September 2012

SR-IOV: Performance Benefits for Virtualized Interconnects!

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Clusters: Mainstream Technology for CAE

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Can High-Performance Interconnects Benefit Memcached and Hadoop?

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Advancing Applications Performance With InfiniBand

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Cloud Computing through Virtualization and HPC technologies

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance of the JMA NWP models on the PC cluster TSUBAME.

Improved LS-DYNA Performance on Sun Servers

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

benchmarking Amazon EC2 for high-performance scientific computing

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

High Performance Computing in CST STUDIO SUITE

Parallel Large-Scale Visualization

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

- An Essential Building Block for Stable and Reliable Compute Clusters

The CNMS Computer Cluster

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

64-Bit versus 32-Bit CPUs in Scientific Computing

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

RDMA over Ethernet - A Preliminary Study

Multicore Parallel Computing with OpenMP

Jean-Pierre Panziera Teratec 2011

Recommended hardware system configurations for ANSYS users

SR-IOV In High Performance Computing

HP ProLiant SL270s Gen8 Server. Evaluation Report

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

HPC Update: Engagement Model

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

How to Run Parallel Jobs Efficiently

Performance Evaluation of InfiniBand with PCI Express

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Scaling Study of LS-DYNA MPP on High Performance Servers

Scaling from Workstation to Cluster for Compute-Intensive Applications

Microsoft Windows Compute Cluster Server 2003 Evaluation

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Parallel Programming Survey

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

ABAQUS High Performance Computing Environment at Nokia

Microsoft Compute Clusters in High Performance Technical Computing. Björn Tromsdorf, HPC Product Manager, Microsoft Corporation

Self service for software development tools

Hadoop on the Gordon Data Intensive Cluster

Tableau Server 7.0 scalability

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Sun Constellation System: The Open Petascale Computing Architecture

PRIMERGY server-based High Performance Computing solutions

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

MPI / ClusterTools Update and Plans

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Multi-Threading Performance on Commodity Multi-Core Processors

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Lattice QCD Performance. on Multi core Linux Servers

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Integrated Grid Solutions. and Greenplum

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Trends in High-Performance Computing for Power Grid Applications

Transcription:

HPC Applications Scalability Gilad@hpcadvisorycouncil.com

Applications Best Practices LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator D. E. Shaw Research Desmond NWChem MPQC - Massively Parallel Quantum Chemistry Program ANSYS FLUENT and CFX CD-adapco STAR-CCM+ CD-adapco STAR-CD MM5 - The Fifth-Generation Mesoscale Model NAMD LSTC LS-DYNA Schlumberger ECLIPSE CPMD - Car-Parrinello Molecular Dynamics WRF - Weather Research and Forecast Model Dacapo - total energy program based on density functional theory Lattice QCD Open Foam GAMESS HOMME OpenAtom PopPerf 2

Applications Best Practices Performance evaluation Profiling network, I/O Recipes (installation, debug, command lines etc) MPI libraries Power management Optimizations at small scale Optimizations at scale 3

HPC Applications Small Scale

ECLIPSE Performance Results - Interconnect InfiniBand enables highest scalability Performance accelerates with cluster size Performance over GigE and 1GigE is not scaling Slowdown occurs beyond 8 nodes 6 Schlumberger ECLIPSE (FOURMILL) Elapsed Time (Seconds) 5 4 3 2 1 4 8 12 16 2 22 24 Number of Nodes Lower is better GigE 1GigE InfiniBand Single job per cluster size 5

MM5 Performance Results - Interconnect Input Data: T3A Resolution 36KM, grid size 112x136, 33 vertical levels 81 second time-step, 3 hour forecast InfiniBand DDR delivers higher performance in any cluster size Up to 46% versus GigE, 3% versus 1GigE MM5 Benchmark Results - T3A Mflop/s 6 5 4 3 2 1 4 6 8 1 12 14 16 18 2 22 Number of Nodes Higher is better GigE 1GigE InfiniBand-DDR Platform MPI 6

MPQC Performance Results - Interconnect Input Dataset MP2 calculations of the uracil dimer binding energy using the aug-cc-pvtz basis set InfiniBand enables higher scalability Performance accelerates with cluster size Outperforms GigE by up to 124% and 1GigE by up to 38% Elapsed Time(Seconds) 16 12 8 4 MPQC Benchmark Result (aug-cc-pvtz) 8 16 24 Number of Nodes Lower is better GigE 1GigE InfiniBand DDR 7

NAMD Performance Results Interconnect ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water Models a bloodstream lipoprotein particle One of the most used data sets for benchmarking NAMD InfiniBand 2Gb/s outperforms GigE and 1GigE in every cluster size InfiniBand provides higher performance up to 79% vs GigE and 49% vs 1GigE 7 6 5 NAMD (ApoA1) ns/day 4 3 2 1 4 8 12 16 2 24 Number of Nodes Higher is better GigE 1GigE InfiniBand Open MPI 8

CPMD Performance - MPI Platform MPI shows better scalability over Open MPI CPMD (Wat32 inp-1) CPMD (Wat32 inp-2) 1 18 Elapsed time (Seconds) 9 8 7 6 5 4 3 2 1 Elapsed time (Seconds) 16 14 12 1 8 6 4 2 4 8 16 4 8 16 Number of Nodes Number of Nodes Open MPI Platform MPI Open MPI Platform MPI Lower is better These results are based on InfiniBand 9

MM5 Performance - MPI Platform MPI demonstrates higher performance versus MVAPICH Up to 25% higher performance Platform MPI advantage increases with increased cluster size 6 5 MM5 Benchmark Results - T3A Mflop/s 4 3 2 1 4 6 8 1 12 14 16 18 2 22 Number of Nodes Higher is better MVAPICH Platform MPI Single job on each node 1

NAMD Performance - MPI Platform MPI and Open MPI provides same level of performance Platform MPI has better performance for cluster size lower than 2 nodes Open MPI becomes better with 24 nodes Higher configurations than 24 nodes were not tested 7 6 5 NAMD (ApoA1) ns/day 4 3 2 1 4 8 12 16 2 24 Number of Nodes Higher is better Open MPI Platform MPI These results are based on InfiniBand 11

STAR-CD Performance - MPI Test case Single job over the entire systems Input Dataset (A-Class) HP-MPI has slightly better performance with CPU affinity enabled 1 STAR-CD Benchmark Results (A-Class) 9 Total Elapsed Time (s) 8 7 6 5 4 3 2 1 4 8 16 2 24 Number of Nodes Lower is better Platform MPI HP-MPI 12

CPMD MPI Profiling MPI_AlltoAll is the key collective function in CPMD Number of AlltoAll messages increases dramatically with cluster size MPI Function Distribution (C12 inp-1) Number of Messages (Millions) 4 3.5 3 2.5 2 1.5 1.5 MPI_Allgather MPI_Allreduce MPI_Alltoall MPI_Bcast MPI_Isend MPI_Recv MPI_Send 4 Nodes 8 Nodes 16 Nodes 13

ECLIPSE - MPI Profiling Eclipse MPI Profiliing Percentage of Messages 6% 5% 4% 3% 4 8 12 16 2 24 Number of Nodes MPI_Isend < 128 Bytes MPI_Recv < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 256 KB Majority of MPI messages are large size Demonstrating the need for highest throughput 14

Abaqus/Standard - MPI Profiling MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest communication overhead MPI Runtime (s) 1 1 1 1 1 1 1 MPI_Allreduce MPI_Alltoall MPI_Barrier Abaqus/Standard MPI Profiliing (S2A) MPI_Bcast MPI_Irecv MPI_Isend MPI Function MPI_Recv MPI_Reduce MPI_Send MPI_Ssend MPI_Test MPI_Waitall 16 Cores 32 Cores 64 Cores 15

OpenFOAM MPI Profiling MPI Functions Mostly used MPI functions MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv Number of MPI functions increases with cluster size 16

ECLIPSE - Interconnect Usage Total server throughput increases rapidly with cluster size Data Transferred (MB/s) Data Transferred (MB/s) Data Sent 4 Nodes Data Sent 8 Nodes 14 12 1 8 6 4 2 1 146 291 436 581 726 871 116 1161 136 1451 1596 1741 1886 231 Timing (s) Data Transferred (MB/s) Data Sent 16 Nodes Data Sent 24 Nodes 14 12 1 8 6 4 2 1 48 95 142 189 236 283 33 377 424 471 518 565 612 659 76 753 8 847 Timing (s) Data Transferred (MB/s) This data is per node based 14 12 1 8 6 4 2 1 86 171 256 341 426 511 596 681 766 851 936 121 116 1191 Timing (s) 14 12 1 8 6 4 2 1 45 89 133 177 221 265 39 353 397 441 485 529 573 617 661 75 749 793 Timing (s) 17

NWChem MPI Profiling Message Transferred Most data related MPI messages are within 8KB-256KB in size Number of messages increases with cluster size Shows the need for highest throughput to ensure highest system utilization Number of Messages (K) 6 5 4 3 2 1 NWChem Benchmark Profiliing (Siosi7) <128B <1KB <8KB <256KB <1MB >1MB Message Size 8nodes 12nodes 16nodes 18

STAR-CD MPI Profiling Data Transferred Most point-to-point MPI messages are within 1KB to 8KB in size Number of messages increases with cluster size STAR-CD MPI Profiling (MPI_Sendrecv) Number of Messages (Millions) 7 6 5 4 3 2 1 [-128B> [128B-1K> [1K-8K> Message Size [8K-256K> [256K-1M> 4 Nodes 8 Nodes 12 Nodes 16 Nodes 2 Nodes 22 Nodes 19

FLUENT 12 MPI Profiling Message Transferred Most data related MPI messages are within 256B-1KB in size Typical MPI synchronization messages are lower than 64B in size Number of messages increases with cluster size Number of Messages (Thousands) 28 24 2 16 12 8 4 FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) [..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..2GB] Message Size 4 Nodes 8 Nodes 16 Nodes 2

ECLIPSE Performance - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for reservoir simulations Three cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Eight jobs, each on one CPU core per server Eight jobs per node increases productivity by up to 142% Number of Jobs 3 25 2 15 1 5 Schlumberger ECLIPSE (FOURMILL) 8 12 16 2 22 24 Number of Nodes Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node InfiniBand 21

MPQC Performance - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for MPQC computation Two cases are presented Single job over the entire systems Two jobs, each on four cores per server Two jobs per node increases productivity by up to 35% Higher is better Jobs/Day 3 25 2 15 1 5 MPQC Benchmark Result (aug-cc-pvdz) 8 12 16 2 24 Number of Nodes 1 Job 2 Jobs 35% InfiniBand 22

LS-DYNA Performance - Productivity Jobs per Day 12 1 8 6 4 2 4 (32 Cores) 8 (64 Cores) LS-DYNA - 3 Vehicle Collision 12 (96 Cores) 16 (128 Cores) Number of Nodes 2 (16 Cores) 24 (192 Cores) 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Jobs per Day 14 12 1 8 6 4 2 4 (32 Cores) LS-DYNA - Neon Refined Revised 8 (64 Cores) 12 (96 Cores) 16 (128 Cores) Number of Nodes 2 (16 Cores) 24 (192 Cores) 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Higher is better 23

FLUENT Performance - Productivity Test cases Single job over the entire systems 2 jobs, each runs on four cores per server Running multiple jobs simultaneously improves FLUENT productivity Up to 9% more jobs per day for Eddy_417K Up to 3% more jobs per day for Aircraft_2M Up to 3% more jobs per day for Truck_14M As bigger the # of elements, higher node count is required for increased productivity The CPU is the bottleneck for larger number of servers Productivity Gain 1% 8% 6% 4% 2% % FLUENT 12. Productivity Result (2 jobs in parallel vs 1 job) 4 8 12 16 2 24 Number of Nodes Higher is better Eddy_417K Aircraft_2M Truck_14M InfiniBand DDR 24

MM5 Performance with Power Management Test Scenario 24 servers, 4 processes per node, 2 processes per CPU (socket) Similar performance with power management enabled or disabled Only 1.4% performance degradation 4 35 MM5 Benchmark Results - T3A Gflop/s 3 25 2 15 1 Higher is better Power Management Enabled Power Management Disabled InfiniBand DDR 25

MM5 Benchmark Power Cost Savings Power management saves 673$/year for the 24-node cluster As cluster size increases, bigger saving are expected 11 MM5 Benchmark - T3A Power Cost Comparison 16 7% $/year 12 98 94 9 Power Management Enabled Power Management Disabled 24 Node Cluster $/year = Total power consumption/year (KWh) * $.2 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf InfiniBand DDR 26

STAR-CD Benchmark Power Consumption Power management reduces 2% of total system power consumption Power Consumption (Watt) 75 7 65 6 55 5 STAR-CD Benchmark Results (AClass) Power Management Enabled Power Management Disabled Lower is better InfiniBand DDR 27

STAR-CD Performance with Power Management Test Scenario 24 servers, 4-Cores/Proc Nearly identical performance with power management enabled or disabled Total Elapsed Time (s) 18 16 14 12 1 STAR-CD Benchmark Results (AClass) Power Management Enabled Power Management Disabled Lower is better InfiniBand DDR 28

MQPC Performance with CPU Frequency Scaling Enabling CPU Frequency Scaling Userspace reducing CPU frequency to 1.9GHz Performance setting for maximum performance (CPU frequency of 2.6GHz) OnDemand Maximum performance per application activity Userspace increases job run time since CPU frequency is reduced Performance and OnDemand enable similar performance Due to high resource demands from the application 3 MPQC Performance Elapsed Time (s) 25 2 15 1 5 ug-cc-pvdz ug-cc-pvtz 24 Node Cluster Benchmark Dataset Userspace (1.9GHz) Performance OnDemand InfiniBand DDR 29

NWChem Benchmark Results Input Dataset - Siosi7 Open MPI and HP-MPI provides similar performance and scalability Intel MKL library enables better performance versus GCC 3 NWChem Benchmark Result (Siosi7) Performance (Jobs/Day) 25 2 15 1 5 8 12 16 HP-MPI + GCC Number of HP-MPI Nodes + Intel MKL Higher is better Open MPI + GCC Open MPI + Intel MKL InfiniBand QDR 3

NWChem Benchmark Results Input Dataset - H2O7 AMD ACML provides higher performance and scalability versus the default BLAS library 3 NWChem Benchmark Result (H2O7 MP2) Wall time (Seconds) 25 2 15 1 5 Lower is better 4 8 16 24 Number of Nodes MVAPICH HP-MPI Open MPI MVAPICH + ACML HP-MPI + ACML OpenMPI + ACML InfiniBand DDR 31

Tuning MPI For Performance Improvements MPI Performance tuning Enabling CPU affinity mca mpi_paffinity_alone 1 Increasing eager limit over infiniband to 32K mca btl_openib_eager_limit 32767 Performance increase of up to 1% 7 6 5 NAMD (ApoA1) Days/ns 4 3 2 1 4 8 12 16 2 24 Higher is better Number of Nodes Open MPI - Default Open MPI - Tuned 32

HPC Applications Large Scale

Acknowledgments The following research was performed under the HPC Advisory Council activities Participating members: AMD, Dell, Jülich, Mellanox NCAR, ParTec, Sun Compute resource: HPC Advisory Council Cluster Center, Jülich Supercomputing Centre For more info please refer to www.hpcadvisorycouncil.com 34

HOMME High-Order Methods Modeling Environment (HOMME) Framework for creating a high-performance scalable global atmospheric model Configurable for shallow water or the dry/moist primitive equations Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate System Model (CCSM) HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP Developed by the Scientific Computing Section at the National Center for Atmospheric Research (NCAR) 35

POPperf POP (Parallel Ocean Program) is an ocean circulation model Simulations of the global ocean Ocean-ice coupled simulations Developed at Los Alamos National Lab POPperf is a modified version of POP 2. (Parallel Ocean Program) POPperf improves POP scalability on large processor counts Re-writing of the conjugate gradient solver to use a 1D data structure The addition of a space-filling curve partitioning technique Low memory binary parallel I/O functionality Developed by NCAR and freely available to the community 36

Objectives and Environments The presented research was done to provide best practices HOMME and POPperf scalability and optimizations Understanding communication patterns HPC Advisory Council system Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX HCAs and switches Jülich Supercomputing Centre - JuRoPA Sun Blade servers with Intel Nehalem processors Mellanox 4Gb/s InfiniBand HCAs and switches ParTec ParaStation Cluster Operation Software and MPI 37

ParTec MPI Parameters PSP_ONDEMAND Each MPI connection needs a certain amount of memory by default (.5MB) PSP_ONDEMAND disabled MPI connections establish when application starts Reduce overhead to start connection dynamically PSP_ONDEMAND enabled MPI connections establish per need Reduce unnecessary message checking from large number of connections Reduce memory footprint PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE Define default send/receive queue size (Default is16) Changing them can reduce memory footprint 38

Lessons Learn for Large Scale Each application needs customized MPI tuning Dynamic MPI connection establishment should be enabled if Simultaneous connections setup is not a must Some MPI function needs to check incoming messages from any source Dynamic MPI connection establishment should be disabled if MPI_Alltoall is used in the application Number of calls to check MPI_ANY_SROUCE is small Enough memory in the system At different scale, different parameters may be used PSP_ONDEMAND shouldn t be enabled at very small scale cluster Different MPI libraries has different characteristics Parameters change for one MPI may not fit to other MPIs 39

HOMME Performance at Scale 4

HOMME Performance at Higher Scale 41

HOMME Scalability 42

POPperf Performance at Scale 43

POPperf Scalability 44

Summary HOMME and POPperf performance depends on low latency network InfiniBand enables both HOMME and POPperf to scale Tested over 13824 cores Same scalability is expected at even higher core count Optimized MPI settings can dramatically improve application performance Understanding application communication pattern MPI parameter tuning 45

Acknowledgements Thanks to all the participated HPC Advisory members Ben Mayer and John Dennis from NCAR who provided the code and benchmark instructions Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the JuRoPA system available for benchmarking Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec who helped review the report 46

Thank You HPC Advisory Council www.hpcadvisorycouncil.com info@hpcadvisorycouncil.com 47 47