Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster

Transcription

1 Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster Pak Lui [email protected] May 7, 2015

2 Agenda Introduction to HPC Advisory Council Altair RADIOSS Benchmark Configuration Performance Benchmark Testing and Results MPI Profiling Summary Altair OptiStruct Benchmark Configuration Performance Benchmark Testing and Results MPI Profiling Summary Q&A / For More Information

3 The HPC Advisory Council World-wide HPC organization (400+ members) Bridges the gap between HPC usage and its full potential Provides best practices and a support/development center Explores future technologies and future developments Working Groups HPC Cloud, HPC Scale, HPC GPU, HPC Storage Leading edge solutions and technology demonstrations

5 HPC Council Board HPC Advisory Council Chairman Gilad Shainer - [email protected] HPC Advisory Council HPC Scale SIG Chair Richard Graham [email protected] HPC Advisory Council Media Relations and Events Director HPC Advisory Council HPC Cloud SIG Chair Brian Sparks - [email protected] William Lu [email protected] HPC Advisory Council China Events Manager Blade Meng - [email protected] HPC Advisory Council HPC GPU SIG Chair Sadaf Alam [email protected] Director of the HPC Advisory Council, Asia Tong Liu - [email protected] HPC Advisory Council India Outreach Goldi Misra [email protected] HPC Advisory Council HPC Works SIG Chair and Cluster Center Manager Pak Lui - [email protected] Director of the HPC Advisory Council Switzerland Center of Excellence and HPC Storage SIG Chair Hussein Harake [email protected] HPC Advisory Council Director of Educational Outreach Scot Schultz [email protected] HPC Advisory Council Workshop Program Director Eric Lantz [email protected] HPC Advisory Council Programming Advisor Tarick Bedeir - [email protected] HPC Advisory Council Research Steering Committee Director Cydney Stevens - [email protected]

6 HPC Advisory Council HPC Center InfiniBand-based Storage (Lustre) Juniper Heimdall Lustre FS 640 cores 280 cores Plutus Janus Athena 192 cores 456 cores 80 cores Vesta Thor Mala 704 cores 896 cores 16 GPUs

7 Special Interest Subgroups Missions HPC Scale To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary based supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services. HPC Cloud To explore usage of HPC components as part of the creation of external/public/internal/private cloud computing environments. HPC Works To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines. HPC Storage To demonstrate how to build high-performance storage solutions and their affect on application performance and productivity. One of the main interests of the HPC Storage subgroup is to explore Lustre based solutions, and to expose more users to the potential of Lustre over high-speed networks. HPC GPU To explore usage models of GPU components as part of next generation compute environments and potential optimizations for GPU based computing. HPC FSI To explore the usage of high-performance computing solutions for low latency trading, more productive simulations (such as Monte Carlo) and overall more efficient financial services.

8 HPC Advisory Council HPC Advisory Council (HPCAC) 400+ members Application best practices, case studies (Over 150) Benchmarking center with remote access for users World-wide workshops Value add for your customers to stay up to date and in tune to HPC market 2015 Workshops USA (Stanford University) February 2015 Switzerland March 2015 Brazil August 2015 Spain September 2015 China (HPC China) Oct 2015 For more information

9 2015 ISC High Performance Conference Student Cluster Competition University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the 2015 ISC High Performance Conference show-floor The Student Cluster Competition is designed to introduce the next generation of students to the high performance computing world and community

10 RADIOSS Performance Study Research performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center Objectives Give overview of RADIOSS performance Compare different MPI libraries, network interconnects and others Understand RADIOSS communication patterns Provide best practices to increase RADIOSS productivity

11 About RADIOSS Compute-intensive simulation software for Manufacturing For 20+ years an established standard for automotive crash and impact Differentiated by its high scalability, quality and robustness Supports multiphysics simulation and advanced materials Used across all industries to improve safety and manufacturability Companies use RADIOSS to simulate real-world scenarios (crash tests, climate effects, etc.) to test the performance of a product

12 Test Cluster Configuration Dell PowerEdge R node (896-core) Thor cluster Dual-Socket 14-core Intel 2.60 GHz CPUs (Turbo on, max perf set in BIOS) OS: RHEL 6.5, OFED MLNX_OFED_LINUX InfiniBand SW stack Memory: 64GB memory, DDR MHz Hard Drives: 1TB 7.2 RPM SATA 2.5 Mellanox Switch-IB SB Gb/s InfiniBand VPI switch Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters Mellanox SwitchX SX Gb/s FDR InfiniBand VPI switch MPI: Intel MPI 5.0.2, Open MPI 1.8.4, Mellanox HPC-X v1.2.0 Application: Altair RADIOSS 13.0 Benchmark datasets: Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated

13 About Intel Cluster Ready Intel Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity Simplifies selection, deployment, and operation of a cluster A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers Focus on your work productivity, spend less management time on the cluster Select Intel Cluster Ready Where the cluster is delivered ready to run Hardware and software are integrated and configured together Applications are registered, validating execution on the Intel Cluster Ready architecture Includes Intel Cluster Checker tool, to verify functionality and periodically check cluster health RADIOSS is Intel Cluster Ready

14 PowerEdge R730 Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GbE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x or - 24 x x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

15 RADIOSS Performance Interconnect (MPP) EDR InfiniBand provides better scalability performance than Ethernet 70 times better performance than 1GbE at 16 nodes / 448 cores 4.8x better performance than 10GbE at 16 nodes / cores Ethernet solutions does not scale beyond 4 nodes with pure MPI 70x 4.8x Higher is better Intel MPI 28 Processes/Node

16 RADIOSS Performance Interconnect (MPP) EDR InfiniBand provides better scalability performance EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 448 cores Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes 28% 25% Higher is better Intel MPI 28 Processes/Node

17 RADIOSS Performance CPU Cores Running more cores per node generally improves overall performance Seen improvement of 18% from 20 to 28 cores per node at 8 nodes Guideline: Most optimal workload distribution is 4000 elements/process For test case of 1 million elements, most optimal core sizes is ~256 cores 4000 elements per process should provides sufficient workload for each process Hybrid MPP (HMPP) provides way to achieve additional scalability on more CPUs 18% 6% Higher is better Intel MPI

18 RADIOSS Performance Simulation Time Increasing simulation time increase the run time at a faster rate Increasing a 8ms simulation to 80ms can result in much longer runtime 10x longer simulation run can result in a 14x in the runtime Contacts usually become more severe at the middle of the run, so it costs more complexity and CPU utilization and therefore cost/cycle increases 13x 14x 14x Higher is better Intel MPI

19 RADIOSS Profiling % Time Spent on MPI RADIOSS utilizes point-to-point communications in most data transfers The most time MPI consuming calls is MPI_Waitany() and MPI_Wait() MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%) MPP Mode 28 Processes/Node

20 RADIOSS Performance Intel MPI Tuning (MPP) Tuning Intel MPI collective algorithm can improve performance MPI profile shows about 20% of runtime spent on MPI_Allreduce communications Default algorithm in Intel MPI is Recursive Doubling The default algorithm is the best among all tested for MPP Higher is better Intel MPI 28 Processes/Node

21 RADIOSS Performance MPI Libraries (MPP) Both Intel MPI and Open MPI performs similarly MPI profile shows ~20% of MPI time spent on MPI_Allreduce communications MPI collective operations (such as MPI_Allreduce) can potentially be optimized Support for Open MPI is new for RADIOSS HPC-X is a tuned MPI distribution based on the latest Open MPI Higher is better 28 Processes/Node

22 RADIOSS Hybrid MPP Parallelization Highly parallel code Multi-level parallelization Domain decomposition MPI parallelization Multithreading OpenMP Enhanced performance Best scalability in the marketplace High efficiency on large HPC clusters Unique, proven method for rich scalability over thousands of cores for FEA Flexibility -- easy tuning of MPI & OpenMP Robustness -- parallel arithmetic allows perfect repeatability in parallel

23 RADIOSS Performance Hybrid MPP version Enabling Hybrid MPP mode unlocks the RADIOSS scalability At larger scale, productivity improves as more threads involves As more threads involved, amount of communications by processes are reduced At 32 nodes/896 cores, best configuration is 1 process per socket to spawn 14 threads each 28 threads/1 PPN is not advised due to breach of data locality across different CPU socket The following environment setting and tuned flags are used for Intel MPI: I_MPI_PIN_DOMAIN auto I_MPI_ADJUST_ALLREDUCE 5 I_MPI_ADJUST_BCAST 1 KMP_AFFINITY compact 32% 70% KMP_STACKSIZE 400m 3.7x ulimit -s unlimited Intel MPI EDR InfiniBand Higher is better

24 RADIOSS Profiling MPI Comm. Time For MPP utilizes most non-blocking calls for communications MPI_Recv, MPI_Waitany, MPI_Allreduce are used most of the time For HMPP, communication behavior has changed Higher time percentage in MPI_Waitany, MPI_Allreduce, and MPI_Recv MPP, 28PPN HMPP, 2PPN / 14 Threads Intel MPI At 32 Nodes

25 RADIOSS Profiling MPI Message Sizes Most time consuming MPI communications for MPP are: MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B MPI_Waitany: Messages are: 48B, 8B, 384B MPI_Allreduce: Most message sizes appears at 80B MPP, 28PPN HMPP, 2PPN / 14 Threads Pure MPP 28 Processes/Node

26 RADIOSS Performance Intel MPI Tuning (DP) For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5 Flags for OFA: I_MPI_OFA_USE_XRC 1. For DAPL: ofa-v2-mlx5_0-1u provider 27% 44% Higher is better Intel MPI 2 PPN / 14 OpenMP

27 RADIOSS Performance Interconnect (HMPP) EDR InfiniBand provides better scalability performance than Ethernet 214% better performance than 1GbE at 16 nodes 104% better performance than 10GbE at 16 nodes InfiniBand typically outperforms other interconnect in collective operations 214% 104% Higher is better Intel MPI 2 PPN / 14 OpenMP

28 RADIOSS Performance Interconnect (HMPP) EDR InfiniBand provides better scalability performance than FDR IB EDR IB outperforms FDR IB by 27% at 32 nodes Improvement for EDR InfiniBand occurs at high node count 27% Higher is better Intel MPI 2 PPN / 14 OpenMP

29 RADIOSS Performance Floating Point Precision Single precision job runs faster double precision SP provides 47% speedup than DP Similar scalability is seen for double precision tests 47% Higher is better Intel MPI 2 PPN / 14 OpenMP

30 RADIOSS Performance CPU Frequency Increasing CPU core frequency enables higher job efficiency 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed) 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed) Increase in performance gain exceeds the increase CPU frequencies CPU bound application see higher benefit of using CPU with higher frequencies 29% 18% Higher is better Intel MPI 2 PPN / 14 OpenMP

31 RADIOSS Performance System Generations Intel E5-2680v3 (Haswell) cluster outperforms prior generations Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes System components used: Thor: 2-socket Intel 2133MHz DIMMs, EDR IB, v13.0 Jupiter: 2-socket Intel 1600MHz DIMMs, FDR IB, v12.0 Janus: 2-socket Intel 1333MHz DIMMs, QDR IB, v % 100% Single Precision

32 RADIOSS Profiling Memory Required Differences in memory consumption between MPP and HMPP MPP: Memory required to run the workload is ~5GB per node HMPP: Approximately 400MB per node is needed (as there are 2 PPN) Considered as a small workload but good enough to observe application behavior MPP, 28PPN HMPP, 2PPN / 14 Threads At 32 Nodes 28 CPU Cores/Node

33 RADIOSS Summary RADIOSS is designed to perform at large scale HPC environment Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP Hybrid MPP version enhanced RADIOSS scalability 2 MPI processes per node (or 1 MPI process per socket), 14 threads each Additional CPU cores generally accelerating time to solution performance Intel E5-2680v3 (Haswell) cluster outperforms prior generations Performs faster by 100% vs Sandy Bridge, by 238% vs Westmere at 16 nodes Network and MPI Tuning EDR InfiniBand outperforms other Ethernet-based interconnects in scalability EDR InfiniBand delivers higher scalability performance than FDR and QDR IB Tuning environment parameters is important to maximize performance Tuning MPI collective ops helps RADIOSS to achieve even better scalability

34 OptiStruct by Altair Altair OptiStruct OptiStruct is an industry proven, modern structural analysis solver Solve for linear and non-linear structural problems under static and dynamic loadings Market-leading solution for structural design and optimization Helps designers and engineers to analyze and optimize structures Optimize for strength, durability and NVH (Noise, Vibration, Harshness) characteristics Help to rapidly develop innovative, lightweight and structurally efficient designs Based on finite-element and multi-body dynamics technology

35 Test Cluster Configuration Dell PowerEdge R node (896-core) Thor cluster Dual-Socket 14-core Intel 2.60 GHz CPUs (Turbo on, max perf set in BIOS) OS: RHEL 6.5, OFED MLNX_OFED_LINUX InfiniBand SW stack Memory: 64GB memory, DDR MHz Hard Drives: 1TB 7.2 RPM SATA 2.5 Mellanox Switch-IB SB Gb/s InfiniBand VPI switch Mellanox SwitchX SX Gb/s FDR InfiniBand VPI switch Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters MPI: Intel MPI Application: Altair OptiStruct 13.0 Benchmark datasets: Engine Assembly

36 Model Engine Block Grids (Structural): Elements: Local Coordinate Systems: 32 Degrees of Freedom: 4,721,581 Non-zero Stiffness Terms: ,216 Pretension sections: 10 Nonlinear run small deformation: 3 subcases (reduced to 1 subcase) 27 iterations totally Memory requirement: 96 GB per node for 1 MPI process job (for in-core) 16GB per node for 8 MPI processes job 8GB per node for 24 MPI processes job

37 PowerEdge R730 Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GbE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x or - 24 x x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

38 OptiStruct Performance CPU Cores Running more cores per node generally improves overall performance The -nproc parameter specified the number of threads spawned per MPI process Guideline: 6 threads per MPI process yields the best performance Ideal threads to be spawned appears to be 6 threads per MPI process (either 2/4 PPN) Having 6 threads spawned by each MPI process performs best among all other tested Higher is better

39 OptiStruct Performance Interconnect EDR InfiniBand provides superior scalability performance over Ethernet 11 times better performance than 1GbE at 24 nodes 90% better performance than 10GbE at 24 nodes Ethernet solutions does not scale beyond 4 nodes 11x 90% Higher is better 2 PPN / 6 Threads

40 OptiStruct Profiling Number of MPI Calls For 1GbE, communication time is mostly spent on point-to-point transfer MPI_Iprobe and MPI_Test are the tests for non-blocking transfers Overall runtime is significantly longer compared to faster interconnects For 10GbE, communication time is consumed by data transfer Amount of time for non-blocking transfers still significant Overall runtime reduces compared to 1GbE While time for data transfer reduces, collective operations has higher ratio as in overall For InfiniBand, overall runtime reduces Time consumed by MPI_Allreduce is more significant compared to data transfer Overall runtime reduces significantly compared to Ethernet 1GbE 10GbE EDR IB

41 OptiStruct Profiling Number of MPI Calls For 1GbE, communication time is mostly spent on point-to-point transfer MPI_Iprobe and MPI_Test are the tests for non-blocking transfers Overall runtime is significantly longer compared to faster interconnects For 10GbE, communication time is consumed by data transfer Amount of time for non-blocking transfers still significant Overall runtime reduces compared to 1GbE While time for data transfer reduces, collective operations has higher ratio as in overall For InfiniBand, overall runtime reduces Time consumed by MPI_Allreduce is more significant compared to data transfer Overall runtime reduces significantly compared to Ethernet 1GbE 10GbE EDR IB

42 OptiStruct Performance Interconnect EDR IB delivers superior scalability performance over previous InfiniBand EDR InfiniBand improves over FDR IB by 40% at 24 nodes EDR InfiniBand outperforms FDR InfiniBand by 9% at 16 nodes New EDR IB architecture supersedes previous FDR IB generation of in scalability 9% 40% Higher is better 4 PPN / 6 Threads

43 OptiStruct Performance Processes Per Node OptiStruct reduces communication by deploying hybrid MPI mode Hybrid MPI process can spawn threads; helps reducing communications on network By enabling more MPI processes per node, it helps to unlock additional performance The following environment setting and tuned flags are used : I_MPI_PIN_DOMAIN auto, I_MPI_ADJUST_ALLREDUCE 2, I_MPI_ADJUST_BCAST 1, I_MPI_ADJUST_REDUCE 2, ulimit -s unlimited 10% 4% Higher is better

44 OptiStruct Performance IMPI Tuning Tuning Intel MPI collective algorithm can improve performance MPI profile shows ~30% of runtime spent on MPI_Allreduce IB communications Default algorithm in Intel MPI is Recursive Doubling (I_MPI_ADJUST_ALLREDUCE=1) Rabenseifner's algorithm for Allreduce appears to the be the best on 24 nodes Higher is better Intel MPI 4 PPN / 6 Threads

45 OptiStruct Profiling MPI Message Sizes The most time consuming MPI communications are: MPI_Allreduce: Messages concentrated at 8B MPI_Iprobe and MPI_Test have volume of calls that test for completion of messages 2 PPN / 6 Threads

46 OptiStruct Performance CPU Frequency Increase in CPU clock speed allows higher job efficiency Up to 11% of high productivity by increasing clock speed from 2300MHz to 2600MHz Turbo Mode boosts job efficiency higher than increase in clock speed Up to 31% of performance jump by enabling Turbo Mode at 2600MHz Performance gain by turbo mode depends on environment factors, e.g. temperature 11% 4% 8% 10% 17% 31% Higher is better 4 PPN / 6 Threads

47 OptiStruct Profiling Disk I/O OptiStruct makes use of distributed I/O of local scratch of compute nodes Heavy disk IO appears to take place throughout the run on each compute node The high I/O usage causes system memory to also to be utilized for I/O caching Disk I/O is distributed on all compute nodes; thus provides higher I/O performance Workload would complete faster as more nodes take part on the distributed I/O Higher is better 4 PPN / 6 Threads

48 OptiStruct Profiling MPI Message Sizes Majority of data transfer takes place from rank 0 to the rest It appears that most data transfer takes place between rank 0 to the rest Those non-blocking communication appears data transfers to hide latency in network The collective operations appear to be much less in size 16 Nodes 32 Nodes 2 PPN / 6 Threads

49 OptiStruct Summary OptiStruct is designed to perform structural analysis at large scale OptiStruct designed hybrid MPI mode to perform at scale EDR InfiniBand shows to outperform Ethernet in scalability performance ~70 times better performance than 1GbE at 24 nodes 4.8x better performance than 10GbE at 24 nodes EDR InfiniBand improves over FDR IB by 40% at 24 nodes Hybrid MPI process can spawn threads; helps reducing communications on network By enabling more MPI processes per node, it helps to unlock additional performance Hybrid MPP version enhanced OptiStruct scalability Profiling and Tuning: CPU, I/O, Network MPI_Allreduce accounts for ~30% of runtime at scale Tuning for MPI_Allreduce should allow better performance at high core counts Guideline: 6 threads per MPI process yields the best performance Turbo Mode boosts job efficiency higher than increase in clock speed OptiStruct makes use of distributed I/O of local scratch of compute nodes Heavy disk IO appears to take place throughout the run on each compute node

50 Thank You Questions? Pak Lui All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 50