LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.



Similar documents
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS DYNA Performance Benchmarks and Profiling. January 2009

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Scaling Study of LS-DYNA MPP on High Performance Servers

FLOW-3D Performance Benchmark and Profiling. September 2012

Improved LS-DYNA Performance on Sun Servers

Cloud Computing through Virtualization and HPC technologies

Performance Monitoring of Parallel Scientific Applications

High Performance Computing. Course Notes HPC Fundamentals

Technical Computing Suite Job Management Software

PRIMERGY server-based High Performance Computing solutions

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Running applications on the Cray XC30 4/12/2015

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

How System Settings Impact PCIe SSD Performance

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

1 Bull, 2011 Bull Extreme Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Performance of the JMA NWP models on the PC cluster TSUBAME.

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

- An Essential Building Block for Stable and Reliable Compute Clusters

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Recommended hardware system configurations for ANSYS users

Cluster Implementation and Management; Scheduling

Scalability evaluation of barrier algorithms for OpenMP

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Parallel Scalable Algorithms- Performance Parameters

HPC enabling of OpenFOAM R for CFD applications

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Clusters: Mainstream Technology for CAE

A Flexible Cluster Infrastructure for Systems Research and Software Development

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Enabling Technologies for Distributed and Cloud Computing

Moab and TORQUE Highlights CUG 2015

MAQAO Performance Analysis and Optimization Tool

GPUs for Scientific Computing

Distributed communication-aware load balancing with TreeMatch in Charm++

Lattice QCD Performance. on Multi core Linux Servers

HPC Software Requirements to Support an HPC Cluster Supercomputer

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

High Performance Computing in CST STUDIO SUITE

Scalability and Classifications

HPC Applications Scalability.

OpenMP and Performance

Multicore Parallel Computing with OpenMP

Overview of HPC Resources at Vanderbilt

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

Interoperability Testing and iwarp Performance. Whitepaper

Vertical Scaling of Oracle 10g Performance on Red Hat Enterprise Linux 5 on Intel Xeon Based Servers. Version 1.0

Multi-Threading Performance on Commodity Multi-Core Processors

Scaling LS-DYNA on Rescale HPC Cloud Simulation Platform

Increasing LS-DYNA Productivity on SGI Systems: A Step by Step Approach

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

Performance and scalability of MPI on PC clusters

OpenMP Programming on ScaleMP

System Software for High Performance Computing. Joe Izraelevitz

Big Data Processing: Past, Present and Future

Keys to node-level performance analysis and threading in HPC applications

Efficiency of Web Based SAX XML Distributed Processing

Introduction to Hybrid Programming

SAS Business Analytics. Base SAS for SAS 9.2

Load Imbalance Analysis

MPI Application Tune-Up Four Steps to Performance

Multi-core and Linux* Kernel

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Parallel Programming Survey

HPC Wales Skills Academy Course Catalogue 2015

An Introduction to Parallel Computing/ Programming

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

OpenFOAM and SGI Designed to Work Together. Christian Tanasescu Vice President Software Engineering

Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U

Symmetric Multiprocessing

Performance brief for IBM WebSphere Application Server 7.0 with VMware ESX 4.0 on HP ProLiant DL380 G6 server

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Transcription:

LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com

Table of Contents Abstract... 3 Introduction... 3 Scalability Analysis of Pure MPP LS-DYNA on a Cray XC3 System... 3 Software and hardware configurations... 4 Pure MPP LS-DYNA scalability... 4 Scalability of element processing and contact... 5 MPI profile of pure MPP LS-DYNA... 5 Scalability Analysis of Hybrid LS-DYNA on a Cray XC3 System... 7 Advantage of Hybrid LS-DYNA over pure MPP LS-DYNA... 7 Effect of the number of OpenMP threads on Hybrid LS-DYNA performance... 9 Effect of Intel Hyper-Threading Technology and Linux HugePages on LS-DYNA Performance... 9 Effect of Intel Hyper-Threading Technology on LS-DYNA performance... 9 Effect of Linux HugePages on LS-DYNA performance... 9 Conclusion... 1 Acknowledgement... 1 References... 1 WP-LS-DYNA-12213 Page 2 of 1 www.cray.com

Abstract For the automotive industry, car crash analysis by finite elements is crucial to shortening the design cycle and reducing costs. To increase analysis accuracy and improve finite element technique, smaller cells of finite element meshes are used to better represent car geometry. The use of finer mesh coupled with the need for fast turnaround has put increased demand on scalability of the finite element analysis. In this paper, we will use the car2car model to measure the scalability of LS-DYNA on the Cray XC3 supercomputer. an Intel Xeon processor-based system using the Aries network. Specifically, we will discuss: Scalability of different functions in LS-DYNA at high core counts MPI communication pattern of LS-DYNA Performance difference between using 1 thread per core and 2 threads per core Performance impact of using large Linux HugePages Introduction Since May 213, Cray and Livermore Software Technology Corporation (LSTC) have been working jointly with a customer on some very large (25 million to 1 million-element) explicit LS-DYNA analysis problems to evaluate LS-DYNA scalability on the Cray XC3 system. As of December 213, this problem takes nearly a month to complete on a Linux cluster at the customer site due to the large problem size and limited scalability of MPP LS-DYNA. To shorten their design cycle, the customer needs better hardware and software performance. In the last decade, the number of cores per processors has doubled every couple of years, but the central processing unit (CPU) clock rate has not been getting much faster. To reduce simulation turnaround time, we need highly scalable supercomputers and improved LS-DYNA parallel performance. Due to the nature of load imbalance in the LS-DYNA contact algorithm, MPP LS-DYNA scalability has been limited to 1, to 2, cores depending on the size of the problem. In order to improve LS-DYNA scalability, we need to use HYBRID LS-DYNA which combines distributed memory parallelism using MPI with shared memory parallelism using OpenMP. The advantage of HYBRID LS-DYNA at low core counts has been demonstrated by Kendo and Makino [1], and Meng and Wang [2]. In this paper, we will use the car2car model from topcrunch.org to understand the scalability of MPP and HYBRID LS-DYNA at high core counts on the Cray XC3 system. Scalability Analysis of Pure MPP LS-DYNA on a Cray XC3 System First, we use the car2car model with 2.4 million elements to test pure MPP LS-DYNA parallel performance at high core counts of 256 to 4,96 on the XC3 system. WP-LS-DYNA-12213 Page 3 of 1 www.cray.com

Software and hardware configurations The following hardware and software configurations are used for this evaluation: Table 1: Summary of Hardware and Software Configurations System Type Processor Architecture Sockets Cores Network Memory Size OS LS-DYNA Version Compiler Version MPI Version Cray XC3 Intel Xeon processor E5 family (2.6 GHz) 2 sockets per node 8 cores per socket or 16 cores per node Aries interconnect and Dragonfly topology 32GB per node CRAY CLE5.1 LS-DYNA R61_84148 Intel/13.1.3.192 CRAY-MPICH/6..2 Pure MPP LS-DYNA scalability Figure 1 shows the scalability of pure MPP LS-DYNA in car2car at core counts from 256 to 4,96 on an XC3 system. Pure MPP LS-DYNA scales reasonably well up to 2,48 cores, but beyond that the scaling levels off. In the next section we will look at LS-DYNA functionality profiles to understand the limiting factor in pure MPP LS-DYNA scaling. Pure MPP LS-DYNA Scalability at High Core Counts 7 6 5 4 3 Figure 1: Pure MPP LS-DYNA scalability in car2car on a Cray XC3 system WP-LS-DYNA-12213 Page 4 of 1 www.cray.com

Scalability of element processing and contact LS-DYNA provides detailed timing information of several major functions at the end of the log file. For car crash simulations, the functions of element processing, contact algorithm, and rigid bodies consume more than 85 percent of total wall clock time at high core counts. Since there is no MPI barrier call between contact algorithm and rigid bodies, we will measure the combined time spent on both (referred to as contact+rigid bodies in this paper). Now let s look at how element processing and contact+rigid bodies of pure MPP LS-DYNA perform at high core counts. Figure 2 shows that element processing scales very well up to 4,96 cores and contact+rigid bodies scales poorly for core counts of 256 to 2,48 and hardly scales after 2,48 cores. This result indicates that contact+rigid bodies scalability is the limiting factor to pure MPP LS-DYNA scaling beyond 2,48 cores. Next, we will use a Cray MPI profiling tool to show the lack of scalability in contact+rigid bodies. Scalability of Element Processing and Contact+Rigid Bodies 5 4 3 Element processing Contact+Rigid bodies Number of Cores Figure 2: Scalability of element processing and contact+rigid bodies in car2car on a Cray XC3 system MPI profile of pure MPP LS-DYNA The Cray MPI profiler reports the total elapsed time, computational time, and MPI time as well as MPI communication patterns. For collective calls, it further breaks down MPI time to MPI synchronization time (time that a process spends on synchronizing with other processes) and MPI communication time (time that a process spends on communicating with other processes). MPI synchronization time is proportional to the amount of the load imbalance in a program, and MPI communication time is disproportional to the speed of interconnect. For MPI time in point-to-point communications, such as MPI_send and MPI_recv, MPI synchronization time and communication time cannot be measured separately. Rather, measurement of MPI communication time in point-to-point MPI communication calls is the sum of MPI synchronization time and communication time. WP-LS-DYNA-12213 Page 5 of 1 www.cray.com

Figure 3 shows the scalability of MPI time of pure MPP LS-DYNA in car2car on the XC3 system. The total MPI time reduces from 2,39 seconds at a core count of 256 to 1,478 seconds at a core count of 2,48, but increases to 157 seconds at a core count of 4,96. The percentage of total MPI time to total elapsed time increases from 36 percent at 256 cores to 74 percent at 4,96 cores. Additionally, MPI synchronization time is about 65 percent of total MPI time. This result further indicates that load imbalance in the contact algorithm of pure MPP LS-DYNA is the main cause of poor scalability at high core counts. Scalability of MPI Time 25 36% MPI Time (s) 15 46% 56% 66% 74% Sync time communication 5 Figure 3: Scalability of MPI time in car2car on a Cray XC3 system From Cray profiler reports, we find that 8 percent of MPI time in pure MPP LS-DYNA is spent on MPI_bcast, MPI_allreduce, and MPI_recv. Figure 4 depicts the MPI time, synchronization time, and communication time on these three MPI calls. It reviews that MPI_bcast communication time is less than 2.5 percent of MPI_bcast synchronization time, and overall MPI_bcast time decreases as core count increases. MPI_allreduce communication time is less than 5 percent of MPI_allreduce synchronization time, and overall MPI_allreduce time increases with core count. MPI_recv time (about 26 to 38 percent of total MPI time) reduces as core count increases in range of 256 to 1,24 cores. But MPI_recv time increases with core counts in the range of 1,24 to 4,96 cores. As we have mentioned earlier, MPI_recv synchronization and communication time cannot be measured separately. Based on the amount of synchronization time in MPI_bcast and MPI_allreduce, we believe that a larger portion of measured MPI_recv time is probably spent on synchronization. This result indicates that most of MPI time is spent on MPI synchronization meaning that load imbalance in LS- DYNA is what prevents it from scaling beyond 2,48 cores. It also means that without a load imbalance issue, the speed of MPI communication on an XC3 system would be fast enough for pure MPP LS- DYNA to potentially scale to 4,96 cores and beyond. WP-LS-DYNA-12213 Page 6 of 1 www.cray.com

MPI Communication Patterns Wall clock Time (s) 9 8 7 6 5 4 3 2 1 bcast comm bcast sync allreduce comm Number of Cores Figure 4: MPI communication patterns of pure MPP LS-DYNA in car2car on a Cray XC3 system One effective way to reduce the impact of load imbalance on LS-DYNA parallel performance is to use HYBRID LS-DYNA which combines MPI processes with OpenMP threads. With HYBRID LS-DYNA, we can push LS-DYNA scaling to beyond 2,48 cores. In the next section, we will discuss HYBRID LS-DYNA parallel performance in car2car, as well as its comparison with pure MPP LS-DYNA on an XC3 system. Scalability Analysis of Hybrid LS-DYNA on a Cray XC3 System Advantage of Hybrid LS-DYNA over pure MPP LS-DYNA The shared memory parallelism of explicit LS-DYNA scales well to 8 threads. In the first evaluation, we run HYBRID LS-DYNA using 4 OpenMP threads with core counts from 256 to 4,96. Figure 5 shows that HYBRID LS-DYNA is 23 to 6 percent faster than pure MPP LS-DYNA for the same core counts. Performance Comparison between MPP and HYBRID 7 6 23% 5 4 3 35% 37% 27% 6% MPP Hybrid Figure 5: Performance comparison between pure MPP and HYBRID LS-DYNA in car2car on a Cray XC3 system WP-LS-DYNA-12213 Page 7 of 1 www.cray.com

Figure 6 and Figure 7 show the performance comparisons of element processing and contact+rigid bodies between pure MPP and HYBRID LS-DYNA on an XC3 system. Element processing of HYBRID LS- DYNA is 35 to 58 percent faster than that of pure MPP LS-DYNA, while contact+rigid bodies of HYBRID LS-DYNA is 2 percent slower than that of MPP LS-DYNA at 256 cores and becomes 1 percent faster at 512 cores and reaches 43 percent faster at 4,96 cores. So for core counts of 256 to 512, HYBRID LS- DYNA gains its performance over MPP LS-DYNA mainly from element processing. For cores counts of 1,24 to 4,96, it gains its performance from both element processing and contact+rigid bodies. 5 "Element Processing" Performance Comparison between MPP and HYBRID 4 3 4% 57% 51% 35% 58% MPP HYBRID Figure 6: Element processing performance comparison between MPP and HYBRID LS-DYNA "Contact + Rigid Bodies" Performance Comparison between MPP and HYBRID -2% 15 5 1% 22% 24% 43% MPP HYBRID Figure 7: Contact+rigid bodies performance comparison between MPP and HYBRID LS-DYNA WP-LS-DYNA-12213 Page 8 of 1 www.cray.com

In the next section, we will explore the effect of the number of OpenMP threads on HYBRID LS-DYNA performance at high core counts. Effect of the number of OpenMP threads on Hybrid LS-DYNA performance 6 HYBRID LS-DYNA Performance with different OMP threads 5 4 3 4 threads 2 threads 8 threads Figure 8: Hybrid LS-DYNA performance comparisons among different OpenMP threads In this study, we will keep the total number of cores, which is equal to MPI ranks OpenMP threads, the same while varying MPI ranks and OpenMPI threads. For aexample, for 256 cores, we use three different combinations: 128 MPI ranks 2 OpenMP threads, 64 MPI ranks 4 OpenMP threads, and 32 MPI ranks 8 OpenMP threads. Figure 8 depicts the performance difference among the three different OpenMP threads. At 256 cores, using 2 threads is 1.5 percent faster than using 4 threads and it is also 7 percent faster than using 8 threads. At 2,48 cores, using 8 threads is about 3 percent faster than using 4 threads, and it is also 9 percent faster than using 2 threads. At other core counts, using 4 threads is 5 to 23 percent faster than using 2 threads and it is also 5 to 8 percent faster than using 8 threads. So we recommend the use of 4 OpenMP threads for HYBRID LS- DYNA at high core counts. Effect of Intel Hyper-Threading Technology and Linux HugePages on LS-DYNA Performance Effect of Intel Hyper-Threading Technology on LS-DYNA performance Intel Hyper-Threading Technology (Intel HT Technology) is enabled on the Cray XC3 system s BIOS. We run pure MPP LS-DYNA with 2 threads per core in car2car for 256 to 2,48 cores. Using 2 threads per core is about 5 to 29 percent slower than using only 1 thread per core for the same core counts. Effect of Linux HugePages on LS-DYNA performance Cray XC3 software allows the use of different sizes of Linux HugePages. The default Linux HugePage size is 4 kilobytes. We test pure MPP LS-DYNA in car2car using 2-megabyte Linux HugePages at core WP-LS-DYNA-12213 Page 9 of 1 www.cray.com

counts of 256 to 4,96, but no noticeable performance advantage is observed. This is because LS-DYNA is a cache-friendly code. Conclusion Pure MPP LS-DYNA scaling stops at 2,48 cores on the Cray XC3 supercomputer. The limiting factor is load imbalance in the LS-DYNA contact algorithm. HYBRID LS-DYNA extends LS-DYNA scaling to 4,96 cores and perhaps beyond on the XC3 system. In general, using 4 OpenMP threads yields close to the best or the best performance. Using 2 threads per core for LS-DYNA is not recommended on the Cray XC3 system. Using large Linux HugePages has no effect on LS-DYNA performance. Acknowledgement The authors would like to thank Cray Inc. for their support on this work. References [1] Kenshiro Kondo and Mitsuhiro Makino, High Performance of Car Crash Simulation by LS-DYNA Hybrid Parallel Version on Fujitsu FX1, 11 th International LS-DYNA Users Conference. [2] Nich Meng and Jason Wang, et al, New Features in LS-DYNA HYBRID Version, 11 th Inernational LS- DYNA Users Conference. WP-LS-DYNA-12213 Page 1 of 1 www.cray.com