LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling January 2009

Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The participating members would like to thank LSTC for their support and guidelines The participating members would like to thank Sharan Kalwani, HPC Automotive specialist, for his support and guidelines For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com 2

LS-DYNA LS-DYNA A general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC) LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering 3

LS-DYNA LS-DYNA SMP (Shared Memory Processing) Optimize the power of multiple CPUs within single machine LS-DYNA MPP (Massively Parallel Processing) The MPP version of LS-DYNA allows to run LS-DYNA solver over High-performance computing cluster Uses message passing (MPI) to obtain parallelism Many companies are switching from SMP to MPP For cost-effective scaling and performance 4

Objectives The presented research was done to provide best practices LS-DYNA performance benchmarking Interconnect performance comparisons Ways to increase LS-DYNA productivity Understanding LS-DYNA communication pattern MPI libraries comparisons Power-aware consideration 5

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron Model 2358 processors ( Barcelona ) Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 667MHz per node OS: RHEL5U2, OFED 1.3 InfiniBand SW stack MPI: HP MPI 2.2.7, Platform MPI 5.6.5 Application: LS-DYNA MPP971 Benchmark Workload Three Vehicle Collision Test simulation Neon-Refined Revised Crash Test simulation 6

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 60Gb/s 20Gb/s 120Gb/s 40Gb/s 240Gb/s (12X) 80Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 7

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC Dual Channel Reg DDR2 4x 512K L2 cache 2MB L3 Cache 8 GB/S 8 GB/S Direct Connect Architecture HyperTransport technology Up to 24 GB/s Floating Point 8 GB/S 8 GB/S PCI-E PCI-E 128-bit FPU per core Bridge Bridge Bridge Bridge 4 FLOPS/clk peak per core 8 GB/S Memory 1GB Page Support USB USB I/O I/O Hub Hub DDR-2 667 MHz PCI PCI Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as Second-Generation AMD Opteron processor 8 November5, 2007 8

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 9

LS-DYNA Performance Results - Interconnect InfiniBand high speed interconnect enables highest scalability Performance gain with cluster size Performance over GigE is not scaling Slowdown occurs as number of processors increases beyond 16 nodes LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised 8000 7000 6000 5000 4000 3000 2000 1000 0 700 600 500 400 300 200 100 0 10 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) InfiniBand GigE InfiniBand GigE Lower is better

LS-DYNA Performance Results - Interconnect 120% 100% 80% 60% 40% 20% 0% LS-DYNA - 3 Vehicle Collision (InfiniBand vs GigE) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 140% 120% 100% 80% 60% 40% 20% 0% LS-DYNA - Neon Refined Revised (InfiniBand vs GigE) Performance Advantage Performance Advantage 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) InfiniBand outperforms GigE by up to 132% As node number increases, bigger advantage is expected 11

LS-DYNA Performance Results CPU Affinity 7000 6000 5000 4000 3000 2000 1000 0 LS-DYNA - 3 Vehicle Collision (CPU Affinity vs Non-Affinity) CPU affinity accelerates performance up to 10% Saves up to 177 seconds per simulation 12 Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) CPU Affinity Without Affinity Lower is better

LS-DYNA Performance Results - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for virtual vehicle design Three cases are presented Single job over the entire systems (with CPU affinity) Two jobs, each on a single CPU per server (job placement, CPU affinity) Four jobs, each on two CPU cores per CPU per server (job placement, CPU affinity) Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case Increased number of parallel processes (jobs) increases the load on the interconnect High speed and low latency interconnect solution is required for gaining high productivity LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised Jobs per Day 90 80 70 60 50 40 30 20 10 0 Jobs per Day 1000 900 800 700 600 500 400 300 200 100 0 Higher is better 1 Job 2 Parallel Jobs 4 Parallel Jobs 1 Job 2 Parallel Jobs 4 Parallel Jobs 13

LS-DYNA Profiling Data Transferred 1E+12 LS-DYNA MPI Profiliing (3 Vehicle Collision) 1E+11 Total Size (MB) 1E+10 1E+09 100000000 10000000 [0..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes Majority of data transfer is done via 256B-4KB message size 14

LS-DYNA Profiling Message Distribution LS-DYNA MPI Profiliing 1000000000 (3 Vehicle Collision) 100000000 Number of Messages 10000000 1000000 100000 10000 1000 100 10 1 [0..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes Majority of the messages are in the range of 2B-4KB 2B-256B for synchronization, 256B-4KB for data communications 15

LS-DYNA Profiling Message Distribution 70% LS-DYNA MPI Profiliing (3 Vehicle Collision) 60% % of total messages 50% 40% 30% 20% 10% 0% [0..64] [65..256] [257..1024] [1025..4096] [4097..16384] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes As number of nodes scales, percentage of small messages increases percentage of 256-1KB messages is relatively consistent with cluster size Actual number increases with cluster size, 16

LS-DYNA Profiling MPI Collectives Two key MPI collective functions in LS-DYNA MPI_AllReduce MPI_Bcast Account for the majority of MPI communication overhead % of Total Overhead (ms) 70% 60% 50% 40% 30% 20% 10% 0% MPI Collectives MPI_AllReduce MPI_Bcast 17

MPI Collective Benchmarking MPI collective performance comparison Two frequently called collection operations in LS-DYNA were benchmarked MPI_Allreduce MPI_Bcast Platform MPI shows better latency for AllReduce operation MPI_AllReduce MPI_Bcast 120 30 100 25 Latency(usec) 80 60 40 20 Latency(usec) 20 15 10 5 0 0 4 8 16 32 64 128 256 512 0 0 1 2 4 8 16 32 64 128 256 512 Message Size Message Size HP-MPI Platform MPI HP-MPI Platform MPI 18

LS-DYNA with Different MPI Libraries LS-DYNA performance Comparison Each MPI library shows different benefits for latency and collectives As such, HP-MPI and Platform MPI shows comparable performance 7000 6000 5000 4000 3000 2000 1000 LS-DYNA - 3 Vehicle Collision 550 500 450 400 350 300 250 200 150 100 LS-DYNA - Neon Refined Revised 19 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) Platform MPI HP-MPI Platform MPI HP-MPI Lower is better

LS-DYNA Profiling Summary - Interconnect LS-DYNA was profiled to determine networking dependency Majority of data transferred between compute nodes Done with 256B-4KB message size, data transferred increases with cluster size Most used message sizes <64B messages mainly synchronizations 64B-4KB mainly compute related Message size distribution Percentage of smaller messages (<64B) increases with cluster size Mainly due to the needed synchronization Percentage of mid-size messages (64B-4KB) is kept the same with cluster size Compute transactions increases with cluster size Percentage of very large messages decreases with cluster size Mainly used for problem data distribution at the simulation initialization phase LS-DYNA interconnect sensitivity points Interconnect latency and throughput for 64B-4KB message range Collectives operations performance, mainly MPI_Allreduce 20

Test Cluster Configuration System Upgrade The following results were achieved after system upgrade (changes are in green) Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron Model 2382 processors ( Shanghai ) (vs Barcelona in previous configuration) Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration) OS: RHEL5U2, OFED 1.3 InfiniBand SW stack MPI: HP MPI 2.2.7, Platform MPI 5.6.5 Application: LS-DYNA MPP971 Benchmark Workload Three-Car Crash Test simulation Neon-Refined Revised Crash Test simulation 21

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC Dual Channel Reg DDR2 4x 512K L2 cache 6MB L3 Cache 8 GB/S 8 GB/S Direct Connect Architecture HyperTransport technology Up to 24 GB/s peak per processor Floating Point 8 GB/S 8 GB/S PCI-E PCI-E 128-bit FPU per core Bridge Bridge Bridge Bridge 4 FLOPS/clk peak per core 8 GB/S Integrated Memory Controller Up to 12.8 GB/s USB USB I/O I/O Hub Hub DDR2-800 MHz or DDR2-667 MHz PCI PCI Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 22 November5, 2007 22

Performance Improvement Upgraded AMD CPU and DDR-2 Memory LS-DYNA run time decreased by more than 20% Leveraging InfiniBand 20Gb/s for higher scalability LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised 7000 6000 5000 4000 3000 2000 1000 0 600 500 400 300 200 100 0 23 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) Barcelona Shanghai Barcelona Shanghai Lower is better

Maximize LS-DYNA Productivity Scalable latency of InfiniBand and latest Shanghai processor deliver scalable LS-DYNA performance LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised Jobs per Day 120 100 80 60 40 20 0 Jobs per Day 1400 1200 1000 800 600 400 200 0 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Higher is better 24

LS-DYNA with Shanghai Processors Shanghai processors provides higher performance compared to Barcelona % of more jobs per day 30% 25% 20% 15% 10% 5% 0% LS-DYNA - 3 Vehicle Collision (Shanghai vs Barcelona) 1 Job 2 Parallel Jobs 4 Parallel Jobs 25

LS-DYNA Performance Results - Interconnect InfiniBand 20Gb/s vs 10GigE vs GigE InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases Reducing run time by up to 60% versus 10GigE and 61% vs GigE Performance loss shown beyond 16 nodes with 10GigE and GigE InfiniBand 20Gb/s maintain scalability with cluster size LS-DYNA - Neon Refined Revised (HP-MPI) LS-DYNA - 3 Vehicle Collision (HP-MPI) Elapsed time (Seconds) 600 500 400 300 200 100 0 Elapsed time (Seconds) 6000 5000 4000 3000 2000 1000 0 Lower is better GigE 10GigE InfiniBand GigE 10GigE InfiniBand 26

Power Consumption Comparison Wh per Job 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Power Consumption (InfiniBand vs 10GigE vs GigE) 50% 3 Vehicle Collision Neon Refined Revised GigE 10GigE InfiniBand 62% InfiniBand also enables power efficient simulations Reducing power/job by up to 62%! 24-node comparison 27

Conclusions LS-DYNA is widely used to simulate many real-world problems Automotive crash-testing and finite-element simulations Developed by Livermore Software Technology Corporation (LSTC) LS-DYNA performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughput interconnect technology NUMA aware application for fast access to local memory Reasonable job distribution can dramatically improve productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows InfiniBand delivers superior performance and productivity in every cluster size Scalability requires low latency and zero scalable latency Lowest power consumption was achieved with InfiniBand Saving in system power, cooling and real-estate 28

Thank You HPC Advisory Council HPC@mellanox.com All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 29 29