LS DYNA Performance Benchmarks and Profiling. January 2009

Similar documents
ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

FLOW-3D Performance Benchmark and Profiling. September 2012

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Clusters: Mainstream Technology for CAE

HPC Applications Scalability.

Enabling High performance Big Data platform with RDMA

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Mellanox Academy Online Training (E-learning)

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Improved LS-DYNA Performance on Sun Servers

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Scaling Study of LS-DYNA MPP on High Performance Servers

White Paper Solarflare High-Performance Computing (HPC) Applications

Michael Kagan.

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Cloud Computing through Virtualization and HPC technologies

SMB Direct for SQL Server and Private Cloud

Sun Constellation System: The Open Petascale Computing Architecture

Building a Scalable Storage with InfiniBand

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

3G Converged-NICs A Platform for Server I/O to Converged Networks

RDMA over Ethernet - A Preliminary Study

7 Real Benefits of a Virtual Infrastructure

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Cut I/O Power and Cost while Boosting Blade Server Performance

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

CUTTING-EDGE SOLUTIONS FOR TODAY AND TOMORROW. Dell PowerEdge M-Series Blade Servers

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Performance Across the Generations: Processor and Interconnect Technologies

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

High Performance Computing. Course Notes HPC Fundamentals

Interoperability Testing and iwarp Performance. Whitepaper

Advancing Applications Performance With InfiniBand

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Accelerating Microsoft Exchange Servers with I/O Caching

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

InfiniBand -- Industry Standard Data Center Fabric is Ready for Prime Time

High Performance SQL Server with Storage Center 6.4 All Flash Array

InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

State of the Art Cloud Infrastructure

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Replacing SAN with High Performance Windows Share over a Converged Network

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

Connecting the Clouds

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

PRIMERGY server-based High Performance Computing solutions

Building Clusters for Gromacs and other HPC applications

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Recommended hardware system configurations for ANSYS users

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

How System Settings Impact PCIe SSD Performance

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Accelerating CFD using OpenFOAM with GPUs

PCI Technology Overview

Microsoft SQL Server 2005 on Windows Server 2003

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Hadoop on the Gordon Data Intensive Cluster

Tableau Server Scalability Explained

Multi-Threading Performance on Commodity Multi-Core Processors

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

An Oracle White Paper December Consolidating and Virtualizing Datacenter Networks with Oracle s Network Fabric

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Overview: X5 Generation Database Machines

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

High Performance Computing in CST STUDIO SUITE

HPC Update: Engagement Model

Performance Analysis: Scale-Out File Server Cluster with Windows Server 2012 R2 Date: December 2014 Author: Mike Leone, ESG Lab Analyst

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Pedraforca: ARM + GPU prototype

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

- An Essential Building Block for Stable and Reliable Compute Clusters

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Dell s SAP HANA Appliance

Transcription:

LS DYNA Performance Benchmarks and Profiling January 2009

Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The participating members would like to thank LSTC for their support and guidelines The participating members would like to thank Sharan Kalwani, HPC Automotive specialist, for his support and guidelines For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com 2

LS-DYNA LS-DYNA A general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC) LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering 3

LS-DYNA LS-DYNA SMP (Shared Memory Processing) Optimize the power of multiple CPUs within single machine LS-DYNA MPP (Massively Parallel Processing) The MPP version of LS-DYNA allows to run LS-DYNA solver over High-performance computing cluster Uses message passing (MPI) to obtain parallelism Many companies are switching from SMP to MPP For cost-effective scaling and performance 4

Objectives The presented research was done to provide best practices LS-DYNA performance benchmarking Interconnect performance comparisons Ways to increase LS-DYNA productivity Understanding LS-DYNA communication pattern MPI libraries comparisons Power-aware consideration 5

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron Model 2358 processors ( Barcelona ) Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 667MHz per node OS: RHEL5U2, OFED 1.3 InfiniBand SW stack MPI: HP MPI 2.2.7, Platform MPI 5.6.5 Application: LS-DYNA MPP971 Benchmark Workload Three Vehicle Collision Test simulation Neon-Refined Revised Crash Test simulation 6

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 60Gb/s 20Gb/s 120Gb/s 40Gb/s 240Gb/s (12X) 80Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 7

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC Dual Channel Reg DDR2 4x 512K L2 cache 2MB L3 Cache 8 GB/S 8 GB/S Direct Connect Architecture HyperTransport technology Up to 24 GB/s Floating Point 8 GB/S 8 GB/S PCI-E PCI-E 128-bit FPU per core Bridge Bridge Bridge Bridge 4 FLOPS/clk peak per core 8 GB/S Memory 1GB Page Support USB USB I/O I/O Hub Hub DDR-2 667 MHz PCI PCI Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as Second-Generation AMD Opteron processor 8 November5, 2007 8

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 9

LS-DYNA Performance Results - Interconnect InfiniBand high speed interconnect enables highest scalability Performance gain with cluster size Performance over GigE is not scaling Slowdown occurs as number of processors increases beyond 16 nodes LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised 8000 7000 6000 5000 4000 3000 2000 1000 0 700 600 500 400 300 200 100 0 10 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) InfiniBand GigE InfiniBand GigE Lower is better

LS-DYNA Performance Results - Interconnect 120% 100% 80% 60% 40% 20% 0% LS-DYNA - 3 Vehicle Collision (InfiniBand vs GigE) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 140% 120% 100% 80% 60% 40% 20% 0% LS-DYNA - Neon Refined Revised (InfiniBand vs GigE) Performance Advantage Performance Advantage 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) InfiniBand outperforms GigE by up to 132% As node number increases, bigger advantage is expected 11

LS-DYNA Performance Results CPU Affinity 7000 6000 5000 4000 3000 2000 1000 0 LS-DYNA - 3 Vehicle Collision (CPU Affinity vs Non-Affinity) CPU affinity accelerates performance up to 10% Saves up to 177 seconds per simulation 12 Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) CPU Affinity Without Affinity Lower is better

LS-DYNA Performance Results - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for virtual vehicle design Three cases are presented Single job over the entire systems (with CPU affinity) Two jobs, each on a single CPU per server (job placement, CPU affinity) Four jobs, each on two CPU cores per CPU per server (job placement, CPU affinity) Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case Increased number of parallel processes (jobs) increases the load on the interconnect High speed and low latency interconnect solution is required for gaining high productivity LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised Jobs per Day 90 80 70 60 50 40 30 20 10 0 Jobs per Day 1000 900 800 700 600 500 400 300 200 100 0 Higher is better 1 Job 2 Parallel Jobs 4 Parallel Jobs 1 Job 2 Parallel Jobs 4 Parallel Jobs 13

LS-DYNA Profiling Data Transferred 1E+12 LS-DYNA MPI Profiliing (3 Vehicle Collision) 1E+11 Total Size (MB) 1E+10 1E+09 100000000 10000000 [0..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes Majority of data transfer is done via 256B-4KB message size 14

LS-DYNA Profiling Message Distribution LS-DYNA MPI Profiliing 1000000000 (3 Vehicle Collision) 100000000 Number of Messages 10000000 1000000 100000 10000 1000 100 10 1 [0..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes Majority of the messages are in the range of 2B-4KB 2B-256B for synchronization, 256B-4KB for data communications 15

LS-DYNA Profiling Message Distribution 70% LS-DYNA MPI Profiliing (3 Vehicle Collision) 60% % of total messages 50% 40% 30% 20% 10% 0% [0..64] [65..256] [257..1024] [1025..4096] [4097..16384] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes As number of nodes scales, percentage of small messages increases percentage of 256-1KB messages is relatively consistent with cluster size Actual number increases with cluster size, 16

LS-DYNA Profiling MPI Collectives Two key MPI collective functions in LS-DYNA MPI_AllReduce MPI_Bcast Account for the majority of MPI communication overhead % of Total Overhead (ms) 70% 60% 50% 40% 30% 20% 10% 0% MPI Collectives MPI_AllReduce MPI_Bcast 17

MPI Collective Benchmarking MPI collective performance comparison Two frequently called collection operations in LS-DYNA were benchmarked MPI_Allreduce MPI_Bcast Platform MPI shows better latency for AllReduce operation MPI_AllReduce MPI_Bcast 120 30 100 25 Latency(usec) 80 60 40 20 Latency(usec) 20 15 10 5 0 0 4 8 16 32 64 128 256 512 0 0 1 2 4 8 16 32 64 128 256 512 Message Size Message Size HP-MPI Platform MPI HP-MPI Platform MPI 18

LS-DYNA with Different MPI Libraries LS-DYNA performance Comparison Each MPI library shows different benefits for latency and collectives As such, HP-MPI and Platform MPI shows comparable performance 7000 6000 5000 4000 3000 2000 1000 LS-DYNA - 3 Vehicle Collision 550 500 450 400 350 300 250 200 150 100 LS-DYNA - Neon Refined Revised 19 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) Platform MPI HP-MPI Platform MPI HP-MPI Lower is better

LS-DYNA Profiling Summary - Interconnect LS-DYNA was profiled to determine networking dependency Majority of data transferred between compute nodes Done with 256B-4KB message size, data transferred increases with cluster size Most used message sizes <64B messages mainly synchronizations 64B-4KB mainly compute related Message size distribution Percentage of smaller messages (<64B) increases with cluster size Mainly due to the needed synchronization Percentage of mid-size messages (64B-4KB) is kept the same with cluster size Compute transactions increases with cluster size Percentage of very large messages decreases with cluster size Mainly used for problem data distribution at the simulation initialization phase LS-DYNA interconnect sensitivity points Interconnect latency and throughput for 64B-4KB message range Collectives operations performance, mainly MPI_Allreduce 20

Test Cluster Configuration System Upgrade The following results were achieved after system upgrade (changes are in green) Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron Model 2382 processors ( Shanghai ) (vs Barcelona in previous configuration) Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration) OS: RHEL5U2, OFED 1.3 InfiniBand SW stack MPI: HP MPI 2.2.7, Platform MPI 5.6.5 Application: LS-DYNA MPP971 Benchmark Workload Three-Car Crash Test simulation Neon-Refined Revised Crash Test simulation 21

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC Dual Channel Reg DDR2 4x 512K L2 cache 6MB L3 Cache 8 GB/S 8 GB/S Direct Connect Architecture HyperTransport technology Up to 24 GB/s peak per processor Floating Point 8 GB/S 8 GB/S PCI-E PCI-E 128-bit FPU per core Bridge Bridge Bridge Bridge 4 FLOPS/clk peak per core 8 GB/S Integrated Memory Controller Up to 12.8 GB/s USB USB I/O I/O Hub Hub DDR2-800 MHz or DDR2-667 MHz PCI PCI Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 22 November5, 2007 22

Performance Improvement Upgraded AMD CPU and DDR-2 Memory LS-DYNA run time decreased by more than 20% Leveraging InfiniBand 20Gb/s for higher scalability LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised 7000 6000 5000 4000 3000 2000 1000 0 600 500 400 300 200 100 0 23 Elapsed time (Seconds) Elapsed time (Seconds) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) 6 (48 Cores) 10 (80 Cores) 14 (112 Cores) 18 (144 Cores) 22 (176 Cores) Barcelona Shanghai Barcelona Shanghai Lower is better

Maximize LS-DYNA Productivity Scalable latency of InfiniBand and latest Shanghai processor deliver scalable LS-DYNA performance LS-DYNA - 3 Vehicle Collision LS-DYNA - Neon Refined Revised Jobs per Day 120 100 80 60 40 20 0 Jobs per Day 1400 1200 1000 800 600 400 200 0 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs 1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs Higher is better 24

LS-DYNA with Shanghai Processors Shanghai processors provides higher performance compared to Barcelona % of more jobs per day 30% 25% 20% 15% 10% 5% 0% LS-DYNA - 3 Vehicle Collision (Shanghai vs Barcelona) 1 Job 2 Parallel Jobs 4 Parallel Jobs 25

LS-DYNA Performance Results - Interconnect InfiniBand 20Gb/s vs 10GigE vs GigE InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases Reducing run time by up to 60% versus 10GigE and 61% vs GigE Performance loss shown beyond 16 nodes with 10GigE and GigE InfiniBand 20Gb/s maintain scalability with cluster size LS-DYNA - Neon Refined Revised (HP-MPI) LS-DYNA - 3 Vehicle Collision (HP-MPI) Elapsed time (Seconds) 600 500 400 300 200 100 0 Elapsed time (Seconds) 6000 5000 4000 3000 2000 1000 0 Lower is better GigE 10GigE InfiniBand GigE 10GigE InfiniBand 26

Power Consumption Comparison Wh per Job 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Power Consumption (InfiniBand vs 10GigE vs GigE) 50% 3 Vehicle Collision Neon Refined Revised GigE 10GigE InfiniBand 62% InfiniBand also enables power efficient simulations Reducing power/job by up to 62%! 24-node comparison 27

Conclusions LS-DYNA is widely used to simulate many real-world problems Automotive crash-testing and finite-element simulations Developed by Livermore Software Technology Corporation (LSTC) LS-DYNA performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughput interconnect technology NUMA aware application for fast access to local memory Reasonable job distribution can dramatically improve productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows InfiniBand delivers superior performance and productivity in every cluster size Scalability requires low latency and zero scalable latency Lowest power consumption was achieved with InfiniBand Saving in system power, cooling and real-estate 28

Thank You HPC Advisory Council HPC@mellanox.com All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 29 29