LS-DYNA Performance Benchmark and Profiling. February 2014

Similar documents
FLOW-3D Performance Benchmark and Profiling. September 2012

LS DYNA Performance Benchmarks and Profiling. January 2009

Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

HPC Update: Engagement Model

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

Advancing Applications Performance With InfiniBand

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

HPC Applications Scalability.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

SMB Direct for SQL Server and Private Cloud

Dell Reference Configuration for Hortonworks Data Platform

præsentation oktober 2011

Business white paper. HP Process Automation. Version 7.0. Server performance

HP reference configuration for entry-level SAS Grid Manager solutions

Mellanox Academy Online Training (E-learning)

HP Cloudline Overview

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Scaling from Workstation to Cluster for Compute-Intensive Applications

Improved LS-DYNA Performance on Sun Servers

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Maximum performance, minimal risk for data warehousing

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Xserve Transition Guide. November 2010

APACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Cloud Computing through Virtualization and HPC technologies

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

High Performance Computing in CST STUDIO SUITE

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

RDMA over Ethernet - A Preliminary Study

Minimize cost and risk for data warehousing

Enabling High performance Big Data platform with RDMA

Headline in Arial Bold 30pt. The Need For Speed. Rick Reid Principal Engineer SGI

FUJITSU Enterprise Product & Solution Facts

PRIMERGY server-based High Performance Computing solutions

Virtual Compute Appliance Frequently Asked Questions

Lenovo Database Configuration for Microsoft SQL Server TB

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Sun Constellation System: The Open Petascale Computing Architecture

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

Boosting Data Transfer with TCP Offload Engine Technology

Hadoop on the Gordon Data Intensive Cluster

Fujitsu PRIMERGY Servers Portfolio

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

How To Perform A File Server On A Poweredge R730Xd With Windows Storage Spaces

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

How System Settings Impact PCIe SSD Performance

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

Dell s SAP HANA Appliance

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Data Sheet Fujitsu Server PRIMERGY BX924 S4 Dual Socket Server Blade

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

Microsoft SharePoint Server 2010

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Microsoft Private Cloud Fast Track Reference Architecture

White Paper Solarflare High-Performance Computing (HPC) Applications

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Up to 4 PCI-E SSDs Four or Two Hot-Pluggable Nodes in 2U

A Holistic Model of the Energy-Efficiency of Hypervisors

Optimizing SQL Server Storage Performance with the PowerEdge R720

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Cost Efficient VDI. XenDesktop 7 on Commodity Hardware

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Data Sheet FUJITSU Server PRIMERGY CX420 S1 Out-of-the-box Dual Node Cluster Server

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

SUN ORACLE EXADATA STORAGE SERVER

Recommended hardware system configurations for ANSYS users

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Terms of Reference Microsoft Exchange and Domain Controller/ AD implementation

Lenovo Database Configuration for Microsoft SQL Server TB

Transcription:

LS-DYNA Performance Benchmark and Profiling February 2014

Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, LSTC Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices LS-DYNA performance overview Understanding LS-DYNA communication patterns Ways to increase LS-DYNA productivity MPI libraries comparisons For more info please refer to http://www.dell.com http://www.intel.com http://www.mellanox.com http://www.lstc.com 2

LS-DYNA LS-DYNA A general purpose structural and fluid analysis simulation software package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC) LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering 3

Objectives The presented research was done to provide best practices LS-DYNA performance benchmarking MPI Library performance comparison Interconnect performance comparison CPUs comparison Compilers comparison The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency 4

Test Cluster Configuration Dell PowerEdge R720xd/R720 32-node (640-core) cluster Dual-Socket Octa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Turbo Mode enabled) Memory: 64GB DDR3 1600 MHz Dual Rank Memory Module (Static max Perf in BIOS) Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 OS: RHEL 6.2, MLNX_OFED 2.1-1.0.0 InfiniBand SW stack Intel Cluster Ready certified cluster Mellanox Connect-IB FDR InfiniBand and ConnectX-3 Ethernet adapters Mellanox SwitchX 6036 VPI InfiniBand and Ethernet switches MPI: Intel MPI 4.1, Platform MPI 9.1, Open MPI 1.6.5 w/ FCA 2.5 & MXM 2.1 Application: LS-DYNA mpp971_s_r3.2.1_intel_linux86-64 (for TopCrunch) ls-dyna_mpp_s_r7_0_0_79069_x64_ifort120_sse2 Benchmark datasets: 3 Vehicle Collision, Neon refined revised 5

About Intel Cluster Ready Intel Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity Simplifies selection, deployment, and operation of a cluster A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers Focus on your work productivity, spend less management time on the cluster Select Intel Cluster Ready Where the cluster is delivered ready to run Hardware and software are integrated and configured together Applications are registered, validating execution on the Intel Cluster Ready architecture Includes Intel Cluster Checker tool, to verify functionality and periodically check cluster health 6

PowerEdge R720xd Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GigE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch 7

LS-DYNA Performance System Generations Intel E5-2600v2 Series (Ivy Bridge) outperforms prior generations Up to 20% higher performance than Intel Xeon E5-2680 (Sandy Bridge) cluster Up to 70% higher performance than Intel Xeon X5670 (Westmere) cluster Up to 167% higher performance than Intel Xeon X5570 (Nehalem) cluster System components used: Ivy Bridge: 2-socket 10-core E5-2680v2@2.8GHz, 1600MHz DIMMs, Connect-IB FDR Sandy Bridge: 2-socket 8-core E5-2680@2.7GHz, 1600MHz DIMMs, ConnectX-3 FDR Westmere: 2-socket 6-core x5670@2.93ghz, 1333MHz DIMMs, ConnectX-2 QDR Nehalem: 2-socket 4-core x5570@2.93ghz, 1333MHz DIMMs, ConnectX-2 QDR 70% 167% 20% Higher is better FDR InfiniBand 8

LS-DYNA Performance Interconnects FDR InfiniBand delivers superior scalability in application performance Provides higher performance by over 7 times % for 1GbE Almost 5 times faster than 10GbE and 40GbE 1GbE stop scaling beyond 4 nodes, and 10GbE stops scaling beyond 8 nodes Only FDR InfiniBand demonstrates continuous performance gain at scale 707% 492% 474% Higher is better Intel E5-2680v2 9

LS-DYNA Performance Open MPI Tuning FCA and MXM enhance LS-DYNA performance at scale for Open MPI FCA allows MPI collective operation offloads to hardware while MXM provides memory enhancements to parallel communication libraries FCA and MXM provide a speedup of 18% over untuned baseline run at 32 nodes MCA parameters for enabling FCA and MXM: For enabling MXM: -mca mtl mxm -mca pml cm -mca mtl_mxm_np 0 -x MXM_TLS=ud,shm,self -x MXM_SHM_RNDV_THRESH=32768 -x MXM_RDMA_PORTS=mlx5_0:1 For enabling FCA: -mca coll_fca_enable 1 -mca coll_fca_np 0 -x fca_ib_dev_name=mlx5_0 18% 2% 10% Higher is better FDR InfiniBand 10

LS-DYNA Performance TopCrunch HPC Advisory Council performs better than the previous best published results TopCrunch (www.topcrunch.com) publishes LS-DYNA performance results HPCAC achieved better performance on per node basis 9% to 27% of higher performance than best published results on TopCrunch (Feb2014) Comparing to all platforms on TopCrunch HPC Advisory Council results are world best for systems for 2 to 32 nodes Achieving higher performance than larger node count systems 10% 24% 21% 10% 14% 25% 9% 27% 21% 11% Higher is better FDR InfiniBand 11

LS-DYNA Performance Memory Module Dual rank memory module provides better speedup for LS-DYNA Using Dual Rank 1600MHz DIMM is 8.9% faster than Single Rank 1866MHz DIMM Ranking of the memory has more importance to performance than speed System components used: DDR3-1866MHz PC14900 CL13 Single Rank Memory Module DDR3-1600MHz PC12800 CL11 Dual Rank Memory Module 8.9% Higher is better Single Node 12

LS-DYNA Performance MPI Platform MPI performs better than Open MPI and Intel MPI Up to 10% better than Open MPI and 30% better than Intel MPI Tuning parameter used: Open MPI: -bind-to-core, FCA, MXM, KNEM Platform MPI: -cpu_bind, XRC Intel MPI: I_MPI_FABRICS shm:ofa, I_MPI_PIN on, I_MPI_ADJUST_BCAST 1 I_MPI_DAPL_SCALABLE_PROGRESS 1 30% 10% Higher is better FDR InfiniBand 13

LS-DYNA Performance System Utilization Maximizing system productivity by running 2 jobs in parallel Up to 77% of increased system utilization at 32 nodes Run each job separately in parallel by splitting system resource in half System components used: Single Job: Use all cores and both IB ports to run a single job 2 Jobs in parallel: Cores of 1 CPU and 1 IB port for a job, the rest for another 77% 22% 38% Higher is better FDR InfiniBand 14

LS-DYNA Profiling MPI/User Time Ratio Computation time is dominant compared to MPI communication time MPI communication ratio increases as the cluster scales Both computation time and communication declines as the cluster scales The InfiniBand infrastructure allows spreading the work without adding overheads Computation time drops faster compares to communication time Compute bound: Tuning for computation performance could yield better results FDR InfiniBand 15

LS-DYNA Profiling MPI Calls MPI_Wait, MPI_Send and MPI_Recv are the most used MPI calls MPI_Wait(31%), MPI_Send(21%), MPI_Irecv(18%), MPI_Recv(14%), MPI_Isend(12%) LS-DYNA has majority of MPI point-to-point calls for data transfers Either blocking or non-blocking point-to-point transfers are seen LS-DYNA has an extensive use of MPI APIs, over 23 MPI APIs are used 16

LS-DYNA Profiling Time Spent by MPI Calls Majority of the MPI time is spent on MPI_recv and MPI Collective Ops MPI_Recv(36%), MPI_Allreduce(27%), MPI_Bcast(24%) MPI communication time lowers gradually as cluster scales Due to the faster total runtime, as more CPUs are working on completing the job faster Reducing the communication time for each of the MPI calls 17

LS-DYNA Profiling MPI Message Sizes Similar communication characteristics seen on both input dataset Both exhibit similar communication patterns Neon_refined_revised 32 nodes 3 Vehicle Collision 32 nodes 18

LS-DYNA Profiling MPI Message Sizes Most of the MPI messages are in the medium sizes Most message sizes are between 0 to 64B For the most time consuming MPI calls MPI_Recv: Most messages are under 4KB MPI_Bcast: Majority are less than 16B, but larger messages exist MPI_Allreduce: Most messages are less than 256B 3 Vehicle Collision 32 nodes 19

LS-DYNA Profiling MPI Data Transfer As the cluster grows, substantial less data transfers between MPI processes Drops from ~10GB per rank at 2-node vs to ~4GB at 32-node Rank 0 contains higher transfers than the rest of the MPI ranks Rank 0 responsible for file IO and uses MPI to communicates with the rest of the ranks 20

LS-DYNA Profiling Aggregated Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively Large data transfer takes place in LS-DYNA Seen around 2TB at 32-node for the amount of data being exchanged between the nodes FDR InfiniBand 21

LS-DYNA Profiling Memory Usage By Node Uniform amount of memory consumed for running LS-DYNA About 38GB of data is used for per node Each node runs with 20 ranks, thus about 2GB per rank is needed The same trend continues for 2 to 32 nodes 3 Vehicle Collision 2 nodes 3 Vehicle Collision 32 nodes 22

LS-DYNA Summary Performance Intel Xeon E5-2680v2 (Ivy Bridge) and FDR InfiniBand enable LS-DYNA to scale Up to 20% over Sandy Bridge, Up to 70% over Westmere, 167% over Nehalem FDR InfiniBand delivers superior scalability in application performance Provides higher performance by over 7 times for 1GbE and over 5 times for 10/40GbE at 32 nodes Dual rank memory module provides better speedup Using Dual Rank 1600MHz DIMM is ~9% faster than single rank 1866MHz DIMM HPC Advisory Council performs better than the best published results on TopCrunch 9% to 27% of higher performance than best published results on TopCrunch (Feb 2014) Tuning FCA and MXM enhances LS-DYNA performance at scale for Open MPI Provide a speedup of 18% over untuned baseline run at 32 nodes As the CPU/MPI time ratio shows significantly more computation is taken place Profiling Majority of MPI calls are for (blocking and non-blocking) point-to-point communications Majority of the MPI time is spent on MPI_recv and MPI Collective operations 23

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 24 24