FLOW-3D Performance Benchmark and Profiling. September 2012

Similar documents
LS DYNA Performance Benchmarks and Profiling. January 2009

Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Dell Reference Configuration for Hortonworks Data Platform

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Maximum performance, minimal risk for data warehousing

How To Perform A File Server On A Poweredge R730Xd With Windows Storage Spaces

Business white paper. HP Process Automation. Version 7.0. Server performance

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

HPC Update: Engagement Model

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Minimize cost and risk for data warehousing

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

June, Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009

Microsoft Exchange Server 2003 Deployment Considerations

RDMA over Ethernet - A Preliminary Study

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Optimizing SQL Server Storage Performance with the PowerEdge R720

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

HP reference configuration for entry-level SAS Grid Manager solutions

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Cluster Implementation and Management; Scheduling

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

SMB Direct for SQL Server and Private Cloud

præsentation oktober 2011

Hadoop on the Gordon Data Intensive Cluster

C460 M4 Flexible Compute for SAP HANA Landscapes. Judy Lee Released: April, 2015

Boosting Data Transfer with TCP Offload Engine Technology

Virtual Compute Appliance Frequently Asked Questions

7 Real Benefits of a Virtual Infrastructure

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Scaling from Workstation to Cluster for Compute-Intensive Applications

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

An Introduction to the Gordon Architecture

IBM System x family brochure

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

HPC Applications Scalability.

Building Clusters for Gromacs and other HPC applications

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Dell High Availability and Disaster Recovery Solutions Using Microsoft SQL Server 2012 AlwaysOn Availability Groups

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Sun Constellation System: The Open Petascale Computing Architecture

FUJITSU Enterprise Product & Solution Facts

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Dell s SAP HANA Appliance

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

IBM System x family brochure

APACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Terms of Reference Microsoft Exchange and Domain Controller/ AD implementation

Data Sheet FUJITSU Server PRIMERGY CX400 S2 Multi-Node Server Enclosure

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Self service for software development tools

Integrated Grid Solutions. and Greenplum

EMC Unified Storage for Microsoft SQL Server 2008

HP Cloudline Overview

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Microsoft Exchange 2010 on Dell Systems. Simple Distributed Configurations

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

How To Write An Article On An Hp Appsystem For Spera Hana

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

Microsoft SharePoint Server 2010

Analysis of VDI Storage Performance During Bootstorm

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Data Sheet Fujitsu PRIMERGY CX122 S1 Cloud server unit for PRIMERGY CX1000

Dell PowerEdge Blades Outperform Cisco UCS in East-West Network Performance

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

supercomputing. simplified.

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Up to 4 PCI-E SSDs Four or Two Hot-Pluggable Nodes in 2U

Big Data Performance Growth on the Rise

Transcription:

FLOW-3D Performance Benchmark and Profiling September 2012

Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http://www.dell.com/hpc http://www.flow3d.com http://www.intel.com http://www.mellanox.com 2

FLOW-3D FLOW-3D is a powerful and highly-accurate CFD software Provides engineers valuable insight into many physical flow processes FLOW-3D is the ideal computational fluid dynamics software To use in the design phase as well as in improving production processes Provides special capabilities for accurately predicting free-surface flows FLOW-3D is a standalone, all-inclusive CFD package Includes an integrated GUI that ties components from problem setup to postprocessing 3

Objectives The following was done to provide best practices FLOW-3D performance benchmarking Interconnect performance comparisons Understanding FLOW-3D communication patterns Ways to increase FLOW-3D productivity MPI libraries comparisons The presented results will demonstrate The scalability of the compute environment The capability of FLOW-3D to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge R720xd 16-node (256-core) Jupiter cluster Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs (Static max Perf in BIOS) Memory: 64GB memory, DDR3 1600 MHz OS: RHEL 6.2, OFED 1.5.3 InfiniBand SW stack Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 Intel Cluster Ready certified cluster Mellanox ConnectX-3 FDR InfiniBand VPI adapters Mellanox SwitchX SX6036 InfiniBand switch MPI (vendor provided): Intel MPI 3.2.0.011 Application: FLOW-3D MP 4.2 Benchmarks: Lid Driven Cavity Flow P2 Engine Block 5

About Intel Cluster Ready Intel Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity Simplifies selection, deployment, and operation of a cluster A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers Focus on your work productivity, spend less management time on the cluster Select Intel Cluster Ready Where the cluster is delivered ready to run Hardware and software are integrated and configured together Applications are registered, validating execution on the Intel Cluster Ready architecture Includes Intel Cluster Checker tool, to verify functionality and periodically check cluster health 6

PowerEdge R720xd Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from 1GbE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch 7

FLOW-3D Performance Network Input dataset: Lid Driven Cavity InfiniBand FDR provides best scalability performance than Ethernet Provides up to 580% better performance than 1GbE at 4-node Provides up to 26% better performance than 10GbE at 12-node Provides up to 25% better performance than 40GbE at 12-node 26% 25% 580% Higher is better 16 MPI Processes/Node 8

FLOW-3D Performance Processors Intel E5-2680 (Sandy Bridge) cluster outperforms prior generations Performs 70% better than X5670 cluster at 16 nodes System components used: Jupiter: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks Janus: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk 70% Higher is better 16 MPI Processes/Node 9

FLOW-3D Performance Processors Storing data files on local FS or tmpfs would improve performance Scalability is limited by NFS when running at scale after 8 nodes NFS used in this case is over 1GbE network 25% Higher is better InfiniBand FDR 10

FLOW-3D Performance MPI vs Hybrid FLOW-3D/MP supports OpenMP Hybrid mode Up to 39% of higher performance than MPI at 16 nodes (Lid Driven Cavity) Up to 436% of higher performance than MPI at 16 nodes (P2 Engine Block) Hybrid version enables higher scalability versus pure MPI version Hybrid version delivers better scalability after 4 nodes MPI processes would spawn OpenMP threads for computation on CPU cores Streamline and reduce communication endpoints to improve scalability 11% 39% 55% 436% Higher is better InfiniBand FDR 11

FLOW-3D Performance Hybrid (Socket/Node) 2PPN of 8 threads can provide better performance than 1PPN of 16 threads Threads of the MPI process causes threads to spawn within the same socket With the -genv I_MPI_PIN_DOMAIN socket specified in the runhyd_par script Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads With the -genv I_MPI_PIN_DOMAIN node specified in the runhyd_par script The flag is modified to socket to allow spawning of threads within a socket For the case of 2PPN of 8 threads 9% 35% Higher is better InfiniBand FDR 12

FLOW-3D Profiling # of MPI Calls The overall runtime reduces as more nodes take part of the MPI job More compute nodes can reduce the runtime by spreading out the workload Computation time drops while the communication time stays flat As cluster scales, MPI time stays constantly at the same level Higher is better 16 MPI Processes/Node 13

FLOW-3D Profiling # of MPI Calls The most used MPI calls are MPI_Isend and MPI_Irecv MPI_Isend(31%), MPI_Irecv(31%), MPI_Waitall(17%), MPI_Allreduce (12%) The number of calls scales proportionally with the number of MPI processes Higher is better 16 MPI Processes/Node 14

FLOW-3D Profiling MPI Communication Time The most time consumed MPI functions are: MPI_Allreduce (34%) MPI_Waitall (30%) MPI_Bcast (25%) MPI_Allreduce time shrinks when cluster size grows while MPI_Bcast and MPI_Waitall grows 15

FLOW-3D Profiling # of MPI Calls There is a wide range of message sizes seen: MPI_Allreduce: Concentration between 4B to 16B MPI_Waitall: Around 4MB message sizes MPI_Bcast: Around 1MB Lid Driven Cavity Flow Higher is better 16 MPI Processes/Node 16

FLOW-3D Profiling MPI Communication Time File IO access occurs during certain period during the MPI solver Large spikes for writing the restart and spatial data Files are directed to write to local instead of NFS to avoid IO bottleneck 17

FLOW-3D Summary Scalability FLOW-3D/MP Hybrid version enables higher scalability versus pure MPI version Hybrid version can deliver good scalability to 16 nodes Performance Intel Xeon E5-2600 series and InfiniBand FDR enable FLOW-3D to scale with 16 nodes The E5-2680 cluster outperforms X5570 Nehalem cluster by 70% at 12 nodes The Hybrid mode allows FLOW-3D to scale at 16 nodes, up to 39% better at 16 nodes Network InfiniBand FDR allows the best scalability performance with 56Gbps rate Outperforms by 580% over 1GbE at 4-node Outperforms by 26% over 10GbE at 12-node Outperforms by 25% over 40GbE at 12-node Profiling The overall runtime reduces as more nodes take part of the MPI job More compute nodes can reduce the runtime by spreading out the workload MPI Communication time is spent mostly on MPI_Allreduce at 34% of overall MPI time Large concentration on small messages, typical for latency sensitive HPC applications 18

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 19 19