ANSYS FLUENT Performance Benchmark and Profiling. May 2009

Similar documents
ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

FLOW-3D Performance Benchmark and Profiling. September 2012

HPC Applications Scalability.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Enabling High performance Big Data platform with RDMA

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Building a Scalable Storage with InfiniBand

Clusters: Mainstream Technology for CAE

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

White Paper Solarflare High-Performance Computing (HPC) Applications

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

RDMA over Ethernet - A Preliminary Study

SMB Direct for SQL Server and Private Cloud

Mellanox Academy Online Training (E-learning)

Michael Kagan.

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Can High-Performance Interconnects Benefit Memcached and Hadoop?

7 Real Benefits of a Virtual Infrastructure

RoCE vs. iwarp Competitive Analysis

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

3G Converged-NICs A Platform for Server I/O to Converged Networks

High Performance Computing in CST STUDIO SUITE

Replacing SAN with High Performance Windows Share over a Converged Network

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Oracle Database Scalability in VMware ESX VMware ESX 3.5

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Understanding PCI Bus, PCI-Express and In finiband Architecture

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Sun Constellation System: The Open Petascale Computing Architecture

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

PCI Express IO Virtualization Overview

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

PCI Technology Overview

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Recommended hardware system configurations for ANSYS users

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Deep Learning GPU-Based Hardware Platform

How System Settings Impact PCIe SSD Performance

Advancing Applications Performance With InfiniBand

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

PCI Express and Storage. Ron Emerick, Sun Microsystems

CUTTING-EDGE SOLUTIONS FOR TODAY AND TOMORROW. Dell PowerEdge M-Series Blade Servers

Kronos Workforce Central on VMware Virtual Infrastructure

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Operating System for the K computer

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

HPC Update: Engagement Model

Boosting Data Transfer with TCP Offload Engine Technology

State of the Art Cloud Infrastructure

Cloud Data Center Acceleration 2015

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Accelerating CFD using OpenFOAM with GPUs

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Cloud Computing through Virtualization and HPC technologies

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Dell s SAP HANA Appliance

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Pedraforca: ARM + GPU prototype

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Whitepaper. Implementing High-Throughput and Low-Latency 10 Gb Ethernet for Virtualized Data Centers

Storage at a Distance; Using RoCE as a WAN Transport

Dragon NaturallySpeaking and citrix. A White Paper from Nuance Communications March 2009

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

AMD Opteron Quad-Core

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Connecting the Clouds

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Transcription:

ANSYS FLUENT Performance Benchmark and Profiling May 29

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, ANSYS, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The participating members would like to thank ANSYS for their support and guidelines For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com, www.ansys.com 2

ANSYS FLUENT Computational Fluid Dynamics (CFD) is a computational technology Enables the study of the dynamics of things that flow By generating numerical solutions to a system of partial differential equations which describe fluid flow Enable better understanding of qualitative and quantitative physical phenomena in the flow which is used to improve engineering design CFD brings together a number of different disciplines Fluid dynamics, mathematical theory of partial differential systems, computational geometry, numerical analysis, Computer science ANSYS FLUENT is a leading CFD application from ANSYS Widely used in almost every industry sector and manufactured product 3

Objectives The presented research was done to provide best practices ANSYS FLUENT performance benchmarking Interconnect performance comparisons Performance enhancement of the latest FLUENT release Ways to increase FLUENT productivity Understanding FLUENT communication patterns 4

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX 2Gb/s (DDR) HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 8MHz per node OS: RHEL5U2, OFED 1.4 InfiniBand SW stack MPI: HP-MPI 2.3 Application: FLUENT 6.3.37, FLUENT 12. Benchmark Workload New FLUENT Benchmark Suite 5

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 4Gb/s node-to-node 12Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 6Gb/s 2Gb/s 12Gb/s 4Gb/s 24Gb/s (12X) 8Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 6

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache Direct Connect Architecture HyperTransport Technology Up to 24 GB/s peak per processor Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core Integrated Memory Controller Up to 12.8 GB/s DDR2-8 MHz or DDR2-667 MHz Scalability 48-bit Physical Addressing PCI-E Bridge Bridge I/O I/O Hub Hub PCI-E Bridge Bridge Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 8 GB/S 8 GB/S 8 GB/S 8 GB/S 8 GB/S USB USB PCI PCI Dual Channel Reg DDR2 7 November5, 27 7

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 8

FLUENT Benchmark Results Input Dataset EDDY_417K Reacting Flow with Eddy Dissipation Model FLUENT 12 provides better performance and scalability Utilizing InfiniBand DDR to delivers highest performance and scalability 5 FLUENT Benchmark Result (Eddy_417K) 4 Rating 3 2 1 1 2 4 8 12 16 2 24 Number of Nodes Higher is better FLUENT 6.3.37 FLUENT 12 InfiniBand DDR 9

FLUENT Benchmark Results Input Dataset Aircraft_2M External Flow Over an Aircraft Wing FLUENT 12 provides performance and scalability increase Up to 17% higher performance versus previous 6.3.37 version 7 6 5 FLUENT Benchmark Result (Aircraft_2M) 17% Rating 4 3 2 1 1 2 4 8 12 16 2 24 Number of Nodes Higher is better FLUENT 6.3.37 FLUENT 12 InfiniBand DDR 1

FLUENT Benchmark Results Input Dataset Truck_14M External Flow Over a Truck Body FLUENT 12 delivers higher performance and scalability For any cluster size Up to 8% higher performance versus previous 6.3.37 version 1 8 FLUENT Benchmark Result (Truck_14M) 8% Rating 6 4 2 1 2 4 8 12 16 2 24 Number of Nodes Higher is better FLUENT 6.3.37 FLUENT 12 InfiniBand DDR 11

FLUENT Benchmark Results Input Dataset Truck_Poly_14M External Flow Over a Truck Body with a Polyhedral Mesh FLUENT 12 delivers higher performance and scalability For any cluster size Up to 67% higher performance versus previous 6.3.37 version Higher is better Rating 9 8 7 6 5 4 3 2 1 FLUENT Benchmark Result (Truck_poly_14M) 1 2 4 8 12 16 2 24 Number of Nodes FLUENT 6.3.37 FLUENT 12 67% InfiniBand DDR 12

FLUENT 12 Benchmark Results - Interconnect Input Dataset EDDY_417K (417 thousand elements) Reacting Flow with Eddy Dissipation Model InfiniBand DDR delivers higher performance and scalability For any cluster size Up to 192% higher performance versus Ethernet (GigE) 5 FLUENT 12. Benchmark Result (Eddy_417K) Rating 4 3 2 1 192% 1 2 4 8 12 16 2 24 Number of Nodes Higher is better InfiniBand DDR Ethernet 13

FLUENT 12 Benchmark Results - Interconnect Input Dataset Aircraft_2M (2 million elements) External Flow Over an Aircraft Wing InfiniBand DDR delivers higher performance and scalability For any cluster size Up to 99% higher performance versus Ethernet (GigE) Higher is better Rating 7 6 5 4 3 2 1 FLUENT 12. Benchmark Result (Aircraft_2M) 1 2 4 8 12 16 2 24 Number of Nodes InfiniBand DDR Ethernet 99% 14

FLUENT 12 Benchmark Results - Interconnect Input Dataset Truck_14M (14 millions elements) External Flow Over a Truck Body InfiniBand DDR delivers higher performance and scalability Up to 36% higher performance versus Ethernet (GigE) For bigger cases (# of elements) CPU is the bottleneck for larger node count configuration More server nodes (or cores) are required for increased paternalism interconnect dependency 1 8 FLUENT 12. Benchmark Result (Truck_14M) 36% Rating 6 4 2 Higher is better 1 2 4 8 12 16 2 24 Number of Nodes InfiniBand DDR Ethernet 15

Enhancing FLUENT Productivity Test cases Single job over the entire systems 2 jobs, each runs on four cores per server Running multiple jobs simultaneously improves FLUENT productivity Up to 9% more jobs per day for Eddy_417K Up to 3% more jobs per day for Aircraft_2M Up to 3% more jobs per day for Truck_14M As bigger the # of elements, higher node count is required for increased productivity The CPU is the bottleneck for larger number of servers Productivity Gain 1% 8% 6% 4% 2% % FLUENT 12. Productivity Result (2 jobs in parallel vs 1 job) 4 8 12 16 2 24 Number of Nodes Higher is better Eddy_417K Aircraft_2M Truck_14M InfiniBand DDR 16

Power Cost Savings with Different Interconnect InfiniBand saves up to $8 power to finish the same number of FLUENT jobs compared to GigE Yearly based for 24-node cluster As cluster size increases, more power can be saved Power Savings/year ($) 1 8 6 4 2 Power Cost Savings (InfiniBand vs GigE) Eddy_417K Aircraft_2M Truck_14M $/KWh = KWh * $.2 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf 17

Power Cost Savings with FLUENT upgrade FLUENT 12 saves up to ~$6 power to finish the same number of FLUENT jobs compared to FLUENT 6.3.37 Yearly based for 24-node cluster As cluster size increases, more power can be saved Power Savings/year ($) 8 6 4 2 Power Cost Savings (FLUENT 12 vs FLUENT 6.3.37) Eddy_417K Aircraft_2M Truck_14M $/KWh = KWh * $.2 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf 18

FLUENT Productivity Results Summary FLUENT 12 has tremendous performance improvement over version 6.3.37 Optimizations made for higher performance and scalability Optimizations included for AMD Opteron processor technology AMD contributed to the development and the QA stages of Fluent 12 Performance results on AMD technology established baseline performance InfiniBand enables higher performance and scalability than Ethernet Performance advantage extends as cluster size increases Efficient job placement can increase FLUENT productivity significantly Interconnect comparison shows InfiniBand delivers superior performance in every cluster size Low latency InfiniBand enables unparalleled scalability InfiniBand enables up to $8/year power savings compared to GigE FLUENT 12 reduces yearly power consumption by up to $6 compared to FLUENT 6.3.37 19

FLUENT 6.3.37 Profiling MPI Functions Mostly used MPI functions MPI_Send, MPI_Recv, MPI_Reduce, and MPI_Bcast FLUENT 6.3.37 Benchmark Profiling Result (Truck_poly_14M) Number of Calls (Thousands) 3 25 2 15 1 5 MPI_Waitall MPI_Send MPI_Reduce MPI_Recv MPI_Bcast MPI_Barrier MPI_Isend MPI_Irecv MPI Function 4 Nodes 8 Nodes 16 Nodes 2

FLUENT 6.3.37 Profiling Timing MPI_Recv shows the highest communication overhead Total Overhead (s) 2 16 12 8 4 FLUENT 6.3.37 Benchmark Profiling Result (Truck_poly_14M) MPI_Waitall MPI_Send MPI_Reduce MPI_Recv MPI_Bcast MPI Function MPI_Barrier MPI_Isend MPI_Irecv 4 Nodes 8 Nodes 16 Nodes 21

FLUENT 6.3.37 Profiling Message Transferred Most data related MPI messages are within 256B-1KB in size Typical MPI synchronization messages are lower than 64B in size Number of messages increases with cluster size Number of Messages (Thousands) 48 4 32 24 16 8 FLUENT 6.3.37 Benchmark Profiling Result (Truck_poly_14M) [..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] Message Size [256KB..1M] [1..4M] [4M..2GB] 4 Nodes 8 Nodes 16 Nodes 22

FLUENT 12 Profiling MPI Functions Mostly used MPI functions MPI_Iprobe, MPI_Allreduce, MPI_Isend, and MPI_Irecv 2 16 12 8 4 FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) 23 Number of Calls (Thousands) MPI_Waitall MPI_Send MPI_Reduce MPI_Recv MPI_Bcast MPI_Barrier MPI_Iprobe MPI_Allreduce MPI_Isend MPI_Irecv MPI Function 4 Nodes 8 Nodes 16 Nodes

FLUENT 12 Profiling Timing MPI_Recv and MPI_Allreduce show highest communication overhead 2 16 12 8 4 FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) 24 Total Overhead (s) MPI_Waitall MPI_Send MPI_Reduce MPI_Recv MPI_Bcast MPI_Barrier MPI_Iprobe MPI_Allreduce MPI_Isend MPI_Irecv MPI Function 4 Nodes 8 Nodes 16 Nodes

FLUENT 12 Profiling Message Transferred Most data related MPI messages are within 256B-1KB in size Typical MPI synchronization messages are lower than 64B in size Number of messages increases with cluster size Number of Messages (Thousands) 28 24 2 16 12 8 4 FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) [..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] [4M..2GB] Message Size 4 Nodes 8 Nodes 16 Nodes 25

FLUENT Profiling - Message Transferred FLUENT 12 reduces total number of messages Further optimization can be made to take bigger advantage of highspeed and low latency interconnects FLUENT Benchmark Profiling Result (Truck_poly_14M) Number of Messages (Thousands) 48 4 32 24 16 8 [..64B] [64..256B] [256B..1KB] [1..4KB] [4..16KB] [16..64KB] [64..256KB] [256KB..1M] [1..4M] Message Size FLUENT 6.3.37 FLUENT 12 26

FLUENT Profiling Summary FLUENT 12 and FLUENT 6.3.37 were profiled to identify their communication patterns Frequent used message sizes 256-1KB messages for data related communications <64B for synchronizations Number of messages increases with cluster size MPI Functions FLUENT 12 introduced MPI collective functions MPI_Allreduce help improves the communication efficiency Interconnects effect to FLUENT performance Both interconnect latency (MPI_Allreduce) and throughput (MPI_Recv) highly influence FLUENT performance Further optimization can be made to take bigger advantage of high-speed networks 27

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 28 28