Scalability of ANSYS Applications on Multi-core and Floating point Accelerator Processor Systems from Hewlett- Packard

Similar documents
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

High Performance Computing in CST STUDIO SUITE

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Clusters: Mainstream Technology for CAE

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

FLOW-3D Performance Benchmark and Profiling. September 2012

Recommended hardware system configurations for ANSYS users

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Brainlab Node TM Technical Specifications

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

IBM System Cluster 1350 ANSYS Microsoft Windows Compute Cluster Server

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

HP RA for SAS Visual Analytics on HP ProLiant BL460c Gen8 Servers running Linux

HP Proliant BL460c G7

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Recent Advances in HPC for Structural Mechanics Simulations

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

HP reference configuration for entry-level SAS Grid Manager solutions

HP ProLiant BL685c takes #1 Windows performance on Siebel CRM Release 8.0 Benchmark Industry Applications

Configuring the HP DL380 Gen9 24-SFF CTO Server as an HP Vertica Node. HP Vertica Analytic Database

HP recommended configurations for Microsoft Exchange Server 2013 and HP ProLiant Gen8 with direct attached storage (DAS)

Scientific Computing Data Management Visions

New to servers. Are you new to servers? Consider these HP ProLiant Essentials servers. Family guide HP ProLiant rack and tower servers

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Microsoft Windows Server 2003 vs. Linux Competitive File Server Performance Comparison

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

White Paper. Better Performance, Lower Costs. The Advantages of IBM PowerLinux 7R2 with PowerVM versus HP DL380p G8 with vsphere 5.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Cloud Computing through Virtualization and HPC technologies

IBM System x SAP HANA

IBM System x family brochure

Performance Analysis of Remote Desktop Virtualization based on Hyper-V versus Remote Desktop Services

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

HP ProLiant DL585 G5 earns #1 virtualization performance record on VMmark Benchmark

Very Large Enterprise Network, Deployment, Users

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Self service for software development tools

LS DYNA Performance Benchmarks and Profiling. January 2009

Adaptec: Snap Server NAS Performance Study

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HP ProLiant BL460c Gen8 Server Blade Customer presentation

Cluster Implementation and Management; Scheduling

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

HP Blade Workstation Solution FAQ

Oracle Database Scalability in VMware ESX VMware ESX 3.5

QuickSpecs. HP Solid State Drives (SSD) What's New. HP Solid State Drives (SSD) Overview

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

C460 M4 Flexible Compute for SAP HANA Landscapes. Judy Lee Released: April, 2015

HP ProLiant Server Booklet. Virtualization Reliability Efficiency Agility

HP Client Virtualization SMB Reference Architecture for Windows Server 2012

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

HP 85 TB reference architectures for Microsoft SQL Server 2012 Fast Track Data Warehouse: HP ProLiant DL980 G7 and P2000 G3 MSA Storage

Pedraforca: ARM + GPU prototype

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Virtuoso and Database Scalability

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Certification: HP ATA Servers & Storage

Performance Guide. 275 Technology Drive ANSYS, Inc. is Canonsburg, PA (T) (F)

Michael Kagan.

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

PRODUCT BRIEF 3E PERFORMANCE BENCHMARKS LOAD AND SCALABILITY TESTING

TREND MICRO SOFTWARE APPLIANCE SUPPORT

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Very Large Enterprise Network Deployment, 25,000+ Users

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Cornell University Center for Advanced Computing

Business white paper. HP Process Automation. Version 7.0. Server performance

QuickSpecs. What's New Support for two InfiniBand 4X QDR 36P Managed Switches

QuickSpecs. HP SAS Enterprise and SAS Midline Hard Drives Overview

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

A-CLASS The rack-level supercomputer platform with hot-water cooling

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Priority Pro v17: Hardware and Supporting Systems

Parallel Programming Survey

QuickSpecs. What's New HP 750GB 1.5G SATA 7.2K 3.5" Hard Disk Drive. HP Serial-ATA (SATA) Hard Drive Option Kits. Overview

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

HP high availability solutions for Microsoft SQL Server Fast Track Data Warehouse using SQL Server 2012 failover clustering

Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades

AP ENPS ANYWHERE. Hardware and software requirements

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Adonis Technical Requirements

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

Sizing guide for Microsoft Hyper-V on HP server and storage technologies

ANSYS Computing Platform Support. July 2013

Transcription:

Scalability of ANSYS Applications on Multi-core and Floating point Accelerator Processor Systems from Hewlett- Packard Don Mize Technical Consultant donald.mize@hp.com

Forewords With this presentation we will look at SL230Gen8 and SL250Gen8 servers from Hewlett-Packard. Both servers contain 2 eight-core Intel Sandy Bridge processors, and the SL250 also supports NVIDIA GPUs for floating point acceleration. These ANSYS applications were tested: the CFD applications Fluent and CFX, along with the ANSYS Mechanical structural application. Note at the time of the testing only ANSYS Mechanical had broad GPU support, so it will be the only one used in the comparisons with floating point acceleration. Also the systems used were running Red Hat Enterprise Linux release 6.2

Detailed benchmark data Contacts : Dave Field ISS HPC Domain Engineering Manager Dave.Field@hp.com Don Mize Application Engineer Donald.Mize@hp.com Jean-Luc Assor Wwide Segment Manager CAE & EDA Jean-Luc.Assor@hp.com 3

Architecture

Proliant Gen8 Servers used for testing SL230s Gen8 SL250s Gen8 Processor Intel Xeon E5-2600, 4/6/8 cores Intel Xeon E5-2600, 4/6/8 cores Memory (16) DDR3, up to 1600MHz (512GB max) (16) DDR3, up to 1600MHz (256GB max) Storage HP Smart Array B320i 2LFF HDD or 4SFF HDD, and 2-Hot-Plug HDD option HP Smart Array B320i 4SFF Hot-Plug HDD 4SFF Internal Drive Option Networking 2x1GbE + FlexibleLOM 2x1GbE + FlexibleLOM Management HP ilo Management Engine HP ilo Management Engine GPU Support - Support for latest NVIDIA GPGPUs M2075, 2070Q, M2090

ANSYS CFD

Compute node used for the benchmark SL230s Gen 8 ; 16 x Intel E5 @2.6GHz, 1333MHz 64GBs of memory, turbo off, Infiniband FDR SL230s

FLUENT

FLUENT

CFX

CFX

Observations with ANSYS CFD applications Results used were from benchmarks that ran in physical memory, the system didn t have to page. As long as the job was large enough it would scale well and run efficiently when fully loading the nodes with processes for the following reasons. These applications are very well tailored for multiprocess parallelism using a Message Passing Interface (MPI) They aren t a high bandwidth or I/O applications, so they will scale up to maximum number of cores per node. A possible SRA (Solution Reference Architecture) for these applications are shown on the next two slides

ANSYS CFD (Fluent/CFX): Entry-Midsize Cluster Server Options: 8-32 ProLiant SL230s Xeon nodes, each using 2 processors 16 cores per compute node Two 300GB SAS drives per compute node Options: Configure a head node with extra memory/storage for very large jobs. IE. the partitioning step in CFX. Total Memory for the Cluster: Compute nodes: 4 to 8 GBs/core Optional head node up to 8GB/core Cluster Interconnect: Integrated Gigabit Ethernet or FDR InfiniBand 2:1 (recommended for jobs using 4 nodes and above) Storage: An optional DL380p head node with up to 16 internal SAS drives Operating Environment: 64-bit Linux, Microsoft HPC Server 2008 Workloads: Ideally suited for 2 simultaneous ANSYS CFD models up to 500M cells (Fluent), and depending on mesh 100 to 500 nodes(cfx). Or, run 11 to 15 simultaneous ANSYS CFD models on the scale of 50M cells(fluent), 10 to 50 M nodes(cfx), again depending on mesh.

ANSYS CFD (FLUENT/CFX): Large Scale-Out Cluster Server Options: 32-64 ProLiant BL460c nodes, each using 2 processors 16 cores per compute node Two 300GB SAS drives per compute node Options: Configure a head node with extra memory/storage for very large jobs IE. the partitioning step in CFX. Total Memory for the Cluster: Compute nodes: 4 to 8 GBs/core Optional head node up to 8 GBs/core Cluster Interconnect: Integrated Gigabit Ethernet or FDR InfiniBand 2:1 (recommended for jobs using 4 nodes and above) 16 BL460c in c-class c7000 Enclosure Storage: Optional extended direct attached SB40c storage blade on head node (up to 6 SFF SAS drives) Optional HP P2000 G3 Storage Array System Operating Environment: 64-bit Linux, Microsoft HPC Server 2008 Workloads: Ideally suited for 2 simultaneous ANSYS CFD models greater than 500M cells(fluent), and greater than100 or 500M nodes (CFX) depending on mesh. Or running greater than 15 simultaneous ANSYS CFD models on the scale of 50M cells (FLUENT), 10 to 50M nodes (CFX) depending on mesh

ANSYS MECHANICAL

Compute nodes used for the benchmark SL230s Gen 8 ; 16 x Intel E5 @2.6GHz, 1333MHz 64GBs of memory, turbo off, Infiniband FDR SL250s Gen 8 ; 16 x Intel E5 @2.6GHz, 1600MHz 64GBs of memory, turbo off, up to three M2090 GPU acceleration, Infiniband FDR SL230s SL250s

Comparison of geometric means of select benchmarks with and without GPU acceleration (bigger is better). 700 s o l v e r 600 500 400 nogpu r a t i n g s 300 200 100 0 1p 2p 4p 6p 8p 10p 12p 14p 16p 1GPU 2GPU 3GPU SL250G8 2.6GHz 1600MHz 64GBs of memory with up to three M2090 GPU acceleration and turbo off processes

Observations with ANSYS Mechanical With all nodes enough memory was used to run the benchmarks incore. The application ran more efficiently when not using all the cores in the node for the following reasons This application is a high bandwidth application. It stresses the memory subsystem, especially when a lot of processes are running. Possible data contention and communication between processes. Each separate process does its own file I/O. Lots of processes stress the file systems on the various nodes running the particular job. RAID 0 striped file systems were used for scratch I/O Processor clock speed increases didn t help with multiprocess runs. A possible SRA (Solution Reference Architecture) for this application is shown on the next slide.

ANSYS Mechanical (Structural Analysis) : Fat Node Cluster Server Options: 4-8 ProLiant DL380p Xeon server nodes, each using 2 processors (16 cores) and 2 to 16 internal 600GB SAS 15K drives or 800GB SAS SSDs striped RAID 0 per compute node plus a 6x2TB SAS RAID0 disk array on head node Optional SL250s Xeon server nodes, each using 2 processors (16 cores), 3 NVIDIA Tesla M2090 6GB GPUs (one per ANSYS job) and 2 internal 300GB SAS 15K drives or 800GB SAS SSDs per compute node (suitable for nonlinear jobs > = 2M DOF) Optional blade workstation with HP RGS for pre/post processing Optional head node with extra memory/storage for very large jobs Total Memory for Cluster: 8GB/core or up to 128GBs total on Head node 4 to 8 GB/core (64 or 128GBs/node) on each remaining compute nodes Cluster Interconnect: FDR InfiniBand 2:1 Storage: Optional Lustre/DDN Cluster File System Operating Environment: 64-bit Linux OR Microsoft HPC Server 2008 SL250s DL380p Workloads: 256-1024GB RAM configurations will handle up to 6 simultaneous running ANSYS megamodels of 45-180M DOFs

Conclusion This summary of ANSYS server applications on HP ProLiant servers using Intel E5-26XX Sandy Bridge processors shows that now, as in the past, as the number of cores on the processors increases, so does application performance. The performances of memory and network components have improved to maximize the performance these processors. However there are still considerations to be taken when running ANSYS applications in parallel. Fluent and CFX are both highly scalable. Enough work for parallelization. With ANSYS Mechanical the situation is different This is because of the application s demands on the memory and filesystem componentsalso with ANSYS Mechanical there is GPU support which can increase the speed of the application in certain situations.

Conclusion The hardware configurations used in the analysis for this paper were designed by HP for HPC. The servers are configured using Intel high performing Sandy Bridge processors, fast memory DIMMs, and high performance disk drives. Other HP twoprocessor server models with similar processors, memory, networks, and disk subsystems will perform in a similar way. These are Proliant BL and DL models. However there are variations in these machines that might make them favor one HPC application over another. To find out more please contact your HP sales representative!

ANSYS Performance management in Production with HP CMU

CMU What Can You Profile?

Use Case #1 : Too Much Memory! We attempt to run a large 16 process job on one node. Job takes 538 seconds to finish, which we believe is too long. We use CMU checking CPU, memory, and disk usage and discover job is using swap. Decide to run job over two nodes as to spread out memory footprint and now job finishes in 328 seconds. We verify no swapping with CMU.

Use Case #1 : Too Much Memory!

Use Case #1 : Resolved!

Single Node Details

GPU Load

Use Case #2 : Too Many Processes! ANSYS job on two node cluster not performing as expected. Using CMU to look at CPU and memory usage on both nodes, noticing that one node is unusually loaded. Possible job placement or memory issue. HW OK, and memory identical on two nodes, so checked job placement and discovered that node one was being packed with processes before using node two. Configure ANSYS to do round robin job placement. Performance improves. Load now looks fine on both nodes as verified by CMU.

Use Case #2 : Too Many Processes!

Use Case #2 : Resolved!

ANSYS Fluent on 16 Nodes - CMU View of CPU Usage

Colplot of a 2 node Fluent job Using the collectl recording function of CMU, the output files can be read into a browser and these charts can be generated. This shows the results from a 2 node fluent job. Each column corresponds with a node and in this instance, three metrics were plotted, cpu utilization, Infiniband interconnect activity, and memory utilizations.

Use Case #3 : Process Affinity The picture on the left shows the cpuload on the cluster with Fluent s process affinity feature disabled. The picture on the right shows the cpuload with process affinity enabled.

Thank you