Scalability of ANSYS Applications on Multi-core and Floating point Accelerator Processor Systems from Hewlett- Packard Don Mize Technical Consultant donald.mize@hp.com
Forewords With this presentation we will look at SL230Gen8 and SL250Gen8 servers from Hewlett-Packard. Both servers contain 2 eight-core Intel Sandy Bridge processors, and the SL250 also supports NVIDIA GPUs for floating point acceleration. These ANSYS applications were tested: the CFD applications Fluent and CFX, along with the ANSYS Mechanical structural application. Note at the time of the testing only ANSYS Mechanical had broad GPU support, so it will be the only one used in the comparisons with floating point acceleration. Also the systems used were running Red Hat Enterprise Linux release 6.2
Detailed benchmark data Contacts : Dave Field ISS HPC Domain Engineering Manager Dave.Field@hp.com Don Mize Application Engineer Donald.Mize@hp.com Jean-Luc Assor Wwide Segment Manager CAE & EDA Jean-Luc.Assor@hp.com 3
Architecture
Proliant Gen8 Servers used for testing SL230s Gen8 SL250s Gen8 Processor Intel Xeon E5-2600, 4/6/8 cores Intel Xeon E5-2600, 4/6/8 cores Memory (16) DDR3, up to 1600MHz (512GB max) (16) DDR3, up to 1600MHz (256GB max) Storage HP Smart Array B320i 2LFF HDD or 4SFF HDD, and 2-Hot-Plug HDD option HP Smart Array B320i 4SFF Hot-Plug HDD 4SFF Internal Drive Option Networking 2x1GbE + FlexibleLOM 2x1GbE + FlexibleLOM Management HP ilo Management Engine HP ilo Management Engine GPU Support - Support for latest NVIDIA GPGPUs M2075, 2070Q, M2090
ANSYS CFD
Compute node used for the benchmark SL230s Gen 8 ; 16 x Intel E5 @2.6GHz, 1333MHz 64GBs of memory, turbo off, Infiniband FDR SL230s
FLUENT
FLUENT
CFX
CFX
Observations with ANSYS CFD applications Results used were from benchmarks that ran in physical memory, the system didn t have to page. As long as the job was large enough it would scale well and run efficiently when fully loading the nodes with processes for the following reasons. These applications are very well tailored for multiprocess parallelism using a Message Passing Interface (MPI) They aren t a high bandwidth or I/O applications, so they will scale up to maximum number of cores per node. A possible SRA (Solution Reference Architecture) for these applications are shown on the next two slides
ANSYS CFD (Fluent/CFX): Entry-Midsize Cluster Server Options: 8-32 ProLiant SL230s Xeon nodes, each using 2 processors 16 cores per compute node Two 300GB SAS drives per compute node Options: Configure a head node with extra memory/storage for very large jobs. IE. the partitioning step in CFX. Total Memory for the Cluster: Compute nodes: 4 to 8 GBs/core Optional head node up to 8GB/core Cluster Interconnect: Integrated Gigabit Ethernet or FDR InfiniBand 2:1 (recommended for jobs using 4 nodes and above) Storage: An optional DL380p head node with up to 16 internal SAS drives Operating Environment: 64-bit Linux, Microsoft HPC Server 2008 Workloads: Ideally suited for 2 simultaneous ANSYS CFD models up to 500M cells (Fluent), and depending on mesh 100 to 500 nodes(cfx). Or, run 11 to 15 simultaneous ANSYS CFD models on the scale of 50M cells(fluent), 10 to 50 M nodes(cfx), again depending on mesh.
ANSYS CFD (FLUENT/CFX): Large Scale-Out Cluster Server Options: 32-64 ProLiant BL460c nodes, each using 2 processors 16 cores per compute node Two 300GB SAS drives per compute node Options: Configure a head node with extra memory/storage for very large jobs IE. the partitioning step in CFX. Total Memory for the Cluster: Compute nodes: 4 to 8 GBs/core Optional head node up to 8 GBs/core Cluster Interconnect: Integrated Gigabit Ethernet or FDR InfiniBand 2:1 (recommended for jobs using 4 nodes and above) 16 BL460c in c-class c7000 Enclosure Storage: Optional extended direct attached SB40c storage blade on head node (up to 6 SFF SAS drives) Optional HP P2000 G3 Storage Array System Operating Environment: 64-bit Linux, Microsoft HPC Server 2008 Workloads: Ideally suited for 2 simultaneous ANSYS CFD models greater than 500M cells(fluent), and greater than100 or 500M nodes (CFX) depending on mesh. Or running greater than 15 simultaneous ANSYS CFD models on the scale of 50M cells (FLUENT), 10 to 50M nodes (CFX) depending on mesh
ANSYS MECHANICAL
Compute nodes used for the benchmark SL230s Gen 8 ; 16 x Intel E5 @2.6GHz, 1333MHz 64GBs of memory, turbo off, Infiniband FDR SL250s Gen 8 ; 16 x Intel E5 @2.6GHz, 1600MHz 64GBs of memory, turbo off, up to three M2090 GPU acceleration, Infiniband FDR SL230s SL250s
Comparison of geometric means of select benchmarks with and without GPU acceleration (bigger is better). 700 s o l v e r 600 500 400 nogpu r a t i n g s 300 200 100 0 1p 2p 4p 6p 8p 10p 12p 14p 16p 1GPU 2GPU 3GPU SL250G8 2.6GHz 1600MHz 64GBs of memory with up to three M2090 GPU acceleration and turbo off processes
Observations with ANSYS Mechanical With all nodes enough memory was used to run the benchmarks incore. The application ran more efficiently when not using all the cores in the node for the following reasons This application is a high bandwidth application. It stresses the memory subsystem, especially when a lot of processes are running. Possible data contention and communication between processes. Each separate process does its own file I/O. Lots of processes stress the file systems on the various nodes running the particular job. RAID 0 striped file systems were used for scratch I/O Processor clock speed increases didn t help with multiprocess runs. A possible SRA (Solution Reference Architecture) for this application is shown on the next slide.
ANSYS Mechanical (Structural Analysis) : Fat Node Cluster Server Options: 4-8 ProLiant DL380p Xeon server nodes, each using 2 processors (16 cores) and 2 to 16 internal 600GB SAS 15K drives or 800GB SAS SSDs striped RAID 0 per compute node plus a 6x2TB SAS RAID0 disk array on head node Optional SL250s Xeon server nodes, each using 2 processors (16 cores), 3 NVIDIA Tesla M2090 6GB GPUs (one per ANSYS job) and 2 internal 300GB SAS 15K drives or 800GB SAS SSDs per compute node (suitable for nonlinear jobs > = 2M DOF) Optional blade workstation with HP RGS for pre/post processing Optional head node with extra memory/storage for very large jobs Total Memory for Cluster: 8GB/core or up to 128GBs total on Head node 4 to 8 GB/core (64 or 128GBs/node) on each remaining compute nodes Cluster Interconnect: FDR InfiniBand 2:1 Storage: Optional Lustre/DDN Cluster File System Operating Environment: 64-bit Linux OR Microsoft HPC Server 2008 SL250s DL380p Workloads: 256-1024GB RAM configurations will handle up to 6 simultaneous running ANSYS megamodels of 45-180M DOFs
Conclusion This summary of ANSYS server applications on HP ProLiant servers using Intel E5-26XX Sandy Bridge processors shows that now, as in the past, as the number of cores on the processors increases, so does application performance. The performances of memory and network components have improved to maximize the performance these processors. However there are still considerations to be taken when running ANSYS applications in parallel. Fluent and CFX are both highly scalable. Enough work for parallelization. With ANSYS Mechanical the situation is different This is because of the application s demands on the memory and filesystem componentsalso with ANSYS Mechanical there is GPU support which can increase the speed of the application in certain situations.
Conclusion The hardware configurations used in the analysis for this paper were designed by HP for HPC. The servers are configured using Intel high performing Sandy Bridge processors, fast memory DIMMs, and high performance disk drives. Other HP twoprocessor server models with similar processors, memory, networks, and disk subsystems will perform in a similar way. These are Proliant BL and DL models. However there are variations in these machines that might make them favor one HPC application over another. To find out more please contact your HP sales representative!
ANSYS Performance management in Production with HP CMU
CMU What Can You Profile?
Use Case #1 : Too Much Memory! We attempt to run a large 16 process job on one node. Job takes 538 seconds to finish, which we believe is too long. We use CMU checking CPU, memory, and disk usage and discover job is using swap. Decide to run job over two nodes as to spread out memory footprint and now job finishes in 328 seconds. We verify no swapping with CMU.
Use Case #1 : Too Much Memory!
Use Case #1 : Resolved!
Single Node Details
GPU Load
Use Case #2 : Too Many Processes! ANSYS job on two node cluster not performing as expected. Using CMU to look at CPU and memory usage on both nodes, noticing that one node is unusually loaded. Possible job placement or memory issue. HW OK, and memory identical on two nodes, so checked job placement and discovered that node one was being packed with processes before using node two. Configure ANSYS to do round robin job placement. Performance improves. Load now looks fine on both nodes as verified by CMU.
Use Case #2 : Too Many Processes!
Use Case #2 : Resolved!
ANSYS Fluent on 16 Nodes - CMU View of CPU Usage
Colplot of a 2 node Fluent job Using the collectl recording function of CMU, the output files can be read into a browser and these charts can be generated. This shows the results from a 2 node fluent job. Each column corresponds with a node and in this instance, three metrics were plotted, cpu utilization, Infiniband interconnect activity, and memory utilizations.
Use Case #3 : Process Affinity The picture on the left shows the cpuload on the cluster with Fluent s process affinity feature disabled. The picture on the right shows the cpuload with process affinity enabled.
Thank you