High Performance Computing for Mechanical Simulations using ANSYS. Jeff Beisheim ANSYS, Inc

Similar documents
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Recent Advances in HPC for Structural Mechanics Simulations

High Performance Computing in CST STUDIO SUITE

Recommended hardware system configurations for ANSYS users

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Best practices for efficient HPC performance with large models

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Performance Guide. 275 Technology Drive ANSYS, Inc. is Canonsburg, PA (T) (F)

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

LLamasoft K2 Enterprise 8.1 System Requirements

Parallel Programming Survey

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Accelerating CFD using OpenFOAM with GPUs

Simultaneous Calculation with ANSYS

Clusters: Mainstream Technology for CAE

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

FLOW-3D Performance Benchmark and Profiling. September 2012

HP Z Turbo Drive PCIe SSD

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Brainlab Node TM Technical Specifications

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Using the Windows Cluster

The Value of High-Performance Computing for Simulation

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Several tips on how to choose a suitable computer

PARALLELS CLOUD STORAGE

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

ANSYS Solvers: Usage and Performance. Ansys equation solvers: usage and guidelines. Gene Poole Ansys Solvers Team, April, 2002

ANSYS Computing Platform Support. July 2013

How to choose a suitable computer

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Several tips on how to choose a suitable computer

Introduction to GPU hardware and to CUDA

HP ProLiant SL270s Gen8 Server. Evaluation Report

Understanding the Benefits of IBM SPSS Statistics Server

Parallel Computing with MATLAB

Deep Learning GPU-Based Hardware Platform

Retargeting PLAPACK to Clusters with Hardware Accelerators

Scaling from Workstation to Cluster for Compute-Intensive Applications

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Can High-Performance Interconnects Benefit Memcached and Hadoop?

HPC with Multicore and GPUs

Interactive Level-Set Deformation On the GPU

Parallel Large-Scale Visualization

Turbomachinery CFD on many-core platforms experiences and strategies

Scalability and Classifications

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Cluster Implementation and Management; Scheduling

Computer Graphics Hardware An Overview

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lecture 1: the anatomy of a supercomputer

- An Essential Building Block for Stable and Reliable Compute Clusters

Self service for software development tools

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Sage SalesLogix White Paper. Sage SalesLogix v8.0 Performance Testing

SGI Technology Guide for ANSYS

High Performance Matrix Inversion with Several GPUs

LS DYNA Performance Benchmarks and Profiling. January 2009

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Benchmark Tests on ANSYS Parallel Processing Technology

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Cluster Computing at HRI

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPGPU accelerated Computational Fluid Dynamics

High Performance Computing. Course Notes HPC Fundamentals

Microsoft Dynamics CRM 2011 Guide to features and requirements

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Ansys & optislang on a HPC-Cluster

Dragon NaturallySpeaking and citrix. A White Paper from Nuance Communications March 2009

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ANSYS Computing Platform Support. June 2014

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Building a Top500-class Supercomputing Cluster at LNS-BUAP

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

HPC Deployment of OpenFOAM in an Industrial Setting

AP ENPS ANYWHERE. Hardware and software requirements

Introduction to GPU Programming Languages

Parallels Cloud Storage

Back to Elements - Tetrahedra vs. Hexahedra

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Symmetric Multiprocessing

Transcription:

High Performance Computing for Mechanical Simulations using ANSYS Jeff Beisheim ANSYS, Inc

HPC Defined High Performance Computing (HPC) at ANSYS: An ongoing effort designed to remove computing limitations from engineers who use computer aided engineering in all phases of design, analysis, and testing. It is a hardware and software initiative!

Need for Speed Assemblies CAD to mesh Capture fidelity Impact product design Enable large models Allow parametric studies Modal Nonlinear Multiphysics Dynamics

A History of HPC Performance 2012 2010 2012 GPU acceleration (multiple GPUs; DMP) 2010 GPU acceleration (single GPU; SMP) 2007 2009 Optimized for multicore processors Teraflop performance at 512 cores 1980 1990 2000 2005 2007 Distributed PCG solver Distributed ANSYS (DMP) released Distributed sparse solver Variational Technology Support for clusters using Windows HPC 2004 1 st company to solve 100M structural DOF 1999 2000 64 bit large memory addressing 1994 Iterative PCG Solver introduced for large analyses 1990 Shared Memory Multiprocessing (SMP) available 1980 s Vector Processing on Mainframes

HPC Revolution Recent advancements have revolutionized the computational speed available on the desktop Multi core processors Every core is really an independent processor Large amounts of RAM and SSDs GPUs

Parallel Processing Hardware 2 Types of memory systems Shared memory parallel (SMP) single box, workstation/server Distributed memory parallel (DMP) multiple boxes, cluster Workstation Cluster

Parallel Processing Software 2 Types of parallel processing for Mechanical APDL Shared memory parallel ( np > 1) First available in v4.3 Can only be used on single machine Distributed memory parallel ( dis np > 1) First available in v6.0 with the DDS solver Can be used on single machine or cluster GPU acceleration ( acc) First available in v13.0 using NVIDIA GPUs Supports using either single GPU or multiple GPUs Can be used on single machine or cluster

Distributed ANSYS Design Requirements No limitation in simulation capability Must support all features Continually working to add more functionality with each release Reproducible and consistent results Same answers achieved using 1 core or 100 cores Same quality checks and testing are done as with SMP version Uses the same code base as SMP version of ANSYS Support all major platforms Most widely used processors, operating systems, and interconnects Supports same platforms that SMP version supports Uses latest versions of MPI software which support the latest interconnects

Distributed ANSYS Design Distributed steps ( dis np N) At start of first load step, decompose FEA model into N pieces (domains) Each domain goes to a different core to be solved Solution is not independent!! Lots of communication required to achieve solution Lots of synchronization required to keep all processes together Each process writes its own sets of files (file0*, file1*, file2*,, file[n 1]*) Results are automatically combined at end of solution Facilitates post processing in /POST1, /POST26, or WorkBench

Distributed ANSYS Capabilities Wide variety of features & analysis capabilities are supported Static linear or nonlinear analyses Buckling analyses Modal analyses Harmonic response analyses using FULL method Transient response analyses using FULL method Single field structural and thermal analyses Low frequency electromagnetic analysis High frequency electromagnetic analysis Coupled field analyses All widely used element types and materials Superelements (use pass) NLGEOM, SOLC, LNSRCH, AUTOTS, IC, INISTATE, Linear Perturbation Multiframe restarts Cyclic symmetry analyses User Programmable features (UPFs)

Distributed ANSYS Equation Solvers Sparse direct solver (default) Supports SMP, DMP, and GPU acceleration Can handle all analysis types and options Foundation for Block Lanczos, Unsymmetric, Damped, and QR damped eigensolvers PCG iterative solver Supports SMP, DMP, and GPU acceleration Symmetric, real value matrices only (i.e., static/full transient) Foundation for PCG Lanczos eigensolver JCG/ICCG iterative solvers Supports SMP only

Distributed ANSYS Eigensolvers Block Lanczos eigensolver (including QRdamp) Supports SMP and GPU acceleration PCG Lanczos eigensolver Supports SMP, DMP, and GPU acceleration Great for large models (>5 MDOF) with relatively few modes (< 50) Supernode eigensolver Supports SMP only Optimal choice when requesting hundreds or thousands of modes Subspace eigensolver Supports SMP, DMP, and GPU acceleration Currently only supports buckling analyses; beta for modal in R14.5 Unsymmetric/Damped eigensolvers Supports SMP, DMP, and GPU acceleration

Distributed ANSYS Benefits Better architecture More computations performed in parallel faster solution time Better speedups than SMP Can achieve > 10x on 16 cores (try getting that with SMP!) Can be used for jobs running on 1000+ CPU cores Can take advantage of resources on multiple machines Memory usage and bandwidth scales Disk (I/O) usage scales Whole new class of problems can be solved!

Distributed ANSYS Performance Need fast interconnects to feed fast processors Two main characteristics for each interconnect: latency and bandwidth Distributed ANSYS is highly bandwidth bound +--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+ Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07 Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : 32 Number of physical cores available : 32 Number of cores requested : 4 (Distributed Memory Parallel) MPI Type: INTELMPI Core Machine Name Working Directory ---------------------------------------------------- 0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core 1 = 1.171 microseconds Latency time from master to core 2 = 2.251 microseconds Latency time from master to core 3 = 2.225 microseconds Communication speed from master to core 1 = 7934.49 MB/sec Same machine Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

Distributed ANSYS Performance Need fast interconnects to feed fast processors Rating (runs/day) 60 50 40 30 Turbine model 2.1 million DOF 20 SOLID187 elements 10 Nonlinear static analysis Sparse solver (DMP) 0 Linux cluster (8 cores per node) Interconnect Performance Gigabit Ethernet DDR Infiniband 8 cores 16 cores 32 cores 64 cores 128 cores

Distributed ANSYS Performance Need fast hard drives to feed fast processors Check the bandwidth specs ANSYS Mechanical can be highly I/O bandwidth bound Sparse solver in the out of core memory mode does lots of I/O Distributed ANSYS can be highly I/O latency bound Seek time to read/write each set of files causes overhead Consider SSDs High bandwidth and extremely low seek times Consider RAID configurations RAID 0 for speed RAID 1,5 for redundancy RAID 10 for speed and redundancy

Distributed ANSYS Performance Need fast hard drives to feed fast processors 8 million DOF Linear static analysis Sparse solver (DMP) Dell T5500 workstation 12 Intel Xeon x5675 cores, 48 GB RAM, single 7.2k rpm HDD, single SSD, Win7) Rating (runs/day) 30 25 20 15 10 5 0 Hard Drive Performance HDD SSD 1 core 2 cores 4 cores 8 cores Beisheim, J.R., Boosting Memory Capacity with SSDs, ANSYS Advantage Magazine, Volume IV, Issue 1, pp. 37, 2010.

Distributed ANSYS Performance Avoid waiting for I/O to complete! Check to see if job is I/O bound or compute bound Check output file for CPU and Elapsed times When Elapsed time >> main thread CPU time Total CPU time for main thread : 167.8 seconds...... Elapsed Time (sec) = 388.000 Date = 08/21/2012 Consider adding more RAM or faster hard drive configuration I/O bound When Elapsed time main thread CPU time Compute bound Considering moving simulation to a machine with faster processors Consider using Distributed ANSYS (DMP) instead of SMP Consider running on more cores or possibly using GPU(s)

Distributed ANSYS Performance ANSYS 11.0 ANSYS 12.0 ANSYS 12.1 ANSYS 13.0 SP2 ANSYS 14.0 Thermal (full model) 3 MDOF Thermomechanical Simulation (full model) 7.8 MDOF Time 4 hours 4 hours 4 hours 4 hours 1 hour 0.8 hour Cores 8 8 8 8 8 + 1 GPU 32 Time ~5.5 days 34.3 hours 12.5 hours 9.9 hours 7.5 hours Iterations 163 164 195 195 195 Cores 8 20 64 64 128 Interpolation of Boundary Conditions Submodel: Creep Strain Analysis 5.5 MDOF Time 37 hours 37 hours 37 hours 0.2 hour 0.2 hour Load Steps 16 16 16 Improved algorithm Time ~5.5 days 38.5 hours 8.5 hours 6.1 hours 5.9 hours 4.2 hours Iterations 492 492 492 488 498 498 16 Cores 18 16 76 128 64 + 8 GPU 256 Total Time 2 weeks 5 days 2 days 1 day 0.5 day All runs with Sparse solver Hardware 12.0: dual X5460 (3.16 GHz Harpertown Intel Xeon) 64GB RAM per node Hardware 12.1 + 13.0: dual X5570 (2.93 GHz Nehalem Intel Xeon) 72GB RAM per node ANSYS 12.0 to 14.0 runs with DDR Infiniband interconnect ANSYS 14.0 creep runs with NROPT,CRPL + DDOPT,METIS Results Courtesy of MicroConsult Engineering, GmbH

Distributed ANSYS Performance Minimum time to solution more important than scaling 25 Solution Scalability 20 Speedup 15 10 Turbine model 2.1 million DOF Nonlinear static analysis 5 1 Loadstep, 7 substeps, 0 25 equilibrium iterations Linux cluster (8 cores per node) 0 8 16 24 32 40 48 56 64

Distributed ANSYS Performance Minimum time to solution more important than scaling Solution Elapsed Time 45000 40000 35000 30000 25000 20000 Turbine model 2.1 million DOF Nonlinear static analysis 1 Loadstep, 7 substeps, 25 equilibrium iterations Linux cluster (8 cores per node) 15000 10000 5000 0 Solution Scalability 11 hrs, 48 mins 1 hr, 20 mins 30 mins 0 8 16 24 32 40 48 56 64

GPU Accelerator Capability Graphics processing units (GPUs) Widely used for gaming, graphics rendering Recently been made available as general purpose accelerators Support for double precision computations Performance exceeding the latest multicore CPUs So how can ANSYS make use of this new technology to reduce the overall time to solution??

GPU Accelerator Capability Accelerate Sparse direct solver (SMP & DMP) GPU is used to factor many dense frontal matrices Decision is made automatically on when to send data to GPU Frontal matrix too small, too much overhead, stays on CPU Frontal matrix too large, exceeds GPU memory, only partially accelerated Accelerate PCG/JCG iterative solvers (SMP & DMP) GPU is only used for sparse matrix vector multiply (SpMV kernel) Decision is made automatically on when to send data to GPU Model too small, too much overhead, stays on CPU Model too large, exceeds GPU memory, only partially accelerated

GPU Accelerator Capability Supported hardware Currently support NVIDIA Tesla 20 series, Quadro 6000, and Quadro K5000 cards Next generation NVIDIA Tesla cards (Kepler) should work with R14.5 Installing a GPU requires the following: Larger power supply (single card needs ~250W) Open 2x form factor PCIe x16 2.0 (or 3.0) slot Supported platforms Windows and Linux 64 bit platforms only Does not include Linux Itanium (IA 64) platform

GPU Accelerator Capability Targeted hardware NVIDIA Tesla C2075 NVIDIA Tesla M2090 NVIDIA Quadro 6000 NVIDIA Quadro K5000 NVIDIA Tesla K10 NVIDIA Tesla K20 Power (W) 225 250 225 122 250 250 Memory 6 GB 6 GB 6 GB 4 GB 8 GB 6 to 24 GB Memory Bandwidth (GB/s) 144 177.4 144 173 320 288 Peak Speed SP/DP (GFlops) 1030/515 1331/665 1030/515 2290/95 4577/190 5184/1728 These NVIDIA Kepler based products are not released yet, so specifications may be incorrect

GPU Accelerator Capability GPUs can offer significantly faster time to solution 4.0 GPU Performance 3.8x 3.5 6.5 million DOF Linear static analysis Sparse solver (DMP) 2 Intel Xeon E5 2670 (2.6 GHz, 16 cores total), 128 GB RAM, SSD, 4 Tesla C2075, Win7 Relative Speedup 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2.6x 2 cores 8 cores 8 cores (no GPU) (no GPU) (1 GPU)

GPU Accelerator Capability GPUs can offer significantly faster time to solution 6.0 5.0 GPU Performance 5.2x 11.8 million DOF Linear static analysis PCG solver (DMP) 2 Intel Xeon E5 2670 (2.6 GHz, 16 cores total), 128 GB RAM, SSD, 4 Tesla C2075, Win7 Relative Speedup 4.0 3.0 2.0 1.0 0.0 2.7x 2 cores 8 cores 16 cores (no GPU) (1 GPU) (4 GPUs)

GPU Accelerator Capability Supports majority of ANSYS users Covers both sparse direct and PCG iterative solvers Only a few minor limitations Ease of use Requires at least one supported GPU card to be installed Requires at least one HPC pack license No rebuild, no additional installation steps Performance ~10 25% reduction in time to solution when using 8 CPU cores Should never slow down your simulation!

Design Optimization How will you use all of this computing power? Higher fidelity Full assemblies More nonlinear Design Optimization Studies

HPC Licensing ANSYS HPC Packs enable high fidelity insight Each simulation consumes one or more packs Parallel enabled increases quickly with added packs Single solution for all physics and any level of fidelity Flexibility as your HPC resources grow Reallocate Packs, as resources allow Parallel Enabled (Cores) 1 GPU + 8 4 GPU + 32 16 GPU + 128 64 GPU + 512 256 GPU + 2048 1 2 3 4 5 Packs per Simulation

HPC Parametric Pack Licensing Scalable, like ANSYS HPC Packs Enhances the customer s ability to include many design points as part of a single study Ensure sound product decision making Amplifies complete workflow Design points can include execution of multiple products (pre, solve, HPC, post) Packaged to encourage adoption of the path to robust design! Number of Simultaneous Design Points Enabled 4 8 16 32 64 1 2 3 4 5 Number of HPC Parametric Pack Licenses

HPC Revolution HDD vs. SSDs SMP vs. DMP The right combination of algorithms and hardware leads to maximum efficiency GPUs Clusters Interconnects

HPC Revolution Every computer today is a parallel computer Every simulation in ANSYS can benefit from parallel processing