High Performance Computing in CST STUDIO SUITE

Similar documents

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Recent Advances in HPC for Structural Mechanics Simulations

Accelerating CFD using OpenFOAM with GPUs

Very Large Enterprise Network Deployment, 25,000+ Users

Very Large Enterprise Network, Deployment, Users

Parallel Computing with MATLAB

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Cloud Computing. Alex Crawford Ben Johnstone

Cloud Computing through Virtualization and HPC technologies

Scaling from Workstation to Cluster for Compute-Intensive Applications

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Enterprise Network Deployment, 10,000 25,000 Users

Self service for software development tools

Clusters: Mainstream Technology for CAE

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Multicore Parallel Computing with OpenMP

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

benchmarking Amazon EC2 for high-performance scientific computing

PRIMERGY server-based High Performance Computing solutions

Cluster Computing at HRI

Scientific Computing Data Management Visions

Recommended hardware system configurations for ANSYS users

FLOW-3D Performance Benchmark and Profiling. September 2012

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

System Requirements Table of contents

Improved LS-DYNA Performance on Sun Servers

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Cluster Implementation and Management; Scheduling

Trends in High-Performance Computing for Power Grid Applications

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Cornell University Center for Advanced Computing

Amazon EC2 XenApp Scalability Analysis

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Building a Private Cloud with Eucalyptus

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Building a Top500-class Supercomputing Cluster at LNS-BUAP

1 Bull, 2011 Bull Extreme Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Muse Server Sizing. 18 June Document Version Muse

GPU Accelerated Signal Processing in OpenStack. John Paul Walters. Computer Scien5st, USC Informa5on Sciences Ins5tute

Adonis Technical Requirements

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

LS DYNA Performance Benchmarks and Profiling. January 2009

IBM System Cluster 1350 ANSYS Microsoft Windows Compute Cluster Server

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Ignify ecommerce. Item Requirements Notes

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Simulation Platform Overview

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

HPC performance applications on Virtual Clusters

SYSTEM SETUP FOR SPE PLATFORMS

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Hardware Acceleration for CST MICROWAVE STUDIO

Scalability and Classifications

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Cornell University Center for Advanced Computing

HOW MANY USERS CAN I GET ON A SERVER? This is a typical conversation we have with customers considering NVIDIA GRID vgpu:

Building Clusters for Gromacs and other HPC applications

Auto-Tunning of Data Communication on Heterogeneous Systems

Cloud computing for fire engineering. Chris Salter Hoare Lea, London, United Kingdom,

Very special thanks to Wolfgang Gentzsch and Burak Yenier for making the UberCloud HPC Experiment possible.

HARDWARE, SOFTWARE AND CONFIGURATION REQUIREMENTS

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Transcription:

High Performance Computing in CST STUDIO SUITE Felix Wolfheimer

GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver Loop CST STUDIO SUITE 2013 CST STUDIO SUITE 2014 0 1 2 3 4 Number of GPUs (Tesla K40) GPU computing performance has been improved for CST STUDIO SUITE 2014 as CPU and GPU resources are used in parallel. GPU CPU Benchmark performed on system equipped with dual Xeon E5-2630 v2 (Ivy Bridge EP) processors, and four Tesla K40 cards. Model has 80 million mesh cells.

Typical GPU System Configurations Entry level Professional level Enterprise level Workstation with 1 GPU card Available "off the shelf Good acceleration for smaller models Limited model size (depends on available GPU memory and features used) Workstation/server with multiple internal or external GPU cards CST engineers are available to discuss with you which configuration makes sense for your applications and usage scenario. Many configurations available Good acceleration for medium size and large models Limited model size (depends on available GPU memory and features used) Cluster system with highspeed interconnect. High flexibility: Can handle extremely large models using MPI Computing and also a lot of parallel simulation tasks using Distributed Computing (DC) Administrative overhead Higher price

MPI Computing Area of Application MPI Computing is a way to handle very large models efficiently Some application examples for MPI Computing: Electrically very large structures (e.g. RCS calculation, lightning strike) Extremely complex structures (e.g.si simulation for a full package)

MPI Computing Working Principle Subdomain boundary CST STUDIO SUITE Frontend connects to MPI Client Nodes Domain decomposition is shown in mesh view. High speed/low latency interconnection network (optional) Based on a domain decomposition of the simulation domain. Each cluster computer works on its part of the domain. Automatic load balancing ensures an equal distribution of the workload. It works cross-platform on Windows and Linux systems.

MPI Matrix Computation The performance of the matrix computation step has been improved significantly for the new version of CST STUDIO SUITE. Performance Results (for two cluster nodes):* Model Matrix Comp. Time/s (2013) Matrix Comp. Time/s (2014) Speedup (Matrix Comp.)** Speedup (Total Sim.)** 10,301 1,217 8.46 2.63 340M cells 12,921 4,018 3.22 1.85 CPU Core CPU Core Matrix computation is single-threaded in case of MPI up to version 2013. 47M cells CPU Core CPU Core Version 2014 uses all available cores on all cluster nodes. * =System configuration: Compute nodes are equipped with dual eight core Xeon E5-2650 processors, 4xK20 GPUs, and Infiniband FDR interconnect. **=Speedup between version 2013 and 2014 of CST STUDIO SUITE.

MPI Calculation Example 2 GHz blade antenna positioned on aircraft 2 GHz 17.4 x 4.5 x 16.2 m 116 x 30 x 108 λ 375,840 λ 3 660 million cells 4 node MPI cluster 4 Tesla K20 GPU on each node Total of 16 GPUs with 6GB RAM at 60% Memory Total memory: < 100 GB

MPI Calculation Example 2 GHz blade antenna positioned on aircraft 2 GHz 17.4 x 4.5 x 16.2 m 116 x 30 x 108 λ 375,840 λ 3 660 million cells 4 node MPI cluster 4 Tesla K20 GPU on each node Total of 16 GPUs with 6GB RAM at 60% Memory Total memory: < 100 GB Broadband calculation time ~ 4h

Sub-Volume Monitors Sub-volume monitors allow to record field data only in a region of interest allowing for a reduction of data. This is especially important for large models which have hundreds of millions mesh cells. Field data is only stored in the sub-volume defined by the box

Distributed Computing CST STUDIO SUITE Frontend Jobs could be: port excitations* frequency points* parameter variations optimization iterations *2 in parallel included with standard license connects to DC Main Controller DC Solver Servers

Model has 16 ports Only 8 ports need to be computed if defining symmetry conditions Distribute the 8 simulation runs to different solver servers with GPU acceleration

DC Simulation Time Improvement 30 Speedup (total time) Speedup 25 20 15 10 CPU 1 GPU (Tesla 20) 5 0 1 2 4 8 Number of DC Solver Servers Dual Intel Xeon X5675 CPUs (3.06 GHz), fastest memory configuration, 1 Tesla 20 GPU per node, 1 Gb Ethernet interconnect, 40 million mesh cells

DC Main Controller The DC Main Controller gives you a complete overview about what is happening on your cluster. Job Status Machine Status Essential resources (RAM usage and disk space) are monitored as well in the 2014 version.

GPU Assignment Users who have smaller jobs can start multiple solver servers and assign each GPU to a separate server. This allows for a more efficient use of multi- GPU hardware

Supported Acceleration Methods Acceleration methods supported by the solvers of CST STUDIO SUITE. Solver Multithreading GPU Computing Distributed Computing MPI Computing on one GPU card Most other solvers support Multithreading and Distributed Computing for parameter sweeps and optimization.

Choose the Right Acceleration Method Solver Model Size Number of Simulations Acceleration Technique Transient below memory limit of GPU hardware low GPU Computing Transient below memory limit of GPU hardware medium/high GPU Computing on a DC Cluster (Distributed Excitations) Transient above memory limit of GPU hardware - MPI or combined MPI+GPU Computing Frequency Domain can be handled by a single machine medium/high Distributed Computing (Distributed Frequency Points) Integral Equation can't be handled by a single machine - MPI Computing Integral Equation can be handled by a single machine medium/high Distributed Computing (Distributed Frequency Points) Parameter Sweep/Optimization n/a medium/high Distributed Computing

HPC in the Cloud CST is working together with HPC hardware and service providers to enable easy access to large computing power for challenging simulations which can't be run on in-house hardware. Users rent a CST license for the resources they need and pay the HPC provider for the required hardware. + HPC system provider Currently supported providers hosting CST STUDIO SUITE: More information can be found in the HPC section of our website: https://www.cst.com/products/hpc/cloud-computing

HPC Hardware Design Process A general hardware recommendation is available on our website which helps you to configure standard systems (e.g. workstations) for CST STUDIO SUITE. For HPC systems (multi-gpu systems, clusters) our hardware experts are available to guide you through the whole process of system design and benchmarking to ensure that your new system is compatible with CST STUDIO SUITE and delivers the expected performance. HPC System Design Process Personal contact with CST engineers to design solution. Benchmarking of designed computing solution in the hardware test center of the preferred vendor. Buy the machine if it fulfills your expectations.