LS-DYNA: CAE Simulation Software on Linux Clusters

Similar documents
Improved LS-DYNA Performance on Sun Servers

Scaling Study of LS-DYNA MPP on High Performance Servers

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS DYNA Performance Benchmarks and Profiling. January 2009

Cloud Computing through Virtualization and HPC technologies

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

High Performance Computing in CST STUDIO SUITE

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Multicore Parallel Computing with OpenMP

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Clusters: Mainstream Technology for CAE

1 Bull, 2011 Bull Extreme Computing

- An Essential Building Block for Stable and Reliable Compute Clusters

A Flexible Cluster Infrastructure for Systems Research and Software Development

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Smart Manufacturing. CAE as a Service in the Cloud. Objective: convincing you to consider CAE in the Cloud

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Cluster Computing at HRI

Cluster Implementation and Management; Scheduling

Shared Parallel File System

Building Clusters for Gromacs and other HPC applications

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Lattice QCD Performance. on Multi core Linux Servers

Building an Inexpensive Parallel Computer

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

High Performance Computing. Course Notes HPC Fundamentals

benchmarking Amazon EC2 for high-performance scientific computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Increasing LS-DYNA Productivity on SGI Systems: A Step by Step Approach

Using PCI Express Technology in High-Performance Computing Clusters

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Recommended hardware system configurations for ANSYS users

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Large Scale Parallel Reservoir Simulations on a Linux PC-Cluster 1

High Performance Computing

ABAQUS High Performance Computing Environment at Nokia

Parallel Programming Survey

wu.cloud: Insights Gained from Operating a Private Cloud System

IOS110. Virtualization 5/27/2014 1

GPUs for Scientific Computing

Leveraging Windows HPC Server for Cluster Computing with Abaqus FEA

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Benchmark Tests on ANSYS Parallel Processing Technology

Scalability and Classifications

Mini System 101 Our Price: $669

Cluster Computing in a College of Criminal Justice

Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

On-Demand Supercomputing Multiplies the Possibilities

Hari Reddy High Performance Computing Solutions Development Systems and Technology Group IBM 6609 Carriage Drive Colleyville, TX 76034

Enabling Technologies for Distributed Computing

QUADRICS IN LINUX CLUSTERS

OpenMP Programming on ScaleMP

Netezza and Business Analytics Synergy

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

SERVER CLUSTERING TECHNOLOGY & CONCEPT

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

HPC Deployment of OpenFOAM in an Industrial Setting

SGI HPC Systems Help Fuel Manufacturing Rebirth

64-Bit versus 32-Bit CPUs in Scientific Computing

MOSIX: High performance Linux farm

Performance Across the Generations: Processor and Interconnect Technologies

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

RLX Technologies Server Blades

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Remote Visualization and Collaborative Design for CAE Applications

Enabling Technologies for Distributed and Cloud Computing

PENGUIN COMPUTING PENGUIN COMPUTING PENGUIN PENGUIN COMPUTING COMPUTING. When is HPC cloud computing right for you?

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Advances in Virtualization In Support of In-Memory Big Data Applications

A Theory of the Spatial Computational Domain

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Performance of the JMA NWP models on the PC cluster TSUBAME.

2. COMPUTER SYSTEM. 2.1 Introduction

Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, sack@stud.ntnu.no

Cellular Computing on a Linux Cluster

IBM Deep Computing Visualization Offering

Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

Recent Advances in HPC for Structural Mechanics Simulations

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Simplest Scalable Architecture

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Tutorial-4a: Parallel (multi-cpu) Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Supercomputing Status und Trends (Conference Report) Peter Wegner

ECLIPSE Performance Benchmarks and Profiling. January 2009

on an system with an infinite number of processors. Calculate the speedup of

Interoperability Testing and iwarp Performance. Whitepaper

Current Trend of Supercomputer Architecture

Transcription:

IBM Deep Computing Group LS-DYNA: CAE Simulation Software on Linux Clusters Guangye Li (guangye@us.ibm.com) IBM Deep Computing Team June, 2003 IBM Deep Computing Group

Topics Introduction to LS-DYNA LS-DYNA Applications Two versions of LS-DYNA: SMP and MPP An example Performance of LS-DYNA on clusters Performance Improvement with Faster Processors Interconnect Options: Gigabit Ethernet or Myrinet One or two process nodes Comparison of LAM/MPI and MPICH Performance Speedup from Compiler Options Speedup from Faster 533 MHz Front side Bus Chrysler experience 2

LS-DYNA: A general purpose transient dynamic finite element program capable of simulating complex real world problems Software Vendor: Livermore Software Technology Corp. (LSTC) Largest application in CAE Large customer base 3

LS-DYNA applications include: Occupant safety Metal Forming Metal Cutting Biomedical Blast loading Fluid-structure interaction Earthquake engineering 4

Two parallel versions of LS-DYNA SMP (OpenMP) for shared memory multiple processors. Parallelized from a serial code Scalable up to 16 CPUs MPP (Distributed memory version) Using the domain decomposition technique Using MPI for communications between subdomains (processors) Scalable up to more than 100 CPUs. Suitable for both shared memory multiple processors and clusters MPP-DYNA on clusters dramatically reduced the turnaround time and the simulation cost 5

Comparison of SMP and MPP Elapsed Time (sec) 35000 30000 25000 20000 15000 10000 5000 1.3 GHz IBM p690 November 2002 LS-DYNA refined Neon-535k elements 0 1-CPU 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU SMP MPP 6

An Example: The Neon Model Frontal crash with initial speed at 31.5 miles/hour Model size number of shell elements: 269,249 number of nodal points: 285,832 Simulation length: 150 ms vehicle bounce back observed at 70 ms Model created by National Crash Analysis Center (NCAC) at George Washington University one of the few publicly available model for vehicle crash analysis based on 1996 Plymouth Neon 7

1996 Plymouth Neon 8

The model 9

The mesh 10

Domain decomposition The whole mesh is decomposed into NCPU subdomains. Each domain has about the same number of elements Each link cut corresponding to communications between two nodes. The decomposition should minimize the link cuts Each CPU processes elements in its subdomain CPUs exchange boundary data using message passing (MPI) 11

12

Simulation results 13

Performance Improvement with Faster Processors 25000 20000 Elapsed Time (sec) 15000 10000 5000 V960 r1488 LS-DYNA Xeon, 2 CPUs per node Gigabit Ethernet Jan-March 2003 LAM/MPI refined Neon-535k elements 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 64-CPU 2.4 GHz 2.8 GHz 14

Configuring Each Node with One Processor Elapsed Time (sec) 16000 14000 12000 10000 8000 6000 4000 2000 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 64-CPU V960 r1488 LS-DYNA 2 CPUs per node 1 CPU per node Gigabit Ethernet x335 2.8 GHz March 2003 LAM/MPI Front crash model 430k elements 15

Interconnect Effect on Performance 25000 20000 Elapsed Time (sec) 15000 10000 5000 2.2 GHz IntelliStation Cluster June 2002 MPI LS-DYNA refined Neon-535k elements 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU Fast Ethernet Gigabit Ethernet Myrinet 16

Interconnect Performance Compared Parallel Speedup 30 25 20 15 10 5 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU x335+fast Ethernet x335+gigabit Ethernet x335+myrinet p655+sp Switch2 V960 LS-DYNA Jan 2003 Refined Neon 535k Elements 17

Comparison of LAM/MPI and MPICH Performance Elapsed Time (sec) 3500 3000 2500 2000 1500 1000 500 2.8 GHz x335 (Xeon) Cluster Gigabit Ethernet March 2003 LS-DYNA refined Neon-535k elements 0 16-CPU MPICH-1.2.4 32-CPU LAM/MPI-6.5.6 18

Speedup from Compiler Options Intel Compiler Option SSE No_SSE Elapsed time (sec) 20781 25110 V960 r1106 MPP-DYNA Feb 2002 LAM/MPI 6.5.2 2.2 GHz IntelliStation node 12 processor runs 19

Speedup from Faster 533 MHz Frontside Bus Model Size (elements) 12000 32000 155000 430000 Speedup: 400MHz to 533 MHz Frontside Bus 1.10 1.08 1.20 1.18 V960 r1488 LS-DYNA March 2003 LAM/MPI 2.8 GHz x335 node 2 processor runs 20

Performance Improvement with Version 970 Elapsed Time (sec) 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 2.8 GHz x335 (Xeon) Cluster Gigabit Ethernet March 2003 LAM/MPI MPP-DYNA refined Neon-535k elements version 960 r1488 1.20 1.15 version 970 r3535 1.14 1.10 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 21

Chrysler experience Customer requirements Reduced turn around time Price/performance Good accuracy, i.e., The numerical results should match the results on those from the current 64 bit machines A team work Chrysler LSTC IBM Intel Eventually all 22 QA models passed the accuracy requirements and Chrysler bought 108 Xeon based IBM Linux cluster nodes for car crash simulation 22

Chrysler is happy with the IBM Linux cluster solution Without parallel processing, we never would have achieved 5* (NCAP) and good (IIHS) on our new Chrysler Sebring and Dodge Stratus within the current product development time. --Subhas Shetty, Chrysler 23

Summary MPI based MPP-DYNA has better scalability Linux clusters reduced the turn around time for car crash simulation Linux clusters reduced the simulation cost The accuracy is satisfactory Users today can customize their system in order to pick the features which serve them best Processors Operating system Interconnect 24