LS-DYNA: CAE Simulation Software on Linux Clusters

IBM Deep Computing Group LS-DYNA: CAE Simulation Software on Linux Clusters Guangye Li (guangye@us.ibm.com) IBM Deep Computing Team June, 2003 IBM Deep Computing Group

Topics Introduction to LS-DYNA LS-DYNA Applications Two versions of LS-DYNA: SMP and MPP An example Performance of LS-DYNA on clusters Performance Improvement with Faster Processors Interconnect Options: Gigabit Ethernet or Myrinet One or two process nodes Comparison of LAM/MPI and MPICH Performance Speedup from Compiler Options Speedup from Faster 533 MHz Front side Bus Chrysler experience 2

LS-DYNA: A general purpose transient dynamic finite element program capable of simulating complex real world problems Software Vendor: Livermore Software Technology Corp. (LSTC) Largest application in CAE Large customer base 3

LS-DYNA applications include: Occupant safety Metal Forming Metal Cutting Biomedical Blast loading Fluid-structure interaction Earthquake engineering 4

Two parallel versions of LS-DYNA SMP (OpenMP) for shared memory multiple processors. Parallelized from a serial code Scalable up to 16 CPUs MPP (Distributed memory version) Using the domain decomposition technique Using MPI for communications between subdomains (processors) Scalable up to more than 100 CPUs. Suitable for both shared memory multiple processors and clusters MPP-DYNA on clusters dramatically reduced the turnaround time and the simulation cost 5

Comparison of SMP and MPP Elapsed Time (sec) 35000 30000 25000 20000 15000 10000 5000 1.3 GHz IBM p690 November 2002 LS-DYNA refined Neon-535k elements 0 1-CPU 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU SMP MPP 6

An Example: The Neon Model Frontal crash with initial speed at 31.5 miles/hour Model size number of shell elements: 269,249 number of nodal points: 285,832 Simulation length: 150 ms vehicle bounce back observed at 70 ms Model created by National Crash Analysis Center (NCAC) at George Washington University one of the few publicly available model for vehicle crash analysis based on 1996 Plymouth Neon 7

1996 Plymouth Neon 8

The model 9

The mesh 10

Domain decomposition The whole mesh is decomposed into NCPU subdomains. Each domain has about the same number of elements Each link cut corresponding to communications between two nodes. The decomposition should minimize the link cuts Each CPU processes elements in its subdomain CPUs exchange boundary data using message passing (MPI) 11

Simulation results 13

Performance Improvement with Faster Processors 25000 20000 Elapsed Time (sec) 15000 10000 5000 V960 r1488 LS-DYNA Xeon, 2 CPUs per node Gigabit Ethernet Jan-March 2003 LAM/MPI refined Neon-535k elements 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 64-CPU 2.4 GHz 2.8 GHz 14

Configuring Each Node with One Processor Elapsed Time (sec) 16000 14000 12000 10000 8000 6000 4000 2000 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 64-CPU V960 r1488 LS-DYNA 2 CPUs per node 1 CPU per node Gigabit Ethernet x335 2.8 GHz March 2003 LAM/MPI Front crash model 430k elements 15

Interconnect Effect on Performance 25000 20000 Elapsed Time (sec) 15000 10000 5000 2.2 GHz IntelliStation Cluster June 2002 MPI LS-DYNA refined Neon-535k elements 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU Fast Ethernet Gigabit Ethernet Myrinet 16

Interconnect Performance Compared Parallel Speedup 30 25 20 15 10 5 0 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU x335+fast Ethernet x335+gigabit Ethernet x335+myrinet p655+sp Switch2 V960 LS-DYNA Jan 2003 Refined Neon 535k Elements 17

Comparison of LAM/MPI and MPICH Performance Elapsed Time (sec) 3500 3000 2500 2000 1500 1000 500 2.8 GHz x335 (Xeon) Cluster Gigabit Ethernet March 2003 LS-DYNA refined Neon-535k elements 0 16-CPU MPICH-1.2.4 32-CPU LAM/MPI-6.5.6 18

Speedup from Compiler Options Intel Compiler Option SSE No_SSE Elapsed time (sec) 20781 25110 V960 r1106 MPP-DYNA Feb 2002 LAM/MPI 6.5.2 2.2 GHz IntelliStation node 12 processor runs 19

Speedup from Faster 533 MHz Frontside Bus Model Size (elements) 12000 32000 155000 430000 Speedup: 400MHz to 533 MHz Frontside Bus 1.10 1.08 1.20 1.18 V960 r1488 LS-DYNA March 2003 LAM/MPI 2.8 GHz x335 node 2 processor runs 20

Performance Improvement with Version 970 Elapsed Time (sec) 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 2.8 GHz x335 (Xeon) Cluster Gigabit Ethernet March 2003 LAM/MPI MPP-DYNA refined Neon-535k elements version 960 r1488 1.20 1.15 version 970 r3535 1.14 1.10 2-CPU 4-CPU 8-CPU 16-CPU 32-CPU 21

Chrysler experience Customer requirements Reduced turn around time Price/performance Good accuracy, i.e., The numerical results should match the results on those from the current 64 bit machines A team work Chrysler LSTC IBM Intel Eventually all 22 QA models passed the accuracy requirements and Chrysler bought 108 Xeon based IBM Linux cluster nodes for car crash simulation 22

Chrysler is happy with the IBM Linux cluster solution Without parallel processing, we never would have achieved 5* (NCAP) and good (IIHS) on our new Chrysler Sebring and Dodge Stratus within the current product development time. --Subhas Shetty, Chrysler 23

Summary MPI based MPP-DYNA has better scalability Linux clusters reduced the turn around time for car crash simulation Linux clusters reduced the simulation cost The accuracy is satisfactory Users today can customize their system in order to pick the features which serve them best Processors Operating system Interconnect 24