ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager* Michael Wood Developer* Cristian Ianculescu Developer* Mintae Kim Developer* Andrzej Bajer Developer* *Dassault Systèmes Simulia Corp. Geraud Krawezik Developer, ACCELEWARE, Canada ABSTRACT In the last decade, significant R&D resources have been invested to deliver commercially available technologies that meet current and future mechanical engineering industry requirements, both in terms of mechanics and performance. While significant focus has been given to developing robust nonlinear finite element analysis technology, there has also been continued investment in developing advancements for linear dynamic analyses. The research and development efforts have focused on combining advanced linear and nonlinear technology to provide accurate, yet fast modelling of noise and vibration engineering problems. This effort has enabled high-fidelity models to

run in a reasonable time which is vital for virtual prototyping within shortened product design cycles. While it is very true that model sizes (degrees of freedom) have grown significantly during this period, the complexity of the models has also increased, which has led to a larger number of total iterations within nonlinear implicit analyses, and to a large number of eigenmodes within linear dynamic simulations. An innovative approach has been developed to leverage highperformance computing (HPC) resources to yield reasonable turn-around times for such analyses by taking advantage of massive parallelism without sacrificing any mechanical formulation quality. The accessibility and affordability of HPC hardware in the past few years has changed the landscape of commercial finite element analysis software usage and applications. This change has come in response to an expressed desire from engineers and designers to run their existing simulations faster, or in many cases to run more realistic jobs. Due to their computational cost and lack of high-performance commercial software, such "high-end" simulations were until recently thought to be only available to academic institutions or government research laboratories which typically developed their own HPC applications. Today, with the advent of affordable multi-core SMP workstations and compute clusters with multi-core nodes and high-speed interconnects equipped with GPGPU accelerators, HPC is now sought after by many engineers for routine FEA. This presents a challenge for commercial FEA software vendors which have to adapt their decades old legacy code to take advantage of state-of-the-art HPC platforms. Given this background, this paper focuses on how recent developments in HPC have affected the performance of linear dynamic and implicit nonlinear analyses. Two main HPC developments are studied. First, we look into the performance and scalability of the commercially available Abaqus AMS eigenvalue solver, and of the entire frequency response simulation running on multi-core SMP workstations. Advances in the AMS eigenvalue solution procedure and linear dynamic capabilities make the realistic simulation solution suitable for a wide range of vehicle-level noise and vibration simulations. Next, we will discuss the progress made in relatively new, but very active area of high performance commercial FE software development, which is based on taking advantage of high performance GPGPU accelerators. Efficient adoption of GPGPU in such products is a very challenging task which requires significant re-architecture of the existent code. We describe the experience in integrating GPGPU acceleration into complex commercial engineering software. In particular we discuss the trade-off we had to make and the benefits we obtained from this technology. KEYWORDS HPC, Parallel Computing, Cluster Computing, Equation Solver, Non-linear Implicit FEA, GPGPU, Modal Linear Dynamics, AMS, Automated Multilevel Substructuring, Abaqus

1: AMS (Automatic Multilevel Substructuring) Eigensolver As model meshes become more refined and accurate, the complexity of the models increase, the size of finite element models grows, all while the demand for faster job turn-around time continues to be strong. The role of a modebased approach in linear dynamic analyses becomes crucial given that the direct approach, based on the solution of a system of equations on the physical domain for each excitation frequency, is much more expensive as the size of finite element models grows. The most time-consuming task in mode-based linear dynamic analyses is the solution of a large eigenvalue extraction problem to create the modal basis. The most advanced eigenvalue extraction technology suitable to handle today s needs in the automotive noise and vibration (N&V) simulation is AMLS. Beginning in 2006, SIMULIA began to offer a version of AMLS, marketed as Abaqus/AMS. The performance of the AMS eigensolver, therefore, becomes crucially important to reduce overall analysis runtime in large-scale N&V simulations. Over the past three years (2007-2010), the Abaqus AMS eigensolver has evolved from an original serial implementation designed for computers with a single processor and limited memory able to solve problems with a couple of million equations, to the modern style software implementation designed for modern computers with multi-core processors and a large amount of memory for solving larger problems with tens of millions of equations. Beginning with the Abaqus 6.10 Enhanced Functionality release, the AMS eigensolver can run in parallel on shared memory computers with multiple processors. Following that release, parallel performance of AMS has been improved substantially. To demonstrate the AMS eigensolver performance on HPC hardware, two automotive industrial models were chosen to run on the machine with four sixcore Intel Xeon Nehalem processors and 128 GB physical memory. The first model, referred to as Model 1, is an automotive vehicle body model with 14.1 million degrees of freedom. This model has an acoustic cavity for coupled structural-acoustic frequency response analysis; the modal basis consists of 5190 structural modes and 266 acoustic modes below the maximum frequency of 600 Hz. The selective recovery capability for the structural domain, which recovers user-requested output variables at the user-defined node set, and the full recovery capability for acoustic domain, which recovers user-defined output variables at all nodes of the model, is used in this simulation. The second model, Model 2, is a powertrain model with 11.2 million degrees of freedom. The modal basis includes 377 modes below 2500 Hz, and the selective recovery capability is used.

The pre-release version of Abaqus 6.11 was used to obtain the performance data in both models. Table 1 demonstrates the parallel performance of the AMS eigensolver for Model 1. In the table, FREQ indicates the whole frequency extraction procedure, which includes the AMS eigensolver and the non-scalable nonsolver parts of the code, while AMS indicates the AMS eigensolver itself. The AMS eigensolver takes only 25 minutes to solve the eigenproblem on 16 cores, while it takes about 4 hours on a single core. Non-scalable parts become dominant as the number of cores increases. Figure 1 shows the scalability of the AMS eigensolver based on the data in Table 1. Due to a good parallel speedup of AMS, the frequency extraction procedure FREQ shows a speedup of about 5 overall. Table 1. Performance of the AMS eigensolver (AMS) and frequency extraction procedure (FREQ) for Model 1 Number of Cores FREQ (6.11) Wall Clock Time (h:mm) AMS (6.11) Wall Clock Time (h:mm) 1 4:32 4:01 4 1:38 1:07 8 1:09 0:39 16 0:56 0:25

Figure 1. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency extraction procedure (FREQ) for Model 1 12 10 9.61 Speedup Factor 8 6 4 2 1.00 3.58 2.77 6.21 3.93 4.89 FREQ (6.11) AMS (6.11) 0 1.00 1 4 8 16 Number of Cores Table 2 and Figure 2 show the parallel performance and scalability of the frequency extraction procedure (FREQ) and the AMS eigensolver (AMS) for Model 2. Due to a good scalability of the AMS eigensolver, the frequency extraction procedure takes only 36 minutes for this large model, which significantly reduces the overall job turn-around time. Table 2. Performance of the AMS eigensolver (AMS) and frequency extraction procedure (FREQ) for Model 2 Number of Cores FREQ (6.11) Wall Clock Time (h:mm) AMS (6.11) Wall Clock Time (h:mm) 1 2:57 2:33 4 1:03 0:39 8 0:45 0:21 16 0:36 0:13

Figure 2. Scalability of the Abaqus 6.11 AMS eigensolver (AMS) and frequency extraction procedure (FREQ) for Model 2 14 12 11.70 Speedup Factor 10 8 6 4 2 1.00 3.87 2.79 7.45 3.96 4.86 FREQ (6.11) AMS (6.11) 0 1.00 1 4 8 16 Number of Cores 2: Mode-based Frequency Response Analysis Mode-based frequency response analysis is the commonly accepted method by N&V engineers for simulation of noise and vibrations in vehicles and other structures. To reduce the cost of the analysis, the system of equations is solved in a modal subspace. The projection of the finite element system to the modal subspace requires the eigenvalue extraction analysis, which in Abaqus is typically performed using the AMS eigensolver described in the previous section. The projected system of equations in the modal subspace takes the following form: 2 K ω M ωc D Re( Q( ω)) Re( F ( ω)) = 2 ( ωc D K ω M ) Im( Q( ω)) Im( F ( ω)) (1) Here: K - is the system stiffness matrix; M - mass matrix; C - viscous damping matrix; D - structural damping matrix; ω - excitation frequency; Q - generalized displacement; F - force vector; Re() - real part of a complex quantity; Im() - imaginary part of a complex quantity.

The size of the modal system (1) is twice the number of modes. If the frequency response is performed in the mid-frequency range, often there are more than 10,000 modes in a complex structure. If only diagonal damping is applied, the mode-based analysis is really inexpensive because the system of equations (1) becomes decoupled and every equation is solved separately. However, in the mid-frequency range the modal damping is not sufficient ient and material damping (e.g., dashpot elements and material structural damping) must be applied to obtain accurate results. The material damping causes the projected damping operators C and/or D in the equation (1) to be fully populated. Thus, the system of linear equations, which is two times the number of modes (2N) in the modal subspace must be solved at every frequency point. With a few hundred to a thousand frequency points, and the number of modes over 10,000, it becomes a rather expensive analysis. Figure 3. The structure of the left-hand side operator for the mode-based frequency response analysis In a typical case, when the stiffness matrix is symmetric and constant with respect to excitation frequency, the stiffness and mass operators are reduced to diagonal matrices in the modal subspace. Thus, the structure of the system of modal equations (1) in this case is presented in Figure 3. The diagonal blocks are represented by diagonal matrices (corresponding to a linear combination of projected mass and stiffness operators), while the off-diagonal blocks are fully populated (corresponding to projected structural and viscous damping operators). Traditionally, this system of equations of the size 2N is solved at every frequency. First, we take advantage age of a diagonal structure on a part of the operator and reduce the size of the system by half. Using this reduction we end up with a fully populated system of equations of the size M. For details and derivation of the reduction algorithm we refer to [1]. The reduction phase is dominated by the matrix-matrix multiplication operations, and takes more time than the subsequent solution of the reduced system. Thus, to obtain an efficient parallel algorithm, we need to parallelize both algorithms: the matrix-matrix multiplication and the factorization of the dense system of equations.

The parallel algorithm for mode-based frequency response analysis is implemented on shared-memory machines. The computationally expensive ingredients of this algorithm matrix-matrix products and dense linear solves have been parallelized using a task-based approach. This implementation ensures that the memory consumption remains constant regardless of the number of processors used, while achieving almost linear parallel scaling to the limits of the number of general-purpose computational cores of modern hardware. To demonstrate the effectiveness of this algorithm we present an example of a typical N&V analysis of structural vibration of a car body. The stiffness matrix is symmetric and the model includes some structural damping, so the projected system looks like the one illustrated in Figure 3. Over 10,000 modes were extracted using Abaqus/AMS eigensolver, and the analysis is performed at 500 frequency points. The presented results were obtained on the machine with four six-core Intel Xeon Nehalem processors and 128 GB physical memory. Table 3 and Figure 4 show the performance and scalability of the modal frequency response solver. Excellent parallel speed-up of 20.63 on 24-cores allows for reducing of the wall-clock analysis time from almost 22 hour to about 1 hour. It drastically reduces turn-around time and enables N&V engineers to analyse a few design changes during one business day. Table 3. Analysis time and scalability of the mode-based frequency response solver Number of Cores Wall Clock Time (h:mm) Parallel Speed-Up 1 21:54 1.00 2 11:35 1.89 4 5:48 3.78 8 2:57 7.44 16 1:56 14.24 24 1:04 20.63

Figure 4. Analysis time of the mode-based frequency response solver on a shared-memory machine with 24 cores 25 Parallel speed-up 20 15 10 5 0 1.89 3.78 7.44 14.24 20.63 2 4 8 16 24 Number of cores Figure 5 demonstrates s the parallel efficiency of the modal frequency response solver. The efficiency is defined as the parallel speed-up divided by the number of cores times 100%. Thus, the parallel efficiency of 100% would indicate the optimal speed-up. The presented results demonstrate very good efficiency of the modal frequency response solver of about 95% on 2, 4, and 8 cores. On the 24 cores, the efficiency is just below 90%. Figure 5. Parallel efficiency of the mode-based frequency response solver on a sharedmemory mory machine with 24 cores 100 80 Efficiency [%] 60 40 20 0 2 4 8 16 24 Number of cores

3: Acceleration of the direct sparse solver using GPGPUs GPGPUs offer exceptional floating point operation speed. With the advent of recent hardware, theoretical double precision floating point operations can be executed at a rate of 500 GFlops. Of course, in order to realize this peak, an algorithm must be embarrassingly parallel since the tremendous processing speed is largely due to massive parallelism of the GPGPU hardware. One of the challenges to exploiting the power of GPGPU in general purpose FEA codes is that it requires re-writing the code in a new language, and adapting the algorithm to maximally utilize the GPGPU hardware. Currently, there are two GPGPU hardware vendors, and each has its own preferred coding language. In order to maximize the benefit of GPGPU performance while minimizing development effort, we chose to apply this technology to the most floating point intensive portion of any implicit FEA program the linear equation solver. With minimal changes to our existing solver, we created an interface for the factorization of individual supernodes in our direct sparse solver. We turned to Acceleware Corporation for the implementation of the GPGPU portion of the project. Their experience with GPGPU acceleration of scientific algorithms was helpful in getting our first implementation up and running quickly. In our current implementation, our GPGPU accelerated direct solver can greatly reduce the time spent in the solver phase of an FEA analysis for a variety of large models. We have learned that there are a number of factors which must be considered when trying to determine the level of benefit to expect when adding GPGPU compute capability to reduce analysis time. Abaqus provides an out-of-core solver, however, when enough memory is available, the factorization and subsequent backward pass remains in-core and delivers optimal performance. Once the problem size exceeds the system memory, I/O costs will become significant and reduces the overall benefit of GPGPU acceleration. Another factor is the size of the FEA model. The most important measure of size in this case is not the number of degrees of freedom (DoF) in the model, but the number of floating point operations required for factorization. Thus, a 5 million DoF solid element model may be more computationally intensive than a 10 million DoF shell element model. The target we set for performance gain was an overall speedup of 2x for the analysis wall clock time for our benchmark automotive powertrain model, when compared to the performance on a 4 core parallel run. The actual results are shown in Figure 6, identified by the number of floating point operations in the solver for this model (1.0E+13). This chart is arranged to show how the amount of work in the solver correlates to performance improvements when

using a GPGPU for compute acceleration. The effectiveness of GPGPU acceleration increases with problem size up to the point where the factorization can no longer fit in core, or an individual supernode does not fit in the GPGPU memory. Figure 6. Effect of GPGPU acceleration on the performance of 4 core parallel runs 4.00 GPGPU speedup (4 core / (4 core + gpu)) 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 4.34E+11 4.45E+11 6.59E+11 9.90E+11 1.91E+12 2.19E+12 4.37E+12 5.76E+12 1.03E+13 1.68E+13 1.70E+13 2.63E+13 1.08E+14 Today, it is common for high performance workstations or compute cluster nodes have 8 cores. For comparison, see the results in the chart of Figure 7 for 8 core + GPGPU vs. 8 core runs for some of the larger test cases. Here, the addition of GPGPU acceleration is again beneficial, but not to the same level. Increasing the number of core increases the number of branches in the supernode tree that are solved concurrently. When more than one branch has an eligible supernode for processing on the GPGPU, there is contention for the GPGPU resource. This results in a delay (waiting for GPGPU to be available), or processing the supernode on the slower CPU resources.

Figure 7. Effect of GPGPU acceleration on the performance of 8 core parallel runs 2.00 1.80 1.60 Speed up (8 cpu / (8 cpu + gpu)) 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00 5.8E+12 1.0E+13 2.6E+13 1.1E+14 Future developments to further leverage GPGPU acceleration of our direct sparse solver will target deployment on multiple nodes of a compute cluster. Going forward we hope to find applications for GPGPU compute acceleration outside of our direct sparse solver. REFERENCES 1. Bajer, A., Performance Improvement Algorithm for Mode-Based Frequency Response Analysis, SAE Paper No. 2009-01-2223, 2009.