Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal Engineer bwweiss@ra.rockwell.com Copyright 2011 Rockwell Automation, Inc. All rights reserved. 1
Outline Introduction to Rockwell Automation Need for High-Performance Computing with Icepak High-Performance Computer Specifications Overview of Benchmark Problems Benchmark Metrics Benchmark Results: Small Blower Benchmark Results: Scalable Motor Drive Icepak High-Performance Computing Configuration and Use at Rockwell Automation High-Performance Computing Benefits Conclusions Copyright 2011 Rockwell Automation, Inc. All rights reserved. 2
Rockwell Automation At A Glance Fiscal 2010 Sales: Approximately $4.9 billion Employees: About 19,000 Serving customers in 80+ countries -- emerging markets over 20% of total sales Leading global provider of industrial power, control and information solutions World Headquarters: Milwaukee, Wisconsin, USA Trading Symbol: ROK Copyright 2011 Rockwell A t ti I All i ht d 3 3
Need for High-Performance Computing Customer demands for higher current outputs and smaller footprints presents a challenge for the thermal management of cabinet-size, scalable motor drives. Highly detailed component models are required to predict the temperatures that are used to establish motor drive rating and reliability. Model Detail MRF Blower Model Component Level Detailed IGBT + Heat Sink MRF Blower, Simple Heat Sink, Simple IGBT Heat Sources Motor Drive Level Desired Capabilities Scalable Motor Drive Level As model sizes scale, Icepak models must be defeatured to stay within computing hardware capabilities. As model scales: Information can be lost. Confidence in model can decrease. Engineers are unable to model multi-cabinet motor drives due to computer resource limits. Model Scale Challenge: Micro-level Detail at Macro-level Scale and Beyond Copyright 2011 Rockwell Automation, Inc. All rights reserved. 4
High-Performance Computing Systems Three computers are used for benchmarking. Benchmarking results established guidelines for Icepak HPC simulations at Rockwell Automation. Milwaukee Institute Linux RA Windows Workstation RA Linux Compute Node Location Milwaukee Institute Rockwell Automation Rockwell Automation Manufacturer/Model Dell Dell Precision T7500 Dell PowerEdge R710 Microprocessor Dual Quad-Core Intel X5550 Dual Quad-Core Intel X5550 Dual Six-Core Intel X5690 Clock Speed 2.66 GHz 2.66 GHz 3.46 GHz RAM 24 GB 1333 MHz 48 GB 1333 MHz 96 GB 1333 MHz Operating System 64 bit Linux 64 bit Windows 7 64 bit Linux Hyper-threading On Off Off Number of Nodes 13 1 1 Copyright 2011 Rockwell Automation, Inc. All rights reserved. 5
Small Benchmark Problem: MRF Blower Detailed blower model using Moving Reference Frame (MRF) technique. Parameter Elements 629,364 Solution Turbulence Model Discretization Scheme Flow Value Realizable two equation All First Order Under-Relaxation Pressure: 0.3 Momentum: 0.7 Viscosity: 1.0 Turbulent KE: 0.5 Dissipation: 0.5 Stabilization Pressure: BCGSTAB Precision Partitioning Method Single Icepak Version 13.0.2 Principal Axes Copyright 2011 Rockwell Automation, Inc. All rights reserved. 6
Large Benchmark Model: Scalable Motor Drive Scalable motor drive with one, two, and three identical modules. Air flow interaction between cabinets. Elements Solution Parameter Turbulence Model Discretization Scheme Value 6.2, 11.7, 17.1 million Flow Realizable two equation All First Order Under-Relaxation Pressure: 0.3 Momentum: 0.7 Viscosity: 1.0 Temperature: 1.0 Turbulent KE: 0.5 Dissipation: 0.5 Stabilization None Precision Partitioning Method Single Icepak Version 13.0.2 Principal Axes (1) MRF Blower Single 6.2 MM Copyright 2011 Rockwell Automation, Inc. All rights reserved. Each Motor Drive contains Double 11.7 MM (3) Detailed Heat Sinks & Detailed IGBTs Plus: Circuit boards, bus bars, resistors, capacitors, etc. Triple 17.1 MM 7
Benchmark Metrics Speed Up: S N 500 Elapsed time Elapsed time * * for 500 iterations using one core for 500 iterations using N cores t 1,500 iterations t N,500 iterations * taken from elapsed time value found in uns_out file for each solution Amdahl s Law: If P is the percent of a program that can benefit from parallelization and N is the number of cores, then according to Amdahl s Law the calculated Speed Up is: S N Amdahl 1 1 P P / N Assuming a theoretical maximum parallelization of P = 100%, Amdahl s Law reduces to: S N N Amdahl Copyright 2011 Rockwell Automation, Inc. All rights reserved. 8
SMALL BENCHMARK PROBLEM: MRF BLOWER Copyright 2011 Rockwell Automation, Inc. All rights reserved. 9
MRF Blower Benchmark 629,364 elements Elapsed time versus number of parallel cores 100 90 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.46 GHz, 96 GB] 80 Elapsed Time - Converged [min] 70 60 50 40 30 20 Elapsed Time significantly decreases as number of cores increase -- up to 6 cores. More than 6 cores has a negligible effect. Comparing Linux machines, reduction in Elapsed Time (32% average) scales with CPU speed [(3.47-2.66)/2.66] = 30%. 10 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 10
MRF Blower Benchmark 629,364 elements Solver Speed Up versus number of parallel cores 12 11 10 9 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.46 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) 8 Solver Speed Up 7 6 5 4 3 2 1 0 All machines closely follow Amdahl s Law up to 4 cores. RA Windows Workstation tops out at 6 cores. 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 11
MRF Blower Benchmark 629,364 elements Elapsed time versus number of parallel cores 100 Elapsed Time - Converged [min] 90 80 70 60 50 40 30 20 10 0 Parallel Solution Network Parallel Solution 16 core Parallel Solution performs 20% faster than 16 core Network Parallel Solution. Decrease in Network Parallel Solution with > 32 cores due to communication overhead. Solution using 2 HPC Packs (32 cores) reduces solution time by 33% versus 1 HPC Pack (8 cores). Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 1 HPC Pack 2 HPC Pack 3 HPC Pack 0 16 32 48 64 80 96 112 128 Number of Cores Copyright 2011 Rockwell Automation, Inc. All rights reserved. 12
LARGE BENCHMARK PROBLEM: SCALABLE MOTOR DRIVE Copyright 2011 Rockwell Automation, Inc. All rights reserved. 13
Single Motor Drive 6.2 million elements Elapsed Time at 500 iterations versus number of parallel cores 1400 1200 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Elapsed Time - 500 iterations [min] 1000 800 600 400 Both Dual 4 core 2.66 GHz machines show nearly identical performance from 2 to 6 cores. Additional memory on 2.66 GHz CPUs has no effect. Comparing Linux machines, reduction in Elapsed Time (24% average) scales with CPU speed [(3.47-2.66)/2.66] = 30%. 200 0 Both Rockwell Automation machines show no or negative performance gains above 6 cores 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 14
Single Motor Drive 6.2 million elements Solver Speed Up versus number of parallel cores 6 5 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) Solver Speed Up 4 3 2 1 All machines follow Amdahl s Law up to 2 cores. RA Linux Compute Node starts to deviate from Amdahl s Law at 3 cores. 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 15
Simplified Single Motor Drive 3.2 million elements Elapsed time versus number of parallel cores Elapsed Time - Converged [min] 700 600 500 400 300 200 Parallel Solution Network Parallel Solution Network Parallel Solution performs the fastest (peak at 64 cores). Decrease in Network Parallel Solution performance with > 64 cores due to communication overhead. Solution using 2 HPC Packs (32 cores) reduces solution time by 59% versus 1 HPC Pack (8 core). Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 100 0 1 HPC Pack 2 HPC Pack 3 HPC Pack 0 16 32 48 64 80 96 112 128 Number of Cores Copyright 2011 Rockwell Automation, Inc. All rights reserved. 16
Double Motor Drive 11.7 million elements Elapsed Time at 500 iterations versus number of parallel cores 3000 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] Elapsed Time - 500 iterations [min] 2500 2000 1500 1000 RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Doubling memory on 4 core 2.66 GHz machine with shows ~8% better performance from 1 to 4 cores. Comparing Linux machines, reduction in Elapsed Time (21% average) almost scales with CPU speed [(3.47-2.66)/2.66] = 30%. Both Rockwell Automation machines show little or negative performance gains above 6 cores. 500 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 17
Double Motor Drive 11.7 million elements Solver Speed Up versus number of parallel cores 6 5 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) Solver Speed Up 4 3 2 1 All machines show a higher deviation from Amdahl s Law than Single Motor Drive model. RA Linux Compute Node starts to deviate from Amdahl s Law at 2 cores. 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 18
Motor Drive Model Scaling Elapsed Time at 500 iterations versus number of parallel cores and number of elements 6000 5000 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] 17.3 million Elapsed Time - 500 iterations [min] 4000 3000 2000 1000 Unable to run triple model on Windows Workstation. 3.3 million 6.2 million 11.7 million Elapsed time linearly increases with number of elements (< 12 million) Slope decreases as cores increase from 2 to 6. 0 0 5,000,000 10,000,000 15,000,000 20,000,000 Number of Elements Copyright 2011 Rockwell Automation, Inc. All rights reserved. 19
HPC Computer Implementation Linux Compute Node implementation at Rockwell has been in place since May 2011. Moved from dedicated Workstation with Laptop to Shared Computing Node with Laptop model. Dedicated Workstation + Laptop Laptop + Shared Computing Node RA Linux Compute Node Dual 6 Core, 3.46 GHz, 96 GB Based on our internal cost models for four Icepak users, Laptop with Shared Computing approach saves 43% per user. 2 HPC Packs (8 core) 4 HPC (1 core) Copyright 2011 Rockwell Automation, Inc. All rights reserved. 20
HPC Use Rules at Rockwell Automation Based on the benchmarks, the optimal number of parallel cores to minimize solution time are: Benchmark Model Model Size RA Linux Compute Node RA Windows Workstation MRF Blower 0.6 million 8 cores 8 cores Single Motor Drive 6.2 million 6 cores 6 cores Double Motor Drive 11.7 million 6 cores 8 cores Using the data above, we have created a rule where the maximum allowable number of cores for any parallel solutions in Icepak is six. This allows for better utilization of the Dual Six-Core Linux Compute Node and HPC Packs. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 21
HPC Benefit at Rockwell Automation With High Performance Computing and Icepak, there is not need to defeature as model scale increases. No loss of information No loss of confidence in model No compromises We can now economically simulate scalable motor drives and push the limits of power density. MRF Blower Model Detailed IGBT + Heat Sink Component Level Motor Drive Level Scalable Motor Drive Level Model Detail Model Scale Copyright 2011 Rockwell Automation, Inc. All rights reserved. 22
Conclusions More parallel cores does not equal a faster solution. Amdahl s Law only holds up to two cores (independent of model size). As model size increases, deviation from Amdahl Law increases. Reduction in elapsed time is proportional to increase in CPU speed. For large models near 10 million elements, doubling memory for the same CPU reduced solve time by ~8%. Elapsed time linearly scales with model size up to 12 million elements. A Parallel Solution with 6 to 8 cores is optimal for the Windows Workstation and Linux Compute Nodes used at Rockwell Automation up to 12 million elements. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 23
Acknowledgements The authors gratefully acknowledge: Jonathon Gladieux and Jay Bayne at Milwaukee Institute for providing 8 months of unfettered access on their high performance computing system to perform benchmarks with Icepak. Mark Jenich at ANSYS for generously providing Icepak and multiple HPC Pack licenses to Milwaukee Institute. Richard Lukaszewski and Sujeet Chand at Rockwell Automation for supporting this work. Jerry Rabe at Rockwell Automation for setting up and configuring the Linux Compute Node. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 24
APPENDIX Copyright 2011 Rockwell Automation, Inc. All rights reserved. 25
MRF Blower Benchmark Metrics 629,364 elements Comparison of Solver Speed Up metrics 12 10 Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 12 10 RA Linux Compute Node Dual 6 Core, 3.46 GHz, 96 GB Solver Speed Up Solver Speed Up 8 6 4 2 0 12 10 8 6 4 2 0 Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] RA Windows Workstation Dual 4 Core, 2.66 GHz, 48 GB Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Solver Speed Up 8 6 4 2 0 S N Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] 500 Elapsed time for 500 iterations using one core Elapsed time for 500 iterations using N cores Less than 8% difference between converged and 500 iteration Solver Speed Up metrics. We will use Solver Speed Up at 500 iterations for all Large Benchmark problems. Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 26