Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Similar documents

High Performance Computing in CST STUDIO SUITE

Accelerating CFD using OpenFOAM with GPUs

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Parallelism and Cloud Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Understanding the Benefits of IBM SPSS Statistics Server

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Across the Generations: Processor and Interconnect Technologies

LS DYNA Performance Benchmarks and Profiling. January 2009

Cloud Computing through Virtualization and HPC technologies

Tableau Server 7.0 scalability

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Dragon NaturallySpeaking and citrix. A White Paper from Nuance Communications March 2009

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

On-Demand Supercomputing Multiplies the Possibilities

White Paper. Recording Server Virtualization

1 Bull, 2011 Bull Extreme Computing

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Cluster Computing at HRI

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Boosting Data Transfer with TCP Offload Engine Technology

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Selecting NetVanta UC Server Hypervisor and Server Platforms

PARALLELS SERVER BARE METAL 5.0 README

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Summary. Key results at a glance:

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

AppDynamics Lite Performance Benchmark. For KonaKart E-commerce Server (Tomcat/JSP/Struts)

Integrated Grid Solutions. and Greenplum

Multi-core and Linux* Kernel

Delivering Quality in Software Performance and Scalability Testing

EVALUATING THE USE OF VIRTUAL MACHINES IN HIGH PERFORMANCE CLUSTERS

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HP ProLiant DL585 G5 earns #1 virtualization performance record on VMmark Benchmark

BC43: Virtualization and the Green Factor. Ed Harnish

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Scaling from 1 PC to a super computer using Mascot

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

HP Z Turbo Drive PCIe SSD

Using Speccy to Report on Your Computer Components

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

Multicore Parallel Computing with OpenMP

NEWSTAR ENTERPRISE and NEWSTAR Sales System Requirements

Recommended hardware system configurations for ANSYS users

RevoScaleR Speed and Scalability

HP ProLiant DL380 G5 takes #1 2P performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Windows

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

Parallels Virtuozzo Containers

An Oracle White Paper March Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite

Business white paper. HP Process Automation. Version 7.0. Server performance

What Is Regeneration?

Recent Advances in HPC for Structural Mechanics Simulations

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Tableau Server Scalability Explained

Computing in High- Energy-Physics: How Virtualization meets the Grid

Virtuoso and Database Scalability

Legal Notices and Important Information

Stream Processing on GPUs Using Distributed Multimedia Middleware

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

Clusters: Mainstream Technology for CAE

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

FLOW-3D Performance Benchmark and Profiling. September 2012

Numerix CrossAsset XL and Windows HPC Server 2008 R2

Improving Scalability for Citrix Presentation Server

How System Settings Impact PCIe SSD Performance

The Performance Characteristics of MapReduce Applications on Scalable Clusters

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

CPU Benchmarks Over 600,000 CPUs Benchmarked

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

PCI Express* Ethernet Networking

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Industry First X86-based Single Board Computer JaguarBoard Released

HPSA Agent Characterization

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

VMWARE WHITE PAPER 1

Comparing Multi-Core Processors for Server Virtualization

AirWave 7.7. Server Sizing Guide

Performance Analysis of Web based Applications on Single and Multi Core Servers

Hardware Acceleration for CST MICROWAVE STUDIO

Trends in High-Performance Computing for Power Grid Applications

PRIMERGY server-based High Performance Computing solutions

Understanding the Performance of an X User Environment

InterScan Web Security Virtual Appliance

Meeting budget constraints. Integration and cooperation between network operations and other IT domains. Providing network performance

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

BLACKBOARD LEARN TM AND VIRTUALIZATION Anand Gopinath, Software Performance Engineer, Blackboard Inc. Nakisa Shafiee, Senior Software Performance

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

The VMware Reference Architecture for Stateless Virtual Desktops with VMware View 4.5

Transcription:

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal Engineer bwweiss@ra.rockwell.com Copyright 2011 Rockwell Automation, Inc. All rights reserved. 1

Outline Introduction to Rockwell Automation Need for High-Performance Computing with Icepak High-Performance Computer Specifications Overview of Benchmark Problems Benchmark Metrics Benchmark Results: Small Blower Benchmark Results: Scalable Motor Drive Icepak High-Performance Computing Configuration and Use at Rockwell Automation High-Performance Computing Benefits Conclusions Copyright 2011 Rockwell Automation, Inc. All rights reserved. 2

Rockwell Automation At A Glance Fiscal 2010 Sales: Approximately $4.9 billion Employees: About 19,000 Serving customers in 80+ countries -- emerging markets over 20% of total sales Leading global provider of industrial power, control and information solutions World Headquarters: Milwaukee, Wisconsin, USA Trading Symbol: ROK Copyright 2011 Rockwell A t ti I All i ht d 3 3

Need for High-Performance Computing Customer demands for higher current outputs and smaller footprints presents a challenge for the thermal management of cabinet-size, scalable motor drives. Highly detailed component models are required to predict the temperatures that are used to establish motor drive rating and reliability. Model Detail MRF Blower Model Component Level Detailed IGBT + Heat Sink MRF Blower, Simple Heat Sink, Simple IGBT Heat Sources Motor Drive Level Desired Capabilities Scalable Motor Drive Level As model sizes scale, Icepak models must be defeatured to stay within computing hardware capabilities. As model scales: Information can be lost. Confidence in model can decrease. Engineers are unable to model multi-cabinet motor drives due to computer resource limits. Model Scale Challenge: Micro-level Detail at Macro-level Scale and Beyond Copyright 2011 Rockwell Automation, Inc. All rights reserved. 4

High-Performance Computing Systems Three computers are used for benchmarking. Benchmarking results established guidelines for Icepak HPC simulations at Rockwell Automation. Milwaukee Institute Linux RA Windows Workstation RA Linux Compute Node Location Milwaukee Institute Rockwell Automation Rockwell Automation Manufacturer/Model Dell Dell Precision T7500 Dell PowerEdge R710 Microprocessor Dual Quad-Core Intel X5550 Dual Quad-Core Intel X5550 Dual Six-Core Intel X5690 Clock Speed 2.66 GHz 2.66 GHz 3.46 GHz RAM 24 GB 1333 MHz 48 GB 1333 MHz 96 GB 1333 MHz Operating System 64 bit Linux 64 bit Windows 7 64 bit Linux Hyper-threading On Off Off Number of Nodes 13 1 1 Copyright 2011 Rockwell Automation, Inc. All rights reserved. 5

Small Benchmark Problem: MRF Blower Detailed blower model using Moving Reference Frame (MRF) technique. Parameter Elements 629,364 Solution Turbulence Model Discretization Scheme Flow Value Realizable two equation All First Order Under-Relaxation Pressure: 0.3 Momentum: 0.7 Viscosity: 1.0 Turbulent KE: 0.5 Dissipation: 0.5 Stabilization Pressure: BCGSTAB Precision Partitioning Method Single Icepak Version 13.0.2 Principal Axes Copyright 2011 Rockwell Automation, Inc. All rights reserved. 6

Large Benchmark Model: Scalable Motor Drive Scalable motor drive with one, two, and three identical modules. Air flow interaction between cabinets. Elements Solution Parameter Turbulence Model Discretization Scheme Value 6.2, 11.7, 17.1 million Flow Realizable two equation All First Order Under-Relaxation Pressure: 0.3 Momentum: 0.7 Viscosity: 1.0 Temperature: 1.0 Turbulent KE: 0.5 Dissipation: 0.5 Stabilization None Precision Partitioning Method Single Icepak Version 13.0.2 Principal Axes (1) MRF Blower Single 6.2 MM Copyright 2011 Rockwell Automation, Inc. All rights reserved. Each Motor Drive contains Double 11.7 MM (3) Detailed Heat Sinks & Detailed IGBTs Plus: Circuit boards, bus bars, resistors, capacitors, etc. Triple 17.1 MM 7

Benchmark Metrics Speed Up: S N 500 Elapsed time Elapsed time * * for 500 iterations using one core for 500 iterations using N cores t 1,500 iterations t N,500 iterations * taken from elapsed time value found in uns_out file for each solution Amdahl s Law: If P is the percent of a program that can benefit from parallelization and N is the number of cores, then according to Amdahl s Law the calculated Speed Up is: S N Amdahl 1 1 P P / N Assuming a theoretical maximum parallelization of P = 100%, Amdahl s Law reduces to: S N N Amdahl Copyright 2011 Rockwell Automation, Inc. All rights reserved. 8

SMALL BENCHMARK PROBLEM: MRF BLOWER Copyright 2011 Rockwell Automation, Inc. All rights reserved. 9

MRF Blower Benchmark 629,364 elements Elapsed time versus number of parallel cores 100 90 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.46 GHz, 96 GB] 80 Elapsed Time - Converged [min] 70 60 50 40 30 20 Elapsed Time significantly decreases as number of cores increase -- up to 6 cores. More than 6 cores has a negligible effect. Comparing Linux machines, reduction in Elapsed Time (32% average) scales with CPU speed [(3.47-2.66)/2.66] = 30%. 10 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 10

MRF Blower Benchmark 629,364 elements Solver Speed Up versus number of parallel cores 12 11 10 9 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.46 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) 8 Solver Speed Up 7 6 5 4 3 2 1 0 All machines closely follow Amdahl s Law up to 4 cores. RA Windows Workstation tops out at 6 cores. 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 11

MRF Blower Benchmark 629,364 elements Elapsed time versus number of parallel cores 100 Elapsed Time - Converged [min] 90 80 70 60 50 40 30 20 10 0 Parallel Solution Network Parallel Solution 16 core Parallel Solution performs 20% faster than 16 core Network Parallel Solution. Decrease in Network Parallel Solution with > 32 cores due to communication overhead. Solution using 2 HPC Packs (32 cores) reduces solution time by 33% versus 1 HPC Pack (8 cores). Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 1 HPC Pack 2 HPC Pack 3 HPC Pack 0 16 32 48 64 80 96 112 128 Number of Cores Copyright 2011 Rockwell Automation, Inc. All rights reserved. 12

LARGE BENCHMARK PROBLEM: SCALABLE MOTOR DRIVE Copyright 2011 Rockwell Automation, Inc. All rights reserved. 13

Single Motor Drive 6.2 million elements Elapsed Time at 500 iterations versus number of parallel cores 1400 1200 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Elapsed Time - 500 iterations [min] 1000 800 600 400 Both Dual 4 core 2.66 GHz machines show nearly identical performance from 2 to 6 cores. Additional memory on 2.66 GHz CPUs has no effect. Comparing Linux machines, reduction in Elapsed Time (24% average) scales with CPU speed [(3.47-2.66)/2.66] = 30%. 200 0 Both Rockwell Automation machines show no or negative performance gains above 6 cores 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 14

Single Motor Drive 6.2 million elements Solver Speed Up versus number of parallel cores 6 5 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) Solver Speed Up 4 3 2 1 All machines follow Amdahl s Law up to 2 cores. RA Linux Compute Node starts to deviate from Amdahl s Law at 3 cores. 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 15

Simplified Single Motor Drive 3.2 million elements Elapsed time versus number of parallel cores Elapsed Time - Converged [min] 700 600 500 400 300 200 Parallel Solution Network Parallel Solution Network Parallel Solution performs the fastest (peak at 64 cores). Decrease in Network Parallel Solution performance with > 64 cores due to communication overhead. Solution using 2 HPC Packs (32 cores) reduces solution time by 59% versus 1 HPC Pack (8 core). Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 100 0 1 HPC Pack 2 HPC Pack 3 HPC Pack 0 16 32 48 64 80 96 112 128 Number of Cores Copyright 2011 Rockwell Automation, Inc. All rights reserved. 16

Double Motor Drive 11.7 million elements Elapsed Time at 500 iterations versus number of parallel cores 3000 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] Elapsed Time - 500 iterations [min] 2500 2000 1500 1000 RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Doubling memory on 4 core 2.66 GHz machine with shows ~8% better performance from 1 to 4 cores. Comparing Linux machines, reduction in Elapsed Time (21% average) almost scales with CPU speed [(3.47-2.66)/2.66] = 30%. Both Rockwell Automation machines show little or negative performance gains above 6 cores. 500 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 17

Double Motor Drive 11.7 million elements Solver Speed Up versus number of parallel cores 6 5 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Windows Workstation [Dual 4 core, 2.66 GHz, 48 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] Amdahl's Law (Theoretical Linear Speed Up) Solver Speed Up 4 3 2 1 All machines show a higher deviation from Amdahl s Law than Single Motor Drive model. RA Linux Compute Node starts to deviate from Amdahl s Law at 2 cores. 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 18

Motor Drive Model Scaling Elapsed Time at 500 iterations versus number of parallel cores and number of elements 6000 5000 Milwaukee Institute Linux [Dual 4 core, 2.66 GHz, 24 GB] RA Linux Compute Node [Dual 6 core, 3.47 GHz, 96 GB] 17.3 million Elapsed Time - 500 iterations [min] 4000 3000 2000 1000 Unable to run triple model on Windows Workstation. 3.3 million 6.2 million 11.7 million Elapsed time linearly increases with number of elements (< 12 million) Slope decreases as cores increase from 2 to 6. 0 0 5,000,000 10,000,000 15,000,000 20,000,000 Number of Elements Copyright 2011 Rockwell Automation, Inc. All rights reserved. 19

HPC Computer Implementation Linux Compute Node implementation at Rockwell has been in place since May 2011. Moved from dedicated Workstation with Laptop to Shared Computing Node with Laptop model. Dedicated Workstation + Laptop Laptop + Shared Computing Node RA Linux Compute Node Dual 6 Core, 3.46 GHz, 96 GB Based on our internal cost models for four Icepak users, Laptop with Shared Computing approach saves 43% per user. 2 HPC Packs (8 core) 4 HPC (1 core) Copyright 2011 Rockwell Automation, Inc. All rights reserved. 20

HPC Use Rules at Rockwell Automation Based on the benchmarks, the optimal number of parallel cores to minimize solution time are: Benchmark Model Model Size RA Linux Compute Node RA Windows Workstation MRF Blower 0.6 million 8 cores 8 cores Single Motor Drive 6.2 million 6 cores 6 cores Double Motor Drive 11.7 million 6 cores 8 cores Using the data above, we have created a rule where the maximum allowable number of cores for any parallel solutions in Icepak is six. This allows for better utilization of the Dual Six-Core Linux Compute Node and HPC Packs. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 21

HPC Benefit at Rockwell Automation With High Performance Computing and Icepak, there is not need to defeature as model scale increases. No loss of information No loss of confidence in model No compromises We can now economically simulate scalable motor drives and push the limits of power density. MRF Blower Model Detailed IGBT + Heat Sink Component Level Motor Drive Level Scalable Motor Drive Level Model Detail Model Scale Copyright 2011 Rockwell Automation, Inc. All rights reserved. 22

Conclusions More parallel cores does not equal a faster solution. Amdahl s Law only holds up to two cores (independent of model size). As model size increases, deviation from Amdahl Law increases. Reduction in elapsed time is proportional to increase in CPU speed. For large models near 10 million elements, doubling memory for the same CPU reduced solve time by ~8%. Elapsed time linearly scales with model size up to 12 million elements. A Parallel Solution with 6 to 8 cores is optimal for the Windows Workstation and Linux Compute Nodes used at Rockwell Automation up to 12 million elements. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 23

Acknowledgements The authors gratefully acknowledge: Jonathon Gladieux and Jay Bayne at Milwaukee Institute for providing 8 months of unfettered access on their high performance computing system to perform benchmarks with Icepak. Mark Jenich at ANSYS for generously providing Icepak and multiple HPC Pack licenses to Milwaukee Institute. Richard Lukaszewski and Sujeet Chand at Rockwell Automation for supporting this work. Jerry Rabe at Rockwell Automation for setting up and configuring the Linux Compute Node. Copyright 2011 Rockwell Automation, Inc. All rights reserved. 24

APPENDIX Copyright 2011 Rockwell Automation, Inc. All rights reserved. 25

MRF Blower Benchmark Metrics 629,364 elements Comparison of Solver Speed Up metrics 12 10 Milwaukee Institute Linux Dual 4 Core, 2.66 GHz, 24 GB 12 10 RA Linux Compute Node Dual 6 Core, 3.46 GHz, 96 GB Solver Speed Up Solver Speed Up 8 6 4 2 0 12 10 8 6 4 2 0 Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] RA Windows Workstation Dual 4 Core, 2.66 GHz, 48 GB Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Solver Speed Up 8 6 4 2 0 S N Solver Speed Up based on 500 iterations Solver Speed Up based on converged solution Amdahl's Law (Theoretical Linear Speed Up) 0 2 4 6 8 10 12 14 16 18 20 22 24 Parallel Solution [Number of Cores] 500 Elapsed time for 500 iterations using one core Elapsed time for 500 iterations using N cores Less than 8% difference between converged and 500 iteration Solver Speed Up metrics. We will use Solver Speed Up at 500 iterations for all Large Benchmark problems. Parallel Solution [Number of Cores] Copyright 2011 Rockwell Automation, Inc. All rights reserved. 26