High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates
Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of HPC: HPC Cluster & Workstation HPC Hardware Components: CPU vs. Cores, GPU vs. Phi, HDD vs. SSD Interconnects GPU Acceleration 2
CAE Associates Inc. Engineering Consulting Firm in Middlebury, CT specializing in FEA and CFD analysis. ANSYS Channel Partner since 1985 providing sales of the ANSYS products, training and technical support. 3
e-learning Webinar Series This presentation is part of a series of e-learning webinars offered by CAE Associates. You can view many of our previous e-learning session either on our website or on the CAE Associates YouTube channel: If you are a New Jersey or New York resident you can earn continuing education credit for attending the full webinar and completing a survey which will be emailed to you after the presentation. 4
CAEA Resource Library Our Resource Library contains over 250 items including: Consulting Case Studies Conference and Seminar Presentations Software demonstrations Useful macros and scripts The content is searchable and you can download copies of the material to review at your convenience. 5
CAEA Engineering Advantage Blog Our Engineering Advantage Blog offers weekly insights from our experienced technical staff. 6
CAEA ANSYS Training Classes can be held at our Training Center at CAE Associates or on-site at your location. CAE Associates is offering on-line training classes in 2015! Registration is available ailable on our website. 7
Agenda Introduction HPC Background Why HPC Licensing SMP vs. DMP HPC Terminology Types of HPC: HPC Cluster & Workstation HPC Hardware Components: CPU vs. Cores, GPU vs. Phi, HDD vs. SSD Interconnects GPU Acceleration 8
Why High Performance Computing (HPC)? Remove computing limitations from engineers in all phases of design, analysis, and testing Impact product design Faster simulation More efficient parametric studies Larger Models More accuracy Turbulence modeling, particle tracking. More refined models Design Optimization More runs for a fixed hardware configuration 9
Why HPC? Using today s multicore computers are key for companies to remain competitive. ANSYS HPC product suite allows scalability to whatever computational level required, from single-user or small user group options at entrylevel up to virtually unlimited parallel capacity or large user group options at enterprise level. Reduce turnaround time Examine more design variants faster Simulate larger or more complex models 10
4 Main Product Licenses Parallel HPC (per-process) HPC Pack HPC product rewarding volume parallel processing for high-fidelity simulations. Each simulation consumes one or more Packs. Parallel enabled increases quickly with added Packs. HPC Workgroup HPC product rewards volume parallel processing for increased simulation throughput shared among engineers throughout a single location or the world. 16 to 32768 parallel shared across any number of simulations on a single server. HPC Parametric Pack Enables simultaneous execution of multiple design points while consuming just one set of licenses. Enabled (Cores) 32768 8 32 128 512 2048 8192 1 2 3 4 5 6 7 HPC Packs per Simulation 11
Poll #01 Poll #01 12
Shared and Distributed Memory Shared Memory: Single Machine Shared Memory Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable. OpenMP is the industry standard. Distributed Memory: Distributed memory parallel processing (DMP) assumes that physical memory for Distributed Memory each process is separate from all other processes Requires message passaging software to communicate between cores MPI is the industry standard 13
Distributed ANSYS Architecture Domain decomposition approach Sparse, PCG & LANPCG all Break problem into N pieces support distributed Solve the global problem independently within each domain Benefits Communicate information across the boundaries as necessary DMP on single node or cluster! SMP for single node only The entire SOLVE phase is parallel l More computations performed in parallel with faster solution time. Better speed-ups than SMP Can achieve > 4x speed-up on 8 cores (Try getting that with SMP!!!!) Can be used for jobs running on hundreds of cores. Can take advantage of resources on multiple machines Memory usage and bandwidth scales Disk (I/O) usage scales (i.e. parallel I/O) 14
ANSYS Mechanical Scaling 6M Degrees of Freedom Plasticity, Contact Bolt pretension 4 load steps v15 15
Parallel Settings ANSYS APDL SMP With GPU Acceleration Settings DMP: For Multiple Core or Node Processing For GPU Acceleration using DMP: Customization Preferences Tab Additional Parameters add command line argument: acc nvidia 16
Parallel Settings ANSYS CFX/Fluent CFX Parallel l Settings Options Fluent Multiple Core Processing and GPU Acceleration Options Fluent Parallel Settings Options 17
2 Common Types of HPC HPC Cluster HPC Cluster Communication via series of switches and interconnects Infiniband, Gigabit (1GB/s,10GB/s) Fiber Scalable DOE Supercomputer: 1.6M cores Workstation HPC Workstation HPC Single desktop communication More than 2 cores, commonly 8 or more Quad Socket Current Builds Xeon E5-4600 up to 48 cores Up to 1TB of 1866 DDR3 1866 MHz RAM 18
Poll #02 Poll #02 19
PC Components 20
Central Processing Unit and Cores Intel Xeon E5 Processor Series E5: 4-18 Cores per CPU Frequency: 1.8-3.5 GHz L3 Cache up to 2.5MB/Core Bus: 6.4-9 GT/s QPI Quad-Socket MOBO Intel Xeon E7 Processor Series: E7: 4-18 Cores per CPU Frequency: 1.9-3.2 GHz L3 Cache up to 2.5MB/Core CPU Bus: 6.4-9 GT/s QPI RAM DDR4: Supports 2-4k MT/s (10 6 transfers/s) DDR3: Supports 0.8-2k MT/s 21
Graphical and Co-Processing Units GPU-accelerated computing is the Co-Processing is a computer use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, analytics, engineering, processor (PCI-Card) used to supplement the functions of the primary processor consumer, and enterprise applications. Floating-point arithmetic Signal processing Supported Cards Mechanical and Fluent Only 64-bit Windows or Linux x64 Tesla K10 and K20 series Quadro 6000 Quadro K5000 and K6000 Supported Cards Xeon Phi 3000, 5000, 7000 series (ANSYS Mechanical only) 22
Improved Parallel Performance & Scaling ANSYS FLUENT 23
GPU Acceleration ANSYS Mechanical ANSYS Fluent Only For models with solid elements > 500k DOF DMP is preferred DOF>5M add another card or a single card with 12GB (k40, k6000) PCG/JCG solver: MSAVE off Models with lower Lev_Diff better suited Higher AMG are ideal for GPU acceleration. Coupled problems benefit from GPUs Whole problem must fit on GPU 1e06 cells require ~4 GB GPU RAM Better performance with lower CPU core counts 3t to 4CPUC Cores per 1GPU ANSYS Fluent 24
GPU/CoProcessing Licensing Licensing Options HPC Packs for quick scale-up HPC Workgroup for flexibility GPUs treated same as CPU cores in the licensing model As you scale-up, license cost decreases per core 25
Poll #03 Poll #03 26
Hard Disk Conventional SATA SAS & SATA 7200 RPM and 10k RPM Ideal for volume storage Cheapest Serial Attached SCSI (SAS) 15k RPM drives (RAID 0) Ideal scratch space drives Solid State Drives (SSD) Fastest read/write operations Lower power, cooler, quieter No mechanical parts Ideal for OS drive Cost per GB is highest 25 2.5 SSD 27
Interconnects Internal CAT5e Infiniband ib Internal Controlled by motherboard Intel QuickPath Interconnect (QPI) External PCIe 30x8=63Gb/s 3.0 PCIe 2.0 x1 = 32 Gb/s PCIe 4.0 x8 = 125 Gb/s Gigabit (1 Gb/s) (1 GB/s) (40 + GB/s) Infiniband (56 Gb/s) Mechanical/APDL requires at least 10 Fibre Channel s (16 Gb/s) GB/s interconnect for scaling past 1- Ethernet RDMA (40 Gb/s) node. Prefer Infiniband FRD/QDR FDR for large clusters 28
Basic Guidelines Faster cores = faster solution Faster RAM = faster solution Most be aware of memory bandwidth Faster HD = faster solution Especially for intensive I/O RAID 0 for multiple disks SSD or SAS 15k drives Parallel file systems 4 GB RAM/Core ANSYS CFD Hyper-threading: Off Turbo-Boost: Only for low core counts Faster is better! More is better. Must balance budget/performance 29
Poll #04 Poll #04 30
HPC Revolution Every computer today is a parallel computer. Every simulation in ANSYS can benefit from parallel processing. 31
Questions 2015 CAE Associates