OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old



Similar documents
MAQAO Performance Analysis and Optimization Tool

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Big Data Visualization on the MIC

Performance Analysis and Optimization Tool

High Performance Computing in CST STUDIO SUITE

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Multicore Parallel Computing with OpenMP

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

OpenMP and Performance

GPGPU accelerated Computational Fluid Dynamics

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

A Case Study - Scaling Legacy Code on Next Generation Platforms

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Keys to node-level performance analysis and threading in HPC applications

SQL Server 2012 Performance White Paper

Evaluation of CUDA Fortran for the CFD code Strukti

Application Performance Analysis Tools and Techniques

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

YALES2 porting on the Xeon- Phi Early results

Scalability evaluation of barrier algorithms for OpenMP

Petascale Software Challenges. William Gropp

Programming the Intel Xeon Phi Coprocessor

Performance Tuning and Optimizing SQL Databases 2016

Big Graph Processing: Some Background

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Accelerating CFD using OpenFOAM with GPUs

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

Rethinking SIMD Vectorization for In-Memory Databases

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Parallel Algorithm Engineering

Overlapping Data Transfer With Application Execution on Clusters

DARPA, NSF-NGS/ITR,ACR,CPA,

Full and Para Virtualization

INTRODUCTION ADVANTAGES OF RUNNING ORACLE 11G ON WINDOWS. Edward Whalen, Performance Tuning Corporation

Performance Characteristics of Large SMP Machines

Overview of HPC Resources at Vanderbilt

OpenACC Programming and Best Practices Guide

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Parallel Computing. Benson Muite. benson.

Computational Modeling of Wind Turbines in OpenFOAM

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

High Performance Computing. Course Notes HPC Fundamentals

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Spring 2011 Prof. Hyesoon Kim

VIRTUALIZATION AND CPU WAIT TIMES IN A LINUX GUEST ENVIRONMENT

Distributed communication-aware load balancing with TreeMatch in Charm++

FLOW-3D Performance Benchmark and Profiling. September 2012

SQL Server Performance Tuning and Optimization

Sitecore Health. Christopher Wojciech. netzkern AG. Sitecore User Group Conference 2015

Building a Top500-class Supercomputing Cluster at LNS-BUAP

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Hardware performance monitoring. Zoltán Majó

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Performance Monitoring of Parallel Scientific Applications

11.1 inspectit inspectit

Cortex-A9 MPCore Software Development

Linux Performance Optimizations for Big Data Environments

reduction critical_section

High Performance Tier Implementation Guideline

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Microsoft SQL Server: MS Performance Tuning and Optimization Digital

Parallel Large-Scale Visualization

Matlab on a Supercomputer

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Chapter 2 Parallel Architecture, Software And Performance

benchmarking Amazon EC2 for high-performance scientific computing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HPC Deployment of OpenFOAM in an Industrial Setting

2

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Operating System Impact on SMT Architecture

Data Structure Oriented Monitoring for OpenMP Programs

Java Virtual Machine: the key for accurated memory prefetching

Embedded Systems: map to FPGA, GPU, CPU?

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

THE NAS KERNEL BENCHMARK PROGRAM

Large-Scale Reservoir Simulation and Big Data Visualization

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

MPI Hands-On List of the exercises

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Large-Data Software Defined Visualization on CPUs

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Transcription:

OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old

What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a free, open source CFD software package which has a large user base across most areas of engineering and science, from both commercial and academic organizations. Application domain: Computational Fluid Dynamics Execution mode: Symmetric MPI, Cluster The following tools were used in the analysis and optimization cycles: - Intel vtune Analyzer, Intel Trace Analyzer and Collector, idb debugger, Intel compiler parallelization reports

Insights What I learned: - Effective load distribution Xeon & Xeon Phi in symmetric mode give better performance. This is due to difference in CPU frequency between Xeon:Xeon Phi and total number of cores present in both of them - Reduction in neighboring communication in unstructured mesh between Xeon & Xeon Phi gives better performance - IO penalty is high on Xeon Phi as compare to Xeon - System tuning like Huge Page = 2MB, compiler flags gives signification improvements. What I recommend: Effective use of tools to diagnose performance problems. Most effective optimization: Vectorization Intrinsics - Changes to get efficient vectorization and avoid compiler overhead due to VGATHER & VSCATTER instructions - Unrolling - Prefetching - Explored decomposition algorithm change for weighted decomposition support Surprises: Oversubscription works better in native Xeon Phi mode when IO is turned off.

Performance Motorbike Case Compelling Performance : - Haswell optimized : 164 s - Haswell + Xeon Phi optimized : 119s (164/119 = 1.38 X) Runtime(secs) of Motorbike Case, 4.2M Workload (lower is better) Competitive performance: - There are no published numbers for GPU's for OpenFOAM benchmark application. Results: - 1.38 X Speedup due to Xeon Phi addition w.r.t Xeon optimized result

Issues with GaussSeidelSmoother Loop Outer loop is not parallelizable because of dependencies Low trip count, on average ~4, not good for vectorization Indirect Referencing : - Inefficient use of cacheline - Scatter/Gather overhead Overhead of scalar DIV operation inside loop Unaligned data structures throughout the application

Identify vector assembly overhead for (facei=fstart; facei<fend; facei++) psii - = upperptr[facei]*psiptr[uptr[facei]]; 1 st Loop 1. 2. 3. vgatherdpdq (%r14,%zmm14,8), %k4, %zmm15 vgatherdpdq (%r13,%zmm11,8), %k1, %zmm16 vgatherdpsl (%r13,%zmm10,4), %k3, %zmm14 vgatherdpsl (%r12,%zmm0,4), %k2, %zmm11... (Mul<ple gathers, due to indirect reference, unrolling and peel & remainder loops) The FOR loop overhead is high as it has only a small trip count in most of the cases. Compiler performing reduc<on opera<on seems costlier as visible in assembly. By default : vectorized and unrolled by 4 These issues suggest that manual vectoriza<on using intrinsics might be beneficial. 4. Vtune profiling & ICC compiler Peel and remainder loops are introducing vec-report leads to bottleneck overheads since trip count is < 8 for 99% of the identification <me.

Time comparison of 1st and 2nd loop : Baseline vs Intrinsic First Loop Time (s) Remarks Baseline 486 By default vectorized Intrinsic (vectoriza8on) 186 Speedup 2.6 X Second Loop Time (s) Remarks Baseline 221 By default non- vectorized Baseline + #pragma ivdep 412 Enabling vectoriza8on degrades performance Intrinsic (vectoriza8on) 128 Speedup 1.7 X Vtune profiling & ICC compiler vec-report/opt-report used for detailed analysis Note: Time taken from Vtune by aggregating assembly

Prefetching using Intrinsics : Vtune Snapshot Baseline: Optimized:

Huge Pages improvement via libhugetlbfs library Huge memory pages (2MB) are osen necessary for memory alloca<ons on the coprocessor. With 2MB pages, TLB misses and page faults may be reduced, and there is a lower alloca<on cost. Without huge pages vs huge pages : difference analysis using VTune

Performance Optimizations on Xeon Phi (Native mode) Xeon Phi Optimization Break-up in % of Total Improvements (341 secs) Optimization gains depend on the order in which they are applied

Thank You! 11

Functions improved Vtune counters Func<on Baseline Op<mized L1 Misses L1 Hit Ra<o L1 Misses L1 Hit ra<o GaussSeidelSmoother::smooth 6194650000 0.963597 3486700000 0.97056 ldumatrix::amul 2143750000 0.958503 1422050000 0.979037 ldumatrix::suma 566300000 0.958976 476700000 0.984022 ldumatrix::summagoffdiag 266000000 0.967213 301350000 0.980977 GAMGSolver::agglomerateMatrix 280000000 0.937792 110950000 0.98937 fvc::surfacesum 154000000 0.966514 70350000 0.975185 processorfvpatchfield::updateint erfacematrix processorgamginterfacefield::up dateinterfacematrix 616000000 0.92781 91700000 0.978203 700000000 0.952449 143850000 0.977784 L1 hit ra<o has improved throughout, and L1 misses have decreased.