YALES2 porting on the Xeon- Phi Early results



Similar documents
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Keys to node-level performance analysis and threading in HPC applications

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

OpenMP and Performance

GPGPU accelerated Computational Fluid Dynamics

Accelerating CFD using OpenFOAM with GPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

CSE 6040 Computing for Data Analytics: Methods and Tools

OpenMP Programming on ScaleMP

Best practices for efficient HPC performance with large models

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Rethinking SIMD Vectorization for In-Memory Databases

Performance Characteristics of Large SMP Machines

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

GPGPU acceleration in OpenFOAM

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

MAQAO Performance Analysis and Optimization Tool

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Parallel Programming Survey

Big Data Visualization on the MIC

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Intel Xeon Processor E5-2600

HPC Deployment of OpenFOAM in an Industrial Setting

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

High Performance Computing in the Multi-core Area

High-fidelity electromagnetic modeling of large multi-scale naval structures

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Big Graph Processing: Some Background

1 Bull, 2011 Bull Extreme Computing

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Presto/Blockus: Towards Scalable R Data Analysis

HPC enabling of OpenFOAM R for CFD applications

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

High Performance Computing. Course Notes HPC Fundamentals

A Case Study - Scaling Legacy Code on Next Generation Platforms

Introduction to Cloud Computing

Introduction to GPGPU. Tiziano Diamanti

Lecture 1: the anatomy of a supercomputer

High Performance Computing in CST STUDIO SUITE

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Multi-Threading Performance on Commodity Multi-Core Processors

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Cloud Data Center Acceleration 2015

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Parallel Computing. Benson Muite. benson.

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Parallel Algorithm Engineering

Performance Analysis and Optimization Tool

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

White Paper The Numascale Solution: Extreme BIG DATA Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Lecture 2 Parallel Programming Platforms

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Systolic Computing. Fundamentals

FPGA-based Multithreading for In-Memory Hash Joins

Trends in High-Performance Computing for Power Grid Applications

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Scientific Computing Programming with Parallel Objects

Visualisation of Large Datasets with Houdini

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!


GPU Acceleration of the SENSEI CFD Code Suite

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Jean-Pierre Panziera Teratec 2011

Benchmarking Cassandra on Violin

Hadoop on the Gordon Data Intensive Cluster


Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Putting it all together: Intel Nehalem.

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology


Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

Arcane/ArcGeoSim, a software framework for geosciences simulation

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Programming the Intel Xeon Phi Coprocessor

Transcription:

YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin 2015

Outline Presentation of the Exascale Lab Paris and related work with partners Presentation of YALES2 Status of the code on the Intel Architectures Motivations of the study Presentation of the results on Sandy Bridge & KNC Future work 16 juin 2015 CRIHAN - Demi-journée calcul intensif 2

The Exascale Lab Paris Research areas: Scientific HPC applications: Molecular dynamics CFD Geophysics Climate Science Fusion Performance Analysis Tools Static analysis Dynamic analysis Performance prediction Optimization projection 16 juin 2015 CRIHAN - Demi-journée calcul intensif 3

Few words on Xeon & Xeon Phi XEON - HSW XEON PHI - KNC XEON PHI - KNL Cores 18+ -61 60+ Freq. 2-4GHz -1.1GHz TBA Mem. -6TB -16GB -384GB Mem. Bandwidth 55GB/s 180GB/s 400+GB/s Cache L2 256kB 512kB 512kB Cache L3 30MB -16GB integrated on package memory FLOPs (double) 662GFlops 1TFlops 3+TFlops Instructions set AVX2 KNC instructions MIC-AVX512 Flop/Byte (double) 12 5.5 (7.5) 16 juin 2015 CRIHAN - Demi-journée calcul intensif 4

The YALES2 solver YALES2 is an unstructured low-mach number code for the DNS and LES of reacting two-phase flows in complex geometries. It solves the unsteady 3D Navier-Stokes equations on massively parallel machines It is used by more than 160 people in labs and in the industry SUCCESS scientific group (http://success.coria-cfd.fr): CORIA, I3M, LEGI, EM2C, IMFT, CERFACS, IFP-EN, LMA Other labs: ULB, MONS, LOMC, Collaboration with INTEL/CEA/GENCI/UVSQ Exascale Lab Industry: SAFRAN, RHODIA (SOLVAY), AREVA, Awards 2011 IBM faculty award 3 rd of the Bull-Joseph Fourier prize in 2009 Principal investigator of 2 PRACE proposals www.coria-cfd.fr 16 juin 2015 CRIHAN - Demi-journée calcul intensif 5

Mesh Management IOs Gambit/Fluent Partitioned HDF5 Ensight AVBP Complex Chemistry Tabulated chemistry Complex chemistry Stiff integrators Dynamic load balancing 1D, 2D, 3D Partitioning Load balancing Refinement YALES2 Solvers Numerics Particles Level sets 4-th order FV schemes Linear Solvers Deflated PCG BICGSTAB2 Residual recycling Data analysis Probes Postproc. variables High-order filters FFT, POD, DMD Statistics 2 main maintainers V. Moureau G. Lartigue www.coria-cfd.fr 300 000 lines of object-oriented F90 Git version management www.coria-cfd.fr Portable on all the major platforms 16 juin 2015 CRIHAN - Demi-journée calcul intensif 6

Turbine blade Re=150 000 18 billions cells 8 192 cores 16 juin 2015 CRIHAN - Demi-journée calcul intensif 7

relative time base is SNB O3 GB/s Status of the code on the Intel Architectures Performance within one node of SNB (2x8 cores @2.6GHz) & KNC (61 cores @ 1.1GHz) Dataset preccinsta 1.7M 2GB mem. footprint 100 iterations Memory bandwidth - SNB O3 4 KNC vs SNB relative times x10 10 16 Instructions distribution on SNB 3,5 14 3 2,5 2 1,5 1 12 10 8 6 total read 0,5 4 0 2 Series1 Series2 Series3 Series4 0 write No gain with vectorization (limited by the bandwidth, lot of indirections) KNC is ~2.5x slower than SNB! But amount of work on each core of the KNC is lower than on SNB (relevance of the dataset?). On KNC, Front End decodes 1 instruction every 2 cycles. Series1 Series2 a lot of arithmetic instructions but very few are vectorized. GPR instructions are ~50% of operations. Max. gain of vectorization is ~50% Time in Gcycles Peak bandwidth is high. The code is bandwidth sensitive. 16 juin 2015 CRIHAN - Demi-journée calcul intensif 8

Wrap up Both -O3 vectorized and not vectorized have the same performance. Autovectorization generates few vector instructions. In the Temporal Loop, the most of the time is spent in the Poisson solver How to execute faster the Poisson solver? Vectorizing the code can be a solution Changing the algorithm Reorganizing the data 16 juin 2015 CRIHAN - Demi-journée calcul intensif 9

Poisson solver overview Low-Mach formulation of NS Poisson equation on pressure Finite volume centered scheme transforms the Laplacian differential operator in a symmetric matrix and the diff. eqn. in linear algebra Each element of A represents the contribution of pairs of nodes of the mesh Connectivity of the mesh is such that each nodes has ~20 neighbours This sparse system is solved with CG based iterative methods 16 juin 2015 CRIHAN - Demi-journée calcul intensif 10

Focus on the Poisson solver loop 70 265 120 4 1 121 15 258 2 3 45 275 Pairs Node 1 Node 2 1 2 1 3 1 4 1 45 1 121 1 265 2 3 2 4 2 15 2 258 2 275 3 23 3 45 3 59 3 70 3 71 3 275! for each cell group do i = 1, size_outer elt => elts(i) pair2node1 => elt%ind1%val pair2node2 => elt%ind2%val sym_op => elt%sym_op%val data => elt%data%val laplacian => elt%laplacian%val laplacian(1:el_grp%nnode) = 0.0! for each pair do ip=1,el_grp%npair ino1 = pair2node1(ip) ino2 = pair2node2(ip) coeff = sym_op(ip)*(data(ino2)-data(ino1)) laplacian(ino1) = laplacian(ino1) + coeff laplacian(ino2) = laplacian(ino2) - coeff Mesh sample Connectivity description end do end do Non vectorizable due to dependencies 16 juin 2015 CRIHAN - Demi-journée calcul intensif 11

Improvements of the loop focus on data Pairs Node 1 Node 2 1 2 1 3 1 4 1 45 1 121 1 265 2 3 2 4 2 15 2 258 2 275 3 23 3 45 3 59 3 70 3 71 3 275 v.1 not vectorizable Remove the data dependencies Pairs Node 1 Node 2 1 2 3 4 45 121 265 2 3 4 15 258 275 3 23 45 59 70 71 275 v.2 not vectorizable (indirections) Gather the groups of the same size to change the direction of the vectorization and reduce the effect of the tail loop. Node 1 Node 2 228 139 140 154 182 205 243 242 166 167 170 181 195 231 245 157 188 189 246 247 248 249 187 190 247 248 255 256 273 132 193 207 225 241 272 8 5 6 7 161 257 278 21 19 20 22 24 25 278 111 9 10 96 110 171 278 161 7 8 103 160 257 200 40 150 169 175 233 28 1 4 27 29 278 97 10 12 85 96 278 131 6 17 99 130 254 156 157 227 246 261 62 86 259 260 264 46 60 258 263 v.3 vectorizable 2x operations than v.1 16 juin 2015 CRIHAN - Demi-journée calcul intensif 12

CYCLES Behavior of the loops on Xeon SNB 30 000 L2 Cycles per loop L3 DRAM 25 000 20 000 15 000 Series1 Series2 Series3 Series4 10 000 Series5 5 000 0 10 100 1 000 10 000 100 000 1 000 000 DATASET SIZE in kb v.2 & v.3 in F90 & C are not vectorized by the compiler Original version is faster when data comes from the DRAM because it has the minimal memory accesses 16 juin 2015 CRIHAN - Demi-journée calcul intensif 13

CYCLES instructions CPI Behavior of the loops on Xeon Phi KNC 90 000 L2 Cycles per loop DRAM 40 000 3,5 80 000 35 000 3 70 000 60 000 50 000 40 000 Series1 Series3 Series4 Series5 Series6 30 000 25 000 20 000 15 000 10 000 5 000 2,5 2 1,5 1 0,5 30 000 10 100 1 000 10 000 100 000 1 000 000 DATASET SIZE in kb v.2 & v.3 in F90 are not vectorized by the compiler, v.3 C has #pragma simd v.3 in F90 version is faster, perhaps because of the overlap of the computation and the software prefetcher 0 1 2 3 4 5 Series1 Series2 Cycles per instructions is between 2 & 3. Even if the fully vectorized loop v.3 C executes less instructions, they take more time to be executed 0 16 juin 2015 CRIHAN - Demi-journée calcul intensif 14

CYCLES GB/s Behavior of the loops on Xeon Phi KNC bandwidth & cycles 120 000 100 000 80 000 60 000 40 000 20 000 80 70 60 50 40 30 20 10 Cycles per loop for dataset in DRAM When the measured bandwidth increases, the cycles per loop decreases. Non vectorized versions are faster than vectorized version, due to latency vs prefetching?! KNC is in order and decodes 1 instruction every 2 cycles for 1 thread. 0 1 2 3 4 5 6 7 8 9 0 Series1 Series2 16 juin 2015 CRIHAN - Demi-journée calcul intensif 15

Future work Improve the memory access with: prefetching (only if not at the maximum bandwidth) reducing the data dispersion in the cache lines by testing 2 strategies: Touching the least cache lines such that the prefetcher is not stressed Touching as much as possible cache lines at the beginning of the loop such that the arrays come in the L2 cache reorganize the code to compute everything on one group of nodes (very hard) Focus on the vectorization of loop v.2 16 juin 2015 CRIHAN - Demi-journée calcul intensif 16

Thank You 16 juin 2015 CRIHAN - Demi-journée calcul intensif 17