Keys to node-level performance analysis and threading in HPC applications

Similar documents
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Scaling up to Production

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Towards OpenMP Support in LLVM

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The Foundation for Better Business Intelligence

High Performance Computing and Big Data: The coming wave.

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Intel Media SDK Library Distribution and Dispatching Process

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

YALES2 porting on the Xeon- Phi Early results

Intel Platform and Big Data: Making big data work for you.

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Accelerating Business Intelligence with Large-Scale System Memory

MAQAO Performance Analysis and Optimization Tool

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

OpenMP* 4.0 for HPC in a Nutshell

Improve Fortran Code Quality with Static Analysis

Accelerating Business Intelligence with Large-Scale System Memory

OpenMP and Performance

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Intel Many Integrated Core Architecture: An Overview and Programming Models

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Large-Data Software Defined Visualization on CPUs

Parallel Programming Survey

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Intel Media Server Studio Professional Edition for Windows* Server

Intel Service Assurance Administrator. Product Overview

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Cloud Computing through Virtualization and HPC technologies

Intel X38 Express Chipset Memory Technology and Configuration Guide

Implementation and Performance of AES-NI in CyaSSL. Embedded SSL

The Transition to PCI Express* for Client SSDs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

Overview

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Intelligent Business Operations

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Big Data Visualization on the MIC

Improve Fortran Code Quality with Static Security Analysis (SSA)

Large Scale Simulation on Clusters using COMSOL 4.2

Extended Attributes and Transparent Encryption in Apache Hadoop

Cloud-based Analytics and Map Reduce

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Performance Analysis and Optimization Tool

Multi-Threading Performance on Commodity Multi-Core Processors

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

* * * Intel RealSense SDK Architecture

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Accomplish Optimal I/O Performance on SAS 9.3 with

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

FLOW-3D Performance Benchmark and Profiling. September 2012

Haswell Cryptographic Performance

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Hetero Streams Library 1.0

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Comparing Multi-Core Processors for Server Virtualization

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Measuring Processor Power

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Parallel Algorithm Engineering

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

MPI Application Tune-Up Four Steps to Performance

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

21152 PCI-to-PCI Bridge

Bandwidth Calculations for SA-1100 Processor LCD Displays

CLOUD SECURITY: Secure Your Infrastructure

Introducing the First Datacenter Atom SOC

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Best Practices for Increasing Ceph Performance with SSD

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

What is in Your Workstation?

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

CUDA programming on NVIDIA GPUs

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Transcription:

Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Application performance: a multiscale problem Microarch Core Socket Node Cluster 10-100 1-4+ 100-10000+ Multicore: vector ISA, cores, cache hierarchies, Manycore: new vector ISAs, MPI+OMP?, memory/core? Optimization space is getting larger Goal of this presentation: Provide keys to application performance and threading analysis Based on characterization & projection experience with full applications 3

Node-level performance Choice of algorithm or scheme Source code implementation Binary code Actual execution Programmer Data access patterns Compiler Vectorization Code generation Architecture Cache behavior Execution pathologies Memory bandwidth/data reuse optimizations Vectorization/code quality optimizations 2 main performance factors (at first order) : Memory (DRAM) bandwidth demand Computation: Flops (but also non-flop instructions sometimes), use of execution units Key questions: What are the requirements of my algorithm, in terms of compute vs. memory transfers? What performance can I expect? Where am I with respect to ideal performance? How can I get closer to ideal? 4

Flops, bytes & arithmetic intensity Arithmetic intensity = Flop/byte: a measure of compute vs. ideal data transfer balance for a particular kernel DAXPY (Triad) do i=1,n y(i) = y(i) + a*x(i) end do Read x Read y Compute y Write y 8N bytes 8N bytes 2N Flops 8N bytes Flop/byte = 2/24 = 0.083 3D Stencil (Gauss-Seidel) do k=1,n do j=1,n do i=1,n x(i,j,k) = ONE_SIXTH * ( & x(i+1,j,k) + x(i-1,j,k) + & x(i,j+1,k) + x(i,j-1,k) + & x(i,j,k+1) + x(i,j,k-1)) end do end do end do Read x Compute update Write new x 8N^3 bytes 6N^3 Flops 8N^3 bytes Flop/byte = 6/16 = 0.375 Source code level analysis: Count floating point operations Count bytes (arrays) read & written, assume perfect reuse (infinite cache) ideal case 5

Compute vs. bandwidth analysis Quantitative System Performance, D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik Williams et al., http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf log GFLOP/s = performance Compute bound Ideal execution Actual vs. ideal execution: Efficiency (% peak) depends on microarch. Finite cache size will reduce flop/byte Actual execution Vectorization, Code generation Data reuse, Cache optims Actual Flop/byte Theoretical Flop/byte log Flop/byte = arithmetic intensity Measuring data for actual execution: GFlops/s derived from code performance: GFlops/s = Gcells/s Flops/cell DRAM bandwidth Flop/byte = (GFlop/s) / (GB/s) Intel VTune Amplifier XE https://software.intel.com/en-us/intel-vtuneamplifier-xe Open source tools, e.g. https://code.google.com/p/likwid/ Requires root access or special kernel module 6

Illustration: GYSELA kernels on Xeon 2 sockets, Xeon E5-2670 (Sandy Bridge, 2.6 GHz) This kernel is BW bound when vectorized, but compute bound when not vectorized! 7

Illustration: GYSELA kernels on Xeon Phi Xeon Phi 7120 (16 GB GDDR, 61 cores, 1.2GHz) Efficiency drops for complex loop bodies Smaller caches incur more memory traffic 8

Node-level characterization: Wrap Up Simple compute vs. bandwidth characterization («roofline») Helps determine max performance expectations Allows to identify optimization directions Can be complemented by quick analysis tricks Measure time on 1 full node (avail b/w = BW 1 ), and write: T 1 full = T compute + T bw Measure time on 2 half-filled nodes (avail b/w = BW 2 > BW 1 ), and write: T 2 half = T compute + T bw (BW 1 BW 2 ) Solve for T compute and T bw to estimate «memory-boundedness» of app on this architecture Also useful for quick projections across similar architectures General trends on Xeon Phi Smaller caches incur more memory demand In-order core, complex vector ISA compiler and code generation matter So far, we assumed good parallelism (no threading or MPI issues) 9

Shared memory: To thread or not to thread? Why is threading interesting in applications? Allows «larger» MPI ranks (for domain decomp.) for a same problem May improve surface/volume ratio Amortizes memory footprint of MPI runtime Allows dynamic load balancing for imbalanced applications What could possibly go wrong? Amdahl s law strikes back On computation: getting good coverage is hard On communications MPI+X is not intrinsically «better» than MPI 4x1 v.s. 1x4 10

200 Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Temporal loop wtime [s] 120 Footprint/core [MB] 2.5E+11 App instructions/core 180 160 100 2.0E+11 140 80 120 1.5E+11 100 60 80 1.0E+11 60 40 40 20 5.0E+10 20 0 1 2 4 6 12 0 1 2 4 6 12 0.0E+00 1 2 4 6 12 Measured [s] Amdahl projection OMP threads/rank

Ranks x Threads Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Wtime spent inside OpenMP parallel regions CFD app example: Wall time [s] on master thread 0 50 100 150 200 250 24x1 12x2 6x4 Wtime spent in MPI library grows with # threads OMP Serial MPI 4x6 2x12 Non-threaded computation wtime («Amdahl s law on threads»)

Can threading help with imbalance? [synthetic data for illustration] Small-scale 50% imbalance Large-scale 50% imbalance Imbalance time = max - mean Shared mem dynamic load balancing may be effective against imbalance Shared mem dynamic load balancing ineffective alone against imbalance core id core id

Ranks x Threads Threading and imbalance: Highly imbalanced adaptive mesh refinement code OMP computation scales less than ideally Wall time [s] on master thread, rank 0 0 100 200 300 400 500 24x1 12x2 OMP Serial MPI 8x3 6x4 Threading helps reduce extreme MPI imbalance 4x6 2x12 But Amdahl s law still overtakes at high thread counts

OpenMP: things to watch for in apps Code coverage (a.k.a. Amdahl s law) Extensive coverage is critical for scalability Can be very tedious/impossible to achieve for flat-profile applications Coarse threading ( loop-level) helps, but reimplementing MPI doesn t Granularity Important metric = average wall time of OpenMP regions Compare to OpenMP barrier/sync time Both points grow in importance on Xeon Phi Lots of threads coverage grows in importance Limited memory/core short loops Vtune profiling can help diagnose both issues 15

Wrap-up Careful performance analysis is essential to guide code optimizations Set pragmatic performance targets Collect data on application behavior Simple compute vs. bandwidth model can provide: Robust first-order characterization Insights into specific or second-order effects Threading can help address some strong-scaling issues Amortize halo overheads, level out imbalance No magic: obtaining good coverage is hard work Threading: an important adjustment variable for Heterogeneous computing resources (e.g. symmetric mode) Available memory/core 16