Performance Analysis and Optimization Tool



Similar documents
MAQAO Performance Analysis and Optimization Tool

Scalability and Classifications

Performance Monitoring of Parallel Scientific Applications

Parallel Computing. Benson Muite. benson.

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

How To Visualize Performance Data In A Computer Program

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Overview

Load Imbalance Analysis

HPC Wales Skills Academy Course Catalogue 2015

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Getting Started with CodeXL

End-user Tools for Application Performance Analysis Using Hardware Counters

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Building an energy dashboard. Energy measurement and visualization in current HPC systems

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Understanding applications using the BSC performance tools

Recent Advances in Periscope for Performance Analysis and Tuning

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Parallel Algorithm Engineering

Keys to node-level performance analysis and threading in HPC applications

Chapter 2 Parallel Architecture, Software And Performance

A Case Study - Scaling Legacy Code on Next Generation Platforms

Using the Windows Cluster

Scalability evaluation of barrier algorithms for OpenMP

High Performance Computing in the Multi-core Area

High Performance Computing. Course Notes HPC Fundamentals

Wiggins/Redstone: An On-line Program Specializer

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Matlab on a Supercomputer

Microsoft Enterprise Search for IT Professionals Course 10802A; 3 Days, Instructor-led

MAGENTO HOSTING Progressive Server Performance Improvements

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Sequential Performance Analysis with Callgrind and KCachegrind

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Best practices for efficient HPC performance with large models

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Binary search tree with SIMD bandwidth optimization using SSE

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Instruction Set Design

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Large-Data Software Defined Visualization on CPUs

How To Program With Adaptive Vision Studio

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Data Structure Oriented Monitoring for OpenMP Programs

FLOW-3D Performance Benchmark and Profiling. September 2012

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Delivering Quality in Software Performance and Scalability Testing

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

DB2 for i5/os: Tuning for Performance

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Parallel Programming Survey

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Microsoft Private Cloud Fast Track

Evaluation of CUDA Fortran for the CFD code Strukti

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

An Introduction to Parallel Computing/ Programming

Optimization tools. 1) Improving Overall I/O

Analysis report examination with CUBE

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

A Lab Course on Computer Architecture

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

ELEC 377. Operating Systems. Week 1 Class 3

Petascale Software Challenges. William Gropp

Parallel Algorithm for Dense Matrix Multiplication

Testing & Assuring Mobile End User Experience Before Production. Neotys

Transcription:

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org

Introduction Performance Analysis Develop performance analysis and optimization tools: MAQAO Framework and Toolsuite Establish partnerships Optimize industrial applications

Introduction Performance Analysis Understand the performance of an application How well it behaves on a given machine What are the issues? Generally a multifaceted problem Maximizing the number of views = better understand Use techniques and tools to understand issues Once understood Optimize application

Introduction Compilation chain Compiler remains your best friend Be sure to select proper flags (e.g., -xavx) Pragmas: Unrolling, Vector alignment O2 V.S. O3 Vectorisation/optimisation report

Introduction MAQAO Tool Open source (LGPL 3.0) Currently binary release Available for x86-64 and Xeon Phi Looking forward in porting MAQAO on BlueGene

Introduction MAQAO Tool Easy install Packaging : ONE (static) standalone binary Easy to embeed Audience User/Tool developer: analysis and optimisation tool Performance tool developer: framework services TAU: tau_rewrite (MIL) ScoreP: on-going effort (MIL)

Introduction MAQAO Tool

Introduction MAQAO Tool Scripting language Lua language : simplicity and productivity Fast prototyping MAQAO Lua API : Access to services

Introduction MAQAO Tool Built on top of the Framework Loop-centric approach Produce reports We deal with low level details You get high level reports

Introduction MAQAO Tool A lot of tools! Which one to use? When Our approach/experience: decision tree Currently working on HPC Multi-node > Node > Socket > Core Classify IO/Memory/MPI/OpenMP/Application PAMDA methodology to be published: 7 th Parallel Tools Workshop https://tools.zih.tu-dresden.de/2013/

Outline Introduction Pinpointing hotspots Functions, loops MPI characterization Code quality analysis

Pinpointing hotspots Measurement methodology MAQAO Profiling Instrumentation Through binary rewriting High overhead / More precision Sampling Hardware counter through perf_event_open system call Very low overhead / less details

Pinpointing hotspots Parallelism level SPMD Program level SIMD Instruction level By default MAQAO only considers system processes and threads

Pinpointing hotspots Display Display functions and their exclusive time Associated callchains and their contribution Loops Innermost loops can then be analyzed by the code quality analyzer module (CQA) Command line and GUI (HTML) outputs

Pinpointing hotspots GUI snapshot (1/4)

Pinpointing hotspots GUI snapshot (2/4)

Pinpointing hotspots GUI snapshot (3/4)

Pinpointing hotspots GUI snapshot (4/4)

Outline Introduction Pinpointing hotspots Functions, loops MPI characterization Code quality analysis

Pinpointing hotspots MPI characterization: Introduction Our methodology Coarse grain: overview, global trends/patterns Fine grain: filtering precise issues Tracing issues Scalability Memory size: can we reduce it? Trace size: can we reduce it? IO s wall: remove it?

Pinpointing hotspots MPI characterization: overview Scalable coarse grain analysis Web Visualizer MPI APPLICATION MAQAO profile.js

Pinpointing hotspots MPI characterization: key concepts Online profiling Aggregated metrics No traces No IOs (only one result file) Reduced memory footprint Scalable on 100+ procs

Pinpointing hotspots MPI characterization: visualization Web based visualizer Only requires a web browser

Pinpointing hotspots MPI characterization: analyses (1/5) MPI primitives high level profile (hits,time,size)

Pinpointing hotspots MPI characterization: analyses (2/5) Function scattering over time

Pinpointing hotspots MPI characterization: analyses (3/5) Probability densities

Pinpointing hotspots MPI characterization: analyses (4/5) Topology view (1/2) Selector Force driven topology view Information Panel Color scale 3 D topology

Pinpointing hotspots MPI characterization: analyses (4/5) Topology view (2/2) Click on a node in the force layout to display its information Hover a node to see its MPI rank Hover an edge to see its value

Pinpointing hotspots MPI characterization: analyses (5/5) Communication matrix (1/3)

Pinpointing hotspots MPI characterization: analyses (5/5) Communication matrix (2/3) Zoom

Pinpointing hotspots MPI characterization: analyses (5/5) Communication matrix (3/3)

Pinpointing hotspots MPI characterization: next steps Fine grained analyses should be: Investigated using (MPI) tracing tools And filtering on specific processes/events of interest detected thanks to this tool

Outline Introduction Pinpointing hotspots Functions, loops MPI characterization Code quality analysis

Static performance modeling Introduction Main performance issues: Core level Multicore interactions Communications Most of the time core level is forgotten

Static performance modeling Goals Static performance model Targets innermost loops source loop V.S. assembly loop Take into account processor (micro)architecture Assess code quality Estimate performance Degree of vectorization Impact on micro architecture ASM Loop 1 Source Loop L255@file.c ASM Loop 2 ASM Loop 4 ASM Loop 5 ASM Loop 3

Static performance modeling Model Simulates the target (micro)architecture Instructions description (latency, uops dispatch...) Machine model For a given binary and micro-architecture, provides Quality metrics (how well the binary is fitted to the micro architecture) Static performance (lower bounds on cycles) Hints and workarounds to improve static performance

Static performance modeling Metrics Vectorization (ratio and speedup) Allows to predict vectorization (if possible) speedup and increase vectorization ratio if it s worth High latency instructions (division/square root) Allows to use less precise but faster instructions like RCP (1/x) and RSQRT (1/sqrt(x)) Unrolling (unroll factor detection) Allows to statically predict performance for different unroll factors (find main loops)

Static performance modeling Output example (1/2)

Static performance modeling Output example (2/2)

MAQAO Tool Thanks for your attention! Questions?