Understanding applications using the BSC performance tools

Similar documents
Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool

Cloud Computing through Virtualization and HPC technologies

How To Visualize Performance Data In A Computer Program

BSC vision on Big Data and extreme scale computing

PERFORMANCE TESTING. New Batches Info. We are ready to serve Latest Testing Trends, Are you ready to learn.?? START DATE : TIMINGS : DURATION :

Bright Idea: GE s Storage Performance Best Practices Brian W. Walker

High Performance Computing. Course Notes HPC Fundamentals

Load Imbalance Analysis

Interoperability Testing and iwarp Performance. Whitepaper

LCMON Network Traffic Analysis

Optimizing Application Performance with CUDA Profiling Tools

A Pattern-Based Approach to. Automated Application Performance Analysis

End-user Tools for Application Performance Analysis Using Hardware Counters

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Next Generation Operating Systems

School of Computer Science

Pedraforca: ARM + GPU prototype

BLM 413E - Parallel Programming Lecture 3

HPC enabling of OpenFOAM R for CFD applications

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

(Refer Slide Time: 01:52)

Bringing Big Data Modelling into the Hands of Domain Experts

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Part I Courses Syllabus

Cluster, Grid, Cloud Concepts

Analytic Modeling in Python

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Masters in Human Computer Interaction

Masters in Advanced Computer Science

Masters in Artificial Intelligence

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scaling in a Hypervisor Environment

Cellular Computing on a Linux Cluster

Tableau Server Scalability Explained

WHAT IS SOFTWARE PERFORMANCE ENGINEERING? By Michael Foster

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

Operating Systems: Basic Concepts and History

Parallel Scalable Algorithms- Performance Parameters

DOCLITE: DOCKER CONTAINER-BASED LIGHTWEIGHT BENCHMARKING ON THE CLOUD

Scalability and Performance Report - Analyzer 2007

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

RDMA over Ethernet - A Preliminary Study

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Masters in Information Technology

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Philips IntelliSpace Critical Care and Anesthesia on VMware vsphere 5.1

Configuration Maximums VMware Infrastructure 3

Responsive, resilient, elastic and message driven system

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Relations with ISV and Open Source. Stephane Requena GENCI

Performance Monitoring of Parallel Scientific Applications

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

White Paper. How to Achieve Best-in-Class Performance Monitoring for Distributed Java Applications

Directions for VMware Ready Testing for Application Software

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Masters in Networks and Distributed Systems

Masters in Computing and Information Technology

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

MOSIX: High performance Linux farm

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Testing & Assuring Mobile End User Experience Before Production. Neotys

Load Testing Analysis Services Gerhard Brückl

2015 Workshops for Professors

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

CHAPTER 1 INTRODUCTION

Distributed communication-aware load balancing with TreeMatch in Charm++

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Distributed Databases

Tableau Server 7.0 scalability

Doctor of Philosophy in Computer Science

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Big Data: Rethinking Text Visualization

Throughput Capacity Planning and Application Saturation

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

VIRTUAL DESKTOP PERFORMANCE AND QUALITY OF EXPERIENCE UNDERSTANDING THE IMPORTANCE OF A DISTRIBUTED DATA CENTER ARCHITECTURE

Software-defined Storage Architecture for Analytics Computing

Tuning Tableau Server for High Performance

Big Data and Graph Analytics in a Health Care Setting

LS DYNA Performance Benchmarks and Profiling. January 2009

UTS: An Unbalanced Tree Search Benchmark

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Q & A From Hitachi Data Systems WebTech Presentation:

EMC Publishing. Ontario Curriculum Computer and Information Science Grade 11

Transcription:

Understanding applications using the BSC performance tools Judit Gimenez (judit@bsc.es) German Llort(german.llort@bsc.es)

Humans are visual creatures Films or books? Two hours vs. days (months) Memorizing a deck of playing cards Each card translated to an image (person, action, location) Our brain loves pattern recognition What do you see on the pictures? PROCESS STORE IDENTIFY

Our Tools Since 1991 Based on traces Open Source http://www.bsc.es/paraver Core tools: Paraver (paramedir) offline trace analysis Dimemas message passing simulator Extrae instrumentation Focus Detail, variability, flexibility Behavioral structure vs. syntactic structure Intelligence: Performance Analytics

Paraver VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Paraver: Performance data browser Raw data Trace visualization/analysis + trace manipulation Timelines Goal = Flexibility No semantics Programmable 2/3D tables (Statistics) Comparative analyses Multiple traces Synchronize scales

Timelines Each window displays one view Piecewise constant function of time s ( t) S, i t, t i i i 1 Types of functions Categorical State, user function, outlined routine S i 0, n N, n Logical In specific user function, In MPI call, In long MPI call S i 0,1 Numerical IPC, L2 miss ratio, Duration of MPI call, duration of computation burst S i R

Tables: Profiles, histograms, correlations From timelines to tables MPI calls profile MPI calls Histogram Useful Duration Useful Duration

Analyzing variability through histograms and timelines SPECFEM3D Useful Duration IPC Instructions L2 miss ratio

Analyzing variability through histograms and timelines By the way: six months later. Useful Duration IPC Instructions L2 miss ratio

From tables to timelines Where in the timeline do the values in certain table columns appear? ie. want to see the time distribution of a given routine?

Variability is everywhere CESM: 16 processes, 2 simulated days Histogram useful computation duration shows high variability How is it distributed? Dynamic imbalance In space and time Day and night. Season?

Trace manipulation Data handling/summarization capability Filtering Subset of records in original trace By duration, type, value, Filtered trace IS a paraver trace and can be analysed with the same cfgs (as long as needed data kept) Cutting All records in a given time interval Only some processes Software counters Summarized values computed from those in the original trace emitted as new even types #MPI calls, total hardware count, 570 s 2.2 GB MPI, HWC 570 s 5 MB 4.6 s 36.5 MB WRF-NMM Peninsula 4km 128 procs

Dimemas VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Efficiency VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Dimemas: Coarse grain, Trace driven simulation Simulation: Highly non linear model MPI protocols, resources contention Parametric sweeps On abstract architectures On application computational regions What if analysis Ideal machine (instantaneous network) Estimating impact of ports to MPI+OpenMP/CUDA/ Should I use asynchronous communications? Are all parts of an app. equally sensitive to network? MPI sanity check Modeling nominal CPU CPU CPU L Local Memory CPU CPU CPU L 1.20 1.00 0.80 0.60 0.40 0.20 Local Memory B CPU CPU CPU Impact of BW (L=8; B=0) L 0.00 1 4 16 64 256 1024 Local Memory NMM 512 ARW 512 NMM 256 ARW 256 NMM 128 ARW 128 Paraver Dimemas tandem Analysis and prediction What-if from selected time window 4ms Detailed feedback on simulation (trace)

Network sensitivity MPIRE 32 tasks, no network contention L = 1000µs BW = 100MB/s L = 25µs BW = 100MB/s All windows same scale L = 25µs BW = 10MB/s

Ideal machine The impossible machine: BW =, L = 0 Actually describes/characterizes Intrinsic application behavior Load balance problems? Dependence problems? GADGET @ Nehalem cluster 256 processes Allgather + sendrecv allreduce alltoall sendrec waitall Real run Ideal network

Models VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Why scaling? LB * Ser * Trf 1,1 CG-POP mpi2s1d - 180x120 Good scalability!! Should we be happy? 1 0,9 0,8 0,7 Parallel eff LB ulb transfer 20 speed up 0,6 0,5 0 100 200 300 400 15 10 5 0 0 100 200 300 400 1,6 1,4 1,2 1 0,8 0,6 Efficiency Parallel eff instr. eff IPC eff * instr * IPC 0,4 0 100 200 300 400

Clustering VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Using Clustering to identify structure Completed Instructions IPC J. Gonzalez et al. Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)

Performance @ serial computation bursts SPECFEM3D WRF 128 cores GROMACS Asynchronous SPMD Balanced #instr variability in IPC SPMD Repeated substructure Coupled imbalance MPMD structure Different coupled imbalance trends

Integrating models and analytics What if. PEPC we increase the IPC of Cluster1? 13% gain we balance Clusters 1 & 2? 19% gain

Tracking: scability through clustering OpenMX (strong scale from 64 to 512 tasks) 64 128 192 256 384 512

Folding VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Folding: Detailed metrics evolution Performance of a sequential region = 2000 MIPS Is it good enough? Is it easy to improve?

Folding: Instantaneous CPI stack MRGENESIS Trivial fix.(loop interchange) Easy to locate? Next step? Availability of CPI stack models for production processors? Provided by manufacturers?

Blind optimization From folded samples of a few levels to timeline structure of relevant routines Recommendation without access to source code Harald Servat et al. Bio-inspired call-stack reconstruction for performance analysis Submitted, 2015

Methodology VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING

Performance analysis tools objective Help generate hypotheses Help validate hypotheses Qualitatively Quantitatively

First steps Parallel efficiency percentage of time invested on computation Identify sources for inefficiency : load balance Communication /synchronization Serial efficiency how far from peak performance? IPC, correlate with other counters Scalability code replication? Total #instructions Behavioral structure? Variability? Paraver Tutorial: Introduction to Paraver and Dimemas methodology

BSC Tools web site www.bsc.es/paraver downloads Sources / Binaries Linux / windows / MAC documentation Training guides Tutorial slides Getting started Start wxparaver Help tutorials and follow instructions Follow training guides Paraver introduction (MPI): Navigation and basic understanding of Paraver operation