Load Imbalance Analysis



Similar documents
Optimization tools. 1) Improving Overall I/O

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

End-user Tools for Application Performance Analysis Using Hardware Counters

ICP Data Validation and Aggregation Module Training document. HHC Data Validation and Aggregation Module Training Document

Performance Monitoring of Parallel Scientific Applications

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad

Performance Analysis and Optimization Tool

NetBeans Profiler is an

Linux tools for debugging and profiling MPI codes

InfoScale Storage & Media Server Workloads

- An Essential Building Block for Stable and Reliable Compute Clusters

Windows 2003 Performance Monitor. System Monitor. Adding a counter

Running a Workflow on a PowerCenter Grid

DNS (Domain Name System) is the system & protocol that translates domain names to IP addresses.

Parallel Computing. Benson Muite. benson.

The Complete Performance Solution for Microsoft SQL Server

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Azure VM Performance Considerations Running SQL Server

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

So in order to grab all the visitors requests we add to our workbench a non-test-element of the proxy type.

MCTS Guide to Microsoft Windows 7. Chapter 10 Performance Tuning

Analytics for Performance Optimization of BPMN2.0 Business Processes

CMS Manual. Digital Video Network Surveillance System. Unisight Digital Technologies, Inc.

Debugging with TotalView

Tools for Performance Debugging HPC Applications. David Skinner

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

NVIDIA Tools For Profiling And Monitoring. David Goodwin

How To Visualize Performance Data In A Computer Program

Running applications on the Cray XC30 4/12/2015

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

User Reports. Time on System. Session Count. Detailed Reports. Summary Reports. Individual Gantt Charts

ABAP SQL Monitor Implementation Guide and Best Practices

Parallel Scalable Algorithms- Performance Parameters

ANCS+ 8.0 Remote Training: ANCS+ 8.0, Import/Export

Apple Configurator Settings for Deploying ios Devices

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

vrealize Operations Manager User Guide

Tool - 1: Health Center

Overlapping Data Transfer With Application Execution on Clusters

Tech Tip: Understanding Server Memory Counters

ZOINED RETAIL ANALYTICS. User Guide

A1 Customer Relationship Management (CRM) & Sales Force Management Guide

DataPA OpenAnalytics End User Training

Operations Guide for the HMC and Managed Systems Version 7 Release 3. ESCALA Power6 REFERENCE 86 A1 85FF 00

DROOMS DATA ROOM USER GUIDE.

MPI and Hybrid Programming Models. William Gropp

Main Points. Scheduling policy: what to do next, when there are multiple threads ready to run. Definitions. Uniprocessor policies

Mobile Technique and Features

MAQAO Performance Analysis and Optimization Tool

MarkLogic Server. Monitoring MarkLogic Guide. MarkLogic 8 February, Copyright 2015 MarkLogic Corporation. All rights reserved.

Symmetric Multiprocessing

VMware vrealize Operations for Horizon Administration

MyOra 3.5. User Guide. SQL Tool for Oracle. Kris Murthy

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

ArcGIS for Server Performance and Scalability: Testing Methodologies. Andrew Sakowicz, Frank Pizzi,

System Copy GT Manual 1.8 Last update: 2015/07/13 Basis Technologies

Infor LN Service User Guide for Service Scheduler Workbench

Optimizing Application Performance with CUDA Profiling Tools

Introducing the Microsoft IIS deployment guide

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

NETWORK PRINT MONITOR User Guide

Default Thresholds. Performance Advisor. Adaikkappan Arumugam, Nagendra Krishnappa

Quick Start Guide.

FIGURE Selecting properties for the event log.

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Chapter 2: Getting Started

Energy Dashboards And Report Tool

The Hadoop Distributed File System

QT9 Quality Management Software

Universal Simple Control, USC-1

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

Solving Performance Problems In SQL Server by Michal Tinthofer

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.

Deep Dive: Maximizing EC2 & EBS Performance

One of the database administrators

PORTAL ADMINISTRATION

sql server best practice

Monitoring Replication

VMS A1 Client Software. User Manual (V2.0)

ScrumDesk Quick Start

This exhibit describes how to upload project information from Estimator (PC) to Trns.port PES (server). Figure 1 summarizes this process.

Project Management within ManagePro

Using the Local Document Organizer in ProjectWise

Database Performance Report Using PivotalVRP

Load Testing Hyperion Applications Using Oracle Load Testing 9.1

QAD Business Intelligence Dashboards Demonstration Guide. May 2015 BI 3.11

Transcription:

With CrayPat

Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization (Collective communication and barriers) Imbalance time = Average time Minimum time Identifies computational code regions and synchronization calls that could benefit most from load balance optimization Estimates how much overall program time could be saved if corresponding section of code had a perfect balance Represents upper bound on potential savings Assumes other processes are waiting, not doing useful work while slowest member finishes Time% Time Imb Imb Calls Group Time Time% Function PE=HIDE 1000% 20643909 - - - - 11490 Total - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 988% 20395989 - - - - 2190 USER - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 911% 18797060 0115535 07% 20 jacobi 77% 1597866 0006647 05% 10 initmt ============================================================= 12% 0239306 - - - - 8710 MPI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 07% 0148981 0094595 444% 1590 MPI_Waitall 04% 0085824 0023669 247% 3180 MPI_Isend

Load Imbalance Analysis Imbalance time percentage represents the percentage of resources available for parallelism that is wasted Imbalance% = 100 X Corresponds to percentage of time that rest of team is not engaged in useful work on the given function Perfectly balanced code segment has imbalance of zero percentage Serial code segment has imbalance of 100 percent Imbalance time Max Time N X N - 1 Time% Time Imb Imb Calls Group Time Time% Function PE=HIDE 1000% 20643909 - - - - 11490 Total - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 988% 20395989 - - - - 2190 USER - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 911% 18797060 0115535 07% 20 jacobi 77% 1597866 0006647 05% 10 initmt ============================================================= 12% 0239306 - - - - 8710 MPI - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 07% 0148981 0094595 444% 1590 MPI_Waitall 04% 0085824 0023669 247% 3180 MPI_Isend

Load Imbalance Analysis MPI Sync time measures load imbalance in programs instrumented to trace MPI functions to determine if MPI ranks arrive at collectives together Separates potential load imbalance from data transfer Sync times reported by default if MPI functions traced If desired, PAT_RT_MPI_SYNC=0 deactivates this feature Only reported for tracing experiments Time% Time Imb Imb Calls Group Time Time% Function PE=HIDE 1000% 20643909 - - - - 11490 Total - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 00% 0008614 - - - - 590 MPI_SYNC - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 00% 0006696 0006627 990% 20 MPI_Barrier(sync) 00% 0001802 0001399 776% 550 MPI_Allreduce(sync) 00% 0000061 0000052 863% 10 MPI_Init(sync) 00% 0000056 0000051 917% 10 MPI_Finalize(sync) ===================================================================

Causes and hints What is causing the load imbalance? Need profiler reports like CrayPAT gives for the where Need application expertise for the why Computation Is decomposition appropriate? Would reordering ranks help? Communication Is decomposition appropriate? Would reordering ranks help? Are receives pre-posted? Any All-to-1 communication? I/O synchronous single-writer I/O will cause significant load imbalance already with a couple of MPI tasks (More on IO tomorrow)

Rank placement The default ordering can be changed using the following environment variable: export MPICH_RANK_REORDER_METHOD=N These are the different values (N) that you can set it to: N=0: Round-robin placement Sequential ranks are placed on the next node in the list 0, 1, 2, 3, 0, 1, 2, 3 (8 tasks on 4 nodes, 2 tasks per node) N=1: (DEFAULT) SMP-style- (block-) placement 0, 0, 1, 1, 2, 2, 3, 3 (8 tasks on 4 nodes, 2 tasks per node) N=2: Folded rank placement 0, 1, 2, 3, 3, 2, 1, 0 (8 tasks on 4 nodes, 2 tasks per node) N=3: Custom ordering The ordering is specified in a file named MPICH_RANK_ORDER

Rank placement with CrayPat When is rank placement a priori useful? Point-to-point communication consumes a significant fraction of program time and a load imbalance detected Also shown to help for collectives (alltoall) on subcommunicators Spread out I/O servers across nodes CrayPat can provide the following feedback ================ Observations and suggestions ======================== MPI Grid Detection: There appears to be point- to- point MPI communication in a 4 X 2 X 8 grid pattern The execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node The effect of several rank orders is estimated below A file named MPICH_RANK_ORDERGrid was generated along with this report and contains usage instructions and the Hilbert rank order from the following table Rank On- Node On- Node MPICH_RANK_REORDER_METHOD Order Bytes/PE Bytes/PE% of Total Bytes/PE Hilbert 5533e+10 Fold 4907e+10 SMP 4883e+10 RoundRobin 3740e+10 9066% 3 8042% 2 8002% 1 6128% 0 # The 'Custom' rank order in this file targets nodes with multi- core # processors, based on Sent Msg Total Bytes collected for: # # Program: /lus/nid00030/heidi/sweep3d/mod/sweep3dmpi # Ap2 File: sweep3dmpi+pat+27054-89tap2 # Number PEs: 48 # Max PEs/Node: 4 # # To use this file, make a copy named MPICH_RANK_ORDER, and set the # environment variable MPICH_RANK_REORDER_METHOD to 3 prior to # executing the program # # The following table lists rank order alternatives and the grid_order # command- line options that can be used to generate a new order 0,532,64,564,32,572,96,540,8,596,72,524,40,604,24,588 104,556,16,628,80,636,56,620,48,516,112,580,88,548,120,612 1,403,65,435,33,411,97,443,9,467,25,499,105,507,41,475 73,395,81,427,57,459,17,419,113,491,49,387,89,451,121,483

Hybrid MPI + OpenMP? OpenMP may help Able to spread workload with less overhead Large amount of work to go from all-mpi to (better performing) hybrid - must accept challenge to hybridize large amount of code When does it pay to add OpenMP to my MPI code? Add OpenMP when code is network bound Adding OpenMP to memory bound codes may aggravate memory bandwidth issues, but you have more control when optimizing for cache Look at collective time, excluding sync time: this goes up as network becomes a problem Look at point-to-point wait times: if these go up, network may be a problem If an all-to-all communication pattern becomes a bottleneck, hybridization often overcomes this Hybridization can be used to avoid replicated data

Cray Apprentice 2 Cray Apprentice2 is a post-processing performance data visualization tool Takes *ap2 files as input Main features are Call graph profile Communication statistics Time-line view for Communication and IO Activity view Pair-wise communication statistics Text reports Source code mapping Cray Apprentice 2 helps identify: Load imbalance Excessive communication Network contention Excessive serialization I/O Problems > module load perftools > app2 my_programap2 & 10

Cray Apprentice 2 11

Call Tree View Width ó inclusive time Height ó exclusive time Load balance overview: Height ó Max time Middle bar ó Average time Lower bar ó Min time Yellow represents imbalance time DUH Button: Provides hints for performance tuning Filtered nodes or sub tree Function List Zoom 12

Call Tree View Function List Right mouse click: View menu: eg, Filter Right mouse click: Node menu eg, hide/unhide children Sort options % Time, Time, Imbalance % Imbalance time Function List off 13

Apprentice 2 Call Tree View of Sampled Data 14

Load Balance View (from Call Tree) Min, Avg, and Max Values -1, +1 Std Dev marks 15

Time Line View Full trace (sequence of events) enabled by setting PAT_RT_SUMMARY=0 Helpful to see communication bottlenecks Use it only for small experiments! 16

Time Line View (Zoom) User Functions, MPI & SHMEM Line I/O Line 17

Time Line View (Fine Grain Zoom) 18