Performance Tools. Tulin Kaman. tkaman@ams.sunysb.edu. Department of Applied Mathematics and Statistics

Similar documents
Performance Tools for Parallel Java Environments

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

End-user Tools for Application Performance Analysis Using Hardware Counters

Part I Courses Syllabus

TAU Install Guide. Updated Nov. 15, 2015, for use with version 2.25 or greater.

Unified Performance Data Collection with Score-P

BLM 413E - Parallel Programming Lecture 3

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

Introduction to the TAU Performance System

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

HPC Wales Skills Academy Course Catalogue 2015

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility

Multi-core Programming System Overview

Performance Analysis for GPU Accelerated Applications

Parallel Computing. Shared memory parallel programming with OpenMP

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Multi-Threading Performance on Commodity Multi-Core Processors

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Review of Performance Analysis Tools for MPI Parallel Programs

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Introduction to GPU hardware and to CUDA

GPUs for Scientific Computing

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment

How to Use Open SpeedShop BGP and Cray XT/XE

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Next Generation GPU Architecture Code-named Fermi

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

serious tools for serious apps

OpenACC Basics Directive-based GPGPU Programming

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Application Performance Tools on Discover

RA MPI Compilers Debuggers Profiling. March 25, 2009

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

An introduction to Fyrkat

Introduction to Hybrid Programming

Parallel Computing. Parallel shared memory computing with OpenMP

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Blue Gene/L: Performance Analysis Tools

Debugging with TotalView

CUDA programming on NVIDIA GPUs

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Introduction to GPU Programming Languages

HPC Software Debugger and Performance Tools

High Performance Computing in Aachen

GPU Profiling with AMD CodeXL

ELEC 377. Operating Systems. Week 1 Class 3

Multicore Parallel Computing with OpenMP

Application Performance Analysis Tools and Techniques

SCALING USER-SESSIONS FOR LOAD TESTING OF INTERNET APPLICATIONS

Performance analysis with Periscope

1 Bull, 2011 Bull Extreme Computing

Case Study on Productivity and Performance of GPGPUs

Profiling and Tracing in Linux

Parallel Algorithm Engineering

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

A Case Study - Scaling Legacy Code on Next Generation Platforms

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenACC Programming and Best Practices Guide

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Running applications on the Cray XC30 4/12/2015

A Pattern-Based Approach to. Automated Application Performance Analysis

Introduction to application performance analysis

COSCO 2015 Heterogeneous Computing Programming

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Performance analysis of parallel applications on modern multithreaded processor architectures

Experiences with Tools at NERSC

MAQAO Performance Analysis and Optimization Tool

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Overview

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Allinea Forge User Guide. Version 6.0.1

A Performance Data Storage and Analysis Tool

Parallel Programming Survey

Data Structure Oriented Monitoring for OpenMP Programs

Introduction to HPC Workshop. Center for e-research

Program Grid and HPC5+ workshop

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

A Crash course to (The) Bighouse

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Seminarbeschreibung. Windows Server 2008 Developing High-performance Applications using Microsoft Windows HPC Server 2008.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

A quick tutorial on Intel's Xeon Phi Coprocessor

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Operating Systems. Lecture 03. February 11, 2013

Monitoring HP OO 10. Overview. Available Tools. HP OO Community Guides

ITG Software Engineering

Profiler User's Guide

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

An Introduction to Parallel Computing/ Programming

Transcription:

Performance Tools Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook/BNL New York Center for Computational Science tkaman@ams.sunysb.edu Aug 24, 2012

Performance Tools Community Tools: GNU Profiler: tool provided with the GNU compiler Tuning and Analysis Utilities (TAU) PAPI (Performance Application Programming Interface) High Performance Computing Toolkit (HPCT) for IBM Blue Gene Message Passing Interface (MPI) Profiler and Tracer tool Xprofiler for CPU profiling Hardware Performance Monitoring (HPM) library Modular I/O (MIO) library

Tuning and Analysis Utilities - TAU Performance evaluation tool Profiling and tracing toolkit for performance analysis of parallel programs written in C, C++, Fortran, Java and Python Support for multiple parallel programming paradigms: MPI, Multi-threading, Hybrid (MPI + Threads) Access to hardware counters using PAPI Memory Profiling

Memory events TAU can evaluate : How much heap memory is currently used? Memory utilization options How much can a program grow before it runs out of free memory on the heap? Memory headroom evaluation options Memory leaks in C/C++ programs TAU will track malloc through the execution issuing user event when the program fails to the allocated memory.

Dynamic instrumentation through library preloading Enabled by command-line options to tau_exec tau_exec [options] [--] <exe> <exe options> Options: -io track I/O -memory track memory -cuda track GPU events via CUDA -opencl track GPU events via OpenCL $ tau_exec -memory./a.out $ mpirun -np 4 tau_exec -memory./a.out

Memory Utilization Option TAU_TRACK_MEMORY() Call in your code When it is called, an interrupt is generated every 10 seconds (DEFAULT) TAU_SET_INTERRUPT_INTERVAL(value) : changes the interrupt interval explicitly enabled or disabled by calling TAU_ENABLE_TRACKING_MEMORY() TAU_DISABLE_TRACKING_MEMORY()

Memory Utilization Option TAU_TRACK_MEMORY_HERE() Triggers memory tracking at a given execution point Example:/gpfs/home1/tulin/TAUP/tau-2.18/examples/memory/simple.cpp

Memory Utilization Option -PROFILEMEMORY Track heap memory utilization at each function entry Each configuration creates a unique Makefile./configure -arch=bgp -pdt=/gpfs/home1/tulin/taup/pdtoolkit-3.14 -pdt_c++=xlc -mpi PROFILEMEMORY Set environment variables export PATH=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/bin:$PATH Use TAU_MAKEFILE export TAU_MAKEFILE=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/lib/Makefile.tau-memory-mpi-pdt Compile your code using TAU compiler scripts: tau_cc.sh tau_cxx.sh tau_f90.sh Run to generate profile files.

Memory Headroom Option examines how much a program can grow before it runs out of free memory on the heap TAU_TRACK_MEMORY_HEADROOM() Call in your code vary the size of the callstack by setting the environment variable TAU_CALLSTACK_DEPTH (default is 2) When it is called, an interrupt is generated every 10 seconds (DEFAULT) TAU_SET_INTERRUPT_INTERVAL(value): changes the interrupt interval explicitly enabled or disabled by calling TAU_ENABLE_TRACKING_MEMORY_HEADROOM() TAU_DISABLE_TRACKING_MEMORY_HEADROOM()

Memory Headroom Option TAU_TRACK_MEMORY_HEADROOM_HERE() evaluates how much memory it can allocate and associates it with the callstack Example:/gpfs/home1/tulin/TAUP/tau-2.18/examples/headroom/here /simple.cpp

Memory Hardware Option -PROFILEHEADROOM Track free memory (or headroom) at each func entry. Each configuration creates a unique Makefile./configure -arch=bgp -pdt=/gpfs/home1/tulin/taup/pdtoolkit-3.14 -pdt_c++=xlc -mpi PROFILEHEADROOM Set environment variables export PATH=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/bin:$PATH Use TAU_MAKEFILE export TAU_MAKEFILE=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/lib/Makefile.tau- headroom-mpipdt Compile your code using TAU compiler scripts: tau_cc.sh tau_cxx.sh tau_f90.sh Run to generate profile files.

How to detect memory leaks? 5 < value <= 15

two Memory leaks

Shared Memory Programming: OpenMP OpenMP = Open Multi-Processing The OpenMP Application Program Interface (API) for writing shared memory parallel programs. Supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures. Consists of Compiler Directives Runtime Library Routines Environment Variables OpenMP program is portable. Compilers have OpenMP support. Requires little programming effort.

Example: PI 3.14159265358979323846264338327950288419716 NUMERICAL INTEGRATION

Example: PI

? Set environment variable OMP_NUM_THREADS=4

OMP_GET_WTIME() portable wall clock timing routine returns a double-precision floating point value equal to the number of elapsed seconds Fortran DOUBLE PRECISION FUNCTION OMP_GET_WTIME() C/ C++ #include <omp.h> double omp_get_wtime(void)

Thread =1 omp_time=41.176492 sec Thread =2 omp_time=20.588791 sec Thread =4 omp_time=10.295631 sec

Thread =1 omp_time=82.352963 sec Thread =2 omp_time=61.765259 sec Thread =4 omp_time=51.472104 sec

OpenMP with TAU Each configuration creates a unique Makefile./configure -arch=bgp -pdt=/gpfs/home1/tulin/taup/pdtoolkit-3.14 -pdt_c++=xlc openmp Set environment variables export PATH=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/bin:$PATH Use TAU_MAKEFILE export TAU_MAKEFILE=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/lib/ Makefile.tau-pdt-openmp Compile your code using TAU compiler scripts: tau_cc.sh tau_cxx.sh tau_f90.sh Run to generate profile files.

OpenMP + MPI with TAU Each configuration creates a unique Makefile./configure -arch=bgp -pdt=/gpfs/home1/tulin/taup/pdtoolkit-3.14 -pdt_c++=xlc -mpi openmp Set environment variables export PATH=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/bin:$PATH Use TAU_MAKEFILE export TAU_MAKEFILE=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/lib/ Makefile.tau-mpi-pdt-openmp Compile your code using TAU compiler scripts: tau_cc.sh tau_cxx.sh tau_f90.sh Run to generate profile files.

Pthread with TAU Each configuration creates a unique Makefile./configure -arch=bgp -pdt=/gpfs/home1/tulin/taup/pdtoolkit-3.14 -pdt_c++=xlc pthread Set environment variables export PATH=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/bin:$PATH Use TAU_MAKEFILE export TAU_MAKEFILE=/gpfs/home1/tulin/TAUP/tau-2.18/bgp/lib/Makefile.tau-pthread-pdt Compile your code using TAU compiler scripts: tau_cc.sh tau_cxx.sh tau_f90.sh Run to generate profile files.

Stony Brook Center for Computational Science Tutorial video and presentations are http://www.stonybrook.edu/sbccs/tutorials.shtml Tuning and Analysis Utilities: TAU http://www.cs.uoregon.edu/research/tau/home.php