Sequential Performance Analysis with Callgrind and KCachegrind

Similar documents

Sequential Performance Analysis with Callgrind and KCachegrind

Building Embedded Systems

Data Structure Oriented Monitoring for OpenMP Programs

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Helping you avoid stack overflow crashes!

Monitoring, Tracing, Debugging (Under Construction)

- An Essential Building Block for Stable and Reliable Compute Clusters

POWER8 Performance Analysis

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Wiggins/Redstone: An On-line Program Specializer

Valgrind BoF Ideas, new features and directions

Software Development Tools for Embedded Systems. Hesen Zhang

Architecture of Hitachi SR-8000

Parallel Computing. Benson Muite. benson.

Multi-Threading Performance on Commodity Multi-Core Processors

1 How to Monitor Performance

for High Performance Computing

PERFORMANCE TOOLS DEVELOPMENTS

Stream Processing on GPUs Using Distributed Multimedia Middleware

FPGA-based Multithreading for In-Memory Hash Joins

Real-time Debugging using GDB Tracepoints and other Eclipse features

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

Valgrind Documentation

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

NVIDIA Tools For Profiling And Monitoring. David Goodwin

A Pattern-Based Approach to. Automated Application Performance Analysis

COS 318: Operating Systems

Performance Characteristics of Large SMP Machines

NetBeans Profiler is an

Performance Analysis and Optimization Tool

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Overlapping Data Transfer With Application Execution on Clusters

Resource Utilization of Middleware Components in Embedded Systems

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

End-user Tools for Application Performance Analysis Using Hardware Counters

CSC 2405: Computer Systems II

ELEC 377. Operating Systems. Week 1 Class 3

Java Virtual Machine: the key for accurated memory prefetching

An Easier Way for Cross-Platform Data Acquisition Application Development

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Chapter 1 Computer System Overview

Dongwoo Kim : Hyeon-jeong Lee s Husband

1 How to Monitor Performance

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

CS 147: Computer Systems Performance Analysis

A Deduplication File System & Course Review

MPI / ClusterTools Update and Plans

I/O Device and Drivers

Testing & Assuring Mobile End User Experience Before Production. Neotys

Concept of Cache in web proxies

Software-defined Storage Architecture for Analytics Computing

Parallel Programming Survey

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Chapter 2 System Structures

COS 318: Operating Systems. Virtual Machine Monitors

Big Graph Processing: Some Background

Chapter 3 Operating-System Structures

RevoScaleR Speed and Scalability

The Data Quality Monitoring Software for the CMS experiment at the LHC

AMD Opteron Quad-Core

Multi-GPU Load Balancing for Simulation and Rendering

LabVIEW Advanced Programming Techniques

Linux Performance Optimizations for Big Data Environments

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

CHAPTER 1 INTRODUCTION

PERFORMANCE MODELS FOR APACHE ACCUMULO:

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Using Power to Improve C Programming Education

Enterprise Applications

Performance Monitoring of Parallel Scientific Applications

Eight Ways to Increase GPIB System Performance

Parallel Computing for Data Science

Chapter 2: OS Overview

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Using System Tracing Tools to Optimize Software Quality and Behavior

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Real-Time Scheduling 1 / 39

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

MAQAO Performance Analysis and Optimization Tool

Testing Database Performance with HelperCore on Multi-Core Processors

Practical Experiences With Software Crash Analysis in TV. Wim Decroix/Yves Martens, TPVision

Performance Tuning and Optimizing SQL Databases 2016

Big Data Processing with Google s MapReduce. Alexandru Costan

Ulf Hermann. October 8, 2014 / Qt Developer Days The Qt Company. Using The QML Profiler. Ulf Hermann. Why? What? How To. Examples.

Web Server Architectures

Transcription:

Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik, Technische Universität München

Outline Background Callgrind & KCachegrind Measurement Visualization Practical Getting started Example: Matrix Multiplication Weidendorfer: Callgrind / Kcachegrind

Background Sequential performance bottlenecks (directly relates to parallel performance!) Logical errors (unneeded function calls) Bad algorithm for given problem Bad exploitation of available resource How to improve performance Use tuned libraries when available Check for above obstacles use Performance Analysis Tools

Performance Analysis Tools Count occurrences of events Resource exploitation is related to events Software-related: function call, OS scheduling,... Hardware-related: FLOP executed, memory access, cache miss,... Relate events to source code Find code regions where most time is spent Check for improvement after changes Profiling data: Sum of events happening at same code position Inclusive vs. exclusive cost Weidendorfer: Callgrind / KCachegrind

How to measure Events Software-related Instrumentation (= insertion of measurement code) into OS / application, manual/automatic, on source/binary level Hardware-related Read Hardware Performance Counters gives exact event counts for code ranges needs instrumentation fixed (high) influence to measurement results Statistical: Sampling Event distribution over code approximated by checking every N-th event Hardware notifies only about every N-th event Influence tunable by N Architecture Simulation Events generated by a (simplified) hardware model No influence: allows for sophisticated online processing

Architectural Performance Problem Today: Main Memory Access latency ~ 200 cycles 400 FLOP wasted for one main memory access Solution: Memory controller on chip Exploit fast caches (Locality of accesses!) Prefetch data (automatically) Bandwidth available for 1 chip ~ 3 10 GB/s All cores have to share the bandwidth Can prevent effective prefetching Solution: Share data in caches among cores Keep working set in cache (Temporal locality!) Use good data layout (Spatial locality!) Weidendorfer: Callgrind / KCachegrind

Callgrind Cache Simulation with Call-Graph Relation Weidendorfer: Callgrind / KCachegrind

Callgrind: Basic Features Based on Valgrind Runtime instrumentation infrastructure (no recompilation needed) Dynamic binary translation of user-level processes Linux/AIX on x86, x86-64, PPC32, PPC64 Correctness checking & Profiling tools on top memcheck : accessibility/validity of memory accesses helgrind / drd : race detection on multithreaded code cachegrind / callgrind : cache simulation massif : space profiling Open source (GPL) www.valgrind.org

Callgrind: Basic Features Part of Valgrind since 3.1 Open Source, GPL Measurement Method Profiling via Simulation Simple Cache Model (synthetic events, derived from Cachegrind) Instrument memory accesses to feed cache simulator Hook into Call/Return instructions, thread switches / signal handlers Instrument (conditional) jumps for CFG inside of functions Presentation of results: callgrind_annotate / KCachegrind Weidendorfer: Callgrind / KCachegrind

Callgrind: Pro and Contra Usage of Valgrind Driven only by user-level instructions of one process Slowdown (Call-graph tracing: 15-20x, + cache simulation: 40-60x) But: Fast-forward mode : 2-3x Allows detailed observation (arbitrary metrics such as reuse distance) Does not need root access / can not crash machine Cache Model Not Reality : Synchronous 2-level inclusive Cache Hierarchy (Size/Associativity taken from real machine) Easy to understand/reconstruct for user Reproducible results independent on real machine load Derived optimizations applicable for most architectures

Callgrind: Advanced Features Interactive Control (backtrace, dump command, ) Fast forward -mode to get to interesting code phase fast Application control via client requests (start/stop, dump) Options for avoidance of function call cycles Cycles are bad for analysis (inclusive costs not applicable) Add context info into function names (call chain/rec. depth) Best case simulation of simple L2 stream prefetcher Usage of cache lines before eviction Relates to temporal/spatial locality How often used / How many bytes unused

Callgrind: Usage valgrind tool=callgrind [callgrind options] yourprogram args Cache simulator: --simulate-cache=yes Start in fast-forward : --instr-atstart=yes Switch on event collection: callgrind_control i on Jump-tracing in functions (CFG): --collect-jumps=yes Separate dumps per thread: --separate-threads=yes Current backtrace of threads (interactive): callgrind_control b Spontaneous dump: callgrind_control d [dump identification] Prefetch simulation: --simulate-hwpref=yes Cacheline usage: --cacheuse=yes

KCachegrind Graphical Browser for Profile Visualization Weidendorfer: Callgrind / KCachegrind

KCachegrind: Features Open Source, GPL kcachegrind.sf.net Included with KDE3 & KDE4 Visualization of Call relationship of functions (callers, callees, call graph) Exclusive/Inclusive cost metrics of functions Grouping according to ELF object / source file / C++ class Source/Assembly annotation: costs + CFG Arbitrary events counts + specification of derived events Callgrind support (file format, events of cache model

KCachegrind: Usage kcachegrind callgrind.out.<pid> Left: Dockables List of function groups Groups according to Library (ELF object) Source Class (C++) List of functions with Inclusive exclusive costs Right: Visualization panes

KCachegrind: Visualization panes for selected function List of event types List of callers/callees Treemap visualization Call Graph Source annotation Assemly annotation

Work in Progress Callgrind Multicore cache simulation (detection of data sharing, not load balancing) Command line tool for measurement merging & results Event relation to data structures KCachegrind Pure Qt4 version (eg. Windows) Cross-System support Weidendorfer: Callgrind / KCachegrind

Outline Background Callgrind & KCachegrind Measurement Visualization Practical Getting started Example: Matrix Multiplication Weidendorfer: Callgrind / Kcachegrind

Practical: Getting started Login to cluster (or use Knoppix locally): ssh rzvmpi13@frontend.dgrid.hlrs.de (PW: tws_us4er) svn co --username tws_user https://svn.gforge.hlrs.de/svn/tws-examples/kcachegrind Setup your environment: module load valgrind tws-examples/kcachegrind/readme Test: What happens in /bin/ls? valgrind --tool=callgrind ls /usr/bin kcachegrind What function takes most instruction executions? Purpose? Where is the main function?

Practical: Detailed analysis of matrix multiplication Kernel for C = A * B Side length N N 3 multiplications + N 3 additions 3 nested loops (i,j,k): Best index order? Optimization for large matrixes: Blocking Code: mm/mm.c make CFLAGS= -O2 -g Timing of orderings (size 300 800):./mm 300 800 Cache behavior for small matrix (fitting into cache): valgrind --tool=callgrind -simulate-cache=yes./mm 300 How good is L1/L2 exploitation of the MM versions? Large matrix (mm800/callgrind.out). How does blocking help?

? Q& A? Weidendorfer: Callgrind / KCachegrind