The Intel VTune Performance Analyzer



Similar documents
Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Hardware-based performance monitoring with VTune Performance Analyzer under Linux

Performance Monitoring of the Software Frameworks for LHC Experiments

A Study of Performance Monitoring Unit, perf and perf_events subsystem

This table lists the files/information you need and what VTune Performance Analyzer features they enable.

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Perfmon2: A leap forward in Performance Monitoring

Perf Tool: Performance Analysis Tool for Linux

Site Configuration SETUP GUIDE. Windows Hosts Single Workstation Installation. May08. May 08

Intel IA-64 Architecture Software Developer s Manual

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

AMD CodeXL 1.7 GA Release Notes

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

PERFORMANCE TOOLS DEVELOPMENTS

Intel Media SDK Library Distribution and Dispatching Process

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Installing and Administering VMware vsphere Update Manager

2

Online Backup Client User Manual

SUSE Linux Enterprise 10 SP2: Virtualization Technology Support

1. Product Information

Online Backup Client User Manual Linux

Linux/ia64 support for performance monitoring

Full and Para Virtualization

RecoveryVault Express Client User Manual

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

Online Backup Linux Client User Manual

Online Backup Client User Manual

Host Power Management in VMware vsphere 5

Performance Application Programming Interface

Freescale Semiconductor, I

FileMaker 11. ODBC and JDBC Guide

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study


Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Eloquence Training What s new in Eloquence B.08.00

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Intel Xeon Phi Coprocessor (codename: Knights Corner) Performance Monitoring Units

Installation Guide. McAfee VirusScan Enterprise for Linux Software

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux

Installation Guide for Basler pylon 2.3.x for Linux

IA-64 Application Developer s Architecture Guide

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

STLinux Software development environment

HyperV_Mon 3.0. Hyper-V Overhead. Introduction. A Free tool from TMurgent Technologies. Version 3.0

MAQAO Performance Analysis and Optimization Tool

VTune Performance Analyzer Essentials

Security Overview of the Integrity Virtual Machines Architecture

Embedded Operating Systems in a Point of Sale Environment. White Paper

Using VMware Player. VMware Player. What Is VMware Player?

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

UEFI on Dell BizClient Platforms

About This Guide Signature Manager Outlook Edition Overview... 5

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Infor Web UI Sizing and Deployment for a Thin Client Solution

Enhanced Diagnostics Improve Performance, Configurability, and Usability

TimePictra Release 10.0

PTC Integrity Eclipse and IBM Rational Development Platform Guide

Rackspace Cloud Databases and Container-based Virtualization

VMware vcenter Update Manager Administration Guide

Finding Performance and Power Issues on Android Systems. By Eric W Moore

FileMaker 12. ODBC and JDBC Guide

Testing Database Performance with HelperCore on Multi-Core Processors

Bandwidth Calculations for SA-1100 Processor LCD Displays

User's Guide - Beta 1 Draft

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Operating System Impact on SMT Architecture

MatriXay Database Vulnerability Scanner V3.0

Performance Monitor on PowerQUICC II Pro Processors

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

Università Degli Studi di Parma. Distributed Systems Group. Android Development. Lecture 1 Android SDK & Development Environment. Marco Picone

FileMaker 13. ODBC and JDBC Guide

Perfmon2: a flexible performance monitoring interface for Linux

Attix5 Pro Server Edition

Central Processing Unit (CPU)

CatDV Pro Workgroup Serve r

Performance Analysis and Optimization Tool

User's Guide FairCom Performance Monitor

x64 Servers: Do you want 64 or 32 bit apps with that server?

Windows Embedded Security and Surveillance Solutions

VMware Server 2.0 Essentials. Virtualization Deployment and Management

PATROL From a Database Administrator s Perspective

Wiggins/Redstone: An On-line Program Specializer

IBM Endpoint Manager Version 9.1. Patch Management for Red Hat Enterprise Linux User's Guide

SAN Conceptual and Design Basics

Introweb Remote Backup Client for Mac OS X User Manual. Version 3.20

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Getting Started Guide

Optimization tools. 1) Improving Overall I/O

Transcription:

The Intel VTune Performance Analyzer Focusing on Vtune for Intel Itanium running Linux* OS Copyright 2002 Intel Corporation. All rights reserved. VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

VTune Performance Analyzer Helps to identify and characterize performance issues by Collecting performance data CPU-Cycles Cycles (time) Micro-architectural events of processor Platform resource utilization Organizing and displaying the data Identifying performance hotspots Suggesting improvements 2 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

A Note about Vtune & other Tools The most useful feature of Vtune is Event Based Sampling: Configuring and monitoring of the Itanium architecture performance counters and displaying the event occurrence data against the work load of the system being analyzed This can be done too by many other tools HPCMon EMON Free utility from Intel includes source code Ask presenter for a copy Batch-like tool used within Intel Knows too about some non-published monitor events Available on request ( no support ) if there is a need ( NDA ) PFMON from HP ftp://ftp.hpl.hp.com/pub/linux-ia64/ ia64/ PAPI (PapiRun( PapiRun, PapiProf), Rabbit, HPCToolKit,, etc Look at the WEB: There are numerous of them Difference is in easy-of of-use, added features APIs, processor support, OS Support, navigation, performance data compatibility, source code support etc 3 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Vtune Performance Analysis for Linux Native: Vtune for Linux 3.0 Any IA-32 or Itanium system running recent Linux version Some kernel and GLIBC dependencies Full Eclipsed-based GUI only for IA32 today Due to Eclipse issues with 64bit For Itanium & EM64T command-line version But graphical viewers for result Eclipse-based release for 64bit system later in 2005 ( Vtune 3.5) Remote Data Collection from Windows* OS Allows full Windows GUI to be used for Linux too Needs Vtune 7.2 for Windows RDC driver comes with Vtune package and includes source code 4 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Remote Data Collection VTune analyzer for Windows installed on host system Remote sampling data collector installed on target system Host System Windows* OS IA-32 or Itanium Controls target View results of data collection LAN Connection Target System -IA-32 or Itanium processor family -Windows or Linux* -Intel PXA250 applications processor running Windows CE 5 Intel, Itanium, VTune, and the intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Linux Driver Kit Required for RDC and Vtune for Linux Pre-built binaries for many kernels Source code SDK in Vtune7.2 Also at http://opensource.intel.com Driver kit requires kernel to export sys_call_table some older kernels have to be rebuild Many OSV kernels have explicit support SUSE 8.x, 9.0, Redhat AS 2.1 Update 2 Support for 2.6 kernel available by latest patch and soon in release 8.0 Beta program for Vtune 8.0 for Linux just started ( end of August 2005 ) 6

Vtune Features Sampling of Execution Addresses Profiling based on processor event counters Call Graph Profiling - Instrumented analysis Call tree, number of calls, timing information Executing Instrumented Code Intel Tuning Assistant: Interpret the results ( Windows or RDC only ) 7

The Sampling Methodology Sample the CPU s s execution context Instruction Address ( Module, source line, assembly line) OS Process OS Thread ID Very easy to use, no special build Source line view requires symbol info ( -g g compiler option) Very low intrusion System-wide measurements Sample rate set to provide statistically meaningful data Based on CPU clock speed or auto-calibrated Measures performance sensitive CPU events Cycles (Time) Cache misses, branch mispredictions, bank conflicts On Itanium there are far above 100 of such events, many of them having multiple sub-events Maximal 4 events each run Restricted by number of PMU ( performance monitoring unit ) registers 8

Sampling Process View System-wide Data Collection 9 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Sampling Source View Source Code Annotated with Performance Data 10 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

VTL - Vtune Native Linux Version Sampling on Linux Test: MySQL 4.0.11: test-select 11 memcpy contains 6 of the first 11 top hot-spots

Selective Sampling The Vtune Pause/Resume API can be used to limit sampling to specific parts of your app #include <vtuneapi.h> Link with vtuneapi.lib Call VTResume() and VTPause() as appropriate Enable Start with data collection paused option in configuration dialog There is also a more sophisticated Config/Start/Stop API available (see online documentation for more details) 12

How Sampling Works How Event-based Sampling (EBS) Works Conceptual Diagram Select Event Signal Count Down Sample After Number Interrupt CPU to Take Sample Underflow to Zero Internal Interrupt Controller How do you choose a Sample After number? 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

How Sampling Works How Many Samples Are Enough? One million samples for a five-second run? Do you have enough samples for it to be statistically significant? How much overhead are you causing? What if you only get 100 samples? Is your sample after number 1? Are you getting a good profile? About 1,000 samples per second is is a good balance between significance and overhead. 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

How Sampling Works Objective: 1,000 Samples Per Second What is the sample after value for clockticks? Dependent upon CPU clock speed ANSWER: CPU clock speed in KHz If CPU clock speed = 1,400,000,000 Hz Sample after 1,400,000 clockticks What is the sample after value for L2 cache read misses? It depends on how often you miss the L2 cache! Circular definition? Is not that what you are trying to determine? Make an intelligent guess! Estimate! More or less often than the clockticks? 10 times? 100 times? 1000 times? 15 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

How Sampling Works Calibration Sets the sample after value to get a reasonable number of samples. ~1000 samples per second per logical CPU Requires the workload to be run twice. Manual Calibration: Uncheck Calibrate Sample After value. Found on Advanced Activity Configuration dialog Start with default value or an estimate. Run a test. Modify the sample after value and re-test. Try to get about a 1000 samples per second per logical CPU. 16 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Vtune Call Graph Feature Instrumented technology Some performance degradation Binary is instrumented Identifies function to function calling sequences Reports statistics for each called function Execution time Blocked time Calling sequences & frequency of occurrence Functionally different to gprof No statistical but 100% precise call relationship data 17

Vtune Call-graph View (VTL, cgview) 18

VTL Vtune for Linux Usage Model (1 of 2) Single-invocation invocation command line $ vtl activity c c sampling $ vtl run $ vtl activity c c sampling run All VTune Activities and results stored in semi-hidden project User configures an Activity and runs it with a single invocation User may have multiple Activities in the project Each Activity may have multiple data collectors and multiple application/module profiles 19

VTL Vtune for Linux Usage Model (2 of 2) Results viewed with a single invocation Some filtering available depending on the data Results accumulate until deleted by user User may pack project and unpack on a Windows system User can ta\ke advantage of VTune GUI on Windows Provides access to capabilities not found on the command line 20

VTL Command Line Syntax Some Examples General status commands vtl query lc lists all collectors ( sampling and callgraph for 2.0) vtl help c c sampling lists all events available for EBS ( event base sampling ) Create/Run a Sampling activity vtl activity c c sampling app gzip, -f f big run Create and run a single Sampling collector Activity with application gzip f f big ; default settings ( Instruction Retired and Cycles ) vtl activity d d 20 c c sampling o -ec en= L3_READS L3_READS-ALL- MISS app gzip, -f big Create and run for 20 seconds a single Sampling collector Activity ty with application gzip f f big collecting all L3 cache misses data and instruction Use option cpu_mask <list> to select subset of processors 21

VTL Command Line Syntax(2) More Examples View Sampling Results vtl view vtl view -gui shows result of last activity ( defaults ) vtl view hf mn gzip view results for module ( application ) gzip in hot-spot function mode ( most active modules first ) vtl view code mn gzip fn deflate sea poa view results in source code mode for function deflate of module ( application ) gzip; show events as percentage of activity 22

VTL Command Line Syntax(3) More Examples Configuring and view Callgraph Activity vtl activity c callgraph app gzip, -f f big moi gzip run Create and run a Callgraph Activity with application gzip f f big ; default settings; module of interest gzip ; in case app is a script, the module of interest can select the binary to be anlayzed vtl view show the just generated call-graph in table-format vtl view -gui show the just generated call-graph in GUI-format; requires installation of CGVIEW tool ( free available from Intel) 23

VTune in Eclipse Call Graph View 24

Itanium Performance Monitoring The Itanium Architecture defines a generic framework for the Performance Monitoring Unit (PMU( PMU): Consistent software APIs across processor models Yet, a processor model can implement its own PMU extension Generic PMU support: 4 64-bit Performance Monitor Data registers (PMDs( PMDs) ) (extensible to 256 in total) 8 64-bit Performance Monitor Configuration registers (PMCs( PMCs) ) (extensible to 256 in total) A performance monitor = 1 PMC + N PMDs (where N >= 1) 3 additional status/control registers: PSR, DCR, PMV Itanium 2 PMU support: Monitor a rich set (140+) of events 16 PMCs,, 18 PMDs (4 for event counting, others for buffering event-specific info.) Can pinpoint exactly where a miss event happened in the program 25

Itanium PMU Events Classification Occurrence or Architectural Events Level 3 Cache Misses, L2 Bank Conflicts, RSE Activiations Some are Exact Address Events EAR Additional context information is saved Stall Events ( Bubble Events ) Stall at EXE stage, Stall of L1D pipeline Derived or artificial events Cycles/Instruction, 100* L3_Misses/L3_References 26

EAR Events Problem: When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source: The sampled PC is inprecise Solution provided by the Itanium 2: Hardware provides a set of Event Address Registers (EARs( EARs) to record the instruction and data addresses of the offending instruction (plus s other useful information). The instruction address can be exactly mapped to machine code instruction of application Most interesting are DEAR ( Data-EAR) events to monitor long-latency latency memory instructions Sample: DEAR_Latency_GT_64 - Counts number of memory operation taking more than 64 cycles, that is for sure not a cache access; helpful to e.g find sub-optimal prefetching.. Intel can provide unsupported tool ( Rosetta )) to find program variable name of data being accessed 27

How to use Vtune for Itanium 1. Find hotspots regarding time (cyles( cyles) By sampling of event CPU_CYCLES By call-graph Straight-forward and all you need in many cases but doesn t t tell you why 2. Find hot-spots regarding expensive occurrence events By sampling for e.g. L3 Cache misses, branch-miss predictions, RSE-activations Provides hints for code modifications Interpretation can be misleading E.g L3 cache misses can be neutral ( Prefetch ) or hint for expensive events Requires some generic knowledge about Itanium architecture 3. Stall cycle analysis By sampling for events causing stalls Most sophisticated and requires detailed knowledge of processor Only available in this form for Itanium architecture 28 The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Introduction to Stall Cycle Analysis The main idea: Assume algorithm and platform are perfectly optimized/configured Total Cycles = Cycles to execute instructions + Cycles where the processor pipeline is stalled Minimize the stall-cycles In case this value is zero, we have 6 instructions/cycle thus can t t be better This is Itanium-2 2 specific For Itanium (-1)( counter structure and names slightly different Does not work this way for IA-32 due to more non-deterministic (out- of-order) order) execution features We will come back to this in the Micro-Architectural talk Detailed documentation available: Itanium Reference Manual for Software Developers Itanium-2 2 Reference Manual for Software Optimization Introduction to Micro-architectural Software Optimization 29 The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Constraining Performance Monitoring Events on IPF The Performance Monitoring Events can be constrained to only increment on particular Instruction type (opcode( matching) Instruction Pointer range (IP matching) Virtual Address Range (Data Address matching) Or any combination of the above default is no constraint = collect all events Unique Features of the Itanium Processor Family 30

O2 O3 Opcode Matching the Matrix Multiply Example Opcode Match Default Fp Load Prefetch Opcode Match Default Fp Load Prefetch CPU_Cycles 2.2 X 10 10 2.2 X 10 10 2.2 X 10 10 CPU_Cycles 3.3 X 10 9 3.3 X 10 9 3.3 X 10 9 Instructions Retired 6.4 X 10 9 2.1 X 10 9 254 Instructions Retired 6.4 X 10 9 2.1 X 10 9 5 X 10 8 31 L3 Cache Misses 6.7 X 10 7 6.7 X 10 7 59 Opcode Matching Shows L3 Misses Are Fixed by O3 L3 Cache Misses 6.7 X 10 7 1 X 10 5 6.7 X 10 7

How Does This Work? Instructions are 41 bit fields Define a unique instruction and register usage 3 per 128 bit bundle Plus 5 bits for the template Opcode matching can work with classes of instructions By using only a subset of the 41 bits Done with an instruction field a mask field (defining which bits to ignore) A template field 32

Example Masks lfetch Template is M Opcode field is 0x0CB00000000 Mask field is 0x030FFFFFFFF fploads Template is M Opcode field is 0x0C000000000 Mask field is 0x037FFFFFFFF This is WAY too Painful!! 33

The Prototype VTune Analyzer 34