The Intel VTune Performance Analyzer Focusing on Vtune for Intel Itanium running Linux* OS Copyright 2002 Intel Corporation. All rights reserved. VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
VTune Performance Analyzer Helps to identify and characterize performance issues by Collecting performance data CPU-Cycles Cycles (time) Micro-architectural events of processor Platform resource utilization Organizing and displaying the data Identifying performance hotspots Suggesting improvements 2 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
A Note about Vtune & other Tools The most useful feature of Vtune is Event Based Sampling: Configuring and monitoring of the Itanium architecture performance counters and displaying the event occurrence data against the work load of the system being analyzed This can be done too by many other tools HPCMon EMON Free utility from Intel includes source code Ask presenter for a copy Batch-like tool used within Intel Knows too about some non-published monitor events Available on request ( no support ) if there is a need ( NDA ) PFMON from HP ftp://ftp.hpl.hp.com/pub/linux-ia64/ ia64/ PAPI (PapiRun( PapiRun, PapiProf), Rabbit, HPCToolKit,, etc Look at the WEB: There are numerous of them Difference is in easy-of of-use, added features APIs, processor support, OS Support, navigation, performance data compatibility, source code support etc 3 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Vtune Performance Analysis for Linux Native: Vtune for Linux 3.0 Any IA-32 or Itanium system running recent Linux version Some kernel and GLIBC dependencies Full Eclipsed-based GUI only for IA32 today Due to Eclipse issues with 64bit For Itanium & EM64T command-line version But graphical viewers for result Eclipse-based release for 64bit system later in 2005 ( Vtune 3.5) Remote Data Collection from Windows* OS Allows full Windows GUI to be used for Linux too Needs Vtune 7.2 for Windows RDC driver comes with Vtune package and includes source code 4 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Remote Data Collection VTune analyzer for Windows installed on host system Remote sampling data collector installed on target system Host System Windows* OS IA-32 or Itanium Controls target View results of data collection LAN Connection Target System -IA-32 or Itanium processor family -Windows or Linux* -Intel PXA250 applications processor running Windows CE 5 Intel, Itanium, VTune, and the intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Linux Driver Kit Required for RDC and Vtune for Linux Pre-built binaries for many kernels Source code SDK in Vtune7.2 Also at http://opensource.intel.com Driver kit requires kernel to export sys_call_table some older kernels have to be rebuild Many OSV kernels have explicit support SUSE 8.x, 9.0, Redhat AS 2.1 Update 2 Support for 2.6 kernel available by latest patch and soon in release 8.0 Beta program for Vtune 8.0 for Linux just started ( end of August 2005 ) 6
Vtune Features Sampling of Execution Addresses Profiling based on processor event counters Call Graph Profiling - Instrumented analysis Call tree, number of calls, timing information Executing Instrumented Code Intel Tuning Assistant: Interpret the results ( Windows or RDC only ) 7
The Sampling Methodology Sample the CPU s s execution context Instruction Address ( Module, source line, assembly line) OS Process OS Thread ID Very easy to use, no special build Source line view requires symbol info ( -g g compiler option) Very low intrusion System-wide measurements Sample rate set to provide statistically meaningful data Based on CPU clock speed or auto-calibrated Measures performance sensitive CPU events Cycles (Time) Cache misses, branch mispredictions, bank conflicts On Itanium there are far above 100 of such events, many of them having multiple sub-events Maximal 4 events each run Restricted by number of PMU ( performance monitoring unit ) registers 8
Sampling Process View System-wide Data Collection 9 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Sampling Source View Source Code Annotated with Performance Data 10 VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
VTL - Vtune Native Linux Version Sampling on Linux Test: MySQL 4.0.11: test-select 11 memcpy contains 6 of the first 11 top hot-spots
Selective Sampling The Vtune Pause/Resume API can be used to limit sampling to specific parts of your app #include <vtuneapi.h> Link with vtuneapi.lib Call VTResume() and VTPause() as appropriate Enable Start with data collection paused option in configuration dialog There is also a more sophisticated Config/Start/Stop API available (see online documentation for more details) 12
How Sampling Works How Event-based Sampling (EBS) Works Conceptual Diagram Select Event Signal Count Down Sample After Number Interrupt CPU to Take Sample Underflow to Zero Internal Interrupt Controller How do you choose a Sample After number? 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
How Sampling Works How Many Samples Are Enough? One million samples for a five-second run? Do you have enough samples for it to be statistically significant? How much overhead are you causing? What if you only get 100 samples? Is your sample after number 1? Are you getting a good profile? About 1,000 samples per second is is a good balance between significance and overhead. 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
How Sampling Works Objective: 1,000 Samples Per Second What is the sample after value for clockticks? Dependent upon CPU clock speed ANSWER: CPU clock speed in KHz If CPU clock speed = 1,400,000,000 Hz Sample after 1,400,000 clockticks What is the sample after value for L2 cache read misses? It depends on how often you miss the L2 cache! Circular definition? Is not that what you are trying to determine? Make an intelligent guess! Estimate! More or less often than the clockticks? 10 times? 100 times? 1000 times? 15 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
How Sampling Works Calibration Sets the sample after value to get a reasonable number of samples. ~1000 samples per second per logical CPU Requires the workload to be run twice. Manual Calibration: Uncheck Calibrate Sample After value. Found on Advanced Activity Configuration dialog Start with default value or an estimate. Run a test. Modify the sample after value and re-test. Try to get about a 1000 samples per second per logical CPU. 16 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Vtune Call Graph Feature Instrumented technology Some performance degradation Binary is instrumented Identifies function to function calling sequences Reports statistics for each called function Execution time Blocked time Calling sequences & frequency of occurrence Functionally different to gprof No statistical but 100% precise call relationship data 17
Vtune Call-graph View (VTL, cgview) 18
VTL Vtune for Linux Usage Model (1 of 2) Single-invocation invocation command line $ vtl activity c c sampling $ vtl run $ vtl activity c c sampling run All VTune Activities and results stored in semi-hidden project User configures an Activity and runs it with a single invocation User may have multiple Activities in the project Each Activity may have multiple data collectors and multiple application/module profiles 19
VTL Vtune for Linux Usage Model (2 of 2) Results viewed with a single invocation Some filtering available depending on the data Results accumulate until deleted by user User may pack project and unpack on a Windows system User can ta\ke advantage of VTune GUI on Windows Provides access to capabilities not found on the command line 20
VTL Command Line Syntax Some Examples General status commands vtl query lc lists all collectors ( sampling and callgraph for 2.0) vtl help c c sampling lists all events available for EBS ( event base sampling ) Create/Run a Sampling activity vtl activity c c sampling app gzip, -f f big run Create and run a single Sampling collector Activity with application gzip f f big ; default settings ( Instruction Retired and Cycles ) vtl activity d d 20 c c sampling o -ec en= L3_READS L3_READS-ALL- MISS app gzip, -f big Create and run for 20 seconds a single Sampling collector Activity ty with application gzip f f big collecting all L3 cache misses data and instruction Use option cpu_mask <list> to select subset of processors 21
VTL Command Line Syntax(2) More Examples View Sampling Results vtl view vtl view -gui shows result of last activity ( defaults ) vtl view hf mn gzip view results for module ( application ) gzip in hot-spot function mode ( most active modules first ) vtl view code mn gzip fn deflate sea poa view results in source code mode for function deflate of module ( application ) gzip; show events as percentage of activity 22
VTL Command Line Syntax(3) More Examples Configuring and view Callgraph Activity vtl activity c callgraph app gzip, -f f big moi gzip run Create and run a Callgraph Activity with application gzip f f big ; default settings; module of interest gzip ; in case app is a script, the module of interest can select the binary to be anlayzed vtl view show the just generated call-graph in table-format vtl view -gui show the just generated call-graph in GUI-format; requires installation of CGVIEW tool ( free available from Intel) 23
VTune in Eclipse Call Graph View 24
Itanium Performance Monitoring The Itanium Architecture defines a generic framework for the Performance Monitoring Unit (PMU( PMU): Consistent software APIs across processor models Yet, a processor model can implement its own PMU extension Generic PMU support: 4 64-bit Performance Monitor Data registers (PMDs( PMDs) ) (extensible to 256 in total) 8 64-bit Performance Monitor Configuration registers (PMCs( PMCs) ) (extensible to 256 in total) A performance monitor = 1 PMC + N PMDs (where N >= 1) 3 additional status/control registers: PSR, DCR, PMV Itanium 2 PMU support: Monitor a rich set (140+) of events 16 PMCs,, 18 PMDs (4 for event counting, others for buffering event-specific info.) Can pinpoint exactly where a miss event happened in the program 25
Itanium PMU Events Classification Occurrence or Architectural Events Level 3 Cache Misses, L2 Bank Conflicts, RSE Activiations Some are Exact Address Events EAR Additional context information is saved Stall Events ( Bubble Events ) Stall at EXE stage, Stall of L1D pipeline Derived or artificial events Cycles/Instruction, 100* L3_Misses/L3_References 26
EAR Events Problem: When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source: The sampled PC is inprecise Solution provided by the Itanium 2: Hardware provides a set of Event Address Registers (EARs( EARs) to record the instruction and data addresses of the offending instruction (plus s other useful information). The instruction address can be exactly mapped to machine code instruction of application Most interesting are DEAR ( Data-EAR) events to monitor long-latency latency memory instructions Sample: DEAR_Latency_GT_64 - Counts number of memory operation taking more than 64 cycles, that is for sure not a cache access; helpful to e.g find sub-optimal prefetching.. Intel can provide unsupported tool ( Rosetta )) to find program variable name of data being accessed 27
How to use Vtune for Itanium 1. Find hotspots regarding time (cyles( cyles) By sampling of event CPU_CYCLES By call-graph Straight-forward and all you need in many cases but doesn t t tell you why 2. Find hot-spots regarding expensive occurrence events By sampling for e.g. L3 Cache misses, branch-miss predictions, RSE-activations Provides hints for code modifications Interpretation can be misleading E.g L3 cache misses can be neutral ( Prefetch ) or hint for expensive events Requires some generic knowledge about Itanium architecture 3. Stall cycle analysis By sampling for events causing stalls Most sophisticated and requires detailed knowledge of processor Only available in this form for Itanium architecture 28 The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.
Introduction to Stall Cycle Analysis The main idea: Assume algorithm and platform are perfectly optimized/configured Total Cycles = Cycles to execute instructions + Cycles where the processor pipeline is stalled Minimize the stall-cycles In case this value is zero, we have 6 instructions/cycle thus can t t be better This is Itanium-2 2 specific For Itanium (-1)( counter structure and names slightly different Does not work this way for IA-32 due to more non-deterministic (out- of-order) order) execution features We will come back to this in the Micro-Architectural talk Detailed documentation available: Itanium Reference Manual for Software Developers Itanium-2 2 Reference Manual for Software Optimization Introduction to Micro-architectural Software Optimization 29 The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.
Constraining Performance Monitoring Events on IPF The Performance Monitoring Events can be constrained to only increment on particular Instruction type (opcode( matching) Instruction Pointer range (IP matching) Virtual Address Range (Data Address matching) Or any combination of the above default is no constraint = collect all events Unique Features of the Itanium Processor Family 30
O2 O3 Opcode Matching the Matrix Multiply Example Opcode Match Default Fp Load Prefetch Opcode Match Default Fp Load Prefetch CPU_Cycles 2.2 X 10 10 2.2 X 10 10 2.2 X 10 10 CPU_Cycles 3.3 X 10 9 3.3 X 10 9 3.3 X 10 9 Instructions Retired 6.4 X 10 9 2.1 X 10 9 254 Instructions Retired 6.4 X 10 9 2.1 X 10 9 5 X 10 8 31 L3 Cache Misses 6.7 X 10 7 6.7 X 10 7 59 Opcode Matching Shows L3 Misses Are Fixed by O3 L3 Cache Misses 6.7 X 10 7 1 X 10 5 6.7 X 10 7
How Does This Work? Instructions are 41 bit fields Define a unique instruction and register usage 3 per 128 bit bundle Plus 5 bits for the template Opcode matching can work with classes of instructions By using only a subset of the 41 bits Done with an instruction field a mask field (defining which bits to ignore) A template field 32
Example Masks lfetch Template is M Opcode field is 0x0CB00000000 Mask field is 0x030FFFFFFFF fploads Template is M Opcode field is 0x0C000000000 Mask field is 0x037FFFFFFFF This is WAY too Painful!! 33
The Prototype VTune Analyzer 34