Hardware performance monitoring Zoltán Majó 1
Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance Computing 2
Program performance Algorithmic complexity is decisive E.g., O(n) better than O(n 2 ) Constant factors matter as well E.g., n 2 operations better than 10000000000000*n operations E.g., 3*n 3 operations better than 500*n 3 operations Constant factors are in many cases hardware-dependent Today s example: dense matrix multiplication (MMM) Complexity: O(n 3 ) Hardware: cache-based architecture 3
Algorithm: MMM C j = A X B j i i for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; } 4
Hardware: cache-based architecture CPU Double type: 35 cycles access latency Cache line size: Cache 200 cycles access latency RAM 5
MMM: Putting it together CPU A[][] B[][] Cache hits?? Total accesses 6 6 Cache RAM C A B 6
MMM: Putting it together CPU A[][] B[][] Cache hits 3?? Total accesses 6 6 Cache RAM C A B 7
MMM: Putting it together CPU Cache hits Total accesses A[][] B[][] 3? 0 6 6 Cache RAM C A B 8
MMM: Cache performance Hit rate: Accesses to A[][]: 3/6 = 50% Accesses to B[][]: 0/6 = 0% All accesses: 25% Can we do better? 9
Cache-friendly MMM Cache-unfriendly MMM (ijk) for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } Cache-friendly MMM (ikj) for (i=0; i<n; i++) for (k=0; k<n; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*b[k][j]; } C = A X B k i i k 10
Cache-friendly MMM CPU Cache hits Total accesses C[][] 3 6 B[][] 3 6 Cache RAM C B 11
Cache-friendly MMM Cache-unfriendly MMM (ijk) A[][]: 3/6 = 50% hit rate B[][]: 0/6 = 0% hit rate All accesses: 25% hit rate Cache-friendly MMM (ikj) C[][]: 3/6 = 50% hit rate B[][]: 0/6 = 50% hit rate All accesses: 50% hit rate Better performance due to cache-friendliness? 12
Performance of MMM Execution time [s] 10000 1000 100 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 13
Performance of MMM Execution time [s] 10000 1000 100 20X 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 14
Program performance MMM: constant factors matter Understanding constant factors requires access to Algorithm Implementation Inputs Architecture... but often not all of these are available We can have only the binary file that we want to execute fast Do we know the architecture? 15
Cache-based architecture Processor package Processor package Core CPU Core Core Core L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache L2 Cache Cache L3 Cache L3 Cache RAM 16
Microarchitecture of a core Source of picture: http://wikipedia.org 17
Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Your project 18
Hardware performance counters Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation In the past: undocumented feature Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst 19
Intel PTU Monitored events Per-function counts Source: http://software.intel.com/en-us/articles/intel-performance-tuning-utility/ 20
Debugging tools Limited functionality No access to raw data Do not support all features of processors Example: Intel PTU supports only sampling Idea: write your own tool 21
Programming performance counters Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode) perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files Simple API: Set up counters: perf_event_open() Read counters as files Example: measuring MMM cache misses 22
Example: monitoring cache misses int main() { int pid = fork(); if (pid == 0) { exit(exec(./mmm, NULL)); } else { int status; uint64_t value; int fd = perf_event_open(...); waitpid(pid, &status, NULL); read(fd, &value, sizeof(uint64_t); printf( Cache misses: % PRIu64 \n, value); } } 23
perf_event_open() Looks simple int sys_perf_event_open( ); struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags struct perf_event_attr { u32 type; u32 size; u64 config; union { u64 sample_period; u64 sample_freq; }; u64 sample_type; u64 read_format; u64 inherit; u64 pinned; u64 exclusive; u64 exclude_user; u64 exclude_kernel; u64 exclude_hv; u64 exclude_idle; u64 mmap; 24
libpfm Open-source helper library (1) event name (3) call perf_event_open() libpfm user program perf_events (2) set up perf_event_attr (4) read results 25
Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual 26
Software Developer s Manual 27
Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM Measure cache misses 28
Performance of MMM Execution time [s] 10000 1000 100 20X 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 29
Millions MMM cache misses # cache misses x 10 6 1000000 1000 1 0.001 0.000001 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 30
Millions MMM cache misses # cache misses x 10 6 1000000 30X 1000 1 0.001 0.000001 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 31
Performance of MMM 30X more cache misses cause 20X performance degradation Hardware performance counters confirm assumption 32
Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Sampling Precise information Your project 33
Sampling So far: counting mode Set up counters Execute program Read counters 34
Single-phased program set up performance counters read performance counters 35
Program with multiple phases set up performance counters get sample 36
Program with multiple phases set up performance counters 37
Program with multiple phases set up performance counters 38
Sampling frequency Low sampling frequency Low overhead Can fail to record changes in program behavior High sampling frequency High overhead Accurately follows program behavior Adaptive sampling 39
Precise information Normal operation: only event counts E.g., # of cache misses, # of branch instructions retired, etc. Events with more information in each sample E.g., register contents, instruction latency Intel PEBS, AMD IBS Today: data address profiling in NUMA systems 40
Non-uniform memory architecture Processor 0 Core 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 MC IC IC MC DRAM DRAM 41
Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC DRAM DRAM Data All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 42
Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC Remote memory accesses bandwidth: 6.3 GB/s DRAM Data DRAM latency: 310 cycles Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 43
Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 44
Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 45
Automatic page placement Current OS support for NUMA: first-touch page placement Often high number of remote accesses Data address profiling For each thread......for each memory instruction......record data address used 46
Profile-based page placement Processor 0 Processor 1 Profile T0 T1 P0 P0 : accessed 1000 times by T0 P1 P1 : accessed 3000 times by T1 DRAM DRAM 47
Automatic page placement Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping 48
Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 49
Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 50
Hardware performance counters Useful for program optimizations Example seen: data locality optimization in NUMA systems Question: what about my ACD project? 51
Your project Which event to use? Which measurement mode to use? Is precise information needed? Answer: it depends on the optimization you choose Example 1: loop unrolling Measure retired branch instructions, back-end stalls, i-fetch misses Example 2: code positioning Measure i-fetch misses 52
References Processor manufacturer s manuals Intel Processor manufacturer optimization manuals Intel http://www.intel.com/content/www/us/en/processors/architecturessoftware-developer-manuals.html http://www.intel.com/content/www/us/en/architecture-andtechnology/64-ia-32-architectures-optimization-manual.html Talk to me zoltan.majo@inf.ethz.ch 53