Hardware performance monitoring. Zoltán Majó

Transcription

1 Hardware performance monitoring Zoltán Majó 1

2 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance Computing 2

3 Program performance Algorithmic complexity is decisive E.g., O(n) better than O(n 2 ) Constant factors matter as well E.g., n 2 operations better than *n operations E.g., 3*n 3 operations better than 500*n 3 operations Constant factors are in many cases hardware-dependent Today s example: dense matrix multiplication (MMM) Complexity: O(n 3 ) Hardware: cache-based architecture 3

4 Algorithm: MMM C j = A X B j i i for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; } 4

5 Hardware: cache-based architecture CPU Double type: 35 cycles access latency Cache line size: Cache 200 cycles access latency RAM 5

6 MMM: Putting it together CPU A[][] B[][] Cache hits?? Total accesses 6 6 Cache RAM C A B 6

7 MMM: Putting it together CPU A[][] B[][] Cache hits 3?? Total accesses 6 6 Cache RAM C A B 7

8 MMM: Putting it together CPU Cache hits Total accesses A[][] B[][] 3? Cache RAM C A B 8

9 MMM: Cache performance Hit rate: Accesses to A[][]: 3/6 = 50% Accesses to B[][]: 0/6 = 0% All accesses: 25% Can we do better? 9

10 Cache-friendly MMM Cache-unfriendly MMM (ijk) for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } Cache-friendly MMM (ikj) for (i=0; i<n; i++) for (k=0; k<n; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*b[k][j]; } C = A X B k i i k 10

11 Cache-friendly MMM CPU Cache hits Total accesses C[][] 3 6 B[][] 3 6 Cache RAM C B 11

12 Cache-friendly MMM Cache-unfriendly MMM (ijk) A[][]: 3/6 = 50% hit rate B[][]: 0/6 = 0% hit rate All accesses: 25% hit rate Cache-friendly MMM (ikj) C[][]: 3/6 = 50% hit rate B[][]: 0/6 = 50% hit rate All accesses: 50% hit rate Better performance due to cache-friendliness? 12

13 Performance of MMM Execution time [s] Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 13

14 Performance of MMM Execution time [s] X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 14

15 Program performance MMM: constant factors matter Understanding constant factors requires access to Algorithm Implementation Inputs Architecture... but often not all of these are available We can have only the binary file that we want to execute fast Do we know the architecture? 15

16 Cache-based architecture Processor package Processor package Core CPU Core Core Core L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache L2 Cache Cache L3 Cache L3 Cache RAM 16

17 Microarchitecture of a core Source of picture: 17

18 Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Your project 18

19 Hardware performance counters Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation In the past: undocumented feature Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst 19

20 Intel PTU Monitored events Per-function counts Source: 20

21 Debugging tools Limited functionality No access to raw data Do not support all features of processors Example: Intel PTU supports only sampling Idea: write your own tool 21

22 Programming performance counters Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode) perf_events interface Standard Linux interface since Linux UNIX philosophy: performance counters are files Simple API: Set up counters: perf_event_open() Read counters as files Example: measuring MMM cache misses 22

23 Example: monitoring cache misses int main() { int pid = fork(); if (pid == 0) { exit(exec(./mmm, NULL)); } else { int status; uint64_t value; int fd = perf_event_open(...); waitpid(pid, &status, NULL); read(fd, &value, sizeof(uint64_t); printf( Cache misses: % PRIu64 \n, value); } } 23

24 perf_event_open() Looks simple int sys_perf_event_open( ); struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags struct perf_event_attr { u32 type; u32 size; u64 config; union { u64 sample_period; u64 sample_freq; }; u64 sample_type; u64 read_format; u64 inherit; u64 pinned; u64 exclusive; u64 exclude_user; u64 exclude_kernel; u64 exclude_hv; u64 exclude_idle; u64 mmap; 24

25 libpfm Open-source helper library (1) event name (3) call perf_event_open() libpfm user program perf_events (2) set up perf_event_attr (4) read results 25

26 Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual 26

27 Software Developer s Manual 27

28 Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM Measure cache misses 28

29 Performance of MMM Execution time [s] X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 29

30 Millions MMM cache misses # cache misses x Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 30

31 Millions MMM cache misses # cache misses x X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 31

32 Performance of MMM 30X more cache misses cause 20X performance degradation Hardware performance counters confirm assumption 32

33 Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Sampling Precise information Your project 33

34 Sampling So far: counting mode Set up counters Execute program Read counters 34

35 Single-phased program set up performance counters read performance counters 35

36 Program with multiple phases set up performance counters get sample 36

37 Program with multiple phases set up performance counters 37

38 Program with multiple phases set up performance counters 38

39 Sampling frequency Low sampling frequency Low overhead Can fail to record changes in program behavior High sampling frequency High overhead Accurately follows program behavior Adaptive sampling 39

40 Precise information Normal operation: only event counts E.g., # of cache misses, # of branch instructions retired, etc. Events with more information in each sample E.g., register contents, instruction latency Intel PEBS, AMD IBS Today: data address profiling in NUMA systems 40

41 Non-uniform memory architecture Processor 0 Core 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 MC IC IC MC DRAM DRAM 41

42 Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC DRAM DRAM Data All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 42

43 Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC Remote memory accesses bandwidth: 6.3 GB/s DRAM Data DRAM latency: 310 cycles Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 43

44 Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 44

45 Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 45

46 Automatic page placement Current OS support for NUMA: first-touch page placement Often high number of remote accesses Data address profiling For each thread......for each memory instruction......record data address used 46

47 Profile-based page placement Processor 0 Processor 1 Profile T0 T1 P0 P0 : accessed 1000 times by T0 P1 P1 : accessed 3000 times by T1 DRAM DRAM 47

48 Automatic page placement Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping 48

49 Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 49

50 Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 50

51 Hardware performance counters Useful for program optimizations Example seen: data locality optimization in NUMA systems Question: what about my ACD project? 51

52 Your project Which event to use? Which measurement mode to use? Is precise information needed? Answer: it depends on the optimization you choose Example 1: loop unrolling Measure retired branch instructions, back-end stalls, i-fetch misses Example 2: code positioning Measure i-fetch misses 52

53 References Processor manufacturer s manuals Intel Processor manufacturer optimization manuals Intel Talk to me [email protected] 53