Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Size: px

Start display at page:

Download "Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche. massimo.bernaschi@cnr.it"

Oliver Short
7 years ago
Views:

1 Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche

2 Performance There are two main measurements of performance. Execution time is what we ll focus on. Throughput is important for operating systems. } Performance is mainly driven by the algorithm; } However the SAME algorithm may be implemented in different ways that may have a dramatic impact on the performance } How do we measure performance? } Timers } Profilers

3 Many possible definitions of performance } Every computer vendor will select one that makes them look good. How do you make sense of conflicting claims? Q: Why do end users need a new performance metric? A: End users who rely only on megahertz as an indicator for performance do not have a complete picture of PC processor performance and may pay the price of missed expectations. 3

4 Basis of Comparison When comparing systems, need to fix the workload Which workload? Workload Pros Cons Actual Target Workload Full Application Benchmarks Small Kernel or Synthetic Benchmarks Micro benchmarks Representative Portable Widely used Realistic Easy to run Useful early in design Identify peak capability and potential bottlenecks Very specific Non-portable Difficult to run/measure Less representative Easy to fool Real application performance may be much below peak 4

5 The components of execution time } Execution time can be divided into two parts. User time is spent running the application program itself. System time is when the application calls operating system code. } The distinction between user and system time is not always clear, especially under different operating systems. } The Unix time command shows both. salary.125 > time distill 05-examples.ps Distilling 05-examples.ps (449,119 bytes) 10.8 seconds (0:11) 449,119 bytes PS => 94,999 bytes PDF (21%) 10.61u 0.98s 0: % User time Wall clock time (including other processes) System time CPU usage = (User + System) / Total 5

6 Three Components of CPU Performance CPU time X,P = Instructions executed P * CPI X,P * Clock cycle time X Cycles Per Instruction 6

7 Instructions Executed } Instructions executed: } We are not interested in the static instruction count, or how many lines of code are in a program. } Instead we care about the dynamic instruction count, or how many instructions are actually executed when the program runs. } There are three lines of code below, but the number of instructions executed would be XXXX?. li $a0, 1000 Ostrich: sub $a0, $a0, 1 bne $a0, $0, Ostrich 7

8 CPI } The average number of clock cycles per instruction, or CPI, is a function of the machine and program. } The CPI depends on the actual instructions appearing in the } } program. It also depends on the CPU implementation. } Do not confuse architecture with implementation! In an ideal world each instruction would take one cycle, making CPI = 1. } The CPI can be >1 due to memory stalls and slow instructions. } The CPI can be <1 on machines that execute more than 1 instruction per cycle (superscalar). 8

9 Execution time, again CPU time X,P = Instructions executed P * CPI X,P * Clock cycle time X } } The easiest way to remember this is match up the units: Seconds Instructions Clock cycles Seconds = * * Program Program Instructions Clock cycle Make things faster by making any component smaller!! Program Compiler ISA Organization Technology Instruction Executed CPI } Clock Cycle TIme Often easy to reduce one component by increasing another 9

10 Improving CPI } Many processor design techniques we ll see improve CPI } Often they only improve CPI for certain types of instructions n CPI = Σ CPI F where F = I i i i i i = 1 Instruction Count } Fi = Fraction of instructions of type i First Law of Performance: Make the common case fast 10

11 Time measure and profiling } Many different ways to measure time and profile a code: } A very simple approach: #include <sys/time.h> #define TIMER_DEF struct timeval temp_1, temp_2 #define TIMER_START gettimeofday(&temp_1, (struct timezone*)0) #define TIMER_STOP gettimeofday(&temp_2, (struct timezone*)0) #define TIMER_ELAPSED ((temp_2.tv_sec-temp_1.tv_sec)*1.e6+(temp_2.tv_usec-temp_1.tv_usec)) It is easy when there are just a few points to be checked } Example: matrix-matrix product (mm.c) } Test with and without optimization(-o3) } 11

12 Time measure and profiling (2) } Profilers help to collect timings and show the corresponding results } Classic (Linux) profile (does not work on Mac Osx!). } gprof } requires compilation AND linking with the pg option } } } cc o mm pg mm.c./mm gprof mm } Many modern alternatives: } Linux perf tool } Intel Architecture Code Analyser (IACA) } Valgrind cachegrind 12

13 Perf (1) } Modern processors and operating systems have performance counters, which track performance-affecting events, such as the number of context switches or cache misses. Linux provides perf_event subsystem which allows to read these counters and use them to hunt performance problems. } perf utility makes it even easier. } perf list lists available performance counters } perf top reports system-wide statistics on performance events } perf stat runs a program and reports the total number of performance events for that program. 13

14 Perf (2) } perf record runs a program and records a detailed information about performance events. } perf report allows to browse information collected by perf record. } To get summary statistics for a program: } perf stat -e <counter1> myapp args... } If you specify instructions and cycles, perf will compute average number of instructions per cycle (IPC): } perf stat -e instructions -e cycles myapp args... } Remember: } The higher is IPC the better. } If IPC < 1,you need to work on it. 14

15 Perf (3) } If you specify branches and branch-misses, perf will compute percent of mispredicted branches: } perf stat -e branches -e branch-misses myapp args. } Targets for percent of mispredicted branches: } The less branches are mispredicted the better } If more than 5% of branches are mispredicted, you need to look at it. } To compute percent of cache misses: } perf stat -e cache-references -e cache-misses myapp args... } Targets for percent of cache misses: } The less cache references are missed the better } If >10% of cache references are missed, investigate it. 15

16 Valgrind (1) } Valgrind is a framework which allows to intercept execution of specific classes of instructions (like loads and stores). } There are a number of tools built on top of valgrind for research and analysis. } Cachegrind utility simulates processor caches and provides information on the number of cache misses. } Unlike perf utility, cachegrind allows to change size and other parameters of simulated caches and analyse how it affects cache miss rate. 16

17 Valgrind (2) } A program myapp can be launched under cachegrind with command } valgrind --tool=cachegrind myapp args } cachegrind parameters -I1, -D1, and -LL allow to change parameters of simulated caches: valgrind --tool=cachegrind --D1=65536,8,64 myapp args... 17

18 3 possible compilers } GNU Compilers Collection (gcc/g++) } Default compiler on Linux } Produces very fast code } Clang C/C++ Compilers (clang/clang++) } Default compiler on Mac OS X and FreeBSD } Much better error reporting than gcc } Usually produces slower code than gcc, but sometimes can be faster } Compiler options, pragmas, and other extensions compatible with gcc } Intel C/C++ Compilers (icc/icpc) } Better error reporting than gcc } For Intel processors often produces faster code than gcc } More or less compatible with gcc 18

19 Basic optimization options } -O0 enable only optimizations which do not mess up debugging } -O2 enable all basic optimizations for performance } -O3 = -O2 + speculative optimizations } -Os optimize for code size } For large code bases this results in faster code than -O3 19

20 Machine-specific optimization options (gcc and clang) } -march=<cpu> use all instructions available on <cpu>. Valid <cpu> values: } core2 for Intel Core 2 and Xeon processors of the same generation (Harpertown, Clovertown). For 45nm Core 2. processors (and Harpertown Xeons) you should also specify -msse4.1 because they additionally support SSE4.1 instruction set. } corei7 for Intel Nehalem and Westmere: firrst generation Core-i7 processors and Xeon 5500 and 5600 series. } corei7-avx for Intel Sandy Bridge processors (aka second-generation Core-i7). } core-avx-i for Intel Ivy Bridge processors (aka third-generation Core-i7). } core-avx2 for Intel Haswell processors (aka fourth-generation Core-i7). } bdver1 for AMD Bulldozer processors (Opteron 3200, 4200, and 6200 series). } bdver2 for AMD Piledriver processors (Trinity APUs, Opteron 3300, 4300, and 6300 series). } native for the CPU you compile on (sometimes misdetects CPU). 20

21 Inter-Procedural Optimization } Postpones code generation until linking stage when all information about the program is available. } Different options for different compilers: } -ipo for Intel Compiler. } -flto for GNU and Clang compilers. } These options must be specified during both compilation and linking. 21

22 Profile-Guided Optimization } Often there is not enough information in compilation time to properly optimize the code. Profile-Guided Optimization allows the compiler to collect information from program runs, and then use this information to improve optimization quality. } Using PGO with GNU Compilers 1. Compile your program with -fprofile-generate parameter } gcc ${CFLAGS} -fprofile-generate -o myapp myapp.c 2. Run the program on representative input }./myapp testdata.input 3. Compile your program again with -fprofile-use parameter } gcc ${CFLAGS} -fprofile-use -o myapp myapp.c 22

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction