1 Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015

3 Application performance: a multiscale problem Microarch Core Socket Node Cluster Multicore: vector ISA, cores, cache hierarchies, Manycore: new vector ISAs, MPI+OMP?, memory/core? Optimization space is getting larger Goal of this presentation: Provide keys to application performance and threading analysis Based on characterization & projection experience with full applications 3

4 Node-level performance Choice of algorithm or scheme Source code implementation Binary code Actual execution Programmer Data access patterns Compiler Vectorization Code generation Architecture Cache behavior Execution pathologies Memory bandwidth/data reuse optimizations Vectorization/code quality optimizations 2 main performance factors (at first order) : Memory (DRAM) bandwidth demand Computation: Flops (but also non-flop instructions sometimes), use of execution units Key questions: What are the requirements of my algorithm, in terms of compute vs. memory transfers? What performance can I expect? Where am I with respect to ideal performance? How can I get closer to ideal? 4

5 Flops, bytes & arithmetic intensity Arithmetic intensity = Flop/byte: a measure of compute vs. ideal data transfer balance for a particular kernel DAXPY (Triad) do i=1,n y(i) = y(i) + a*x(i) end do Read x Read y Compute y Write y 8N bytes 8N bytes 2N Flops 8N bytes Flop/byte = 2/24 = D Stencil (Gauss-Seidel) do k=1,n do j=1,n do i=1,n x(i,j,k) = ONE_SIXTH * ( & x(i+1,j,k) + x(i-1,j,k) + & x(i,j+1,k) + x(i,j-1,k) + & x(i,j,k+1) + x(i,j,k-1)) end do end do end do Read x Compute update Write new x 8N^3 bytes 6N^3 Flops 8N^3 bytes Flop/byte = 6/16 = Source code level analysis: Count floating point operations Count bytes (arrays) read & written, assume perfect reuse (infinite cache) ideal case 5

6 Compute vs. bandwidth analysis Quantitative System Performance, D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik Williams et al., log GFLOP/s = performance Compute bound Ideal execution Actual vs. ideal execution: Efficiency (% peak) depends on microarch. Finite cache size will reduce flop/byte Actual execution Vectorization, Code generation Data reuse, Cache optims Actual Flop/byte Theoretical Flop/byte log Flop/byte = arithmetic intensity Measuring data for actual execution: GFlops/s derived from code performance: GFlops/s = Gcells/s Flops/cell DRAM bandwidth Flop/byte = (GFlop/s) / (GB/s) Intel VTune Amplifier XE Open source tools, e.g. Requires root access or special kernel module 6

7 Illustration: GYSELA kernels on Xeon 2 sockets, Xeon E (Sandy Bridge, 2.6 GHz) This kernel is BW bound when vectorized, but compute bound when not vectorized! 7

8 Illustration: GYSELA kernels on Xeon Phi Xeon Phi 7120 (16 GB GDDR, 61 cores, 1.2GHz) Efficiency drops for complex loop bodies Smaller caches incur more memory traffic 8

9 Node-level characterization: Wrap Up Simple compute vs. bandwidth characterization («roofline») Helps determine max performance expectations Allows to identify optimization directions Can be complemented by quick analysis tricks Measure time on 1 full node (avail b/w = BW 1 ), and write: T 1 full = T compute + T bw Measure time on 2 half-filled nodes (avail b/w = BW 2 > BW 1 ), and write: T 2 half = T compute + T bw (BW 1 BW 2 ) Solve for T compute and T bw to estimate «memory-boundedness» of app on this architecture Also useful for quick projections across similar architectures General trends on Xeon Phi Smaller caches incur more memory demand In-order core, complex vector ISA compiler and code generation matter So far, we assumed good parallelism (no threading or MPI issues) 9

10 Shared memory: To thread or not to thread? Why is threading interesting in applications? Allows «larger» MPI ranks (for domain decomp.) for a same problem May improve surface/volume ratio Amortizes memory footprint of MPI runtime Allows dynamic load balancing for imbalanced applications What could possibly go wrong? Amdahl s law strikes back On computation: getting good coverage is hard On communications MPI+X is not intrinsically «better» than MPI 4x1 v.s. 1x4 10

11 200 Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Temporal loop wtime [s] 120 Footprint/core [MB] 2.5E+11 App instructions/core E E E E E Measured [s] Amdahl projection OMP threads/rank

12 Ranks x Threads Illustration: CFD application Configurations with {#ranks} x {#threads} = 24 cores Wtime spent inside OpenMP parallel regions CFD app example: Wall time [s] on master thread x1 12x2 6x4 Wtime spent in MPI library grows with # threads OMP Serial MPI 4x6 2x12 Non-threaded computation wtime («Amdahl s law on threads»)

13 Can threading help with imbalance? [synthetic data for illustration] Small-scale 50% imbalance Large-scale 50% imbalance Imbalance time = max - mean Shared mem dynamic load balancing may be effective against imbalance Shared mem dynamic load balancing ineffective alone against imbalance core id core id

14 Ranks x Threads Threading and imbalance: Highly imbalanced adaptive mesh refinement code OMP computation scales less than ideally Wall time [s] on master thread, rank x1 12x2 OMP Serial MPI 8x3 6x4 Threading helps reduce extreme MPI imbalance 4x6 2x12 But Amdahl s law still overtakes at high thread counts

15 OpenMP: things to watch for in apps Code coverage (a.k.a. Amdahl s law) Extensive coverage is critical for scalability Can be very tedious/impossible to achieve for flat-profile applications Coarse threading ( loop-level) helps, but reimplementing MPI doesn t Granularity Important metric = average wall time of OpenMP regions Compare to OpenMP barrier/sync time Both points grow in importance on Xeon Phi Lots of threads coverage grows in importance Limited memory/core short loops Vtune profiling can help diagnose both issues 15

16 Wrap-up Careful performance analysis is essential to guide code optimizations Set pragmatic performance targets Collect data on application behavior Simple compute vs. bandwidth model can provide: Robust first-order characterization Insights into specific or second-order effects Threading can help address some strong-scaling issues Amortize halo overheads, level out imbalance No magic: obtaining good coverage is hard work Threading: an important adjustment variable for Heterogeneous computing resources (e.g. symmetric mode) Available memory/core 16

