Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

Agenda Investigated Hardware Kernel Benchmark Results Memory Bandwidth NUMA Distances Synchronizations Applications NestedCP TrajSearch Conclusion 2

Hardware HP ProLiant DL98 G7 8 x Intel Xeon X655 @ 2 GHz 256 GB main memory internally several boards SGI Altix Ultraviolet 14 x Intel Xeon E7-487 @ 2.4 GHz about 2 TB main memory 2 Socket Boards connected with NUMALink network Bull Coherence Switch System 16 x Intel Xeon X755 @ 2 GHz 256 GB main memory 3 4-socket boards externally connected with the Bull Coherence Switch (BCS)

Hardware ScaleMP System 64 x Intel Xeon X755 @ 2 GHz about 4 TB main memory 4-socket boards connected with Infiniband vsmp foundation software used to create a cache coherent single system Intel Xeon Phi 1 Intel Xeon Phi coprocessor @ 1.5 GHz plugged in a PCIe slot 8 GB main memory 4

1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB Bandwidth in GB/s Bandwidth in GB/s 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB Bandwidth in GB/s Bandwidth in GB/s Bandwidth in GB/s Serial Bandwidth 15 1 5 Write Bandwidth HP Write Bandwidth BCS 15 1 5 15 1 5 Write Bandwidth AltixUV Write Bandwidth ScaleMP Write Bandwidth Phi local remote 1st level 15 1 5 5 4 3 2 1 remote 2nd level 5 Standard SW-Prefetching

Distance Matrix Measured bandwidth between sockets memory and threads placed with numactl normalized to 1 for socket BCS Socket 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 1 13 13 13 57 57 57 57 59 59 59 59 59 57 57 57 1 13 1 13 13 56 55 56 56 56 56 56 55 55 56 56 55 2 14 13 1 13 58 58 58 58 56 56 56 56 58 58 58 58 3 13 13 13 1 56 55 56 55 56 56 56 55 56 55 56 55 4 56 56 56 56 1 13 13 13 56 56 56 57 58 58 58 58 5 55 55 55 55 13 1 13 13 55 55 55 55 56 56 56 55 6 58 58 58 59 13 13 1 13 58 58 58 58 56 56 56 57 7 56 55 56 55 13 13 13 1 56 56 56 56 56 56 56 56 8 58 58 58 58 56 57 56 56 1 13 13 13 56 56 56 56 9 56 56 55 55 55 55 55 55 13 1 13 13 55 55 56 55 1 56 56 56 56 58 58 58 58 13 13 1 13 58 58 58 58 11 56 56 56 55 56 56 56 55 13 13 13 1 56 56 56 56 12 56 56 56 56 58 58 58 58 56 57 56 56 1 13 13 13 13 55 55 55 55 56 56 55 55 56 55 55 55 13 1 13 13 14 58 58 58 58 56 56 56 56 58 58 58 58 13 13 1 13 15 56 56 56 56 56 56 56 56 56 56 56 56 13 13 13 1 remote accesses much more expensive on the BCS machine HP machine internally has also several NUMA levels HP Socket 1 2 3 4 5 6 7 1 1 17 13 18 18 18 18 1 1 1 17 13 18 18 18 18 2 17 17 1 11 18 18 18 18 3 17 17 1 11 19 19 18 18 4 18 18 18 18 1 1 17 17 5 18 18 18 18 1 1 17 17 6 18 18 18 18 17 17 1 1 7 18 19 18 18 17 17 1 9 6

1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 Bandwidth in GB/s Parallel Bandwidth 25 2 15 1 5 Number of Threads HP-read ALTIX-read BCS-read SCALEMP-read Phi-read HP-write ALTIX-write BCS-write SCALEMP-write Phi-write Read and Write Bandwidth on local data 16 MB memory footprint per thread 7

mem_go_around Investigate slow-down, when remote accesses ocure Every thread initializes local memory and measures the bandwidth In step n thread t uses the memory of thread (t+n)%nthreads this increases the number of remote accesses in every step 8

1 2 3 4 5 6 7 8 9 1 11 12 Memory Bandwidth in GB/s mem_go_around 512 256 128 64 32 16 8 4 2 1 Turn HP Altix BCS ScaleMP Phi 9

Synchronization Overhead in microseconds to acquire a lock #threads BCS SCALEMP PHI ALTIX HP 1.6.7.4.5.93 8.27.29 1.89.21.26 32/3.62.99 1.77 3.29.97 64/6 1.4 24.36 1.94 3.72 1.7 128/12 1.64 35.78 2.1 2.99 24 2.26 Synchronization overhead rises with the number of threads ScaleMP introduces more overhead for large thread counts 1

NestedCP: Parallel Critical Point Extraction Virtual Reality Group of RWTH Aachen University: Analysis of large-scale flow simulations Feature extraction from raw data Interactive analysis in virtual environment (e.g. a cave) Critical Point: Point in the vector field with zero velocity Andreas Gerndt, Virtual Reality Center, RWTH Aachen 11

1 2 4 8 16 32 64/6 128/12 24 Runtime in sec. Speedup NestedCP parallelization done with OpenMP tasks many independent tasks only synchronized at the end 3 1 25 2 15 1 5 8 6 4 2 Number of Threads BCS SCALEMP PHI ALTIX HP BCS-Speedup SCALEMP-Speedup PHI-Speedup ALTIX-Speedup HP-Speedup 12

TrajSearch Direct Numerical Simulation of three dimensional turbulent flow field produces large output arrays 16384 Procs@BlueGene ~ ½ year of computation produced 248³ output grid (32GB) Trajectory Analysis (TrajSearch) implemented with OpenMP was optimized for large NUMA machines Here the 124³ grid cells data was used (~4 GB) Institute for Combustion Technology 13

8 16 32 64 128 Runtime in hours Speedup TrajSearch Optimizations: reduced number of locks NUMA aware data initialization data blocked to 8x8x8 blocks to load nearest data on ScaleMP self-written NUMA aware scheduler 4 35 3 25 2 15 1 5 14 12 1 8 6 4 2 Number of Threads ALTIX BCS SCALEMP ALTIX-Speedup BCS-Speedup SCALEMP-Speedup 14

Conclusion larger systems provide a larger total memory bandwidth the overhead for a lot of remote accesses is also higher on larger systems as has been seen with the mem_go_around test the caching in the vsmp software can hide the remote latency, even when larger arrays are read or written remotely synchronization is a problem on all systems that increases with the number of cores the Xeon Phi system delivers a good bandwidth and low synchronization overhead for a large number of threads applications can run well on large NUMA machines Remark: A Revised version with newer performance measurements will soon be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/hpc/public/default.aspx 15

Conclusion larger systems provide a larger total memory bandwidth the overhead for a lot of remote accesses is also higher on larger systems as has been seen with the mem_go_around test the caching in the vsmp software can hide the remote latency, even when larger arrays are read or written remotely Thank you for your attention! synchronization is a problem on all systems that increases with the Questions? number of cores the Xeon Phi system delivers a good bandwidth and low synchronization overhead for a large number of threads applications can run well on large NUMA machines Remark: A Revised version with newer performance measurements will soon be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/hpc/public/default.aspx 16