Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)
The Question: If we look at the Xeon Phi as a standalone system: Can OpenMP parallel programs, optimized for multicore machines, run efficiently on the Intel Xeon Phi Coprocessor without special tuning? 2
Agenda The Intel Xeon Phi Architecture Kernel benchmarks NAS Parallel Benchmarks Application Tests Conclusion and Outlook 3
The Intel Xeon Phi Architecture 4
The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading - 512-bit vector registers - 1 GHz clockrate - ring network 5 Intel Sandy Bridge System - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect
The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading - 512-bit vector registers - 1 GHz clockrate - ring network 6 Intel Sandy Bridge System Both architectures have roughly the same price, size and power consumption. - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect
Kernel Benchmarks 7
Bandwidth in GB/s Memory Bandwidth Stream triad benchmark (a = b + x c) 2 GB memory footprint spread thread pinning Intel Compiler 13.0 -mmic flag to cross-compile 200 SNB System 150 Intel Xeon Phi 100 Sandy Bridge System (SNB) 32 GB DDR3 RAM Intel Xeon Phi 8 GB GDDR5 RAM 50 0 1 2 4 8 16 32 64 128 256 Number of Threads 8
1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB 1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB Latency in ns Level 1 cache Level 2 cache Latency in ns Level 1 cache Level 2 cache Level 3 cache Memory Latency single threaded pointer chasing with random stride small stride = larger than a cache-line (if possible) large stride = larger than a memory page (if possible) 400 300 200 100 0 Intel Xeon Phi Small Stride Large Stride 90 80 70 60 50 40 30 20 10 0 SNB System Small Stride Large Stride Memory Footprint Memory Footprint 9
EPCC Microbenchmarks Kernel benchmark to measure the overhead of OpenMP constructs extended to measure the overhead of tasking as well Worksharing Xeon Phi SNB # Threads Parallel for barrier reduction # Threads Parallel for barrier reduction 2 4.32 1.29 4.29 2 0.85 0.56 1.65 16 13.81 5.83 21.61 16 3.47 2.05 5.83 30 15.85 8.21 24.80 32 24.36 31.78 58.90 240 27.56 13.37 48.86 Tasking Xeon Phi SNB # Threads single producer parallel producer # Threads single producer parallel producer 2 4.18 0.92 2 0.80 0.18 16 81.18 1.67 16 63.25 0.75 30 165.50 1.78 32 146.41 4.11 240 1355.90 8.39 Overhead in microseconds. 10
GFLOPS Conjugate Gradient Method Sparse Matrix-Vector Multiplication in a CG solver ~3.5 GB memory footprint different parallel versions: dynamic worksharing precalculated worksharing task parallel version 11 20 15 10 5 0 Xeon Phi, pre-calc. Xeon Phi, tasks Xeon Phi, dynamic 1 2 4 8 Threads 16 32 64 128 256 SNB, pre-calc. SNB, tasks SNB, dynamic
NAS Parallel Benchmarks 12
NAS Parallel Benchmarks Standard benchmarks for parallel computing problem size C good speedup on both systems serial runtime on the Xeon Phi is low overall parallel version slower on the Xeon Phi system SNB Intel Xeon Phi Benchmark 1 Thread 32 Threads Speedup 1 Thread 240 Threads Speedup IS 23.12 1.38 16.75 192.49 2.46 78.25 EP 186.81 8.11 23.03 1518.42 13.34 113.82 MG 64.04 8.03 7.98 498.94 9.63 51.81 FT 306.11 19.19 15.95 2393.01 53.97 44.34 BT 1241.63 82.61 15.03 9433.52 132.29 71.31 SP 826.25 137.69 6 12264.29 164.59 74.51 LU 1109.76 62.23 17.83 9835.09 163.33 60.22 Runtime in seconds. 13
Application Tests 14
Application Tests Applicati on imoose Area finite elements package Paralleli zation worksha ring Language C++ Size ~300k lines FIRE image recognition tasks C++ ~35k lines NestedCP extracting critical points in unsteady flow fields nested parallel C++ ~2k NestedCP tasks C++ ~2k 15 NINA Neuromagnetic INverse largescale problems worksha ring C ~2k
Application Tests Application 1 Thread best (#threads) Speedup 1 Thread best (#threads) Speedup imoose 104.68 12.2 (16) 8.58 1243.54 15.59 (240) 79.74 FIRE 284.6 16.68 (32) 17.06 2672.71 38.25 (234) 98.02 NestedCP Nested 46.99 3.21 (32) 14.62 845.14 35.58 (240) 23.76 NestedCP Tasking 47.34 2.43 (32) 19.47 848.34 11.14 (240) 76.16 NINA 470.06 61.16 (16) 7.68 1381.94 27.29 (177) 50.64 Runtime in seconds. also here the speedup for all codes is good the serial runtime is slower on the Xeon Phi only NINA is faster on the Xeon Phi 16
NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Source 2 Experiment: Persons get different stimuli in form of pictures The induced magnetic field is measured around the head NINA reconstructs the activity inside the brain Kernel portion of 90% Dense matrix-vector multiplications & vector operations Matrix fits into memory (128 x 512,000) 17 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6):2905 2921, 2008.
Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 18
Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Questions or Comments? Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 19