Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Size: px

Start display at page:

Download "Assessing the Performance of OpenMP Programs on the Intel Xeon Phi"

Adam Marcus Sherman
10 years ago
Views:

1 Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller Rechen- und Kommunikationszentrum (RZ)

2 The Question: If we look at the Xeon Phi as a standalone system: Can OpenMP parallel programs, optimized for multicore machines, run efficiently on the Intel Xeon Phi Coprocessor without special tuning? 2

3 Agenda The Intel Xeon Phi Architecture Kernel benchmarks NAS Parallel Benchmarks Application Tests Conclusion and Outlook 3

4 The Intel Xeon Phi Architecture 4

The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading - 512-bit vector registers - 1 GHz

5 The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading bit vector registers - 1 GHz clockrate - ring network 5 Intel Sandy Bridge System - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect

clockrate - ring network 5 Intel Sandy Bridge System - 2 times 8 cores -

6 The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading bit vector registers - 1 GHz clockrate - ring network 6 Intel Sandy Bridge System Both architectures have roughly the same price, size and power consumption. - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect

Bridge System Both architectures have roughly the same price, size and power consumption.

7 Kernel Benchmarks 7

8 Bandwidth in GB/s Memory Bandwidth Stream triad benchmark (a = b + x c) 2 GB memory footprint spread thread pinning Intel Compiler mmic flag to cross-compile 200 SNB System 150 Intel Xeon Phi 100 Sandy Bridge System (SNB) 32 GB DDR3 RAM Intel Xeon Phi 8 GB GDDR5 RAM Number of Threads 8

0 -mmic flag to cross-compile 200 SNB System 150 Intel Xeon Phi 100 Sandy Bridge

9 1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB 1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB Latency in ns Level 1 cache Level 2 cache Latency in ns Level 1 cache Level 2 cache Level 3 cache Memory Latency single threaded pointer chasing with random stride small stride = larger than a cache-line (if possible) large stride = larger than a memory page (if possible) Intel Xeon Phi Small Stride Large Stride SNB System Small Stride Large Stride Memory Footprint Memory Footprint 9

threaded pointer chasing with random stride small stride = larger than a cache-line (if possible) large stride = larger than a memory page (if possible)

10 EPCC Microbenchmarks Kernel benchmark to measure the overhead of OpenMP constructs extended to measure the overhead of tasking as well Worksharing Xeon Phi SNB # Threads Parallel for barrier reduction # Threads Parallel for barrier reduction Tasking Xeon Phi SNB # Threads single producer parallel producer # Threads single producer parallel producer Overhead in microseconds. 10

05 5.83 30 15.85 8.21 24.80 32 24.36 31.78 58.90 240 27.56 13.37 48.

11 GFLOPS Conjugate Gradient Method Sparse Matrix-Vector Multiplication in a CG solver ~3.5 GB memory footprint different parallel versions: dynamic worksharing precalculated worksharing task parallel version Xeon Phi, pre-calc. Xeon Phi, tasks Xeon Phi, dynamic Threads SNB, pre-calc. SNB, tasks SNB, dynamic

worksharing task parallel version 11 20 15 10 5 0 Xeon Phi, pre-calc.

12 NAS Parallel Benchmarks 12

13 NAS Parallel Benchmarks Standard benchmarks for parallel computing problem size C good speedup on both systems serial runtime on the Xeon Phi is low overall parallel version slower on the Xeon Phi system SNB Intel Xeon Phi Benchmark 1 Thread 32 Threads Speedup 1 Thread 240 Threads Speedup IS EP MG FT BT SP LU Runtime in seconds. 13

12 1.38 16.75 192.49 2.46 78.25 EP 186.81 8.11 23.03 1518.42 13.34 113.82 MG 64.04 8.03 7.98 498.94 9.63 51.81 FT 306.11 19.19 15.95 2393.01 53.97 44.

14 Application Tests 14

15 Application Tests Applicati on imoose Area finite elements package Paralleli zation worksha ring Language C++ Size ~300k lines FIRE image recognition tasks C++ ~35k lines NestedCP extracting critical points in unsteady flow fields nested parallel C++ ~2k NestedCP tasks C++ ~2k 15 NINA Neuromagnetic INverse largescale problems worksha ring C ~2k

NestedCP extracting critical points in unsteady flow fields nested parallel C++ ~2k

16 Application Tests Application 1 Thread best (#threads) Speedup 1 Thread best (#threads) Speedup imoose (16) (240) FIRE (32) (234) NestedCP Nested (32) (240) NestedCP Tasking (32) (240) NINA (16) (177) Runtime in seconds. also here the speedup for all codes is good the serial runtime is slower on the Xeon Phi only NINA is faster on the Xeon Phi 16

58 (240) 23.76 NestedCP Tasking 47.34 2.43 (32) 19.47 848.34 11.14 (240) 76.16 NINA 470.06 61.16 (16) 7.68 1381.94 27.29 (177) 50.

NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Source 2 Experiment: Persons get different stimuli in form of pictures The induced magnetic field is measured around the

17 NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Source 2 Experiment: Persons get different stimuli in form of pictures The induced magnetic field is measured around the head NINA reconstructs the activity inside the brain Kernel portion of 90% Dense matrix-vector multiplications & vector operations Matrix fits into memory (128 x 512,000) 17 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6): , 2008.

multiplications & vector operations Matrix fits into memory (128 x 512,000) 17 2 M. Bücker, R. Beucker, and A. Rupp.

18 Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 18

For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB

19 Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Questions or Comments? Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 19

For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system.

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark