Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Similar documents

Performance Characteristics of Large SMP Machines

Case Study on Productivity and Performance of GPGPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenMP and Performance

Multicore Parallel Computing with OpenMP

OpenMP Programming on ScaleMP

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

YALES2 porting on the Xeon- Phi Early results

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Multi-Threading Performance on Commodity Multi-Core Processors

Rethinking SIMD Vectorization for In-Memory Databases

Programming the Intel Xeon Phi Coprocessor

benchmarking Amazon EC2 for high-performance scientific computing

Using the Intel Inspector XE

Suitability of Performance Tools for OpenMP Task-parallel Programs

Lecture 1: the anatomy of a supercomputer

Keys to node-level performance analysis and threading in HPC applications

Scalability evaluation of barrier algorithms for OpenMP

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Programming Survey

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Intel Xeon Processor E5-2600

RDMA over Ethernet - A Preliminary Study

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

On the Importance of Thread Placement on Multicore Architectures

Parallel Algorithm Engineering

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Server: Performance Benchmark. Memory channels, frequency and performance

Putting it all together: Intel Nehalem.

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

How SSDs Fit in Different Data Center Applications

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

SR-IOV: Performance Benefits for Virtualized Interconnects!

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

QCD as a Video Game?

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

FPGA-based Multithreading for In-Memory Hash Joins

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

GPUs for Scientific Computing

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Infrastructure Matters: POWER8 vs. Xeon x86

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

CSE 6040 Computing for Data Analytics: Methods and Tools

Clustering Billions of Data Points Using GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Experiences with HPC on Windows

FLOW-3D Performance Benchmark and Profiling. September 2012

The Foundation for Better Business Intelligence

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU hardware and to CUDA

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

How System Settings Impact PCIe SSD Performance

CUDA programming on NVIDIA GPUs

MAQAO Performance Analysis and Optimization Tool

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Enabling Technologies for Distributed and Cloud Computing

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Using the Windows Cluster

Parallel Computing. Benson Muite. benson.

Accelerate Discovery with Powerful New HPC Solutions

Parallel Computing for Data Science

High-Density Network Flow Monitoring

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Experiences With Mobile Processors for Energy Efficient HPC

~ Greetings from WSU CAPPLab ~

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPGPU accelerated Computational Fluid Dynamics

OpenACC Basics Directive-based GPGPU Programming

Enabling Technologies for Distributed Computing

System requirements for MuseumPlus and emuseumplus

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Transcription:

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

The Question: If we look at the Xeon Phi as a standalone system: Can OpenMP parallel programs, optimized for multicore machines, run efficiently on the Intel Xeon Phi Coprocessor without special tuning? 2

Agenda The Intel Xeon Phi Architecture Kernel benchmarks NAS Parallel Benchmarks Application Tests Conclusion and Outlook 3

The Intel Xeon Phi Architecture 4

The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading - 512-bit vector registers - 1 GHz clockrate - ring network 5 Intel Sandy Bridge System - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect

The Intel Xeon Phi Architecture Intel Xeon Phi - PCIe extension card - 60 In-Order Cores - 4-way Hyperthreading - 512-bit vector registers - 1 GHz clockrate - ring network 6 Intel Sandy Bridge System Both architectures have roughly the same price, size and power consumption. - 2 times 8 cores - complex out-oforder cores - 2 GHz clockrate - 2-way Hyperthreading - QPI interconnect

Kernel Benchmarks 7

Bandwidth in GB/s Memory Bandwidth Stream triad benchmark (a = b + x c) 2 GB memory footprint spread thread pinning Intel Compiler 13.0 -mmic flag to cross-compile 200 SNB System 150 Intel Xeon Phi 100 Sandy Bridge System (SNB) 32 GB DDR3 RAM Intel Xeon Phi 8 GB GDDR5 RAM 50 0 1 2 4 8 16 32 64 128 256 Number of Threads 8

1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB 1 B 4 B 16 B 64 B 256 B 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4MB 16 MB 64 MB 256 MB 1 GB 4 GB Latency in ns Level 1 cache Level 2 cache Latency in ns Level 1 cache Level 2 cache Level 3 cache Memory Latency single threaded pointer chasing with random stride small stride = larger than a cache-line (if possible) large stride = larger than a memory page (if possible) 400 300 200 100 0 Intel Xeon Phi Small Stride Large Stride 90 80 70 60 50 40 30 20 10 0 SNB System Small Stride Large Stride Memory Footprint Memory Footprint 9

EPCC Microbenchmarks Kernel benchmark to measure the overhead of OpenMP constructs extended to measure the overhead of tasking as well Worksharing Xeon Phi SNB # Threads Parallel for barrier reduction # Threads Parallel for barrier reduction 2 4.32 1.29 4.29 2 0.85 0.56 1.65 16 13.81 5.83 21.61 16 3.47 2.05 5.83 30 15.85 8.21 24.80 32 24.36 31.78 58.90 240 27.56 13.37 48.86 Tasking Xeon Phi SNB # Threads single producer parallel producer # Threads single producer parallel producer 2 4.18 0.92 2 0.80 0.18 16 81.18 1.67 16 63.25 0.75 30 165.50 1.78 32 146.41 4.11 240 1355.90 8.39 Overhead in microseconds. 10

GFLOPS Conjugate Gradient Method Sparse Matrix-Vector Multiplication in a CG solver ~3.5 GB memory footprint different parallel versions: dynamic worksharing precalculated worksharing task parallel version 11 20 15 10 5 0 Xeon Phi, pre-calc. Xeon Phi, tasks Xeon Phi, dynamic 1 2 4 8 Threads 16 32 64 128 256 SNB, pre-calc. SNB, tasks SNB, dynamic

NAS Parallel Benchmarks 12

NAS Parallel Benchmarks Standard benchmarks for parallel computing problem size C good speedup on both systems serial runtime on the Xeon Phi is low overall parallel version slower on the Xeon Phi system SNB Intel Xeon Phi Benchmark 1 Thread 32 Threads Speedup 1 Thread 240 Threads Speedup IS 23.12 1.38 16.75 192.49 2.46 78.25 EP 186.81 8.11 23.03 1518.42 13.34 113.82 MG 64.04 8.03 7.98 498.94 9.63 51.81 FT 306.11 19.19 15.95 2393.01 53.97 44.34 BT 1241.63 82.61 15.03 9433.52 132.29 71.31 SP 826.25 137.69 6 12264.29 164.59 74.51 LU 1109.76 62.23 17.83 9835.09 163.33 60.22 Runtime in seconds. 13

Application Tests 14

Application Tests Applicati on imoose Area finite elements package Paralleli zation worksha ring Language C++ Size ~300k lines FIRE image recognition tasks C++ ~35k lines NestedCP extracting critical points in unsteady flow fields nested parallel C++ ~2k NestedCP tasks C++ ~2k 15 NINA Neuromagnetic INverse largescale problems worksha ring C ~2k

Application Tests Application 1 Thread best (#threads) Speedup 1 Thread best (#threads) Speedup imoose 104.68 12.2 (16) 8.58 1243.54 15.59 (240) 79.74 FIRE 284.6 16.68 (32) 17.06 2672.71 38.25 (234) 98.02 NestedCP Nested 46.99 3.21 (32) 14.62 845.14 35.58 (240) 23.76 NestedCP Tasking 47.34 2.43 (32) 19.47 848.34 11.14 (240) 76.16 NINA 470.06 61.16 (16) 7.68 1381.94 27.29 (177) 50.64 Runtime in seconds. also here the speedup for all codes is good the serial runtime is slower on the Xeon Phi only NINA is faster on the Xeon Phi 16

NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Source 2 Experiment: Persons get different stimuli in form of pictures The induced magnetic field is measured around the head NINA reconstructs the activity inside the brain Kernel portion of 90% Dense matrix-vector multiplications & vector operations Matrix fits into memory (128 x 512,000) 17 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6):2905 2921, 2008.

Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 18

Conclusion and Outlook The memory system is able to deliver a good performance 2.5 X faster than the SNB system. For kernels like the CG, the Xeon Phi also delivers good performance. For the NAS parallel benchmarks and all applications but the NINA code, the overall runtime was faster on the SNB system. The Answer: Questions or Comments? Most tested unchanged OpenMP applications did not perform well on the Xeon Phi architecture, because of the relatively slow serial performance. 19