Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Similar documents

Multicore Parallel Computing with OpenMP

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Big Data Visualization on the MIC

A quick tutorial on Intel's Xeon Phi Coprocessor

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Next Generation GPU Architecture Code-named Fermi

Multi-Threading Performance on Commodity Multi-Core Processors

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

benchmarking Amazon EC2 for high-performance scientific computing

High Performance Computing in CST STUDIO SUITE

Binary search tree with SIMD bandwidth optimization using SSE

Rethinking SIMD Vectorization for In-Memory Databases

Trends in High-Performance Computing for Power Grid Applications

Parallel Computing. Benson Muite. benson.

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Programming the Intel Xeon Phi Coprocessor

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU Programming Languages

Xeon+FPGA Platform for the Data Center

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

OpenMP Programming on ScaleMP

1 Bull, 2011 Bull Extreme Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Parallel Programming Survey

RDMA over Ethernet - A Preliminary Study

HPC performance applications on Virtual Clusters

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Introduction to Cloud Computing

Performance Characteristics of Large SMP Machines

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HPC Wales Skills Academy Course Catalogue 2015

Parallel Algorithm Engineering

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

SR-IOV: Performance Benefits for Virtualized Interconnects!

Case Study on Productivity and Performance of GPGPUs

CUDA programming on NVIDIA GPUs

GPUs for Scientific Computing

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Performance and Scalability of the NAS Parallel Benchmarks in Java

Building a Top500-class Supercomputing Cluster at LNS-BUAP

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Using the Windows Cluster

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

High-Density Network Flow Monitoring

Embedded Systems: map to FPGA, GPU, CPU?

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Parallel Computing with MATLAB

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Benchmarking Large Scale Cloud Computing in Asia Pacific

Resource Scheduling Best Practice in Hybrid Clusters

MAQAO Performance Analysis and Optimization Tool

How System Settings Impact PCIe SSD Performance

Performance Analysis and Optimization Tool

Overview of HPC Resources at Vanderbilt

FPGA-based Multithreading for In-Memory Hash Joins

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Pedraforca: ARM + GPU prototype

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

64-Bit versus 32-Bit CPUs in Scientific Computing

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

OpenMP and Performance

The Foundation for Better Business Intelligence

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MPI and Hybrid Programming Models. William Gropp

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

INTEL Software Development Conference - LONDON High Performance Computing - BIG DATA ANALYTICS - FINANCE

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

CHAPTER 1 INTRODUCTION

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

A Case Study - Scaling Legacy Code on Next Generation Platforms

Recent Advances in HPC for Structural Mechanics Simulations

Accelerating CFD using OpenFOAM with GPUs

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Lattice QCD Performance. on Multi core Linux Servers

Transcription:

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France Vienne, Ramachandran, Wijngaart, Koesterke, Shaparov

Increasingly Complex HPC World Mainframes HPC language (Fortran, then also C) vectorization (no caches!) Clusters: MPI SMP machines (shared memory): OpenMP threads Clusters with multi-core nodes: MPI OpenMP Vectorization NUMA Clusters with accelerators Fast compute, high memory BW, limited memory Data transfer through PCIe bus à Double buffering GPUs: CUDA/OpenCL (partially mastered) MICs: Standard languages OpenMP Evaluation of Phis through standard benchmarks How do we evaluate new technologies?

Programming and optimizing for MIC? How hard can it be? What performance can I expect? The NAS Parallel Benchmark (NPB) gives very valuable insight

Programming and Optimizing for the MIC Architecture Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Make use of existing knowledge!

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network Memory and PCIe connection Phi architecture block diagram Xeon Phi ~60 cores Few GB of GDDR5 RAM 512-bit wide SIMD registers L1/L2 caches Multiple threads (up to 4) per core Xeon Phi in Stampede SE10P 61 cores 8 GB of GDDR5 memory MIC is similar to Xeon x86 cores with caches Cache coherency protocol Supports traditional threads

MIC vs. GPU Differences Architecture: x86 vs. streaming processors HPC Programming model: coherent caches vs. shared memory and caches extension to C/C/Fortran vs. CUDA/OpenCL OpenCL support Threading/MPI: OpenMP and Multithreading vs. threads in hardware Programming details MPI on host and/or MIC vs. MPI on host only offloaded regions vs. kernels Support for any code: serial, scripting, etc. Yes No Native mode: Any code may be offloaded as a whole to the coprocessor

Adapting Scientific Code to MIC Today: Most scientific code for clusters Languages: C/C and/or Fortran, Communication: MPI may be thread-based (Hybrid code: MPI & OpenMP), may use external libraries (MKL, FFTW, etc.). With MIC on Stampede: Languages: C/C and/or Fortran, Communication: MPI may run an MPI task on the MIC or may offload sections of the code to the MIC, will be thread-based (Hybrid code: MPI & OpenMP), may use external libraries (MKL), that automatically use MIC

Stampede Cluster at the Texas Advanced Computing Center What do we do with 6800 MIC cards? How to program and optimize for MIC?

NAS Parallel Benchmark Suite of parallel workloads Testing performance of a variety of components Computational kernels IS: Integer sorting FT: Fourier transform CG: Conjugate gradient MG: Multi-grid Mini applications BT & SP: Factorization techniques LU: LU decomposition Variety of sizes: class A, B, and C used Parallelized with OpenMP & MPI

Timings à Flops Evaluation Ratio of active vector lanes to number of vector instructions VPU_ELEMENTS_ACTIVE VPU_INSTRUCTIONS_EXECUTED Vector width: 8/16 (DP/SP) Reading from caches DATA_READ_OR_WRITE Missing caches DATA_READ_MISS_OR_WRITE TLB misses DATA_PAGE_WALK, LONG_DATA_PAGE_WALK

Affinity and Scaling: IS Millions Million of of keys operations ranked 700 600 500 400 300 200 100 0 Balanced Balanced Scatter Compact Compact 61 122 183 244 5 1321 29 3745 53 6169 77 8593 101 109117 125 133141 149 157165 173 181189 197 205213 221 229237 244 IS, class C Balanced provides best performance Number of OpenMP Threads Number of Threads

Performance: BT Total Total MFLOPS 30000 25000 20000 15000 10000 5000 0 Class A A Class Class B B C Class C 5 1321 29 3745 61 122 183 244 53 6169 77 8593 101 109117 125 133141 149 157165 173 181189 197 205213 244 221 229237 BT, three sizes Best number of threads per core derived by trial Jumps at natural boundaries and in between boundaries Number of OpenMP Threads Number of Threads

Performance: CG Total Total MFLOPS 8000 7000 6000 5000 4000 3000 2000 Class A A Class Class B B C Class C CG, three sizes Best performance at maximum number of threads 1000 0 5 1321 29 3745 61 122 183 244 53 6169 77 8593 197 205213 221 229237 244 101 109117 125 133141 149 157165 Number of OpenMP Threads 173 181189 Number of Threads

Xeon vs. Phi: 16 vs 61 cores 16 Sandy-Bridge cores are up to 2.5x faster Best Phi performance to break even Class C SB: time [s] SB: threads Phi: time [s] Phi: threads BT 69 16 100 162 CG 22 32 57 241 FT 16 32 42 105 IS 1 32 2.4 177 LU 49 32 110 162 MG 7 16 6 172 SP 115 16 135 162

Analysis: Loop Iterations Phi needs more loop iterations (concurrency, n ) for good performance Either n is a multiple of 61 or 122 Or n is much larger than 122 Technique: loop collapse BT, class A: 2x performance increase BT, other classes: no gain; Larger sizes provide already enough concurrency

Analysis: Division and Square Root Divisions and square roots are extra slow on Phi Division consumes 25% of execution time in BT Similar for square root in SP

Analysis: Strided Access Strided access in BT (stride = 5) Rearrangement of loop iterations did not increase speed, because other loops changed from stride-1 to stride-5

Analysis: Strided Access and Vectorization Non-unit stride access in LU Loop vectorized by autovectorizer Non-vectorized loop (-no-vec) faster Gather/scatter load and stores Reduced hardware prefetching

Analysis: Gather/Scatter vs. Masked Load/Store Access for low strides (stride-2) through 2 masked load/stores, instead of a gather/scatter Test code revealed a 1.8x speed-up for vectorized code Hand-coded intrinsics

Analysis: Vectorization (general) Some benchmarks (particularly MG) showed good performance gain from vectorizations MG faster on Phi than on Xeon Most benchmarks performed better without vectorization Phi slower than Xeon

Analysis: Indirect Memory Access Memory latency (main GDDR5) higher on Phi than on Xeon (main DDR3) Decreased performance for CG and IS

Summary NBP gives very valuable insight For code to perform well on Phi: More concurrency required (more loop iterations) Vectorization absolutely crucial But not all vectorized code performs well Stride-1 data access No indirect data access Manual hardware prefetching may be required No slow operations (div and sqrt) Only certain codes will benefit from this generation of Xeon Phi

Thank You! Questions?