OpenMP* 4.0 for HPC in a Nutshell

Similar documents

Towards OpenMP Support in LLVM

OpenACC Basics Directive-based GPGPU Programming

Keys to node-level performance analysis and threading in HPC applications

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

OpenMP and Performance

Intel Many Integrated Core Architecture: An Overview and Programming Models

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

High Performance Computing and Big Data: The coming wave.

Intel Platform and Big Data: Making big data work for you.

Scaling up to Production

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Intel Media SDK Library Distribution and Dispatching Process

Parallel Programming Survey

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

OpenACC Programming on GPUs

Parallel Computing. Shared memory parallel programming with OpenMP

Multi-Threading Performance on Commodity Multi-Core Processors

The Foundation for Better Business Intelligence

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel Computing. Parallel shared memory computing with OpenMP

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Large-Data Software Defined Visualization on CPUs

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Debugging with TotalView

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

Parallel Algorithm Engineering

Big Data Visualization on the MIC

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

High Performance Computing

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Improve Fortran Code Quality with Static Analysis

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Haswell Cryptographic Performance

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Finding Performance and Power Issues on Android Systems. By Eric W Moore

OpenACC 2.0 and the PGI Accelerator Compilers

Extended Attributes and Transparent Encryption in Apache Hadoop

Intel Service Assurance Administrator. Product Overview

Intel Perceptual Computing SDK My First C++ Application

OpenACC Programming and Best Practices Guide

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CSE 6040 Computing for Data Analytics: Methods and Tools

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU hardware and to CUDA

Spring 2011 Prof. Hyesoon Kim

Hybrid Programming with MPI and OpenMP

Programming the Intel Xeon Phi Coprocessor

The Transition to PCI Express* for Client SSDs

OpenMP C and C++ Application Program Interface

A Superior Hardware Platform for Server Virtualization

* * * Intel RealSense SDK Architecture

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

Accelerating Business Intelligence with Large-Scale System Memory

Cloud-based Analytics and Map Reduce

Accomplish Optimal I/O Performance on SAS 9.3 with

Case Study on Productivity and Performance of GPGPUs

Intel Media Server Studio Professional Edition for Windows* Server

Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms

Performance Analysis and Optimization Tool

Introduction to Intel Xeon Phi Coprocessors. Slides by Johannes Henning

Revealing the performance aspects in your code. Intel VTune Amplifier XE Generics. Rev.: Sep 1, 2013

Introduction to Cloud Computing

SDN. WHITE PAPER Intel Ethernet Switch FM6000 Series - Software Defined Networking. Recep Ozdag Intel Corporation

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

LTE Control Plane on Intel Architecture

Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A Case Study - Scaling Legacy Code on Next Generation Platforms

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

MPI and Hybrid Programming Models. William Gropp

Hetero Streams Library 1.0

Chapter 2 Parallel Architecture, Software And Performance

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Improving OpenSSL* Performance

Transcription:

OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group (michael.klemm@intel.com) *Other brands and names are the property of their respective owners.

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

OpenMP API De-facto standard, OpenMP 4.0 out since July 2013 API for C/C++ and Fortran for shared-memory parallel programming Based on directives (pragmas in C/C++) Portable across vendors and platforms Supports various types of parallelism

Evolution of Hardware (at Intel) (die sizes not to scale, for illustration only) Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon E5-2600 processor code-named Sandy Bridge EP Intel Xeon E5-2600 v2 processor code-named Ivy Bridge EP Intel Xeon E5-2600 v3 processor code-named Haswell EP Core(s) up to 1 2 4 6 8 12 18 Threads up to 2 2 8 12 16 24 36 SIMD Width (bits) 128 128 128 128 256 256 256 Intel Xeon Phi coprocessor code-named Knights Corner Intel Xeon Phi processor & coprocessor code-named Knights Landing 1 61 >61 244 >244 512 512 Product specification for launched and shipped products available on ark.intel.com. 1 Not launched or in planning.

Levels of Parallelism in OpenMP 4.0 Cluster Coprocessors/Accelerators Node Socket Core Hyper-Threads Group of computers communicating through fast interconnect Special compute devices attached to the local node through special interconnect Group of processors communicating through shared memory Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units OpenMP 4.0 for Devices OpenMP 4.0 Affinity Superscalar Pipeline Vector Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units OpenMP 4.0 SIMD

OpenMP Intro in Three Slides (1) #pragma omp parallel { #pragma omp for for (i = 0; i<n; i++) { } fork distribute work barrier #pragma omp for for (i = 0; i< N; i++) { } distribute work barrier } join

barrier OpenMP Intro in Three Slides (2) double a[n]; double l,s = 0; #pragma omp parallel for reduction(+:s) private(l) \ schedule(static,4) s=0 for (i = 0; i<n; i++) { l = log(a[i]); s += l; } s =0 s =0 s =0 s =0 distribute work s += s s += s s += s s = s 7

OpenMP Intro in Three Slides (3) #pragma omp parallel #pragma omp single for(e = l->first; e ; e = e->next) #pragma omp task process(e); fork join

OpenMP 4.0 SIMD

In a Time before OpenMP 4.0 Programmers had to rely on auto-vectorization or to use vendor-specific extensions Programming models (e.g., Intel Cilk Plus) Compiler pragmas (e.g., #pragma vector) Low-level constructs (e.g., _mm_add_pd()) #pragma omp parallel for #pragma vector always #pragma ivdep for (int i = 0; i < N; i++) { a[i] = b[i] +...; } You need to trust the compiler to do the right thing.

OpenMP SIMD Loop Construct Vectorize a loop nest Cut loop into chunks that fit a SIMD vector register No parallelization of the loop body Syntax (C/C++) #pragma omp simd [clause[[,] clause], ] for-loops Syntax (Fortran)!$omp simd [clause[[,] clause], ] do-loops

Example void sprod(float *a, float *b, int n) { Example float sum = 0.0f; #pragma omp simd reduction(+:sum) for (int k=0; k<n; k++) sum += a[k] * b[k]; return sum; } vectorize

Data Sharing Clauses private(var-list): Uninitialized vectors for variables in var-list x: 42???? firstprivate(var-list): Initialized vectors for variables in var-list x: 42 42 42 42 42 reduction(op:var-list): Create private variables for var-list and apply reduction operator op at the end of the construct 12 5 8 17 x: 42

SIMD Loop Clauses safelen (length) Maximum number of iterations that can run concurrently without breaking a dependence in practice, maximum vector length linear (list[:linear-step]) The variable s value is in relationship with the iteration number x i = x orig + i * linear-step aligned (list[:alignment]) Specifies that the list items have a given alignment Default is alignment for the architecture collapse (n)

Loop-Carried Dependencies Dependencies may occur across loop iterations Loop-carried dependency The following code contains such a dependency: void lcd_ex(float* a, float* b, size_t n, float c1, float c2) { size_t i; for (i = 0; i < n; i++) { a[i] = c1 * a[i + 17] + c2 * b[i]; } } Some iterations of the loop have to complete before the next iteration can run Simple trick: can you reverse the loop w/o getting wrong results?

Loop-Carried Dependencies Can we parallelize or vectorize the loop? Parallelization: no (except for very specific loop schedules) Vectorization: yes (if vector length is shorter than any distance of any dependency) 0 1 2 3 17 18 19 20

SIMD Worksharing Construct Parallelize and vectorize a loop nest Distribute a loop s iteration space across a thread team Subdivide loop chunks to fit a SIMD vector register Syntax (C/C++) #pragma omp for simd [clause[[,] clause], ] for-loops Syntax (Fortran)!$omp do simd [clause[[,] clause], ] do-loops

Example void sprod(float *a, float *b, int n) { Example float sum = 0.0f; #pragma omp for simd reduction(+:sum) for (int k=0; k<n; k++) sum += a[k] * b[k]; return sum; } parallelize Thread 0 Thread 1 Thread 2 vectorize

SIMD Function Vectorization float min(float a, float b) { SIMD return Function a < b? Vectorization a : b; } float distsq(float x, float y) { return (x - y) * (x - y); } void example() { #pragma omp parallel for simd for (i=0; i<n; i++) { d[i] = min(distsq(a[i], b[i]), c[i]); } }

SIMD Function Vectorization Declare one or more functions to be compiled for calls from a SIMD-parallel loop Syntax (C/C++): #pragma omp declare simd [clause[[,] clause], ] [#pragma omp declare simd [clause[[,] clause], ]] [ ] function-definition-or-declaration Syntax (Fortran):!$omp declare simd (proc-name-list)

SIMD Function Vectorization #pragma omp declare simd float min(float a, float b) { return a < b? a : b; } vec8 min_v(vec8 a, vec8 b) { return a < b? a : b; } #pragma omp declare simd float distsq(float x, float y) { return (x - y) * (x - y); } vec8 distsq_v(vec8 x, vec8 y) return (x - y) * (x - y); } void example() { #pragma omp parallel for simd for (i=0; i<n; i++) { d[i] = min(distsq(a[i], b[i]), c[i]); } } vd = min_v(distsq_v(va, vb, vc))

SIMD Function Vectorization simdlen (length) generate function to support a given vector length uniform (argument-list) argument has a constant value between the iterations of a given loop inbranch function always called from inside an if statement notinbranch function never called from inside an if statement linear (argument-list[:linear-step]) aligned (argument-list[:alignment]) reduction (operator:list) Same as before

relative speed-up (higher is better) SIMD Constructs & Performance 5,00x 4,50x 4,00x 3,66x 4,34x ICC auto-vec ICC SIMD directive 3,50x 3,00x 2,50x 2,00x 2,04x 2,13x 2,40x 1,50x 1,47x 1,00x 0,50x 0,00x Mandelbrot Volume Rendering BlackScholes Fast Walsh Perlin Noise SGpp M. Klemm, A. Duran, X. Tian, H. Saito, D. Caballero, and X. Martorell. Extending OpenMP with Vector Constructs for Modern Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June 2012. LNCS 7312.

OpenMP 4.0 for Devices

Device Model OpenMP 4.0 supports accelerators/coprocessors Device model: One host Multiple accelerators/coprocessors of the same kind Coprocessors Host

OpenMP 4.0 for Devices - Constructs Transfer control [and data] from the host to the device Syntax (C/C++) #pragma omp target [data] [clause[[,] clause], ] structured-block Syntax (Fortran)!$omp target [data] [clause[[,] clause], ] structured-block!$omp end target [data] Clauses device(scalar-integer-expression) map(alloc to from tofrom: list) if(scalar-expr)

Execution Model The target construct transfers the control flow to the target device Transfer of control is sequential and synchronous The transfer clauses control direction of data flow Array notation is used to describe array length The target data construct creates a scoped device data environment Does not include a transfer of control The transfer clauses control direction of data flow The device data environment is valid through the lifetime of the target data region Use target update to request data transfers from within a target data region

Execution Model Data environment is lexically scoped Data environment is destroyed at closing curly brace Allocated buffers/data are automatically released pa Host 2 to( ) Device 1 alloc( ) 4 from( ) #pragma omp target \ map(alloc:...) \ map(to:...) \ map(from:...) {... } 3

Example #pragma omp target data device(0) map(alloc:tmp[:n]) map(to:input[:n)) map(from:res) { #pragma omp target device(0) #pragma omp parallel for for (i=0; i<n; i++) tmp[i] = some_computation(input[i], i); update_input_array_on_the_host(input); #pragma omp target update device(0) to(input[:n]) #pragma omp target device(0) #pragma omp parallel for reduction(+:res) for (i=0; i<n; i++) res += final_computation(input[i], tmp[i], i) } host target host target host

teams Construct Support multi-level parallel devices Syntax (C/C++): #pragma omp teams [clause[[,] clause], ] structured-block Syntax (Fortran):!$omp teams [clause[[,] clause], ] structured-block Clauses num_teams(integer-expression) num_threads(integer-expression) default(shared none) private(list), firstprivate(list) shared(list), reduction(operator : list)

Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); SAXPY // Define scalars n, a, b & initialize x, y #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y) #pragma omp teams num_teams(num_blocks) num_threads(nthreads) all do the same } for (int i = 0; i < n; i += num_blocks){ for (int j = i; j < i + num_blocks; j++) { y[j] = a*x[j] + y[j]; } } } free(x); free(y); return 0;

Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y SAXPY: Coprocessor/Accelerator #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y) #pragma omp teams num_teams(num_blocks) num_threads(bsize) all do the same #pragma omp distribute for (int i = 0; i < n; i += num_blocks){ workshare (w/o barrier) #pragma omp parallel for for (int j = i; j < i + num_blocks; j++) { workshare (w/ barrier) y[j] = a*x[j] + y[j]; } } } free(x); free(y); return 0; }

Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { SAXPY: Combined Constructs float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y #pragma omp target map(to:x[0:n]) map(tofrom:y) { #pragma omp teams distribute parallel for \ num_teams(num_blocks) num_threads(bsize) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; } } } free(x); free(y); return 0;

OpenMP 4.0 Affinity

NUMA is here to Stay (Almost) all multi-socket compute servers are NUMA systems Different access latencies for different memory locations Different bandwidth observed for different memory locations Example: Intel Xeon E5-2600v2 Series processor Xeon E5-2600v2 Xeon E5-2600v2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 GB/sec [higher is better] Thread Affinity Why It Matters? 100,00 90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 STREAM Triad, Intel Xeon E5-2697v2 # of threads/cores compact, par scatter, par compact, seq scatter, seq

Thread Affinity Processor Binding Binding strategies depends on machine and the app Putting threads far, i.e. on different packages (May) improve the aggregated memory bandwidth (May) improve the combined cache size (May) decrease performance of synchronization constructs Putting threads close together, i.e. on two adjacent cores which possible share the cache (May) improve performance of synchronization constructs (May) decrease the available memory bandwidth and cache size (per thread)

Thread Affinity in OpenMP* 4.0 OpenMP 4.0 introduces the concept of places set of threads running on one or more processors can be defined by the user pre-defined places available: threads cores sockets and affinity policies spread close master one place per hyper-thread one place exists per physical core one place per processor package spread OpenMP threads evenly among the places pack OpenMP threads near master thread collocate OpenMP thread with master thread and means to control these settings Environment variables OMP_PLACES and OMP_PROC_BIND clause proc_bind for parallel regions

Thread Affinity Example Example (Intel Xeon Phi Coprocessor): Distribute outer region, keep inner regions close OMP_PLACES=cores(8); OMP_NUM_THREADS=4,4 #pragma omp parallel proc_bind(spread) #pragma omp parallel proc_bind(close) p0 p1 p2 p3 p4 p5 p6 p7 p0 p1 p2 p3 p4 p5 p6 p7 p0 p1 p2 p3 p4 p5 p6 p7

We re Almost Through This presentation focused on HPC stuff OpenMP 4.0 has more to offer! Improved Fortran 2003 support User-defined reductions Task dependencies Cancellation We can chat about these features during the Q&A (if you want to)

The last Slide OpenMP 4.0 is a major leap for OpenMP New kind of parallelism has been introduced Support for heterogeneous systems with coprocessor devices Support in Intel Composer XE 2013 SP1 SIMD Constructs (except combined constructs) OpenMP for devices (except combined constructs) OpenMP Affinity