The affinity model. Gabriele Fatigati. CINECA Supercomputing group Gabriele Fatigati The affinity model 1 / 26

Similar documents
Multi-Threading Performance on Commodity Multi-Core Processors

Parallel Algorithm Engineering

Parallel Programming Survey

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Multi-core Programming System Overview

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Multi-core and Linux* Kernel

Hybrid Programming with MPI and OpenMP

CHAPTER 1 INTRODUCTION

OpenMP and Performance

MPI and Hybrid Programming Models. William Gropp

NUMA Best Practices for Dell PowerEdge 12th Generation Servers

Distributed communication-aware load balancing with TreeMatch in Charm++

Virtual Machines.

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Parallel Computing. Parallel shared memory computing with OpenMP

CPU Scheduling Outline

How to Run Parallel Jobs Efficiently

What is Multi Core Architecture?

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Multilevel Load Balancing in NUMA Computers

BLM 413E - Parallel Programming Lecture 3

Debugging with TotalView

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

High Performance Computing. Course Notes HPC Fundamentals

Scalability evaluation of barrier algorithms for OpenMP

Multicore Parallel Computing with OpenMP

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Introduction to Hybrid Programming

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Enabling Technologies for Distributed Computing

Auto-Tunning of Data Communication on Heterogeneous Systems

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

MAQAO Performance Analysis and Optimization Tool

Parallel Computing. Shared memory parallel programming with OpenMP

Getting OpenMP Up To Speed

Cloud Computing through Virtualization and HPC technologies

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Trends in High-Performance Computing for Power Grid Applications

Performance Monitoring of Parallel Scientific Applications

Using the Windows Cluster

Enabling Technologies for Distributed and Cloud Computing

GPI Global Address Space Programming Interface

Big Data Visualization on the MIC

How to control Resource allocation on pseries multi MCM system

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Performance Evaluation of HPC Benchmarks on VMware s ESXi Server

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

SLURM Workload Manager

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Technical Computing Suite Job Management Software

Stream Processing on GPUs Using Distributed Multimedia Middleware

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Intel Many Integrated Core Architecture: An Overview and Programming Models

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

On the Importance of Thread Placement on Multicore Architectures

HPC performance applications on Virtual Clusters

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Part I Courses Syllabus

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Recommended hardware system configurations for ANSYS users

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Analysis and Implementation of Cluster Computing Using Linux Operating System

Load Balancing on Speed

Overlapping Data Transfer With Application Execution on Clusters

An Implementation Of Multiprocessor Linux

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

HPC Wales Skills Academy Course Catalogue 2015

MPI / ClusterTools Update and Plans

You are here. Automatic Page Migration for Linux [A Matter of Hygiene] Lee T. Schermerhorn HP/Open Source & Linux Organization

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Operating System Multilevel Load Balancing

Design and Implementation of the Heterogeneous Multikernel Operating System

<Insert Picture Here> Oracle Database Support for Server Virtualization Updated December 7, 2009

Chapter 5 Linux Load Balancing Mechanisms

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

ELEC 377. Operating Systems. Week 1 Class 3

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

Transcription:

The affinity model Gabriele Fatigati CINECA Supercomputing group g.fatigati@cineca.it Gabriele Fatigati The affinity model 1 / 26

Outline 1 NUMA architecture 2 Affinity, what is? 3 Affinity, how to implement 4 HWLOC Gabriele Fatigati The affinity model 2 / 26

NUMA architecture Definition NUMA Non-Uniform Memory Access, the memory is divided in local (faster) and remote (slower) areas Core 0,1,2,3,4,5 access faster in socket 0 memory area, slower in socket 1 memory area. Core 4,5,6,7,8,9 access faster in socket 1 memory area, slower in socket 0 memory area. Gabriele Fatigati The affinity model 3 / 26

Definition Physical proximity: two cores are physically near when there is a physical interconnection between them. (by BUS, cache or other). Two threads are near when they are placed on two near cores. Gabriele Fatigati The affinity model 4 / 26

Affinity, what is? Definition The affinity is a modification of the native OS queue scheduling algorithm. (rif: wikipedia) Execution bind: indicates a preferred core where process/thread will run Memory bind: indicates a preferred memory area where memory pages will bound (local areas in NUMA machine) Migrating process/thread from one core to another, cache and memory locality is lost. Gabriele Fatigati The affinity model 5 / 26

Usually, HPC cluster doesn t have an HPC-oriented kernel, but they have a general purpouse kernel like Linux. The kernel moves process among cores to optimize the usage of them, assuming that on the node there are many processes besides HPC processes. In non HPC context should be fine, but on HPC cluster can limit the peformance. Gabriele Fatigati The affinity model 6 / 26

In NUMA architecture can be happen also with remote memory: Gabriele Fatigati The affinity model 7 / 26

numactl Control NUMA policy for processes or shared memory. Runs processes with a specific NUMA scheduling or memory placement policy. hardware: Show NUMA policy settings of the current process. available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32733 MB node 0 free: 17837 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32767 MB node 1 free: 1542 MB node distances: node 0 1 0: 10 20 1: 20 10 Gabriele Fatigati The affinity model 8 / 26

show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cpubind: 0 1 nodebind: 0 1 membind: 0 1 This machine has 16 cores, arranged into 2 groups (nodes), with each group of 8 having access to local memory region of 32 GB (64 GB total) cpunodebind=nodes, Only execute process on the CPUs of nodes. Gabriele Fatigati The affinity model 9 / 26

numactl cpunodebind=0 membind=0,1 Run process on node 0 with memory allocated on node 0 and 1. numactl physcpubind=2 Run process only on physical CPU having id 2 and allocate the memory on the same NUMA node where the process is running. Gabriele Fatigati The affinity model 10 / 26

touch Touch pages to enforce policy early. Default is do not touch them, the policy is applied when an applications maps and accesses to a single page. When a processor/thread allocates memory, the memory is not really allocated immediately. The memory pages allocation is done on the first access (usually on the initialization) int*array=(int*)malloc(100*sizeof(int)); // not really allocated for( int i=0; i \(< \) 100; i++) 1 array[i] = 0; // now is allocated Gabriele Fatigati The affinity model 11 / 26

Thread affinity and OpenMPI Is it possible to bind using MPI environment, like OpenMPI: -bind-to-core Bind processes to cores. -bind-to-socket Bind processes to processor sockets. The default is do not bind processes. Binding policies are referred to a node. Gabriele Fatigati The affinity model 12 / 26

Thread affinity and Intel MPI Intel MPI can bind threads to physical processing units by using KMP AFFINITY environment variable. KMP AFFINITY=mode Gabriele Fatigati The affinity model 13 / 26

Compact mode KMP AFFINITY=compact. Specifying compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. If two threads shares the same data in a cache, put them near can give advantages. Gabriele Fatigati The affinity model 14 / 26

Scatter mode KMP AFFINITY=scatter. Specifying scatter distributes the threads as evenly as possible across the entire system. scatter is the opposite of compact Useful when the cache concurrency bewteen threads is high Gabriele Fatigati The affinity model 15 / 26

Explicit mode Instead of allowing the library to detect the hardware topology and automatically assign OpenMP threads to processing elements, the user may explicitly specify the assignment by using a list of operating system (OS) processor (proc) IDs. However, this requires knowledge of which processing elements the OS proc IDs represent. Gabriele Fatigati The affinity model 16 / 26

KMP AFFINITY= granularity=fine,proclist=[3,0,{1,2},{1,2}],explicit Gabriele Fatigati The affinity model 17 / 26

Thread affinity and GCC compiler On GCC, thread binding is managed by using GOMP CPU AFFINITY environment variable (requires OpenMP 3.0 or more) GOMP CPU AFFINITY= 0 3 1-2 4-15:2 Bind the initial thread to CPU 0, the second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixth can be placed on CPU 6, 8, 10, 12 or 14 Gabriele Fatigati The affinity model 18 / 26

Thread affinity and Linux sched setaffinity is a function that sets a processor s affinity to run on a given processor, by using a processors mask. #include <sched.h> #include <iostream> #include <stdlib.h> using namespace std; int main (int argc, char* argv[]) { cpu_set_t mask; unsigned int len = sizeof(mask); CPU_ZERO(&mask); CPU_SET(2,&mask); // set core 2, processor mask = 0010 sched_setaffinity(getpid(), len, &mask); // process will run on CPU 2 } Gabriele Fatigati The affinity model 19 / 26

OpenMP standard Starting from 3.1 version OpenMP implements, a binding procedure throught OMP PROC BIND environment variable. If true, the execution environment should not move OpenMP threads between processors. The behaviour is implementation defined. If false, the execution environment may move OpenMP threads between processors. Gabriele Fatigati The affinity model 20 / 26

Thread affinity suggested option: HWLOC HWLOC stands for HardWare LOCality, library to get many information about the machine like NUMA nodes, shared caches, processor sockets, processor cores and processor units. These information are very useful to bind processor/threads. Portable on many OS like Linux, AIX, Darwin, Solaris, Windows, FreeBSD. Gabriele Fatigati The affinity model 21 / 26

PU (Processing Unit) In HWLOC, the PU is The smallest processing element that can be represented by a hwloc object. It may be a single- core processor, a core of a multicore processor, or a single thread in SMT processor. Gabriele Fatigati The affinity model 22 / 26

hwloc_topology_t topology; hwloc_obj_t core; hwloc_cpuset_t set; int pu_id = 4; // process will bound on core 4 hwloc_topology_init(&topology); hwloc_topology_load(topology); core = hwloc_get_obj_by_type(machine_topology->topology, HWLOC_OBJ_PU, pu_id); set = hwloc_bitmap_dup(core->cpuset); hwloc_bitmap_singlify(set); // keep a single index among those hwloc_set_cpubind(machine_topology->topology, set, 0); // some work hwloc_bitmap_free(set); Gabriele Fatigati The affinity model 23 / 26

hwloc set cpubind The most important funcion, used to bind a process/thread to a specific PU HWLOC CPUBIND PROCESS Bind all threads of the current (possibly) multithreaded process. HWLOC CPUBIND THREAD Bind current thread of current process. HWLOC CPUBIND STRICT The process will never execute on other CPUs than the designates CPUs HWLOC CPUBIND NOMEMBIND Avoid any effect on memory binding. Gabriele Fatigati The affinity model 24 / 26

Memory binding with hwloc hwloc_set_membind(hwloc_topology_t topology, hwloc_const_cpuset_t cpuset, hwloc_membind_policy_t policy, int flags ) Bind the already-allocated memory of the current process or thread to prefer the NUMA node near the specified physical cpuset Gabriele Fatigati The affinity model 25 / 26

Memory binding trap /* imagine two thread, the first one bound on node 0, the second one allocated on node 1 */ int* array = (int*)malloc(1000*sizeof(int)); #pragma omp parallel { int tid = omp_get_thread_num(); if(tid==1) { for( int i=0; i \(< \) 100; i++) array[i] = 0; // now is intialized } } Array is completely bound on node 1, since thread 1 is bound on it. If thread 0 want to use it, memory pages will be remote so the access is more slow. Gabriele Fatigati The affinity model 26 / 26