GPI Global Address Space Programming Interface



Similar documents
GASPI A PGAS API for Scalable and Fault Tolerant Computing

Debugging with TotalView

Multi-Threading Performance on Commodity Multi-Core Processors

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

GPU Hardware Performance. Fall 2015

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Binary search tree with SIMD bandwidth optimization using SSE

Advancing Applications Performance With InfiniBand

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A PGAS-based implementation for the unstructured CFD solver TAU

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Parallel Programming Survey

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

In Memory Accelerator for MongoDB

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Next Generation GPU Architecture Code-named Fermi

FPGA-based Multithreading for In-Memory Hash Joins

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Principles and characteristics of distributed systems and environments

Petascale Software Challenges. William Gropp

Evaluation of CUDA Fortran for the CFD code Strukti

How To Virtualize A Storage Area Network (San) With Virtualization

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hybrid Programming with MPI and OpenMP

CPU Scheduling Outline

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Parallel Computing. Benson Muite. benson.

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Program Coupling and Parallel Data Transfer Using PAWS

Packet-based Network Traffic Monitoring and Analysis with GPUs

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Distributed communication-aware load balancing with TreeMatch in Charm++

- An Essential Building Block for Stable and Reliable Compute Clusters

MapReduce Evaluator: User Guide

BSC vision on Big Data and extreme scale computing

Big Data Visualization on the MIC

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Module 14: Scalability and High Availability

MAQAO Performance Analysis and Optimization Tool

Introduction to Hybrid Programming

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Data-Flow Awareness in Parallel Data Processing

Parallelization: Binary Tree Traversal

Operating System Structure

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Introduction to Cloud Computing

SR-IOV: Performance Benefits for Virtualized Interconnects!

Chapter 11 I/O Management and Disk Scheduling

Mellanox HPC-X Software Toolkit Release Notes

Using the Windows Cluster

Interoperability Testing and iwarp Performance. Whitepaper

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Performance Monitoring of Parallel Scientific Applications

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Diagram 1: Islands of storage across a digital broadcast workflow

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

HPC Wales Skills Academy Course Catalogue 2015

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Performance Analysis and Optimization Tool

Scalability evaluation of barrier algorithms for OpenMP

Linux Driver Devices. Why, When, Which, How?

Parallel Computing. Shared memory parallel programming with OpenMP

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Network Traffic Monitoring & Analysis with GPUs

Network Traffic Monitoring and Analysis with GPUs

INTRODUCTION ADVANTAGES OF RUNNING ORACLE 11G ON WINDOWS. Edward Whalen, Performance Tuning Corporation

Performance Analysis and Tuning in Windows HPC Server Xavier Pillons Program Manager Microsoft Corp.

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

LS DYNA Performance Benchmarks and Profiling. January 2009

NVIDIA Tools For Profiling And Monitoring. David Goodwin

High Availability Solutions for the MariaDB and MySQL Database

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

INDIA September 2011 virtual techdays

Intel Xeon +FPGA Platform for the Data Center

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Practical Data Visualization and Virtual Reality. Virtual Reality VR Software and Programming. Karljohan Lundin Palmerius

OpenMP* 4.0 for HPC in a Nutshell

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

BLM 413E - Parallel Programming Lecture 3

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Transcription:

GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1

GPI Global address space programming interface PGAS Partitioned global address space read/write global data, one-sided, non-blocking, asynchron, zero-copy send/recv messages, passive mode atomic counters barrier, allreduce queues ranks fault tolerance daemon concept environment checks auto detect and auto select network 2

GPI Example code: Alltoall in C #include <GPI.h> extern void dump (const char *, const int *); int main (int argc, char *argv[]) { startgpi (argc, argv, "", (1UL << 30)); // 1 GiB DMA enabled memory per node const int iproc = getrankgpi (); const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset nproc * sizeof(int) for (int p = 0; p < nproc; ++p) { src[p] = iproc * nproc + p; const unsigned long locoff = p * sizeof (int); const unsigned long remoff = (nproc + iproc) * sizeof (int); writedmagpi (locoff, remoff, sizeof (int), p, GPIQueue0); } waitdmagpi (GPIQueue0); barriergpi (); dump ( src, src); dump ( dst, dst); shutdowngpi (); } 3

GPI Example code: Alltoall in C #include <GPI.h> extern void dump (const char *, const int *); int main (int argc, char *argv[]) { startgpi (argc, argv, "", (1UL << 30)); // 1 GiB DMA enabled memory per node const int iproc = getrankgpi (); const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset nproc * sizeof(int) for (int p = 0; p < nproc; ++p) { src[p] = iproc * nproc + p; const unsigned long locoff = p * sizeof (int); const unsigned long remoff = (nproc + iproc) * sizeof (int); writedmagpi (locoff, remoff, sizeof (int), p, GPIQueue0); } waitdmagpi (GPIQueue0); barriergpi (); dump ( src, src); dump ( dst, dst); shutdowngpi (); } 4 > gcc alltoall.c -I $GPI_HOME/include -L $GPI_HOME/lib64 -lgpi -libverbs15 > cp a.out $GPI_HOME/bin > getnode -n 4 > $GPI_HOME/bin/a.out [...collected and sorted output...] src 0: 0 1 2 3 dst 0: 0 4 8 12 src 1: 4 5 6 7 dst 1: 1 5 9 13 src 2: 8 9 10 11 dst 2: 2 6 10 14 src 3: 12 13 14 15 dst 3: 3 7 11 15

GPI Microbenchmark 5 Xeon 5460, ConnectX, mvapich2 1.2p1

GPI Permute 5 secs for reordering: From latency bound to compute bound again 6 2x2 woodcrest, MHGS18-DDR

GPI 2D-FFT: Perfect overlap 7

GPI BQCD: 30% better than MPI 8

GPI instead of MPI, MCTP instead of OpenMP Easy to use: Single programming model Fast: Wirespeed, zero-copy Scalable: Perfect overlap, no cycles for communication Robust: Fault tolerance Extends to heterogeneous systems: Read/write to accelerators like GPUs MCTP: Full hardware awareness (NUMA-Barrier) Coming next: Collectives offload, relaxed collectives, sub-collectives Coming next: auto thread-awareness for wait [x] done Coming next: Queue-pair auto-recovery Coming next: reliable HW multicast 9

GPIspace Architecture overview Coordination Aggregator Virtual Memory 10

GPIspace Future programming platform separate coordination and programming (David Gelernter) Programming: algorithms in modules, deal with data Tools: C++, Fortran,... Coordination: load/save data, schedule algorithm to data, take care of system state, fault tolerance, (everything automatic) Tools: (graphical) (functional) (domain specific) programming language based on modified Petri nets (typesafe, hierarchical, builtin expressions) Automatic parallelism, fault tolerance, throughput optimization Graphical programming, explorative development 11

backup 12

MCTP and GPI onmousepressed: for (tid = 1; tid < mctpgetnumberofactivatedcores(); ++tid) mctpsuspend(tid); // do visualization onmousereleased: for (tid = 1; tid < mctpgetnumberofactivatedcores(); ++tid) mctpresume(tid); // continue with workflow 13

GPI Microbenchmark 14 2x2 woodcrest, MHGS18-DDR

MCTP many core thread package platform independent (Linux/Windows 32/64bit) fast, very thin abstraction layer threadpool fabric + synchronization/communication between pools full thread control (init, start, stop, wait, exit) full system control wrt. HW/Core mappings (logical and physical, SSE level, NUMA layouts, cache affinity,...) thread affinity, get/set mask atomic operations fetchadd, cmpswap fast lock/unlock active/passive synchronization, mixable, NUMA-Barrier aligned thread safe alloc/free, special pages for global variables high resolution timer 15

MCTP barrier synchronization 16

MCTP critical section latency Tile based rendering (fps) MCTP 1.52: 79.02 TBB3: 74.46 (-5.7%) OMP_parfor: 74.70 (-5.5%) 17