Kommunikation in HPC-Clustern

Similar documents

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Cellular Computing on a Linux Cluster

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Source Code Transformations Strategies to Load-balance Grid Applications

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

- An Essential Building Block for Stable and Reliable Compute Clusters

White Paper Solarflare High-Performance Computing (HPC) Applications

Experiences with HPC on Windows

Performance Evaluation of InfiniBand with PCI Express

Parallel and Distributed Computing Programming Assignment 1

Parallel Algorithm for Dense Matrix Multiplication

2: Computer Performance

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

A Flexible Cluster Infrastructure for Systems Research and Software Development

Solution of Linear Systems

1 Bull, 2011 Bull Extreme Computing

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

High Performance Computing in CST STUDIO SUITE

Matrix Multiplication

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Symmetric Multiprocessing

Distributed communication-aware load balancing with TreeMatch in Charm++

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

RevoScaleR Speed and Scalability

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

MOSIX: High performance Linux farm

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Retargeting PLAPACK to Clusters with Hardware Accelerators

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Implementing MPI-IO Shared File Pointers without File System Support

Performance and scalability of MPI on PC clusters

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

2IP WP8 Materiel Science Activity report March 6, 2013

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Pattern-Based Approach to. Automated Application Performance Analysis

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Cluster Computing in a College of Criminal Justice

Multicore Parallel Computing with OpenMP

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

The CNMS Computer Cluster

Mark Bennett. Search and the Virtual Machine

Lightning Introduction to MPI Programming

MPI Application Development Using the Analysis Tool MARMOT

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Improved LS-DYNA Performance on Sun Servers

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

High performance computing systems. Lab 1

Driving force. What future software needs. Potential research topics

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

OpenMP Programming on ScaleMP

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

HP ProLiant SL270s Gen8 Server. Evaluation Report

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Scalability evaluation of barrier algorithms for OpenMP

Building an Inexpensive Parallel Computer

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

What is Multi Core Architecture?

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Inside the Erlang VM

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Communication Protocol

ECLIPSE Performance Benchmarks and Profiling. January 2009

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Transcription:

Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005

Outline 1 2

Optimize run-time of (highly) parallel applications Use powerful parallel libraries that Support aggressive overlap of communication and computation A typical library (e.g. ScaLapack) is based on MPI/BLACS

Overlap is enabled by non-blocking MPI operations Until now only non-blocking Point-to-Point operations (Non-blocking) MPI collectives could take advantage of special HW support of the underlying network

Outline 1 2

Our Tuning collective communication routines e.g. optimized MPI_Barrier() Targeting at special support of the underlying network (InfiniBand s multicast, atomic ops, RDMA) Implementing Non-blocking Collectives within the framework of Open MPI ( coll component)

An Optimized InfiniBand Barrier The LogP model abstracts for giving a rough communication performance estimation Processor + Memory Level Sender Receiver CPU o g L Network o g L g o... P Time

An Optimized InfiniBand Barrier But 1:P-P:1 benchmark reveals an implicite parallelism for InfiniBand ping 1 1 pong 30 25 Xeon PCI-X LogfP Prediction LogP Prediction L+2o 0 2 3 4 2 3 4 0 RTT/P 20 15 10 5 5 5 6 6 0 5 10 15 20 25 30 # processors (P) university-logo

An Optimized InfiniBand Barrier Thus, f is introduced (the first f small messages are essentially for free) T. Hoefler et. al.: LogfP - A Model for small Messages in InfiniBand For the first f messages it seems that there are f ports - instead of one All current algorithms have been modeled and optimized for single-port nodes The Dissemination Barrier Algorithm works best for single ported nodes (proven with LogP) The idea came up to create an n-way Dissemination Barrier university-logo

An Optimized InfiniBand Barrier The Dissemination Barrier Algorithm Round 0 Round 1 Round 2 run-time O(log 2 P) university-logo

An Optimized InfiniBand Barrier The n-way Dissemination Barrier Algorithm Round 0 Round 1 run-time O(log n+1 P) T. Hoefler et. al.: An optimal Synchronization Algorithm for Multi-Port Networks T. Hoefler et. al.: F ast Barrier Synchronization for InfiniBand university-logo

Outline 1 2

Helping to Answer Overlapping Comm/Comp programming style (threads) too complex What is the tradeoff of performance and programming effort? J.B. White:...programming for overlap may be of little benefit for synchronous applications in general. Many current MPI implementations -all those tested here- do not support the required level of overlap. The additional development effort and code complexity required to program for overlap seems unjustified... We contribute to answer by implementing non-blocking collectives in the context of a real application (Abinit)! university-logo

Its not just Performance - Why Collectives? Gorlach, 04: Send-Receive Considered Harmful Dijkstra, 68: Go To Statement Considered Harmful point to point: if ( rank == 0) then call MPI_SEND(...) else call MPI_RECV(...) end if vs. collective: call MPI_GATHER(...) cmp. math libraries vs. loops university-logo

Putting Everything Together non blocking collectives? JoD mentions split collectives example: MPI_Bcast_begin(...) MPI_Bcast_end(...) no nesting with other colls very limited not in the MPI-2 standard votes: 11 yes, 12 no, 2 abstain

Why non blocking Collectives many collectives synchronize unneccessary scale with O(log 2 P) sends wasted CPU time: log 2 P 2L Fast Ethernet: L = 50-60 Gigabit Ethernet: L = 15-20 InfiniBand: L = 2-7 1µs 4000 FLOPs on a 2GHz Machine

Outline 1 2

Pseudocode Overlap do...! send communication buffer MPI_Isend(buffer_comm,..., reg,...)! do useful work do_work(buffer_work)! finish communication MPI_Wait(req,...)! swap the buffers buffer_tmp = buffer_comm buffer_comm = buffer_work buffer_work = buffer_tmp enddo university-logo

Linpack Problem solves Ax = b with A = PLU decomposition row partial pivoting A and b are given Gauss Algorithm for A O(n 3 ) FP operations applied to b (no need to store P or L) solution by backwards substitution of U our example assumes column-wise distribution block distribution more complicated university-logo

- Linpack Problem - traditional do k=1,n,1! loop over columns if (IHaveColumn(k)) max = findmaxcolumn(k) MPI_Bcast(max); exchangerow(max,k) do i=k+1,n,1! loop over rows if (IHaveColumn(k)) div = A(i+1,k) / A(k,k) MPI_Bcast(div) do j=k+1,n,1! essentially DAXPY A(i,j) = A(i,j) - div * A(k,j) enddo b(i) = b(i) - div * b(k) enddo enddo university-logo

- Linpack Problem - non-blocking Collectives do k=1,n,1! loop over columns if (IHaveColumn(k+1)) nextmax = findmaxcolumn(k+1) MPI_Ibcast(nextmax, h1); exchangerow(max,k) do i=k+1,n/p,1! loop over rows if (IHaveColumn(k)) nextdiv = A(i+1,k) / A(k,k) MPI_Ibcast(nextdiv, h2) do j=k+1,n,1! essentially DAXPY A(i,j) = A(i,j) - div * A(k,j) enddo b(i) = b(i) - div * b(k) MPI_Wait(h2); div = nextdiv enddo MPI_Wait(h1); max = nextmax enddo university-logo