McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

Similar documents

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Boosting Data Transfer with TCP Offload Engine Technology

Multi-Threading Performance on Commodity Multi-Core Processors

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Informatica Ultra Messaging SMX Shared-Memory Transport

benchmarking Amazon EC2 for high-performance scientific computing

1000Mbps Ethernet Performance Test Report

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Intel DPDK Boosts Server Appliance Performance White Paper

Building an Inexpensive Parallel Computer

Symmetric Multiprocessing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

D1.2 Network Load Balancing

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

MPI and Hybrid Programming Models. William Gropp

Computer Systems Structure Input/Output

Lecture 2 Parallel Programming Platforms

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Intel Pentium 4 Processor on 90nm Technology

Minimum Hardware Specifications Upgrades

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Oracle Database Scalability in VMware ESX VMware ESX 3.5

System Requirements Table of contents

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Summary. Key results at a glance:

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS DYNA Performance Benchmarks and Profiling. January 2009

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

- An Essential Building Block for Stable and Reliable Compute Clusters

Architecting High-Speed Data Streaming Systems. Sujit Basu

An Oracle White Paper September Advanced Java Diagnostics and Monitoring Without Performance Overhead

Business white paper. HP Process Automation. Version 7.0. Server performance

Building a Private Cloud with Eucalyptus

Scaling Database Performance in Azure

Enabling Technologies for Distributed and Cloud Computing

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

IT Business Management System Requirements Guide

RDMA over Ethernet - A Preliminary Study

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Parallel Programming Survey

Virtual Machines.

Using the Windows Cluster

Figure 1A: Dell server and accessories Figure 1B: HP server and accessories Figure 1C: IBM server and accessories

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

PARALLELS CLOUD SERVER

Introduction to Web Services

Middleware Lou Somers

Networking Driver Performance and Measurement - e1000 A Case Study

Running applications on the Cray XC30 4/12/2015

A low-cost, connection aware, load-balancing solution for distributing Gigabit Ethernet traffic between two intrusion detection systems

Power Comparison of Dell PowerEdge 2950 using Intel X5355 and E5345 Quad Core Xeon Processors

RightNow November 09 Workstation Specifications

Improved LS-DYNA Performance on Sun Servers

AMD Opteron Quad-Core

Determining Your Computer Resources

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

ECLIPSE Performance Benchmarks and Profiling. January 2009

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Datacenter Operating Systems

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

PCI Express High Speed Networks. Complete Solution for High Speed Networking

The Bus (PCI and PCI-Express)

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

3.4 Planning for PCI Express

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

IBM Europe Announcement ZG , dated March 11, 2008

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

64-Bit versus 32-Bit CPUs in Scientific Computing

Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers

InterScan Web Security Virtual Appliance

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Introduction to Hybrid Programming

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Stream Processing on GPUs Using Distributed Multimedia Middleware

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

10.2 Requirements for ShoreTel Enterprise Systems

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Microsoft Exchange Server 2003 Deployment Considerations

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Legal Notices Introduction... 3

Effective Java Programming. efficient software development

Programming Languages

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

Transcription:

McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation, design patterns, communication performance results Threads and the MPI Standard Pre- End Points proposal ideas

Why Implement MPI Again? Parallel program, distributed memory => MPI library Most (all?) MPI libraries written in C MPI Standard provides C and FORTRAN bindings C++ can use the C functions Other languages can follow the C++ model Use the C functions Alternatively, MPI can be implemented in that language Removes inter-language function call overheads but May not be possible to achieve comparable performance

Why Did I Choose C#? Experience and knowledge I gained from my career in software development My impression of the popularity of C# in commercial software development My desire to bridge the gap between high-performance programming and high-productivity programming One of the UK research councils offered me funding for a PhD that proposed to use C# to implement MPI

C# Myths C# only runs on Windows Not such a bad thing 3 of the Top500 machines use Windows Not actually true Mono works on multiple operating systems C# is a Microsoft language Not such a bad thing resources, commitment, support, training Not actually true C# follows ECMA and ISO standards C# is slow like Java Not such a bad thing expressivity, readability, re-usability Not actually true no easy way to prove this conclusively C# and its ilk are not things we need to care about Not such a bad thing they will survive/thrive, or not, without us Not actually true popularity trumps utility

McMPI Design & Implementation Desirable features of code Isolation of concerns -> easier to understand Human readability -> easier to maintain Compiler readability -> easier to get good performance Object-orientation can help with isolation of concerns So can modularisation and judiciously reducing LOC per code file Design patterns can help with human readability So can documentation and useful in-code comments Choice of language & compiler can help with performance So can coding style and detailed examination of compiler output What is the best compromise?

Communication Layer Abstract class factory design pattern Similar to plug-ins Enables addition of new functionality without re-compilation of the rest of the library All communication modules: Implement the same Abstract Device Interface (ADI) Isolate the details of their implementation from other layers Provide the same semantics and capabilities Reliable delivery Ordering of delivery Preservation of message boundaries Message = fixed size envelope information and variable size user data

Communication Layer UML

Protocol Layer Bridge design pattern Enables addition of new functionality without re-compilation of the rest of the library All protocol messages: Implement inherit from the same base class Isolate the details of their implementation from other layers Modify state of internal shared data structures independently Shared data structures (message queues ) Unexpected queue message envelope at receiver before receive Request queue receive called before message envelope arrival Matched queue at receiver waiting for message data to arrive Pending queue message data waiting at sender

Protocol Layer UML

Interface Layer Simple façade design pattern Translates MPI Standard-like syntax into protocol layer syntax Will become adapter design pattern For example, when custom data-types are implemented Current version of McMPI covers parts of MPI 1 only Initialisation and finalisation Administration functions, e.g. to get rank and size of communicator Point-to-point communication functions ready, synchronous, standard (not buffered) blocking, non-blocking, persistent Previous version had collectives Implemented on top of point-to-point Using hypercube or binary tree algorithms

McMPI Implementation Overview

Performance Results Introduction 1 Shared-memory results hardware details Number of Nodes: 1 Armari Magnetar server CPUs per Node: 2 Intel Xeon E5420 Threads per CPU: 4 Quad-core, no hyper-threading Core Clock Speed: 2.5GHz Front-side bus 1333MHz Level 1 Cache: 4x2x32KB Data & instruction per core Level 2 Cache: 2x6MB One per pair of cores Memory per Node: 16GB DDR2 667MHz Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet Operating System: WinXP Pro 64bit with SP3 version 5.2.3790

Performance Results Introduction 2 Distributed-memory results hardware details Number of Nodes: 18 Dell PowerEdge 2900 CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6 Threads per CPU: 2 Dual-core, no hyper-threading Core Clock Speed: 2.0GHz Front-side bus 1333MHz Level 1 Cache: 2x2x32KB Data & instruction per core Level 2 Cache: 1x4MB One per CPU Memory per Node: 4GB DDR2 533MHz Network Hardware: 2xNIC BCM5708C NetXtreme II GigE Operating System: Win2008 Server x64, SP2 version 6.0.6002

Latency (µs) Shared-memory Latency 6 5 4 MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)

Bandwidth (Mbit/s) Shared-memory Bandwidth 70,000 60,000 50,000 40,000 30,000 20,000 McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory 10,000 0 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Message Size (bytes)

Latency (µs) Distributed-memory Latency 600 550 500 450 400 350 300 250 200 McMPI Eager MS-MPI 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)

Bandwidth (Mbit/s) Distributed-memory Bandwidth 1,000 750 500 250 McMPI Rendezvous McMPI Eager MS-MPI 0 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Message Size (bytes)

Thread-as-rank Threading Level McMPI allows MPI_THREAD_AS_RANK as input for the MPI_INIT_THREAD function McMPI creates new threads during initialisation Not needed MPI_INIT_THREAD must be called enough times McMPI uses thread-local storage to store rank Not needed each communicator handle can encode rank Thread-to-thread message delivery is zero-copy Direct copy from user send buffer to user receive buffer Any thread can progress MPI messages

Thread-as-rank MPI Process Diagram created by Gaurav Saxena MSc, 2013

Thread-as-rank MPI Standard Is thread-as-rank compliant with the MPI Standard? Does the MPI Standard allow/support thread-as-rank? Ambiguous/debatable at best The MPI Standard assumes MPI process = OS process Call MPI_INIT or MPI_INIT_THREAD twice in one OS process Erroneous by definition or results in two MPI processes? MPI Standard thread compliant prohibits thread-as-rank To maintain a POSIX-process-like interface for MPI process End-points proposal violates this principle in exactly the same way Other possible interfaces exist

Thread-as-rank End-points Similarities Multiple threads can communicate reliably without using tags Thread rank can be stored in thread-local storage or handles Most common use-case likely requires MPI_THREAD_MULTIPLE Differences Thread-as-rank part of initialisation and active until finalisation End-points created after initialisation and can be destroyed Thread-as-rank has all possible ranks in MPI_COMM_WORLD End-points only has some ranks in MPI_COMM_WORLD Thread-as-rank cannot create ranks but may need to merge ranks End-points can create ranks and does not need to merge ranks

Thread-as-rank MPI Forum Proposal? Short answer: no Long answer: not yet, it s complicated More likely to be suggested amendments to end-points proposal Thread-as-rank is a special case of end-points Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD Thread-safety implications are similar (possibly identical?) Advantages/opportunities similar Thread-to-thread delivery rather than process-to-process delivery Work-stealing MPI progress engine or per-thread message queues

Questions?