McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]

Size: px

Start display at page:

Download "McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]"

Hilda Doyle
9 years ago
Views:

1 McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]

2 Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation, design patterns, communication performance results Threads and the MPI Standard Pre- End Points proposal ideas

3 Why Implement MPI Again? Parallel program, distributed memory => MPI library Most (all?) MPI libraries written in C MPI Standard provides C and FORTRAN bindings C++ can use the C functions Other languages can follow the C++ model Use the C functions Alternatively, MPI can be implemented in that language Removes inter-language function call overheads but May not be possible to achieve comparable performance

Other languages can follow the C++ model Use the C functions Alternatively, MPI can be implemented in

4 Why Did I Choose C#? Experience and knowledge I gained from my career in software development My impression of the popularity of C# in commercial software development My desire to bridge the gap between high-performance programming and high-productivity programming One of the UK research councils offered me funding for a PhD that proposed to use C# to implement MPI

the popularity of C# in commercial software development My desire to bridge the gap between

5 C# Myths C# only runs on Windows Not such a bad thing 3 of the Top500 machines use Windows Not actually true Mono works on multiple operating systems C# is a Microsoft language Not such a bad thing resources, commitment, support, training Not actually true C# follows ECMA and ISO standards C# is slow like Java Not such a bad thing expressivity, readability, re-usability Not actually true no easy way to prove this conclusively C# and its ilk are not things we need to care about Not such a bad thing they will survive/thrive, or not, without us Not actually true popularity trumps utility

standards C# is slow like Java Not such a bad thing expressivity, readability, re-usability Not actually true no easy way to prove this

6 McMPI Design & Implementation Desirable features of code Isolation of concerns -> easier to understand Human readability -> easier to maintain Compiler readability -> easier to get good performance Object-orientation can help with isolation of concerns So can modularisation and judiciously reducing LOC per code file Design patterns can help with human readability So can documentation and useful in-code comments Choice of language & compiler can help with performance So can coding style and detailed examination of compiler output What is the best compromise?

and judiciously reducing LOC per code file Design patterns can help with human readability So can documentation and useful in-code comments

7 Communication Layer Abstract class factory design pattern Similar to plug-ins Enables addition of new functionality without re-compilation of the rest of the library All communication modules: Implement the same Abstract Device Interface (ADI) Isolate the details of their implementation from other layers Provide the same semantics and capabilities Reliable delivery Ordering of delivery Preservation of message boundaries Message = fixed size envelope information and variable size user data

Isolate the details of their implementation from other layers Provide the same semantics and capabilities Reliable delivery

8 Communication Layer UML

9 Protocol Layer Bridge design pattern Enables addition of new functionality without re-compilation of the rest of the library All protocol messages: Implement inherit from the same base class Isolate the details of their implementation from other layers Modify state of internal shared data structures independently Shared data structures (message queues ) Unexpected queue message envelope at receiver before receive Request queue receive called before message envelope arrival Matched queue at receiver waiting for message data to arrive Pending queue message data waiting at sender

shared data structures independently Shared data structures (message queues ) Unexpected queue message envelope at receiver before receive

10 Protocol Layer UML

11 Interface Layer Simple façade design pattern Translates MPI Standard-like syntax into protocol layer syntax Will become adapter design pattern For example, when custom data-types are implemented Current version of McMPI covers parts of MPI 1 only Initialisation and finalisation Administration functions, e.g. to get rank and size of communicator Point-to-point communication functions ready, synchronous, standard (not buffered) blocking, non-blocking, persistent Previous version had collectives Implemented on top of point-to-point Using hypercube or binary tree algorithms

12 McMPI Implementation Overview

13 Performance Results Introduction 1 Shared-memory results hardware details Number of Nodes: 1 Armari Magnetar server CPUs per Node: 2 Intel Xeon E5420 Threads per CPU: 4 Quad-core, no hyper-threading Core Clock Speed: 2.5GHz Front-side bus 1333MHz Level 1 Cache: 4x2x32KB Data & instruction per core Level 2 Cache: 2x6MB One per pair of cores Memory per Node: 16GB DDR2 667MHz Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet Operating System: WinXP Pro 64bit with SP3 version

5GHz Front-side bus 1333MHz Level 1 Cache: 4x2x32KB Data & instruction per core Level 2 Cache: 2x6MB One per pair of cores

14 Performance Results Introduction 2 Distributed-memory results hardware details Number of Nodes: 18 Dell PowerEdge 2900 CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6 Threads per CPU: 2 Dual-core, no hyper-threading Core Clock Speed: 2.0GHz Front-side bus 1333MHz Level 1 Cache: 2x2x32KB Data & instruction per core Level 2 Cache: 1x4MB One per CPU Memory per Node: 4GB DDR2 533MHz Network Hardware: 2xNIC BCM5708C NetXtreme II GigE Operating System: Win2008 Server x64, SP2 version

0GHz Front-side bus 1333MHz Level 1 Cache: 2x2x32KB Data & instruction per core Level 2 Cache: 1x4MB One per CPU Memory per

15 Latency (µs) Shared-memory Latency MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread ,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)

thread-to-thread 3 2 1 0 1 2 4 8 16 32 64 128 256

16 Bandwidth (Mbit/s) Shared-memory Bandwidth 70,000 60,000 50,000 40,000 30,000 20,000 McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory 10, ,096 8,192 16,384 32,768 65, , , ,288 1,048,576 Message Size (bytes)

shared-memory MS-MPI shared-memory 10,000 0 4,096 8,192

17 Latency (µs) Distributed-memory Latency McMPI Eager MS-MPI ,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)

150 100 50 0 1 2 4 8 16 32 64 128 256 512 1,024

18 Bandwidth (Mbit/s) Distributed-memory Bandwidth 1, McMPI Rendezvous McMPI Eager MS-MPI 0 4,096 8,192 16,384 32,768 65, , , ,288 1,048,576 Message Size (bytes)

MS-MPI 0 4,096 8,192 16,384 32,768 65,536

19 Thread-as-rank Threading Level McMPI allows MPI_THREAD_AS_RANK as input for the MPI_INIT_THREAD function McMPI creates new threads during initialisation Not needed MPI_INIT_THREAD must be called enough times McMPI uses thread-local storage to store rank Not needed each communicator handle can encode rank Thread-to-thread message delivery is zero-copy Direct copy from user send buffer to user receive buffer Any thread can progress MPI messages

thread-local storage to store rank Not needed each communicator handle can encode rank Thread-to-thread message

20 Thread-as-rank MPI Process Diagram created by Gaurav Saxena MSc, 2013

21 Thread-as-rank MPI Standard Is thread-as-rank compliant with the MPI Standard? Does the MPI Standard allow/support thread-as-rank? Ambiguous/debatable at best The MPI Standard assumes MPI process = OS process Call MPI_INIT or MPI_INIT_THREAD twice in one OS process Erroneous by definition or results in two MPI processes? MPI Standard thread compliant prohibits thread-as-rank To maintain a POSIX-process-like interface for MPI process End-points proposal violates this principle in exactly the same way Other possible interfaces exist

22 Thread-as-rank End-points Similarities Multiple threads can communicate reliably without using tags Thread rank can be stored in thread-local storage or handles Most common use-case likely requires MPI_THREAD_MULTIPLE Differences Thread-as-rank part of initialisation and active until finalisation End-points created after initialisation and can be destroyed Thread-as-rank has all possible ranks in MPI_COMM_WORLD End-points only has some ranks in MPI_COMM_WORLD Thread-as-rank cannot create ranks but may need to merge ranks End-points can create ranks and does not need to merge ranks

23 Thread-as-rank MPI Forum Proposal? Short answer: no Long answer: not yet, it s complicated More likely to be suggested amendments to end-points proposal Thread-as-rank is a special case of end-points Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD Thread-safety implications are similar (possibly identical?) Advantages/opportunities similar Thread-to-thread delivery rather than process-to-process delivery Work-stealing MPI progress engine or per-thread message queues

24 Questions?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.