McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk
Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation, design patterns, communication performance results Threads and the MPI Standard Pre- End Points proposal ideas
Why Implement MPI Again? Parallel program, distributed memory => MPI library Most (all?) MPI libraries written in C MPI Standard provides C and FORTRAN bindings C++ can use the C functions Other languages can follow the C++ model Use the C functions Alternatively, MPI can be implemented in that language Removes inter-language function call overheads but May not be possible to achieve comparable performance
Why Did I Choose C#? Experience and knowledge I gained from my career in software development My impression of the popularity of C# in commercial software development My desire to bridge the gap between high-performance programming and high-productivity programming One of the UK research councils offered me funding for a PhD that proposed to use C# to implement MPI
C# Myths C# only runs on Windows Not such a bad thing 3 of the Top500 machines use Windows Not actually true Mono works on multiple operating systems C# is a Microsoft language Not such a bad thing resources, commitment, support, training Not actually true C# follows ECMA and ISO standards C# is slow like Java Not such a bad thing expressivity, readability, re-usability Not actually true no easy way to prove this conclusively C# and its ilk are not things we need to care about Not such a bad thing they will survive/thrive, or not, without us Not actually true popularity trumps utility
McMPI Design & Implementation Desirable features of code Isolation of concerns -> easier to understand Human readability -> easier to maintain Compiler readability -> easier to get good performance Object-orientation can help with isolation of concerns So can modularisation and judiciously reducing LOC per code file Design patterns can help with human readability So can documentation and useful in-code comments Choice of language & compiler can help with performance So can coding style and detailed examination of compiler output What is the best compromise?
Communication Layer Abstract class factory design pattern Similar to plug-ins Enables addition of new functionality without re-compilation of the rest of the library All communication modules: Implement the same Abstract Device Interface (ADI) Isolate the details of their implementation from other layers Provide the same semantics and capabilities Reliable delivery Ordering of delivery Preservation of message boundaries Message = fixed size envelope information and variable size user data
Communication Layer UML
Protocol Layer Bridge design pattern Enables addition of new functionality without re-compilation of the rest of the library All protocol messages: Implement inherit from the same base class Isolate the details of their implementation from other layers Modify state of internal shared data structures independently Shared data structures (message queues ) Unexpected queue message envelope at receiver before receive Request queue receive called before message envelope arrival Matched queue at receiver waiting for message data to arrive Pending queue message data waiting at sender
Protocol Layer UML
Interface Layer Simple façade design pattern Translates MPI Standard-like syntax into protocol layer syntax Will become adapter design pattern For example, when custom data-types are implemented Current version of McMPI covers parts of MPI 1 only Initialisation and finalisation Administration functions, e.g. to get rank and size of communicator Point-to-point communication functions ready, synchronous, standard (not buffered) blocking, non-blocking, persistent Previous version had collectives Implemented on top of point-to-point Using hypercube or binary tree algorithms
McMPI Implementation Overview
Performance Results Introduction 1 Shared-memory results hardware details Number of Nodes: 1 Armari Magnetar server CPUs per Node: 2 Intel Xeon E5420 Threads per CPU: 4 Quad-core, no hyper-threading Core Clock Speed: 2.5GHz Front-side bus 1333MHz Level 1 Cache: 4x2x32KB Data & instruction per core Level 2 Cache: 2x6MB One per pair of cores Memory per Node: 16GB DDR2 667MHz Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
Performance Results Introduction 2 Distributed-memory results hardware details Number of Nodes: 18 Dell PowerEdge 2900 CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6 Threads per CPU: 2 Dual-core, no hyper-threading Core Clock Speed: 2.0GHz Front-side bus 1333MHz Level 1 Cache: 2x2x32KB Data & instruction per core Level 2 Cache: 1x4MB One per CPU Memory per Node: 4GB DDR2 533MHz Network Hardware: 2xNIC BCM5708C NetXtreme II GigE Operating System: Win2008 Server x64, SP2 version 6.0.6002
Latency (µs) Shared-memory Latency 6 5 4 MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)
Bandwidth (Mbit/s) Shared-memory Bandwidth 70,000 60,000 50,000 40,000 30,000 20,000 McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory 10,000 0 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Message Size (bytes)
Latency (µs) Distributed-memory Latency 600 550 500 450 400 350 300 250 200 McMPI Eager MS-MPI 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Message Size (bytes)
Bandwidth (Mbit/s) Distributed-memory Bandwidth 1,000 750 500 250 McMPI Rendezvous McMPI Eager MS-MPI 0 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Message Size (bytes)
Thread-as-rank Threading Level McMPI allows MPI_THREAD_AS_RANK as input for the MPI_INIT_THREAD function McMPI creates new threads during initialisation Not needed MPI_INIT_THREAD must be called enough times McMPI uses thread-local storage to store rank Not needed each communicator handle can encode rank Thread-to-thread message delivery is zero-copy Direct copy from user send buffer to user receive buffer Any thread can progress MPI messages
Thread-as-rank MPI Process Diagram created by Gaurav Saxena MSc, 2013
Thread-as-rank MPI Standard Is thread-as-rank compliant with the MPI Standard? Does the MPI Standard allow/support thread-as-rank? Ambiguous/debatable at best The MPI Standard assumes MPI process = OS process Call MPI_INIT or MPI_INIT_THREAD twice in one OS process Erroneous by definition or results in two MPI processes? MPI Standard thread compliant prohibits thread-as-rank To maintain a POSIX-process-like interface for MPI process End-points proposal violates this principle in exactly the same way Other possible interfaces exist
Thread-as-rank End-points Similarities Multiple threads can communicate reliably without using tags Thread rank can be stored in thread-local storage or handles Most common use-case likely requires MPI_THREAD_MULTIPLE Differences Thread-as-rank part of initialisation and active until finalisation End-points created after initialisation and can be destroyed Thread-as-rank has all possible ranks in MPI_COMM_WORLD End-points only has some ranks in MPI_COMM_WORLD Thread-as-rank cannot create ranks but may need to merge ranks End-points can create ranks and does not need to merge ranks
Thread-as-rank MPI Forum Proposal? Short answer: no Long answer: not yet, it s complicated More likely to be suggested amendments to end-points proposal Thread-as-rank is a special case of end-points Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD Thread-safety implications are similar (possibly identical?) Advantages/opportunities similar Thread-to-thread delivery rather than process-to-process delivery Work-stealing MPI progress engine or per-thread message queues
Questions?