Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Similar documents
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Architecture of Hitachi SR-8000

Performance Evaluation of InfiniBand with PCI Express

Building an Inexpensive Parallel Computer

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

CS5460: Operating Systems

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

- An Essential Building Block for Stable and Reliable Compute Clusters

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Multi-Threading Performance on Commodity Multi-Core Processors

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

High Performance Computing. Course Notes HPC Fundamentals

Installation Guide for Dolphin PCI-SCI Adapters

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

OpenSPARC T1 Processor

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Integrating PCI Express into the PXI Backplane

An Architectural study of Cluster-Based Multi-Tier Data-Centers

An Active Packet can be classified as

Power-Aware High-Performance Scientific Computing

Computer Systems Structure Input/Output

Lecture 2 Parallel Programming Platforms

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

GPI Global Address Space Programming Interface

Automatic Workload Management in Clusters Managed by CloudStack

Interconnection Networks

Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Big Graph Processing: Some Background

Technical Computing Suite Job Management Software

PCI Express: Interconnect of the future

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

AMD Opteron Quad-Core

High Speed I/O Server Computing with InfiniBand

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Bandwidth Calculations for SA-1100 Processor LCD Displays

COS 318: Operating Systems. Virtual Memory and Address Translation

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

50. DFN Betriebstagung

Principles and characteristics of distributed systems and environments

TCP Adaptation for MPI on Long-and-Fat Networks

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

How To Understand The Concept Of A Distributed System

Optimizing TCP Forwarding

Parallel Programming Survey

Introduction to PCI Express Positioning Information

CSE 6040 Computing for Data Analytics: Methods and Tools

Myrinet Express (MX): A High-Performance, Low-Level, Message-Passing Interface for Myrinet. Version 1.0 July 01, 2005

Architecting High-Speed Data Streaming Systems. Sujit Basu

PART II. OPS-based metro area networks

A Transport Protocol for Multimedia Wireless Sensor Networks

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

Computer Organization & Architecture Lecture #19

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Ethernet, VLAN, Ethernet Carrier Grade

I/O. Input/Output. Types of devices. Interface. Computer hardware

PCI Technology Overview

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm

PCI Express IO Virtualization Overview

Distributed communication-aware load balancing with TreeMatch in Charm++

Memory Architecture and Management in a NoC Platform

Advanced Computer Networks. High Performance Networking I

CSC 2405: Computer Systems II

Gigabit Ethernet Design

Unified Fabric: Cisco's Innovation for Data Center Networks

The proliferation of the raw processing

What s New in Mike Bailey LabVIEW Technical Evangelist. uk.ni.com

Overlapping Data Transfer With Application Execution on Clusters

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Implementing MPI-IO Shared File Pointers without File System Support

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Communication Systems Internetworking (Bridges & Co)

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Lustre Networking BY PETER J. BRAAM

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

SCSI vs. Fibre Channel White Paper

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

1. Memory technology & Hierarchy

I/O Device and Drivers

A New Chapter for System Designs Using NAND Flash Memory

Transcription:

Workshop for Communication Architecture in Clusters, IPDPS 2002: Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Joachim Worringen, Andreas Gäer, Frank Reker Lehrstuhl für Betriebssysteme, RWTH Aachen

Agenda Introduction into SCI via PCI MPI via SCI: SCI-MPICH Non-Contiguous Datatype Communication One-sided Communication Conclusion

SCI Principle Scalable Coherent Interface: Protocol for a memory-coupling interconnect standardized in 1992 (IEEE 1596-1992) Not a physical bus but offers bus-like services transparent operation for source and target operates within a global 64-bit adress space Good scalability through flexible topologies and small packet sizes optional: efficient, distributed cache-coherence

Integrated Systems SCI Implementation PCI-SCI adapter: 64-bit 66 MHz PCI-PCI Bridge Key performance values on IA-32 platform: - Remote-write (word) 1,5 us - Remote-read (word) 3,2 us - Bandwidth PIO 170 MB/s (IA-32 PCI is the limit) (peak bandwidth for 512 byte blocksize) DMA 250 MB/s (DMA engine is the limit) - Synchronization ~ 15 us - Remote interrupt ~ 30 us

PCI-SCI Peformance on IA-32: Bandwidth DMA engine PCI Memory

PCI-SCI Peformance on IA-32: Latency Native packet size

Message Passing over SCI Similar to local shared memory, but: Different performance characteristics - Read / write latency - I/O-bus behaves differently than system bus - Granularity and contiguousity of access important - Load on independant inter-node links (collective operations) Different memory consistency model Connecting / Mapping incurs more overhead Resources are limited It s still a network: connection monitoring Naive approach map everything and do shmem is neither efficient nor scalable

Communicating Non-Contiguous Data

MPI Datatypes Basic datatypes (char, double,...) Derived datatypes Concanate elements: communicat vectors Associations of elements: communicate structs Using offsets, strides and upper/lower bounds, noncontiguous datatypes can be defined Example: Columns of a matrix ( C storage)

Sending Non-Contiguous Data The Generic Way What is the problem? Many communication devices have simple data specification <data location, data length> N send operations for N separate data chunks inefficient Generic solution: Sender Receiver Pack Transfer Gather Unpack

Sending Non-Contiguous Data The SCI Way Goal: avoid intermediate copy operations! SCI Advantage: high bandwidth for small data chunks Problem: requires fast, space-efficient, restartable pack/unpack algorithm: direct_pack_ff Sender Receiver Pack & Transfer Gather & Unpack

Performance (I) Simple Vector

Performance (II) Complex Type

Performance (III) - Alignment

Reduced Cache-Pollution

One-Sided Communication SCI is one-sided communication, but: Remote read bandwidth < 10 MB/s Only fragments of each process address space are accessable These fragments need to be allocated specifically (until now) Multi-protocol approach required to Handle every kind of memory type Achieve optimal performance for each type of access

Different Access Modes Direct access of remote memory: Window is allocated from an SCI shared segment Put-operations of any size Get-operations sized up to a threshold Emulated access of remote memory: Access initiated by origin via messages, executed by target: Windows allocated from private memory Get-Operations beyond threshold Accumulate-operations Synchronous or asynchronous completion Different overhead & synchronization semantics

Latency of Remote Accesses latency [µs] 120 100 80 60 40 20 0 put shared put priv async put priv sync get shared get priv async get priv sync 1 5 12 28 44 60 76 92 108 number elements ("double") 124

Bandwidth of Remote Accesses 100 bandwidth [MiB/s] 10 put shared put priv async put priv sync get shared get priv async get priv sync 1 1 5 12 28 44 60 76 92 108 number elements ("double") 124

Scalability SCI ringlet bandwidth: 633 MiB/s Outgoing peak traffic for MPI_Put() per node: 122 MiB/s Worst case scalability: nodes offered load per node [MiB/s] total [MiB/s] efficiency 4 76,3 % 120,7 482,8 76,3 % 5 95,3 % 115,8 579,0 91,5% 6 114,4 % 97,8 586,5 92,7 % 7 133,5 % 79,3 555,1 87,7 % 8 152,5 % 62,8 502,2 79,3 %

Conclusions Avoiding separate pack/unpack for non-contiguous datatypes Highly efficient communication Reduced cache pollution and working set size Performance of one-sided communication comparable to integrated systems Multi-protocol approach ensures unlimited usability with optimal performance for each scenario Techniques are directly usable for SCI and local shmem Limits (and possible solutions): Address space of IA-32 architecture (IA-64 or others) Limited address translation resources (better hw, or sw cache) Ringlet scalability (increase link clock) Remote write latency (PCI-X w/ 133MHz -> down to 0.8 µs) Remote read-bandwidth (no real solution in sight) Interrupt latency (interrupt-on-write will halve it)

Thank you!

Spare Slides

Pack on the fly Store type information on commit of datatype: offset, length, stride, repetition count Usual data structure tree: recursive, complex restart Naive data structure list: not space-efficient Suitable data structure: linked list of stacks Bandwidth break-even point between Overhead for intermediate-copies Reduced bandwidth for fine-grained copy operations List sorted by size of stack entries ( leaves ) to allow for optimization: buffering for small and misaligned (parts of) leaves

Integration in SCI-MPICH Building the stack: MPIR_build_ff (struct MPIR_DATATYPE *type, int *leaves) Moving data: MPID_SMI_Pack_ff (char *inbuf, struct MPIR_DATATYPE *dtype_ptr, char *outbuf, int dest, int max, int *outlen) MPID_SMI_UnPack_ff (char *inbuf, struct MPIR_DATATYPE *dtype_ptr, char *outbuf, int from, int max, int *outlen)

Type Matching Sorting of basic datatypes (leafs) by size: Improves performance by mixing gathering & direct write Changes order in incoming buffer at receiver!

Efficiency Comparision: Non-contig Vector 1,2 1,0 0,8 0,6 0,4 0,2 ff via SCI ff via shmem Score Myrinet Score shmem SunFire Gigabit SunFire shmem Cray T3E 0,0 8 32 128 512 2048 8192 32768 131072

MPI_Alloc_mem(): Below threshold 1: Below threshold 2: Memory Allocation allocate via malloc() allocate from shared mem pool Beyond threshold 2: create specific shared memory Attributes: Keys private, shared, pinned: type of memory buffer Key align with value: align memory buffer MPI_Free_mem(): Unpin memory, release shared memory segment Implicitely invoke remote segment callbacks

sparse micro-benchmark Simulate access to remote sparse matrix: MPI_Win_create (..., winsize,...) for (increasing values of access_cnt) { } offset = 0 stride = access_cnt * sizeof(datatype) flush_cache() time = MPI_Wtime() while (offset + access_size < winsize) { } MPI_Get/Put (.., partner, offset, access_cnt,..) offset = offset + stride MPI_Win_fence(...) time = MPI_Wtime() - time

Latency of Remote Accesses

Bandwidth of Remote Accesses

Point-to-Point Comparison

Scaling Comparison MPI_Put