InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment



Similar documents
A Tour of the Linux OpenFabrics Stack

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

RoCE vs. iwarp Competitive Analysis

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Building Enterprise-Class Storage Using 40GbE

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

OFA Training Programs

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

High Speed I/O Server Computing with InfiniBand

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

InfiniBand in the Enterprise Data Center

3G Converged-NICs A Platform for Server I/O to Converged Networks

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

State of the Art Cloud Infrastructure

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

White Paper Solarflare High-Performance Computing (HPC) Applications

Mellanox Academy Online Training (E-learning)

Connecting the Clouds

SMB Direct for SQL Server and Private Cloud

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Block based, file-based, combination. Component based, solution based

Introduction to MPIO, MCS, Trunking, and LACP

Security in Mellanox Technologies InfiniBand Fabrics Technical Overview

InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media

Lustre Networking BY PETER J. BRAAM

Enabling High performance Big Data platform with RDMA

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Building a Scalable Storage with InfiniBand

Zadara Storage Cloud A

Windows Server 2008 R2 Hyper-V Live Migration

Advanced Computer Networks. High Performance Networking I

Ultra Low Latency Data Center Switches and iwarp Network Interface Cards

Storage at a Distance; Using RoCE as a WAN Transport

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

Windows Server 2008 R2 Hyper-V Live Migration

FIBRE CHANNEL OVER ETHERNET

Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Quantum StorNext. Product Brief: Distributed LAN Client

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Storage Networking Foundations Certification Workshop

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Microsoft SMB Running Over RDMA in Windows Server 8

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Introduction to Cloud Design Four Design Principals For IaaS

Storage Protocol Comparison White Paper TECHNICAL MARKETING DOCUMENTATION

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Cisco SFS 7000P InfiniBand Server Switch

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

Solving the Hypervisor Network I/O Bottleneck Solarflare Virtualization Acceleration

WinOF Updates. Gilad Shainer Ishai Rabinovitz Stan Smith Sean Hefty.

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Michael Kagan.

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

RDMA over Ethernet - A Preliminary Study

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

The proliferation of the raw processing

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Mellanox Reference Architecture for Red Hat Enterprise Linux OpenStack Platform 4.0

Overview and Frequently Asked Questions Sun Storage 10GbE FCoE PCIe CNA

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Replacing SAN with High Performance Windows Share over a Converged Network

Boosting Data Transfer with TCP Offload Engine Technology

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

IP SAN Best Practices

High Performance VMM-Bypass I/O in Virtual Machines

QoS & Traffic Management

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Performance of RDMA-Capable Storage Protocols on Wide-Area Network

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability

Transcription:

December 2007 InfiniBand Software and Protocols Enable Seamless Off-the-shelf Deployment 1.0 Introduction InfiniBand architecture defines a high-bandwidth, low-latency clustering interconnect that is used for high-performance computing (HPC) and enterprise data center (EDC) class applications. It is an industry standard developed by the InfiniBand Trade Association (www.infinibandta.org). Since the release of the specification, the InfiniBand community has been active in developing software for all major OS platforms including Linux open source software for the InfiniBand architecture. The Linux software community, comprising of major suppliers to the HPC and EDC markets collaborate on joint open source software development as part of the OpenFabrics alliance (www.openfabrics.org, previously called OpenIB.org). The InfiniBand software stack is designed ground up to enable ease of application deployment. IP and TCP socket applications can avail of InfiniBand performance without requiring any change to existing applications that run over Ethernet. The same applies to SCSI, iscsi and file system applications. Upper layer protocols that reside over the low level InfiniBand adapter device driver and device independent API (also called verbs) provide industry standard interfaces to enable seamless deployments of off-the-shelf applications. Some extremely high performance applications require very low latency and such applications typically use industry standard message passing interface (MPI) or interface directly with the verbs layer. Such applications need to be ported and tuned specifically to the InfiniBand specific semantics. These applications and protocols are not discussed in this article. Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com - 1 -

1.1 Software Architecture Figure 1 below shows the Linux InfiniBand software architecture. The software consists of a set of kernel modules and protocols. There are associated user-mode shared libraries as well which are not shown in the figure. that operate at the user level stay transparent to the underlying interconnect technology. The focus of this article is to discuss what application developers need to know to enable their IP, SCSI, iscsi, sockets or file system based applications to operate over InfiniBand. Level IP SCSI Sockets iscsi FS Transparent to underlying interconnect technology Transparent and lication Interfaces IPoIB SRP SDP iser NFS-R Others CM Abstraction Core Modules SMA PMA CM SA Client MAD Services SMI GSI QP Redirection Verbs InfniBand Dependent Kernal Level Components Figure 1 A detailed discussion on the operation of the protocols and the underlying Core and modules is beyond the scope of this article. However, for the sake of completeness, below is a brief overview of the kernel level and InfiniBand specific modules and protocols is presented below. The kernel code divides logically into three layers: the HCA driver(s), the core InfiniBand modules, and the upper level protocols. The core InfiniBand modules comprise the kernel level mid-layer for InfiniBand devices. The mid-layer allows access to multiple HCA NICs and provides a common set of shared services. These include the following services: - -level Access Modules The user-level access modules implement the necessary mechanisms to allow access to InfiniBand hardware from user-mode applications. - The mid-layer provides the following functions: Communications Manager (CM) The CM provides the services needed to allow clients to establish connections. Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com - 2 -

SA Client The SA (Subnet Administrator) client provides functions that allow clients to communicate with the subnet administrator. The SA contains important information, such as path records, that are needed to establish connections. SMA The Subnet Manager Agent responds to subnet management packets that allow the subnet manager to query and configure the devices on each host. PMA The Performance Management Agent responds to management packets that allow retrieval of the hardware performance counters. MAD services Management Datagram (MAD) services provide a set of interfaces that allow clients to access the special InfiniBand queue pairs (QP), 0 and 1. GSI The General Services Interface (GSI) allows clients to send and receive management packets on special QP 1. Queue pair (QP) redirection allows an upper level management protocol that would normally share access to special QP 1 to redirect that traffic to a dedicated QP. This is done for upper level management protocols that are bandwidth intensive. SMI The Subnet Management Interface (SMI) allows clients to send and receive packets on special QP 0. This is typically used by the subnet manager. Verbs The mid-layer provides access to the InfiniBand verbs supplied by the HCA driver. The InfiniBand architecture specification defines the verbs. A verb is a semantic description of a function that must be provided. The mid-layer translates these semantic descriptions into a set of Linux kernel application programming interfaces (APIs). The mid-layer is also responsible for resource tracking, reference counting, and resource cleanup in the event of an abnormal program termination or in the event a client closes the interface without releasing all of the allocated resources. - The lowest layer of the kernel-level InfiniBand stack consists of the HCA driver(s). Each HCA device requires an HCA-specific driver that registers with the mid-layer and provides the InfiniBand verbs. The upper level protocols such as IPoIB, SRP, SDP, iser etc., facilitate standard data networking, storage and file system applications to operate over InfiniBand. Except for IPoIB, which provides a simple encapsulation of TCP/IP data streams over InfiniBand, the other upper level protocols transparently enable higher bandwidth, lower latency, lower CPU utilization and end-to-end services using field proven RDMA (Remote DMA) and hardware based transport technologies available with InfiniBand. The following is a discussion of those upper level protocols and how existing applications can be quickly enabled to operate over them and InfiniBand. Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com - 3 -

1.2 Supporting IP The easiest path to evaluating any IP-based application over InfiniBand is to use the upper layer protocol called IP over IB (IPoIB). IPoIB running over high bandwidth InfiniBand adapters can provide an instant performance boost to any IP-based applications. IPoIB supports tunneling of Internet Protocol (IP) packets over InfiniBand hardware. See Figure 2 below. In Linux, the protocol is implemented as a standard Linux network driver, and this allows any application or kernel driver that uses standard Linux network services to use the InfiniBand transport without modification. Linux kernel 2.6.11 and above includes support of the IPoIB protocol, the InfiniBand Core and a HCA driver for HCA NICs based on Mellanox Technologies HCA silicon. TCP IP Linux Network In Linux Kernal IPoIB Ethernet NIC Figure 2 This method of enabling IP applications over InfiniBand is effective for management, configuration, setup or control plane related data where bandwidth and latency are not critical. Because the application continues to run over the standard TCP/IP networking stack, the applications are completely unaware of the underlying I/O hardware. However, to attain full performance and take advantage of some of the advanced features of the InfiniBand architecture, application developers may want to use the sockets direct protocol (SDP) and related sockets based API. 1.3 Supporting Sockets-based For applications that use TCP sockets, SDP or sockets direct protocol delivers a significant boost to performance while reducing CPU utilization and application latency. The SDP driver provides a high-performance interface for standard socket applications and provides a boost in performance by bypassing the software TCP/IP stack, implementing zero copy and asynchronous I/O, and transferring data using efficient RDMA and hardware based transport mechanisms. - 4 - Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com

SDP sockets pre-load library TCP IP Linux Network SDP Ethernet NIC Figure 3 InfiniBand hardware provides a reliable and hardware based transport. As such, the TCP protocol becomes redundant and can be bypassed, saving valuable CPU cycles. See Figure 3 above which depicts a Linux based implementation of SDP. Zero-copy SDP implementations can save on expensive memory copies and use of RDMA can help save on expensive context switch penalties on, CPU utilization, performance and latency. The SDP protocol is implemented as a separate network address family. For example, TCP/IP provides the AF_INET address family and SDP provides the AF_SDP (27) address family. To allow standard sockets applications to use SDP without modification, SDP provides a preloaded library that traps the libc sockets calls destined for AF_INET and redirects them to AF_SDP. As is obvious, applications do not need to change except to interface with the preloaded library. 1.4 Supporting SCSI and iscsi Protocolbased SCSI RDMA Protocol (SRP) was defined by the ANSI T10 committee to provide block storage capabilities for the InfiniBand architecture. SRP is a protocol that tunnels SCSI request packets over InfiniBand hardware using this industry-standard wire protocol. This allows one host driver to use storage target devices from various storage hardware vendors. - 5 - Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com

Linux File Systems(s) SCSI mid-layer Linux Storage In Linux SRP HBA Driver (SCSI, FC, SATA) HBA Figure 4 As shown in Figure 4 above, the SRP upper level protocol plugs into Linux using the SCSI midlayer. Thus, to the upper layer Linux file systems and user applications that use those file systems, the SRP devices appear as any other locally attached storage device, even though they can be physically located anywhere on the fabric. It is worthwhile to mention that SRP is part of the latest Linux kernel versions. iser (iscsi RDMA) eliminates the traditional iscsi and TCP bottlenecks by enabling zero-copy RDMA, offloading CRC calculations in the transport layer to the hardware and by working with message boundaries instead of streams. It leverages iscsi management and discovery facilities and uses SLP and isns global storage naming. The iser specification for InfiniBand and other RDMA fabrics is driven within the Internet Engineering Task Force (IETF) (http://www.ietf.org/ internet-drafts/draft-ietf-ips-iser-05.txt) and IBTA has created an annex for support of iser over InfiniBand. iser follows the same methods of plugging into Linux using the SCSI mid-layer. However, as seen in Figure 5 below, iser works over an extra layer of abstraction (the CMA, Connection Manager Abstraction layer) to enable transparent operation over InfiniBand and iwarp based RDMA technologies. that interface with LibC at the user level and the Linux File Systems at the kernel level work transparently, unaware of what interconnect technology being used underneath. Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com - 6 -

Linux File Systems(s) SCSI mid-layer Linux Storage iser CMA iwarp HBA Driver (SCSI, FC, SATA) iwarp NIC HBA Figure 5 1.5 Supporting NFSbased Network File System (NFS) over RDMA is a protocol being developed by the Internet Engineering Task Force (IETF). This effort is extending NFS to take advantage of the RDMA features of the InfiniBand architecture and other RDMA enabled fabrics. As shown in Figure 6 below, the NFS-RDMA client plugs into the RPC switch layer and the standard NFS v2/v3 layer in the Linux kernel. The RPC switch directs NFS traffic either through the NFS-RDMA client or the TCP/IP stack. Like the iser implementation, NFS-RDMA client works over the CMA to provide transparent support over InfiniBand and iwarp based RDMA technologies. File system applications interface to the standard Linux file system layer and are unaware of the underlying interconnect. An open source project is underway to develop an NFS-RDMA client at http://sourceforge.net/projects/nfs-rdma/. Mellanox provides a Linux NFS-RDMA package (see: http://www.mellanox.com/products/nfs_rdma_sdk.php) that is ready for production usage and is compliant with OpenFabrics InfiniBand software shipped in major Linux OS distributions. Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com - 7 -

Linux File Systems(s) NFS v2/v3 RPC Switch NFS-RDMA Client CMA TCP/IP Linux File System iwarp iwarp NIC Ethernet NIC Figure 6 1.6 Software Ecosystem InfiniBand software and protocols are supported and available in major Linux, Windows and Virtual Machine Monitor (Hypervisor) platforms. This includes Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Microsoft Windows Server and Windows CCS (Compute Cluster Server), and VMware Virtual Infrastructure platforms. 1.7 Conclusion InfiniBand offers advanced features that are attractive for high performance computing and data center applications that require high performance and no-compromise I/O services in the areas of throughput, latency and end-to-end service level guarantees. Using the protocols discussed in this article, applications that run today over lower performing interconnects can be transparently migrated, without any changes, to work over InfiniBand and avail of the advanced features. The InfiniBand RDMA based software ecosystem is the most advanced one in the industry today, allowing for multiple and reliable software sourcing options. - 8 - Copyright 2007. Mellanox Technologies. All rights reserved. www.mellanox.com