Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School



Similar documents
InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

A Tour of the Linux OpenFabrics Stack

High Speed I/O Server Computing with InfiniBand

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Mellanox Academy Online Training (E-learning)

Know your Cluster Bottlenecks and Maximize Performance

OFA Training Programs

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

RDMA over Ethernet - A Preliminary Study

Advancing Applications Performance With InfiniBand

Interconnecting Future DoE leadership systems

Scaling 10Gb/s Clustering at Wire-Speed

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

WinOF Updates. Gilad Shainer Ishai Rabinovitz Stan Smith Sean Hefty.

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

Can High-Performance Interconnects Benefit Memcached and Hadoop?

InfiniBand Technology Overview. Dror Goldenberg, Mellanox Technologies

PCI Express and Storage. Ron Emerick, Sun Microsystems

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

Advanced Computer Networks. High Performance Networking I

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

InfiniBand Clustering

Deploying 10/40G InfiniBand Applications over the WAN

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Hadoop on the Gordon Data Intensive Cluster

SMB Direct for SQL Server and Private Cloud

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Delivering Application Performance with Oracle s InfiniBand Technology

Performance of RDMA-Capable Storage Protocols on Wide-Area Network

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Isilon IQ Network Configuration Guide

Lustre Networking BY PETER J. BRAAM

Storage Architectures. Ron Emerick, Oracle Corporation

Michael Kagan.

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

PCI Express Overview. And, by the way, they need to do it in less time.

LS DYNA Performance Benchmarks and Profiling. January 2009

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

John Ragan Director of Product Management. Billy Wise Communications Specialist

MLNX_VPI for Windows Installation Guide

Building Enterprise-Class Storage Using 40GbE

ECLIPSE Performance Benchmarks and Profiling. January 2009

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

FLOW-3D Performance Benchmark and Profiling. September 2012

Introduction to Cloud Design Four Design Principals For IaaS

Using High Availability Technologies Lesson 12

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

The following InfiniBand adaptor products based on Mellanox technologies are available from HP

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Security in Mellanox Technologies InfiniBand Fabrics Technical Overview

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Building a Scalable Storage with InfiniBand

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

State of the Art Cloud Infrastructure

IP Networking. Overview. Networks Impact Daily Life. IP Networking - Part 1. How Networks Impact Daily Life. How Networks Impact Daily Life

Installing Hadoop over Ceph, Using High Performance Networking

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

PCI Express IO Virtualization Overview

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Configuring VMware vsphere 5.1 with Oracle ZFS Storage Appliance and Oracle Fabric Interconnect

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

HPC Software Requirements to Support an HPC Cluster Supercomputer

Introduction to PCI Express Positioning Information

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

RoCE vs. iwarp Competitive Analysis

Deploying HPC Cluster with Mellanox InfiniBand Interconnect Solutions

Network Contention and Congestion Control: Lustre FineGrained Routing

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Fibre Channel over Ethernet in the Data Center: An Introduction

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Data Communication Networks and Converged Networks

Power Saving Features in Mellanox Products

Mobile IP Network Layer Lesson 02 TCP/IP Suite and IP Protocol

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Mellanox Technical Training and Certification Program

Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise

D1.2 Network Load Balancing

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Microsoft SMB Running Over RDMA in Windows Server 8

Oracle Big Data Appliance: Datacenter Network Integration

Designing HP SAN Networking Solutions

Data Networking and Architecture. Delegates should have some basic knowledge of Internet Protocol and Data Networking principles.

Transcription:

Introduction to Infiniband Hussein N. Harake, Performance U! Winter School

Agenda Definition of Infiniband Features Hardware Facts Layers OFED Stack OpenSM Tools and Utilities Topologies Infiniband Roadmap 2

Definition of Infiniband Type of communication link that allows data to flow Communication Fabric used in the HPC domain Open standard interconnect Developed by Infiniband Trade Association http://www.infinibandta.org/ 3

Development Software Stack is developed by the Openfabric community http://www.openfabrics.org/ Open source software stack for HPC network that required bandwidth, scalability and low latency. OpenFabrics Enterprise Distribution (OFED ) 4

Infiniband Features Scalable HPC Network Low Latency ~ 1 microsecond Low CPU Overhead QoS Failover Congestion control 5

Infiniband Hardware Switch 6

Infiniband Hardware GT/s gigatransfers per second 7

Infiniband Hardware 8

Infiniband Hardware 9

Infiniband Bandwidth Facts QDR links use 8b/10b encoding Every 10 bits sent, links carry 8bits of data 10Gb link carries 8Gb data Applies to SDR, DDR and QDR. QDR bandwidth is 32Gb/s. FDR links use 64b/66b encoding Every 66 bits sent carry 64 bits of data. From the 14Gb/s links hold 13.64 Gbit/s FDR bandwidth is 54.54 Gbit/s 10

Infiniband Layers 11

Infiniband Layers Physical Layer: - How bits are placed on the wire - Signaling protocol - Link Rates, link speed and link width - Cables, copper and optical cables - Connectors etc.. 12

Infiniband Layers Link Layer: Packets, Switching, QoS, Flowcontrol and Data integrity Packets operation, format and protocols Up to 4K packet size Maintain link configuration and route between S. and D. in a subnet. Packets forwarded using a 16 bit id (LID) assigned by the subnet manager VL (Virtual Lanes) separate logical links which share physical link. Up to 15 VLs per link and 1 management lane FlowControl manage data flow between links (point to point) Data integrity two CRCs per packet, Variant and Invariant CRC to ensure data integrity. 13

Infiniband Layers Network Layer: Handles routing between different subnets Not required within a subnet Global Routing Header to route packets by IB subnets GUID (Globally Unique ID) is a must per node 128bit IPv6 header is used for source and destination packets 14

Infiniband Layers Transport Layer: QP (Queue Pairs) are used to transport data from one point to another Async interface Send, receive and complete queue pairs IPoIB runs on two different transport mode (UD, CM..) Based on the MTU the transport layer: Divide the data into packets Reassemble based on BTH (Base Transport Header) Complete Offload and kernel bypass (All mentioned run on the HCA and enables low latency) 15

Infiniband Layers Upper Level Protocols: Categorize ULPs into different categories: Network IPoIB (IP over IB) SDP (Socket Direct Protocol) WSD (Winsock Direct for windows) Storage NFS over RDMA SCSI RDMA Protocol (SRP) iscsi Extensions for RDMA (iser) Computing Clustering MPI (Message Passing Interface) 16

OFED Stack OFED Stack (OPEN Fabrics Enterprise Distribution) A complete Software Stack includes drivers, core, protocols, agents, services, management tools, libraries etc.. You could obtain the OFED stack from http://www.openfabrics.org/ Mellanox provides a precompiled version which includes new features, some fixes and firmware update for Mellanox products. Intel (Qlogic) download OFED stack from openfabrics or Intel precompiled version. 17

OpenSM OpenSM and Subnet Agents are part of the OFED stack Subnet Management (SM) manage the fabric discover and configure devices Assign local ID (LID) to every port provide routing tables Subnet Agents Performance Management Connection Manager Communication Management Device Management etc.. 18

Tools and Utilities Tools and utilities are available to manage, diagnostic and measure performance. Node or port: ibstat ibstatus ibv_devinfo Measure Performance Ib_write_bw Ib_read_bw Ib_send_lat Ib_read_lat Diagnostic and query tools on the network Ibdiagnet Ibhosts Iblinkinfo.pl Ibswitches 19

Infiniband Hardware Connect-IB PCI-E 3.0 16X 20

Manage Infiniband 21

Topologies Topologies available over IB Fat-tree Mash 2D Torus 3D Torus DragonFly Etc.. Fat-tree Topology 22

Topologies Topologies available over IB Fat-tree Mash 2D Torus 3D Torus DragonFly Etc.. UpDown Routing (Broken Fat-tree) 23

Topologies 24

Topologies Mesh Network 25

Topologies 2D Torus 26

Routing Algorithm Routing Min Hop (shortest length not deadlock free) UPDN (Not pure Fat Tree deadlock due to a loop in the subnet) Fat-tree (used if the topology is pure Tat-tree) DOR (Min Hop deadlock free for Mesh and hybercube topologies) Lash (Deadlock free and use the shortest path) Torus-2Qos (DOR based deadlock free, used for 2D and 3D Torus) 27

Infiniband Roadmap http://www.infinibandta.org/content/pages.php?pg=technology_overview 28

HPC Advisory 2013 Lugano 29

Thank you for your attention. 30