Advanced Computer Networks. High Performance Networking I

Similar documents
Advanced Computer Networks. Datacenter Network Fabric

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Can High-Performance Interconnects Benefit Memcached and Hadoop?

RoCE vs. iwarp Competitive Analysis

High Speed I/O Server Computing with InfiniBand

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Advanced Computer Networks Network Topologies. Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Storage at a Distance; Using RoCE as a WAN Transport

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Building Enterprise-Class Storage Using 40GbE

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Ultra Low Latency Data Center Switches and iwarp Network Interface Cards

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

SMB Direct for SQL Server and Private Cloud

QoS & Traffic Management

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Microsoft SMB Running Over RDMA in Windows Server 8

Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise

IEEE Congestion Management Presentation for IEEE Congestion Management Study Group

3G Converged-NICs A Platform for Server I/O to Converged Networks

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

White Paper Abstract Disclaimer

Advancing Applications Performance With InfiniBand

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Michael Kagan.

PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

New Data Center architecture

Topological Properties

Advanced Computer Networks. Scheduling

Performance of Software Switching

RDMA over Ethernet - A Preliminary Study

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

Architecting Low Latency Cloud Networks

Latency Considerations for 10GBase-T PHYs

Design and Implementation of the iwarp Protocol in Software. Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff Ohio Supercomputer Center

Cloud Computing and the Internet. Conferenza GARR 2010

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Mellanox Academy Online Training (E-learning)

InfiniBand in the Enterprise Data Center

Gigabit Ethernet Design

Fibre Channel Over and Under

VMWARE WHITE PAPER 1

The proliferation of the raw processing

Lustre Networking BY PETER J. BRAAM

Computer Systems Structure Input/Output

Data Center Architecture Overview

PCI Express and Storage. Ron Emerick, Sun Microsystems

Architecture and Performance of the Internet

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

D1.2 Network Load Balancing

Windows 8 SMB 2.2 File Sharing Performance

Performance Evaluation of InfiniBand with PCI Express

Unified Fabric: Cisco's Innovation for Data Center Networks

Lecture 2 Parallel Programming Platforms

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

- An Essential Building Block for Stable and Reliable Compute Clusters

Local-Area Network -LAN

Block based, file-based, combination. Component based, solution based

Advanced Computer Networks. Layer-7-Switching and Loadbalancing

Routing Heterogeneous CCI Subnets

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

High-Performance Networking for Optimized Hadoop Deployments

Network Virtualization Technologies and their Effect on Performance

Interconnection Networks. B649 Parallel Computing Seung-Hee Bae Hyungro Lee

Bioscience. Introduction. The Importance of the Network. Network Switching Requirements. Arista Technical Guide

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Transcription:

Advanced Computer Networks 263 3501 00 High Performance Networking I Patrick Stuedi Spring Semester 2014 1 Oriana Riva, Department of Computer Science ETH Zürich

Outline Last week: Wireless TCP Today: High Performance Networking: Part I 2

Appendix: Mobile IPv6 Uses IPv6 routing header 3

Course Overview Wireless networking technologies: first half of this course We are now here covered in basic ETH Operating Systems and Networks course 4 Datacenter networking: second half of this course

Overview High-Performance Computing (HPC) Supercomputers Systems designed from scratch Densely packed Short rack-to-rack cables Expensive Built from custom high-end components Mostly run a single program at a time, e.g., message passing (MPI) applications Cloud Computing Datacenters often built from commodity off-the-shelf hardware May run multiple jobs at the same time Often are multi-tenant - Different jobs running in datacenter have been developed or deployed by different people Often use virtualization - Hardware is multiplexed, e.g., multiple virtual machines per host Runs cloudy workload - Internet based applications: Email, Social Network, Maps, Search - Analytics: MapReduce/Hadoop, Pregel, NoSql, NewSql, etc... 5

IBM Blue Gene P supercomputer 6

Blue Gene: Cabling 7

Blue Gene Supercomputer Overview Blue Gene is a family of supercomputers from IBM BlueGene/L (2004) BlueGene/P (2006) BlueGene/Q (2010) Blue Gene/L: 64K-node highly integrated supercomputer Many of the components (processor, network, router) on the same chip Blue Gene/L: #1 Supercomputer as ranked by Top500 list from November 2004 June 2008 8

BlueGene: Dense Packaging 9

Design Motivation: Processor Clock Frequency Scaling Ends Three decades of exponential clock rate (and electrical power!) growth has ended Yet Moore s Law continues in transistor count What do we do with all those transistors to keep performance increasing to meet demand? Industry response: Multi-core (i.e. double the number of cores every 18 months instead of the clock frequency (and power!) Source: The Landscape of Computer Architecture, John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007 But, added transistors can be used for other functions such as memory/storage controllers, embedded networks, etc. 10

Blue Gene/P System-on-a-Chip Compute Node Network logic is a fraction of compute ASIC complexity/area. BG/P Node Card 11

Network Topology: Basics Topologies can be classified into: Direct networks Processing nodes directly attached to the switching fabric Indirect networks: separate processing nodes and switching elements In direct networks nodes often have very few ports (2,3,4, ) Low port networks are also called low-radix networks Elements (e.g. switches) in indirect networks often have higher port numbers (16,32,64,128,...) High port networks are also called high-radix networks 12

Criteria for choosing a particular network topology Path length between two nodes The more hops the bigger the latency The more hops the more congestion in the network Cost Bisection bandwidth The rate at which communication can take place between one half of a cluster and the other Typically the segmentation refers to the worst-case segmentation Path redundency Multiple paths between src/dst nodes Affects reliability, bandwidth, etc 13

Direct Networks: Mesh, Torus, Hypercube Notation <k>-ary-<n>-mesh or <k>-ary-<n>-torus k: radix, number of elements in each dimension (different meaning here than in term high-radix-network ) n: number of dimensions radix k does not have to be the same in each dimension Examples: a) 10-ary 1-torus b) 5-ary 2-torus c) 3-ary 3-torus 14

Direct Networks: Mesh, Torus, Hypercube (2) Cost effective at scale Allows for very dense packaging (single node card with compute element and switching element) Great performance for applications with locality Computation has dependencies on results of computations on neighboring nodes (many MPI application have this property) Simple expansion for future growth Just append nodes on one of the dimensions Good path redundancy 15

Example Bisection Bandwidth Bisection bandwidth: Minimal #arcs to be removed to partition the network in two equal halves 4-ary-2-mesh: bisection bandwidth 4 2-ary-3-mesh: bisection bandwidth 4 16

Example: IBM Blue Gene 3D Torus Network Interconnects compute nodes Communication backbone for computation In BlueGene/L: 32x32x64 connectivity Worst case diameter: 16+16+32 = 64 hops Different consecutive packets can follow different routes 17

High Performance Neworking: Layer-2 / Interconnect technologies Supercomputer interconnect technologies through the ages: Ten years ago (2002) - Many different interconnect technologies - Myrinet takes about 30% In 2010: - Gigabit Ethernet takes 50% - Infiniband 41 % Datacenter interconnects Almost entirely Ethernet 18

Infiniband vs Ethernet Infiniband (IB) Low latency - ~ 1us for two directly connected boxes High bandwidth - For data rates (SDR: 10 Gbit/s, DDR: 20 Gbit/s, QDR: 40 Gbit/s) Supports RDMA interface - RDMA: Remote Direct Memory Access: No OS involvement during transmission and reception of packets Ethernet: 10GbE has 5-6 times the latency of Infiniband 40GbE and 100Gbe in the pipeline Both IB and Ethernet can be operated with a switched fabric topology 19

Network latencies in Data centers Factors that contribute to latency in TCP datacenters Delay: cost of a single traversal of the component RTT: total cost in a round-trip traversing 5 switches in each direction OS overhead per packet exchanged between two hosts attached to the same switch: (2*15)/(2*2.5+2*15+10)=66% (!!) 20

Packet Processing Overhead user space kernel space DMA accessible memory area Sending-side: Data is copied from the application buffer into a socket buffer Data is DMA copied into NIC buffer Receiver side: Data is DMA copied from NIC buffer into socket buffer Data is copied into application buffer Application is scheduled (context switching)21

Throughput and CPU load at 1Gbit/s and 10Gbit/s - Throughput limited because of high CPU load - RX side typically more CPU intensive because highly asynchronous 22

TCP Offloading What is TCP offloading Moving IP and TCP processing to the Network Interface (NIC) Main justification for TCP offloading Reduction of host CPU cycles for protocol header processing, checksumming Fewer CPU interrupts Fewer bytes copied over the memory bus Potential to offload expensive features such as encryption 23 Department of Computer Science

TCP Offload Engines (TOEs) 24 Department of Computer Science

Problems of TCP offloading Moore s Law worked against smart NICs CPU's used to be fast enough Now many cores, cores don't get faster Network processing is hard to parallelize TCP/IP headers don t take many CPU cycles TOEs impose complex interfaces Protocol between TOE & CPU can be worse than TCP Connection management overhead For short connections, overwhelms any savings 25 Department of Computer Science

Where TCP offload helps Sweet spot for TCP offload might be apps with: Very high bandwidth Relatively low end-to-end latency network paths Long connection durations Relatively few connections Typical examples of these might be: Storage-server access Cluster interconnects 26 Department of Computer Science

User-level networking: Remove OS from the data path Transport offloading is not enough! System call overhead, Context switches Memory copying U-Net: Eicken, Basu, Buch, Vogels, Cornell University, 1995 Virtual network interface that allows applications to send and receive messages without operating system intervention Move all all buffer management and packet processing to userspace (zero-copy) 27

U-Net Overview a) Traditional networking architecture Kernel controls the network All communication via kernel b) U-Net architecture: Application access network directly via MUX Kernel involved only in connection setup 28

U-Net Building Blocks End points application s handle into the network Buffer area hold message data for sending or buffer space for receiving Message queues hold descriptors pointing to buffer area 29

U-Net communication application 1 application 2 application 2 endpoint endpoint endpoint U-Net NI Initialization: Create one or more endpoints Register communication segment with endpoint and associate them with a tag Sending Composes the data in the communication segment Push a descriptor for the message onto the send queue NIC transmits the message after marking it with the appropriate message tag. Receiving: Incoming messages get de-multiplexed based on the message tag Data is placed within the target buffer of the application by the NIC Push message descriptor to the receive queue 30 Department of Computer Science

History of User-Level Networking U-Net one of the first (if not the first) system to propose OS-bypassing Other early works SHRIMP: Virtual Memory Mapped Interfaces, IEEE Micro, 1995 Separating Data and Control Transfer in Distributed Operating Systems, Thekkath et. al., ASPLOS'94 Efforts of U-Net eventually resulted in the Virtual Interface Architecture (VIA) Specification jointly proposed by Compaq, Intel and Microsoft, 1997 VIA architecture has led to the implementation of various high performance networking stacks: Infiniband, iwarp, Roce: Commonly referred to as RDMA network stacks RDMA = Remote Direct Memory Access 31 Department of Computer Science

RDMA Architecture RDMA verbs interface Application RDMA verbs userlib user space Socket layer TCP UDP traditional socket interface kernel IP Kernel Mod Ethernet NIC Driver RDMA enabled NIC i/o space Traditional socket interface involves kernel RDMA interface involves kernel only on control path, but access the RDMA capable NIC (rnic) directly from user space on the data path Dedicated verbs interface used for RDMA, instead of traditional socket interface 32 Department of Computer Science

RDMA Queue Pairs (QPs) Applications use 'verbs' interface to Register memory: - Operating system will make sure the memory is pinned and accessible by DMA Create a queue pair (QP) - send/recv queue Create a completion queue (CQ) - RNIC puts a new completion-queue element into the CQ after an operation has completed Send/Receive data - Place a work-request element (WQE) into the send or recv queue - WQE points to user buffer and defines the 33 type of the operation (e.g., send, recv,..) Department of Computer Science

RDMA Queue Pairs (QPs) Applications use 'verbs' interface to This is much like Register memory: U-NET - Operating system will make sure the memory is pinned and accessible by DMA Create a queue pair (QP) - send/recv queue Create a completion queue (CQ) - RNIC puts a new completion-queue element into the CQ after an operation has completed Send/Receive data - Place a work-request element (WQE) into the send or recv queue - WQE points to user buffer and defines the 34 type of the operation (e.g., send, recv,..) Department of Computer Science

RDMA operations Send/Receive: Two-sided operation: data exchange naturally involves both ends of the communication channel Each send operation must have a matching receive operation Send WR specifies where the data should be taken from Receive WR on the remote machine specifies where the inbound data is to be placed RDMA (Remote Direct Memory Access) Two independent operations: RDMA Read and RDMA Write Only the application issuing the operation is actively involved in the data transfer An RDMA Write not only specifies where the data should be taken from, but also where it is to be placed (remotely) An RDMA Read quires a buffer advertisement prior to data exchange 35 Department of Computer Science

Example: RDMA Send/Recv (1) Sender and receiver have created their QPs and CQs Sender has registered a buffer for sending Receiver has registered a buffer for receiving 36 Department of Computer Science

Example: RDMA Send/Recv (2) Receiver places a WQE into its receive queue Sender places a WQE into its send queue 37 Department of Computer Science

Example: RDMA Send/Recv (3) Data is transferred between the hosts Involves two DMA transfers, one at the sender and one at the receiver 38 Department of Computer Science

Example: RDMA Send/Recv (4) After operation has finished, a CQE is placed into the completion queue of the sender 39 Department of Computer Science

RDMA implementations Infiniband Compaq, HP, IBM, Intel Microsoft and Sun Microsystems Provides RDMA semantics First spec released 2000 Based on point-to-point switched fabric Designed from ground up (has its own physical layer, switches, NICs, etc) IWARP (Internet Wide Area RDMA Protocol) RDMA semantics implemented over offloaded TCP/IP Requires custom NICs, but uses Ethernet RoCE RDMA semantics implemented directly over Ethernet All of those implementation can be programmed through the verbs interface 40 Department of Computer Science

Typical CPU loads for three network stack implementations 41

Performance Mellanox ConnectX-2 bare-metal from within a virtual machine RDMA/read latency (one-sided operation): 2-3 us 42

References High Performance Datacenter Networks: Architecture, Algorithms, and Opportunities, Synthesis Lectures on Computer Architecture, Morgan & Claypool, 2010 An Overview of the BlueGene/L Supercomputer, The BlueGene/L Team, 2002 U-Net: A User-Level Network Interface for Parallel and Distributed Computing, SOSP 1995 43