Advanced Computer Networks. High Performance Networking I

Advanced Computer Networks 263 3501 00 High Performance Networking I Patrick Stuedi Spring Semester 2014 1 Oriana Riva, Department of Computer Science ETH Zürich

Outline Last week: Wireless TCP Today: High Performance Networking: Part I 2

Appendix: Mobile IPv6 Uses IPv6 routing header 3

Course Overview Wireless networking technologies: first half of this course We are now here covered in basic ETH Operating Systems and Networks course 4 Datacenter networking: second half of this course

Overview High-Performance Computing (HPC) Supercomputers Systems designed from scratch Densely packed Short rack-to-rack cables Expensive Built from custom high-end components Mostly run a single program at a time, e.g., message passing (MPI) applications Cloud Computing Datacenters often built from commodity off-the-shelf hardware May run multiple jobs at the same time Often are multi-tenant - Different jobs running in datacenter have been developed or deployed by different people Often use virtualization - Hardware is multiplexed, e.g., multiple virtual machines per host Runs cloudy workload - Internet based applications: Email, Social Network, Maps, Search - Analytics: MapReduce/Hadoop, Pregel, NoSql, NewSql, etc... 5

IBM Blue Gene P supercomputer 6

Blue Gene: Cabling 7

Blue Gene Supercomputer Overview Blue Gene is a family of supercomputers from IBM BlueGene/L (2004) BlueGene/P (2006) BlueGene/Q (2010) Blue Gene/L: 64K-node highly integrated supercomputer Many of the components (processor, network, router) on the same chip Blue Gene/L: #1 Supercomputer as ranked by Top500 list from November 2004 June 2008 8

BlueGene: Dense Packaging 9

Design Motivation: Processor Clock Frequency Scaling Ends Three decades of exponential clock rate (and electrical power!) growth has ended Yet Moore s Law continues in transistor count What do we do with all those transistors to keep performance increasing to meet demand? Industry response: Multi-core (i.e. double the number of cores every 18 months instead of the clock frequency (and power!) Source: The Landscape of Computer Architecture, John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007 But, added transistors can be used for other functions such as memory/storage controllers, embedded networks, etc. 10

Blue Gene/P System-on-a-Chip Compute Node Network logic is a fraction of compute ASIC complexity/area. BG/P Node Card 11

Network Topology: Basics Topologies can be classified into: Direct networks Processing nodes directly attached to the switching fabric Indirect networks: separate processing nodes and switching elements In direct networks nodes often have very few ports (2,3,4, ) Low port networks are also called low-radix networks Elements (e.g. switches) in indirect networks often have higher port numbers (16,32,64,128,...) High port networks are also called high-radix networks 12

Criteria for choosing a particular network topology Path length between two nodes The more hops the bigger the latency The more hops the more congestion in the network Cost Bisection bandwidth The rate at which communication can take place between one half of a cluster and the other Typically the segmentation refers to the worst-case segmentation Path redundency Multiple paths between src/dst nodes Affects reliability, bandwidth, etc 13

Direct Networks: Mesh, Torus, Hypercube Notation <k>-ary-<n>-mesh or <k>-ary-<n>-torus k: radix, number of elements in each dimension (different meaning here than in term high-radix-network ) n: number of dimensions radix k does not have to be the same in each dimension Examples: a) 10-ary 1-torus b) 5-ary 2-torus c) 3-ary 3-torus 14

Direct Networks: Mesh, Torus, Hypercube (2) Cost effective at scale Allows for very dense packaging (single node card with compute element and switching element) Great performance for applications with locality Computation has dependencies on results of computations on neighboring nodes (many MPI application have this property) Simple expansion for future growth Just append nodes on one of the dimensions Good path redundancy 15

Example Bisection Bandwidth Bisection bandwidth: Minimal #arcs to be removed to partition the network in two equal halves 4-ary-2-mesh: bisection bandwidth 4 2-ary-3-mesh: bisection bandwidth 4 16

Example: IBM Blue Gene 3D Torus Network Interconnects compute nodes Communication backbone for computation In BlueGene/L: 32x32x64 connectivity Worst case diameter: 16+16+32 = 64 hops Different consecutive packets can follow different routes 17

High Performance Neworking: Layer-2 / Interconnect technologies Supercomputer interconnect technologies through the ages: Ten years ago (2002) - Many different interconnect technologies - Myrinet takes about 30% In 2010: - Gigabit Ethernet takes 50% - Infiniband 41 % Datacenter interconnects Almost entirely Ethernet 18

Infiniband vs Ethernet Infiniband (IB) Low latency - ~ 1us for two directly connected boxes High bandwidth - For data rates (SDR: 10 Gbit/s, DDR: 20 Gbit/s, QDR: 40 Gbit/s) Supports RDMA interface - RDMA: Remote Direct Memory Access: No OS involvement during transmission and reception of packets Ethernet: 10GbE has 5-6 times the latency of Infiniband 40GbE and 100Gbe in the pipeline Both IB and Ethernet can be operated with a switched fabric topology 19

Network latencies in Data centers Factors that contribute to latency in TCP datacenters Delay: cost of a single traversal of the component RTT: total cost in a round-trip traversing 5 switches in each direction OS overhead per packet exchanged between two hosts attached to the same switch: (2*15)/(2*2.5+2*15+10)=66% (!!) 20

Packet Processing Overhead user space kernel space DMA accessible memory area Sending-side: Data is copied from the application buffer into a socket buffer Data is DMA copied into NIC buffer Receiver side: Data is DMA copied from NIC buffer into socket buffer Data is copied into application buffer Application is scheduled (context switching)21

Throughput and CPU load at 1Gbit/s and 10Gbit/s - Throughput limited because of high CPU load - RX side typically more CPU intensive because highly asynchronous 22

TCP Offloading What is TCP offloading Moving IP and TCP processing to the Network Interface (NIC) Main justification for TCP offloading Reduction of host CPU cycles for protocol header processing, checksumming Fewer CPU interrupts Fewer bytes copied over the memory bus Potential to offload expensive features such as encryption 23 Department of Computer Science

TCP Offload Engines (TOEs) 24 Department of Computer Science

Problems of TCP offloading Moore s Law worked against smart NICs CPU's used to be fast enough Now many cores, cores don't get faster Network processing is hard to parallelize TCP/IP headers don t take many CPU cycles TOEs impose complex interfaces Protocol between TOE & CPU can be worse than TCP Connection management overhead For short connections, overwhelms any savings 25 Department of Computer Science

Where TCP offload helps Sweet spot for TCP offload might be apps with: Very high bandwidth Relatively low end-to-end latency network paths Long connection durations Relatively few connections Typical examples of these might be: Storage-server access Cluster interconnects 26 Department of Computer Science

User-level networking: Remove OS from the data path Transport offloading is not enough! System call overhead, Context switches Memory copying U-Net: Eicken, Basu, Buch, Vogels, Cornell University, 1995 Virtual network interface that allows applications to send and receive messages without operating system intervention Move all all buffer management and packet processing to userspace (zero-copy) 27

U-Net Overview a) Traditional networking architecture Kernel controls the network All communication via kernel b) U-Net architecture: Application access network directly via MUX Kernel involved only in connection setup 28

U-Net Building Blocks End points application s handle into the network Buffer area hold message data for sending or buffer space for receiving Message queues hold descriptors pointing to buffer area 29

U-Net communication application 1 application 2 application 2 endpoint endpoint endpoint U-Net NI Initialization: Create one or more endpoints Register communication segment with endpoint and associate them with a tag Sending Composes the data in the communication segment Push a descriptor for the message onto the send queue NIC transmits the message after marking it with the appropriate message tag. Receiving: Incoming messages get de-multiplexed based on the message tag Data is placed within the target buffer of the application by the NIC Push message descriptor to the receive queue 30 Department of Computer Science

History of User-Level Networking U-Net one of the first (if not the first) system to propose OS-bypassing Other early works SHRIMP: Virtual Memory Mapped Interfaces, IEEE Micro, 1995 Separating Data and Control Transfer in Distributed Operating Systems, Thekkath et. al., ASPLOS'94 Efforts of U-Net eventually resulted in the Virtual Interface Architecture (VIA) Specification jointly proposed by Compaq, Intel and Microsoft, 1997 VIA architecture has led to the implementation of various high performance networking stacks: Infiniband, iwarp, Roce: Commonly referred to as RDMA network stacks RDMA = Remote Direct Memory Access 31 Department of Computer Science

RDMA Architecture RDMA verbs interface Application RDMA verbs userlib user space Socket layer TCP UDP traditional socket interface kernel IP Kernel Mod Ethernet NIC Driver RDMA enabled NIC i/o space Traditional socket interface involves kernel RDMA interface involves kernel only on control path, but access the RDMA capable NIC (rnic) directly from user space on the data path Dedicated verbs interface used for RDMA, instead of traditional socket interface 32 Department of Computer Science

RDMA Queue Pairs (QPs) Applications use 'verbs' interface to Register memory: - Operating system will make sure the memory is pinned and accessible by DMA Create a queue pair (QP) - send/recv queue Create a completion queue (CQ) - RNIC puts a new completion-queue element into the CQ after an operation has completed Send/Receive data - Place a work-request element (WQE) into the send or recv queue - WQE points to user buffer and defines the 33 type of the operation (e.g., send, recv,..) Department of Computer Science

RDMA Queue Pairs (QPs) Applications use 'verbs' interface to This is much like Register memory: U-NET - Operating system will make sure the memory is pinned and accessible by DMA Create a queue pair (QP) - send/recv queue Create a completion queue (CQ) - RNIC puts a new completion-queue element into the CQ after an operation has completed Send/Receive data - Place a work-request element (WQE) into the send or recv queue - WQE points to user buffer and defines the 34 type of the operation (e.g., send, recv,..) Department of Computer Science

RDMA operations Send/Receive: Two-sided operation: data exchange naturally involves both ends of the communication channel Each send operation must have a matching receive operation Send WR specifies where the data should be taken from Receive WR on the remote machine specifies where the inbound data is to be placed RDMA (Remote Direct Memory Access) Two independent operations: RDMA Read and RDMA Write Only the application issuing the operation is actively involved in the data transfer An RDMA Write not only specifies where the data should be taken from, but also where it is to be placed (remotely) An RDMA Read quires a buffer advertisement prior to data exchange 35 Department of Computer Science

Example: RDMA Send/Recv (1) Sender and receiver have created their QPs and CQs Sender has registered a buffer for sending Receiver has registered a buffer for receiving 36 Department of Computer Science

Example: RDMA Send/Recv (2) Receiver places a WQE into its receive queue Sender places a WQE into its send queue 37 Department of Computer Science

Example: RDMA Send/Recv (3) Data is transferred between the hosts Involves two DMA transfers, one at the sender and one at the receiver 38 Department of Computer Science

Example: RDMA Send/Recv (4) After operation has finished, a CQE is placed into the completion queue of the sender 39 Department of Computer Science

RDMA implementations Infiniband Compaq, HP, IBM, Intel Microsoft and Sun Microsystems Provides RDMA semantics First spec released 2000 Based on point-to-point switched fabric Designed from ground up (has its own physical layer, switches, NICs, etc) IWARP (Internet Wide Area RDMA Protocol) RDMA semantics implemented over offloaded TCP/IP Requires custom NICs, but uses Ethernet RoCE RDMA semantics implemented directly over Ethernet All of those implementation can be programmed through the verbs interface 40 Department of Computer Science

Typical CPU loads for three network stack implementations 41

Performance Mellanox ConnectX-2 bare-metal from within a virtual machine RDMA/read latency (one-sided operation): 2-3 us 42

References High Performance Datacenter Networks: Architecture, Algorithms, and Opportunities, Synthesis Lectures on Computer Architecture, Morgan & Claypool, 2010 An Overview of the BlueGene/L Supercomputer, The BlueGene/L Team, 2002 U-Net: A User-Level Network Interface for Parallel and Distributed Computing, SOSP 1995 43