Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet
|
|
|
- Oswin Hodges
- 10 years ago
- Views:
Transcription
1 Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet A Tutorial at CCGrid 11 by Dhabaleswar K. (DK) Panda The Ohio State University [email protected] Sayantan Sur The Ohio State University [email protected]
2 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 2
3 Current and Next Generation Applications and Computing Systems Diverse Range of Applications Processing and dataset characteristics vary Growth of High Performance Computing Growth in processor performance Chip density doubles every 18 months Growth in commodity networking Increase in speed/features + reducing cost Different Kinds of Systems Clusters, Grid, Cloud, Datacenters,.. 3
4 Cluster Computing Environment Compute cluster Frontend LAN LAN Storage cluster Compute Node Meta-Data Manager Meta Data Compute Node Compute Node L A N I/O Server Node I/O Server Node Data Data Compute Node I/O Server Node Data 4
5 Trends for Computing Clusters in the Top 500 List ( Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%) Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%) Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%) Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%) Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%) Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%) Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%) Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%) Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%) Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced 5
6 Grid Computing Environment Compute cluster Compute cluster Frontend LAN Frontend LAN LAN Storage cluster WAN LAN Storage cluster Compute Node Compute Node Compute Node Compute Node L A N Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Data Data Data Compute Node Compute Node Compute Node Compute Node L A N Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Data Data Data Compute cluster Frontend LAN LAN Storage cluster Compute Node Compute Node Compute Node Compute Node L A N Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Data Data Data 6
7 Multi-Tier Datacenters and Enterprise Computing Enterprise Multi-tier Datacenter Routers/ Servers Application Server Database Server Routers/ Servers Switch Application Server Switch Database Server Switch Routers/ Servers.. Application Server.. Database Server Routers/ Servers Application Server Database Server Tier1 Tier2 Tier3 7
8 Integrated High-End Computing Environments Compute cluster Storage cluster Compute Node Meta-Data Manager Meta Data Frontend LAN Compute Node Compute Node L A N I/O Server Node I/O Server Node Data Data LAN/WAN Compute Node I/O Server Node Data Enterprise Multi-tier Datacenter for Visualization and Mining Routers/ Servers Routers/ Servers Routers/ Servers.. Routers/ Servers Tier1 Switch Application Server Application Server Application Server.. Application Server Tier2 Switch Database Server Database Server Database Server Database Server Tier3 Switch 8
9 Cloud Computing Environments VM VM Physical Machine VM VM Physical Machine Local Storage VM VM Local Storage LAN Virtual FS Meta-Data I/O Server I/O Server I/O Server Meta Data Data Data Data Physical Machine Local Storage VM VM Physical Machine I/O Server Data Local Storage 9
10 Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases 10
11 Memcached Architecture System Area Network System Area Network Internet Web-frontend Servers Memcached Servers Database Servers Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 11
12 Networking and I/O Requirements Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for inter-processor communication (IPC) and I/O Good Storage Area Networks high performance I/O Good WAN connectivity in addition to intra-cluster SAN/LAN connectivity Quality of Service (QoS) for interactive applications RAS (Reliability, Availability, and Serviceability) With low cost 12
13 Core0 Core2 Core0 Core2 Major Components in Computing Systems P0 P1 Core1 Core3 Core1 Core3 Network Network Bottlenecks Switch Processing Bottlenecks I / O I/O Interface Bottlenecks B u s Network Adapter Memory Memory Hardware components Processing cores and memory subsystem I/O bus or links Network adapters/switches Software components Communication stack Bottlenecks can artificially limit the network performance the user perceives 13
14 Processing Bottlenecks in Traditional Protocols Ex: TCP/IP, UDP/IP Generic architecture for all networks P0 Host processor handles almost all aspects of communication Core0 Core2 Core1 Core3 Processing Bottlenecks Memory Data buffering (copies on sender and receiver) Data integrity (checksum) Routing aspects (IP routing) Core0 Core2 P1 Core1 Core3 I / O B u s Memory Signaling between different layers Network Adapter Hardware interrupt on packet arrival or transmission Network Switch Software signals between different layers to handle protocol processing in different priority levels 14
15 Bottlenecks in Traditional I/O Interfaces and Networks Traditionally relied on bus-based technologies (last mile bottleneck) E.g., PCI, PCI-X One bit per wire Performance increase through: Increasing clock speed Increasing bus width Not scalable: Cross talk between bits Skew between wires Core0 Core2 Core0 Core2 Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds P0 P1 Network Switch Core1 Core3 Core1 Core3 I / O I/O Interface Bottlenecks B u s Network Adapter Memory Memory PCI MHz/32bit: 1.05Gbps (shared bidirectional) 1998 (v1.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) PCI-X 2003 (v2.0) MHz/64bit: 17Gbps (shared bidirectional) 15
16 Bottlenecks on Traditional Networks Network speeds saturated at around 1Gbps Features provided were limited Commodity networks were not considered scalable enough for very large-scale systems Core0 Core2 Core0 Core2 P0 P1 Core1 Core3 Core1 Core3 Network Network Bottlenecks Switch I / O B u s Network Adapter Memory Memory Ethernet ( ) Fast Ethernet (1993 -) Gigabit Ethernet (1995 -) ATM (1995 -) Myrinet (1993 -) Fibre Channel (1994 -) 10 Mbit/sec 100 Mbit/sec 1000 Mbit /sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 16
17 Motivation for InfiniBand and High-speed Ethernet Industry Networking Standards InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and network speed) Ethernet aimed at directly handling the network speed bottleneck and relying on complementary technologies to alleviate the protocol processing and I/O bus bottlenecks 17
18 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 18
19 IB Trade Association IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies Many other industry participated in the effort to define the IB architecture specification IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 Latest version released January
20 High-speed Ethernet Consortium (10GE/40GE/100GE) 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE WG Energy-efficient and power-conscious protocols On-the-fly link speed reduction for under-utilized links 20
21 Tackling Communication Bottlenecks with IB and HSE Network speed bottlenecks Protocol processing bottlenecks I/O interface bottlenecks 21
22 Network Bottleneck Alleviation: InfiniBand ( Infinite Bandwidth ) and High-speed Ethernet (10/40/100 GE) Bit serial differential signaling Independent pairs of wires to transmit independent data (called a lane) Scalable to any number of lanes Easy to increase clock speed of lanes (since each lane consists only of a pair of wires) Theoretically, no perceived limit on the bandwidth 22
23 Network Speed Acceleration with IB and HSE Ethernet ( ) Fast Ethernet (1993 -) Gigabit Ethernet (1995 -) ATM (1995 -) Myrinet (1993 -) Fibre Channel (1994 -) InfiniBand (2001 -) 10-Gigabit Ethernet (2001 -) InfiniBand (2003 -) InfiniBand (2005 -) InfiniBand (2007 -) 40-Gigabit Ethernet (2010 -) InfiniBand (2011 -) InfiniBand (2012 -) 10 Mbit/sec 100 Mbit/sec 1000 Mbit /sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 2 Gbit/sec (1X SDR) 10 Gbit/sec 8 Gbit/sec (4X SDR) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) 32 Gbit/sec (4X QDR) 40 Gbit/sec 56 Gbit/sec (4X FDR) 100 Gbit/sec (4X EDR) 20 times in the last 9 years 23
24 Bandwidth per direction (Gbps) InfiniBand Link Speed Standardization Roadmap # of Lanes per direction Per Lane & Rounded Per Link Bandwidth (Gb/s) 4G-IB DDR 8G-IB QDR 14G-IB-FDR (14.025) 26G-IB-EDR ( ) NDR = Next Data Rate HDR = High Data Rate EDR = Enhanced Data Rate FDR = Fourteen Data Rate QDR = Quad Data Rate DDR = Double Data Rate SDR = Single Data Rate (not shown) 300G-IB-EDR 168G-IB-FDR 12x HDR 12x NDR 8x NDR x HDR 4x NDR 96G-IB-QDR 200G-IB-EDR 112G-IB-FDR 4x HDR x12 48G-IB-DDR 48G-IB-QDR 100G-IB-EDR 56G-IB-FDR 1x NDR x8 32G-IB-DDR 32G-IB-QDR 25G-IB-EDR 14G-IB-FDR 1x HDR x4 16G-IB-DDR 8G-IB-QDR x
25 Tackling Communication Bottlenecks with IB and HSE Network speed bottlenecks Protocol processing bottlenecks I/O interface bottlenecks 25
26 Capabilities of High-Performance Networks Intelligent Network Interface Cards Support entire protocol processing completely in hardware (hardware protocol offload engines) Provide a rich communication interface to applications User-level communication capability Gets rid of intermediate data buffering requirements No software signaling between communication layers All layers are implemented on a dedicated hardware unit, and not on a shared host CPU 26
27 Previous High-Performance Network Stacks Fast Messages (FM) Developed by UIUC Myricom GM Proprietary protocol stack from Myricom These network stacks set the trend for high-performance communication requirements Hardware offloaded protocol stack Support for fast and secure user-level access to the protocol stack Virtual Interface Architecture (VIA) Standardized by Intel, Compaq, Microsoft Precursor to IB 27
28 IB Hardware Acceleration Some IB models have multiple hardware accelerators E.g., Mellanox IB adapters Protocol Offload Engines Completely implement ISO/OSI layers 2-4 (link layer, network layer and transport layer) in hardware Additional hardware supported features also present RDMA, Multicast, QoS, Fault Tolerance, and many more 28
29 Ethernet Hardware Acceleration Interrupt Coalescing Improves throughput, but degrades latency Jumbo Frames No latency impact; Incompatible with existing switches Hardware Checksum Engines Checksum performed in hardware significantly faster Shown to have minimal benefit independently Segmentation Offload Engines (a.k.a. Virtual MTU) Host processor thinks that the adapter supports large Jumbo frames, but the adapter splits it into regular sized (1500-byte) frames Supported by most HSE products because of its backward compatibility considered regular Ethernet Heavily used in the server-on-steroids model High performance servers connected to regular clients 29
30 TOE and iwarp Accelerators TCP Offload Engines (TOE) Hardware Acceleration for the entire TCP/IP stack Initially patented by Tehuti Networks Actually refers to the IC on the network adapter that implements TCP/IP In practice, usually referred to as the entire network adapter Internet Wide-Area RDMA Protocol (iwarp) Standardized by IETF and the RDMA Consortium Support acceleration features (like IB) for Ethernet & 30
31 Converged (Enhanced) Ethernet (CEE or CE) Also known as Datacenter Ethernet or Lossless Ethernet Combines a number of optional Ethernet standards into one umbrella as mandatory requirements Sample enhancements include: Priority-based flow-control: Link-level flow control for each Class of Service (CoS) Enhanced Transmission Selection (ETS): Bandwidth assignment to each CoS Datacenter Bridging Exchange Protocols (DBX): Congestion notification, Priority classes End-to-end Congestion notification: Per flow congestion control to supplement per link flow control 31
32 Tackling Communication Bottlenecks with IB and HSE Network speed bottlenecks Protocol processing bottlenecks I/O interface bottlenecks 32
33 Interplay with I/O Technologies InfiniBand initially intended to replace I/O bus technologies with networking-like technology That is, bit serial differential signaling With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now Both IB and HSE today come as network adapters that plug into existing I/O technologies 33
34 Trends in I/O Interfaces with Servers Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit) PCI MHz/32bit: 1.05Gbps (shared bidirectional) PCI-X AMD HyperTransport (HT) PCI-Express (PCIe) by Intel 1998 (v1.0) 2003 (v2.0) 2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1) 2003 (Gen1), 2007 (Gen2) 2009 (Gen3 standard) 133MHz/64bit: 8.5Gbps (shared bidirectional) MHz/64bit: 17Gbps (shared bidirectional) 102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) (32 lanes) Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps) Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps) Intel QuickPath Interconnect (QPI) Gbps (20 lanes) 34
35 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 35
36 IB, HSE and their Convergence InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Novel Features Subnet Management and Services High-speed Ethernet Family Internet Wide Area RDMA Protocol (iwarp) Alternate vendor-specific protocol stacks InfiniBand/Ethernet Convergence Technologies Virtual Protocol Interconnect (VPI) (InfiniBand) RDMA over Ethernet (RoE) (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) 36
37 Comparing InfiniBand with Traditional Networking Stack HTTP, FTP, MPI, File Systems Application Layer Application Layer MPI, PGAS, File Systems Sockets Interface OpenFabrics Verbs TCP, UDP Transport Layer Transport Layer RC (reliable), UD (unreliable) Routing DNS management tools Network Layer Network Layer Routing Flow-control and Error Detection Link Layer Link Layer Flow-control, Error Detection OpenSM (management tool) Copper, Optical or Wireless Physical Layer Physical Layer Copper or Optical Traditional Ethernet InfiniBand 37
38 IB Overview InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Communication Model Memory registration and protection Channel and memory semantics Novel Features Hardware Protocol Offload Link, network and transport layer features Subnet Management and Services 38
39 VL VL VL VL VL VL VL VL VL QP QP QP QP Components: Channel Adapters Used by processing and I/O units to connect to fabric Consume & generate IB packets Programmable DMA engines with protection features May have multiple ports Independent buffering channeled through Virtual Lanes Memory Channel Adapter DMA Transport C Port Port SMA MTP Port Host Channel Adapters (HCAs) 39
40 VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL Components: Switches and Routers Switch Packet Relay Port Port Port Router GRH Packet Relay Port Port Port Relay packets from a link to another Switches: intra-subnet Routers: inter-subnet May support multicast 40
41 Components: Links & Repeaters Network Links Copper, Optical, Printed Circuit wiring on Back Plane Not directly addressable Traditional adapters built for copper cabling Restricted by cable length (signal integrity) For example, QDR copper cables are restricted to 7m Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore) Up to 100m length 550 picoseconds copper-to-optical conversion latency Available from other vendors (Luxtera) Repeaters (Vol. 2 of InfiniBand specification) (Courtesy Intel) 41
42 IB Overview InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Communication Model Memory registration and protection Channel and memory semantics Novel Features Hardware Protocol Offload Link, network and transport layer features Subnet Management and Services 42
43 IB Communication Model Basic InfiniBand Communication Semantics 43
44 Queue Pair Model QP CQ Each QP has two queues Send Queue (SQ) Receive Queue (RQ) Work requests are queued to the QP (WQEs: Wookies ) QP to be linked to a Complete Queue (CQ) Gives notification of operation completion from QPs Completed WQEs are placed in the CQ with additional information (CQEs: Cookies ) Send Recv WQEs CQEs InfiniBand Device 44
45 Memory Registration Before we do any communication: All memory used for communication must be registered Process 1. Registration Request Send virtual address and length 2. Kernel handles virtual->physical mapping and pins region into physical memory 1 4 Process cannot map memory that it does not own (security!) 2 Kernel 3. HCA caches the virtual to physical mapping and issues a handle 3 Includes an l_key and r_key HCA/RNIC 4. Handle is returned to application 45
46 Memory Protection For security, keys are required for all operations that touch buffers Process To send or receive data the l_key must be provided to the HCA l_key HCA/NIC Kernel HCA verifies access to local memory For RDMA, initiator must have the r_key for the remote virtual address Possibly exchanged with a send/recv r_key is not encrypted in IB r_key is needed for RDMA operations 46
47 Communication in the Channel Semantics (Send/Receive Model) Memory Segment Memory Segment Memory Segment CQ Memory QP Send Recv Processor Processor Processor is involved only to: QP Send Recv Memory Memory Segment Memory Segment CQ 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ InfiniBand Device Send WQE contains information about the send buffer (multiple non-contiguous segments) Hardware ACK InfiniBand Device Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place 47
48 Communication in the Memory Semantics (RDMA Model) Memory Segment Memory Processor Processor Memory Memory Segment Memory Segment CQ QP Send Recv Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ QP Send Recv Memory Segment CQ No involvement from the target processor InfiniBand Device Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Hardware ACK InfiniBand Device 48
49 Communication in the Memory Semantics (Atomics) Memory Processor Processor Memory Source Memory Segment Destination Memory Segment CQ QP Send Recv Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ QP Send Recv Memory Segment CQ No involvement from the target processor InfiniBand Device Send WQE contains information about the send buffer (single 64-bit segment) and the receive buffer (single 64-bit segment) OP InfiniBand Device IB supports compare-and-swap and fetch-and-add atomic operations 49
50 IB Overview InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Communication Model Memory registration and protection Channel and memory semantics Novel Features Hardware Protocol Offload Link, network and transport layer features Subnet Management and Services 50
51 Hardware Protocol Offload Complete Hardware Implementations Exist 51
52 Link/Network Layer Capabilities Buffering and Flow Control Virtual Lanes, Service Levels and QoS Switching and Multicast 52
53 Buffering and Flow Control IB provides three-levels of communication throttling/control mechanisms Link-level flow control (link layer feature) Message-level flow control (transport layer feature): discussed later Congestion control (part of the link layer features) IB provides an absolute credit-based flow-control Receiver guarantees that enough space is allotted for N blocks of data Occasional update of available credits by the receiver Has no relation to the number of messages, but only to the total amount of data being sent One 1MB message is equivalent to KB messages (except for rounding off at message boundaries) 53
54 Virtual Lanes Multiple virtual links within same physical link Between 2 and 16 Separate buffers and flow control Avoids Head-of-Line Blocking VL15: reserved for management Each port supports one or more data VL 54
55 Service Levels and QoS Service Level (SL): Packets may operate at one of 16 different SLs Meaning not defined by IB SL to VL mapping: SL determines which VL on the next link is to be used Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management Partitions: Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows 55
56 Traffic Segregation Benefits Servers IPC, Load Balancing, Web Caches, ASP InfiniBand Network IP Network Routers, Switches VPN s, DSLAMs Servers InfiniBand Fabric Fabric Storage Area Network RAID, NAS, Backup (Courtesy: Mellanox Technologies) Servers Virtual Lanes InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link Providing the benefits of independent, separate networks while eliminating the cost and difficulties associated with maintaining two or more networks 56
57 Switching (Layer-2 Routing) and Multicast Each port has one or more associated LIDs (Local Identifiers) Switches look up which port to forward a packet to based on its destination LID (DLID) This information is maintained at the switch For multicast packets, the switch needs to maintain multiple output ports to forward the packet to Packet is replicated to each appropriate output port Ensures at-most once delivery & loop-free forwarding There is an interface for a group management protocol Create, join/leave, prune, delete group 57
58 Switch Complex Basic unit of switching is a crossbar Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars Switches available in the market are typically collections of crossbars within a single cabinet Do not confuse non-blocking switches with crossbars Crossbars provide all-to-all connectivity to all connected nodes For any random node pair selection, all communication is non-blocking Non-blocking switches provide a fat-tree of many crossbars For any random node pair selection, there exists a switch configuration such that communication is non-blocking If the communication pattern changes, the same switch configuration might no longer provide fully non-blocking communication 58
59 IB Switching/Routing: An Example An Example IB Switch Block Diagram (Mellanox 144-Port) Switching: IB supports Virtual Cut Through (VCT) Spine Blocks Leaf Blocks Routing: Unspecified by IB SPEC Up*/Down*, Shift are popular routing engines supported by OFED P2 P1 DLID Out-Port LID: 2 LID: 4 Someone has to setup the forwarding tables and give every port an LID Subnet Manager does this work Forwarding Table Different routing algorithms give different paths Fat-Tree is a popular topology for IB Cluster Different over-subscription ratio may be used Other topologies are also being used 3D Torus (Sandia Red Sky) and SGI Altix (Hypercube) 59
60 More on Multipathing Similar to basic switching, except sender can utilize multiple LIDs associated to the same destination port Packets sent to one DLID take a fixed path Different packets can be sent using different DLIDs Each DLID can have a different path (switch can be configured differently for each DLID) Can cause out-of-order arrival of packets IB uses a simplistic approach: If packets in one connection arrive out-of-order, they are dropped Easier to use different DLIDs for different connections This is what most high-level libraries using IB do! 60
61 IB Multicast Example 61
62 Hardware Protocol Offload Complete Hardware Implementations Exist 62
63 IB Transport Services Service Type Connection Oriented Acknowledged Transport Reliable Connection Yes Yes IBA Unreliable Connection Yes No IBA Reliable Datagram No Yes IBA Unreliable Datagram No No IBA RAW Datagram No No Raw Each transport service can have zero or more QPs associated with it E.g., you can have four QPs based on RC and one QP based on UD 63
64 Reliability Trade-offs in Different Transport Types Attribute Reliable Connection Reliable Datagram extended Reliable Connection Unreliable Connection Unreliable Datagram Raw Datagram Scalability (M processes, N nodes) M 2 N QPs per HCA M QPs per HCA MN QPs per HCA M 2 N QPs per HCA M QPs per HCA 1 QP per HCA Corrupt data detected Data Delivery Guarantee Data Order Guarantees Data Loss Detected Per connection Data delivered exactly once One source to multiple destinations Per connection Yes Unordered, duplicate data detected No guarantees Yes No No No No Error Recovery Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken, protection violation, etc.) Packets with errors and sequence errors are reported to responder None None 64
65 Transport Layer Capabilities Data Segmentation Transaction Ordering Message-level Flow Control Static Rate Control and Auto-negotiation 65
66 Data Segmentation IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP) Application can hand over a large message Network adapter segments it to MTU sized packets Single notification when the entire message is transmitted or received (not per packet) Reduced host overhead to send/receive messages Depends on the number of messages, not the number of bytes 66
67 Transaction Ordering IB follows a strong transaction ordering for RC Sender network adapter transmits messages in the order in which WQEs were posted Each QP utilizes a single LID All WQEs posted on same QP take the same path All packets are received by the receiver in the same order All receive WQEs are completed in the order in which they were posted 67
68 Message-level Flow-Control Also called as End-to-end Flow-control Does not depend on the number of network hops Separate from Link-level Flow-Control Link-level flow-control only relies on the number of bytes being transmitted, not the number of messages Message-level flow-control only relies on the number of messages transferred, not the number of bytes If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs) If the sent messages are larger than what the receive buffers are posted, flow-control cannot handle it 68
69 Static Rate Control and Auto-Negotiation IB allows link rates to be statically changed On a 4X link, we can set data to be sent at 1X For heterogeneous links, rate can be set to the lowest link rate Useful for low-priority traffic Auto-negotiation also available E.g., if you connect a 4X adapter to a 1X switch, data is automatically sent at 1X rate Only fixed settings available Cannot set rate requirement to 3.16 Gbps, for example 69
70 IB Overview InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Communication Model Memory registration and protection Channel and memory semantics Novel Features Hardware Protocol Offload Link, network and transport layer features Subnet Management and Services 70
71 Concepts in IB Management Agents Processes or hardware units running on each adapter, switch, router (everything on the network) Provide capability to query and set parameters Managers Make high-level decisions and implement it on the network fabric using the agents Messaging schemes Used for interactions between the manager and agents (or between agents) Messages 71
72 Subnet Manager Inactive Link Multicast Setup Inactive Active Links Compute Node Switch Multicast Setup Multicast Join Multicast Join Subnet Manager 72
73 IB, HSE and their Convergence InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Novel Features Subnet Management and Services High-speed Ethernet Family Internet Wide Area RDMA Protocol (iwarp) Alternate vendor-specific protocol stacks InfiniBand/Ethernet Convergence Technologies Virtual Protocol Interconnect (VPI) (InfiniBand) RDMA over Ethernet (RoE) (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) 73
74 HSE Overview High-speed Ethernet Family Internet Wide-Area RDMA Protocol (iwarp) Architecture and Components Features Out-of-order data placement Dynamic and Fine-grained Data Rate control Multipathing using VLANs Existing Implementations of HSE/iWARP Alternate Vendor-specific Stacks MX over Ethernet (for Myricom 10GE adapters) Datagram Bypass Layer (for Myricom 10GE adapters) Solarflare OpenOnload (for Solarflare 10GE adapters) 74
75 IB and HSE RDMA Models: Commonalities and Differences IB iwarp/hse Hardware Acceleration Supported Supported RDMA Supported Supported Atomic Operations Supported Not supported Multicast Supported Supported Congestion Control Supported Supported Data Placement Ordered Out-of-order Data Rate-control Static and Coarse-grained Dynamic and Fine-grained QoS Prioritization Prioritization and Fixed Bandwidth QoS Multipathing Using DLIDs Using VLANs 75
76 iwarp Architecture and Components User Hardware iwarp Offload Engines Application or Library MPA TCP RDDP IP Device Driver Network Adapter (e.g., 10GigE) RDMAP SCTP (Courtesy iwarp Specification) RDMA Protocol (RDMAP) Feature-rich interface Security Management Remote Direct Data Placement (RDDP) Data Placement and Delivery Multi Stream Semantics Connection Management Marker PDU Aligned (MPA) Middle Box Fragmentation Data Integrity (CRC) 76
77 HSE Overview High-speed Ethernet Family Internet Wide-Area RDMA Protocol (iwarp) Architecture and Components Features Out-of-order data placement Dynamic and Fine-grained Data Rate control Existing Implementations of HSE/iWARP Alternate Vendor-specific Stacks MX over Ethernet (for Myricom 10GE adapters) Datagram Bypass Layer (for Myricom 10GE adapters) Solarflare OpenOnload (for Solarflare 10GE adapters) 77
78 Decoupled Data Placement and Data Delivery Place data as it arrives, whether in or out-of-order If data is out-of-order, place it at the appropriate offset Issues from the application s perspective: Second half of the message has been placed does not mean that the first half of the message has arrived as well If one message has been placed, it does not mean that that the previous messages have been placed Issues from protocol stack s perspective The receiver network stack has to understand each frame of data If the frame is unchanged during transmission, this is easy! The MPA protocol layer adds appropriate information at regular intervals to allow the receiver to identify fragmented frames 78
79 Dynamic and Fine-grained Rate Control Part of the Ethernet standard, not iwarp Network vendors use a separate interface to support it Dynamic bandwidth allocation to flows based on interval between two packets in a flow E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps Complicated because of TCP windowing behavior Important for high-latency/high-bandwidth networks Large windows exposed on the receiver side Receiver overflow controlled through rate control 79
80 Prioritization and Fixed Bandwidth QoS Can allow for simple prioritization: E.g., connection 1 performs better than connection 2 8 classes provided (a connection can be in any class) Similar to SLs in InfiniBand Two priority classes for high-priority traffic E.g., management traffic or your favorite application Or can allow for specific bandwidth requests: E.g., can request for 3.62 Gbps bandwidth Packet pacing and stalls used to achieve this Query functionality to find out remaining bandwidth 80
81 HSE Overview High-speed Ethernet Family Internet Wide-Area RDMA Protocol (iwarp) Architecture and Components Features Out-of-order data placement Dynamic and Fine-grained Data Rate control Existing Implementations of HSE/iWARP Alternate Vendor-specific Stacks MX over Ethernet (for Myricom 10GE adapters) Datagram Bypass Layer (for Myricom 10GE adapters) Solarflare OpenOnload (for Solarflare 10GE adapters) 81
82 Software iwarp based Compatibility Regular Ethernet Cluster TOE Cluster Wide Area Network iwarp Cluster Regular Ethernet adapters and TOEs are fully compatible Compatibility with iwarp required Software iwarp emulates the functionality of iwarp on the host Fully compatible with hardware iwarp Internally utilizes host TCP/IP stack Distributed Cluster Environment 82
83 Different iwarp Implementations OSU, OSC, IBM OSU, ANL Chelsio, NetEffect (Intel) Application Application Application Application User-level iwarp Sockets TCP Sockets Kernel-level iwarp TCP (Modified with MPA) High Performance Sockets Sockets TCP IP Software iwarp High Performance Sockets Sockets TCP IP IP IP Device Driver Device Driver Device Driver Device Driver Offloaded TCP Offloaded iwarp Offloaded TCP Network Adapter Network Adapter Network Adapter Offloaded IP Offloaded IP Network Adapter Regular Ethernet Adapters TCP Offload Engines iwarp compliant Adapters 83
84 HSE Overview High-speed Ethernet Family Internet Wide-Area RDMA Protocol (iwarp) Architecture and Components Features Out-of-order data placement Dynamic and Fine-grained Data Rate control Multipathing using VLANs Existing Implementations of HSE/iWARP Alternate Vendor-specific Stacks MX over Ethernet (for Myricom 10GE adapters) Datagram Bypass Layer (for Myricom 10GE adapters) Solarflare OpenOnload (for Solarflare 10GE adapters) 84
85 Myrinet Express (MX) Proprietary communication layer developed by Myricom for their Myrinet adapters Third generation communication layer (after FM and GM) Supports Myrinet-2000 and the newer Myri-10G adapters Low-level MPI-like messaging layer Almost one-to-one match with MPI semantics (including connectionless model, implicit memory registration and tag matching) Later versions added some more advanced communication methods such as RDMA to support other programming models such as ARMCI (low-level runtime for the Global Arrays PGAS library) Open-MX New open-source implementation of the MX interface for non- Myrinet adapters from INRIA, France 85
86 Datagram Bypass Layer (DBL) Another proprietary communication layer developed by Myricom Compatible with regular UDP sockets (embraces and extends) Idea is to bypass the kernel stack and give UDP applications direct access to the network adapter High performance and low-jitter Primary motivation: Financial market applications (e.g., stock market) Applications prefer unreliable communication Timeliness is more important than reliability This stack is covered by NDA; more details can be requested from Myricom 86
87 Solarflare Communications: OpenOnload Stack HPC Networking Stack provides many performance benefits, but has limitations for certain types of scenarios, especially where applications tend to fork(), exec() and need asynchronous advancement (per application) Solarflare approach: Typical Commodity Networking Stack Typical HPC Networking Stack Network hardware provides user-safe interface to route packets directly to apps based on flow information in headers Protocol processing can happen in both kernel and user space Protocol state shared between app and kernel using shared memory Solarflare approach to networking stack Courtesy Solarflare communications ( 87
88 IB, HSE and their Convergence InfiniBand Architecture and Basic Hardware Components Communication Model and Semantics Novel Features Subnet Management and Services High-speed Ethernet Family Internet Wide Area RDMA Protocol (iwarp) Alternate vendor-specific protocol stacks InfiniBand/Ethernet Convergence Technologies Virtual Protocol Interconnect (VPI) (InfiniBand) RDMA over Ethernet (RoE) (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) 88
89 Virtual Protocol Interconnect (VPI) Applications Single network firmware to support both IB and Ethernet IB Verbs Sockets Autosensing of layer-2 protocol Can be configured to automatically IB Transport Layer TCP work with either IB or Ethernet networks Hardware IB Network Layer IB Link Layer IP TCP/IP support Ethernet Link Layer Multi-port adapters can use one port on IB and another on Ethernet Multiple use modes: Datacenters with IB inside the cluster and Ethernet outside Clusters with IB network and IB Port Ethernet Port Ethernet management 89
90 (InfiniBand) RDMA over Ethernet (IBoE or RoE) Application IB Verbs IB Transport IB Network Ethernet Hardware Native convergence of IB network and transport layers with Ethernet link layer IB packets encapsulated in Ethernet frames IB network layer already uses IPv6 frames Pros: Cons: Works natively in Ethernet environments (entire Ethernet management ecosystem is available) Has all the benefits of IB verbs Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available Some IB native link-layer features are optional in (regular) Ethernet Approved by OFA board to be included into OFED 90
91 (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) Very similar to IB over Ethernet Often used interchangeably with IBoE Application IB Verbs IB Transport IB Network CE Hardware Can be used to explicitly specify link layer is Converged (Enhanced) Ethernet (CE) Pros: Works natively in Ethernet environments (entire Ethernet management ecosystem is available) Has all the benefits of IB verbs CE is very similar to the link layer of native IB, so there are no missing features Cons: Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available 91
92 IB and HSE: Feature Comparison IB iwarp/hse RoE RoCE Hardware Acceleration Yes Yes Yes Yes RDMA Yes Yes Yes Yes Congestion Control Yes Optional Optional Yes Multipathing Yes Yes Yes Yes Atomic Operations Yes No Yes Yes Multicast Optional No Optional Optional Data Placement Ordered Out-of-order Ordered Ordered Prioritization Optional Optional Optional Yes Fixed BW QoS (ETS) No Optional Optional Yes Ethernet Compatibility No Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) Yes Yes (using IPoIB) Yes (using IPoIB) 92
93 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 93
94 IB Hardware Products Many IB vendors: Mellanox+Voltaire and Qlogic Aligned with many server vendors: Intel, IBM, SUN, Dell And many integrators: Appro, Advanced Clustering, Microway Broadly two kinds of adapters Offloading (Mellanox) and Onloading (Qlogic) Adapters with different interfaces: Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT MemFree Adapter No memory on HCA Uses System memory (through PCIe) Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB) Different speeds SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps) Some 12X SDR adapters exist as well (24 Gbps each way) New ConnectX-2 adapter from Mellanox supports offload for collectives (Barrier, Broadcast, etc.) 94
95 Tyan Thunder S2935 Board (Courtesy Tyan) Similar boards from Supermicro with LOM features are also available 95
96 IB Hardware Products (contd.) Customized adapters to work with IB switches Cray XD1 (formerly by Octigabay), Cray CX1 Switches: 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) 3456-port Magnum switch from SUN used at TACC 72-port nano magnum 36-port Mellanox InfiniScale IV QDR switch silicon in 2008 Up to 648-port QDR switch by Mellanox and SUN Some internal ports are 96 Gbps (12X QDR) IB switch silicon from Qlogic introduced at SC 08 Up to 846-port QDR switch by Qlogic New FDR (56 Gbps) switch silicon (Bridge-X) has been announced by Mellanox in May 11 Switch Routers with Gateways IB-to-FC; IB-to-IP 96
97 10G, 40G and 100G Ethernet Products 10GE adapters: Intel, Myricom, Mellanox (ConnectX) 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel) 40GE adapters: Mellanox ConnectX2-EN 40G 10GE switches Fulcrum Microsystems Low latency switch based on 24-port silicon FM4000 switch with IP routing, and TCP/UDP support Fujitsu, Myricom (512 ports), Force10, Cisco, Arista (formerly Arastra) 40GE and 100GE switches Nortel Networks 10GE downlinks with 40GE and 100GE uplinks Broadcom has announced 40GE switch in early
98 Products Providing IB and HSE Convergence Mellanox ConnectX Adapter Supports IB and HSE convergence Ports can be configured to support IB or HSE Support for VPI and RoCE 8 Gbps (SDR), 16Gbps (DDR) and 32Gbps (QDR) rates available for IB 10GE rate available for RoCE 40GE rate for RoCE is expected to be available in near future 98
99 Software Convergence with OpenFabrics Open source organization (formerly OpenIB) Incorporates both IB and iwarp in a unified manner Support for Linux and Windows Design of complete stack with `best of breed components Gen1 Gen2 (current focus) Users can download the entire stack and run Latest release is OFED OFED 1.6 is underway 99
100 OpenFabrics Stack with Unified Verbs Interface Verbs Interface (libibverbs) Mellanox (libmthca) QLogic (libipathverbs) IBM (libehca) User Level Chelsio (libcxgb3) Mellanox (ib_mthca) QLogic (ib_ipath) IBM (ib_ehca) Chelsio (ib_cxgb3) Kernel Level Mellanox Adapters QLogic Adapters IBM Adapters Chelsio Adapters 100
101 OpenFabrics on Convergent IB/HSE Verbs Interface (libibverbs) ConnectX (libmlx4) ConnectX (ib_mlx4) ConnectX Adapters User Level Kernel Level For IBoE and RoCE, the upper-level stacks remain completely unchanged Within the hardware: Transport and network layers remain completely unchanged Both IB and Ethernet (or CEE) link layers are supported on the network adapter Note: The OpenFabrics stack is not valid for the Ethernet path in VPI HSE IB That still uses sockets and TCP/IP 101
102 Kernel bypass Kernel bypass OpenFabrics Software Stack Application Level User APIs Diag Tools Open SM User Level MAD API InfiniBand OpenFabrics User Level Verbs / API iwarp R-NIC User Space IP Based App Access Sockets Based Access SDP Lib UDAPL Various MPIs Block Storage Access Clustered DB Access Access to File Systems SA MAD SMA PMA IPoIB SDP Subnet Administrator Management Datagram Subnet Manager Agent Performance Manager Agent IP over InfiniBand Sockets Direct Protocol Upper Layer Protocol Kernel Space IPoIB SDP SRP iser RDS NFS-RDMA RPC Cluster File Sys SRP iser RDS SCSI RDMA Protocol (Initiator) iscsi RDMA Protocol (Initiator) Reliable Datagram Service Mid-Layer SA Client MAD SMA Connection Manager Connection Manager Abstraction (CMA) Connection Manager UDAPL HCA R-NIC User Direct Access Programming Lib Host Channel Adapter RDMA NIC InfiniBand OpenFabrics Kernel Level Verbs / API iwarp R-NIC Provider Hardware Hardware Specific Driver InfiniBand HCA Hardware Specific Driver iwarp R-NIC Key Common InfiniBand iwarp Apps & Access Methods for using OF Stack 102
103 InfiniBand in the Top500 Percentage share of InfiniBand is steadily increasing 103
104 InfiniBand in the Top500 (Nov. 2010) Number of Systems Performance 0% 0% 1% 6% 1% 0% 0% 0% 4% 1% 0% 0% 0% 0% 0% 0% 9% 25% 45% 21% 43% 44% Gigabit Ethernet InfiniBand Proprietary Myrinet Quadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Custom Gigabit Ethernet Infiniband Proprietary Myrinet Quadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Custom 104
105 Efficiency (%) InfiniBand System Efficiency in the Top500 List 100 Computer Cluster Efficiency Comparison Top 500 Systems IB-CPU IB-GPU/Cell GigE 10GigE IBM-BlueGene Cray 105
106 Large-scale InfiniBand Installations 214 IB Clusters (42.8%) in the Nov 10 Top500 list ( Installations in the Top 30 (13 systems): 120,640 cores (Nebulae) in China (3 rd ) 15,120 cores (Loewe) in Germany (22 nd ) 73,278 cores (Tsubame-2.0) at Japan (4 th ) 26,304 cores (Juropa) in Germany (23 rd ) 138,368 cores (Tera-100) at France (6 th ) 26,232 cores (Tachyonll) in South Korea (24 th ) 122,400 cores (RoadRunner) at LANL (7 th ) 23,040 cores (Jade) at GENCI (27 th ) 81,920 cores (Pleiades) at NASA Ames (11 th ) 33,120 cores (Mole-8.5) in China (28 th ) 42,440 cores (Red Sky) at Sandia (14 th ) More are getting installed! 62,976 cores (Ranger) at TACC (15 th ) 35,360 cores (Lomonosov) in Russia (17 th ) 106
107 HSE Scientific Computing Installations HSE compute systems with ranking in the Nov 2010 Top500 list 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126) 7,944-core installation in Purdue with 10GigE Chelsio/iWARP (#147) 6,828-core installation in Germany (#166) 6,144-core installation in Germany (#214) 6,144-core installation in Germany (#215) 7,040-core installation at the Amazon EC2 Cluster (#231) 4,000-core installation at HHMI (#349) Other small clusters 640-core installation in University of Heidelberg, Germany 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch 256-core installation at Argonne National Lab with Myri-10G Integrated Systems BG/P uses 10GE for I/O (ranks 9, 13, 16, and 34 in the Top 50) 107
108 Other HSE Installations HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking Example Enterprise Computing Domains Enterprise Datacenters (HP, Intel) Animation firms (e.g., Universal Studios ( The Hulk ), 20 th Century Fox ( Avatar ), and many new movies using 10GE) Amazon s HPC cloud offering uses 10GE internally Heavily used in financial markets (users are typically undisclosed) Many Network-attached Storage devices come integrated with 10GE network adapters ESnet to install $62M 100GE infrastructure for US DOE 108
109 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 109
110 Modern Interconnects and Protocols Application Application Interface Sockets Verbs Kernel Space TCP/IP TCP/IP SDP RDMA Protocol Implementation Ethernet Driver IPoIB Hardware Offload RDMA User space Network Adapter 1/10 GigE Adapter InfiniBand Adapter 10 GigE Adapter InfiniBand Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch 10 GigE Switch InfiniBand Switch InfiniBand Switch 1/10 GigE IPoIB 10 GigE-TOE SDP IB Verbs 110
111 Case Studies Low-level Network Performance Clusters with Message Passing Interface (MPI) Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) InfiniBand in WAN and Grid-FTP Cloud Computing: Hadoop and Memcached 111
112 Latency (us) Latency (us) Low-level Latency Measurements Small Messages VPI-IB Native IB VPI-Eth RoCE Large Messages Message Size (bytes) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches RoCE has a slight overhead compared to native IB because it operates at a slower clock rate (required to support only a 10Gbps link for Ethernet, as compared to a 32Gbps link for IB) 112
113 Bandwidth (MBps) Low-level Uni-directional Bandwidth Measurements 1600 Uni-directional Bandwidth VPI-IB Native IB VPI-Eth RoCE Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches 113
114 Case Studies Low-level Network Performance Clusters with Message Passing Interface (MPI) Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) InfiniBand in WAN and Grid-FTP Cloud Computing: Hadoop and Memcached 114
115 MVAPICH/MVAPICH2 Software High Performance MPI Library for IB and HSE MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) Used by more than 1,550 organizations in 60 countries More than 66,000 downloads from OSU site directly Empowering many TOP500 clusters 11 th ranked 81,920-core cluster (Pleiades) at NASA 15 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) 115
116 Latency (us) Latency (us) One-way Latency: MPI over IB 6 Small Message Latency 400 Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe Message Size (bytes) Message Size (bytes) All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch 116
117 MillionBytes/sec MillionBytes/sec Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR-PCIe MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe Message Size (bytes) Message Size (bytes) All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch 117
118 Latency (us) One-way Latency: MPI over iwarp 90 One-way Latency Chelsio (TCP/IP) Chelsio (iwarp) Intel-NetEffect (TCP/IP) Intel-NetEffect (iwarp) Message Size (bytes) 2.4 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch 118
119 MillionBytes/sec MillionBytes/sec Bandwidth: MPI over iwarp Unidirectional Bandwidth Bidirectional Bandwidth Chelsio (TCP/IP) Chelsio (iwarp) Intel-NetEffect (TCP/IP) Intel-NetEffect (iwarp) Message Size (bytes) Message Size (bytes) 2.33 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch 119
120 Latency (us) Latency (us) Convergent Technologies: MPI Latency 60 Small Messages Large Messages Native IB VPI-IB VPI-Eth RoCE Message Size (bytes) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches 120
121 Bandwidth (MBps) Bandwidth (MBps) Convergent Technologies: MPI Uni- and Bi-directional Bandwidth Uni-directional Bandwidth Native IB VPI-IB Bi-directional Bandwidth VPI-Eth RoCE Message Size (bytes) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches 121
122 Case Studies Low-level Network Performance Clusters with Message Passing Interface (MPI) Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) InfiniBand in WAN and Grid-FTP Cloud Computing: Hadoop and Memcached 122
123 IPoIB vs. SDP Architectural Models Traditional Model Possible SDP Model User Sockets App Sockets API User Sockets Application Sockets API SDP OS Modules InfiniBand Hardware Kernel TCP/IP Sockets Provider Kernel TCP/IP Sockets Provider Sockets Direct Protocol TCP/IP Transport Driver TCP/IP Transport Driver Kernel Bypass IPoIB Driver IPoIB Driver RDMA Semantics InfiniBand CA InfiniBand CA (Source: InfiniBand Trade Association) 123
124 Latency (us) Bandwidth (MBps) Bidir Bandwidth (MBps) SDP vs. IPoIB (IB QDR) IPoIB-RC IPoIB-UD SDP K 8K 32K K 8K 32K SDP enables high bandwidth (up to 15 Gbps), low latency (6.6 µs) K 2K 124
125 Case Studies Low-level Network Performance Clusters with Message Passing Interface (MPI) Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) InfiniBand in WAN and Grid-FTP Cloud Computing: Hadoop and Memcached 125
126 IB on the WAN Option 1: Layer-1 Optical networks IB standard specifies link, network and transport layers Can use any layer-1 (though the standard says copper and optical) Option 2: Link Layer Conversion Techniques InfiniBand-to-Ethernet conversion at the link layer: switches available from multiple companies (e.g., Obsidian) Technically, it s not conversion; it s just tunneling (L2TP) InfiniBand s network layer is IPv6 compliant 126
127 UltraScience Net: Experimental Research Network Testbed Since 2005 Features This and the following IB WAN slides are courtesy Dr. Nagi Rao (ORNL) End-to-end guaranteed bandwidth channels Dynamic, in-advance, reservation and provisioning of fractional/full lambdas Secure control-plane for signaling Peering with ESnet, National Science Foundation CHEETAH, and other networks 127
128 IB-WAN Connectivity with Obsidian Switches Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY CPU Host bus CPU Idea is to make remote storage appear local IB-WAN switch does frame conversion IB standard allows per-hop creditbased flow control IB-WAN switch uses large internal buffers to allow enough credits to fill the wire System Controller HCA IB switch Storage System Memory IB WAN Wide-area SONET/10GE 128
129 InfiniBand Over SONET: Obsidian Longbows RDMA throughput measurements over USN Quad-core Dual socket ORNL Linux host IB 4x longbow IB/S ORNL CDCI 700 miles Chicago CDCI 3300 miles 4300 miles Seattle CDCI Sunnyvale CDCI Linux host longbow IB/S OC192 IB 4x: 8Gbps (full speed) Host-to-host local switch:7.5gbps ORNL loop -0.2 mile: 7.48Gbps ORNL-Chicago loop 1400 miles: 7.47Gbps Hosts: Dual-socket Quad-core 2 GHz AMD Opteron, 4GB memory, 8-lane PCI- Express slot, Dual-port Voltaire 4x SDR HCA ORNL- Chicago - Seattle loop 6600 miles: 7.37Gbps ORNL Chicago Seattle - Sunnyvale loop 8600 miles: 7.34Gbps 129
130 IB over 10GE LAN-PHY and WAN-PHY Quad-core Dual socket ORNL Linux host IB 4x longbow IB/S ORNL CDCI 700 miles Chicago CDCI 3300 miles 4300 miles Seattle CDCI Sunnyvale CDCI Linux host longbow IB/S OC192 Chicago E300 Sunnyvale E300 ORNL loop -0.2 mile: 7.5Gbps ORNL-Chicago loop 1400 miles: 7.49Gbps OC GE WAN PHY ORNL- Chicago - Seattle loop 6600 miles: 7.39Gbps ORNL Chicago Seattle - Sunnyvale loop 8600 miles: 7.36Gbps 130
131 MPI over IB-WAN: Obsidian Routers Obsidian WAN Router Obsidian WAN Router (Variable Delay) (Variable Delay) Cluster A WAN Link Cluster B Delay (us) Distance (km) MPI Bidirectional Bandwidth Impact of Encryption on Message Rate (Delay 0 ms) Hardware encryption has no impact on performance for less communicating streams S. Narravula, H. Subramoni, P. Lai, R. Noronha and D. K. Panda, Performance of HPC Middleware over InfiniBand WAN, Int'l Conference on Parallel Processing (ICPP '08), September
132 Communication Options in Grid GridFTP High Performance Computing Applications Globus XIO Framework IB Verbs IPoIB RoCE TCP/IP Obsidian Routers 10 GigE Network Multiple options exist to perform data transfer on Grid Globus-XIO framework currently does not support IB natively We create the Globus-XIO ADTS driver and add native IB support to GridFTP 132
133 Globus-XIO Framework with ADTS Driver Globus XIO Interface Globus XIO Driver #1 User Globus XIO Driver #n Globus-XIO ADTS Driver Data Connection Management Persistent Session Management Data Transport Interface Buffer & File Management File System Zero Copy Channel Flow Control Memory Registration Modern WAN Interconnects InfiniBand / RoCE 10GigE/iWARP Network 133
134 Bandwidth (MBps) Bandwidth (MBps) Performance of Memory Based Data Transfer ADTS Driver Network Delay (us) 2 MB 8 MB 32 MB 64 MB UDT Driver Network Delay (us) 2 MB 8 MB 32 MB 64 MB Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files ADTS based implementation is able to saturate the link bandwidth Best performance for ADTS obtained when performing data transfer with a network buffer of size 32 MB 134
135 Bandwidth (MBps) Bandwidth (MBps) Performance of Disk Based Data Transfer 400 Synchronous Mode 400 Asynchronous Mode Network Delay (us) ADTS-8MB ADTS-16MB ADTS-32MB ADTS-64MB IPoIB-64MB Network Delay (us) ADTS-8MB ADTS-16MB ADTS-32MB ADTS-64MB IPoIB-64MB Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files Predictable as well as better performance when Disk-IO threads assist network thread (Asynchronous Mode) Best performance for ADTS obtained with a circular buffer with individual buffers of size 64 MB 135
136 Bandwidth (MBps) Application Level Performance CCSM Ultra-Viz Target Applications Application performance for FTP get operation for disk based transfers Community Climate System Model (CCSM) Part of Earth System Grid Project Transfers 160 TB of total data in chunks of 256 MB Network latency - 30 ms Ultra-Scale Visualization (Ultra-Viz) Transfers files of size 2.6 GB Network latency - 80 ms ADTS IPoIB The ADTS driver out performs the UDT driver using IPoIB by more than 100% H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and the Grid (CCGrid), May
137 Case Studies Low-level Network Performance Clusters with Message Passing Interface (MPI) Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) InfiniBand in WAN and Grid-FTP Cloud Computing: Hadoop and Memcached 137
138 A New Approach towards OFA in Cloud Current Cloud Software Design Application Current Approach Towards OFA in Cloud Application Our Approach Towards OFA in Cloud Application Sockets Accelerated Sockets Verbs / Hardware Offload Verbs Interface 1/10 GigE Network 10 GigE or InfiniBand 10 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, Hadoop) Zero-copy not available for non-blocking sockets (Memcached) Significant consolidation in cloud system software Hadoop and Memcached are developer facing APIs, not sockets Improving Hadoop and Memcached will benefit many applications immediately! 138
139 Memcached Design Using Verbs Sockets Client 1 2 Master Thread Sockets Worker Thread Shared Data RDMA Client 1 2 Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads 139
140 Time (us) Time (us) Memcached Get Latency SDP IPoIB OSU Design 1G 10G-TOE K 2K 4K K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes DDR: 6 us; QDR: 5 us 4K bytes -- DDR: 20 us; QDR:12 us Almost factor of four improvement over 10GE (TOE) for 4K We are in the process of evaluating iwarp on 10GE 140
141 Thousands of Transactions per second (TPS) Memcached Get TPS SDP 10G - TOE OSU Design SDP IPoIB OSU Design Clients 16 Clients 0 8 Clients 16 Clients Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get transactions per second for 4 bytes On IB DDR about 600K/s for 16 clients On IB QDR 1.9M/s for 16 clients Almost factor of six improvement over 10GE (TOE) We are in the process of evaluating iwarp on 10GE 141
142 Bandwidth (MB/s) Bandwidth (MB/s) Hadoop: Java Communication Benchmark 1400 Bandwidth with C version 1400 Bandwidth Java Version GE IPoIB SDP 10GE-TOE SDP-NIO Message Size (Bytes) Sockets level ping-pong bandwidth test Java performance depends on usage of NIO (allocatedirect) C and Java versions of the benchmark have similar performance HDFS does not use direct allocated blocks or NIO on DataNode S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda Can High-Performance Interconnects Benefit Hadoop Distributed File System?, MASVDC 10 in conjunction with MICRO 2010, Atlanta, GA K 4K 16K 64K 256K 1M 4M Message Size (Bytes) 142
143 Hadoop: DFS IO Write Performance Average Write Throughput (MB/sec) File Size(GB) DFS IO included in Hadoop, measures sequential access throughput We have two map tasks each writing to a file of increasing size (1-10GB) Significant improvement with IPoIB, SDP and 10GigE 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD With SSD, performance improvement is almost seven or eight fold! SSD benefits not seen without using high-performance interconnect! In-line with comment on Google keynote about I/O performance Four Data Nodes Using HDD and SSD 143
144 Hadoop: RandomWriter Performance Execution Time (sec) Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD Each map generates 1GB of random binary data and writes to HDFS SSD improves execution time by 50% with 1GigE for two DataNodes For four DataNodes, benefits are observed only with HPC interconnect IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes 144
145 Execution Time (sec) Hadoop Sort Benchmark Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD Sort: baseline benchmark for Hadoop Sort phase: I/O bound; Reduce phase: communication bound SSD improves performance by 28% using 1GigE with two DataNodes Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE 145
146 Presentation Overview Introduction Why InfiniBand and High-speed Ethernet? Overview of IB, HSE, their Convergence and Features IB and HSE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A 146
147 Concluding Remarks Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems Presented background and details of IB and HSE Highlighted the main features of IB and HSE and their convergence Gave an overview of IB and HSE hardware/software products Discussed sample performance numbers in designing various highend systems with IB and HSE IB and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions 147
148 Funding Acknowledgments Funding Support by Equipment Support by 148
149 Personnel Acknowledgments Current Students N. Dandapanthula (M.S.) R. Darbha (M.S.) V. Dhanraj (M.S.) J. Huang (Ph.D.) J. Jose (Ph.D.) K. Kandalla (Ph.D.) M. Luo (Ph.D.) V. Meshram (M.S.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekhar (Ph.D.) A. Singh (Ph.D.) H. Subramoni (Ph.D.) Past Students P. Balaji (Ph.D.) D. Buntinas (Ph.D.) S. Bhagvat (M.S.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) S. Kini (M.S.) M. Koop (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) P. Lai (Ph. D.) J. Liu (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) G. Santhanaraman (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Current Research Scientist S. Sur Current Post-Docs H. Wang J. Vienne X. Besseron Past Post-Docs E. Mancini S. Marcarelli H.-W. Jin Current Programmers M. Arnold D. Bureddy J. Perkins 149
150 Web Pointers MVAPICH Web Page
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School
Introduction to Infiniband Hussein N. Harake, Performance U! Winter School Agenda Definition of Infiniband Features Hardware Facts Layers OFED Stack OpenSM Tools and Utilities Topologies Infiniband Roadmap
InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment
December 2007 InfiniBand Software and Protocols Enable Seamless Off-the-shelf Deployment 1.0 Introduction InfiniBand architecture defines a high-bandwidth, low-latency clustering interconnect that is used
OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/2011. www.openfabrics.
OFA Training Program Writing Application Programs for RDMA using OFA Software Author: Rupert Dance Date: 11/15/2011 www.openfabrics.org 1 Agenda OFA Training Program Program Goals Instructors Programming
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
A Tour of the Linux OpenFabrics Stack
A Tour of the OpenFabrics Stack Johann George, QLogic June 2006 1 Overview Beyond Sockets Provides a common interface that allows applications to take advantage of the RDMA (Remote Direct Memory Access),
High Speed I/O Server Computing with InfiniBand
High Speed I/O Server Computing with InfiniBand José Luís Gonçalves Dep. Informática, Universidade do Minho 4710-057 Braga, Portugal [email protected] Abstract: High-speed server computing heavily relies on
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:
Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure Motti Beck Director, Marketing [email protected] Michael Kagan Chief Technology Officer [email protected]
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand Hari Subramoni *, Ping Lai *, Raj Kettimuthu **, Dhabaleswar. K. (DK) Panda * * Computer Science and Engineering Department
Advanced Computer Networks. High Performance Networking I
Advanced Computer Networks 263 3501 00 High Performance Networking I Patrick Stuedi Spring Semester 2014 1 Oriana Riva, Department of Computer Science ETH Zürich Outline Last week: Wireless TCP Today:
OFA Training Programs
OFA Training Programs Programming and Fabric Admin Author: Rupert Dance Date: 03/26/2012 www.openfabrics.org 1 Agenda OFA Training Programs Writing Applications for RDMA using OFA Software Program Goals
How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)
INFINIBAND NETWORK ANALYSIS AND MONITORING USING OPENSM N. Dandapanthula 1, H. Subramoni 1, J. Vienne 1, K. Kandalla 1, S. Sur 1, D. K. Panda 1, and R. Brightwell 2 Presented By Xavier Besseron 1 Date:
Advancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
RoCE vs. iwarp Competitive Analysis
WHITE PAPER August 21 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...4 Summary...
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to
Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.
Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09
Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09 Authors: Amphenol, Cisco, Dell, Fulcrum Microsystems, Intel, Ixia, JDSU, Mellanox, NetApp, Panduit, QLogic, Spirent, Tyco Electronics,
Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct
Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 Direct Increased Performance, Scaling and Resiliency July 2012 Motti Beck, Director, Enterprise Market Development [email protected]
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
Fibre Channel over Ethernet in the Data Center: An Introduction
Fibre Channel over Ethernet in the Data Center: An Introduction Introduction Fibre Channel over Ethernet (FCoE) is a newly proposed standard that is being developed by INCITS T11. The FCoE protocol specification
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology Reduce I/O cost and power by 40 50% Reduce I/O real estate needs in blade servers through consolidation Maintain
Cisco Datacenter 3.0. Datacenter Trends. David Gonzalez Consulting Systems Engineer Cisco
Cisco Datacenter 3.0 Datacenter Trends David Gonzalez Consulting Systems Engineer Cisco 2009 Cisco Systems, Inc. All rights reserved. Cisco Public 1 Agenda Data Center Ethernet (DCE) Fiber Channel over
Building a Scalable Storage with InfiniBand
WHITE PAPER Building a Scalable Storage with InfiniBand The Problem...1 Traditional Solutions and their Inherent Problems...2 InfiniBand as a Key Advantage...3 VSA Enables Solutions from a Core Technology...5
High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters
High Throughput File Servers with SMB Direct, Using the 3 Flavors of network adapters Jose Barreto Principal Program Manager Microsoft Corporation Abstract In Windows Server 2012, we introduce the SMB
Building Enterprise-Class Storage Using 40GbE
Building Enterprise-Class Storage Using 40GbE Unified Storage Hardware Solution using T5 Executive Summary This white paper focuses on providing benchmarking results that highlight the Chelsio T5 performance
SMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect
White PAPER Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect Introduction: High Performance Data Centers As the data center continues to evolve to meet rapidly escalating
Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking
Cluster Grid Interconects Tony Kay Chief Architect Enterprise Grid and Networking Agenda Cluster Grid Interconnects The Upstart - Infiniband The Empire Strikes Back - Myricom Return of the King 10G Gigabit
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building
Mellanox Academy Online Training (E-learning)
Mellanox Academy Online Training (E-learning) 2013-2014 30 P age Mellanox offers a variety of training methods and learning solutions for instructor-led training classes and remote online learning (e-learning),
EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation
PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
Basic Networking Concepts. 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet
Basic Networking Concepts 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet 1 1. Introduction -A network can be defined as a group of computers and other devices connected
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
iscsi Top Ten Top Ten reasons to use Emulex OneConnect iscsi adapters
W h i t e p a p e r Top Ten reasons to use Emulex OneConnect iscsi adapters Internet Small Computer System Interface (iscsi) storage has typically been viewed as a good option for small and medium sized
Isilon IQ Network Configuration Guide
Isilon IQ Network Configuration Guide An Isilon Systems Best Practice Paper August 2008 ISILON SYSTEMS Table of Contents Cluster Networking Introduction...3 Assumptions...3 Cluster Networking Features...3
Converging Data Center Applications onto a Single 10Gb/s Ethernet Network
Converging Data Center Applications onto a Single 10Gb/s Ethernet Network Explanation of Ethernet Alliance Demonstration at SC10 Contributing Companies: Amphenol, Broadcom, Brocade, CommScope, Cisco, Dell,
Unified Fabric: Cisco's Innovation for Data Center Networks
. White Paper Unified Fabric: Cisco's Innovation for Data Center Networks What You Will Learn Unified Fabric supports new concepts such as IEEE Data Center Bridging enhancements that improve the robustness
Cloud Computing and the Internet. Conferenza GARR 2010
Cloud Computing and the Internet Conferenza GARR 2010 Cloud Computing The current buzzword ;-) Your computing is in the cloud! Provide computing as a utility Similar to Electricity, Water, Phone service,
Introduction to Cloud Design Four Design Principals For IaaS
WHITE PAPER Introduction to Cloud Design Four Design Principals For IaaS What is a Cloud...1 Why Mellanox for the Cloud...2 Design Considerations in Building an IaaS Cloud...2 Summary...4 What is a Cloud
BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX
White Paper BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH Abstract This white paper explains how to configure a Mellanox SwitchX Series switch to bridge the external network of an EMC Isilon
InfiniBand in the Enterprise Data Center
InfiniBand in the Enterprise Data Center InfiniBand offers a compelling value proposition to IT managers who value data center agility and lowest total cost of ownership Mellanox Technologies Inc. 2900
PCI Express and Storage. Ron Emerick, Sun Microsystems
Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature
The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:
Overview HP supports 56 Gbps Fourteen Data Rate (FDR) and 40Gbps 4X Quad Data Rate (QDR) InfiniBand (IB) products that include mezzanine Host Channel Adapters (HCA) for server blades, dual mode InfiniBand
Performance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 [email protected] Amith Mamidala, Abhinav Vishnu, and Dhabaleswar
State of the Art Cloud Infrastructure
State of the Art Cloud Infrastructure Motti Beck, Director Enterprise Market Development WHD Global I April 2014 Next Generation Data Centers Require Fast, Smart Interconnect Software Defined Networks
IP SAN Best Practices
IP SAN Best Practices A Dell Technical White Paper PowerVault MD3200i Storage Arrays THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES.
Evaluation Report: Emulex OCe14102 10GbE and OCe14401 40GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters
Evaluation Report: Emulex OCe14102 10GbE and OCe14401 40GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters Evaluation report prepared under contract with Emulex Executive Summary As
3G Converged-NICs A Platform for Server I/O to Converged Networks
White Paper 3G Converged-NICs A Platform for Server I/O to Converged Networks This document helps those responsible for connecting servers to networks achieve network convergence by providing an overview
Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note
Building High-Performance iscsi SAN Configurations An Alacritech and McDATA Technical Note Building High-Performance iscsi SAN Configurations An Alacritech and McDATA Technical Note Internet SCSI (iscsi)
Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand
Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand Ping Lai, Hari Subramoni, Sundeep Narravula, Amith Mamidala, Dhabaleswar K. Panda Department of Computer Science and
Network Virtualization for Large-Scale Data Centers
Network Virtualization for Large-Scale Data Centers Tatsuhiro Ando Osamu Shimokuni Katsuhito Asano The growing use of cloud technology by large enterprises to support their business continuity planning
Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches
Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches February, 2009 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
White Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
Deploying 10/40G InfiniBand Applications over the WAN
Deploying 10/40G InfiniBand Applications over the WAN Eric Dube ([email protected]) Senior Product Manager of Systems November 2011 Overview About Bay Founded in 2000 to provide high performance
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller
White Paper From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller The focus of this paper is on the emergence of the converged network interface controller
Microsoft SMB 2.2 - Running Over RDMA in Windows Server 8
Microsoft SMB 2.2 - Running Over RDMA in Windows Server 8 Tom Talpey, Architect Microsoft March 27, 2012 1 SMB2 Background The primary Windows filesharing protocol Initially shipped in Vista and Server
SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation
SMB Advanced Networking for Fault Tolerance and Performance Jose Barreto Principal Program Managers Microsoft Corporation Agenda SMB Remote File Storage for Server Apps SMB Direct (SMB over RDMA) SMB Multichannel
Leased Line + Remote Dial-in connectivity
Leased Line + Remote Dial-in connectivity Client: One of the TELCO offices in a Southern state. The customer wanted to establish WAN Connectivity between central location and 10 remote locations. The customer
White Paper Abstract Disclaimer
White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification
Layer 3 Network + Dedicated Internet Connectivity
Layer 3 Network + Dedicated Internet Connectivity Client: One of the IT Departments in a Northern State Customer's requirement: The customer wanted to establish CAN connectivity (Campus Area Network) for
ADVANCED NETWORK CONFIGURATION GUIDE
White Paper ADVANCED NETWORK CONFIGURATION GUIDE CONTENTS Introduction 1 Terminology 1 VLAN configuration 2 NIC Bonding configuration 3 Jumbo frame configuration 4 Other I/O high availability options 4
Ethernet. Ethernet. Network Devices
Ethernet Babak Kia Adjunct Professor Boston University College of Engineering ENG SC757 - Advanced Microprocessor Design Ethernet Ethernet is a term used to refer to a diverse set of frame based networking
Storage at a Distance; Using RoCE as a WAN Transport
Storage at a Distance; Using RoCE as a WAN Transport Paul Grun Chief Scientist, System Fabric Works, Inc. (503) 620-8757 [email protected] Why Storage at a Distance the Storage Cloud Following
PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based
PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based Description Silicom 10Gigabit TOE PCI Express server adapters are network interface cards that contain multiple
QoS & Traffic Management
QoS & Traffic Management Advanced Features for Managing Application Performance and Achieving End-to-End Quality of Service in Data Center and Cloud Computing Environments using Chelsio T4 Adapters Chelsio
D1.2 Network Load Balancing
D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June [email protected],[email protected],
New Data Center architecture
New Data Center architecture DigitPA Conference 2010, Rome, Italy Silvano Gai Consulting Professor Stanford University Fellow Cisco Systems 1 Cloud Computing The current buzzword ;-) Your computing is
Chapter 2 - The TCP/IP and OSI Networking Models
Chapter 2 - The TCP/IP and OSI Networking Models TCP/IP : Transmission Control Protocol/Internet Protocol OSI : Open System Interconnection RFC Request for Comments TCP/IP Architecture Layers Application
Interconnecting Future DoE leadership systems
Interconnecting Future DoE leadership systems Rich Graham HPC Advisory Council, Stanford, 2015 HPC The Challenges 2 Proud to Accelerate Future DOE Leadership Systems ( CORAL ) Summit System Sierra System
High-Performance Networking for Optimized Hadoop Deployments
High-Performance Networking for Optimized Hadoop Deployments Chelsio Terminator 4 (T4) Unified Wire adapters deliver a range of performance gains for Hadoop by bringing the Hadoop cluster networking into
Accelerating High-Speed Networking with Intel I/O Acceleration Technology
White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing
Computer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
Performance of RDMA-Capable Storage Protocols on Wide-Area Network
Performance of RDMA-Capable Storage Protocols on Wide-Area Network Weikuan Yu Nageswara S.V. Rao Pete Wyckoff* Jeffrey S. Vetter Ohio Supercomputer Center* Managed by UT-Battelle InfiniBand Clusters around
SUN DUAL PORT 10GBase-T ETHERNET NETWORKING CARDS
SUN DUAL PORT 10GBase-T ETHERNET NETWORKING CARDS ADVANCED PCIE 2.0 10GBASE-T ETHERNET NETWORKING FOR SUN BLADE AND RACK SERVERS KEY FEATURES Low profile adapter and ExpressModule form factors for Oracle
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
PCI Express Overview. And, by the way, they need to do it in less time.
PCI Express Overview Introduction This paper is intended to introduce design engineers, system architects and business managers to the PCI Express protocol and how this interconnect technology fits into
Transport and Network Layer
Transport and Network Layer 1 Introduction Responsible for moving messages from end-to-end in a network Closely tied together TCP/IP: most commonly used protocol o Used in Internet o Compatible with a
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family White Paper June, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
IP Networking. Overview. Networks Impact Daily Life. IP Networking - Part 1. How Networks Impact Daily Life. How Networks Impact Daily Life
Overview Dipl.-Ing. Peter Schrotter Institute of Communication Networks and Satellite Communications Graz University of Technology, Austria Fundamentals of Communicating over the Network Application Layer
QoS Switching. Two Related Areas to Cover (1) Switched IP Forwarding (2) 802.1Q (Virtual LANs) and 802.1p (GARP/Priorities)
QoS Switching H. T. Kung Division of Engineering and Applied Sciences Harvard University November 4, 1998 1of40 Two Related Areas to Cover (1) Switched IP Forwarding (2) 802.1Q (Virtual LANs) and 802.1p
Broadcom Ethernet Network Controller Enhanced Virtualization Functionality
White Paper Broadcom Ethernet Network Controller Enhanced Virtualization Functionality Advancements in VMware virtualization technology coupled with the increasing processing capability of hardware platforms
VXLAN: Scaling Data Center Capacity. White Paper
VXLAN: Scaling Data Center Capacity White Paper Virtual Extensible LAN (VXLAN) Overview This document provides an overview of how VXLAN works. It also provides criteria to help determine when and where
Network Simulation Traffic, Paths and Impairment
Network Simulation Traffic, Paths and Impairment Summary Network simulation software and hardware appliances can emulate networks and network hardware. Wide Area Network (WAN) emulation, by simulating
Cloud Infrastructure Planning. Chapter Six
Cloud Infrastructure Planning Chapter Six Topics Key to successful cloud service adoption is an understanding of underlying infrastructure. Topics Understanding cloud networks Leveraging automation and
Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability
Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability Mellanox continues its leadership providing InfiniBand Host Channel
Using High Availability Technologies Lesson 12
Using High Availability Technologies Lesson 12 Skills Matrix Technology Skill Objective Domain Objective # Using Virtualization Configure Windows Server Hyper-V and virtual machines 1.3 What Is High Availability?
Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities 2014-2015
Mellanox Academy Course Catalog Empower your organization with a new world of educational possibilities 2014-2015 Mellanox offers a variety of training methods and learning solutions for instructor-led
PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
