The Coming Decade of Data Center Networking Discontinuities Renato Recio IBM Fellow & System Networking CTO Agenda Issues with today s Networks Discontinuous & Disruptive Trends Coming decade of Data Center Networking Discontinuities Optimized: Flat, Converged, Scalable fabrics Automated: Virtual & Overlay Networks Integrated: Software Defined Networks Summary 2 1
Early Ethernet Campus Evolution Layer3 Layer2 3 Campus Core Layer 95% Aggregation Layer Access Layer 5% WAN In the beginning, Ethernet was used to interconnect stations (e.g. dumb terminals), initially through repeater & hub topologies, eventually through switched topologies. Ethernet campuses evolved into a structured network typically divided into a Core, Service (e.g. firewall), Aggregation and Access Layer, where: Traffic pattern is mostly North South (directed outside campus vs peer peer). To avoid spanning tree problems, campus networks typically are divided at access. Ethernet in the Data Center Data Center WAN Core Layer Layer2 < 25% Aggregation Layer Access Layer > 75% SAN Layer2 The same structured network topologies were used in the data center, but Traffic pattern is mostly East West (e.g. Application to Database tier). Large layer 2 domains needed for clustering and Virtual Machine mobility. Partly due to Ethernet limitations (e.g. lack of flow control), the data center used additional networks, such as Fibre Channel Storage Area Networks (SAN) InfiniBand cluster networks. 4 2
3/9/2012 Ethernet in the Data Center Net: today's Data Center Network looks similar to Campus networks Data Center WAN Core Layer But does it still fit to the current requirements of a modern Data Center? Aggregation Layer FCoE Access Layer OpenFlow STP SDN SAN Overlay TRILL ECMP 5 Traditional Data Center Network Issues Data Center Core Layer Discrete components and piece parts WAN Discrete & Multiple managers and management domains Decoupled Box level point Services Aggregation Layer Access Layer SAN 6 Manual & Painful Limited Scale Dynamic workload management complexity Multi tenancy complications SLAs & security are error prone Too many network types, with too many nodes & tiers Inefficient switching Expensive network resources 3
Clients are looking for smarter Data Center Infrastructure that solves these issues. Customers Want Larger Integrated Systems Integrated System buying Rack Priority Investment Trends in next 18 months Not an Investment Area (0 3) Somewhat of an Investment Area (4 7) High Investment Area (8 10) Blade Virtual Machine Mobility Integrated Compute Platform 8% 11% 36% 44% 56% 45% Integrated Management 9% 47% 44% Box buying Converged Fabrics 13% 47% 40% Clients are seeking solutions to the complexities associated with inefficient networking, server sprawl and manual virtualization management. Integrated system pre packages server, storage, network, virtualization and management and provides an automated, converged & virtualized solution with fast time to value & simple maintenance 8 4
Smarter Data Center Infrastructure Integrated Expandable Integrated System Simple, consolidated management Software Driven Network stack Automated Workload Aware Networking Dynamic provisioning Wire once fabric. Period. Optimized Converged network Single, flat fabric Grow as you need architecture 9 What technology trends are we pursuing to tackle these issues? 10 5
Discontinuous Technologies Discontinuity a: the property of being not mathematically continuous; b: an instance of being not mathematically continuous; especially a value of an independent variable at which a function is not continuous 11 Discontinuous Technologies Examples Distributed Overlay Virtual Ethernet (DOVE) Networks TRILL, with disjoint multi pathing Converged Enhanced Ethernet Fibre Channel over Ethernet Software Defined Networking OpenFlow (more later ) 12 6
Sustaining vs Disruptive Technologies Sustaining doesn t affect existing markets. Capability Most demanding High end Mid-range Low end Time 13 Sustaining vs Disruptive Technologies Disruptive innovation that creates a new market or displaces existing technologies in a market. Most demanding High end Mid-range Low end Disruptive Technology 14 7
Personal Computing Technology Examples Sustaining vs Disruptive Capability 1974 HP-65 1971 HP 9100 PCs 1981 1987-88 386 & Graphics 1985 IBM AT (Disk, 286) ISA PC (DOS, CGA) 1977-80 Commodore, AppleII, TRS-80 1990-91 GUI 1994-96 Multimedia, Internet Calculators High resolution LCDs Home Ethernet Cable/DSL Organizers / PDAs (Calendar, Notepad, GPS, ) Tablets (Notebook, PC, TV) Smart Phones (GPS, Camera, Organizers, ) 15 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Basic Technology Trends Microprocessor 1000 Processor (SAP & TPC-C Relative) vs Processor IO Performance (GB/s) 100 10 Microprocessor IO SAP 2 socket TPC-C 2 socket 16 1 2002 2007 2012 Observations 2 socket performance (per VM instance) is growing at ~60% IO performance has lagged, but gap will temporarily close a bit 8
Basic Technology Trends IO Links IO Link Performance (GB/s) 100 Microprocessor IO 10 1 PCI-X PCI-e PCI-e Gen3 PCI-e Gen4 17.1 2002 2007 Observations 2012 2015 PCIe Generation 3 is tracking traditional up IO bandwidth trend 1 PCIe Gen3 can support 2x 40GE (4x 40GE, but not at link rate) Basic Technology Trends IO & Fabrics 100 Uni-Directional Bandwidth (GBytes/s) PCIe 10 1 SAN IB.1.01 x16 x8 x4 up IO PCIe Legend Ethernet IB 4x FC 2000 2005 2010 2015 2020 Observations Ethernet, with Data Center enhancements (more next), is a disruptive technology for SAN & cluster markets (now), and PCIe market in the future. 9
Emerging Disruptive IP/Ethernet Technologies Converged stacks: CEE, FCoE, RoCE Layer 2 based multi pathing: TRILL, Proprietary fabrics & OpenFlow Software Defined Networking stack 19 Disrupted markets: Fibre Channel SAN InfiniBand Clusters Enterprise Networks Limited Scale Traditional Data Center Networks 1. Multiple fabrics (SAN, LAN, ) because Ethernet lacks convergence services: Lossy, no bandwidth guarantee, no multi pathing, not VM aware 2. Inefficient topology, with too many tiers & nodes Per hop forwarding Wasted paths (due to STP) East West traffic forced to go North South (e.g. due to multi cast containment or proprietary, VNTag, approach) 3. direct to core makes cabling a mess 20 10
Optimized CEE the Path to Converged Fabrics Mgt Cluster SAN CEE DCBX LAN Multiple fabrics Complex management Inefficient RAS Converged fabric Simplified management Improved RAS 21 Optimized Remote Direct Memory Access 22 RDMA over Ethernet Enhanced Transmission Selection Data Center Convergence Enabling Technologies Priority Based Flow Control Fibre Channel over Ethernet Best available & per priority bandwidth guaranteed services FC Forwarders, FC Data Forwarders, FC Snooping Bridges Per priority pause Offered traffic 2G/s 4G/s 6G/s t1 t2 t3 1 2 3 4 5 6 7 8 Transmit Buffers Pause 10G Utilization HPC Traffic Storage Traffic LAN Traffic 4G/s t1 t2 t3 Receive Buffers 2G/s 5G/s 1 2 3 4 5 6 7 8 11
Optimized Traditional Arcane DCN Mesh Ahead 23 Multi tiered tree topologies High oversubscription Expensive, high bandwidth uplinks Small layer 2 fabric Robustness of higher tier products has been a concern Mesh, Clos, Jellyfish topologies Oversubscription only to WAN/core High cross sectional B/W (cheap TOR B/W) Layer 2 scaling options (more next) Robust, HA topologies Optimized Integrated System Fabric Trends Technology Trend Standalone Optimal local switching Low latency Converged Full L2/3 Stacking Single stacked switch + Active active multilink redundancies Single logical switch Fabric Switch cluster + Arbitrary topologies (fat tree, Clos, Jellyfish ) Cross sectional bandwidth scaling Multiple redundant paths Scales to many ports Unifies physical & virtual 24 12
Optimized Stacking Example: 40GE TOR Switch 16 QSFP+ Ports 40 Gig E 4x G8316 Stacked 40GE TOR switch (e.g. G8316) as spine pay as you grow Layer 2 network state is at rack level Adjustable from non blocking to 3:1 oversubscription Maximum deployment up to 768 10GbE Ports 10 GE 10 GE 16x G8264 10 GE 10 GE 768x 10GbE Ports Rack #1 Rack # X Rack # X Rack # X 25 Optimized Fabric Technology Options OpenFlow Controllers Layer 3 26 ECMP Established technology Standards based Distributed control plane Large scalability HA with fast convergence Small layer 2 without DOVE network Many devices to manage TRILL TRILL Large layer 2 Distributed control plane Large scalability HA with fast convergence Emerging technology (some proprietary) Single disjoint multi path fabric may need new RFC Open Flow Switches OpenFlow Large layer 2 Distributed control plane Large scalability HA with fast convergence Enables network functions delivered as Services e.g. disjoint multi pathing (more later) Emerging technology Client acceptance barrier 13
Optimized Ethernet Fabric 1 Cores Shared vs Disjoint Multi pathing TRILL Fabric Ethernet Fabric 2 Cores Data Center storage and cluster fabrics require full path redundancy. Completely redundant Ethernet fabrics meet this requirement, but come with an administration burden (e.g. 2x SAN configuration & maintenance). 27 All switches are RBridges Shared multi pathed Disjoint multi pathed Dual TRILL fabrics are emerging today. Single TRILL fabric, with disjoint multi pathing would eliminate administration burden. Traditional Network Flash Memory CPU Switching ASIC CPU Interface Management Plane Switch Management: Telnet, SSH, SNMP, SYSLOG, NTP, HTTP, FTP/TFTP Control Plane Network Topology, L2 Switching Decisions, Routing, ACLs, Link Management Software Transceiver Interface Data Plane Transceivers Switching, Forwarding, Link Network devices are closed system Each Network element has its own control and management plane 28 14
OpenFlow Network Flash Memory CPU Switching ASIC CPU Interface Management Plane Switch Management: Telnet, SSH, SNMP, SYSLOG, NTP, HTTP, FTP/TFTP Management Plane Switch Management: Telnet, SSH, SNMP, SYSLOG, NTP, HTTP, FTP/TFTP Control Plane Network Topology, L2 Switching Decisions, Routing, ACLs, Link Management Transceiver Interface Transceivers Data Plane Switching, Forwarding, Link Network devices are open and controlled from a server OpenFlow Protocol Secure Channel Control plane is extracted out from the network APIs are provided to program the network 29 Optimized OpenFlow Based Fabric Overview OpenFlow can also be used to create a disjoint multi pathing fabric. OpenFlow & FCF Controller OpenFlow OpenFlow enabled switches Each switch has Layer 2 forwarding turned off. Each switch connects to OpenFlow Controller. OFC discovers switches and switch adjacencies. OFC computes disjoint physical paths and configures switch forwarding tables. All ARPs go to OFC. 30 Shared multi pathed Disjoint multi pathed 15
Optimized Smart Data Center Networks 1. Converged fabric (LAN, SAN, Cluster,...) Lossless Bandwidth allocation Disjoint multi pathing VM workload aware 2. Efficient topology, with few tiers & nodes Efficient forwarding Disjoint multi path enabled Optimized East West flows 31 3. s connect to switches within rack (eliminates cabling mess) Manual & Painful Virtualization increased network management complexity Before Virtualization Physical Network with vswitches Web App Database vswitch vswitch vswitch Static workloads ran on bare-metal OS Each workload had network state associated with it. Physical network was static & simple (configured once) virtualization = dynamic workloads VM s network state resides in vswitch/dcn Physical network is dynamic and more complex (VMs come up dynamically & move) 32 16
3/9/2012 Automated Hypervisor vswitch Automation VM Migration vswitch vswitch vswitch N Motion Precedes VM 33 Automated network state migration 34 Platform Manager 1. Per VM switching in HW 2. Hypervisor Vendor Agnostic 3. Platform Manager Integration 4. Standards based (IEEE 802.1Qbg) 5. Network state migrates ahead of VM East-West traffic just goes thru first hop 34 17
Automated 1000 100 Network Virtualization Trends Virtual Machines per 2 Socket (approximately 10x every 10 years) 10 1 2006 2008 2010 2012 Infrastructure Groupware Database Web Email Application 2014 Terminal Number of VMs per socket is rapidly growing (10x every 10 years). 2016 Increases amount of VM VM traffic in Enterprise Data Centers (e.g. co resident Web, Application & Database). VM growth increases network complexity associated with creating/migrating: layer 2 (VLANs, ACLs ) & layer 3 (e.g. Firewall, IPS) attributes. 35 Automated Service VM2 DC 1 Hypervisor Network Virtualization Technology Trend DOVE DC 2 Service VM1 Layer 2 vswitch features, plus: 1. Layer 3 Distributed Overlay Virtual Ethernet (DOVE) 2. Simple configure once network (physical network doesn t have to be configured per VM). 3. De couples virtual from physical 4. Multi tenant aware 5. Enables cross subnet virtual appliance services (e.g. Firewall, IPS) 36 18
Automated Overview of DOVE Technology Elements DOVES DOVES DOVE Controller Overlay Network Physical network DOVE Controller Performs management & a portion of control plane functions across DOVE Switches DOVE Switches (DOVES) Provides layer 2 over UDP overlay (e.g. based on OTV/VXLAN) Performs data and some control plane functions Runs in Hypervisor vswitch or gateways Provides interfaces for Virtual Appliances to plug into 37 (Analogous to appliance line cards on a modular switch) Automated DOVE Technology VXLAN based Encapsulation Example Original Packet Inner MAC Inner IP Payload Encapsulation Outer MAC Outer IP UDP EP Header Inner MAC Inner IP Payload Version I R R R Reserved Domain ID Reserved Encapsulation Protocol (EP) Header (e.g. VXLAN based) (VXLAN extension in Yellow necessary IETF version field) 38 19
Automated Site Improving Networking Efficiency for Consolidated s APP APP Database HTTP HTTP 10.0.5.7 00:23:45:67:00:04 10.0.5.1 00:23:45:67:00:01 10.0.0.42 00:23:45:67:00:25 10.0.3.41 00:23:45:67:00:23 vappliance 10.0.5.4 00:23:45:67:00:14 Layer-2 Layer-3 Distributed vswitch 10.0.5.5 00:23:45:67:00:15 10.0.3.6 00:23:45:67:00:16 10.0.3.9 00:23:45:67:00:17 10.0.3.8 10.0.3.3 00:23:45:67:00:18 00:23:45:67:00:24 vappliance APP APP HTTP HTTP HTTP A Virtual Machine Layer-3 Appliance (e.g. IPS) Hypervisor vswitches enable addition of virtual appliances (vappliances), which provide secure communication across subnets (e.g. APP to Database tier). However, all traffic must be sent to an external Layer 3 switch, which is inefficient considering VM/socket growth rates and integrated servers. To solve this issue requires cross subnet communications in Hypervisor s vswitch. 39 Automated 4 Site Multi Tenant with Overlapping Address Spaces A Virtual Machine Note, vswitches and vappliances are not shown. APP HTTP Database HTTP Site 10.0.5.7 00:23:45:67:00:04 10.0.3.1 00:23:45:67:00:01 10.0.0.4 00:23:45:67:00:25 10.0.3.42 00:23:45:67:00:01 DOVE Network Pepsi Overlay Network vappliance Coke Overlay Network 10.0.3.1 00:23:45:67:00:01 10.0.5.7 00:23:45:67:00:04 10.0.3.42 00:23:45:67:00:25 10.0.5.1 00:23:45:67:00:01 10.0.5.4 00:23:45:67:00:01 Database HTTP APP HTTP HTTP vappliance 40 Multi tenant, Cloud environments require multiple IP address spaces within same server, within a Data Center and across Data Centers (see above). Layer 3 Distributed Overlay Virtual Ethernet (DOVE) switches enable multi tenancy all the way into the /Hypervisor, with overlapping IP Address spaces for the Virtual Machines. 20
Integrated Systems Network Element Manager SNEM A comprehensive tool to automate Virtualized Data Center Network workflows Engineer Physical Firmware Configuration Advanced PD Operator Monitoring Initial PD Automation Planner (or Engineer) Investment = business direction + application requirements + utilization trends 41 Integrated SNEM Examples Engineer Perform Efficient Firmware or Configuration Updates to Multiple Switches Operate Automate VM network resident port profiles and converged fabric Quality of Service Plan Performance trend & root-cause analysis, fault management,.. 42 21
Integrated SNEM Path Services Native Switch (L 2/3) Driver Software HW & embedded SW Software Defined Networking Technologies Network Controller Multi-tenant Services Network APIs Integrated System Network OS DOVE Driver SAN Services OpenFlow Driver Control Plane Network functions delivered as services Multi tenant VM security Virtualized load balancing Network API s provides an abstract interface into underlying controller Distributes, configures & controls state between services & controllers Provides multiple abstract views 5KV 5KV 5KV 5KV Network Operating System drives set of devices Physical devices (e.g. TOR) Virtual devices (e.g. vswitch) 43 Integrated SNEM Path Services Native Switch (L 2/3) Driver Software HW & embedded SW Software Defined Networking Value Network Controller Multi-tenant Services Network APIs Integrated System Network OS DOVE Driver SAN Services OpenFlow Driver Control Plane Network Services value: Eco system for network Apps vs today s closed switch model (think smart phone App model) Examples: SAN (FC) services; Application performance monitor DOVE Network value: Virtualized network resource provisioning De couples virtual network from physical network Simple configure once network (physical network doesn t have to be configured per VM). Cloud scale (e.g. multi tenant) 44 5KV 5KV 5KV 5KV OpenFlow value: De couples switch s control plane from data plane Data center wide physical network control 22
Integrated DOVE Technology + Multi pathing DOVE Gateways Standards based multi-pathing Software Defined Networking Stack (more next) DOVE controls overlay network forwarding.... 45 DOVE network simplifies virtual machine network Enables multi tenancy all the way to the VM Enables single MAC Address per physical server (2 for HA) Significantly reduces size of physical network TCAM & ACL tables Increases layer 2 scale within Data Center and across Data Centers, by decoupling VM s layer 2 from physical network Qbg automates layer 2 provisioning, DOVE automates layer 3 7 provisioning Standards based multi pathed physical network Integrated Integrated System An Integrated System: provides fast time to value, is simple to maintain and scales without limits vswitch COMPUTE STORAGE single scalable interconnect vswitch COMPUTE STORAGE Rack Scale out Elasticity Optimized: Converged, multi path fabric High performance Integrated Systems Automated: Network virtualization Integrated: Software Defined Networking 46 23
Thank You Renato J Recio IBM Fellow & Systems Networking CTO 11400 Burnett Road Austin, TX 78758 512 973 2217 recio us ibm com 47 48 24
Optimized Mgt Console TRILL Based Fabric Overview TRILL Fabric RBridges RBridges TRILL provides multi pathing through the network Uses IS IS to provide shortest layer 2 path Works with arbitrary topologies Easy implementation minimum configuration Scalable deployment, from few to large number of switches Switches that support TRILL use IETF Routing Bridge (Rbridge) protocol Rbridges discovers each other and distribute link state Rbridges use TRILL header to encapsulate packets 49 TRILL = Transparent Interconnection of Lots of Links Integrated Firmware & Configuration Group management 1. Simultaneous management operations to multiple switches 2. Back-up, restore, compare & maintain configuration history 3. Firmware upgrades 4. Control operations 5. CLI script execution Engineer - Perform Efficient Firmware or Configuration Updates to Multiple Switches Single operation on multiple devices 50 25
Integrated Converged Network Setup Converged Enhanced Ethernet management Priority Flow Control (PFC) Bandwidth allocation by Priority Group [a.k.a. Enhanced Transmission Selection (ETS)] Operator Monitoring Initial PD Automation 51 Integrated Port Profile (VSI Database) Setup 52 Operator Monitoring Initial PD Automation Full VM Migration with connectivity persistence requires Port Profile (VSI database in 802.1Qbg terms) automation Qbg set up: VLANs for each VSI type Access Control Lists Send & Receive Bandwidth 26
Integrated Performance Management Planner (or Engineer) Investment = business direction + application requirements + utilization trends Real-time performance monitoring of switch statistics Fault management & root-cause analysis Performance data trend charts 53 Example of Qbg Channels and CEE CNA (Qbg Channels for traffic) Switch (CEE for traffic) HPC vnic1 HPC vnic2 HPC vnic3 HPC vnic4 CEE link HPC vnic1 HPC vnic2 HPC vnic3 HPC vnic4 Storage vhba1 Storage vhba2 HPC Traffic Storage Traffic 2G/s Storage vhba1 Storage vhba2 Storage vhba3 Storage vhba3 LAN Traffic 4G/s 5G/s Time 1 Time 2 Time 3 LAN vnic1 LAN vnic2 LAN vnic3 Network vnic1 Network vnic2 Network vnic3 Network vnic4 LAN vnic4 54 27