RouteBricks: A Fast, Software- Based, Distributed IP Router

Similar documents
RouteBricks: Exploiting Parallelism To Scale Software Routers

Performance of Software Switching

RouteBricks: Exploiting Parallelism To Scale Software Routers

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Router Architectures

Evaluating the Suitability of Server Network Cards for Software Routers

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Lecture 1: the anatomy of a supercomputer

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Packet-based Network Traffic Monitoring and Analysis with GPUs

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Intel DPDK Boosts Server Appliance Performance White Paper

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Tyche: An efficient Ethernet-based protocol for converged networked storage

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

50. DFN Betriebstagung

Control and forwarding plane separation on an open source router. Linux Kongress in Nürnberg

FPGA-based Multithreading for In-Memory Hash Joins

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Collecting Packet Traces at High Speed

Lustre Networking BY PETER J. BRAAM

Networking Virtualization Using FPGAs

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

First Midterm for ECE374 03/09/12 Solution!!

Sage ERP Accpac Online

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

Sage 300 ERP Online. Mac Resource Guide. (Formerly Sage ERP Accpac Online) Updated June 1, Page 1

High Performance Computing in CST STUDIO SUITE

Network Traffic Monitoring & Analysis with GPUs

Boosting Data Transfer with TCP Offload Engine Technology

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

Capacity Planning. The purpose of capacity planning is to:

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

VMware vsphere 4.1 Networking Performance

Latency Considerations for 10GBase-T PHYs

Michael Kagan.

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Networking Driver Performance and Measurement - e1000 A Case Study

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Network Traffic Monitoring and Analysis with GPUs

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

GPU-Based Network Traffic Monitoring & Analysis Tools

Performance Evaluation of Linux Bridge

Benchmarking Cassandra on Violin

Overview of Computer Networks

Load Balancing Mechanisms in Data Center Networks

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Question: 3 When using Application Intelligence, Server Time may be defined as.

How To Build A Policy Aware Switching Layer For Data Center Data Center Servers

Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle

Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD

PCI Express* Ethernet Networking

ethernet alliance Overview of Requirements and Applications for 40 Gigabit and 100 Gigabit Ethernet Version 1.0 August 2007 Authors:

Resource Utilization of Middleware Components in Embedded Systems

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

D1.2 Network Load Balancing

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

IP SAN Best Practices

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

The Bus (PCI and PCI-Express)

Storage I/O Control: Proportional Allocation of Shared Storage Resources

High-Density Network Flow Monitoring

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

QoS & Traffic Management

Paolo Costa

MoonGen. A Scriptable High-Speed Packet Generator Design and Implementation. Paul Emmerich. January 30th, 2016 FOSDEM 2016

Performance of STAR-System

Advanced Computer Networks. High Performance Networking I

Datacenter Operating Systems

Disk Storage Shortfall

High-performance vswitch of the user, by the user, for the user

Scalable High Resolution Network Monitoring

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

PCI Express and Storage. Ron Emerick, Sun Microsystems

How To Set Up Foglight Nms For A Proof Of Concept

Open-source routing at 10Gb/s

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

SUN DUAL PORT 10GBase-T ETHERNET NETWORKING CARDS

Scaling in a Hypervisor Environment

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

A network is a group of devices (Nodes) connected by media links. A node can be a computer, printer or any other device capable of sending and

A Filesystem Layer Data Replication Method for Cloud Computing

Intel Xeon Processor E5-2600

Transcription:

outebricks: A Fast, Software- Based, Distributed IP outer Brad Karp UCL Computer Science (with thanks to Katerina Argyraki of EPFL for slides) CS GZ03 / M030 18 th November, 2009

One-Day oom Change! On Monday the 23 rd of November, we will meet in Chandler House G10 at the usual time! This room change will be one time only; we will revert to this room thereafter.

Distributed Systems Context Many models of consistency: sequential consistency (Ivy) close-to-open consistency (NFS) eventual consistency/conflict resolution despite disconnection (Bayou) atomic appends for multiple writers (GFS) Cluster as platform: Ivy, GFS Today: distributing an IP router over a cluster of PCs, for capacity and programmability highly parallel workload: packets forwarded independently (apart from flow ordering constraint) distributed within a single PC: multiple multi-core CPUs

outer = N packet processing + switching Fast = high, N Programmable = any processing 4

Why programmable routers? New ISP services» intrusion detec.on, applica.on accelera.on Monitoring» measure link latency, track down traffic New protocols» IP traceback, Trajectory Sampling, 5

Today: fast O programmable Fast hardware routers» processing by specialized hardware» not programmable» aggregate throughput: Tbps Programmable soiware routers» processing by general- purpose CPUs» aggregate throughput < 10Gbps 6

outebricks A router out of off- the- shelf PCs» familiar programming environment» large- volume manufacturing (cheap, widely available, growing in CPU and I/O with PCs) Can we build a Tbps router out of PCs? 7

A hardware router N linecards linecards Processing at rate ~ per linecard 8

A hardware router N linecards switch fabric linecards Processing at rate ~ per linecard Switching at rate N by switch fabric 9

outebricks N commodity interconnect servers servers Processing at rate ~ per server Switching at rate ~ per server 10

outebricks N commodity interconnect servers servers Per- server processing rate: c 11

Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 12

Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 13

equirements N commodity interconnect Internal link rates < Per- server processing rate: c Per- server fanout: constant 14

Straw Man: Direct Full Mesh N 15

Straw Man: Direct Full Mesh N N external links of capacity N 2 internal links of capacity 16

BeWer: Valiant Load Balancing [Valiant, Brebner, 1981] N /N /N 17

Valiant Load Balancing /N /N N Per- server processing rate: 3» processing overhead: 50% (2 without VLB) N 2 links of capacity 2/N 18

VLB with Uniform Traffic /N /N /N N /N /N /N 19

Direct Valiant LB: OpPmize for Uniform Traffic Load /N /N N /N /N /N 20

Direct Valiant LB /N N If uniform traffic matrix: 2 21

VLB Summary /N /N N Worst- case per- server processing rate: 3 If uniform traffic matrix: 2 22

Per- Server Fanout? /N N ConnecPvity degree: N per server What if not enough network ports? 23

SoluPon #1: Increase Server Capacity 2/N N e.g., double # external ports per server Doubles data rate on internal links Cuts fanout by half 24

SoluPon #1: Increase Server Capacity 2 2/N N e.g., double # external ports per server Doubles data rate on internal links, processing rate per server Cuts fanout by half 25

Per- Server Fanout? /N N 26

SoluPon #2: Add Intermediate Servers /N /N N k- degree, n- stage buwerfly ( k- ary n- fly ) Per- server fanout: k Stages: n = log k N 27

The outebricks Interconnect: CombinaPon Assign max external ports per server Full mesh if fanout allows Extra servers otherwise 28

Example Assuming current servers» 5 NICs, 2 x 10G ports or 8 x 1G ports» 1 external port per server N = 32 ports: full mesh» 32 servers N = 1024 ext. ports: 16- ary 4- fly» 3072 servers (2 extra servers per port) 29

ecap N Valiant Load Balancing + full mesh k- ary n- fly Per- server processing rate: 2 3 30

Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 31

First Try: a Shared- Bus Server 64- byte packets North bridge Click Ports Memory Cores» Cloverton architecture» FSB: 2 x 1.33 GHz» CPUs: 2 x Xeon 2.4GHz 4- core» NICs: 2 x Intel XFS 2x10Gbps 32

First Try: a Shared- Bus Server 64- byte packets North bridge Click Ports Memory Cores 820Mbps 33

Problem #1: the Shared Bus 64- byte packets North bridge Click Ports Memory Cores FSB address bus saturated MulP- core alone is not enough 34

SoluPon: NUMA architecture I/O hub Mem Ports Cores Mem» Nehalem architecture» QuickPath interconnect» CPUs: 2 x Xeon 2.4GHz 4- core» NICs: 2 x Intel XFS 2x10Gbps 35

SoluPon: NUMA architecture I/O hub Mem Ports Cores Mem 820Mbps 1.3Gbps 36

Problem #2: Per- Packet Overhead I/O hub Mem Ports Cores Mem Bookkeeping operapons» moving packet descriptors between NIC and memory» upda.ng descriptor rings 37

SoluPon: Batching Poll- driven batching» poll mul.ple packets at a.me» reduces updates on descriptor rings» Click already supported it NIC- driven batching» relay mul.ple packet descriptors at a.me» reduces transac.ons on PCIe and I/O buses» changed NIC driver 38

SoluPon: Batching I/O hub Mem Ports Cores Mem 820Mbps 1.3Gbps 3Gbps 39

Problem #3: Queue Access I/O hub Mem Ports Cores Mem 40

Problem #3: Queue Access Ports Cores 41

Problem #3: Queue Access ule 1: one core per port 42

Problem #3: Queue Access 0.6 Gbps 1.2 Gbps 1.7 Gbps ule 1: one core per port ule 2: one core per packet 43

Problem #3: Queue Access 0.6 Gbps Can we always enforce both rules? What about when receiving on one 101.2 Gbps Gbps port? 1.7 Gbps ule 1: one core per port ule 2: one core per packet 44

Problem #3: Queue Access ule 1: one core per port ule 2: one core per packet 45

Problem #3: Queue Access ule 1: one core per port ule 2: one core per packet 46

SoluPon: MulP- Queue NICs ule 1: one core per port ule 2: one core per packet 47

Single- Server Performance I/O hub Mem Ports Cores Mem 9.7Gbps 820Mbps 1.3Gbps 3Gbps 48

ecap State- of- the art hardware» NUMA architecture» mul.- queue NICs Wrote NIC driver» batching» lock- free queue access Careful queue- to- core allocapon» one core per queue» one core per packet 49

Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 50

Single- Server Performance 24.6 24.6 eal packet trace Gbps 9.7 6.35 64- byte packets 4.45 1.4 Forwarding IP roupng IPsec 51

Feasible outer Line ate 24.6 24.6 eal packet trace Gbps 9.7 6.35 64- byte packets 1.4 4.45 Forwarding IP roupng IPsec Per- server processing rate: 2 3 eal packet mix: = 8 12 Gbps Small packets: = 2 3 Gbps 52

BoWlenecks Small packets: CPU» buses far from satura.on eal packet- size mix: none» limited PCIe lanes» only for the prototype Expected evolupon?» next Nehalem: 4 x 8 = 32 cores» 4 8 PCIe2.0 slots 53

Projected Single- Server Performance 70 70 eal packet trace Gbps 38.8 19.9 64- byte packets Forwarding IP roupng 54

Projected outer Line ate 70 70 eal packet trace Gbps 38.8 19.9 64- byte packets Forwarding IP roupng eal packet mix: = 23 35 Gbps Small packets: = 6.5 10 Gbps 55

B4 Prototype N = 4 external ports» 1 server per port» full mesh eal packet mix: 4 x 8.75 = 35 Gbps» expected = 8 12 Gbps Small packets: 4 x 3 = 12 Gbps» expected = 2 3 Gbps 56

What About Packet Order? TCP cuts sending rate by ½ if packets are ever reordered by more than 3 posipons But VLB sprays packets randomly across intermediate nodes! outebricks parpal solupon:» Assign packets on same flow to same receive queue» During any 100 ms interval, VLB forwards packets from same flow to same next hop 0.15% of packets reordered with this mechanism; 5.5% without it 57

Latency One server: 24 microseconds Three servers: 66.4 microseconds 58

outebricks Summary outebricks: fast soiware router» Valiant Load- Balanced cluster of commodity servers Programmable with Click Performance:» Easily = 1Gbps, N = 100s» = 10Gbps with next- genera.on servers Programming model for more complex funcponality? 59