Big, Fast Routers. Router Architecture

Similar documents
Router Architectures

Scalable Prefix Matching for Internet Packet Forwarding

IP Router Architectures: An Overview

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

IP address lookup for Internet routers using cache routing table

Fiber Channel Over Ethernet (FCoE)

Agenda. sflow intro. sflow architecture. sflow config example. Summary

Network Layer: Network Layer and IP Protocol

CHAPTER 3 STATIC ROUTING

Switching. An Engineering Approach to Computer Networking

Data Structures For IP Lookup With Bursty Access Patterns

Congestion Control Review Computer Networking. Resource Management Approaches. Traffic and Resource Management. What is congestion control?

Steve Worrall Systems Engineer.

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Lecture 16: Quality of Service. CSE 123: Computer Networks Stefan Savage

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

Quality of Service (QoS) on Netgear switches

Understanding and Configuring NAT Tech Note PAN-OS 4.1

Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm

361 Computer Architecture Lecture 14: Cache Memory

Data Center Networks and Basic Switching Technologies

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

On the effect of forwarding table size on SDN network utilization

Limitations of Packet Measurement

ECE 578 Term Paper Network Security through IP packet Filtering

Cisco Networking Academy CCNP Multilayer Switching

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Internet Protocols Fall Lectures 7-8 Andreas Terzis

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

Stateful Firewalls. Hank and Foo

Switch Fabric Implementation Using Shared Memory

Multihoming and Multi-path Routing. CS 7260 Nick Feamster January

Network management and QoS provisioning - QoS in the Internet

WAN Topologies MPLS. 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr Cisco Systems, Inc. All rights reserved.

First Midterm for ECE374 03/09/12 Solution!!

QoS Parameters. Quality of Service in the Internet. Traffic Shaping: Congestion Control. Keeping the QoS

Network layer: Overview. Network layer functions IP Routing and forwarding

OpenFlow and Software Defined Networking presented by Greg Ferro. OpenFlow Functions and Flow Tables

MPLS. Packet switching vs. circuit switching Virtual circuits

Principle and Implementation of. Protocol Oblivious Forwarding

OpenFlow Based Load Balancing

Smart Queue Scheduling for QoS Spring 2001 Final Report

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

OLD VULNERABILITIES IN NEW PROTOCOLS? HEADACHES ABOUT IPV6 FRAGMENTS

Network Virtualization Based on Flows

Software-Defined Traffic Measurement with OpenSketch

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Architecture of distributed network processors: specifics of application in information security systems

FPGA Implementation of IP Packet Segmentation and Reassembly in Internet Router*

NetFlow Performance Analysis

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Quality of Service Analysis of site to site for IPSec VPNs for realtime multimedia traffic.

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

FWSM introduction Intro 5/1

Building A Cheaper Peering Router. (Actually it s more about buying a cheaper router and applying some routing tricks)

Advanced routing scenarios POLICY BASED ROUTING: CONCEPTS AND LINUX IMPLEMENTATION

Internet Protocol: IP packet headers. vendredi 18 octobre 13

Software Defined Networking What is it, how does it work, and what is it good for?

Open Flow Controller and Switch Datasheet

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Table of Contents. Cisco How Does Load Balancing Work?

Cisco CCNP Optimizing Converged Cisco Networks (ONT)

Computer Architecture

Stateful Inspection Firewall Session Table Processing

Configuring Static and Dynamic NAT Translation

MPLS is the enabling technology for the New Broadband (IP) Public Network

Networking Virtualization Using FPGAs

GR2000: a Gigabit Router for a Guaranteed Network

Load Balancing and Switch Scheduling

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

Advanced IP Addressing

Course Overview: Learn the essential skills needed to set up, configure, support, and troubleshoot your TCP/IP-based network.

Network Traffic Monitoring an architecture using associative processing.

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Internetworking II: VPNs, MPLS, and Traffic Engineering

CS268 Exam Solutions. 1) End-to-End (20 pts)

Quality of Service (QoS)) in IP networks

J-Flow on J Series Services Routers and Branch SRX Series Services Gateways

Binary search tree with SIMD bandwidth optimization using SSE

Configuring Denial of Service Protection

Open-source routing at 10Gb/s

NetFlow Aggregation. Feature Overview. Aggregation Cache Schemes

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Configuring a Load-Balancing Scheme

OpenFlow. Ihsan Ayyub Qazi. Slides use info from Nick Mckeown

Course Contents CCNP (CISco certified network professional)

Configuring a Load-Balancing Scheme

Versalar Switch Router Market Opportunity and Product Overview

Transcription:

Big, Fast Routers Dave Andersen CMU CS 15-744 Data Plane Router Architecture How packets get forwarded Control Plane How routing protocols establish routes/etc. 1

Processing: Fast Path vs. Slow Path Optimize for common case BBN router: 85 instructions for fast-path code Fits entirely in L1 cache Non-common cases handled on slow path Route cache misses Errors (e.g., ICMP time exceeded) IP options Fragmented packets Mullticast packets First Generation Routers Shared Bus CPU Route Table Off-chip Buffer Buffer CPU Interface Interface Interface Interface card DMAs into buffer, CPU examines header, has output DMA out Typically <.5Gb/s aggregate capacity 2

Second Generation Routers CPU Route Table Buffer Bypasses memory bus with direct transfer over bus between line cards Card Buffer Card Buffer Card Buffer Moves forwarding decisions local to card to reduce CPU pain Fwding Cache Fwding Cache Fwding Cache Punt to CPU for slow operations Typically <5Gb/s aggregate capacity Control Plane & Data Plane Control plane must remember lots of routing info (BGP tables, etc.) Data plane only needs to know the FIB (Forwarding Information Base) Smaller, less information, etc. Simplifies line cards vs the network processor 3

Bus-based Some improvements possible Cache bits of forwarding table in line cards, send directly over bus to outbound line card But shared bus was big bottleneck E.g., modern PCI bus (PCIx16) is only 32Gbit/sec (in theory) Almost-modern cisco (XR 12416) is 32Gbit/sec. Ow! How do we get there? Third Generation Routers Crossbar : Switched Backplane Card CPU Card Card Interface CPU Local Buffer Routing Table Local Buffer Fwding Table Periodic Control Fwding Table updates Typically <5Gb/s aggregate capacity 4

Crossbars N input ports, N output ports (One per line card usually) Every line card has its own forwarding table/classifier/etc removes CPU bottleneck Scheduler Decides which input/output port to connect in a given time slot Crossbar constraint: If input I is connected to output j, no other input connected to j, no other output connected to input I Scheduling is bipartite matching What s so hard here? Back-of-the-envelope numbers cards can be 4 Gbit/sec today (OC-768) Undoubtedly faster in a few more years, so scale these #s appropriately! To handle minimum-sized packets (~4b) 125 Mpps, or 8ns per packet But note that this can be deeply pipelined, at the cost of buffering and complexity. Some lookup chips do this, though still with SRAM, not DRAM. Good lookup algos needed still. For every packet, you must: Do a routing lookup (where to send it) Schedule the crossbar Maybe buffer, maybe QoS, maybe filtering by ACLs 5

Routing Lookups Routing tables: 2, 1M entries Router must be able to handle routing table load 5 years hence. Maybe 1. So, how to do it? DRAM (Dynamic RAM, ~5ns latency) Cheap, slow SRAM (Static RAM, <5ns latency) Fast, $$ TCAM (Ternary Content Addressable parallel lookups in hardware) Really fast, quite $$, lots of power Longest-Prefix Match Not just one entry that matches a dst 128.2../16 vs 128.2.153./24 Must take the longest (most specific) prefix 6

Method 1: Trie P5 P7 P8 Root 1 P6 P1 Trie P4 1 1 P3 1 P2 Sample Database P1 = 1* P2 = 111* P3 = 111* P4 = 1* P5 = * P6 = 1* P7 = 1* P8 = 1* Skip Count vs. Path Compression P1 P2 1 1 1 P1 1 (Skip count) Skip 2 or 11 (path compressed) 1 P2 1 P3 P4 P3 P4 Removing one way branches ensures # of trie nodes is at most twice # of prefixes Using a skip count requires exact match at end and backtracking on failure path compression simpler 7

LPM with PATRICIA Tries Traditional method Patricia Tree Arrange route entries into a series of bit tests Worst case = 32 bit tests Problem: memory speed, even w/sram! Bit to test = left child,1 = right child default / 128.2/16 1 16 128.32/16 19 128.32.13/24 128.32.15/24 How can we speed LPM up? Two general approaches: Shrink the table so it fits in really fast memory (cache) Degermark et al.; optional reading Complete prefix tree (node has 2 or kids) can be compressed well. 3 stages: Match 16 bits; match next 8; match last 8 Drastically reduce the # of memory lookups WUSTL algorithm ca. same time (Binary search on prefixes) 8

TCAMs for LPM Content addressable memory (CAM) Hardware-based route lookup Input = tag, output = value Requires exact match with tag Multiple cycles (1 per prefix) with single CAM Multiple CAMs (1 per prefix) searched in parallel Ternary CAM (,1,don t care) values in tag match Priority (i.e., longest prefix) by order of entries Very expensive, lots of power, but fast! Some commercial routers use it Caching Skipping LPMs with caching Packet trains exhibit temporal locality Many packets to same destination. Problems? Cisco Express Forwarding 9

Problem 2: Crossbar Scheduling Find a bipartite matching In under 8ns First issue: Head-of-line blocking with input queues If only 1 queue per input Max throughput <= (2-sqrt(2)) =~ 58% Solution? Virtual output queueing In input line card, one queue per dst. Card Requires N queues; more if QoS The Way It s Done Now Head-of- Blocking Problem: The packet at the front of the queue experiences contention for the output queue, blocking all packets behind it. Input 1 Input 2 Input 3 Output 1 Output 2 Output 3 Maximum throughput in such a switch: 2 sqrt(2) 1

Early Crossbar Scheduling Algorithm Wavefront algorithm Slow! (36 cycles ) Observation: 2,1 1,2 don t conflict with each other (11 cycles ) Problems: Fairness, speed, Do in groups, with groups in parallel (5 cycles ) Can find opt group size, etc. Alternatives to the Wavefront Scheduler PIM: Parallel Iterative Matching Request: Each input sends requests to all outputs for which it has packets Grant: Output selects an input at random and grants Accept: Input selects from its received grants Problem: Matching may not be maximal Solution: Run several times Problem: Matching may not be fair Solution: Grant/accept in round robin instead of random 11

islip Round-robin PIM Each input maintains round-robin list of outputs Each output maints round-robin list of inputs Request phase: Inputs send to all desired output Grant phase: Output picks first input in roundrobin sequence Input picks first output from its RR seq Output updates RR seq only if it s chosen Good fairness in simulation 1% Throughput? Why does it matter? Guaranteed behavior regardless of load! Same reason moved away from cache-based router architectures Cool result: Dai & Prabhakar: Any maximal matching scheme in a crossbar with 2x speedup gets 1% throughput Speedup: Run internal crossbar links at 2x the input & output link speeds 12

Filling in the (painful) details Routers do more than just LPM & crossbar Packet classification (L3 and up!) Counters & stats for measurement & debugging IPSec and VPNs QoS, packet shaping, policing IPv6 Access control and filtering IP multicast AQM: RED, etc. (maybe) Serious QoS: DiffServ (sometimes), IntServ (not) Other challenges in Routing Fast Classification src, dst, sport, dport, {other stuff} -> accept, deny, which class/queue, etc. Even with TCAMs, hard efficient use of limited entries, etc. Routing correctness We ll get back to this a bit later Architecture w.r.t. management Also later. How do you manage a collection of 1s of routers? 13

Going Forward Today s highest-end: Multi-rack routers Measured in Tb/sec One scenario: Big optical switch connecting multiple electrical switches Cool design: McKeown sigcomm 23 paper BBN MGR: Normal CPU for forwarding Modern routers: Several ASICs 14