Getting the most TCP/IP from your Embedded Processor



Similar documents
TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

First Workshop on Open Source and Internet Technology for Scientific Environment: with case studies from Environmental Monitoring

Network-Oriented Software Development. Course: CSc4360/CSc6360 Instructor: Dr. Beyah Sessions: M-W, 3:00 4:40pm Lecture 2

Ethernet. Ethernet. Network Devices

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Network Programming TDC 561

Data Communication Networks and Converged Networks

Objectives of Lecture. Network Architecture. Protocols. Contents

10/100/1000Mbps Ethernet MAC with Protocol Acceleration MAC-NET Core with Avalon Interface

IP Network Layer. Datagram ID FLAG Fragment Offset. IP Datagrams. IP Addresses. IP Addresses. CSCE 515: Computer Network Programming TCP/IP

XPS LL Tri-Mode Ethernet MAC Performance with Monta Vista Linux Author: Brian Hill

Transport Layer Protocols

Overview of TCP/IP. TCP/IP and Internet

Chapter 11. User Datagram Protocol (UDP)

Transport Layer. Chapter 3.4. Think about

Frequently Asked Questions

Internet Architecture and Philosophy

Overview. Securing TCP/IP. Introduction to TCP/IP (cont d) Introduction to TCP/IP

Indian Institute of Technology Kharagpur. TCP/IP Part I. Prof Indranil Sengupta Computer Science and Engineering Indian Institute of Technology

UPPER LAYER SWITCHING

PART OF THE PICTURE: The TCP/IP Communications Architecture

10/100/1000 Ethernet MAC with Protocol Acceleration MAC-NET Core

Understanding TCP/IP. Introduction. What is an Architectural Model? APPENDIX

Basic Networking Concepts. 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet

Optimizing Network Virtualization in Xen

Using Network Virtualization to Scale Data Centers

Computer Networks/DV2 Lab

PCI Express Overview. And, by the way, they need to do it in less time.

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Easy H.264 video streaming with Freescale's i.mx27 and Linux

Overview of Computer Networks

Chapter 2 - The TCP/IP and OSI Networking Models

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

IP - The Internet Protocol

EITF25 Internet Techniques and Applications L5: Wide Area Networks (WAN) Stefan Höst

Transport and Network Layer

NETWORK LAYER/INTERNET PROTOCOLS

Guide to TCP/IP, Third Edition. Chapter 3: Data Link and Network Layer TCP/IP Protocols

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

White Paper Abstract Disclaimer

Gigabit Ethernet Design

Chapter 9. IP Secure

Computer Networks CS321

GB ethernet UDP interface in FPGA

Chapter 3. TCP/IP Networks. 3.1 Internet Protocol version 4 (IPv4)

Kirchhoff Institute for Physics Heidelberg

How do I get to

Network Models OSI vs. TCP/IP

Gigabit Ethernet MAC. (1000 Mbps Ethernet MAC core with FIFO interface) PRODUCT BRIEF

Introduction To Computer Networking

The proliferation of the raw processing

IP address format: Dotted decimal notation:

How To Make Gigabit Ethernet A Reality

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

CHAPTER 3 STATIC ROUTING

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Optimizing Network Virtualization in Xen

TCP/IP Basis. OSI Model

Chapter 3: Review of Important Networking Concepts. Magda El Zarki Dept. of CS UC Irvine

Internetworking. Problem: There is more than one network (heterogeneity & scale)

Network Performance Evaluation. Throughput. Troy Bennett. Computer Science Department. School of Engineering. California Polytechnic State University

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Network Layer: Network Layer and IP Protocol

D1.2 Network Load Balancing

LogiCORE IP AXI Performance Monitor v2.00.a

How To Test A Microsoft Vxworks Vx Works (Vxworks) And Vxwork (Vkworks) (Powerpc) (Vzworks)

Protocols. Packets. What's in an IP packet

The TCP/IP Reference Model

Course Overview: Learn the essential skills needed to set up, configure, support, and troubleshoot your TCP/IP-based network.

Solution of Exercise Sheet 5

Intel DPDK Boosts Server Appliance Performance White Paper

ACHILLES CERTIFICATION. SIS Module SLS 1508

Network Security TCP/IP Refresher

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

Performance Evaluation of Linux Bridge

Open Flow Controller and Switch Datasheet

AlliedWare Plus OS How To Use sflow in a Network

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Network Simulation Traffic, Paths and Impairment

Communication Systems Internetworking (Bridges & Co)

Topics. Computer Networks. Let s Get Started! Computer Networks: Our Definition. How are Networks Used by Computers? Computer Network Components

The OSI and TCP/IP Models. Lesson 2

Operating Systems Design 16. Networking: Sockets

Mobile IP Network Layer Lesson 02 TCP/IP Suite and IP Protocol

ESSENTIALS. Understanding Ethernet Switches and Routers. April 2011 VOLUME 3 ISSUE 1 A TECHNICAL SUPPLEMENT TO CONTROL NETWORK

TCP/IP and the Internet

Lecture 8. IP Fundamentals

System-on-a-Chip with Security Modules for Network Home Electric Appliances

VMWARE WHITE PAPER 1

MLPPP Deployment Using the PA-MC-T3-EC and PA-MC-2T3-EC

Introduction to TCP/IP

Lecture Computer Networks

Transcription:

Getting the most TCP/IP from your Embedded Processor

Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 2 Getting the most TCP/IP from your Embedded Processor

Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 3 Getting the most TCP/IP from your Embedded Processor

Why do we care about TCP/IP? TCP/IP is the most popular networking protocol TCP has become the de-facto standard for data communications between computer systems Ethernet is the most popular LAN technology 95%* of all LAN implementations use Ethernet Over 750 million* Ethernet ports deployed worldwide * Source: Instat/MDR 4 Getting the most TCP/IP from your Embedded Processor

TCP/IP Defined TCP/IP is a suite of protocols that allow computer systems to communicate with each other TCP/IP is an open standard Internet Engineering Task Force http://www.ietf.org/ Protocol suite is layered Application: handles details of particular application Transport: provides flow of data Network: handles movement (routing) of packets around the network Data Link: device driver + network interface Application Presentation Session Transport Network Data Link Physical Telnet, FTP, e-mail, etc. TCP, UDP IP Ethernet device driver and interface card 5 Getting the most TCP/IP from your Embedded Processor

TCP/IP Protocols TCP : Transmission Control Protocol [RFC 793] Connection oriented, reliable, byte stream transport UDP : User Datagram Protocol [RFC 768] Simple datagram oriented, unreliable transport IP : Internet Protocol [RFC 791] [RFC 791] Connectionless datagram delivery service TCP and UDP data are transmitted as IP datagrams 6 Getting the most TCP/IP from your Embedded Processor

Protocol Encapsulation Data is sent down protocol stack through each layer until it is finally sent over Ethernet media Each layer prepends a header that helps to provide the service of the each protocol TCP Segment: data that TCP layer sends to IP layer IP datagram: data that IP layer sends to network interface Ethernet Frame: stream of bits sent over Ethernet media 7 Getting the most TCP/IP from your Embedded Processor

Protocol Demultiplexing Ethernet frames received on network interface travel up the protocol stack Headers are removed and parsed by each successive protocol layer Identifiers in the headers determine which next upper layer should receive the data Each TCP Connection can uniquely be identified by 4 pieces of information (a socket): Client IP address Client port number Server IP address Server port number 8 Getting the most TCP/IP from your Embedded Processor

TCP/IP Summary TCP/IP is a suite of protocols that allow computer systems to communicate with each other Protocols are layered (like OSI model) Main Protocols that make up TCP/IP Protocol Suite TCP, UDP, IP 9 Getting the most TCP/IP from your Embedded Processor

Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges Further TCP Acceleration Techniques TCP/IP Customer Ready Demo 10 Getting the most TCP/IP from your Embedded Processor

High Resolution Imaging over TCP Image Protocol Interface to Imaging device GbE PowerPC takes data from Imaging device and transfers to remote Workstation using TCP Workstation stores image data and performs post processing Ethernet interface to Workstation 11 Getting the most TCP/IP from your Embedded Processor

Remote Monitoring & Control Xilinx Enables Remote Monitoring/Control Legacy Implementation TCP/IP Internet Network RS232 Hardware: Microblaze, 10/100 MAC, UART, GPIO Software: XMK, Http server, SNMP agent, HTML pages - Monitor parameters acquired by the controller (volts, amps, usage..) - Configure the controller settings (set alarm conditions,..) - Operate the Controller and the associated Power System A Power Monitoring/Control Unit - An old and outdated System - Monitors and Controls the power system for Telecom equipment - Operated by a simple command line interface over RS232 12 Getting the most TCP/IP from your Embedded Processor

Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 13 Getting the most TCP/IP from your Embedded Processor

System Level Considerations Memory Bandwidth Lots of memory bandwidth is required to support wire-speed TCP/Ethernet plus processor instruction/data fetches with reasonable latency CPU Performance PowerPC in Xilinx FPGA ~300MHz General Rule of thumb 1 MHz of CPU per 1 Mbits of TCP PPC can t touch payload data bytes PPC can not produce or consume TCP payload So compression or any other byte touching operation is out 14 Getting the most TCP/IP from your Embedded Processor

What Limits TCP Performance? Supporting functions are source of most overhead TCP protocol processing is not to blame Research Papers: End-System Optimizations for High-Speed TCP http://www.cs.duke.edu/ari/publications/end-system.pdf TCP/IP at Near-Gigabit Speeds http://www.cs.duke.edu/ari/publications/tcpgig.pdf TCP Performance Re-Visited http://www.cs.duke.edu/~jaidev/papers/ispass03.pdf 15 Getting the most TCP/IP from your Embedded Processor

Classification of TCP/IP Overheads Per-Byte Overhead Data buffer copies Checksum calculation Per-Packet Overhead Interrupt overhead Buffer Management Protocol Processing Per-Connection Overhead More Opportunity for hardware to affect system performance 16 Getting the most TCP/IP from your Embedded Processor

Per-Byte vs. Per-Packet Overheads http://www.cs.duke.edu/ari/publications/tcpgig.pdf 17 Getting the most TCP/IP from your Embedded Processor

Per-Byte Overhead Sockets API uses buffer copying to queue user application data (user-to-kernel space) Current TCP stacks have zero-copy APIs For example, Linux has sendfile( ) Additional buffer copies are also required if Ethernet Interface s DMA engine restricts alignment of data buffers TCP checksum requires 16-bit calculations over header and payload data bytes 18 Getting the most TCP/IP from your Embedded Processor

Per-Packet Overhead CPU Interrupt Buffer Management Process Management RX and TX Interrupts impose high overhead Gigabit Ethernet 1518 byte frame transferred every 12us 3651 CPU cycles between frames (300Mhz CPU) 19 Getting the most TCP/IP from your Embedded Processor

Gigabit System Reference Design (XAPP536) Processor does not touch payload data TCP Zero-Copy software APIs Transmit and Receive Checksum offload in FPGA fabric DMA with Data Realignment Engine Multi Port Memory Controller Efficiently shares DDR SDRAM memory resource between processor and DMA devices Allows point-to-point connection between high bandwidth peripherals and memory no shared bus Getting started with GSRD: www.xilinx.com/gsrd (XAPP536) 20 Getting the most TCP/IP from your Embedded Processor

Gigabit System Reference Design GSRD is an EDK-based reference system tuned for maximum TCP/IP performance GSRD uses custom IP blocks LocalLink Gigabit Ethernet MAC (LLGMAC, XAPP536) Communications DMA Controller (CDMAC, XAPP535) Multi Port Memory Controller (MPMC, XAPP535) TCP transmit rates as high as 780 Mbps Ideal for applications requiring high performance bridging between TCP/IP and user data interfaces 21 Getting the most TCP/IP from your Embedded Processor

Checksum Offload Implementation Processor incurs severe penalty for checksum computation Ethernet packet ingress to memory PPC extraction of header from memory Processing of header with software and data structures from memory Processed packet assembled in memory Egress of packet to the host Functionality can easily be offloaded into FPGA fabric Simple state machine implements and inserts the checksum logic Implemented in ~ 100 LUT s Software stack needs minor corresponding change Many TCP/IP stacks have pre-built support for checksum offload TX LocalLink DataStream TX FIFO MUX GMAC TX DMA Descriptor Checksum Compute Control Insert CSUM FIFO 22 Getting the most TCP/IP from your Embedded Processor

Transmit Performance Comparison With Checksum Offload Mbs 800 700 600 Checksum in SW Checksum in HW GSRD 1.8X 491 785 500 400 2.2X 355 300 200 158 100 0 1500 Byte Packet 9000 Byte Packet - Zero Copy API, 1MB TCP Window - Treck TCP/IP Stack (Jumbo Frames) 23 Getting the most TCP/IP from your Embedded Processor

TCP Termination Challenges Summary Memory Bandwidth is important High bandwidth peripherals such as 1 Gigabit Ethernet Processor(s) instruction/data fetches PPC can t touch payload data bytes TCP Protocol Processing not to blame for low performance Supping functions like buffer copies, checksums, etc. are to blame 24 Getting the most TCP/IP from your Embedded Processor

Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 25 Getting the most TCP/IP from your Embedded Processor

Improving TCP TX Performance Reduce occurrence of per packet Overhead Decrease number of CPU generated packets TCP Segmentation Offload (TSO) CPU generates large TCP frames (64 KB) Hardware segments into smaller frames (1.5 KB) LLGMAC optionally coalesces received ACKs 26 Getting the most TCP/IP from your Embedded Processor

TX Performance Projections Linux Netperf Transmit Performance (Sendfile vs Non-sendfile) 5000 4500 Sendfile Non-sendfile Poly. (Sendfile) Log. (Non-sendfile) 4000 3500 Throughput ( Mbits/sec) 3000 2500 2000 1500 1000 500 0 0 10000 20000 30000 40000 50000 60000 Frame size ( bytes) 27 Getting the most TCP/IP from your Embedded Processor

TSO Algorithm (1) MAC HEADER IP HEADER TCP HEADER TCP PAYLOAD MAC HEADER IP HEADER TCP HEADER TCP PAYLOAD MAC HEADER IP HEADER TCP HEADER TCP PAYLOAD... MAC HEADER IP HEADER TCP HEADER TCP PAYLOAD LLGMAC uses TCP/IP frame header as template LLGMAC generates multiple standard sized frames 28 Getting the most TCP/IP from your Embedded Processor

Template TCP/IP Header MAC Header 0 4 8 16 19 31 0 DESTINATION MAC ADDRESS 1 DESTINATION MAC ADDRESS SOURCE MAC ADDRESS 2 SOURCE MAC ADDRESS 3 ETHERNET-II TYPE VERS HLEN SERVICE TYPE IP Header 4 5 FLAGS TOTAL LENGTH IDENTIFICATION FRAGMENT OFFSET TIME TO LIVE PROTOCOL 6 HEADER CHECKSUM SOURCE IP ADDRESS 7 SOURCE IP ADDRESS DESTINATION IP ADDRESS 8 DESTINATION IP ADDRESS IP OPTIONS (IF ANY) 9 PADDING SOURCE PORT TCP Header 10 DESTINATION PORT SEQUENCE NUMBER 11 SEQUENCE NUMBER ACKNOWLEGEMENT NUMBER 12 ACKNOWLEGEMENT NUMBER HLEN RESERVED CODE BITS 13 WINDOW CHECKSUM 14 URGENT POINTER OPTIONS (IF ANY) 15 PADDING DATA 16...... 29 Getting the most TCP/IP from your Embedded Processor

TSO Algorithm (2) false IP Total Length - change to new IP frame size IP Identification - generate new value IP Checksum - calculate IP header csum TCP Seq Number - increment by bytes sent TCP Checksum - calculate TCP csum TCP Flags - update based on rules i += frame_size; i == CPU_MTU Each generated frame 200 cycles of processing of assembly code 2uS at 100MHz FPGA Implementation FSM controlled data path -OR- Microprocessor in fabric IEEE Paper Deferred Segmentation for Wire-Speed Transmission of Large TCP Frames over Standard GbE networks 30 Getting the most TCP/IP from your Embedded Processor

TCP Acceleration Platform CDMAC LLGMAC RX1 LocalLink TX1 LocalLink RX_DMAC_IF HEADER STRIP TX_DMAC_ IF RX STATUS FIFO RX FIFO Protocol Control Engine Protocol Control Engine HEADER FIFO TX FIFO HEADER STRIP RX_GMAC_IF TX_GMAC_IF LogiCORE 1-Gigabit Ethernet MAC 31 Getting the most TCP/IP from your Embedded Processor

32 Getting the most TCP/IP from your Embedded Processor Appendix

Glossary Of Terms XPS: Xilinx Platform Studio. SW to design with Xilinx processors EDK: Bundling on XPS with Processor IP lwip: Light Weight Internet Protocol TCP/IP Stack IP: Intellectual Property, pre engineered blocks of logic implementing specific functions EMAC: Ethernet MAC GEMAC: Gigabit Ethernet MAC TEMAC: Tri-Mode Ethernet MAC GSRD Gigabit Ethernet Reference Design Protocol: A formal specification of how things should communicate. Think of it as the language used for networks to communicate with each other. OSI: Open Systems Interconnection. The standard model to describe how computer networks should work Transmission Control Protocol/Internet Protocol (TCP/IP) is a group or suite of protocols that is widely used in networking. The Internet uses TCP/IP. TCP Stack: Library of functions that implement various layers of the communication protocol in Software 33 Getting the most TCP/IP from your Embedded Processor

Virtex-4 Tri-Mode EMAC (TEMAC) Virtex4-FX devices include 2 or 4 dedicated TEMACs 10/100/1000 Mb/s with auto-detect Can be used standalone or with embedded PPC405 Seamless connection to MGT TX/RX Statistics gathering Jumbo Frame Support Processor can control the EMAC via Device Control Register* A memory mapped host interface UNH compliance tested Accessible through EDK Version 6.3i SP1 Processor Block Test APU Control Reset & Control DCR Control *DCR is part of IBM s CoreConnect Bus APU ISPLB DSPLB DCR ISOCM Control ISOC M 405 Core DSOCM DSOCM Control EMAC Block DCR I/F PHY I/F PHY I/F EMAC EMAC Client I/F Client I/F Statistics I/F Host Interface Statistics I/F 34 Getting the most TCP/IP from your Embedded Processor

Hard EMAC Frees Up Logic Resources In FPGA Logic Cells x1000 70 60 50 40 30 20 GSRD Design in V-4 V 4 takes 15% less resources compared to Virtex-II Pro +25% +16% +15% +11% max Logic saved by EMACs Logic available in device 10 0 FX12 FX20 FX40 FX60 # of EMACs/device 2 2 4 4 ** Comparison based on 1G PLB EMAC implementation; Higher savings realized when TEMAC features like Tri-speed, RGMII, SGMII, address match, multicast address table lookup are used 35 Getting the most TCP/IP from your Embedded Processor