Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Similar documents

Results of Benchmark Tests upon a Gigabit Ethernet Switch

Readout Unit using Network Processors

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

D1.2 Network Load Balancing

Overview of Computer Networks

Wire-speed Packet Capture and Transmission

Technical Brief. DualNet with Teaming Advanced Networking. October 2006 TB _v02

TCP/IP Jumbo Frames Network Performance Evaluation on A Testbed Infrastructure

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

How To Improve Performance On A Linux Based Router

Advanced features on Ethernet networks. Fulvio Risso Politecnico di Torino

1000Mbps Ethernet Performance Test Report

Basic Networking Concepts. 1. Introduction 2. Protocols 3. Protocol Layers 4. Network Interconnection/Internet

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Gigabit Ethernet Design

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

Indian Institute of Technology Kharagpur. TCP/IP Part I. Prof Indranil Sengupta Computer Science and Engineering Indian Institute of Technology

Based on Computer Networking, 4 th Edition by Kurose and Ross

The Elements of GigE Vision

Insiders View: Network Security Devices

Getting the most TCP/IP from your Embedded Processor

Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD

Collecting Packet Traces at High Speed

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

VMWARE WHITE PAPER 1

CSIS CSIS 3230 Spring Networking, its all about the apps! Apps on the Edge. Application Architectures. Pure P2P Architecture

EITF25 Internet Techniques and Applications L5: Wide Area Networks (WAN) Stefan Höst

Technical Support Information Belkin internal use only

Monitoring high-speed networks using ntop. Luca Deri

The OSI Model: Understanding the Seven Layers of Computer Networks

Networking Driver Performance and Measurement - e1000 A Case Study

AGIPD Interface Electronic Prototyping

Isolating the Performance Impacts of Network Interface Cards through Microbenchmarks

UPPER LAYER SWITCHING

pco.interface GigE & USB Installation Guide

Network Layer IPv4. Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS. School of Computing, UNF

Communicating with devices

ICS 153 Introduction to Computer Networks. Inst: Chris Davison

Campus Network Design Science DMZ

VXLAN Performance Evaluation on VMware vsphere 5.1

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

Pump Up Your Network Server Performance with HP- UX

AT-2916SX AT-2931SX AT-2972SX AT-2972SX/2 AT-2972T/2

Introduction. What is a computer network?

The OSI and TCP/IP Models. Lesson 2

CSCI Topics: Internet Programming Fall 2008

Chapter 9A. Network Definition. The Uses of a Network. Network Basics

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based

New!! - Higher performance for Windows and UNIX environments

Mobile IP Network Layer Lesson 02 TCP/IP Suite and IP Protocol

Transport Layer. Chapter 3.4. Think about

Communication Performance over a Gigabit Ethernet Network.

The proliferation of the raw processing

10/100/1000Mbps Ethernet MAC with Protocol Acceleration MAC-NET Core with Avalon Interface

GigE Vision cameras and network performance

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Optimizing TCP Forwarding

Chapter 5. Data Communication And Internet Technology

Remote Serial over IP Introduction on serial connections via IP/Ethernet

CS335 Sample Questions for Exam #2

Deployments and Tests in an iscsi SAN

Computer Networks CS321

Network Performance: Networks must be fast. What are the essential network performance metrics: bandwidth and latency

ZCL for SONYGigECAM Introduction Manual

COMPUTERS ARE YOUR FUTURE CHAPTER 7 NETWORKS: COMMUNICATING AND SHARING RESOURCES

Ethernet. Ethernet. Network Devices

Protocols and Architecture. Protocol Architecture.

Understanding TCP/IP. Introduction. What is an Architectural Model? APPENDIX

DATA COMMUNICATIONS AND NETWORKING. Solved Examples

Chapter 1: Introduction

Mathatma Gandhi University

Computer Networks and the Internet

IP SAN Best Practices

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Design Issues in a Bare PC Web Server

Final for ECE374 05/06/13 Solution!!

Gigabit Ethernet: Architectural Design and Issues

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

PE10G2T Dual Port Fiber 10 Gigabit Ethernet TOE PCI Express Server Adapter Broadcom based

Chapter 8: Computer Networking. AIMS The aim of this chapter is to give a brief introduction to computer networking.

Local-Area Network -LAN

Transport and Network Layer

Introduction. Computer Networking. Prof. Andrzej Duda

RFC 2544 Testing of Ethernet Services in Telecom Networks

10/100/1000 Ethernet MAC with Protocol Acceleration MAC-NET Core

WHITE PAPER BRENT WELCH NOVEMBER

Development. Igor Sheviakov Manfred Zimmer Peter Göttlicher Qingqing Xia. AGIPD Meeting April, 2014

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

From Fieldbus to toreal Time Ethernet

High Speed Ethernet. Dr. Sanjay P. Ahuja, Ph.D. Professor School of Computing, UNF

Fibre Channel over Ethernet in the Data Center: An Introduction

An Architectural study of Cluster-Based Multi-Tier Data-Centers

Improved Digital Media Delivery with Telestream HyperLaunch

Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition System

Transcription:

Network Performance Optimisation and Load Balancing Wulf Thannhaeuser 1

Network Performance Optimisation 2

Network Optimisation: Where? Fixed latency 4.0 µs Variable latency <1 ms Level 0 Trigger Level 1 Trigger 40 MHz 1 MHz 40 khz 1 MHz Timing & Fast Control L0 L1 LHC-B Detector VDET TRACK ECAL HCAL MUON RICH Front-End Electronics Front-End Multiplexers (FEM) Front End Links RU RU RU Read-out units (RU) LAN Data rates 40 TB/s 1 TB/s 4 GB/s Throttle Read-out Network (RN) 2-4 GB/s 60 SFCs with ca. 16 off the shelf Storage SFC CPU CPU SFC CPU CPU Sub-Farm Controllers (SFC) Trigger Level 2 & 3 Event Filter Control & Monitoring 20 MB/s PCs each 3

Network Optimisation: Why? Gigabit (Fast) Ethernet Ethernet Speed: Fast Ethernet Speed: Gigabit Ethernet Speed: (considering full-duplex: 10 Mb/s 100 Mb/s 1000 Mb/s 2000 Mb/s) 4

Network Optimisation: Why? An average CPU might not be able to process such a huge amount of data packets per second: -TCP/IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) 5

Network Optimisation: How? An average CPU might not be able to process such a huge amount of data packets per second: -TCP/IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) Reduce per packet Overhead: Replace TCP with UDP 6

TCP / UDP Comparison TCP (Transfer Control Protocol): - connection-oriented protocol - full-duplex - messages received in order, no loss or duplication reliable but with overheads UDP (User Datagram Protocol): - messages called datagrams - messages may be lost or duplicated - messages may be received out of order unreliable but potentially faster 7

Network Optimisation: How? An average CPU might not be able to process such a huge amount of data packets per second: -TCP/IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) Reduce per packet Overhead: Replace TCP with UDP Reduce number of packets: Jumbo Frames 8

Jumbo Frames Normal Ethernet Maximum Transmission Unit (MTU): 1500 bytes Ethernet with Jumbo Frames MTU: 9000 bytes 9

Test set-up Netperf is a benchmark for measuring network performance The systems tested were 800 and 1800 MHz Pentium PCs using (optical as well as copper) Gbit Ethernet NICs. The network set-up was always a simple point-to-point connection with a crossed twisted pair or optical cable. Results were not always symmetric: With two PCs of different performance, the benchmark results were usually better if data was sent from the slow PC to the fast PC, i.e. the receiving process is more expensive. 10

Results with the optimisations so far Frame Size tuning: throughput 900 856.52 854.52 Throughput (Mbit/s) 800 700 600 500 400 300 200 597.67 648.53 647.76 704.94 100 0 1500 9000 MTU TCP Throughpt (Mbit/s) UDP Send perf (Mbit/s) UDP Rcv perf (Mbit/s) 11

Network Optimisation: How? An average CPU might not be able to process such a huge amount of data packets per second: -TCP/IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) Reduce per packet Overhead: Replace TCP with UDP Reduce number of packets: Jumbo Frames Reduce context switches: Interrupt coalescence 12

Interrupt Coalescence Packet Processing without Interrupt coalescence: NIC Memory Interrupt CPU Packet Processing with Interrupt coalescence: NIC Memory Interrupt CPU 13

Interrupt Coalescence: Results 700 600 526.09 Interrupt coalescence tuning: throughput (Jumbo Frames deactivated) 666.43 633.72 587.43 556.76 593.94 Throughput (Mbit/s) 500 400 300 200 100 0 32 1024 us waited for new packets to arrive before interrupting CPU TCP Throughpt (Mbit/s) UDP Send perf (Mbit/s) UDP Rcv perf (Mbit/s) 14

Network Optimisation: How? An average CPU might not be able to process such a huge amount of data packets per second: -TCP /IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) Reduce per packet Overhead: Replace TCP with UDP Reduce number of packets: Jumbo Frames Reduce context switches: Interrupt coalescence Reduce context switches: Checksum Offloading 15

Checksum Offloading A checksum is a number calculated from the data transmitted and attached to the tail of each TCP/IP packet. Usually the CPU has to recalculate the checksum for each received TCP/IP packet in order to compare it with the checksum in the tail of the packet to detect transmission errors. With checksum offloading, the NIC performs this task. Therefore the CPU does not have to calculate the checksum and can perform other operations in the meanwhile. 16

Network Optimisation: How? An average CPU might not be able to process such a huge amount of data packets per second: -TCP/IP Overhead -Context Switching -Packet Checksums An average PCI Bus is 33 MHz, 32-bit wide. Theory: 1056 Mbit/s Actually: ca. 850 Mbit/s (PCI overhead, burstsize) Reduce per packet Overhead: Replace TCP with UDP Reduce number of packets: Jumbo Frames Reduce context switches: Interrupt coalescence Reduce context switches: Checksum Offloading Or buy a faster PC with a better PCI bus 17

Load Balancing 18

Load Balancing: Where? Fixed latency 4.0 µs Variable latency <1 ms Level 0 Trigger Level 1 Trigger 40 MHz 1 MHz 40 khz 1 MHz Timing & Fast Control L0 L1 LHC-B Detector VDET TRACK ECAL HCAL MUON RICH Front-End Electronics Front-End Multiplexers (FEM) Front End Links RU RU RU Read-out units (RU) LAN Data rates 40 TB/s 1 TB/s 4 GB/s Throttle Read-out Network (RN) 2-4 GB/s 60 SFCs with ca. 16 off the shelf Storage SFC CPU CPU SFC CPU CPU Sub-Farm Controllers (SFC) Trigger Level 2 & 3 Event Filter Control & Monitoring 20 MB/s PCs each 19

Load Balancing with round-robin Event SFC Gigabit Ethernet Fast Ethernet Event Event Event Event 1 2 3 4 Problem: The SFC doesn t know if the node it wants to send the event to is ready to process it yet. 20

Load Balancing with control-tokens Event SFC Gigabit Ethernet 2 1 3 1 Fast Ethernet Event Token Event Token Event Token 1 2 3 4 With control tokens, nodes who are ready send a token, and every event is forwarded to the sender of a token. 21

LHC Comp. Grid Testbed Structure 100 cpu servers on GE, 300 on FE, 100 disk servers on GE (~50TB), 10 tape server on GE 64 disk server 1 GB lines Backbone Routers 200 FE cpu server 3 GB lines 100 GE cpu server 3 GB lines 8 GB lines 10 tape server SFC 100 FE cpu server 36 disk server