TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Similar documents

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

- An Essential Building Block for Stable and Reliable Compute Clusters

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Intel DPDK Boosts Server Appliance Performance White Paper

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Windows Server Performance Monitoring

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

ECLIPSE Performance Benchmarks and Profiling. January 2009

Accelerating 4G Network Performance

Building a Scalable Storage with InfiniBand

Optimizing TCP Forwarding

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

PERFORMANCE TUNING ORACLE RAC ON LINUX

Symmetric Multiprocessing

Principles and characteristics of distributed systems and environments

Distributed Systems LEEC (2005/06 2º Sem.)

Boosting Data Transfer with TCP Offload Engine Technology

LS DYNA Performance Benchmarks and Profiling. January 2009

Performance of Software Switching

Optimizing Network Virtualization in Xen

Lustre Networking BY PETER J. BRAAM

PCI Express High Speed Networks. Complete Solution for High Speed Networking

The proliferation of the raw processing

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Datacenter Operating Systems

Unified Fabric: Cisco's Innovation for Data Center Networks

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Optimizing Network Virtualization in Xen

A Packet Forwarding Method for the ISCSI Virtualization Switch

BridgeWays Management Pack for VMware ESX

Delivering Quality in Software Performance and Scalability Testing

Microsoft Exchange Server 2003 Deployment Considerations

IEEE Congestion Management Presentation for IEEE Congestion Management Study Group

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Have both hardware and software. Want to hide the details from the programmer (user).

Enabling High performance Big Data platform with RDMA

Advanced Computer Networks. High Performance Networking I

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Storage at a Distance; Using RoCE as a WAN Transport

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

New Features in SANsymphony -V10 Storage Virtualization Software

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Resource Utilization of Middleware Components in Embedded Systems

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

IP SAN Best Practices

Deployments and Tests in an iscsi SAN

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

High Speed I/O Server Computing with InfiniBand

Using Multipathing Technology to Achieve a High Availability Solution

How To Understand The Concept Of A Distributed System

HP-UX 11i TCP/IP Performance White Paper

Achieving Performance Isolation with Lightweight Co-Kernels

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

An Architectural study of Cluster-Based Multi-Tier Data-Centers

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

VMWARE WHITE PAPER 1

Comparison of Web Server Architectures: a Measurement Study

Integrated Network Acceleration Features.

Router Architectures

Interconnection Networks

Question: 3 When using Application Intelligence, Server Time may be defined as.

The functionality and advantages of a high-availability file server system

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

SMB Direct for SQL Server and Private Cloud

IBM PowerHA SystemMirror for i. Performance Information

Real-Time Scheduling 1 / 39

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

Windows Server 2008 R2 Hyper-V Live Migration

Overlapping Data Transfer With Application Execution on Clusters

Mobile Computing/ Mobile Networks

Audit & Tune Deliverables

Network Attached Storage. Jinfeng Yang Oct/19/2015

How To Manage Performance On A Network (Networking) On A Server (Netware) On Your Computer Or Network (Computers) On An Offline) On The Netbook (Network) On Pc Or Mac (Netcom) On

CHAPTER 1 INTRODUCTION

Rackspace Cloud Databases and Container-based Virtualization

JoramMQ, a distributed MQTT broker for the Internet of Things

Performance Analysis Methods ESX Server 3

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Quantum StorNext. Product Brief: Distributed LAN Client

Latency Considerations for 10GBase-T PHYs

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Low-rate TCP-targeted Denial of Service Attack Defense

Network Virtualization Technologies and their Effect on Performance

OpenFlow Based Load Balancing

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

Benchmarking Hadoop & HBase on Violin

Transcription:

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented by: Thomas Repantis trep@cs.ucr.edu CS260-Seminar in Computer Science, Fall 2004 p.1/35

Overview To execute the TCP/IP processing on a dedicated processor, node, or device (the TCP server) using low-overhead, non-intrusive communication between it and the host(s) running the server application. Three TCP Server architectures: 1. A dedicated network processor on a symmetric multiprocessor (SMP) server. 2. A dedicated node on a cluster-based server built around a memory-mapped communication interconnect such as VIA. 3. An intelligent network interface in a cluster of intelligent devices with a switch-based I/O interconnect such as Infiniband. CS260-Seminar in Computer Science, Fall 2004 p.2/35

Introduction The network subsystem is nowadays one of the major performance bottlenecks in web servers: Every outgoing data byte has to go through the same processing path in the protocol stack down to the network device. Proposed solution a TCP Server architecture: Decoupling the TCP/IP protocol stack processing from the server host, and executing it on a dedicated processor/node. CS260-Seminar in Computer Science, Fall 2004 p.3/35

Introductory Details The communication between the server host and the TCP server can dramatically benefit from using low-overhead, non-intrusive, memory-mapped communication. The network programming interface provided to the server application must use and tolerate asynchronous socket communication to avoid data copying. CS260-Seminar in Computer Science, Fall 2004 p.4/35

Apache Execution Time Breakdown CS260-Seminar in Computer Science, Fall 2004 p.5/35

Motivation The web server spends in user space only 20% of its execution time. Network processing, which includes TCP send/receive, interrupt processing, bottom half processing, and IP send/receive take about 71% of the total execution time. Processor cycles devoted to TCP processing, cache and TLB pollution (OS intrusion on the application execution). CS260-Seminar in Computer Science, Fall 2004 p.6/35

TCP Server Architecture The application host avoids TCP processing by tunneling the socket I/O calls to the TCP server using fast communication channels. Shared memory and memory-mapped communication for tunneling. CS260-Seminar in Computer Science, Fall 2004 p.7/35

Advantages Kernel Bypassing. Asynchronous Socket Calls. No Interrupts. No Data Copying. Process Ahead. Direct Communication with File Server. CS260-Seminar in Computer Science, Fall 2004 p.8/35

Kernel Bypassing Bypassing the host OS kernel. Establishing a socket channel between the application and the TCP server for each open socket. The socket channel is created by the host OS kernel during the socket call. CS260-Seminar in Computer Science, Fall 2004 p.9/35

Asynchronous Socket Calls Maximum overlapping between the TCP processing of the socket call and the application execution. Avoid context switches whenever this is possible. CS260-Seminar in Computer Science, Fall 2004 p.10/35

No Interrupts Since the TCP server exclusively executes TCP processing, interrupts can be apparently easily and beneficially replaced with polling. Too high polling frequency rate would lead to bus congestion while too low would result in inability to handle all events. CS260-Seminar in Computer Science, Fall 2004 p.11/35

No Data Copying With asynchronous system calls, the TCP server can avoid the double copying performed in the traditional TCP kernel implementation of the send operation. The application must tolerate the wait for completion of the send. For retransmission, the TCP server can read the data again from the application send buffer. CS260-Seminar in Computer Science, Fall 2004 p.12/35

Process Ahead The TCP server can execute certain operations ahead of time, before they are actually requested by the host. Specifically, the accept and receive system calls. CS260-Seminar in Computer Science, Fall 2004 p.13/35

Direct Communication with File Server In a multi-tier architecture a TCP server can be instructed to perform direct communication with the file server. CS260-Seminar in Computer Science, Fall 2004 p.14/35

TCP Server in an SMP-based Architecture Dedicating a subset of the processors for in-kernel TCP processing. Network generated interrupts are routed to the dedicated processors. The communication between the application and the TCP server is through queues in shared CS260-Seminar in Computer Science, Fall 2004 p.15/35 memory.

SMP-based Architecture Details Offloading interrupts and receive processing. Offloading TCP send processing. CS260-Seminar in Computer Science, Fall 2004 p.16/35

TCP Server in a Cluster-based Architecture Dedicating a subset of nodes to TCP processing. VIA-based SAN interconnect. CS260-Seminar in Computer Science, Fall 2004 p.17/35

Cluster-based Architecture Operation The TCP server node acts as the network endpoint for the outside world. The network data is transferred between the host node and the TCP server node across SAN using low latency memorymapped communication. CS260-Seminar in Computer Science, Fall 2004 p.18/35

Cluster-based Architecture Details The socket call interface is implemented as a user level communication library. With this library a socket call is tunneled across SAN to the TCP server. Several implementations: 1. Split-TCP (synchronous) 2. AsyncSend 3. Eager Receive 4. Eager Accept 5. Setup With Accept CS260-Seminar in Computer Science, Fall 2004 p.19/35

TCP Server in an Intelligent-NIC-based Architecture Cluster of intelligent devices over a switched-based I/O (Infiniband). The devices are considered to be "intelligent", i.e., each device has a programmable processor and local memory. CS260-Seminar in Computer Science, Fall 2004 p.20/35

Intelligent-NIC-based Architecture Details Each open connection is associated with a memory-mapped channel between the host and the I-NIC. During a message send, the message is transferred directly from user-space to a send buffer at the interface. A message receive is first buffered at the network interface and then copied directly to user-space at the host. CS260-Seminar in Computer Science, Fall 2004 p.21/35

4-way SMP-based Evaluation Dedicating two processors to network processing is always better than dedicating only one. CS260-Seminar in Computer Science, Fall 2004 p.22/35 Throughput benefits of up to 25-30%.

4-way SMP-based Evaluation CS260-Seminar in Computer Science, Fall 2004 p.23/35

4-way SMP-based Evaluation When only one processor is dedicated to the network processing, the network processor becomes a bottleneck and, consequently, the application processor suffers from idle time. When we apply two processors to the handling of the network overhead, there is enough network processing capacity and the application processor becomes the bottleneck. The best system would be one in which the division of labor between the network and application processors is more flexible, allowing for some measure of load balancing. CS260-Seminar in Computer Science, Fall 2004 p.24/35

2-node Cluster-based Evaluation for Static Load Asynchronous send operations outperform their counterparts CS260-Seminar in Computer Science, Fall 2004 p.25/35

2-node Cluster-based Evaluation for Static Load Smaller gain than that achievable with SMP-based architecture. 17% is the greatest throughput improvement we can achieve with this architecture/workload combination. CS260-Seminar in Computer Science, Fall 2004 p.26/35

2-node Cluster-based Evaluation for Static Load In the case of Split-TCP and AsyncSend the host has idle time available since it is the network processing at the TCP server that proves to be the bottleneck. CS260-Seminar in Computer Science, Fall 2004 p.27/35

2-node Cluster-based Evaluation for Static and Dynamic Load Split TCP and Async Send systems saturate later than Regular TCP. CS260-Seminar in Computer Science, Fall 2004 p.28/35

2-node Cluster-based Evaluation for Static and Dynamic Load At an offered load of about 500 reqs/sec, the host CPU is effectively saturated. 18% is the greatest throughput improvement we can achieve with this architecture. CS260-Seminar in Computer Science, Fall 2004 p.29/35

2-node Cluster-based Evaluation for Static and Dynamic Load Balanced confgurations depend heavily on the particular characteristics of the workload. A dynamic load balancing scheme between host and TCP server nodes is required for ideal performance in dynamic workloads CS260-Seminar in Computer Science, Fall 2004 p.30/35

Intelligent-NIC-based Simulation Evaluation For all the simulated processor speeds, the Split-TCP system outperforms all the other implementations. The improvements over a conventional system range from 20% to 45%. CS260-Seminar in Computer Science, Fall 2004 p.31/35

Intelligent-NIC-based Simulation Evaluation The ratio of processing power at the host to that available at the NIC plays an important role in determining the server performance. In Split-TCP the processor on the NIC saturates much earlier than the host processor or the CS260-Seminar in Computer Science, Fall 2004 p.32/35

Conclusions about TCP Servers 1/2 Offloading TCP/IP processing is beneficial to overall system performance when the server is overloaded. An SMP-based approach to TCP servers is more efficient than a cluster-based one. The benefits of SMP and cluster-based TCP servers reach 30% in the scenarios we studied. The simulated results show greater gains of up to 45% for a cluster of devices. CS260-Seminar in Computer Science, Fall 2004 p.33/35

Conclusions about TCP Servers 2/2 TCP servers require substantial computing resources for complete offloading. The type of workload plays a significant role in the efficiency of TCP servers. Depending on the application workload, either the host processor or the TCP Server can be- come the bottleneck. Hence, a scheme to balance the load between the host and the TCP Server would be beneficial for server performance. CS260-Seminar in Computer Science, Fall 2004 p.34/35

Thank you! Questions/comments? CS260-Seminar in Computer Science, Fall 2004 p.35/35