mtcp: A Highly Scalable User-level TCP Stack for Multicore Systems

Similar documents
Performance of Software Switching

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Next Generation Operating Systems

Presentation of Diagnosing performance overheads in the Xen virtual machine environment

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Zeus Traffic Manager VA Performance on vsphere 4

Intel DPDK Boosts Server Appliance Performance White Paper

ZVA64EE PERFORMANCE BENCHMARK SOFINTEL IT ENGINEERING, S.L.

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Gaining Control of Cellular Traffic Accounting by Spurious TCP Retransmission

ZEN NETWORKS 3300 PERFORMANCE BENCHMARK SOFINTEL IT ENGINEERING, S.L.

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Tempesta FW. Alexander Krizhanovsky NatSys Lab.

Full and Para Virtualization

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Question: 3 When using Application Intelligence, Server Time may be defined as.

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

Riverbed Stingray Traffic Manager VA Performance on vsphere 4 WHITE PAPER

Cisco Integrated Services Routers Performance Overview

Comparison of Script Characterization of web benchmarks

Improving the performance of data servers on multicore architectures. Fabien Gaud

Datacenter Operating Systems

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Why SSL is better than IPsec for Fully Transparent Mobile Network Access

Apache Tomcat Tuning for Production

High-performance vswitch of the user, by the user, for the user

Delivering Quality in Software Performance and Scalability Testing

Empowering Developers to Estimate App Energy Consumption. Radhika Mittal, UC Berkeley Aman Kansal & Ranveer Chandra, Microsoft Research

Stingray Traffic Manager Sizing Guide

On Multi Gigabit Packet Capturing With Multi Core Commodity Hardware

SO_REUSEPORT Scaling Techniques for Servers with High Connection Rates. Ying Cai

RouteBricks: A Fast, Software- Based, Distributed IP Router

NetScaler VPX FAQ. Table of Contents

New Obvious and Obscure MikroTik RouterOS v5 features. Budapest, Hungary MUM Europe 2011

VXLAN Performance Evaluation on VMware vsphere 5.1

Eloquence Training What s new in Eloquence B.08.00

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Proposed Scalability and Performance Roadmap. Ken Rozendal IBM Fall 2000

Tyche: An efficient Ethernet-based protocol for converged networked storage

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

A Packet Forwarding Method for the ISCSI Virtualization Switch

Transport Layer Protocols

The Advantages of Network Interface Card (NIC) Web Server Configuration Strategy

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Design and Performance of a Web Server Accelerator

MAGENTO HOSTING Progressive Server Performance Improvements

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

Optimizing TCP Forwarding

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Networking Virtualization Using FPGAs

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Final for ECE374 05/06/13 Solution!!

SiteCelerate white paper

An introduction to Apache Traffic Server. Leif Hedstrom December 2011

10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications

Design Issues in a Bare PC Web Server

Informatica Ultra Messaging SMX Shared-Memory Transport

Cut Network Security Cost in Half Using the Intel EP80579 Integrated Processor for entry-to mid-level VPN

HP-UX 11i TCP/IP Performance White Paper

VMware vsphere 4.1 Networking Performance

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

1-Gigabit TCP Offload Engine

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Database Scalability and Oracle 12c

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Security Protocols HTTPS/ DNSSEC TLS. Internet (IPSEC) Network (802.1x) Application (HTTP,DNS) Transport (TCP/UDP) Transport (TCP/UDP) Internet (IP)

TFE listener architecture. Matt Klein, Staff Software Engineer Twitter Front End

10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

One Server Per City: Using TCP for Very Large SIP Servers

SIDN Server Measurements

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

PERFORMANCE TUNING ORACLE RAC ON LINUX

Scaling in a Hypervisor Environment

I3: Maximizing Packet Capture Performance. Andrew Brown

Scalable TCP Session Monitoring with Symmetric Receive-side Scaling

Analysis of Delivery of Web Contents for Kernel-mode and User-mode Web Servers

Parallel Firewalls on General-Purpose Graphics Processing Units

Security vulnerabilities in the Internet and possible solutions

Improving DNS performance using Stateless TCP in FreeBSD 9

PRODUCTIVITY ESTIMATION OF UNIX OPERATING SYSTEM

Resource Containers: A new facility for resource management in server systems

A Transport Protocol for Multimedia Wireless Sensor Networks

A way towards Lower Latency and Jitter

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

High-Performance Many-Core Networking: Design and Implementation

Performance Metrics for Multilayer Switches

Performance and scalability of a large OLTP workload

Transcription:

mtcp: A Highly Scalable User-level TCP Stack for Multicore Systems EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong Sunghwan Ihm*, Dongsu Han, and KyoungSoo Park KAIST * Princeton University

CDF Needs for Handling Many Short Flows 100% 80% 60% 40% 20% 0% 61% Flow count 91% 0 4K 16K 32K 64K 256K 1M FLOW SIZE (BYTES) Middleboxes - SSL proxies - Network caches End systems - Web servers * Commercial cellular traffic for 7 days Comparison of Caching Strategies in Modern Cellular Backhaul Networks, MOBISYS 2013 2

Connections/sec (x 10 5 ) Unsatisfactory Performance of Linux TCP Large flows: Easy to fill up 10 Gbps Small flows: Hard to fill up 10 Gbps regardless of # cores Too many packets: 14.88 Mpps for 64B packets in a 10 Gbps link Kernel is not designed well for multicore systems TCP Connection Setup Performance 2.5 2.0 1.5 1.0 Linux: 3.10.16 Intel Xeon E5-2690 Intel 10Gbps NIC Performance meltdown 0.5 0.0 1 2 4 6 8 Number of CPU Cores 3

Kernel Uses the Most CPU Cycles CPU Usage Breakdown of Web Server Web server (Lighttpd) Serving a 64 byte file Linux-3.10 Application 17% Performance bottlenecks 1. Shared resources 2. Broken locality 3. Per packet processing TCP/IP 34% Kernel (without TCP/IP) 45% Packet I/O 4% 83% of CPU usage spent inside kernel! Bottleneck removed by mtcp 1) Efficient use of CPU cycles for TCP/IP processing 2.35x more CPU cycles for app 2) 3x ~ 25x better performance 4

Inefficiencies in Kernel from Shared FD 1. Shared resources Shared listening queue Shared file descriptor space File descriptor space Listening queue Lock Linear search for finding empty slot Core 0 Core 1 Core 2 Core 3 Per-core packet queue Receive-Side Scaling (H/W) 5

Inefficiencies in Kernel from Broken Locality 2. Broken locality Interrupt handling core!= accepting core Interrupt handle Core 0 Core 1 Core 2 Core 3 accept() read() write() Per-core packet queue Receive-Side Scaling (H/W) 6

Inefficiencies in Kernel from Lack of Support for Batching 3. Per packet, per system call processing User Kernel Application thread BSD socket LInux epoll Kernel TCP Packet I/O accept(), read(), write() Inefficient per system call processing Frequent mode switching Cache pollution Inefficient per packet processing Per packet memory allocation 7

Previous Works on Solving Kernel Complexity Listening queue Connection locality App <-> TCP comm. Packet I/O API Linux-2.6 Shared No Per system call Per packet BSD Linux-3.9 SO_REUSEPORT Per-core No Per system call Per packet BSD Affinity-Accept Per-core Yes Per system call Per packet BSD MegaPipe Per-core Yes Batched system call Per packet custom Still, 78% of CPU cycles are used in kernel! How much performance improvement can we get if we implement a user-level TCP stack with all optimizations? 8

Clean-slate Design Principles of mtcp mtcp: A high-performance user-level TCP designed for multicore systems Clean-slate approach to divorce kernel s complexity Problems 1. Shared resources 2. Broken locality 3. Lack of support for batching Our contributions Each core works independently No shared resources Resources affinity Batching from flow processing from packet I/O to user API Easily portable APIs for compatibility 9

Overview of mtcp Architecture 3 1 Core 0 Core 1 Application Thread 0 mtcp socket Application Thread 1 mtcp epoll 2 mtcp thread 0 mtcp thread 1 User-level packet I/O library (PSIO) NIC device driver User-level Kernel-level 1. Thread model: Pairwise, per-core threading 2. Batching from packet I/O to application 3. mtcp API: Easily portable API (BSD-like) [SIGCOMM 10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html 10

1. Thread Model: Pairwise, Per-core Threading Application Thread 0 mtcp socket Application Thread 1 mtcp epoll Per-core listening queue mtcp thread 0 mtcp thread 1 User-level packet I/O library (PSIO) Device driver Core 0 Core 1 User-level Kernel-level Per-core file descriptor Per-core packet queue Symmetric Receive-Side Scaling (H/W) 11

From System Call to Context Switching Linux TCP Application thread mtcp Application Thread System call BSD socket LInux epoll Kernel TCP User Kernel Context switching mtcp socket mtcp epoll mtcp thread Packet I/O User-level packet I/O library NIC device driver 12

From System Call to Context Switching Linux TCP < Application thread System call User Kernel BSD socket LInux epoll mtcp higher overhead Application Thread Context switching mtcp socket mtcp epoll Kernel TCP Packet I/O mtcp thread Batching to amortize context switch overhead User-level packet I/O library NIC device driver 13

2. Batching process in mtcp thread Application thread Socket API accept() epoll_wait() connect() Connect queue write() Write queue close() Close queue Accept queue Event queue mtcp thread SYN ACK Rx manager Internal event queue Control list ACK list TX manager SYN S RST S/A F/A FIN Data Payload handler Data list Rx queue Tx queue 14

3. mtcp API: Similar to BSD Socket API Two goals: Easy porting + keeping popular event model Ease of porting Just attach mtcp_ to BSD socket API socket() mtcp_socket(), accept() mtcp_accept(), etc. Event notification: Readiness model using epoll() Porting existing applications Mostly less than 100 lines of code change Application Description Modified lines / Total lines Lighttpd An event-driven web server 65 / 40K ApacheBench A webserver performance benchmark tool 29 / 66K SSLShader A GPU-accelerated SSL proxy [NSDI 11] 43 / 6,618 WebReplay A web log replayer 81 / 3,366 15

Optimizations for Performance Lock-free data structures Cache-friendly data structure Hugepages for preventing TLB missing Efficient TCP timer management Priority-based packet queuing Lightweight connection setup Please refer to our paper 16

mtcp Implementation 11,473 lines (C code) Packet I/O, TCP flow management, User-level socket API, Event system library 552 lines to patch the PSIO library Support event-driven packet I/O: ps_select() TCP implementation Follows RFC793 Congestion control algorithm: NewReno Passing correctness test and stress test with Linux TCP stack 17

Evaluation Scalability with multicore Comparison of performance of multicore with previous solutions Performance improvement on ported applications Web Server (Lighttpd) Performance under the real workload SSL proxy (SSL Shader, NSDI 11) TCP bottlenecked application 18

Transactions/sec (x 10 5 ) Multicore Scalability 64B ping/pong messages per connection Heavy connection overhead, small packet processing overhead 25x Linux, 5x SO_REUSEPORT *[LINUX3.9], 3x MegaPipe *[OSDI 12] 15 Linux: 3.10.12 12 9 6 3 0 Intel Xeon E5-2690 32GB RAM Intel 10Gbps NIC 1 0 2 4 6 8 Number of CPU Cores Linux REUSEPORT MegaPipe mtcp Inefficient small packet processing in Kernel Shared fd in process Shared listen socket * [LINUX3.9] https://lwn.net/articles/542629/ * [OSDI 12] MegaPipe: A New Programming Interface for Scalable Network I/O, Berkeley 19

Throughput (Gbps) Transactions/sec (x 10 3 ) Performance Improvement on Ported Applications Web Server (Lighttpd) Real traffic workload: Static file workload from SpecWeb2009 set 3.2x faster than Linux 1.5x faster than MegaPipe SSL Proxy (SSLShader) Performance Bottleneck in TCP Cipher suite 1024-bit RSA, 128-bit AES, HMAC- SHA1 Download 1-byte object via HTTPS 18% ~ 33% better on SSL handshake 5 4 3 2 1 0 4.02 2.69 1.79 1.24 Linux REUSEPORT MegaPipe mtcp 40 35 30 25 20 15 10 5 0 36,505 37,739 31,710 26,762 28,208 27,725 Linux mtcp 4K 8K 16K # Concurrent Flows 20

Conclusion mtcp: A high-performing user-level TCP stack for multicore systems Clean-slate user-level design to overcome inefficiency in kernel Make full use of extreme parallelism & batch processing Per-core resource management Lock-free data structures & cache-aware threading Eliminate system call overhead Reduce context switch cost by event batching Achieve high performance scalability Small message transactions: 3x to 25x better Existing applications: 33% (SSLShader) to 320% (lighttpd) 21

Thank You Source code is available at http://shader.kaist.edu/mtcp/ https://github.com/eunyoung14/mtcp