Revisiting Software Zero-Copy for Web-caching Applications with Twin Memory Allocation

Similar documents
Revisiting Software Zero-Copy for Web-caching Applications with Twin Memory Allocation

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Accelerate In-Line Packet Processing Using Fast Queue

1000Mbps Ethernet Performance Test Report

Optimizing Network Virtualization in Xen

Datacenter Operating Systems

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Optimizing Network Virtualization in Xen

Presentation of Diagnosing performance overheads in the Xen virtual machine environment

High-Density Network Flow Monitoring

D1.2 Network Load Balancing

Network packet capture in Linux kernelspace

Operating Systems Design 16. Networking: Sockets

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

dnstap: high speed DNS logging without packet capture Robert Edmonds Farsight Security, Inc.

Multi-Threading Performance on Commodity Multi-Core Processors

VXLAN Performance Evaluation on VMware vsphere 5.1

Intel DPDK Boosts Server Appliance Performance White Paper

Self-Adapting Load Balancing for DNS

Can High-Performance Interconnects Benefit Memcached and Hadoop?

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

Performance of Software Switching

VMWARE WHITE PAPER 1

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Determining Overhead, Variance & Isola>on Metrics in Virtualiza>on for IaaS Cloud

RCL: Software Prototype

Client-aware Cloud Storage

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Enabling High performance Big Data platform with RDMA

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Resource Utilization of Middleware Components in Embedded Systems

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Tempesta FW. Alexander Krizhanovsky NatSys Lab.

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability

Network Traffic Monitoring and Analysis with GPUs

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Tyche: An efficient Ethernet-based protocol for converged networked storage

An Architectural study of Cluster-Based Multi-Tier Data-Centers

Peter Senna Tschudin. Performance Overhead and Comparative Performance of 4 Virtualization Solutions. Version 1.29

HPC performance applications on Virtual Clusters

High-performance vswitch of the user, by the user, for the user

Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances

Next Generation Operating Systems

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Isolating the Performance Impacts of Network Interface Cards through Microbenchmarks

The Plan Today... System Calls and API's Basics of OS design Virtual Machines


Cloud Operating Systems for Servers

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

Performance and scalability of a large OLTP workload

Virtualization for Cloud Computing

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Example of Standard API

Managing Storage Space in a Flash and Disk Hybrid Storage System

Increasing Web Server Throughput with Network Interface Data Caching

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

Xeon+FPGA Platform for the Data Center

Gigabit Ethernet Design

Performance of optimized software implementation of the iscsi protocol

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

A Deduplication File System & Course Review

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications

Toward Exitless and Efficient Paravirtual I/O

A Packet Forwarding Method for the ISCSI Virtualization Switch

Rackspace Cloud Databases and Container-based Virtualization

Benchmarking Hadoop & HBase on Violin

Microsoft SMB Running Over RDMA in Windows Server 8

Privacy Protection in Virtualized Multi-tenant Cloud: Software and Hardware Approaches

Network Traffic Monitoring & Analysis with GPUs

Full and Para Virtualization

Achieving Performance Isolation with Lightweight Co-Kernels

Comparison of the real-time network performance of RedHawk Linux 6.0 and Red Hat operating systems

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

InterScan Web Security Virtual Appliance

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Where is the memory going? Memory usage in the 2.6 kernel

Serving dynamic webpages in less than a millisecond

ELI: Bare-Metal Performance for I/O Virtualization

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

1-Gigabit TCP Offload Engine

Introduction to Intel Ethernet Flow Director and Memcached Performance

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

End-System Optimizations for High-Speed TCP

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

DotSlash: Providing Dynamic Scalability to Web Applications with On-demand Distributed Query Result Caching

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

CHAPTER 1 INTRODUCTION

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Chapter 5 Cloud Resource Virtualization

Transcription:

Revisiting Software Zero-Copy for Web-caching Applications with Twin Memory Allocation Xiang Song Jicheng Shi, Haibo Chen and Binyu Zang IPADS of Shanghai Jiao Tong University Fudan University

Network I/O Limitations Network- intensive applica?ons limited by network I/O processing Physical limita?ons Efficiency of network sub- systems Data copying is one of the key limi3ng factors

Data Copy Overhead Tradi?onal network system calls (e.g., sendmsg) have non- trivial overhead Data copying Cache thrashing

The Cost of Data Copy Netperf benchmark using UDP_RR Execu&on Time (cycles) 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Data Copy Others 32 64 128 256 512 768 1024 1280 Package Size (Bytes) 19.2%

Cache Thrashing Problem Memcached benchmark L2 cache miss Rate 4.58 per thousand cycles (256 Byte) 4.89 per thousand cycles (512 Byte) Top 2 Func&ons 256 Byte 512 Byte copy_user_generic_string 25.4% 28.7% assoc_find 12.8% 10.3%

Challenge in Zero-copy: Data Mutation Applica&on Network Data Network Stack Modify DMA NIC

Limitations of Existing Solutions Sendfile and splice Need file back Fbuf and I/O Lite New API, microkernel oriented On- demand memory mapping and COW mechanism Protec?on granularity (e.g., page size) Alignment requirements

Insight into Network Data Mutation Struct of memcached data typedef struct _stritem { struct _stritem *next; struct _stritem *prev; unsigned short refcount; struct { key nsuffix value} ; } item; Procedure of get request do_item_get( ) { it = assoc_find(key, nkey); if (it!= NULL) it- >refcount++; return it; } process_get_command() { it = do_item_get(key, nkey); output_value_data(it); }

Observation: False Sharing in Protection Metadata co- locates with the network data Modify metadata!= modify network data False protec?ng the metadata when protec?ng the network data

ZCopy Idea: Let applica?ons designate which data should be zero- copied ZCopy system A twin memory allocator kernel subsystem Effec?ve for web caching applica?ons

ZCopy Architecture Applica?on Glibc alloc Data Data ZC_alloc TCP/UDP Stack Data Copy ZCopy Proxy Bypass Data Copy Hardware

Challenge: Small Memory Blocks Minimal memory protec?on Granularity: page size (4 KByte) Alignment: page size Wasteful to allocate one page for small data blocks

ZCopy Memory Allocator Aggrega?ng memory blocks with similar sizes Pageblock - - basic memory unit Write protected a pageblock when it is full of zero- copy data Especially friendly to reusable data E.g., cached key/value pairs in memcached

Challenge: Bypass Data Copy Tradi?onal TCP/UDP network package Normal Package Header Data addr length copy kernel buffer Data User Space Network Data Network Data copy

UDP/TCP Package in ZCopy ZCopy Package Header Fragments addr length addr length reference reference User Space Network Data Network Data

Package Processing in ZCopy Header Data Data Data 1. Copy prior none zero- copy data 3. Handling Zero- copy data (reference) 2. Check user data buffer Zero- copy Data Normal Data 4. Handling none Zero- copy data (copy) Next itera?on 5. Next user data buffer Finish

Prototype Implementation ZCopy is built based on Linux 2.6.38 A twin memory allocator ZC_alloc Changed 20 LOCs of streamflow memory allocator ZCopy proxy 530 LOCs in UDP and TCP packages processing Data protec?on module 200 LOCs user- level library 205 LOCs kernel module

Experimental Setup Experimental environment 2 machine with 1.87Ghz Intel Xeon E7 chips Gigabit Network connec?on Debian GNU/Linux 6.0, Kernel version 2.6.38 Experimental benchmark Memcached Varnish (in paper)

Memcached Setup Memcached caches mul?ple key/value pairs in memory From a long run s perspec?ve, the key/value pairs are not expected to be modified or freed 10 LOCs of modifica?ons Use the memaslap testsuite as client One memcached server using a single CPU core

Memcached UDP Performance Throughput (requests/sec) 160000 140000 120000 100000 80000 60000 40000 20000 0 Throughput of Memcached with UDP ZCopy Overhead 28.7% ZCopy Vanilla Linux 41.1% Network Limita&on 128 256 512 768 1024 Package Size (bytes)

Memcached UDP: Package Processing Insight Package Processing Time (cycles) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Package Processing Time Vanilla Linux ZCopy 256 512 768 1024 Package Size (Bytes)

Memcached UDP: Cache Misses Miss Frequency (1 miss/k cycles) 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 L2 Cache Miss Rate UDP Linux UDP ZCopy 512 768 1024 Package Size (Bytes)

Memcached TCP Performance Throughput (requests/sec) 140000 120000 100000 80000 60000 40000 20000 0 Throughput of Memcahed with TCP ZCopy Overhead 40.8% 37.9% 256 512 768 1024 2048 Package Size (bytes) ZCopy Vanilla Linux 30.8% Network Limita&on

Future Work Study and evaluate the performance benefit of ZCopy on other network intensive applica?ons Extend ZCopy to efficiently run on mul?core machines Extend ZCopy to 10 Gigabit network

Conclusion This paper presented a new zero- copy system named ZCopy A lightweight sovware zero- copy mechanism based on a twin memory allocator Experiments on an Intel machine show that ZCopy outperforms vanilla Linux

Thanks ZCopy Ques&ons? A lightweight Zero- copy mechanism Institute of Parallel And Distributed Systems http://ipads.se.sjtu.edu.cn/