Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Similar documents

AIX NFS Client Performance Improvements for Databases on NAS

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Tyche: An efficient Ethernet-based protocol for converged networked storage

D1.2 Network Load Balancing

Linux NIC and iscsi Performance over 40GbE

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Benchmarking FreeBSD. Ivan Voras

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Benchmarking Hadoop & HBase on Violin

2009 Oracle Corporation 1

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Deep Dive: Maximizing EC2 & EBS Performance

High-performance vswitch of the user, by the user, for the user

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Stingray Traffic Manager Sizing Guide

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

MIDeA: A Multi-Parallel Intrusion Detection Architecture

PERFORMANCE TUNING ORACLE RAC ON LINUX

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Microsoft SQL Server 2014 Fast Track

Benchmarking Cassandra on Violin

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Tempesta FW. Alexander Krizhanovsky NatSys Lab.

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Certification Document macle GmbH Grafenthal-S1212M 24/02/2015. macle GmbH Grafenthal-S1212M Storage system

VMware Virtual SAN 6.0 Performance

Maxta Storage Platform Enterprise Storage Re-defined

Building Enterprise-Class Storage Using 40GbE

EMC Unified Storage for Microsoft SQL Server 2008

Datacenter Operating Systems

SALSA Flash-Optimized Software-Defined Storage

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Intel DPDK Boosts Server Appliance Performance White Paper

Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks

Isilon IQ Scale-out NAS for High-Performance Applications

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

Certification Document macle GmbH GRAFENTHAL R2208 S2 01/04/2016. macle GmbH GRAFENTHAL R2208 S2 Storage system

Comparing the Network Performance of Windows File Sharing Environments

SMB Direct for SQL Server and Private Cloud

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

I3: Maximizing Packet Capture Performance. Andrew Brown

SOLUTION BRIEF. Resolving the VDI Storage Challenge

Integrated Grid Solutions. and Greenplum

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

SLIDE 1 Previous Next Exit

Certification Document bluechip STORAGEline R54300s NAS-Server 03/06/2014. bluechip STORAGEline R54300s NAS-Server system

KVM Virtualized I/O Performance

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Virtuoso and Database Scalability

Windows 8 SMB 2.2 File Sharing Performance

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Maximum performance, minimal risk for data warehousing

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

High-Density Network Flow Monitoring

Pivot3 Reference Architecture for VMware View Version 1.03

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

VMWARE WHITE PAPER 1

Best practices for Implementing Lotus Domino in a Storage Area Network (SAN) Environment

A virtual SAN for distributed multi-site environments

Big Data Performance Growth on the Rise

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

AIX 5L NFS Client Performance Improvements for Databases on NAS

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

Boosting Data Transfer with TCP Offload Engine Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

White Paper. Recording Server Virtualization

Performance Tuning and Optimizing SQL Databases 2016

50. DFN Betriebstagung

Microsoft Exchange Server 2003 Deployment Considerations

Microsoft SMB Running Over RDMA in Windows Server 8

Xangati Storage Solution Brief. Optimizing Virtual Infrastructure Storage Systems with Xangati

Xeon+FPGA Platform for the Data Center

The functionality and advantages of a high-availability file server system

Choosing Storage Systems

POSIX and Object Distributed Storage Systems

New Design and Layout Tips For Processing Multiple Tasks

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

How To Evaluate Netapp Ethernet Storage System For A Test Drive

High Performance Tier Implementation Guideline

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Research Report: The Arista 7124FX Switch as a High Performance Trade Execution Platform

The Data Placement Challenge

Resource Utilization of Middleware Components in Embedded Systems

Transcription:

Chronicle: Capture and Analysis of NFS Workloads at Line Rate Ardalan Kangarlou, Sandip Shete, and John Strunk Advanced Technology Group 1

Motivation Goal: To gather insights from customer workloads via trace capture and real-time analysis. Design Test Support Sales Evaluating new algorithms Designing representative benchmarks Diagnosing problems and misconfigurations Identifying appropriate storage platforms 2

Use Case 1: Research An I/O-by-I/O view of workloads is preferred for research in Data caching, prefetching, and tiering techniques Data deduplication analysis Creation of representative benchmarks for real-world workloads Studying data access and growth patterns over time 3

Use Case 2: Sales Which platform? Workload Sizing Questionnaire I/O rate? Number of clients? Random read working set size? Ratio of random reads vs. random writes?

Chronicle Framework Capture and real-time analysis of NFSv3 workloads Rates above 10Gb/s Commodity hardware Passive network monitoring Runs for days to weeks Programmer-friendly and extensible API NFS clients NFS server Chronicle appliance 5

Background NFS Traffic TCP reassembly Ethernet header IP header TCP header Deep Packet Inspection (DPI) NFS PDU RPC Protocol Data Unit (PDU) NFS RPC Application Presentation Session Transport Network Link Physical OSI reference model 6

Background NAS Tracing [Ellard-FAST 03] and [Leung-ATC 08] NFS and CIFS workload analysis based on pcap traces Driverdump [Anderson-FAST 09] The fastest software-only solution operating @ 1.4Gb/s Network driver stores packets directly in the pcap format Main limitations for our use case: Packet capture through the lossy pcap interface High storage bandwidth and capacity requirements Trace capture @ 10Gb/s for a week requires 750TB of storage! pcap traces require offline parsing of data Stateless parsing: inability to parse fields that span packets 7

Our Approach Efficient Trace Storage Instead of storing raw packets (i.e., pcap format) Use DPI to identify fields of interests in packets Checksum read and write data for data deduplication analysis Leverage DataSeries [Anderson-OSR 09] as the trace format Efficient storage of structured, serial data Inline compression, non-blocking I/O, and delta encoding Extents for storing RPC-, NFS-, and network-level information as well as read/write data checksums Simplified RPC extent record_id request_ts reply_ts operation client server transaction_id With above techniques, a single standard disk can handle the storage bandwidth requirements for tracing at line rate! 8

Background Packet Processing Active area of research in software routing and network security: Common techniques: partitioning and pipelining work across cores, judicious placement and scheduling of threads, minimizing synchronization overhead, batch processing, recycling allocated memory, zero-copy parsing, and bypassing kernel Some examples: RouteBricks [Dobrescu-SOSP 09]: Packet forwarding @ 10Gb/s, packet routing @ 6.4Gb/s, and IPsec @ 1.4Gb/s Required tedious manual tuning NetSlices [Marian-ANCS 12]: Fixed mapping between packets and cores to support 9.7Gb/s routing throughput Click [Kohler-TOCS 00, Chen-USENIXATC 01] Kernel-mode (3-4 Mpps per core) and user-mode (490 Kpps) netmap [Rizzo-USENIXATC 12, Rizzo-INFOCOM 12]: Send/receive packets at line rate (14.88 Mpps @ 10Gb/s); 20ns/pkt vs. 500-1000ns/pkt for sockets User-space Click on netmap resulted in the same throughput as kernel-mode Click (3.9Mpps) 9

Packet Processing Frameworks cont. Major limitations for our use case: A network-centric view of packet processing: No DPI, TCP reassembly, and stateful parsing across packets Fixed, small per-packet processing cost Maintaining low latency is as important as high throughput Manual tuning for specific hardware platforms Management of shared resources and state (e.g., locks, thread-safe queues, etc.) Kernel implementations are hard to extend with custom libraries Main Challenge: To extend proven packet processing techniques to the application layer, for a more CPU-intensive use case, and in a programmer-friendly manner! 10

Our Approach Packet Processing Libtask: A user-space actor model library Performance: Seamless scalability to many cores Implicit batching of work to support high throughput Flexibility and usability: A pluggable, pipelined architecture Portable software by hiding hardware configuration from users Unburden application programmers of concurrency bugs Leverage netmap instead of libpcap for reading packets Efficient framework for bypassing kernel based on modified network drivers We extended netmap to support jumbo frames 11

Background Actor Model Programming Actor: A computation agent that processes tasks Message: Information to be shared with a target actor about a task or tasks 12

Libtask A light-weight actor model library written in C++ Three constructs: Scheduler, Process, and Message Core 13

Libtask cont. Load balancing and seamless scalability Two versions of Libtask: NUMA-aware and NUMA-agnostic NUMA-agnostic NUMA-aware Core Core Core Core 14 CPU 1 CPU 2

Chronicle Architecture Chronicle Pipeline 1 Packet Reader Packet Reader Network Parser Network Parser Chronicle Pipeline 2 Packet Reader Network Parser Chronicle Pipeline n 15

Chronicle Pipelines Trace Capture Pipeline RPC Parser NFS Parser Checksum Module DataSeries Writer Workload Sizer Pipeline RPC Parser NFS Parser Workload Sizer Workload Sizing Questionnaire I/O rate? Number of clients? Random read working set size? Ratio of random reads vs. random writes? 16

Chronicle Modules RPC Parser Filtering of TCP and RPC traffic Detection and parsing of RPC header Matching RPC replies with the corresponding calls Reassembly of TCP segments Construction of RPC PDUs Two modes of operation Slow Fast mode Ethernet header IP header TCP header RPC header 17 TCP reassembly RPC Protocol Data Unit (PDU) NFS PDU

More information on Chronicle Please refer to the paper for more information on the following: The functions of each module in the pipeline The messages passed between modules Chronicle s novel, zero-copy application layer parsing approach A comprehensive comparison with other packet processing frameworks Insights from Chronicle that helped our customers 18

Evaluation Setup Chronicle server: Two Intel Xeon E5-2690 2.90GHz CPUs (8 physical cores/16 logical cores per CPU) 128GB of 1600MHz DDR3 DRAM memory (64GB per CPU) Two dual-port Intel 82599EB 10GbE NICs Ten 3TB SATA disks 3.2.32 Linux kernel $10,000 A NetApp FAS6280 as the NFS server 19

Libtask Evaluation Message Ring Benchmark 1000 Processes pass ~100M Messages in a ring 100 outstanding Messages at a given time Averages of 10 runs 20 Messages/s (x10⁶) NUMA-aware Libtask NUMA-agnostic Libtask Erlang (V: R15B01) Go (V: 1.0.2) 80 60 40 20 0 1 2 4 8 16 32 Schedulers

Libtask Evaluation All-to-All Benchmark 100 Processes pass ~100M Messages randomly 1000 outstanding Messages at a given time Averages of 10 runs 21 Messages/s (x10⁶) NUMA-aware Libtask NUMA-agnostic Libtask Erlang (V: R15B01) Go (V: 1.0.2) 50 40 30 20 10 0 1 2 4 8 16 32 Schedulers

Chronicle Evaluation Maximum Sustained Throughput One client issuing 64KB sequential R/W ops across two 10Gb links using fio workload generator Throughput (Gb/s) 15 10 5 0 NFS server max: 14Gb/s 1 2 4 8 16 32 Cores 1 CPU max avg min < 2.5GB of RAM usage 0.0002% op loss for 32 cores 1 CPU + hyperthreading 2 CPUs

Chronicle Evaluation Maximum Sustained IOPS One client issuing 1B sequential R/W ops across two 10Gb links using fio workload generator 150KIOPS Metadata-intensive workload NFS server max: 106KIOPS 3,000 clients Operations/s (x10³) 120 100 80 60 40 20 0 1 2 4 8 16 32 Cores 1 CPU 1 CPU + hyperthreading max avg min < 100MB of RAM usage < 0.0001% op loss for 8-32 cores 23 2 CPUs

Chronicle Evaluation Packet Loss Controlled experiment to study the impact of packet loss at 10Gb/s 4% loss 10.1Gb/s 24

Chronicle Evaluation Trace Storage Efficiency 7-hour-long trace of a production workload 40x reduction in trace size over pcap traces (1.8TB 44.6GB)! 1.8TB Extent size (GB) 350 300 250 200 150 100 50 0 p DS Uncompressed DS Compressed 321.3GB 44.6GB Pcap DS Total Network RPC NFS Checksum 25

Conclusions Chronicle is an efficient framework for trace capture and realtime analysis of NFS workloads: Operates at 14Gb/s using general-purpose CPUs, disks, and NICs Based on actor model programming Seamless scalability and a pluggable, pipelined Programmer-friendly API CPU-intensive operations like stateful parsing, pattern matching, data checksumming, and compression Extensible to support other network storage protocols (e.g., SMB/ CIFS, iscsi, RESTful key-value store protocols) 26

Questions? Thank You! Chronicle s source code is available under an academic, non-commercial license: https://github.com/ntap/chronicle 27

Use Case 3: Data-Driven Management 28

Libtask Evaluation Message Ring Benchmark Messages/s (x10⁶) 70 60 50 40 30 20 10 0 NUMA-aware Libtask NUMA-agnostic Libtask NUMA-aware libtask + load NUMA-agnostic libtask + load Erlang (V: R15B01) Go (V: 1.0.2) 1 2 4 8 16 32 Threads 29

Libtask Evaluation All-to-All Benchmark Messages/s (x10⁶) 50 40 30 20 10 0 NUMA-aware Libtask NUMA-agnostic Libtask NUMA-aware libtask + load NUMA-agnostic libtask + load Erlang (V: R15B01) Go (V: 1.0.2) 1 2 4 8 16 32 Threads 30

Chronicle Evaluation Maximum Sustained Throughput Throughput (Gb/s) 15 10 5 0 NFS server max 1 2 4 8 16 32 Normalized CPU usage (%) 100 80 60 40 20 0 1 2 4 8 16 32 Cores Cores Max. memory usage (GB) 31 2.5 2 1.5 1 0.5 0 1 2 4 8 16 32 Cores Loss (%) 1 0.1 0.01 0.001 0.0001 0.00001 NFS calls NFS replies 1 2 4 8 16 32 Cores

Chronicle Evaluation Maximum Sustained IOPS Operations/s (x10³) 100 50 0 NFS server max 1 2 4 8 16 32 Normalized CPU usage (%) 100 80 60 40 20 0 1 2 4 8 16 32 Cores Cores Max memory usage (GB) 32 0.1 0.08 0.06 0.04 0.02 0 1 2 4 8 16 32 Cores Loss (%) 1 0.1 0.01 0.001 0.0001 0.00001 NFS calls NFS replies 1 2 4 8 16 32 Cores