Scalable High Resolution Network Monitoring



Similar documents
Data Center Load Balancing Kristian Hartikainen

Packet Sampling and Network Monitoring

Empowering Virtualized Networks with Measurement As a Service (MaaS)

Empowering Software Defined Network Controller with Packet-Level Information

Hadoop Technology for Flow Analysis of the Internet Traffic

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview

Cisco IOS Flexible NetFlow Technology

Cloud-Scale Application Performance Monitoring with SDN and NFV

Datacenter Network Large Flow Detection and Scheduling from the Edge

Network congestion control using NetFlow

ON THE IMPLEMENTATION OF ADAPTIVE FLOW MEASUREMENT IN THE SDN-ENABLED NETWORK: A PROTOTYPE

Packet Flow Analysis and Congestion Control of Big Data by Hadoop

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

Load Balancing Mechanisms in Data Center Networks

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

Data Center Network Topologies: FatTree

4 Internet QoS Management

Towards Smart and Intelligent SDN Controller

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Intel DPDK Boosts Server Appliance Performance White Paper

Autonomous Fast Rerouting for Software Defined Network

Latency Analyzer (LANZ)

BEHAVIORAL SECURITY THREAT DETECTION STRATEGIES FOR DATA CENTER SWITCHES AND ROUTERS

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

Performance of Software Switching

Ethernet Fabric Requirements for FCoE in the Data Center

Programmable Networking with Open vswitch

Operating Systems. Cloud Computing and Data Centers

Question: 3 When using Application Intelligence, Server Time may be defined as.

CS6204 Advanced Topics in Networking

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Software-Defined Networking for the Data Center. Dr. Peer Hasselmeyer NEC Laboratories Europe

In-band Network Telemetry (INT) Mukesh Hira, VMware Naga Katta, Princeton University

Applications of Software-Defined Networking (SDN) in Power System Communication Infrastructure: Benefits and Challenges

Software Defined Networking What is it, how does it work, and what is it good for?

End-to-End Network Centric Performance Management

SDN Interfaces and Performance Analysis of SDN components

Technical Bulletin. Arista LANZ Overview. Overview

Comparisons of SDN OpenFlow Controllers over EstiNet: Ryu vs. NOX

Software Defined Networks

Transport Layer Protocols

Software Defined Networking A quantum leap for Devops?

NetFlow-Lite offers network administrators and engineers the following capabilities:

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Cisco NetFlow TM Briefing Paper. Release 2.2 Monday, 02 August 2004

Enhancing Hypervisor and Cloud Solutions Using Embedded Linux Iisko Lappalainen MontaVista

SDN and NFV Open Source Initiatives. Systematic SDN and NFV Workshop Challenges, Opportunities and Potential Impact

SDN Software Defined Networks

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

Brocade Solution for EMC VSPEX Server Virtualization

Network Virtualization and Application Delivery Using Software Defined Networking

Internet Management and Measurements Measurements

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

Network traffic monitoring and management. Sonia Panchen 11 th November 2010

Software-Defined Networking Architecture Framework for Multi-Tenant Enterprise Cloud Environments

Large-Scale Passive Monitoring using SDN

Enabling Practical SDN Security Applications with OFX (The OpenFlow extension Framework)

OpenFlow - the key standard of Software-Defined Networks. Dmitry Orekhov, Epam Systems

Introduction to Cisco IOS Flexible NetFlow

Application Latency Monitoring using nprobe

Network Performance Management Solutions Architecture

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Transactional Support for SDN Control Planes "

Software Defined Networking and the design of OpenFlow switches

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Network Virtualization: Delivering on the Promises of SDN. Bruce Davie, Principal Engineer

OpenDaylight Project Proposal Dynamic Flow Management

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Software Defined Networking

ICND2 NetFlow. Question 1. What are the benefit of using Netflow? (Choose three) A. Network, Application & User Monitoring. B.

and reporting Slavko Gajin

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

Understanding OpenFlow

Datacenter Operating Systems

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

SDN Applications in Today s Data Center

Software Defined Networking & Openflow

RouteBricks: A Fast, Software- Based, Distributed IP Router

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Comprehensive IP Traffic Monitoring with FTAS System

ONOS [Open Source SDN Network Operating System for Service Provider networks]

Network Performance Monitoring at Small Time Scales

J-Flow on J Series Services Routers and Branch SRX Series Services Gateways

Network Virtualization for Large-Scale Data Centers

Advanced Computer Networks. Scheduling

Self-Compressive Approach for Distributed System Monitoring

How To Monitor Network Traffic With Sflow Vswitch On A Vswitch

Data Analysis Load Balancer

Scalable Extraction, Aggregation, and Response to Network Intelligence

Transcription:

Scalable High Resolution Network Monitoring Open Cloud Day Wintherthur, 16 th of June, 2016 Georgios Kathareios, Ákos Máté, Mitch Gusat IBM Research GmbH Zürich Laboratory {ios, kos, mig}@zurich.ibm.com

Traditional Network Monitoring Network monitoring is essential for: Performance tuning Security Reliability, troubleshooting Traditional monitoring tools: SNMP sflow NetFlow / IPFIX OpenFlow statistics Insufficient at large speeds: Data delivered through the control plane (ms to s granularity) Short-lived / Volatile traffic patterns can affect the system networking performance and stability and eventually the user experience MapReduce congestive episodes in the range of 100s of µs (TCP incast problem) Traditional monitoring is Slow for the speed of modern networks

zmon: architectural goals Goal: Implement high-resolution load monitoring (at 10/25/40/100 Gbps networks) What: Scalable global continuous network monitoring Where: Data plane switch CPU not involved How: In an non-intrusive manner: Re-using available data plane sampling and mirroring mechanisms Targets: Large-scale datacenter and IXP networks

zmon: architectural goals Goal: Implement high-resolution load monitoring (at 10/25/40/100 Gbps networks) What: Scalable global continuous network monitoring Where: Data plane switch CPU not involved How: In an non-intrusive manner: Re-using available data plane sampling and mirroring mechanisms Targets: Large-scale datacenter and IXP networks http://www.h2020-endeavour.eu Zürich Research Lab

The Heatmap method - overview 1. Layer-2 load sampling Monitor the switch via an Layer-2 sampler Sample all output queue occupancies 2. Heatmap construction Queue samples upload to a controller Datacenter network topology assumed to be available Network state snapshots are produced 3. Synchronization (Time-coherency) Result is a stream of snapshots over time video Frame-rate depends on: 1. speed monitoring data collection 2. speed of monitoring data processing

The Heatmap method the QCN protocol Quantized Congestion Notification (QCN) Layer-2 congestion management scheme defined in the IEEE DCB 802.1Qau standard Objective: adapt the source injection to the available network capacity Congestion detection at the switch: 1. Sample packets with probability depending on the severity of the congestion (output queue size) 2. Calculate feedback value based on queue occupancy 3. Send congestion notification message (CNM) to traffic source Rate Limiter follows 2 control laws at the traffic source reduces TX rate proportional with feedback increases rate autonomously based on byte counter and timer

The Heatmap method repurposing QCN For monitoring, we use the congestion detection part of QCN (rate limiter not needed) As an IEEE standard, it should be already implemented in h/w in most switches (non-intrusive) CNMs are created by the switch and gathered in the end-nodes of the network (distributed, scalable) The switch CPU is not involved (monitoring in the data plane fast, us sampling) Prototype implemented in simulation: Detection of congestion trees in 10s of us T+0.07 ms T+0.18 ms T+1.74ms Anghel, Andreea, Robert Birke, and Mitch Gusat. "Scalable High Resolution Traffic Heatmaps: Coherent Queue Visualization for Datacenters." Traffic Monitoring and Analysis. Springer Berlin Heidelberg, 2014. 26-37.

The Heatmap method challenges in real-life Trident switches don t support QCN in hardware When implemented, it s mainly a reduced version in firmware Trident-2 and after do implement a proper hardware QCN FSM More challenges: Timestamping CNMs Must take place in the switch CNMs are scattered to the network edge provides scalability, but aggregation is harder End-node needs a collector/aggregator for CNMs Central collector aggregates all CNMs for heatmap production All operations are at line speed Aggregation/filtering needs same speed

Another sampling approach: Planck Work performed in Brown, Rice universities, IBM Research Austin Laboratory and Brocade Architectural goals: Obtain very fast samples across all switches in the network Use samples to infer global state of the network (flow throughput, flow path, port congestion state) System Measurement Speed (ms) Hedera (NSDI 10) 5,000 DevoFlow Polling (Sigcomm 11) 500 15,000 Mahout Polling (Infocom 11) 190 sflow/opensample (ICDCS 14) 100 Helios (Sigcomm 10) 77.4 Planck (Sigcomm 14) < 4.2 Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agarwal, John Carter, Rodrigo Fonseca "Planck: Millisecond-scale monitoring and control for commodity networks." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4. ACM, 2014. Large parts of the current and following slides were contributed by the authors of the paper

Port mirroring in Planck Planck leverages the port-mirroring function of modern switches Copies all packets e.g. going out a port to a designated mirror port Mirror all ports to a single mirror port Intentional oversubscription Drop behavior approximates sampling Data-plane sampling Production Traffic Mirror Port Switch

Planck architecture Oversubscribed port-mirroring as a primitive SDN Controller Collectors receive samples from mirror ports Netmap for fast processing (Ongoing move to Intel DPDK) Reconstruct flow information across all flows in the network Flow throughput estimated from TCP sequence numbers Planck Collector Instance(s) Collectors can interact with an SDN controller to implement various applications

Inferring Throughput Sampling function is unknown Inferring throughput from samples typically relies on understanding the sampling function Limited to only packets containing sequence numbers (e.g. TCP) Packet Sample A at ta SA payload Packet Sample B at tb SB payload ta < tb throughput = SB - SA tb - ta Adaptive windowbased rate estimator

Planck challenges Increases production traffic packet drop rate due to shared buffer Switch Collector Mirror Port Mirrored packets hog the shared buffer space, reducing the available space for production traffic

Planck challenges (2) Planck mirroring is implemented with OpenFlow Without OpenFlow: Rely on switch mirroring implementation not always well implemented On Trident switches dropped mirrored packets also cause the original packets to be dropped Mirror disabled Both flows 100% throughput Mirror enabled Both flows at 50% throughput Mirror Port Switch Switch 50% drop at mirror port

Towards Planck 2.0 Uses match-and-mirror capability of commodity switches to reduce mirrored traffic Used by Everflow, a Microsoft telemetry tool for troubleshooting datacenter network problems With match-and-mirror we can solve unfairness of sampling between large and small flows In addition, reduce overhead by packet chunking and aggregation Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R.,... & Zheng, H. (2015, August). Packet-level telemetry in large datacenter networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (pp. 479-491). ACM.

Next steps: Ingestion Pipeline Machine- Learning Pre-processing Storage Monitoring and sampling data, application traces Messaging system Monitoring as a Big Data application

Conclusions Legacy monitoring techniques are not adequate for modern networks We need high-resolution, non-intrusive monitoring in the data-plane 2 approaches for zmon: Heatmap creation by repurposing the QCN protocol (Ab)use of the traffic mirroring functionality Processing monitoring data is a Big Data application: Still a challenge