Packet-based Network Traffic Monitoring and Analysis with GPUs

Similar documents
Network Traffic Monitoring and Analysis with GPUs

GPU-Based Network Traffic Monitoring & Analysis Tools

Network Traffic Monitoring & Analysis with GPUs

WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks

High-Density Network Flow Monitoring

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Tyche: An efficient Ethernet-based protocol for converged networked storage

Accelerating CFD using OpenFOAM with GPUs

Performance of Software Switching

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

High-performance vswitch of the user, by the user, for the user

Stream Processing on GPUs Using Distributed Multimedia Middleware

Networking Virtualization Using FPGAs

Parallel Firewalls on General-Purpose Graphics Processing Units

ICRI-CI Retreat Architecture track

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

VMWARE WHITE PAPER 1

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

RouteBricks: A Fast, Software- Based, Distributed IP Router

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Intel DPDK Boosts Server Appliance Performance White Paper

Data Centric Interactive Visualization of Very Large Data

Benchmarking Cassandra on Violin

PANDORA FMS NETWORK DEVICES MONITORING

HP ProLiant SL270s Gen8 Server. Evaluation Report

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

GPU-based Decompression for Medical Imaging Applications

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Boosting Data Transfer with TCP Offload Engine Technology

I3: Maximizing Packet Capture Performance. Andrew Brown

D1.2 Network Load Balancing

PANDORA FMS NETWORK DEVICE MONITORING

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

Unified Computing Systems

Resource Utilization of Middleware Components in Embedded Systems

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Next Generation Operating Systems

Private cloud computing advances

Computer Graphics Hardware An Overview

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Overlapping Data Transfer With Application Execution on Clusters

Cloud Optimize Your IT

Characterize Performance in Horizon 6

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Design Issues in a Bare PC Web Server

Latency in High Performance Trading Systems Feb 2010

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Stingray Traffic Manager Sizing Guide

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

ON THE IMPLEMENTATION OF ADAPTIVE FLOW MEASUREMENT IN THE SDN-ENABLED NETWORK: A PROTOTYPE

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Parallel Programming Survey

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

EMC ISILON AND ELEMENTAL SERVER

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Challenges in high speed packet processing

PERFORMANCE TUNING ORACLE RAC ON LINUX

Xeon+FPGA Platform for the Data Center

Gigabit Ethernet Design

SIDN Server Measurements

Clustering Billions of Data Points Using GPUs

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

NetStream (Integrated) Technology White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date

RevoScaleR Speed and Scalability

Windows Server 2012 R2 Hyper-V: Designing for the Real World

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Intel Xeon +FPGA Platform for the Data Center

Bringing Big Data Modelling into the Hands of Domain Experts

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

A Packet Forwarding Method for the ISCSI Virtualization Switch

Full and Para Virtualization

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Limitations of Packet Measurement

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Transcription:

Packet-based Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar wenji@fnal.gov, demar@fnal.gov GPU Technology Conference 2014 March 24-27, 2014 SAN JOSE, CALIFORNIA

Background Main uses for network traffic monitoring & analysis tools: Operations & management Capacity planning Performance troubleshooting Levels of network traffic monitoring & analysis: Device counter level (snmp data) Traffic flow level (flow data) Packet level (The Focus of this work) security analysis application performance analysis traffic characterization studies 2

Characteristics of packet-based network monitoring & analysis applications Time constraints on packet processing. Compute and I/O throughput-intensive High levels of data parallelism. Background (cont.) Packet parallelism. Each packet can be processed independently Flow parallelism. Each flow can be processed independently Extremely poor temporal locality for data Typically, data processed once in sequence; rarely reused 3

The Problem Packet-based traffic monitoring & analysis tools face performance & scalability challenges within highperformance networks. High-performance networks: 40GE/100GE link technologies Servers are 10GE-connected by default 400GE backbone links & 40GE host connections loom on the horizon. Millions of packets generated & transmitted per sec 4

Monitoring & Analysis Tool Platforms (I) Requirements on computing platform for high performance network monitoring & analysis applications: High Compute power Ample memory bandwidth Capability of handing data parallelism inherent with network data Easy programmability 5

Monitoring & Analysis Tool Platforms (II) Three types of computing platforms: NPU/ASIC CPU GPU Features NPU/ASIC CPU GPU High compute power Varies High memory bandwidth Varies Easy programmability Data-parallel execution model Architecture Comparison 6

Our Solution Use GPU-based Traffic Monitoring & Analysis Tools Highlights of our work: Demonstrated GPUs can significantly accelerate network traffic monitoring & analysis 11 million+ pkts/s without drops (single Nvidia M2070) A generic I/O architecture to move network traffic from wire into GPU domain Implemented a GPU-accelerated library for network traffic capturing, monitoring, and analysis. Dozens of CUDA kernels, which can be combined in a variety of ways to perform monitoring and analysis tasks 7

Key Technical Issues GPU s relatively small memory size: Nvidia M2070 has 6 GB Memory, K10 has 8 GB Memory Workarounds: Mapping host memory into GPU with zero-copy technique? Partial packet capture approach Need to capture & move packets from wire into GPU domain without packet loss Need to design data structures that are efficient for both CPU and GPU 8

System Architecture Four Types of Logical Entities: Traffic Capture Preprocessing Monitoring & Analysis Output Display 1. Traffic Capture 2. Preprocessing 3. Monitoring & Analysis GPU Domain 4. Output Display Captured Data... Packet Buffer Packet Buffer Output Output Output Capturing Packet Chunks Monitoring & Analysis Kernels Output User Space NICs... Network Packets 9

Packet I/O Engine Processing Data Recycle Capture User Space OS Kernel Packet Buffer Chunk Attach... Descriptor Segments Free Packet Buffer Chunks Recv Descriptor Ring NIC Incoming Packets... Goal: To avoid packet capture loss Key techniques A novel ring-buffer-pool mechanism Pre-allocated large packet buffers Packet-level batch processing Memory mapping based zero-copy Key Operations Open Capture Recycle Close 10

GPU-based Network Traffic Monitoring & Analysis Algorithms A GPU-accelerated library for network traffic capturing, monitoring, and analysis apps. Dozens of CUDA kernels Can be combined in a variety of ways to perform intended monitoring & analysis operations 11

Packet-Filtering Kernel index 0 1 2 3 4 5 6 7 raw_pkts [ ] filtering_buf [ ] p1 p2 p3 p4 p5 p6 p7 p8 1 x x x x 0 1 1 0 0 1 0 1 Filtering We use Berkeley Packet Filter (BPF) as the packet filter scan_buf [ ] filtered_pkts [ ] index 0 1 1 2 3 3 3 4 0 1 2 3 p1 p3 p4 p7 index 0 1 2 3 4 5 6 7 2 Scan 3 Compact A few basic GPU operations, such as sort, prefix-sum, and compact. Advanced packet filtering capabilities at wire speed are necessary so that we only analyze those packets of interest to us. 12

Sample Use Case -1 Using our GPU-accelerated library, we developed a sample use case to monitor network status: Monitor networks at different levels: from aggregate of entire network down to one node Monitor network traffic by protocol Monitor network traffic information per node: Determine who is sending/receiving the most traffic For both local and remote addresses Monitor IP conversations: Characterizing by volume, or other traits. Packet Parallelism 13

Sample Use Case 1 Data Structures Three key data structures were created at GPU: protocol_stat[] an array that is used to store protocol statistics for network traffic, with each entry associated with a specific protocol. ip_snd[] and ip_rcv[] arrays that are used to store traffic statistics for IP conversations in the send and receive directions, respectively. ip_table a hash table that is used to keep track of network traffic information of each IP address node. These data structures are designed to reference themselves and each other with relative offsets such as array indexes. 14

Sample Use Case 1 Algorithm 1. Call Packet-filtering kernel to filter packets of interest 2. Call Unique-IP-addresses kernel to obtain IP addresses 3. Build the ip_table with a parallel hashing algorithm 4. Collect traffic statistics for each protocol and each IP node 5. Call Traffic-Aggregation kernel to build IP conversations 15

Sample use case 2 Using our GPU-accelerated library, we developed a sample use case to monitor bulk data transfer: Monitor bulk data transfer rate from aggregate of entire network down to one node Analyze data transfer performance bottlenecks Sender-limited, network-limited, or receiver-limited? Network-limited? Packet drops, or packet reordering? Packet Parallelism + Flow Parallelism 16

Sample Use Case 2 Algorithm Monitoring Mode: 1. Call Packet-filtering kernel to filter packets of interest 2. Call TCP-throughput kernel to calculate data transfer rates Analyzing Mode: 1. Call Packet-filtering kernel to filter packets of interest 2. Classify packets into flows 3. Call TCP-analysis kernel to analyze performance bottlenecks 17

Node 0 Node 1 Prototyped System Mem QPI M2070 PCI-E 10G-NIC PCI-E 10G-NIC P0 I O H QPI QPI P1 I O H QPI PCI-E Mem Prototyped System Our application is developed on Linux. CUDA 5.0 programming environment. The packet I/O engine is implemented on Intel 82599-based 10GigE NIC A two-node NUMA system Two 8-core 2.67GHz Intel X5650 processors. Two Intel 82599-based 10GigE NICs One Nvidia M2070 GPU. 18

Performance Evaluation Packet I/O Engine Packet drop rate 0.5 0.4 psioe-1.6 psioe-2.0 psioe-2.4 netmap-1.6 netmap-2.0 netmap-2.4 gpu-i/o-1.6 gpu-i/o-2.0 gpu-i/o-2.4 0.3 0.2 0.1 Packet drop rate one-nic-exp 0 1000 10000 100000 1000000 10000000 Number of packets Packet drop rate two-nic-exp Our Packet I/O engine (GPU-I/O) achieves better performance than NETMAP and PSIOE in avoiding packet capture loss. 19

Throughput (Million p/s) Performance Evaluation GPU-based Packet Filtering Algorithm 180 160 140 120 100 80 60 40 20 0 Data-set-1 Data-set-2 Data-set-3 Data-set-4 mmap-gpu s-cpu-1.6 s-cpu-2.0 s-cpu-2.4 m-cpu-1.6 m-cpu-2.0 m-cpu-2.4 st-gpu 20

Performance Evaluation GPU-based Sample Use Case - 1 Throughput (Million p/s) 100 90 80 70 60 50 40 30 20 10 0 s-cpu-1.6g s-cpu-2.0g s-cpu-2.4g m-cpu-1.6 m-cpu-2.0 m-cpu-2.4 st-gpu IP udp net 129.15.30 net 131.225.107 21

Performance Evaluation Overall System Performance Packet Ratio Packet Ratio 1 0.8 0.6 0.4 One NIC Experiments 1.6GHz 2.0GHz 2.4GHz OnDemand 50 100 150 200 250 300 Packet Size (Byte) 1 0.8 0.6 0.4 Two NIC Experiments 1.6GHz 2.0GHz 2.4GHz OnDemand 50 100 150 200 250 300 Packet Size (Byte) In the experiments: Generators transmitted packets at wire rate. Packet sizes are varied across the experiments. Prototype system run in full operation with sample use case 1. 22

Key Take Aways GPUs accelerate network traffic monitoring & analysis. A generic I/O architecture to move network traffic from wire into GPU domain. A GPU-accelerated library for network traffic capturing, monitoring, and analysis. Dozens of CUDA kernels, which can be combined in various ways to perform monitoring and analysis tasks. 23

Question? Thank You! Email: wenji@fnal.gov and demar@fnal.gov 24