Network Traffic Monitoring & Analysis with GPUs



Similar documents
Network Traffic Monitoring and Analysis with GPUs

GPU-Based Network Traffic Monitoring & Analysis Tools

Packet-based Network Traffic Monitoring and Analysis with GPUs

WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks

MIDeA: A Multi-Parallel Intrusion Detection Architecture

High-Density Network Flow Monitoring

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Performance of Software Switching

Limitations of Packet Measurement

HP ProLiant SL270s Gen8 Server. Evaluation Report

Tyche: An efficient Ethernet-based protocol for converged networked storage

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

High-performance vswitch of the user, by the user, for the user

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

NetFlow Aggregation. Feature Overview. Aggregation Cache Schemes

ICRI-CI Retreat Architecture track

Accelerating CFD using OpenFOAM with GPUs

Networking Virtualization Using FPGAs

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Benchmarking Cassandra on Violin

Multi-Gigabit Intrusion Detection with OpenFlow and Commodity Clusters

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Tertiary Storage and Data Mining queries

RouteBricks: A Fast, Software- Based, Distributed IP Router

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Stream Processing on GPUs Using Distributed Multimedia Middleware

Data Centric Interactive Visualization of Very Large Data

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Cloud Optimize Your IT

IPv6/IPv4 Automatic Dual Authentication Technique for Campus Network

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

ON THE IMPLEMENTATION OF ADAPTIVE FLOW MEASUREMENT IN THE SDN-ENABLED NETWORK: A PROTOTYPE

Autonomous NetFlow Probe

Unified Computing Systems

Network forensics 101 Network monitoring with Netflow, nfsen + nfdump

and reporting Slavko Gajin

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

NetScaler Logging Facilities

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Parallel Firewalls on General-Purpose Graphics Processing Units

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Performance Architect Remote Storage (Intern)

Intel Xeon +FPGA Platform for the Data Center

Benchmarking Hadoop & HBase on Violin

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Next Generation Operating Systems

Computer Graphics Hardware An Overview

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

PANDORA FMS NETWORK DEVICES MONITORING

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Monitoring of Tunneled IPv6 Traffic Using Packet Decapsulation and IPFIX

Gigabit Ethernet Design

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Xeon+FPGA Platform for the Data Center

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Resource Utilization of Middleware Components in Embedded Systems

User Reports. Time on System. Session Count. Detailed Reports. Summary Reports. Individual Gantt Charts

Hardware acceleration enhancing network security

Parallel Programming Survey

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing

Characterize Performance in Horizon 6

A Packet Forwarding Method for the ISCSI Virtualization Switch

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

PANDORA FMS NETWORK DEVICE MONITORING

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to MPIO, MCS, Trunking, and LACP

Intel DPDK Boosts Server Appliance Performance White Paper

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Deploying F5 BIG-IP Virtual Editions in a Hyper-Converged Infrastructure

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU-based Decompression for Medical Imaging Applications

Xen and the Art of. Virtualization. Ian Pratt

Challenges in high speed packet processing

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

PERFORMANCE TUNING ORACLE RAC ON LINUX

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Chapter 11 I/O Management and Disk Scheduling

UPC team: Pere Barlet-Ros Josep Solé-Pareta Josep Sanjuàs Diego Amores Intel sponsor: Gianluca Iannaccone

Security Overview of the Integrity Virtual Machines Architecture

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Overlapping Data Transfer With Application Execution on Clusters

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Fortinet Network Security NSE4 test questions and answers:

Transcription:

Network Traffic Monitoring & Analysis with GPUs Wenji Wu, Phil DeMar wenji@fnal.gov, demar@fnal.gov GPU Technology Conference 2013 March 18-21, 2013 SAN JOSE, CALIFORNIA

Background Main uses for network traffic monitoring & analysis tools: OperaWons & management Capacity planning Performance troubleshoowng Levels of network traffic monitoring & analysis: Device counter level (snmp data) Traffic flow level (flow data) At the packet inspecwon level (The Focus of this work) security analysis applicawon performance analysis traffic characterizawon studies 2

Background (cont.) CharacterisWcs of packet- based network monitoring & analysis applicawons Time constraints on packet processing. Compute and I/O throughput- intensive High levels of data parallelism. Each packet can be processed independently Extremely poor temporal locality for data Typically, data processed once in sequence; rarely reused 3

The Problem Packet- based traffic monitoring & analysis tools face performance & scalability challenges within high- performance networks. High- performance networks: 40GE/100GE link technologies Servers are 10GE- connected by default 400GE backbone links & 40GE host connecwons loom on the horizon. Millions of packets generated & transmiced per sec 4

Monitoring & Analysis Tool Pladorms (I) Requirements on compuwng pladorm for high performance network monitoring & analysis applicawons: High Compute power Ample memory bandwidth Capability of handing data parallelism inherent with network data Easy programmability 5

Monitoring & Analysis Tool Pladorms (II) Three types of compuwng pladorms: NPU/ASIC CPU GPU Features NPU/ASIC CPU GPU High compute power Varies High memory bandwidth Varies Easy programmability Data- parallel execueon model Architecture Comparison 6

Our SoluWon Use GPU- based Traffic Monitoring & Analysis Tools Highlights of our work: Demonstrated GPUs can significantly accelerate network traffic monitoring & analysis 11 million+ pkts/s without drops (single Nvidia M2070) Designed/implemented a generic I/O architecture to move network traffic from wire into GPU domain Implemented a GPU- accelerated library for network traffic capturing, monitoring, and analysis. Dozens of CUDA kernels, which can be combined in a variety of ways to perform monitoring and analysis tasks 7

Key Technical Issues GPU s relawvely small memory size: Nvidia M2070 has 6 GB Memory Workarounds: Mapping host memory into GPU with zero- copy technique? ParWal packet capture approach Need to capture & move packets from wire into GPU domain without packet loss Need to design data structures that are efficient for both CPU and GPU 8

System Architecture Four Types of Logical EnWWes: Traffic Capture Preprocessing Monitoring & Analysis Output Display 1. Traffic Capture 2. Preprocessing 3. Monitoring & Analysis GPU Domain 4. Output Display Captured Data... Packet Buffer Packet Buffer Output Output Output Capturing Packet Chunks Monitoring & Analysis Kernels Output User Space NICs... Network Packets 9

Packet I/O Engine Packet Buffer Chunk Capture Processing Data Recycle User Space OS Kernel Attach Key techniques Pre- allocated large packet buffers Packet- level batch processing Memory mapping based zero- copy... Descriptor Segments Free Packet Buffer Chunks Recv Descriptor Ring NIC Incoming Packets... Key OperaWons Open Capture Recycle Close 10

GPU- based Network Traffic Monitoring & Analysis Algorithms A GPU- accelerated library for network traffic capturing, monitoring, and analysis apps. Dozens of CUDA kernels Can be combined in a variety of ways to perform intended monitoring & analysis operawons 11

Packet- Filtering Kernel index 0 1 2 3 4 5 6 7 raw_pkts [ ] p1 p2 p3 p4 p5 p6 p7 p8 filtering_buf [ ] 1 x x x x 0 1 1 0 0 1 0 1 Filtering We use Berkeley Packet Filter (BPF) as the packet filter 2 scan_buf [ ] 0 1 1 2 3 3 3 4 Scan filtered_pkts [ ] index 0 1 2 3 p1 p3 p4 p7 index 0 1 2 3 4 5 6 7 3 Compact A few basic GPU operawons, such as sort, prefix- sum, and compact. Advanced packet filtering capabiliwes at wire speed are necessary so that we only analyze those packets of interest to us. 12

Traffic- AggregaWon Kernel Reads an array of n packets at pkts[] and aggregates traffic between same src & dst IP addresses. Exports a list of entries; each entry records a src & dst IP address pair, with associated traffic stawswcs such as packets and bytes sent etc. index raw_pkts[ ] 0 1 2 3 4 5 6 7 key_value[ ] value key1 key2 0 src1 dst1 1 2 src1 src2 dst2 dst4 3 src1 dst1 4 src1 dst1 5 src1 dst3 6 src2 dst4 7 src2 dst4 sorted_pkts[ ] value key1 key2 0 src1 dst1 3 4 1 5 2 src1 src1 src1 dst1 dst2 dst3 src1 dst1 src2 dst4 6 src2 dst4 7 src2 dst4 MulWkey- Value Sort diff_buf[ ] inc_scan_buf[ ] IP_Traffic [ ] 1 0 0 1 1 1 0 0 1 1 1 2 3 4 4 4 src1 dst1 src1 src1 dst2 dst3 src2 dst4 stats stats stats stats index 0 1 2 3 Inclusive Scan Use to build IP conversawons 13

Unique- IP- Addresses Kernel It reads an array of n packets at pkts[] and outputs a list of unique src or dst IP addresses seen on the packets. 1. for each i [0,n- 1] in parallel do IPs[i] src or dst addr of pkts[i]; end for 2. perform sort on IPs[] to determine sorted_ips[]; 3. diff_results[0] =1; for each i [1,n- 1] in parallel do if(sorted_ips[i] sorted_ips[i- 1]) diff_buf[i]=1; else diff_buf[i]=0; end for 4. perform exclusive prefix sum on diff_buf[]; 5. for each i [0,n- 1] in parallel do if(diff_buf[i] ==1) Output[scan_buf[i]]=sorted_IPs[i]; end for 14

A Sample Use Case Using our GPU- accelerated library, we developed a sample use case to monitor network status: Monitor networks at different levels: from aggregate of enwre network down to one node Monitor network traffic by protocol Monitor network traffic informawon per node: Determine who is sending/receiving the most traffic For both local and remote addresses Monitor IP conversawons: Characterizing by volume, or other traits. 15

A Sample Use Case Data Structures Three key data structures were created at GPU: protocol_stat[] an array that is used to store protocol stawswcs for network traffic, with each entry associated with a specific protocol. ip_snd[] and ip_rcv[] arrays that are used to store traffic stawswcs for IP conversawons in the send and receive direcwons, respecwvely. ip_table a hash table that is used to keep track of network traffic informawon of each IP address node. These data structures are designed to reference themselves and each other with relawve offsets such as array indexes. 16

A Sample Use Case Algorithm 1. Call Packet- filtering kernel to filter packets of interest 2. Call Unique- IP- addresses kernel to obtain IP addresses 3. Build the ip_table with a parallel hashing algorithm 4. Collect traffic stawswcs for each protocol and each IP node 5. Call Traffic- AggregaWon kernel to build IP conversawons 17

Prototyped System Node 0 Mem QPI M2070 PCI-E 10G-NIC PCI-E 10G-NIC P0 I O H QPI QPI P1 I O H QPI PCI-E Mem Node 1 Our applicawon is developed on Linux. CUDA 4.2 programming environment. The packet I/O engine is implemented on Intel 82599- based 10GigE NIC Prototyped System A two- node NUMA system Two 8- core 2.67GHz Intel X5650 processors. Two Intel 82599- based 10GigE NICs One Nvidia M2070 GPU. 18

Performance EvaluaWon Packet I/O Engine -./012#3.42561#7.21# ($!"# (!!"# '!"# &!"# %!"# $!"# -./012>?.@16# A12B.4# *-8CDEF# -./#/0123# ($!"# (!!"# '!"# &!"# %!"# $!"#.19;3<=>1?35# @3<A1B# *./CDEF#!"# ()&#*+,# $)!#*+,# $)%#*+,# 3-8#961:51;/<1=#!"# ()&#*+,# $)!#*+,# $)%#*+,# -./#45367389:30# Packet Capture Rate CPU Usage Our Packet I/O engine (GPU- I/O) and PacketShader achieve nearly 100% packet capture rate; Netmap suffers significant packet drops. Our Packet I/O engine requires least CPU usage 19

Performance EvaluaWon GPU- based Packet Filtering Algorithm )*+,-./0"123+"450267"82992:+,/0;<" #!!!" '&!!" '%!!" '$!!" '#!!" '!!!" &!!" %!!" $!!" #!!"!" :6@0;@>;BC=-B+*=" 33@=BC=-B+*=",=-B+*=B'D%E",=-B+*=B#D!E",=-B+*=B#D$E" '" #" (" $" )*=+>23+06"?@6@"A+6" 20

Performance EvaluaWon GPU- based Sample Use Case,0123456")781"9:67;<"=7>>7?1256@A" &#!!" &!!!" %#!!" %!!!" $#!!" $!!!" #!!"!" EF3G10F" 2F3G10FG$-HI" 2F3G10FG%-!I" 2F3G10FG%-JI" '(" )*(" +,)"$&$-%%#-$!." +,)"$%/-$#-&!" B(C"C7>;1D?" 21

Performance EvaluaWon Overall System Performance 1 Packet Ratio 0.8 0.6 0.4 One NIC Experiments 1.6GHz 2.0GHz 2.4GHz OnDemand 50 100 150 200 250 300 Packet Size (Byte) 1 Packet Ratio 0.8 0.6 0.4 Two NIC Experiments 1.6GHz 2.0GHz 2.4GHz OnDemand 50 100 150 200 250 300 Packet Size (Byte) In the experiments: Generators transmiced packets at wire rate. Packet sizes are varied across the experiments. Prototype system run in full operawon with sample use case. 22

Conclusion We demonstrate that GPUs can significantly accelerate network traffic monitoring & analysis in high- performance networks. 11 million+ pkts/s without drops (single Nvidia M2070) We designed/implemented a generic I/O architecture to move network traffic from wire into GPU domain. We implemented a GPU- accelerated library for network traffic capturing, monitoring, and analysis. Dozens of CUDA kernels, which can be combined in various ways to perform monitoring and analysis tasks. 23

QuesWon? Thank You! Email: wenji@fnal.gov and demar@fnal.gov 24