MIDeA: A Multi-Parallel Intrusion Detection Architecture

Similar documents

High-Density Network Flow Monitoring

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Performance of Software Switching

Haetae: Scaling the Performance of Network Intrusion Detection with Many-core Processors

Packet-based Network Traffic Monitoring and Analysis with GPUs

Intrusion Detection Architecture Utilizing Graphics Processors

RouteBricks: A Fast, Software- Based, Distributed IP Router

Router Architectures

50. DFN Betriebstagung

Computer Communications

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Improving the Performance of Passive Network Monitoring Applications with Memory Locality Enhancements

Parallel Programming Survey

Network Traffic Monitoring & Analysis with GPUs

Intel DPDK Boosts Server Appliance Performance White Paper

GPU-Based Network Traffic Monitoring & Analysis Tools

Stream-Oriented Network Traffic Capture and Analysis for High-Speed Networks

Network Traffic Monitoring and Analysis with GPUs

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri Joseph Gasparakis

VMware vsphere 4.1 Networking Performance

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

High-Density Network Flow Monitoring

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Exploiting Commodity Multi-core Systems for Network Traffic Analysis

DPDK Summit 2014 DPDK in a Virtual World

Hardware acceleration enhancing network security

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Tyche: An efficient Ethernet-based protocol for converged networked storage

Introduction to GPU Programming Languages

Scalable TCP Session Monitoring with Symmetric Receive-side Scaling

Compiling PCRE to FPGA for Accelerating SNORT IDS

Texture Cache Approximation on GPUs

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Stream Processing on GPUs Using Distributed Multimedia Middleware

Parallel Firewalls on General-Purpose Graphics Processing Units

Delivering 160Gbps DPI Performance on the Intel Xeon Processor E Series using HyperScan

Computer Graphics Hardware An Overview

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPGPU Computing. Yong Cao

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

Boosting Data Transfer with TCP Offload Engine Technology

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Evaluating the Suitability of Server Network Cards for Software Routers

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Accelerated Deep Packet Inspection

ultra fast SOM using CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Network Virtualization Technologies and their Effect on Performance

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

High Speed Network Traffic Analysis with Commodity Multi-core Systems

High-performance vswitch of the user, by the user, for the user

Firewalls and IDS. Sumitha Bhandarkar James Esslinger

Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

D1.2 Network Load Balancing

Open-source routing at 10Gb/s

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

CHAPTER 1 INTRODUCTION

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Demartek June Broadcom FCoE/iSCSI and IP Networking Adapter Evaluation. Introduction. Evaluation Environment

Integrated Network Acceleration Features.

Technical Challenges for Big Health Care Data. Donald Kossmann Systems Group Department of Computer Science ETH Zurich

Introduction to Cloud Computing

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Going Linux on Massive Multicore

Stream Oriented Network Packet Capture. Hot topics in Security Research the Red Book. Evangelos Markatos FORTH-ICS

Latency Considerations for 10GBase-T PHYs

Design of an Application Programming Interface for IP Network Monitoring

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

GPU Computing - CUDA

MoonGen. A Scriptable High-Speed Packet Generator Design and Implementation. Paul Emmerich. January 30th, 2016 FOSDEM 2016

PARALLELS CLOUD STORAGE

How to Build a Massively Scalable Next-Generation Firewall

The Advantages of Network Interface Card (NIC) Web Server Configuration Strategy

VXLAN Performance Evaluation on VMware vsphere 5.1

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Content-ID. Content-ID URLS THREATS DATA

Auto-Tunning of Data Communication on Heterogeneous Systems

Design considerations for efficient network applications with Intel multi-core processor-based systems on Linux*

High Performance Network Security

Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

Embedded Systems: map to FPGA, GPU, CPU?

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

High-Performance Network Data Capture: Easier Said than Done

SDN Testing & Validation ONF SDN Solutions Showcase Theme Demonstrations

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

A Fast Pattern-Matching Algorithm for Network Intrusion Detection System

Transcription:

MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011

Network Intrusion Detection Systems Typically deployed at ingress/egress points Inspect all network traffic Look for suspicious activities Alert on malicious actions 10 GbE Internet Internal Network NIDS gvasil@ics.forth.gr 2

Challenges Traffic rates are increasing 10 Gbit/s Ethernet speeds are common in metro/enterprise networks Up to 40 Gbit/s at the core Keep needing to perform more complex analysis at higher speeds Deep packet inspection Stateful analysis 1000s of attack signatures gvasil@ics.forth.gr 3

Designing NIDS Fast Need to handle many Gbit/s Scalable Moore s law does not hold anymore Commodity hardware Cheap Easily programmable gvasil@ics.forth.gr 4

Today: fast or commodity Fast hardware NIDS FPGA/TCAM/ASIC based Throughput: High Commodity software NIDS Processing by general-purpose processors Throughput: Low gvasil@ics.forth.gr 5

MIDeA A NIDS out of commodity components Single-box implementation Easy programmability Low price Can we build a 10 Gbit/s NIDS with commodity hardware? gvasil@ics.forth.gr 6

Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 7

Single-threaded performance NIC Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s gvasil@ics.forth.gr 8

Problem #1: Scalability Single-threaded NIDS have limited performance Do not scale with the number of CPU cores gvasil@ics.forth.gr 9

Multi-threaded performance Preprocess Pattern matching Output NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s gvasil@ics.forth.gr 10

Problem #2: How to split traffic NIC cores Synchronization overheads Cache misses Receive-Side Scaling (RSS) 11

Multi-queue performance Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s 12

Problem #3: Pattern matching is the bottleneck > 75% NIC Preprocess Pattern matching Output Offload pattern matching on the GPU NIC Preprocess Pattern matching Output gvasil@ics.forth.gr 13

Why GPU? General-purpose computing Flexible and programmable Powerful and ubiquitous Constant innovation Data-parallel model More transistors for data processing rather than data caching and flow control gvasil@ics.forth.gr 14

Offloading pattern matching to the GPU Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s With GPU: 5.2 Gbit/s 15

Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 16

Multiple data transfers NIC PCIe CPU PCIe GPU Several data transfers between different devices Are the data transfers worth the computational gains offered? gvasil@ics.forth.gr 17

Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Rx Queue Assigned Network Interface Packets are hashed in the NIC and distributed to different Rx-queues Memory-mapped ring buffers for each Rx-queue gvasil@ics.forth.gr 18

CPU Processing Packet capturing is performed by different CPU-cores in parallel Process affinity Each core normalizes and reassembles captured packets to streams Remove ambiguities Detect attacks that span multiple packets Packets of the same connection always end up to the same core No synchronization Cache locality Reassembled packet streams are then transferred to the GPU for pattern matching How to access the GPU? gvasil@ics.forth.gr 19

Accessing the GPU Solution #1: Master/Slave model Thread 2 Thread 3 Thread 4 Thread 1 PCIe 64 Gbit/s GPU Execution flow example 14.6 Gbit/s Transfer to GPU: P1 P1 GPU execution: P1 P1 Transfer from GPU: P1 P1 gvasil@ics.forth.gr 20

Accessing the GPU Solution #2: Shared execution by multiple threads Thread 1 Thread 2 Thread 3 PCIe 64 Gbit/s GPU Thread 4 Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 gvasil@ics.forth.gr 21

Transferring to GPU CPU-core Push Push Scan Push GPU Small transfer results to PCIe throughput degradation Each core batches many reassembled packets into a single buffer gvasil@ics.forth.gr

Pattern Matching on GPU Packet Buffer GPU core GPU core GPU core GPU core GPU core GPU core Matches Uniformly, one GPU core for each reassembled packet stream gvasil@ics.forth.gr 23

Pipelining CPU and GPU CPU Double-buffering Packet buffers Each CPU core collects new reassembled packets, while the GPUs process the previous batch Effectively hides GPU communication costs gvasil@ics.forth.gr 24

Recap GPUs: Data-parallel content matching Reassembled packet streams CPUs: Per-flow protocol analysis Packet streams NIC: Demux 1-10Gbps Packets gvasil@ics.forth.gr 25

Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 26

Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH NIC GPU GPU NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE gvasil@ics.forth.gr 27

GPU Throughput Pattern Matching Performance Bounded by PCIe capacity 42.5 48.1 14.6 26.7 1 2 4 8 #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 28

GPU Throughput Pattern Matching Performance 70.7 14.6 26.7 42.5 48.1 Adding a second GPU 1 2 4 8 #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 29

Setup: Network 10 GbE Traffic Generator/Replayer MIDeA gvasil@ics.forth.gr 30

Gbit/s Synthetic traffic MIDeA 7.2 Snort (8x cores) 4.8 1.1 1.5 2.1 2.4 200b 800b 1500b Packet size Randomly generated traffic gvasil@ics.forth.gr 31

Gbit/s Real traffic 5.2 MIDeA Snort (8x cores) 1.1 5.2 Gbit/s with zero packet-loss Replayed trace captured at the gateway of a university campus gvasil@ics.forth.gr 32

Summary MIDeA: A multi-parallel network intrusion detection architecture Single-box implementation Based on commodity hardware Less than $1500 Operate on 5.2 Gbit/s with zero packet loss 70 Gbit/s pattern matching throughput gvasil@ics.forth.gr 33

Thank you! gvasil@ics.forth.gr gvasil@ics.forth.gr 34