MIDeA: A Multi-Parallel Intrusion Detection Architecture

Size: px

Start display at page:

Download "MIDeA: A Multi-Parallel Intrusion Detection Architecture"

Lynne Eaton
9 years ago
Views:

1 MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011

Look for suspicious activities Alert on malicious

2 Network Intrusion Detection Systems Typically deployed at ingress/egress points Inspect all network traffic Look for suspicious activities Alert on malicious actions 10 GbE Internet Internal Network NIDS 2

3 Challenges Traffic rates are increasing 10 Gbit/s Ethernet speeds are common in metro/enterprise networks Up to 40 Gbit/s at the core Keep needing to perform more complex analysis at higher speeds Deep packet inspection Stateful analysis 1000s of attack signatures 3

needing to perform more complex analysis at higher speeds Deep packet

4 Designing NIDS Fast Need to handle many Gbit/s Scalable Moore s law does not hold anymore Commodity hardware Cheap Easily programmable [email protected] 4

5 Today: fast or commodity Fast hardware NIDS FPGA/TCAM/ASIC based Throughput: High Commodity software NIDS Processing by general-purpose processors Throughput: Low 5

6 MIDeA A NIDS out of commodity components Single-box implementation Easy programmability Low price Can we build a 10 Gbit/s NIDS with commodity hardware? [email protected] 6

7 Outline Architecture Implementation Performance Evaluation Conclusions 7

8 Single-threaded performance NIC Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s 8

9 Problem #1: Scalability Single-threaded NIDS have limited performance Do not scale with the number of CPU cores 9

10 Multi-threaded performance Preprocess Pattern matching Output NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s 10

Preprocess Pattern matching Output Vanilla Snort: 0.

11 Problem #2: How to split traffic NIC cores Synchronization overheads Cache misses Receive-Side Scaling (RSS) 11

12 Multi-queue performance Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s 12

Pattern matching Output Vanilla Snort: 0.

13 Problem #3: Pattern matching is the bottleneck > 75% NIC Preprocess Pattern matching Output Offload pattern matching on the GPU NIC Preprocess Pattern matching Output 13

14 Why GPU? General-purpose computing Flexible and programmable Powerful and ubiquitous Constant innovation Data-parallel model More transistors for data processing rather than data caching and flow control 14

15 Offloading pattern matching to the GPU Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s With GPU: 5.2 Gbit/s 15

matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.

16 Outline Architecture Implementation Performance Evaluation Conclusions 16

17 Multiple data transfers NIC PCIe CPU PCIe GPU Several data transfers between different devices Are the data transfers worth the computational gains offered? 17

18 Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Rx Queue Assigned Network Interface Packets are hashed in the NIC and distributed to different Rx-queues Memory-mapped ring buffers for each Rx-queue 18

are hashed in the NIC and distributed to different Rx-queues

19 CPU Processing Packet capturing is performed by different CPU-cores in parallel Process affinity Each core normalizes and reassembles captured packets to streams Remove ambiguities Detect attacks that span multiple packets Packets of the same connection always end up to the same core No synchronization Cache locality Reassembled packet streams are then transferred to the GPU for pattern matching How to access the GPU? 19

packets Packets of the same connection always end up to the same core No synchronization Cache locality

20 Accessing the GPU Solution #1: Master/Slave model Thread 2 Thread 3 Thread 4 Thread 1 PCIe 64 Gbit/s GPU Execution flow example 14.6 Gbit/s Transfer to GPU: P1 P1 GPU execution: P1 P1 Transfer from GPU: P1 P1 [email protected] 20

21 Accessing the GPU Solution #2: Shared execution by multiple threads Thread 1 Thread 2 Thread 3 PCIe 64 Gbit/s GPU Thread 4 Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 [email protected] 21

22 Transferring to GPU CPU-core Push Push Scan Push GPU Small transfer results to PCIe throughput degradation Each core batches many reassembled packets into a single buffer [email protected]

23 Pattern Matching on GPU Packet Buffer GPU core GPU core GPU core GPU core GPU core GPU core Matches Uniformly, one GPU core for each reassembled packet stream 23

24 Pipelining CPU and GPU CPU Double-buffering Packet buffers Each CPU core collects new reassembled packets, while the GPUs process the previous batch Effectively hides GPU communication costs 24

25 Recap GPUs: Data-parallel content matching Reassembled packet streams CPUs: Per-flow protocol analysis Packet streams NIC: Demux 1-10Gbps Packets 25

26 Outline Architecture Implementation Performance Evaluation Conclusions 26

27 Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH NIC GPU GPU NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E GHz x 4 cores 2 x GPU NVIDIA GTX GHz x 480 cores 1 x NIC 82599EB 10 GbE [email protected] 27

GPU Throughput Pattern Matching Performance Bounded by PCIe capacity 42.5 48.1 14.6 26.

28 GPU Throughput Pattern Matching Performance Bounded by PCIe capacity #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases [email protected] 28

GPU Throughput Pattern Matching Performance 70.7 14.6 26.7 42.5 48.

29 GPU Throughput Pattern Matching Performance Adding a second GPU #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases [email protected] 29

30 Setup: Network 10 GbE Traffic Generator/Replayer MIDeA 30

31 Gbit/s Synthetic traffic MIDeA 7.2 Snort (8x cores) b 800b 1500b Packet size Randomly generated traffic 31

32 Gbit/s Real traffic 5.2 MIDeA Snort (8x cores) Gbit/s with zero packet-loss Replayed trace captured at the gateway of a university campus [email protected] 32

33 Summary MIDeA: A multi-parallel network intrusion detection architecture Single-box implementation Based on commodity hardware Less than $1500 Operate on 5.2 Gbit/s with zero packet loss 70 Gbit/s pattern matching throughput [email protected] 33

34 Thank you! 34

High-Density Network Flow Monitoring

Petr Velan [email protected] High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible