MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011
Network Intrusion Detection Systems Typically deployed at ingress/egress points Inspect all network traffic Look for suspicious activities Alert on malicious actions 10 GbE Internet Internal Network NIDS gvasil@ics.forth.gr 2
Challenges Traffic rates are increasing 10 Gbit/s Ethernet speeds are common in metro/enterprise networks Up to 40 Gbit/s at the core Keep needing to perform more complex analysis at higher speeds Deep packet inspection Stateful analysis 1000s of attack signatures gvasil@ics.forth.gr 3
Designing NIDS Fast Need to handle many Gbit/s Scalable Moore s law does not hold anymore Commodity hardware Cheap Easily programmable gvasil@ics.forth.gr 4
Today: fast or commodity Fast hardware NIDS FPGA/TCAM/ASIC based Throughput: High Commodity software NIDS Processing by general-purpose processors Throughput: Low gvasil@ics.forth.gr 5
MIDeA A NIDS out of commodity components Single-box implementation Easy programmability Low price Can we build a 10 Gbit/s NIDS with commodity hardware? gvasil@ics.forth.gr 6
Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 7
Single-threaded performance NIC Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s gvasil@ics.forth.gr 8
Problem #1: Scalability Single-threaded NIDS have limited performance Do not scale with the number of CPU cores gvasil@ics.forth.gr 9
Multi-threaded performance Preprocess Pattern matching Output NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s gvasil@ics.forth.gr 10
Problem #2: How to split traffic NIC cores Synchronization overheads Cache misses Receive-Side Scaling (RSS) 11
Multi-queue performance Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s 12
Problem #3: Pattern matching is the bottleneck > 75% NIC Preprocess Pattern matching Output Offload pattern matching on the GPU NIC Preprocess Pattern matching Output gvasil@ics.forth.gr 13
Why GPU? General-purpose computing Flexible and programmable Powerful and ubiquitous Constant innovation Data-parallel model More transistors for data processing rather than data caching and flow control gvasil@ics.forth.gr 14
Offloading pattern matching to the GPU Preprocess Pattern matching Output RSS NIC Preprocess Pattern matching Output Preprocess Pattern matching Output Vanilla Snort: 0.2 Gbit/s With multiple CPU-cores: 0.9 Gbit/s With multiple Rx-queues: 1.1 Gbit/s With GPU: 5.2 Gbit/s 15
Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 16
Multiple data transfers NIC PCIe CPU PCIe GPU Several data transfers between different devices Are the data transfers worth the computational gains offered? gvasil@ics.forth.gr 17
Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Rx Queue Assigned Network Interface Packets are hashed in the NIC and distributed to different Rx-queues Memory-mapped ring buffers for each Rx-queue gvasil@ics.forth.gr 18
CPU Processing Packet capturing is performed by different CPU-cores in parallel Process affinity Each core normalizes and reassembles captured packets to streams Remove ambiguities Detect attacks that span multiple packets Packets of the same connection always end up to the same core No synchronization Cache locality Reassembled packet streams are then transferred to the GPU for pattern matching How to access the GPU? gvasil@ics.forth.gr 19
Accessing the GPU Solution #1: Master/Slave model Thread 2 Thread 3 Thread 4 Thread 1 PCIe 64 Gbit/s GPU Execution flow example 14.6 Gbit/s Transfer to GPU: P1 P1 GPU execution: P1 P1 Transfer from GPU: P1 P1 gvasil@ics.forth.gr 20
Accessing the GPU Solution #2: Shared execution by multiple threads Thread 1 Thread 2 Thread 3 PCIe 64 Gbit/s GPU Thread 4 Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 gvasil@ics.forth.gr 21
Transferring to GPU CPU-core Push Push Scan Push GPU Small transfer results to PCIe throughput degradation Each core batches many reassembled packets into a single buffer gvasil@ics.forth.gr
Pattern Matching on GPU Packet Buffer GPU core GPU core GPU core GPU core GPU core GPU core Matches Uniformly, one GPU core for each reassembled packet stream gvasil@ics.forth.gr 23
Pipelining CPU and GPU CPU Double-buffering Packet buffers Each CPU core collects new reassembled packets, while the GPUs process the previous batch Effectively hides GPU communication costs gvasil@ics.forth.gr 24
Recap GPUs: Data-parallel content matching Reassembled packet streams CPUs: Per-flow protocol analysis Packet streams NIC: Demux 1-10Gbps Packets gvasil@ics.forth.gr 25
Outline Architecture Implementation Performance Evaluation Conclusions gvasil@ics.forth.gr 26
Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH NIC GPU GPU NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE gvasil@ics.forth.gr 27
GPU Throughput Pattern Matching Performance Bounded by PCIe capacity 42.5 48.1 14.6 26.7 1 2 4 8 #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 28
GPU Throughput Pattern Matching Performance 70.7 14.6 26.7 42.5 48.1 Adding a second GPU 1 2 4 8 #CPU-cores The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 29
Setup: Network 10 GbE Traffic Generator/Replayer MIDeA gvasil@ics.forth.gr 30
Gbit/s Synthetic traffic MIDeA 7.2 Snort (8x cores) 4.8 1.1 1.5 2.1 2.4 200b 800b 1500b Packet size Randomly generated traffic gvasil@ics.forth.gr 31
Gbit/s Real traffic 5.2 MIDeA Snort (8x cores) 1.1 5.2 Gbit/s with zero packet-loss Replayed trace captured at the gateway of a university campus gvasil@ics.forth.gr 32
Summary MIDeA: A multi-parallel network intrusion detection architecture Single-box implementation Based on commodity hardware Less than $1500 Operate on 5.2 Gbit/s with zero packet loss 70 Gbit/s pattern matching throughput gvasil@ics.forth.gr 33
Thank you! gvasil@ics.forth.gr gvasil@ics.forth.gr 34