Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis



Similar documents
Analysis of Network Beaconing Activity for Incident Response

Indexing Full Packet Capture Data With Flow

Big Data & Scripting Part II Streaming Algorithms

ER/Studio 8.0 New Features Guide

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Development at the Speed and Scale of Google. Ashish Kumar Engineering Tools

Best Practices for Hadoop Data Analysis with Tableau

NetFlow Tracker Overview. Mike McGrath x ccie CTO mike@crannog-software.com

Multi-dimensional index structures Part I: motivation

LOG INTELLIGENCE FOR SECURITY AND COMPLIANCE

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Oracle Database In-Memory The Next Big Thing

Scalable Prefix Matching for Internet Packet Forwarding

Reference Architecture, Requirements, Gaps, Roles

Safe Harbor Statement

The Complete Performance Solution for Microsoft SQL Server

2009 Oracle Corporation 1

AlienVault. Unified Security Management (USM) 5.x Policy Management Fundamentals

Case Study: Instrumenting a Network for NetFlow Security Visualization Tools

Wireshark Developer and User Conference

Practical Experience with IPFIX Flow Collectors

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

Semantic based Web Application Firewall (SWAF V 1.6) Operations and User Manual. Document Version 1.0

Introducing the BIG-IP and SharePoint Portal Server 2003 configuration

Flexible Web Visualization for Alert-Based Network Security Analytics

SQL Server Administrator Introduction - 3 Days Objectives

NetFlow Analytics for Splunk

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Monitoring Traffic manager

TSM Studio Server User Guide

Kafka & Redis for Big Data Solutions

Introducing the Microsoft IIS deployment guide

Main Memory Data Warehouses

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

An apparatus for P2P classification in Netflow traces

DNS (Domain Name System) is the system & protocol that translates domain names to IP addresses.

uncommon thinking ORACLE BUSINESS INTELLIGENCE ENTERPRISE EDITION ONSITE TRAINING OUTLINES

An overview of traffic analysis using NetFlow

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Quanqing XU YuruBackup: A Highly Scalable and Space-Efficient Incremental Backup System in the Cloud

Midterm. Name: Andrew user id:

LOG AND EVENT MANAGEMENT FOR SECURITY AND COMPLIANCE

Graph Database Proof of Concept Report

DEPLOYMENT GUIDE Version 2.1. Deploying F5 with Microsoft SharePoint 2010

Symantec Endpoint Protection Shared Insight Cache User Guide

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

McAfee Web Gateway 7.4.1

Whitepaper. Innovations in Business Intelligence Database Technology.

Big Data Patterns. Ron Bodkin Founder and President, Think Big

NfSen Plugin Supporting The Virtual Network Monitoring

SAP InfiniteInsight Explorer Analytical Data Management v7.0

Data processing goes big

A NOVEL RESOURCE EFFICIENT DMMS APPROACH

There are numerous ways to access monitors:

White Paper. Intrusion Detection Deploying the Shomiti Century Tap

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Niara Security Intelligence. Overview. Threat Discovery and Incident Investigation Reimagined

Enterprise Organizations Need Contextual- security Analytics Date: October 2014 Author: Jon Oltsik, Senior Principal Analyst

A Quantitative Approach to Security Monitor Deployment

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Big Data and Scripting map/reduce in Hadoop

How good can databases deal with Netflow data

Limitations of Packet Measurement

How to Design and Create Your Own Custom Ext Rep

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Real vs. Synthetic Web Performance Measurements, a Comparative Study

LOG MANAGEMENT AND SIEM FOR SECURITY AND COMPLIANCE

Prediction of DDoS Attack Scheme

NetFlow Performance Analysis

Product Characteristics Page 2. Management & Administration Page 2. Real-Time Detections & Alerts Page 4. Video Search Page 6

Lavastorm Resolution Center 2.2 Release Frequently Asked Questions

An Introduction to HIPPO V4 - the Operational Monitoring and Profiling Solution for the Informatica PowerCenter platform.

Set Up a VM-Series Firewall on the Citrix SDX Server

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Oracle IVR Integrator

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks

Network Monitoring with SNMP

Processing and Analyzing Streams. CDRs in Real Time

Using Database Performance Warehouse to Monitor Microsoft SQL Server Report Content

Consuming Real Time Analytics and KPI powered by leveraging SAP Lumira and SAP Smart Business in Fiori SESSION CODE: 0611 Draft!!!

How To Handle Big Data With A Data Scientist

Exam 1 Review Questions

High-Volume Data Warehousing in Centerprise. Product Datasheet

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Cassandra A Decentralized, Structured Storage System

HADOOP PERFORMANCE TUNING

How To Manage Log Management

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Rethinking SIMD Vectorization for In-Memory Databases

Transcription:

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis January 8, 2008 FloCon 2008 Chris Roblee, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344

Overview Introduction to Bloom Filters Overview of CIAC s Bloom Filter-Based indexing System Approach's Applicability for CIAC & other CERTs Performance on Actual Flow Data Applications of Approach in Conjunction With Analytical Tools Facilitating incident detection and analysis with flow visualization tools. 2

A Very Brief Introduction to Bloom Filters 3

Introduction to Bloom Filters High-level Functionality trivial Create bloom filter of fruit types, name: Add: fruit apple create() insert() Bloom Filter fruit Add: lychee insert() Contains element: lychee? query() Answer: Yes Contains element: broccoli? query() Answer: No http://www.eecs.harvard.edu/~michaelm/newwork/postscripts/bloomfiltersurvey.pdf http://en.wikipedia.org/wiki/bloom_filter 4

Introduction to Bloom Filters The Concept Efficient, probabilistic data structure, providing extremely lightweight string lookups, or approximate membership queries. Invented by Burton Bloom in 1970 to optimize spellchecking. Trade-off small probability of false positives for massive gains in space and time efficiency. Popular for various large-scale network applications (e.g., web caches, query routing). References: http://www.eecs.harvard.edu/~michaelm/newwork/postscripts/bloomfiltersurvey.pdf http://en.wikipedia.org/wiki/bloom_filter 5

How Bloom Filters Work 1. Empty bloom filter is a bit array of m 0 - bits. k 2. Introduce k different hash functions, each maps key value to one of m array positions. m bits ELEMENT1 3. Insert element by feeding it to each hash function, to obtain k array positions. Set these bits to 1. k ELEMENT1 4. Query element (check its existence) by re-feeding into each hash function, and checking corresponding bit positions. If all bits are 1, then element is either in the filter or it s a false positive. k ELEMENT2 5. If bit positions of hashes of an element contain a 0, then that element is definitely not in filter (no false negatives). k 6

Introduction to Bloom Filters False Positives Probability of false positive for a populated bloom filter is: Probability of False Positive 0.015 0.014 0.013 0.012 0.011 0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 p(fp ) Probability of False Positive 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 m/n (filter bits/element) k=8 k=6 k=4 k=2 k - number of hash functions used n number of elements inserted m size of bloom filter (bit array) 7

Bloom Filters - Summary Quick test of element membership: 0 likelihood of false negatives Tunable false positive rates Probability of collisions proportional to the number of elements in set & inversely proportional to filter size. Enforce maximum false positive threshold by tuning filter size: Often require as little as one byte per element Functionality Significant space and time advantages over many standard, deterministic indexing structures: Self-balancing trees Tries Hash-Tables Arrays, Linked Lists Query time is O(k), independent of number of items in set. Many open source implementations available. Practicality Inexpensive, easy to deploy and maintain 8

Bloom Filters: Operational Viability for CIAC and the CERT Community 9

CIAC s Flow Collection Review CIAC collects massive volumes of biflow data from 29 sensors across the DOE complex: 300-500 million biflows daily (~4600/s) ~14GB/94GB compressed/uncompressed daily Approximate daily averages by sensor 50,000,000.00 45,000,000.00 40,000,000.00 35,000,000.00 Recordss 30,000,000.00 25,000,000.00 20,000,000.00 Average # records per day Min # records per day Max # records per day 15,000,000.00 10,000,000.00 5,000,000.00 0.00 Sensor 10

CIAC s Flow Collection Review biflow feed: Session summary Fields: Date/Time & Duration Source/Destination IP and Port Protocol Information Bidirectional Byte and Packet Counts Bidirectional Protocol Options Subset of TCP/ICMP flags Example Biflow Record 1171066191.997532,20070210000951.997532,site3,flo30,6,192168081021,192,168,81,21,IT,010000001008,10,0,1,8,US,53,1024,0,0,0.0000,0,0,54,0,1,0,0,0,0,0,0,60,0,60,0,,,14,00,+14,0,0,0,0 11

CIAC Analysis - Legacy Search Methodologies File grep Search sensors and hours for range of interest (e.g., site3, site12, site21 from 10/1/06 through 12/31/06 ). Requires reading/decompressing and combing through GBs of data (from disk) for every day searched. RDBMS - Oracle SQL+ Perl/JDBC Typically limited* to past ~25 days of bi-directional sessions (~15%) AWARE web portal High-level charting and statistics (session counts, etc.) Biflow DB Many mission-critical searches can take several hours or days to complete 12

Current CIAC Analysis Data Flow 13

Watch and Warn Query Needs and Issues Rapidly search all flow data over long periods of time: Analysts typically search on IP address: Watch list (suspicious, known-bad, etc.) Nodes of interest Compromised internal nodes Various time (hours, days, months) and space (single site, all sites) scales. Require quick turnaround (minutes) to respond to site requests: e.g. Have you seen these IPs at my site in the past 3 weeks? IP-based searches often yield relatively small result sets: Interesting IP might only have been seen in 30 site-hours, whereas 21,600 hours (~1 DOE-month) might have been searched. 99.9% wasted duty cycle! Need to reduce the search space (raw flow files) through better cataloging of data as it arrives. 14

Bloomdex: CIAC s Bloom Filter-based Indexing System for Network Flow Analysis 15

Solution: Bloomdex Bloomdex A hybrid hierarchy/file-based Bloom filter system to index CIAC s biflow records. Currently indexed by source or destination IP. Index partitioned by: Site-month (e.g., SITE8 11/2006 ) Site-day (e.g., SITE8 11/5/2006 ) Site-hour (e.g., SITE8 11/5/2006 13:00 ) Uses intuitive directory tree structures and multi-scale bloom filters to accelerate IP-based searches. max(fp rate) 2x10-4 3 bytes of storage per unique IP 16

Blooomdex - CIAC Analysis Data Flow 17

18 Reducing the Biflow Search Space Biflow Flat File Repository Bloom filters 1 23n 2. Get list of sitehour hits, L L{ } 1. Query existence of IPs, I I{ } 3. Only search raw files corresponding to site-hours in L L{ } 4. Obtain biflow results, R R{ } 5. Subsequent Analyses R{ } ~10s of TBs (several years => millions of gzip d flat files) ~200GB (~500k files)

Bloomdex: Performance Profile 19

Bloomdex: Comparative Performance Profiles Typical analyst IP-based queries: IPs searched Date-range searched Site-hours searched Site-hour hits % Sitehour hits Session hits Raw biflow file hits Search time (conventional) Search time (bloomdex) Relative Speedup 8 12/13/06-1/9/07 19,140 466 2.43% 10,594 600 16.29 hours 1.3 hours 12.5 13 10/15/06-1/17/07 65,888 1,166 1.77% 158,345 1,667 57.52 hours 3.45 hours 16.7 13 1/22/07-1/29/07 4,959 31 0.60% 78 39 4.16 hours 5.82 minutes 42.9 4 1/1/07-1/2/07 725 3 0.41% 3 3 21.5 minutes 28 seconds 46.1 9 1/23/07-1/24/07 725 1 0.14% 1 1 41.7 minutes 41 seconds 61 Expect >10x speedup Strong dependency on site-hour hit ratio Future optimizations to search tools could make it even faster 20

Bloomdex: Performance Profile Comparative Performance: Relative Search Speedup 70 60 Speedup Factor her 50 40 30 20 10 0 0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% Site-hour hits Strong relationship between speedup and site-hour hit ratio Ideal for searches on sparsely-occurring IPs 21

Bloomdex: Performance Profile Bloom filter generation performance: Average site-day filter generation rate: ~ 33/hour = 792/day (current incoming rate: 29/day) Average site-hour filter generation rate: ~ 390/hour = 9360/day (current incoming rate: 696/day) Will scale well to 100+ sites (cheaply) 22

Bloomdex: Status Coverage 2.5 years of biflow records indexed. Storage footprint 3 bytes per unique IP at the site-hour, site-day and site-month levels. Bloom filters currently using ~200GB of shared storage. Exploring additional space and performance-based optimizations Other dimensions (e.g., port, ip-port, srcip-dstip pairs) Counting Bloom filters Different hashing functions Parallelization 23

Bloomdex: Analyst Workflow Integration 24

Analyst Workflow Integration IDS Alerts Alert Files AWARE Portal Biflow DB Stats 10Gb Packet Capture Biflow Aggregator Session Files Bloom filters 1 23n Everest Graph Cache Everest Graph Vis. Statistics / Aggregation Pattern Matching Historical Queries Trending Algorithmic Charting / Histograms Visual Analysis Graph Analysis Human Inspection High Data Volume Low 25

Facilitating Incident Analysis with Bloomdex and Everest Flow Visualization Example Use Scenario: 1. Site reports compromise Supplies 4 suspect IPs to CIAC. 2. CIAC queries biflow data for suspect IPs using Bloomdex query tool: Search all sensors over a sufficient time range (perhaps a full year). Quickly identify several other sites with hosts exhibiting similar behaviors. Analysis set narrowed down to just 1,635 sessions. 26

Analysis Using Bloomdex and Everest (2) 3. Launch Everest graph visualization tool, point to Bloomdex output file containing result set (1,635 biflow records). 4. Issue general query to generate session graph: 27

Analysis Using Bloomdex and Everest (3) 5. Perform drill-down or aggregate analysis Suspicious hosts 28

Analysis Using Bloomdex and Everest (4) 6. Perform in-depth or summary analysis 29

Conclusion The Bloomdex suite enables significantly faster turnaround times on analyst IP-based queries: It does this by drastically narrowing the search space through Bloom filter pre-queries. Facilitates use of other analytic tools, such as Everest. Provides significant space savings. Very straightforward and inexpensive to deploy and maintain. Future: Utilize compressed bitmap indexes as an integrated indexing/retrieval solution. 30

Questions cdr [at] llnl.gov 31