Cisco OpenSOC Hadoop Design with Bro Kurt Grutzmacher (@grutz) Solutions Architect, Cisco Security Services 19 August 2014
What Does OpenSOC Address? SOC Challenges: Volume: from Terabytes to Zettabytes Variety: structured to unstructured Velocity: everything is a sensor, generates logs Veracity: data quality issues Value: What to keep? Purge? Summarize? Cisco Public 2
Introducing OpenSOC Supports Data-Driven Security Model Unified platform for ingest, storage, analytics Provides multiple views/access patterns for data Interactive Analytics and Predictive Modeling Provides contextual real-time alerts Rapid deployment and scoring Cisco Public 3
Emergence of Big Data Systems that scale out, not up Parallel and scalable computation tools Cheap, massively-scalable storage Stream computation + stream analysis Scaling + approximation Cisco Public 4
Technology Behind OpenSOC Telemetry Capture Layer: Data Bus: Stream Processor: Real-Time Index and Search: Long-Term Data Store: Long-Term Packet Store: Visualization Platform: Apache Flume Apache Kafka Apache Storm Elastic Search Apache Hive Apache Hbase Kibana Cisco Public 5
Introducing OpenSOC Intersection of Big Data and Security Analytics Real-Time Alerts Anomaly Detection Data Correlation Hadoop Scalable Compute Multi Petabyte Storage Interactive Query Real-Time Search OpenSOC Rules and Reports Predictive Modeling UI and Applications Elastic Search Unstructured Data Scalable Stream Processing Data Access Control Big Data Platform Cisco Public 6
OpenSOC in a Nutshell Threat Intelligence Feeds Raw Network Stream Applications + Analyst Tools Network Metadata Stream Netflow Syslog Raw Application Logs Parse + Format Enrich Alert Log Mining and Analytics Elastic Search Network Packet Mining and PCAP Reconstruction HBase Big Data Exploration, Predictive Modeling Hive Other Streaming Telemetry Real-Time Index Raw Packet Store Long-Term Store Enrichment Data Cisco Public 7
Overview of the Analytics Pipeline Cisco Public 8
Analytics Pipeline Flume Kafka Storm OpenSOC-Streaming RAW Transform Enrich Alert (Rules-Based) Big Data Stores Elastic Search Real-Time Index and Search HIVE Long-Term Data Store OpenSOC-Aggregation Filter Aggregators HBase Windowed Rollup Store Enriched OpenSOC-ML Router Model 1 Model 2 Scorer HIVE Alerts Store Model n Alerts UI UI UI SOC Alert Consumers UI UI Web Services Secure Gateway Services External Alert Consumers Remedy Ticketing System Cisco Public 9
OpenSOC-Streaming KAFKA TOPIC: RAW [TIMESTAMP] Log message: "This is a test [TIMESTAMP] log message", Log generated message: from "This is a mydomain.com, test [TIMESTAMP] log message", 10.20.30.40:123 Log generated message: -> from "This is a 20.30.40.50:456 mydomain.com, test log message", {TCP} 10.20.30.40:123 generated -> from 20.30.40.50:456 mydomain.com, {TCP} 10.20.30.40:123 -> 20.30.40.50:456 {TCP} OpenSOC-Streaming 1:1 KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} Cisco Public 10
Streaming Topology Message Stream Shuffle Grouping Parsers Enrich Geo Enrich Whois Control Stream ALL Grouping Parser Enrich Enrich KAFKA Topic: Enriched Kafka Spout MESSAGE Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Plugin Geo Plug-In Whois Plug-In SQL Store K-V Store Cisco Public 11
OpenSOC-Aggregation KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} OpenSOC-Aggregation RULES 1:* HBASE yyyy:mm:dd:hh:[bucket_id]: { total-messages:1003 protocol-tcp:305 protocol-udp:201 city-from-phoenix:11 city-from-sanjose:405 country-from-usa:10 country-from-ca:11 ip-from-sketch(a):301 ip-from-sketch(n):55 port-from-sketch(a):44 port-from-sketch(n):12 } OpenSOC-Streaming Cisco Public 12
Aggregation Topology Message Stream Shuffle Grouping Control Stream ALL Grouping Filters Filter Pulse Spout KEY Aggregators Aggregator HBase KEY:CF:COUNT TIMER Kafka Spout MESSAGE Filter Filter TUPLE TUPLE TUPLE TUPLE Aggregator Aggregator HBase KEY:CF:COUNT HBase KEY:CF:COUNT Filter Aggregator HBase KEY:CF:COUNT MAPPER FILTER AGGREGATE Cisco Public 13
OpenSOC-ML KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} KAFKA Topic: ML Alerts OpenSOC-Streaming OpenSOC-ML *:* alert: { } anomaly-type: irregular pattern, models-triggered: [m1,m2,m4] Cisco Public 14
ML Topology Pulse Spout Event ID Online Models Model Event ID TIMER Model Kafka Router Spout MESSAGE Offline Models Scorer Alert KAFKA Topic: Alerts Message Stream ALL Grouping Control Stream ALL Grouping Model Model Eval Algo HBase MR JOB Spark JOB HIVE Long-Term Data Store Cisco Public 15
Packet Metadata Cisco Public 16
OpenSOC - Stitching Things Together Source Systems Data Collection Messaging System Real Time Processing Storage Access Passive Tap Traffic Replicator PCAP Kafka PCAP Topic Storm PCAP Topology Hive Raw Data Analytic Tools R / Python Telemetry Sources Syslog HTTP File System Flume Agent A Agent B DPI Topic A Topic B Topic DPI Topology Deeper Look A Topology B Topology ORC Elastic Search Index HBase Power Pivot Tableau Web Services Search Other Agent N N Topic N Topology PCAP Table PCAP Reconstruction Cisco Public 17
PCAP Topology Real Time Processing Storm Storage Hive Raw Data HDFS Bolt ORC Elastic Search Kafka Spout Parser Bolt ES Bolt Index HBase Bolt HBase PCAP Table Cisco Public 18
DPI Topology & Telemetry Enrichment Real Time Processing Storm Storage Hive HDFS Bolt Raw Data ORC Kafka Spout GEO Enrich CIF Enrich Elastic Search ES Bolt Index Parser Bolt Whois Enrich HBase PCAP Table Cisco Public 19
Enrichments {! msg_key1 : msg value1,! src_ip : 10.20.30.40,! dest_ip : 20.30.40.50,! domain : mydomain.com! }! "geo":[ {"region":"ca",! "postalcode":"95134",! "areacode":"408",! "metrocode":"807",! "longitude":-121.946,! "latitude":37.425,! "locid":4522,! "city":"san Jose",! "country":"us"! }]! "whois":[ {! "OrgId":"CISCOS",! "Parent":"NET-144-0-0-0-0",! "OrgAbuseName":"Cisco Systems Inc",! "RegDate":"1991-01-171991-01-17",! "OrgName":"Cisco Systems",! "Address":"170 West Tasman Drive",! "NetType":"Direct Assignment"! } ],! cif : Yes! RAW Message Parser Bolt GEO Enrich Cache Whois Enrich Cache Threat Intel Enrich Cache Enriched Message MySQL HBase HBase Geo Lite Data Whois Data CIF Data Cisco Public 20
Producing Bro Messages to OpenSOC Bro 2.2/2.3: https://github.com/grutz/bro Writer::LogKafka addition (will not be merged) Not fully baked but spits out a ton of messages Awaiting new Logging process (BIT-1222) Let Bro do its DPD functionality Let the Hadoop cluster do the enrichment Off-loading capture cards help let the CPU better do its chores Tie your Bro capture system to HDFS and extract files to it! Cisco Public 21
What OpenSOC is NOT easy to install and get working quickly Remember Bro 1.x? It s sorta like that right now A good framework but really tough to put together but getting better a killer of your existing SIEM At least not yet that silver bullet which will solve all your security problems More like a rusty nail which you re just about to step on with Java Cisco Public 22
Algorithms Cisco Public 23
Types of Algorithms Offline analyze entire data set at once Generally a good fit for Apache Hadoop/Map Reduce Model compiled via batch, scored via stream processor Online start with an initial state and analyze each piece of data serially one at a time Generally require a chain of Map Reduce jobs Good fit for Apache Spark, Tez, Storm Primarily batch, good for Lambda architectures Online Streaming real-time, in-memory, limited analysis Generally a good fit for Apache Storm, Spark-Streaming Probabilistic, random structures, approximations Cisco Public 24
Online Models Stream Processing Bolt Bolt Stream Bolt Bolt Bolt Bolt Result Batch Workflow Hadoop Cisco Public 25
Examples: Stream Clustering StreamKM++: computes a small weighted sample of the data stream and it uses the k-means++ algorithm as a randomized seeding technique to choose the first values for the clusters. CluStream: micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. ClusTree: It is a parameter free algorithm automatically adapting to the speed of the stream and it is capable of detecting concept drift, novelty, and outliers in the stream. DenStream: uses dense micro-clusters (named core-micro-cluster) to summarize clusters. To maintain and distinguish the potential clusters and outliers, this method presents core-micro-cluster and outlier micro-cluster structures. D-Stream: maps each input data record into a grid and it computes the grid density. The grids are clustered based on the density. This algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. CobWeb: uses a classification tree. Each node in a classification tree represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node Cisco Public 26
Example: Stream Classification Hoeffding Tree (VFDT) incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time Half-Space Trees ensemble model that randomly spits data into half spaces. They are created online and detect anomalies by their deviations in placement within the forest relative to other data from the same window Cisco Public 27
Examples: Outlier Detection[10] Median Absolute Deviation: Telemetry is anomalous if the deviation of its latest datapoint with respect to the median is X times larger than the median of deviations Standard Deviation from Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the average. Standard Deviation from Moving Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoints minus the moving average is greater than three standard deviations of the moving average. Cisco Public 28
Examples: Outlier Detection [10] Mean Subtraction Cumulation: Telemetry is anomalous if the value of the next datapoint in the series is farther than three standard deviations out in cumulative terms after subtracting the mean from each data point Least Squares: Telemetry is anomalous if the average of the last three datapoints on a projected least squares model is greater than three sigma Histogram Bins: Telemetry is anomalous if the average of the last three datapoints falls into a histogram bin with less than x Cisco Public 29
Offline Models Stream Processing Stream Bolt Result Batch Workflow Stream Stream Hadoop Batch JOB Model Cisco Public 30
Offline Algorithms Hypothesis Tests Chi2 Test (Goodness of Fit): A feature is anomalous if it the data for the latest micro batch (for the last 10 minutes) comes from a different distribution than the historical distribution for that feature Grubbs Test: telemetry is anomalous if the Z score is greater than the Grubb's score. Kolmogorov-Smirnov Test: check if data distribution for last 10 minutes is different from last hour Simple Outliers test: telemetry is anomalous if the number of outliers for the last 10 minutes is statistically different then the historical number of outliers for that time frame Decision Trees/Random Forests Association Rules (Apriori) BIRCH/DBSCAN Clustering Auto Regressive (AR) Moving Average (MA) Cisco Public 31
Approximation for Streaming Cisco Public 32
Approximation for Streaming Skip Lists Trade memory for lookup time increase Hierarchical layered indexing on top of lists Lookup improvement: O(n) -> O(log n) Best Uses: Search for element in a list Approximate the number of distinct elements in a list Cisco Public 33
Approximation for Streaming HyperLogLog Turns each feature to a hash value Keeps track of the longest run of a feature Approximates the number of occurrences of a feature based on it s longest runs Best Uses: Estimate how rare an occurrence of a feature is Approximate count of distinct objects over time Estimate entropy of a data set Cisco Public 34
Approximation for Streaming Bloom Filters Trade memory for fast set membership check Hash all incoming elements to hash sets Use multiple hash functions and multiple hash sets of varying sizes Best Uses: Check element membership in a set Cisco Public 35
Approximation for Streaming Sketches Top-K: count of top elements in the stream Count-Min: histograms over large feature sets (similar to Bloom Filters with counts) Digest algorithm for computing approximate quantiles on a collection of integers Cisco Public 36
@ProjectOpenSOC https://github.com/opensoc Cisco Public 37
Thank you.