Cisco OpenSOC Hadoop Design with Bro



Similar documents
Combat Sophisticated Threats

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

HDP Hadoop From concept to deployment.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

HADOOP. Revised 10/19/2015

SEIZE THE DATA SEIZE THE DATA. 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Internet of Things and Big Data: Intro

Oracle Big Data SQL Technical Update

Hadoop Ecosystem B Y R A H I M A.

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

HiBench Introduction. Carson Wang Software & Services Group

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

What s New in Security Analytics Be the Hunter.. Not the Hunted

Towards Smart and Intelligent SDN Controller

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The 4 Pillars of Technosoft s Big Data Practice

Openbus Documentation

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Advanced In-Database Analytics

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Creating Big Data Applications with Spring XD

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Comprehensive Analytics on the Hortonworks Data Platform

Big Data & Security. Aljosa Pasic 12/02/2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Data Challenges in Telecommunications Networks and a Big Data Solution

Architectures for massive data management

Detect & Investigate Threats. OVERVIEW

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

The basic data mining algorithms introduced may be enhanced in a number of ways.

HDP Enabling the Modern Data Architecture

ANALYTICS CENTER LEARNING PROGRAM

Upcoming Announcements

Time-Series Databases and Machine Learning

Big Data and Data Science: Behind the Buzz Words

Hadoop and Map-Reduce. Swati Gore

CRITEO INTERNSHIP PROGRAM 2015/2016

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Transforming the Telecoms Business using Big Data and Analytics

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Search and Real-Time Analytics on Big Data

Streamdrill: Analyzing Big Data Streams in Realtime

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA What it is and how to use?

Big Data Analytics Nokia

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Real Time Data Processing using Spark Streaming

How Companies are! Using Spark

Streaming items through a cluster with Spark Streaming

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

Open source Google-style large scale data analysis with Hadoop

Kafka & Redis for Big Data Solutions

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Innovative, High-Density, Massively Scalable Packet Capture and Cyber Analytics Cluster for Enterprise Customers

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

COMP9321 Web Application Engineering

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Cisco Data Preparation

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

AlienVault Unified Security Management Solution Complete. Simple. Affordable Life Cycle of a log


Case Study: Instrumenting a Network for NetFlow Security Visualization Tools

Unified Big Data Processing with Apache Spark. Matei

Distributed Computing and Big Data: Hadoop and MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

Apache Hadoop FileSystem and its Usage in Facebook

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

A Brief Introduction to Apache Tez

TORNADO Solution for Telecom Vertical

Transcription:

Cisco OpenSOC Hadoop Design with Bro Kurt Grutzmacher (@grutz) Solutions Architect, Cisco Security Services 19 August 2014

What Does OpenSOC Address? SOC Challenges: Volume: from Terabytes to Zettabytes Variety: structured to unstructured Velocity: everything is a sensor, generates logs Veracity: data quality issues Value: What to keep? Purge? Summarize? Cisco Public 2

Introducing OpenSOC Supports Data-Driven Security Model Unified platform for ingest, storage, analytics Provides multiple views/access patterns for data Interactive Analytics and Predictive Modeling Provides contextual real-time alerts Rapid deployment and scoring Cisco Public 3

Emergence of Big Data Systems that scale out, not up Parallel and scalable computation tools Cheap, massively-scalable storage Stream computation + stream analysis Scaling + approximation Cisco Public 4

Technology Behind OpenSOC Telemetry Capture Layer: Data Bus: Stream Processor: Real-Time Index and Search: Long-Term Data Store: Long-Term Packet Store: Visualization Platform: Apache Flume Apache Kafka Apache Storm Elastic Search Apache Hive Apache Hbase Kibana Cisco Public 5

Introducing OpenSOC Intersection of Big Data and Security Analytics Real-Time Alerts Anomaly Detection Data Correlation Hadoop Scalable Compute Multi Petabyte Storage Interactive Query Real-Time Search OpenSOC Rules and Reports Predictive Modeling UI and Applications Elastic Search Unstructured Data Scalable Stream Processing Data Access Control Big Data Platform Cisco Public 6

OpenSOC in a Nutshell Threat Intelligence Feeds Raw Network Stream Applications + Analyst Tools Network Metadata Stream Netflow Syslog Raw Application Logs Parse + Format Enrich Alert Log Mining and Analytics Elastic Search Network Packet Mining and PCAP Reconstruction HBase Big Data Exploration, Predictive Modeling Hive Other Streaming Telemetry Real-Time Index Raw Packet Store Long-Term Store Enrichment Data Cisco Public 7

Overview of the Analytics Pipeline Cisco Public 8

Analytics Pipeline Flume Kafka Storm OpenSOC-Streaming RAW Transform Enrich Alert (Rules-Based) Big Data Stores Elastic Search Real-Time Index and Search HIVE Long-Term Data Store OpenSOC-Aggregation Filter Aggregators HBase Windowed Rollup Store Enriched OpenSOC-ML Router Model 1 Model 2 Scorer HIVE Alerts Store Model n Alerts UI UI UI SOC Alert Consumers UI UI Web Services Secure Gateway Services External Alert Consumers Remedy Ticketing System Cisco Public 9

OpenSOC-Streaming KAFKA TOPIC: RAW [TIMESTAMP] Log message: "This is a test [TIMESTAMP] log message", Log generated message: from "This is a mydomain.com, test [TIMESTAMP] log message", 10.20.30.40:123 Log generated message: -> from "This is a 20.30.40.50:456 mydomain.com, test log message", {TCP} 10.20.30.40:123 generated -> from 20.30.40.50:456 mydomain.com, {TCP} 10.20.30.40:123 -> 20.30.40.50:456 {TCP} OpenSOC-Streaming 1:1 KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} Cisco Public 10

Streaming Topology Message Stream Shuffle Grouping Parsers Enrich Geo Enrich Whois Control Stream ALL Grouping Parser Enrich Enrich KAFKA Topic: Enriched Kafka Spout MESSAGE Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Plugin Geo Plug-In Whois Plug-In SQL Store K-V Store Cisco Public 11

OpenSOC-Aggregation KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} OpenSOC-Aggregation RULES 1:* HBASE yyyy:mm:dd:hh:[bucket_id]: { total-messages:1003 protocol-tcp:305 protocol-udp:201 city-from-phoenix:11 city-from-sanjose:405 country-from-usa:10 country-from-ca:11 ip-from-sketch(a):301 ip-from-sketch(n):55 port-from-sketch(a):44 port-from-sketch(n):12 } OpenSOC-Streaming Cisco Public 12

Aggregation Topology Message Stream Shuffle Grouping Control Stream ALL Grouping Filters Filter Pulse Spout KEY Aggregators Aggregator HBase KEY:CF:COUNT TIMER Kafka Spout MESSAGE Filter Filter TUPLE TUPLE TUPLE TUPLE Aggregator Aggregator HBase KEY:CF:COUNT HBase KEY:CF:COUNT Filter Aggregator HBase KEY:CF:COUNT MAPPER FILTER AGGREGATE Cisco Public 13

OpenSOC-ML KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test log 10.20.30.40, message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test log 10.20.30.40, message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} KAFKA Topic: ML Alerts OpenSOC-Streaming OpenSOC-ML *:* alert: { } anomaly-type: irregular pattern, models-triggered: [m1,m2,m4] Cisco Public 14

ML Topology Pulse Spout Event ID Online Models Model Event ID TIMER Model Kafka Router Spout MESSAGE Offline Models Scorer Alert KAFKA Topic: Alerts Message Stream ALL Grouping Control Stream ALL Grouping Model Model Eval Algo HBase MR JOB Spark JOB HIVE Long-Term Data Store Cisco Public 15

Packet Metadata Cisco Public 16

OpenSOC - Stitching Things Together Source Systems Data Collection Messaging System Real Time Processing Storage Access Passive Tap Traffic Replicator PCAP Kafka PCAP Topic Storm PCAP Topology Hive Raw Data Analytic Tools R / Python Telemetry Sources Syslog HTTP File System Flume Agent A Agent B DPI Topic A Topic B Topic DPI Topology Deeper Look A Topology B Topology ORC Elastic Search Index HBase Power Pivot Tableau Web Services Search Other Agent N N Topic N Topology PCAP Table PCAP Reconstruction Cisco Public 17

PCAP Topology Real Time Processing Storm Storage Hive Raw Data HDFS Bolt ORC Elastic Search Kafka Spout Parser Bolt ES Bolt Index HBase Bolt HBase PCAP Table Cisco Public 18

DPI Topology & Telemetry Enrichment Real Time Processing Storm Storage Hive HDFS Bolt Raw Data ORC Kafka Spout GEO Enrich CIF Enrich Elastic Search ES Bolt Index Parser Bolt Whois Enrich HBase PCAP Table Cisco Public 19

Enrichments {! msg_key1 : msg value1,! src_ip : 10.20.30.40,! dest_ip : 20.30.40.50,! domain : mydomain.com! }! "geo":[ {"region":"ca",! "postalcode":"95134",! "areacode":"408",! "metrocode":"807",! "longitude":-121.946,! "latitude":37.425,! "locid":4522,! "city":"san Jose",! "country":"us"! }]! "whois":[ {! "OrgId":"CISCOS",! "Parent":"NET-144-0-0-0-0",! "OrgAbuseName":"Cisco Systems Inc",! "RegDate":"1991-01-171991-01-17",! "OrgName":"Cisco Systems",! "Address":"170 West Tasman Drive",! "NetType":"Direct Assignment"! } ],! cif : Yes! RAW Message Parser Bolt GEO Enrich Cache Whois Enrich Cache Threat Intel Enrich Cache Enriched Message MySQL HBase HBase Geo Lite Data Whois Data CIF Data Cisco Public 20

Producing Bro Messages to OpenSOC Bro 2.2/2.3: https://github.com/grutz/bro Writer::LogKafka addition (will not be merged) Not fully baked but spits out a ton of messages Awaiting new Logging process (BIT-1222) Let Bro do its DPD functionality Let the Hadoop cluster do the enrichment Off-loading capture cards help let the CPU better do its chores Tie your Bro capture system to HDFS and extract files to it! Cisco Public 21

What OpenSOC is NOT easy to install and get working quickly Remember Bro 1.x? It s sorta like that right now A good framework but really tough to put together but getting better a killer of your existing SIEM At least not yet that silver bullet which will solve all your security problems More like a rusty nail which you re just about to step on with Java Cisco Public 22

Algorithms Cisco Public 23

Types of Algorithms Offline analyze entire data set at once Generally a good fit for Apache Hadoop/Map Reduce Model compiled via batch, scored via stream processor Online start with an initial state and analyze each piece of data serially one at a time Generally require a chain of Map Reduce jobs Good fit for Apache Spark, Tez, Storm Primarily batch, good for Lambda architectures Online Streaming real-time, in-memory, limited analysis Generally a good fit for Apache Storm, Spark-Streaming Probabilistic, random structures, approximations Cisco Public 24

Online Models Stream Processing Bolt Bolt Stream Bolt Bolt Bolt Bolt Result Batch Workflow Hadoop Cisco Public 25

Examples: Stream Clustering StreamKM++: computes a small weighted sample of the data stream and it uses the k-means++ algorithm as a randomized seeding technique to choose the first values for the clusters. CluStream: micro-clusters are temporal extensions of cluster feature vectors. The micro-clusters are stored at snapshots in time following a pyramidal pattern. ClusTree: It is a parameter free algorithm automatically adapting to the speed of the stream and it is capable of detecting concept drift, novelty, and outliers in the stream. DenStream: uses dense micro-clusters (named core-micro-cluster) to summarize clusters. To maintain and distinguish the potential clusters and outliers, this method presents core-micro-cluster and outlier micro-cluster structures. D-Stream: maps each input data record into a grid and it computes the grid density. The grids are clustered based on the density. This algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. CobWeb: uses a classification tree. Each node in a classification tree represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node Cisco Public 26

Example: Stream Classification Hoeffding Tree (VFDT) incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time Half-Space Trees ensemble model that randomly spits data into half spaces. They are created online and detect anomalies by their deviations in placement within the forest relative to other data from the same window Cisco Public 27

Examples: Outlier Detection[10] Median Absolute Deviation: Telemetry is anomalous if the deviation of its latest datapoint with respect to the median is X times larger than the median of deviations Standard Deviation from Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the average. Standard Deviation from Moving Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoints minus the moving average is greater than three standard deviations of the moving average. Cisco Public 28

Examples: Outlier Detection [10] Mean Subtraction Cumulation: Telemetry is anomalous if the value of the next datapoint in the series is farther than three standard deviations out in cumulative terms after subtracting the mean from each data point Least Squares: Telemetry is anomalous if the average of the last three datapoints on a projected least squares model is greater than three sigma Histogram Bins: Telemetry is anomalous if the average of the last three datapoints falls into a histogram bin with less than x Cisco Public 29

Offline Models Stream Processing Stream Bolt Result Batch Workflow Stream Stream Hadoop Batch JOB Model Cisco Public 30

Offline Algorithms Hypothesis Tests Chi2 Test (Goodness of Fit): A feature is anomalous if it the data for the latest micro batch (for the last 10 minutes) comes from a different distribution than the historical distribution for that feature Grubbs Test: telemetry is anomalous if the Z score is greater than the Grubb's score. Kolmogorov-Smirnov Test: check if data distribution for last 10 minutes is different from last hour Simple Outliers test: telemetry is anomalous if the number of outliers for the last 10 minutes is statistically different then the historical number of outliers for that time frame Decision Trees/Random Forests Association Rules (Apriori) BIRCH/DBSCAN Clustering Auto Regressive (AR) Moving Average (MA) Cisco Public 31

Approximation for Streaming Cisco Public 32

Approximation for Streaming Skip Lists Trade memory for lookup time increase Hierarchical layered indexing on top of lists Lookup improvement: O(n) -> O(log n) Best Uses: Search for element in a list Approximate the number of distinct elements in a list Cisco Public 33

Approximation for Streaming HyperLogLog Turns each feature to a hash value Keeps track of the longest run of a feature Approximates the number of occurrences of a feature based on it s longest runs Best Uses: Estimate how rare an occurrence of a feature is Approximate count of distinct objects over time Estimate entropy of a data set Cisco Public 34

Approximation for Streaming Bloom Filters Trade memory for fast set membership check Hash all incoming elements to hash sets Use multiple hash functions and multiple hash sets of varying sizes Best Uses: Check element membership in a set Cisco Public 35

Approximation for Streaming Sketches Top-K: count of top elements in the stream Count-Min: histograms over large feature sets (similar to Bloom Filters with counts) Digest algorithm for computing approximate quantiles on a collection of integers Cisco Public 36

@ProjectOpenSOC https://github.com/opensoc Cisco Public 37

Thank you.