Combat Sophisticated Threats



Similar documents
Cisco OpenSOC Hadoop Design with Bro

HDP Hadoop From concept to deployment.

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

What s New in Security Analytics Be the Hunter.. Not the Hunted

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

SEIZE THE DATA SEIZE THE DATA. 2015

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

ANALYTICS CENTER LEARNING PROGRAM

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

The 4 Pillars of Technosoft s Big Data Practice

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Towards Smart and Intelligent SDN Controller

Internet of Things. Opportunity Challenges Solutions

Upcoming Announcements

locuz.com Big Data Services

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

HADOOP. Revised 10/19/2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Oracle Big Data SQL Technical Update

BIG DATA What it is and how to use?

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

BIG DATA ANALYTICS For REAL TIME SYSTEM

HDP Enabling the Modern Data Architecture

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Transforming the Telecoms Business using Big Data and Analytics

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Architectures for massive data management

Getting the Most Out of SIEM. Presentation Title. Data in Big Data. Presented By: Dr. Char Sample, CERT

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The basic data mining algorithms introduced may be enhanced in a number of ways.

LOG AND EVENT MANAGEMENT FOR SECURITY AND COMPLIANCE

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Hadoop Ecosystem B Y R A H I M A.

Advanced In-Database Analytics

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Creating Big Data Applications with Spring XD

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

LOG INTELLIGENCE FOR SECURITY AND COMPLIANCE

The session is about to commence. Please switch your phone to silent!

Big Data & Security. Aljosa Pasic 12/02/2015

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Detect & Investigate Threats. OVERVIEW

The Next Big Thing in the Internet of Things: Real-time Big Data Analytics

TORNADO Solution for Telecom Vertical

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Sunnie Chung. Cleveland State University

Big Data and Advanced Analytics Technologies for the Smart Grid

The Internet of Things and Big Data: Intro

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Streamdrill: Analyzing Big Data Streams in Realtime

Openbus Documentation

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

TIBCO Cyber Security Platform. Atif Chaughtai

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Time-Series Databases and Machine Learning

Integrating a Big Data Platform into Government:

Advanced SOC Design. Next Generation Security Operations. Shane Harsch Senior Solutions Principal, MBA GCED CISSP RSA

CRITEO INTERNSHIP PROGRAM 2015/2016

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Advanced Big Data Analytics with R and Hadoop

Where is... How do I get to...

Big Data and Market Surveillance. April 28, 2014

Real Time Big Data Processing

Streaming Analytics A Framework for Innovation

Kafka & Redis for Big Data Solutions

Big Data Management and Security

Case Study: Instrumenting a Network for NetFlow Security Visualization Tools

Security Analytics for Smart Grid

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data and Industrial Internet

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

HiBench Introduction. Carson Wang Software & Services Group

The Future of Data Management

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Big Data. Fast Forward. Putting data to productive use

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Infrastructures for big data

Transcription:

Combat Sophisticated Threats How Big Data and OpenSOC Could Help SESSION ID: SEC-T11 James Sirota Big Data Architect/Data Scientist Cisco Security Solutions Practice @JamesSirota

In the next few minutes Introduction to Data-Driven Security Overview of OpenSOC Overview of the Analytics Pipeline Survey of Algorithms used by OpenSOC Probabilistic Structures for Stream Estimation Q&A

Introduction to Data-Driven Security

Traditional Approach to Security Reactive Security Model [1,2,3,4,5] 80% of spending on perimeter defenses Average 40-50 security tools per organization Collectively generate ~20,000 events per second Tools are highly specialized 80% of alerts require additional follow-up

Traditional Approach to Security Investigation and Incident Response [5,6] 66% take an hour or longer to identify 91% take a day or longer to discover root cause 81% take a day or longer to fix 84% of fixes take a week or longer to validate

Where things break down Challenges with perimeter defense [2] Most organizations support BYOD 47% of organizations use cloud services Global Workplace Challenges with Point Tools [4, 5] Too many tools and manual processes Too many false positive alerts Lack of integrated architecture Not enough data sources

Where things break down Challenges with Adding Data Sources Volume: from Terabytes to Zettabytes Variety: structured to unstructured Velocity: everything is a sensor, generates logs Veracity: data quality issues Value: What to keep? Purge? Summarize? Skills Challenge [4,6,7] Staff spend ~40% less time on tasks than ideal 35% of organizations use contractors

Cisco Approach to Security Data-Driven Security Model Moving from reactive to predictive Moving from point tools to a unified platform Shifting skills from specialists to data scientists Emphasis on quality and not quantity of alerts Emphasis on closed-loop analytics

Overview of OpenSOC

Introducing OpenSOC Supports Data-Driven Security Model Unified platform for ingest, storage, analytics Provides multiple views/access patterns for data Interactive Analytics and Predictive Modeling Provides contextual real-time alerts Rapid deployment and scoring

Emergence of Big Data Systems that scale out, not up Parallel and scalable computation tools Cheap, massively-scalable storage Stream computation + stream analysis Scaling + approximation

Technology Behind OpenSOC Telemetry Capture Layer: Apache Flume Data Bus: Apache Kafka Stream Processor: Apache Storm Real-Time Index and Search: Elastic Search Long-Term Data Store: Apache Hive Long-Term Packet Store: Apache Hbase Visualization Platform: Kibana

Introducing OpenSOC Intersection of Big Data and Security Analytics Real-Time Alerts Anomaly Detection Data Correlation Hadoop Scalable Compute Multi Petabyte Storage Interactive Query Real-Time Search OpenSOC Big Data Platform Rules and Reports Predictive Modeling UI and Applications Unstructured Data Scalable Stream Processing Data Access Control

OpenSOC in a Nutshell Threat Intelligence Feeds Raw Network Stream Network Metadata Stream Netflow Syslog Raw Application Logs Parse + Format Enrich Alert Log Mining and Analytics Elastic Search Applications + Analyst Tools Network Packet Mining and PCAP Reconstructio n HBase Big Data Exploration, Predictive Modeling Hive Other Streaming Telemetry Real-Time Index Raw Packet Store Long-Term Store Enrichment Data

OpenSOC Journey Sept 2014 General Availability May 2014 CR Work off April 2014 Sept 2013 Dec 2013 Hortonworks joins the project March 2014 Platform development finished First beta test at customer site First Prototype

Overview of the Analytics Pipeline

Analytics Pipeline Flume Kafka Storm OpenSOC-Streaming RAW Transform Enrich Alert (Rules-Based) Big Data Stores Elastic Search Real-Time Index and Search HIVE Long-Term Data Store OpenSOC-Aggregation Filter Aggregators HBase Windowed Rollup Store Enriched OpenSOC-ML Router Model 1 Model 2 Scorer HIVE Alerts Store Model n Alerts UI UI UI SOC Alert Consumers UI UI Web Services Secure Gateway Services External Alert Consumers Remedy Ticketing System

OpenSOC-Streaming KAFKA TOPIC: RAW [TIMESTAMP] Log message: "This [TIMESTAMP] is a test log Log message", message: generated "This [TIMESTAMP] is a from test log Log message", message: mydomain.com, generated "This is a from test log message", 10.20.30.40:123 mydomain.com, generated from -> 20.30.40.50:456 10.20.30.40:123 mydomain.com, {TCP} -> 20.30.40.50:456 10.20.30.40:123 {TCP} -> 20.30.40.50:456 {TCP} OpenSOC-Streaming 1:1 KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test 10.20.30.40, log message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test 10.20.30.40, log message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}}

Streaming Topology Message Stream Shuffle Grouping Parsers Enrich Geo Enrich Whois Control Stream ALL Grouping Parser Enrich Enrich KAFKA Topic: Enriched Kafka Spout MESSAGE Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Enrich Enrich KAFKA Topic: Enriched Parser Plugin Geo Plug-In Whois Plug-In SQL Store K-V Store

OpenSOC-Aggregation KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test 10.20.30.40, log message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test 10.20.30.40, log message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} 1:* OpenSOC-Aggregation RULES OpenSOC-Streaming HBASE yyyy:mm:dd:hh:[bucket_id]: { total-messages:1003 protocol-tcp:305 protocol-udp:201 city-from-phoenix:11 city-from-sanjose:405 country-from-usa:10 country-from-ca:11 ip-from-sketch(a):301 ip-from-sketch(n):55 port-from-sketch(a):44 port-from-sketch(n):12 }

Aggregation Topology Message Stream Shuffle Grouping Control Stream ALL Grouping Filters Filter Pulse Spout KEY Aggregators Aggreg ator HBase KEY:CF:COUNT TIMER Kafka Spout MESSAGE Filter Filter TUPLE TUPLE TUPLE TUPLE Aggreg ator Aggreg ator HBase KEY:CF:COUNT HBase KEY:CF:COUNT Filter Aggreg ator HBase KEY:CF:COUNT MAPPER FILTER AGGREGAT E

HBase Aggregation Table Key Total- Messages Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature n 2004-01-12-15-1 2004-01-12-15-2 34 5 0 0 14 4 22 30 6 2 0 11 5 24 2004-01-12-15-3 2004-01-12-15-4 2004-01-12-16-1 33 4 1 1 12 2 20 31 5 0 0 14 4 21 30 2 4 2 10 5 24.

OpenSOC-ML KAFKA TOPIC: ENRICHED { msg_content : This is a test log message, processed: { msg_content : [TIMESTAMP], This is src_ip: a test 10.20.30.40, log message, src_port: 123, dest_ip: 20.30.40.50, dest_port: processed: { msg_content : 456, [TIMESTAMP], domain: This mydomain.com, is src_ip: a test 10.20.30.40, log message, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: processed: 456, [TIMESTAMP], domain: mydomain.com, src_ip: 10.20.30.40, protocol:"tcp" src_port: 123, dest_ip: 20.30.40.50, geo_enrichment:{ dest_port: 456, source:{ domain: mydomain.com, protocol:"tcp" geo_enrichment:{ source:{ postalcode: 95134,areaCode: 408,metroCode: 807,longitude:-121.946, source:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ postalcode: latitude:37.425,locid:4522,city:"san 95134,areaCode: 408,metroCode: Jose",country:"US"} 807,longitude:-121.946, dest:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ latitude:37.425,locid:4522,city:"san postalcode: 95134,areaCode: 408,metroCode: Jose",country:"US"}} 807,longitude:-121.946, whois_enrichment:{ source: { latitude:37.425,locid:4522,city:"san Jose",country:"US"}} whois_enrichment:{ source: OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { source: OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", { Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"} West Tasman Drive", Systems", dest: { "NetType":"Direct "Address":"170 Assignment"} West Tasman Drive", dest: { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", "NetType":"Direct Assignment"} dest: "OrgAbuseName":"Cisco { OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco OrgId":"CISCOS","Parent":"NET-144-0-0-0-0", Systems Inc", Systems", "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco "OrgAbuseName":"Cisco West Tasman Drive", Systems Inc", Systems", "NetType":"Direct "Address":"170 "RegDate":"1991-01-171991-01-17","OrgName":"Cisco Assignment"}}} West Tasman Drive", Systems", "NetType":"Direct "Address":"170 Assignment"}}} West Tasman Drive", "NetType":"Direct Assignment"}}} OpenSOC-Streaming KAFKA OpenSOC-ML *:* alert: { } anomaly-type: irregular pattern, models-triggered: [m1,m2,m4]

ML Topology Pulse Spout Event ID Online Models Model Event ID TIMER Model Kafka Router Spout MESSAGE Offline Models Scorer Alert KAFKA Topic: Alerts Message Stream ALL Grouping Control Stream ALL Grouping Model Model Eval Algo HBase MR JOB Spark JOB HIVE Long-Term Data Store

Survey of Algorithms Used by OpenSOC

Types of Algorithms Offline analyze entire data set at once Generally a good fit for Apache Hadoop/Map Reduce Model compiled via batch, scored via stream processor Online start with an initial state and analyze each piece of data serially one at a time Generally require a chain of Map Reduce jobs Good fit for Apache Spark, Tez, Storm Primarily batch, good for Lambda architectures Online Streaming real-time, in-memory, limited analysis Generally a good fit for Apache Storm, Spark-Streaming Probabilistic, random structures, approximations

Online Models Stream Processing Bolt Bolt Stream Bolt Bolt Bolt Bolt Result Batch Workflow Hadoop

Examples: Stream Clustering StreamKM++: computes a small weighted sample of the data stream and it uses the k- means++ algorithm as a randomized seeding technique to choose the first values for the clusters. CluStream: micro-clusters are temporal extensions of cluster feature vectors. The microclusters are stored at snapshots in time following a pyramidal pattern. ClusTree: It is a parameter free algorithm automatically adapting to the speed of the stream and it is capable of detecting concept drift, novelty, and outliers in the stream. DenStream: uses dense micro-clusters (named core-micro-cluster) to summarize clusters. To maintain and distinguish the potential clusters and outliers, this method presents coremicro-cluster and outlier micro-cluster structures. D-Stream: maps each input data record into a grid and it computes the grid density. The grids are clustered based on the density. This algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. CobWeb: uses a classification tree. Each node in a classification tree represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node 28

Example: Stream Classification Hoeffding Tree (VFDT) incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time Half-Space Trees ensemble model that randomly spits data into half spaces. They are created online and detect anomalies by their deviations in placement within the forest relative to other data from the same window 29

Examples: Outlier Detection [10] Median Absolute Deviation: Telemetry is anomalous if the deviation of its latest datapoint with respect to the median is X times larger than the median of deviations Standard Deviation from Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the average. Standard Deviation from Moving Average: Telemetry is anomalous if the absolute value of the average of the latest three datapoints minus the moving average is greater than three standard deviations of the moving average.

Examples: Outlier Detection [10] Mean Subtraction Cumulation: Telemetry is anomalous if the value of the next datapoint in the series is farther than three standard deviations out in cumulative terms after subtracting the mean from each data point Least Squares: Telemetry is anomalous if the average of the last three datapoints on a projected least squares model is greater than three sigma Histogram Bins: Telemetry is anomalous if the average of the last three datapoints falls into a histogram bin with less than x 31

Offline Models Stream Processing Stream Bolt Result Batch Workflow Stream Stream Stream Hadoop Batch JOB Model

Offline Algorithms Hypothesis Tests Chi2 Test (Goodness of Fit): A feature is anomalous if it the data for the latest micro batch (for the last 10 minutes) comes from a different distribution than the historical distribution for that feature Grubbs Test: telemetry is anomalous if the Z score is greater than the Grubb's score. Kolmogorov-Smirnov Test: check if data distribution for last 10 minutes is different from last hour Simple Outliers test: telemetry is anomalous if the number of outliers for the last 10 minutes is statistically different then the historical number of outliers for that time frame Decision Trees/Random Forests Association Rules (Apriori) BIRCH/DBSCAN Clustering Auto Regressive (AR) Moving Average (MA)

Approximation for Streaming

Approximation for Streaming Skip Lists Trade memory for lookup time increase Hierarchical layered indexing on top of lists Lookup improvement: O(n) -> O(log n) Best Uses: Search for element in a list Approximate the number of distinct elements in a list

Approximation for Streaming HyperLogLog Turns each feature to a hash value Keeps track of the longest run of a feature Approximates the number of occurrences of a feature based on it s longest runs Best Uses: Estimate how rare an occurrence of a feature is Approximate count of distinct objects over time Estimate entropy of a data set

Approximation for Streaming Bloom Filters Trade memory for fast set membership check Hash all incoming elements to hash sets Use multiple hash functions and multiple hash sets of varying sizes Best Uses: Check element membership in a set

Approximation for Streaming Sketches Top-K: count of top elements in the stream Count-Min: histograms over large feature sets (similar to Bloom Filters with counts) Digest algorithm for computing approximate quantiles on a collection of integers

Thank You Follow us on Twitter @ProjectOpenSOC MTD Data Science Team James Sirota @JamesSirota Sam Davis @samdavis510 Nate Bitting @nbitting 39

Sources [1] RSA Incident Response Teams are the New (Security) Black https://blogs.rsa.com/incident-response-teams-newsecurity-black/ [2] PWC - The Global State of Information Security Survey 2014 http://download.pwc.com/ie/pubs/2013_key_findings_from_the_global_state_of_information_security_survey_2014.pdf [3] Tripwire 2014 IT SECURITY BUDGET FORECAST ROUNDUP FOR CIOs AND CISOs/CSOs https://www.akatt.com/wp-content/uploads/2014/01/it-security-budget-forecast-roundup-2014-for-cios-and-cso-cisos.pdf [4] NetworkWorld: Real-Time Big Data Security Analytics for Incident Detection http://www.networkworld.com/article/2225959/cisco-subnet/real-time-big-data-security-analytics-for-incidentdetection.html [5] CAS Static Analysis Tool Study http://samate.nist.gov/docs/cas_2011_sa_tool_method.pdf [6] TIBCO Cyber Security Platform http://www.tibco.com/multimedia/wp-cyber-security-platform_tcm8-16777.pdf [7] Reuters http://www.reuters.com/article/2012/06/13/us-media-tech-summit-symantec-idusbre85b1e220120613 [8] Staffing the Informaton Security Organization https://vostrom.com/get/infosec_staffing.pdf [9] SBT Global Survey us.westcon.com/documents/.../stbglobalsurveywpen11nov2013web.pdf [10] Etsy Kale http://codeascraft.com/2013/06/11/introducing-kale/ 40