www.sgt-inc.com SGT Technology Innovation Center Dasvis Project 12 March 2015 2015 SGT Inc. Rohit Mital Jay Ellis Ashton Webster Grant Orndorff
Introduction About SGT Technology Innovation Center Genesis for Dasvis project 2015 SGT Inc. 2
Purpose Project Goals Develop a real-time distributed processing framework for big data Determine how tools like Dasvis (built upon this framework) can fit in with other tools in the marketplace Design and develop a complementary tool suite to SGT s Cyber Security capabilities to ensure the security of SGT and our infrastructure Dasvis is designed to be a customizable network monitoring tool Will mirror the capabilities of standard SIEM, Network IDS/IPS, and other tools Can accept a variety of inputs 2015 SGT Inc. 3
Data Exfiltration in the News Sony Pictures Entertainment - 2014 Attack on Sony by Guardians of Peace (with suspected Nation-State involvement) in retaliation to the release of movie The Interview Exfiltration of PII from Sony employees / family members, emails, executive salaries, and previously unreleased Sony movies Elimination of wide-scale theatrical movie release Edward Snowden - 2013 Former NSA contractor, CIA, and DIA employee who released thousands of classified documents about NSA s global Surveillance programs Charged with espionage by US DOJ (30 year sentence) and theft of government property, currently living in Russia WikiLeaks - 2006 to present 1.2 million documents published in the first year after website launch Initial communication to WikiLeaks founder by PFC Manning (currently serving 35 year prison term) considered to be the largest leak of classified information in history, to include: 500,000+ US Army reports (Afghan and Iraq War Logs) 250,000+ unredacted US State Department cables 2015 SGT Inc. 4
Real-World Applications Large-scale data exfiltration from both government and commercial sector becoming all too common Loss of sensitive and classified data occurring for and by corporations and Nation States Indicates a need for companies to monitor network and/or user activity to protect against these types of threats Tools and frameworks needed to process the amount of information necessary to thwart these types of attacks 2015 SGT Inc. 5
System Architecture Cloud-based Real-time distributed processing framework Developed using standard, open-source tools with an available labor pool to support future maintenance and expansion Designed with flexibility and portability in mind 2015 SGT Inc. 6
Dasvis Architecture / Tools Configuration Processing Apache 2.4 Web Server Capture and Processing Packet Captures: Pcap4j Data Transfer: Apache Kafka Queue Distributed/Real Time Processing: Apache Storm/Trident Data Storage NoSQL Databases: Primary Packet Store: MongoDB Aggregate/Time Series DB: Cube DB Reporting/Graphing Apache 2.4 Web Server PHP Web Framework: Laravel Graphing/Visualizations: Google Visualizations Post Processing (Future) Integration with HDFS/Hadoop with queries using HQL 2015 SGT Inc. 7
Apache Kafka Queue Kafka is a distributed messaging system that is used to transfer large amounts of data between processes. It is a queue and has producers and consumers Producers push data to a Kafka Queue Consumers pull data from a Kafka Queue Basically a reliable way to send big data from one place to another in virtually any format 2015 SGT Inc. 8
MongoDB and CubeDB MongoDB is a NoSQL database Has collections (analagous to tables in SQL) that can accept documents of varying structures Uses JavaScript Object Notation (JSON) for more flexible format (similar to rows in SQL) Unlike other databases (e.g. MySQL) that require every object inserted to be of the exact same structure/schema CubeDB is a Time Series database that sits on top of MongoDB A time series database is a database that is highly optimized for queries based on time of insertion 2015 SGT Inc. 9
Apache Storm/Trident Storm allows one to process large amounts of data in real time by providing an abstraction for writing distributed processing programs Spout: A unit that creates a stream of data to be processed A unit that accepts a stream of data, performs an operation on it, and optionally passes on more data. Topology: A collection of spouts and bolts connected by the streams of data passed between them Storm Bolts and Spouts can be run as multiple tasks (threads) and even on different machines in parallel Trident is a further abstraction on top of Storm that handles the creation of spouts and bolts in what it deems the most efficient topology 2015 SGT Inc. 10
How it All Fits Together 2015 SGT Inc. 11
Dasvis Storm Topologies: Tracking and Comparing The Tracking Topology looks at incoming data and aggregates data that we want to track Aggregated data is stored in the Time Series database, and sent to the Comparing Topology The Comparing topology compares the incoming data to the Baseline Data to look for anomalies Raw Data Do we want to track this data? Yes Aggregate incoming data Aggregated Data Compares incoming data to baseline data Discard Data Comparison information Tracking Topology Comparing Topology 2015 SGT Inc. 12
A Closer Look at the Tracking Topology Packet Spout: Packet is retrieved from Kafka Queue Packet Parse Packet Parsed to JSON Packet Match Packet Matched with Configurations Packet Aggregation Packet aggregated over time with other packets Single Insertion Packet inserted to MongoDB Aggregate Forward Aggregated packets sent to Comparing Topology via Kafka Queue Aggregate Insertion Packet aggregate data stored in Time Series Database Spouts and bolts make for simple programming abstractions Spouts start the data processing Bolts are operations on those packets Bolts Data Flow 2015 SGT Inc. 13
A Closer Look at the Tracking Topology Packet Spout: Packet Parse Packet Match Single Insertion Aggregate Forward Bolts Can Run as multiple Tasks Tasks can be thought of as threads Packet Aggregation Aggregate Insertion Bolts Task 2015 SGT Inc. 14
Node 1 Node 2 Packet Spout: Packet Parse Packet Match Packet Spout: Packet Parse Packet Match A Closer Look at the Tracking Topology Node 4 Single Insertion Aggregate Forward Bolts can run on multiple nodes in a cluster Each bolt can still run as multiple tasks This greatly improves performance Packet Aggregation Aggregate Insertion Bolts Tasks Nodes Node 3 Node 5 2015 SGT Inc. 15
Episodes and Baseline Data Baseline Data is the data that represents what the incoming data to Dasvis should look like If the incoming data is significantly different from the Baseline Data, then we have an anomaly An Episode is a set of Baseline Data associated with a set of Conditions This allows the user to have different sets of Baseline Data for different times. Episodes of Baseline Data Normal Baseline Data 2015 SGT Inc. 16
Review of Dasvis-Specific Concepts Tracking vs Comparing Topologies Tracking topology records and aggregates the incoming data we want to track Comparing topology decides if there are anomalies in incoming data by comparing against baseline data Baseline Data Past data aggregated by Dasvis that represents the normal distribution of data Episode A set of Baseline Data that is only used at specific times (Ex. only on Mondays, or only during business hours) 2015 SGT Inc. 17
Demo Mini Tutorial Creating a Baseline Setting Baseline Data Example Scenario and expected output Normal data that matches baseline well Potentially malicious activity 2015 SGT Inc. 18
Summary Challenges / Issues Need to clarify current use of Open source tools and potential costs for deploying Dasvis as a COTS product Future Plans Adding new inputs such as Netflow, Application Logs, etc. in addition to packet capture Adherence to NIST Cyber Security Situational Awareness specification 2015 SGT Inc. 19
Comments/Questions? Your Feedback is Appreciated! 2015 SGT Inc. 20