Forensic Clusters: Advanced Processing with Open Source Software Jon Stewart Geoff Black
Who We Are Mac Lightbox Guidance alum Mr. EnScript C++ & Java Developer Fortune 100 Financial NCIS (DDK/ManTech) Lightbox Guidance alum Examiner Pessimist Optimist Python Search PSTs Louisiana DOJ ACEDS EnScript JavaScript NSFs Linux C# Jon Stewart Like many of you, we've dealt with a lot of data. Geoff Black
CPU Clock Speed
Hard Drive Capacity
The Problem CPU clock speed stopped doubling Hard drive capacity kept doubling Multicore CPUs to the rescue!...but they're wasted on single-threaded apps...and hard drive transfer speeds might be too slow for 24 core machines (depending)
Split the data up Solution: Distributed Processing Process it in parallel Scale out as needed Sounds great!!
I know, let's use Oracle!...
Storage quickly becomes the bottleneck $$$$$$$
These guys have a lot of data......how do they do their processing?
What about these guys? http://wiki.apache.org/incubator/accumuloproposal
Apache Hadoop Apache Hadoop is an open source software suite for distributed processing and storage, primarily inspired by Google's in-house systems. http://hadoop.apache.org
The Foundation Apache Hadoop created at Yahoo! + Based on Google MapReduce and File System Apache HBase based on Google BigTable
Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
HDFS Just like any other file system, but better Metadata master server ( NameNode ) One server acts as FAT/MFT Knows where files & blocks are on cluster Blocks on separate machines ( DataNodes ) Automatic replication of blocks(default 3x) All blocks have checksums Rack-aware DataNodes form p2p network Clients contact directly for reading/writing file data Not general purpose Files are read-only once written Optimized for streaming reads/writes of large files
Apache Hadoop: Scalable, Fault-tolerant, Open Source Hadoop Distributed File System (HDFS) MapReduce Framework Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
MapReduce Bring the application to the data Code is sent to every node in the cluster Local data is then processed as "tasks" Support for custom file formats Failures are restarted elsewhere, or skipped One nodes can process several tasks at once Idle CPUs are bad Output is automatically collated Parallelism is transparent to programmer
MapReduce Example Yahoo! has over 500M unique users / month Billions of transactions per day Search Assist (type ahead) Based on years of log data Before Hadoop: 26 days processing time After Hadoop: 20 minutes
HBase Random access, updates, versioning Conventional RDBMS (MySQL) Limited throughput Not distributed Fields always allocated varchar[255] Strict schemas ACID integrity/transactions Support for joins Indices speed things up HBase High write throughput Distributed automatically Null values not stored A row is Name:Value list Schemas not imposed Rows sorted by key No joins, no transactions Good at table scans Fields are versioned
HBase Auto-Sharding Logical Table Regions Region Servers A A B B C C D D E E F F G G H H I I J J K K L L M M N N O O Client Source: http://www.slideshare.net/cloudera/hadoop-world-2011-advanced-hbase-schema-design-lars-george-cloudera
So, let's do forensics on Hadoop! We'd done some work on proof-of-concept and design Army Intelligence Center of Excellence funded an open source prototype effort In collaboration with 42six Solutions, Basis Technologies, and Dapper Vision
Open Source Stack for Distributed Forensic Processing Processing fsrip plugins MRCoffee Document Analysis Dapper Vision Picarus Storage & Job Management Libraries The Sleuth Kit +
Pipeline Ingest Processing Review
Ingest: File Extraction / Hash Calculation / dd,e01 fsrip The Sleuth Kit md5sum sha1sum +
Pipeline Ingest Processing Review
Processing Tasks Hash set analysis Keyword searching Text extraction Document clustering Face detection Graphics clustering Video analysis Not limited parse *anything* with plugin interface
Processing plugins MRCoffee Document Analysis Dapper Vision Picarus +
Processing Plugins Plugin interface for community members to extend functionality Python + <other languages> Plugins can return data to the system: PST file -> Emails ->HBase rows Internet History records ->HBase rows Extracting zip files ->HBase rows
Processing Flow plugin Parsed Records
Processing with MRCoffee Sandboxed virtual processing Creates supermin VM on the fly with kvm 100s of KB. No network stack, no gcc, no nothin'. VM uses host kernel, glibc, etc. Includes whatever support files you give it Uses character device for data transfer Inside VM, MRCoffee daemon pumps data to your own utility on stdin, pipes data from stdout back to host HBase
Pipeline Ingest Processing Review
Review: Reporting
Review: Image Clustering
Review: Video Keyframe Analysis
Wrap Up Yes, we can scale forensics We can do it without dongles We can do it without fancy hardware We're CPU rich, finally We can add machines and do more/better analysis Not quite ready for primetime......some lab will need to be first
https://github.com/jonstewart/tp Jon Stewart Geoff Black jon@lightboxtechnologies.com geoff@lightboxtechnologies.com