SCEADAN Systematic Classification Engine for Advanced Data ANalysis

Similar documents
SCEADAN Systematic Classification Engine for Advanced Data ANalysis

DETECTING THREATENING INSIDERS WITH LIGHTWEIGHT MEDIA FORENSICS

Insider Threat Detection

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

RIA SECURITY TECHNOLOGY

SVM Ensemble Model for Investment Prediction

Firewall Testing Methodology W H I T E P A P E R

Performance Testing Process A Whitepaper

Detecting Threats in Network Security by Analyzing Network Packets using Wireshark

Support Vector Machines for Dynamic Biometric Handwriting Classification

Network Based Intrusion Detection Using Honey pot Deception

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Clavister InSight TM. Protecting Values

An Oracle Technical White Paper May How to Configure Kaspersky Anti-Virus Software for the Oracle ZFS Storage Appliance

A Review on Network Intrusion Detection System Using Open Source Snort

How To Understand How Weka Works

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Web Document Clustering

Sophistication of attacks will keep improving, especially APT and zero-day exploits

Azure Machine Learning, SQL Data Mining and R

SecureDoc Disk Encryption Cryptographic Engine

Towards facilitating reliable recovery of JPEG pictures? P. De Smet

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Assessment for Master s Degree Program Fall Spring 2011 Computer Science Dept. Texas A&M University - Commerce

E-commerce Transaction Anomaly Classification

PTK Forensics. Dario Forte, Founder and Ceo DFLabs. The Sleuth Kit and Open Source Digital Forensics Conference

An Oracle Technical White Paper January How to Configure the Trend Micro IWSA Virus Scanner for the Oracle ZFS Storage Appliance

Model Selection. Introduction. Model Selection

FEATURE COMPARISON BETWEEN WINDOWS SERVER UPDATE SERVICES AND SHAVLIK HFNETCHKPRO

VX Search File Search Solution. VX Search FILE SEARCH SOLUTION. User Manual. Version 8.2. Jan Flexense Ltd.

Personal Security Practices of the CAO

WildFire Features. Palo Alto Networks. PAN-OS New Features Guide Version 6.1. Copyright Palo Alto Networks

Using Remote Desktop Clients

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Presenting Mongoose A New Approach to Traffic Capture (patent pending) presented by Ron McLeod and Ashraf Abu Sharekh January 2013

1. Classification problems

Automated Content Analysis of Discussion Transcripts

FORSIGS: Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints

Anomaly Management using Complex Event Processing. Bastian Hoßbach and Bernhard Seeger

Defending Against Cyber Attacks with SessionLevel Network Security

A Preliminary Performance Comparison of Two Feature Sets for Encrypted Traffic Classification

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Data Mining for Knowledge Management. Classification

Model Deployment. Dr. Saed Sayad. University of Toronto

A Feature Selection Methodology for Steganalysis

Barracuda Intrusion Detection and Prevention System

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

SiteCelerate white paper

1 Topic. 2 Scilab. 2.1 What is Scilab?

Data Exfiltration and DNS

FEATURE OVERVIEW. FGX Series firewall. Last updated February 2012

Beating the NCAA Football Point Spread

PCI COMPLIANCE REQUIREMENTS COMPLIANCE CALENDAR

IDS Categories. Sensor Types Host-based (HIDS) sensors collect data from hosts for

Network- vs. Host-based Intrusion Detection

Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce

VWF. Virtual Wafer Fab

Intrusion Detection Systems

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Intrusion Detection for Mobile Ad Hoc Networks

Learning from Diversity

Data Structure Reverse Engineering

BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation

Fighting Advanced Threats

Content-ID. Content-ID URLS THREATS DATA

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

KEITH LEHNERT AND ERIC FRIEDRICH

WildFire Overview. WildFire Administrator s Guide 1. Copyright Palo Alto Networks

CHAD TILBURY.

Comparison of Firewall, Intrusion Prevention and Antivirus Technologies

Flow-based detection of RDP brute-force attacks

Programming Exercise 3: Multi-class Classification and Neural Networks

CSC574 - Computer and Network Security Module: Intrusion Detection

Semi-Supervised Learning for Blog Classification

Second-generation (GenII) honeypots

Author Gender Identification of English Novels

Intrusion Detection System Based Network Using SNORT Signatures And WINPCAP

The syslog-ng Premium Edition 5F2

II. RELATED WORK. Sentiment Mining

Intrusion Defense Firewall

Secure Shell SSH provides support for secure remote login, secure file transfer, and secure TCP/IP and X11 forwarding. It can automatically encrypt,

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

DKIM Enabled Two Factor Authenticated Secure Mail Client

The Evolution of File Carving [The benefits and problems of forensics recovery]

Active Learning SVM for Blogs recommendation

Sandy. The Malicious Exploit Analysis. Static Analysis and Dynamic exploit analysis. Garage4Hackers

Transcription:

SCEADAN Systematic Classification Engine for Advanced Data ANalysis Nicole Beebe, Ph.D. The University of Texas at San Antonio Dept. of Info. Systems & Cyber Security Simson Garfinkel, Ph.D. The Naval Postgraduate School Dept. of Computer Science

Caveats UTSA funding sources NPS Grant No. N00244-11-1-0011 NPS Grant No. N00244-13-1-0027 UTSA Provost s Summer Research Mentorship Program Disclaimer Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the Naval Postgraduate School

Motivation, Use Cases, Literature Review BACKGROUND

Intro: Data/File Type Determination Challenging on fragments (non-header) and files without accurate signatures/extensions o o o Poor prediction accuracies in large multi-class situations No useful tools available after 10 years of research by field Significant scalability and speed issues

Primary Use Cases Digital forensics Focus investigative efforts (e.g. search hit ranking feature) Triage and disk profiling Fragment identification/isolation, file recovery/reassembly Intrusion detection Overcome signature-based obfuscation techniques Anomaly detection Context triggered indication & warning (e.g. payload vs. protocol), as opposed to traffic-based, bursty anomaly detection Exfiltration, extrusion detection and profiling

Other Possible Use Cases Firewalls Content based blocking Malware Detection: Content based malware, virus detection Analysis: Mapping binary objects Steganalysis Detecting statistically abnormal file types due to content Isolating stegged data within binary objects CAVEAT: Use traditional methods when you have trusted file signatures!

Much research but no tools Literature Review Penrose et al. (2013) Patel and Singh (2013) Alherbawi et al. (2013) Roussev and Quates (2013) Xie et al. (2013) Fitzgerald et al. (2012) Gopal et al. (2011) Axelsson (2010) Conti et al. (2010) Li et al. (2010, 2005) Ahmed et al. (2010, 2009) Cao et al. (2010) Amirani et al. (2012, 2011, 2008) Calhoun and Coles (2008) Moody and Erbacher (2008) Zhang and White (2007) Veenman (2007) Erbacher and Mulholland (2007) Karresand and Shahmehri (2006) Hall and Davis (2006) McDaniel and Heydari (2003) Shannon (2004)

Two Fundamental Approaches Naïve statistical classification SCEADAN employs a statistical approach Machine learning or metrics based approaches Byte frequency distribution (n-gram analysis) Complexity measures (entropy, Kolmogrov complexity, etc.) Specialized, semantic based approaches Utilize knowledge of internal file structures Ex: JPEG sections; ZIP local file headers Look for predictive, string based indicators Ex: >>stream preceding a Deflate compressed stream =.PDF Not signature based in traditional sense

Byte Frequency Distributions n-gram Analysis: Unigram = 1 byte 256 features \x00, \x01 \xff Bi-gram = 2 bytes 65,536 features \x0000, \x0001, \xffff A few non-ngram features (e.g. entropy, kurtosis, ) Images from Amirani et al. (2012)

Including Headers Biases Results Sceadan (2013) Legend No header data in training set Sample includes some header segments All sample observations include headers (unit of analysis = full file) Sceadan (model/tool) does not rely on header segments (but can use them)

Tool Developed SCEADAN

Tool Developed: Sceadan The name Systematic Classification Engine for Advanced Data ANalysis Old English / Proto-Germanic for To Classify Naïve statistical classifier Classifies independent of signatures, extension, file system data Uses N-gram features (concatenated unigram+bigram vectors) Built-in training and prediction modes Leverages LIBLINEAR (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) Uses support vector machines (SVMs) Built-in model ( 50MB), but you can train/use your own

True Positive Prediction Rates in our Experiments Type Ext Rate % Delimited.csv 100 JSON records.json 100 Base64 encoding.b64 100 Base85 encoding.a85 100 Hex encoding.urlenc 100 Postscript.ps 100 Log files.log 99 CSS.css 99 Plain text.text,.txt 98 XML.xml 98 FS-EXT.ext3 97 Java Source Code.java 97 JavaScript code.js 95 Bi-tonal images.tif,.tiff 95 HTML.html 91 GIF.gif 86 MS-XLS.xls 84 MP3.mp3 84 Bitmap.bmp 83 AVI.avi 78 JPG.jpg 76 BZ2.bz2 72 H264.mp4 72 FS-NTFS.ntfs 71 v1.2.1 built-in model performance Type Ext Rate % AAC.m4a 69 MS-DOCX.docx 62 WMV.wmv 59 PDF.pdf 54 MS-DOC.doc 53 MS-XLSX.xlsx 50 FLV.flv,.FLV 44 ZLIB DEFLATE.gz 29 Portable Network Graphic.png 28 FS-FAT.fat 25 MS-PPTX.pptx 21 ZLIB - DEFLATE.zip 20 MS-PPT.ppt 14 ENCRYPTED N/A 13 Average Sceadan prediction accuracy: 71.5%* (NOTE: Later modeling netted 73.5% accuracy) Random chance classification: 1/40 = 2.5% Train/Test: 60%:40% (million fragment sample) * Beebe et al. (2013) Sceadan: Using Concatenated N-Gram Vectors for Improved Data/File Type Classification, IEEE Transactions on Information Forensics and Security, (8:9), pp. 1519-1530.

Sceadan Performance Relative to Others (2013) 100% 90% Higher Accuracy But Few Classes Prediction Accuracy 80% 70% 60% 50% 40% 30% Typical Accuracy Degradation As Multi-Class Challenge Increases Sceadan v1.0 73.5% Accuracy* 40 Classes NOTE: 9 lowest performing types are significantly under-researched classes (e.g. Office2010, FS data) 20% 10% 0% 0 5 10 15 20 25 30 35 40 Number of Classes Predicted by Classifier *Additional model training has improved classifier accuracy from 71.5% to 73.5%

Built-In Model Details (Sceadan v1.2.1) Model file model.ucv-bcv.20130509.c256.s2.e005 MD5 D068034D64329ECCA25DD76F515656A7 Model trained using digitalcorpora.org files Random subset of GOVDOCS1 files (types: see Beebe et al. 2013) FILETYPES1 data set (UTSA open source file collection) 512B training samples (n=1,800 samples of each type) Segmented all files into 512 byte blocks Removed header sector Model parameters & solver function C=256, e=.005, gamma=n/a (linear kernel) L2 regularized L2 loss function, primal solver

Model Building (Training) Motivation It is important to build your own models. Train on different data types High number of classes degrades model performance Experiment with different block sizes, features, etc. Training data must be verified, prepared, cleansed Sceadan optimizes model parameters Runs grid.py Optimizes C (surface smoothing parameter) Optimizes gamma (single training point influence) Sceadan randomly splits sample into train:test sets

Other Capabilities Ability to write LibSVM compliant vectors to output Multi-threaded, in-memory/compiled model Creates confusion matrices automatically Sub-block classification capability Can specify sub-block size within files to classify Can classify individual files, or in directory mode: Block offset (0=full file classified) Predicted Type Input path/filename These samples were named <<true type>>.txt

Miscellaneous Copyright: University of Texas at San Antonio License: GPLv2 Written in C Linux CLI based Code rewritten/improved in 2014 by NPS To obtain: github.com/nbeebe/sceadan

Future Code Development Plans LATEST RESEARCH

Scatter Plot of Unigram Weights (4 Models) for.exe Files Positive weighted features help classify sample as type Unigrams (byte values) Negative weighted features help classify sample as NOT type High absolute value features are good discriminators Not all n-grams are equally discriminatory

Prediction Accuracy as % Unigrams Decline 100% unigrams 69% accuracy 14% unigrams 65% accuracy 100% unigrams 81% accuracy Research findings suggest 15-20% of unigrams/bigrams can be used to obtain similar performance, at Much reduced computational cost. 12% unigrams 75% accuracy

Hierarchical Classification Modeling Advanced, hierarchical classification design Improve accuracy via hierarchical data class classification, followed by data type classification In practice, class classification Conti et al. (2010) Becomes triage mechanism to focus classification efforts Improves prediction accuracy Reduces multi-class size in tough classes Enables better feature selection within classes Reduces problem of over-fitting

Nicole.Beebe@utsa.edu (210) 458-8040 (210) 269-5647 (Cell) COMMENTS / QUESTIONS?