Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum



Similar documents
How To Cluster

Mining a Corpus of Job Ads

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data Mining - Evaluation of Classifiers

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

Web Forensic Evidence of SQL Injection Analysis

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Classification algorithm in Data mining: An Overview

Search Engines. Stephen Shaw 18th of February, Netsoc

Machine learning for algo trading

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Spam Detection A Machine Learning Approach

Supervised Learning (Big Data Analytics)

Social Media Mining. Data Mining Essentials

Why is Internal Audit so Hard?

Question 2 Naïve Bayes (16 points)

An Introduction to Data Mining

A Content based Spam Filtering Using Optical Back Propagation Technique

Machine Learning using MapReduce

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Projektgruppe. Categorization of text documents via classification

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Feature Subset Selection in Spam Detection

Analytics on Big Data

Using Artificial Intelligence to Manage Big Data for Litigation

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Data Mining - The Next Mining Boom?

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining Part 5. Prediction

An Approach to Detect Spam s by Using Majority Voting

Lecture #2. Algorithms for Big Data

Azure Machine Learning, SQL Data Mining and R

Data Mining Algorithms Part 1. Dejan Sarka

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

MACHINE LEARNING IN HIGH ENERGY PHYSICS

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

8. Machine Learning Applied Artificial Intelligence

Similarity Search in a Very Large Scale Using Hadoop and HBase

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

Active Learning SVM for Blogs recommendation

Machine Learning Final Project Spam Filtering

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

1. Classification problems

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of Social Media Streams

Evaluation & Validation: Credibility: Evaluating what has been learned

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Chapter 6. The stacking ensemble approach

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Introduction to Data Mining

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Machine Learning for Data Science (CS4786) Lecture 1

Statistical Models in Data Mining

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

CSC574 - Computer and Network Security Module: Intrusion Detection

Experiments in Web Page Classification for Semantic Web

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

MS1b Statistical Data Mining

Rameau: A System for Automatic Harmonic Analysis

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Performance Measures for Machine Learning

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Detecting Spam Using Spam Word Associations

Performance Metrics for Graph Mining Tasks

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Mining Applications in Higher Education

Distributed Computing and Big Data: Hadoop and MapReduce

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Towards better accuracy for Spam predictions

Principles of Data Mining by Hand&Mannila&Smyth

Bayesian Spam Detection

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Private Record Linkage with Bloom Filters

Journée Thématique Big Data 13/03/2015

CS 6220: Data Mining Techniques Course Project Description

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Transcription:

Statistical Validation and Data Analytics in ediscovery Jesse Kornblum

Administrivia Silence your mobile Interactive talk Please ask questions 2

Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 3

Introduction Computer Forensics Research Guru md5deep/hashdeep fuzzy hashing (ssdeep) foremost Now with Kyrus Technology Previously AFOSI, USNA, DoJ, ManTech 4

Statistical Similarity Using statistics to identify things which are similar Science, it works! Found in some ediscovery tools now Will be in AccessData products soon Introduction Not the only approach Semantic Similarity, et al. There have been many developments in Computer Science which we re using yet Most of today s talk is 10-20 years old 5

Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? Current technology: Manual review Expensive, time consuming 6

What Makes Things Similar? 7

Depends on Which aspects you re comparing How you re comparing them. What Makes Things Similar? 8

Example 9

Example Both live in Washington DC Both like a good hamburger Both are dog people Conclusion: Similar President Obama is much taller Presenter does not have gray hair Work in different career fields Conclusion: Not similar 10

Feature Selection Choose aspects to compare Anything can be a feature Text Pictures Metadata Language Reading level Number of words Image courtesy of Flickr user doctor_keats and used under Create Commons license. Have to be represented mathematically 11

Similar inputs should have similar features Feature Selection 12

N-grams N-grams of text Computer science term for phrase of n words The quick brown fox jumped over the lazy dog 2-grams the quick quick brown brown fox 3-grams the quick brown quick brown fox brown fox jumped Photograph from Flickr user regali and used under a Creative Commons license. 13

N-grams Relative position independence Handy when a paragraph gets moved Not entirely position independent Gives some context for any word Unlike Bag of words model 14

Getting the features out of the documents Counting n-grams quick brown: 2 brown fox: 4 Feature Extraction Want to make features look the same Looking for similarity, not identical Confusion is a good thing Want to minimize number of features Makes math easier (and faster) 15

Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick and over the Feature Extraction Stemming words Linguistics technique Remove endings to create same word Jumped, jumps, jumping jump 16

The quick brown fox jumped over the lazy dog Feature Selection 2-grams: quick brown brown fox fox jump jump over lazi dog 17

What it sounds like How far about are these data points? Distance Measures Alternatively, how similar are they? More than one way to measure distance 18

Distance Measures Venetian In n Out Burger 19

Distance Measures Distance: 3 miles Straight line or Euclidean distance 20

Distance Measures Distance: 5 miles Manhattan distance 21

Distance measures for strings: Edit distance Hamming distance Dice s coefficient String Distance Measures See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 22

String Distance Measures We want a distance measure counts of n-grams Not just two strings Cosine similarity Create a vector (arrow) for each set of strings Measure the angle between those vectors 23

Cosine Similarity fox jumped Represent feature counts for each document as a vector quick brown 24

Cosine Similarity fox jumped The smaller this angle, the more similar the documents θ quick brown 25

Cosine Similarity fox jumped Extending to three dimensions (or features) quick brown 26

Math can handle any number of dimensions/features But more features makes the math more complicated Cosine Similarity The Curse of Dimensionality So many dimensions (features) that comparisons become too time consuming Just select the best features (Insert mathy stuff here) Example: Which is best feature? advanced persistent threat vs. quick brown 27

Comparisons These documents are similar! 28

Comparisons Can find documents similar to any query Document Paragraph Similar to a kind of fuzzy hashing Signature is n-gram counts 29

Clustering Can find clusters of similar documents Unsupervised machine learning Artificial intelligence Start with pile of documents Press go End up with clusters of similar documents Example: Documents A, B, C, D, E, F, and G 30

Each document belongs to at most one cluster Exclusive Clusters Not all documents in a cluster are similar to each other Some documents are not similar to any others Unique documents 31

Non-Exclusive Clusters Each document can belong to any number of clusters Every document in a cluster is similar to the others 32

Classification Also known as: Predictive Coding Assisted Machine Learning Choose all documents which belong in my group Documents responsive to the subpeona: A, C, D, G Documents not-responsive: B, E, F 33

User must create a set of training data Some documents which are in the group Some documents which are not in the group Classification Coding documents: 1. Yes 2. No 3. [skip] 4. Yes 5. Yes 6. No 7. No 34

Classification Artificial intelligence is just math There are many algorithms: Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 35

Also used for spam detector Also a classification problem Naïve Bayesian Classifier P(B given A) = (P(B) * P(A given B)) / P(A) Email contains features: P(spam given features) = P(spam) * P(features given spam) / P (feat) P(notspam given feat) = P(notspam) * P(features given not) / P(feat) Which probability is greater? 36

Build a flowchart of questions on the features Each question should divide the data equally Blackjack example: Decision Tree Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay 37

Quick to classify, but slow to construct What questions are best at which point in the tree? Decision Tree [Insert mathy stuff here] You could make a career out of efficient decision tree generation And people do 38

Run classifier on training data Compare classifier results to known values Classifier Performance True value Classifier Guess 1. Yes YES 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) 39

Classifier Performance There are several measures of classifier performance Precision and Recall Receiver operating characteristic Aka ROC curve Confusion matrix 40

Precision measures false positives P = TP / (TP + FP) Precision and Recall Recall measures false negatives R = TP / (TP + FN) Both are on a scale from zero to one One being perfect 41

True value 1. Yes YES Classifier Guess 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) Precision and Recall TP = 2 FP = 2 FN = 1 Precision = TP / (TP + FP) = 2 / (2 + 2) = 0.5 Recall = TP / (TP + FN) = 2 / (2 + 1) = 0.666 42

Classifier Performance If you re not happy with the performance, you can: Add more training values (easy) Change feature selection (moderate) Change features (difficult) Change algorithms (PITA) 43

Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? New technology: Select features Let computer do the work 44

Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 45

Questions? Jesse Kornblum jesse.kornblum@kyrus-tech.com 46