Statistical Validation and Data Analytics in ediscovery Jesse Kornblum
Administrivia Silence your mobile Interactive talk Please ask questions 2
Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 3
Introduction Computer Forensics Research Guru md5deep/hashdeep fuzzy hashing (ssdeep) foremost Now with Kyrus Technology Previously AFOSI, USNA, DoJ, ManTech 4
Statistical Similarity Using statistics to identify things which are similar Science, it works! Found in some ediscovery tools now Will be in AccessData products soon Introduction Not the only approach Semantic Similarity, et al. There have been many developments in Computer Science which we re using yet Most of today s talk is 10-20 years old 5
Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? Current technology: Manual review Expensive, time consuming 6
What Makes Things Similar? 7
Depends on Which aspects you re comparing How you re comparing them. What Makes Things Similar? 8
Example 9
Example Both live in Washington DC Both like a good hamburger Both are dog people Conclusion: Similar President Obama is much taller Presenter does not have gray hair Work in different career fields Conclusion: Not similar 10
Feature Selection Choose aspects to compare Anything can be a feature Text Pictures Metadata Language Reading level Number of words Image courtesy of Flickr user doctor_keats and used under Create Commons license. Have to be represented mathematically 11
Similar inputs should have similar features Feature Selection 12
N-grams N-grams of text Computer science term for phrase of n words The quick brown fox jumped over the lazy dog 2-grams the quick quick brown brown fox 3-grams the quick brown quick brown fox brown fox jumped Photograph from Flickr user regali and used under a Creative Commons license. 13
N-grams Relative position independence Handy when a paragraph gets moved Not entirely position independent Gives some context for any word Unlike Bag of words model 14
Getting the features out of the documents Counting n-grams quick brown: 2 brown fox: 4 Feature Extraction Want to make features look the same Looking for similarity, not identical Confusion is a good thing Want to minimize number of features Makes math easier (and faster) 15
Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick and over the Feature Extraction Stemming words Linguistics technique Remove endings to create same word Jumped, jumps, jumping jump 16
The quick brown fox jumped over the lazy dog Feature Selection 2-grams: quick brown brown fox fox jump jump over lazi dog 17
What it sounds like How far about are these data points? Distance Measures Alternatively, how similar are they? More than one way to measure distance 18
Distance Measures Venetian In n Out Burger 19
Distance Measures Distance: 3 miles Straight line or Euclidean distance 20
Distance Measures Distance: 5 miles Manhattan distance 21
Distance measures for strings: Edit distance Hamming distance Dice s coefficient String Distance Measures See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 22
String Distance Measures We want a distance measure counts of n-grams Not just two strings Cosine similarity Create a vector (arrow) for each set of strings Measure the angle between those vectors 23
Cosine Similarity fox jumped Represent feature counts for each document as a vector quick brown 24
Cosine Similarity fox jumped The smaller this angle, the more similar the documents θ quick brown 25
Cosine Similarity fox jumped Extending to three dimensions (or features) quick brown 26
Math can handle any number of dimensions/features But more features makes the math more complicated Cosine Similarity The Curse of Dimensionality So many dimensions (features) that comparisons become too time consuming Just select the best features (Insert mathy stuff here) Example: Which is best feature? advanced persistent threat vs. quick brown 27
Comparisons These documents are similar! 28
Comparisons Can find documents similar to any query Document Paragraph Similar to a kind of fuzzy hashing Signature is n-gram counts 29
Clustering Can find clusters of similar documents Unsupervised machine learning Artificial intelligence Start with pile of documents Press go End up with clusters of similar documents Example: Documents A, B, C, D, E, F, and G 30
Each document belongs to at most one cluster Exclusive Clusters Not all documents in a cluster are similar to each other Some documents are not similar to any others Unique documents 31
Non-Exclusive Clusters Each document can belong to any number of clusters Every document in a cluster is similar to the others 32
Classification Also known as: Predictive Coding Assisted Machine Learning Choose all documents which belong in my group Documents responsive to the subpeona: A, C, D, G Documents not-responsive: B, E, F 33
User must create a set of training data Some documents which are in the group Some documents which are not in the group Classification Coding documents: 1. Yes 2. No 3. [skip] 4. Yes 5. Yes 6. No 7. No 34
Classification Artificial intelligence is just math There are many algorithms: Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 35
Also used for spam detector Also a classification problem Naïve Bayesian Classifier P(B given A) = (P(B) * P(A given B)) / P(A) Email contains features: P(spam given features) = P(spam) * P(features given spam) / P (feat) P(notspam given feat) = P(notspam) * P(features given not) / P(feat) Which probability is greater? 36
Build a flowchart of questions on the features Each question should divide the data equally Blackjack example: Decision Tree Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay 37
Quick to classify, but slow to construct What questions are best at which point in the tree? Decision Tree [Insert mathy stuff here] You could make a career out of efficient decision tree generation And people do 38
Run classifier on training data Compare classifier results to known values Classifier Performance True value Classifier Guess 1. Yes YES 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) 39
Classifier Performance There are several measures of classifier performance Precision and Recall Receiver operating characteristic Aka ROC curve Confusion matrix 40
Precision measures false positives P = TP / (TP + FP) Precision and Recall Recall measures false negatives R = TP / (TP + FN) Both are on a scale from zero to one One being perfect 41
True value 1. Yes YES Classifier Guess 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) Precision and Recall TP = 2 FP = 2 FN = 1 Precision = TP / (TP + FP) = 2 / (2 + 2) = 0.5 Recall = TP / (TP + FN) = 2 / (2 + 1) = 0.666 42
Classifier Performance If you re not happy with the performance, you can: Add more training values (easy) Change feature selection (moderate) Change features (difficult) Change algorithms (PITA) 43
Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? New technology: Select features Let computer do the work 44
Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 45
Questions? Jesse Kornblum jesse.kornblum@kyrus-tech.com 46