Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Size: px

Start display at page:

Download "Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum"

Ami Pitts
10 years ago
Views:

1 Statistical Validation and Data Analytics in ediscovery Jesse Kornblum

2 Administrivia Silence your mobile Interactive talk Please ask questions 2

3 Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 3

4 Introduction Computer Forensics Research Guru md5deep/hashdeep fuzzy hashing (ssdeep) foremost Now with Kyrus Technology Previously AFOSI, USNA, DoJ, ManTech 4

5 Statistical Similarity Using statistics to identify things which are similar Science, it works! Found in some ediscovery tools now Will be in AccessData products soon Introduction Not the only approach Semantic Similarity, et al. There have been many developments in Computer Science which we re using yet Most of today s talk is years old 5

Found in some ediscovery tools now Will be in AccessData products soon Introduction

6 Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? Current technology: Manual review Expensive, time consuming 6

Which of these documents belong in categories I ve created?

7 What Makes Things Similar? 7

8 Depends on Which aspects you re comparing How you re comparing them. What Makes Things Similar? 8

9 Example 9

10 Example Both live in Washington DC Both like a good hamburger Both are dog people Conclusion: Similar President Obama is much taller Presenter does not have gray hair Work in different career fields Conclusion: Not similar 10

President Obama is much taller Presenter does not have

11 Feature Selection Choose aspects to compare Anything can be a feature Text Pictures Metadata Language Reading level Number of words Image courtesy of Flickr user doctor_keats and used under Create Commons license. Have to be represented mathematically 11

words Image courtesy of Flickr user doctor_keats and used under

12 Similar inputs should have similar features Feature Selection 12

13 N-grams N-grams of text Computer science term for phrase of n words The quick brown fox jumped over the lazy dog 2-grams the quick quick brown brown fox 3-grams the quick brown quick brown fox brown fox jumped Photograph from Flickr user regali and used under a Creative Commons license. 13

brown fox 3-grams the quick brown quick brown fox brown fox jumped

14 N-grams Relative position independence Handy when a paragraph gets moved Not entirely position independent Gives some context for any word Unlike Bag of words model 14

15 Getting the features out of the documents Counting n-grams quick brown: 2 brown fox: 4 Feature Extraction Want to make features look the same Looking for similarity, not identical Confusion is a good thing Want to minimize number of features Makes math easier (and faster) 15

the same Looking for similarity, not identical Confusion is a good

16 Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick and over the Feature Extraction Stemming words Linguistics technique Remove endings to create same word Jumped, jumps, jumping jump 16

and over the Feature Extraction Stemming words Linguistics

17 The quick brown fox jumped over the lazy dog Feature Selection 2-grams: quick brown brown fox fox jump jump over lazi dog 17

18 What it sounds like How far about are these data points? Distance Measures Alternatively, how similar are they? More than one way to measure distance 18

19 Distance Measures Venetian In n Out Burger 19

20 Distance Measures Distance: 3 miles Straight line or Euclidean distance 20

21 Distance Measures Distance: 5 miles Manhattan distance 21

22 Distance measures for strings: Edit distance Hamming distance Dice s coefficient String Distance Measures See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 22

23 String Distance Measures We want a distance measure counts of n-grams Not just two strings Cosine similarity Create a vector (arrow) for each set of strings Measure the angle between those vectors 23

24 Cosine Similarity fox jumped Represent feature counts for each document as a vector quick brown 24

25 Cosine Similarity fox jumped The smaller this angle, the more similar the documents θ quick brown 25

26 Cosine Similarity fox jumped Extending to three dimensions (or features) quick brown 26

27 Math can handle any number of dimensions/features But more features makes the math more complicated Cosine Similarity The Curse of Dimensionality So many dimensions (features) that comparisons become too time consuming Just select the best features (Insert mathy stuff here) Example: Which is best feature? advanced persistent threat vs. quick brown 27

28 Comparisons These documents are similar! 28

29 Comparisons Can find documents similar to any query Document Paragraph Similar to a kind of fuzzy hashing Signature is n-gram counts 29

30 Clustering Can find clusters of similar documents Unsupervised machine learning Artificial intelligence Start with pile of documents Press go End up with clusters of similar documents Example: Documents A, B, C, D, E, F, and G 30

31 Each document belongs to at most one cluster Exclusive Clusters Not all documents in a cluster are similar to each other Some documents are not similar to any others Unique documents 31

32 Non-Exclusive Clusters Each document can belong to any number of clusters Every document in a cluster is similar to the others 32

33 Classification Also known as: Predictive Coding Assisted Machine Learning Choose all documents which belong in my group Documents responsive to the subpeona: A, C, D, G Documents not-responsive: B, E, F 33

34 User must create a set of training data Some documents which are in the group Some documents which are not in the group Classification Coding documents: 1. Yes 2. No 3. [skip] 4. Yes 5. Yes 6. No 7. No 34

35 Classification Artificial intelligence is just math There are many algorithms: Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 35

36 Also used for spam detector Also a classification problem Naïve Bayesian Classifier P(B given A) = (P(B) * P(A given B)) / P(A) contains features: P(spam given features) = P(spam) * P(features given spam) / P (feat) P(notspam given feat) = P(notspam) * P(features given not) / P(feat) Which probability is greater? 36

37 Build a flowchart of questions on the features Each question should divide the data equally Blackjack example: Decision Tree Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay 37

38 Quick to classify, but slow to construct What questions are best at which point in the tree? Decision Tree [Insert mathy stuff here] You could make a career out of efficient decision tree generation And people do 38

39 Run classifier on training data Compare classifier results to known values Classifier Performance True value Classifier Guess 1. Yes YES 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) 39

40 Classifier Performance There are several measures of classifier performance Precision and Recall Receiver operating characteristic Aka ROC curve Confusion matrix 40

41 Precision measures false positives P = TP / (TP + FP) Precision and Recall Recall measures false negatives R = TP / (TP + FN) Both are on a scale from zero to one One being perfect 41

42 True value 1. Yes YES Classifier Guess 2. No YES (false positive) 3. [skipped] [skipped] 4. Yes YES 5. Yes NO (false negative) 6. No NO 7. No YES (false positive) Precision and Recall TP = 2 FP = 2 FN = 1 Precision = TP / (TP + FP) = 2 / (2 + 2) = 0.5 Recall = TP / (TP + FN) = 2 / (2 + 1) =

43 Classifier Performance If you re not happy with the performance, you can: Add more training values (easy) Change feature selection (moderate) Change features (difficult) Change algorithms (PITA) 43

44 Big Questions I have a billion documents. Which of these documents are similar to each other? Which of these documents belong in categories I ve created? Responsive to a subpoena? Related to the Henderson account? New technology: Select features Let computer do the work 44

45 Outline Introduction Big Questions What Makes Things Similar? Feature Selection Feature Extraction Comparisons Clustering Classification 45

46 Questions? Jesse Kornblum 46

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main