A Practitioner s Guide to Statistical Sampling in E-Discovery. October 16, 2012



Similar documents
E-Discovery in Mass Torts:

The Evolution, Uses, and Case Studies of Technology Assisted Review

The Benefits of. in E-Discovery. How Smart Sampling Can Help Attorneys Reduce Document Review Costs. A white paper from

Predictive Coding Defensibility

E-discovery Taking Predictive Coding Out of the Black Box

Power-Up Your Privilege Review: Protecting Privileged Materials in Ediscovery

Technology Assisted Review of Documents

Application of Simple Random Sampling 1 (SRS) in ediscovery

SAMPLING: MAKING ELECTRONIC DISCOVERY MORE COST EFFECTIVE

Recent Developments in the Law & Technology Relating to Predictive Coding

The Truth About Predictive Coding: Getting Beyond The Hype

Predictive Coding: A Primer

MANAGING BIG DATA IN LITIGATION

ESI and Predictive Coding

November/December 2010 THE MAGAZINE OF THE AMERICAN INNS OF COURT. rofessionalism. Ethics Issues. and. Today s. Technology.

Technology Assisted Review: The Disclosure of Training Sets and Related Transparency Issues Whitney Street, Esq. 1

The United States Law Week

Quality Control for predictive coding in ediscovery. kpmg.com

Judge Peck Provides a Primer on Computer-Assisted Review By John Tredennick

Mastering Predictive Coding: The Ultimate Guide

Legal Arguments & Response Strategies for E-Discovery

Document Review Costs

Predictive Coding: How to Cut Through the Hype and Determine Whether It s Right for Your Review

THE PREDICTIVE CODING CASES A CASE LAW REVIEW

The Predictive Coding Soundtrack: Rewind, Play, Fast-Forward

COURSE DESCRIPTION AND SYLLABUS LITIGATING IN THE DIGITAL AGE: ELECTRONIC CASE MANAGEMENT ( ) Fall 2014

TECHNOLOGY-ASSISTED DOCUMENT REVIEW: IS IT DEFENSIBLE?

Pr a c t i c a l Litigator s Br i e f Gu i d e t o Eva l u at i n g Ea r ly Ca s e

Cost-Effective and Defensible Technology Assisted Review

The State Of Predictive Coding

White Paper Technology Assisted Review. Allison Stanfield and Jeff Jarrett 25 February

Navigating Information Governance and ediscovery

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Ethics and ediscovery

Jason R. Baron Director of Litigation Office of General Counsel National Archives and Records Administration

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

E-DISCOVERY: BURDENSOME, EXPENSIVE, AND FRAUGHT WITH RISK

Measurement in ediscovery

2972 NW 60 th Street, Fort Lauderdale, Florida Tel Fax

An Open Look at Keyword Search vs. Predictive Analytics

Pretrial Practice Course Syllabus Spring, 2014 Meeting -- Tuesdays 1:30-3:20pm Room (C)

Presenters: Brett Anders, Esq. Joseph J. Lazzarotti, Esq., CIPP/US. Morristown, NJ

Litigation Solutions insightful interactive culling distributed ediscovery processing powering digital review

How to Manage Costs and Expectations for Successful E-Discovery: Best Practices

PREDICTIVE CODING: SILVER BULLET OR PANDORA S BOX?

The Tested Effectiveness of Equivio>Relevance in Technology Assisted Review

How To Write A Document Review

Predictive Coding in Multi-Language E-Discovery

Software-assisted document review: An ROI your GC can appreciate. kpmg.com

How Good is Your Predictive Coding Poker Face?

case 3:12-md RLM-CAN document 396 filed 04/18/13 page 1 of 7 UNITED STATES DISTRICT COURT NORTHERN DISTRICT OF INDIANA SOUTH BEND DIVISION

Predictive Coding Helps Companies Reduce Discovery Costs

Technology- Assisted Review 2.0

Three Methods for ediscovery Document Prioritization:

The case for statistical sampling in e-discovery

Digital Government Institute. Managing E-Discovery for Government: Integrating Teams and Technology

THE NEW WORLD OF E-DISCOVERY

A PRIMER ON THE NEW ELECTRONIC DISCOVERY PROVISIONS IN THE ALABAMA RULES OF CIVIL PROCEDURE

PRESENTED BY: Sponsored by:

Measuring Recall in E-Discovery, Part Two: No Easy Answers

E-Discovery: A Common Sense Approach. In order to know how to handle and address ESI issues, the preliminary and

E-Discovery in Employment Litigation: Making Practical, Yet Defensible Decisions

Case 1:12-cv JG Document 404 Entered on FLSD Docket 03/18/2014 Page 1 of 14

E-Discovery Tip Sheet

The Legal Advantages of Retaining Information

Predictive Coding: Emerging E Discovery Tool Leveraging E Discovery Computer Assisted Review to Reduce Time and Expense of Discovery

Xact Data Discovery. Xact Data Discovery. Xact Data Discovery. Xact Data Discovery. ediscovery for DUMMIES LAWYERS. MDLA TTS August 23, 2013

ESI: Focus on Review and Production Strategy. Meredith Lee, Online Document Review Supervisor, Paralegal

Traditionally, the gold standard for identifying potentially

Predictability in E-Discovery

ediscovery Policies: Planned Protection Saves More than Money Anticipating and Mitigating the Costs of Litigation

Technology-Assisted Review and Other Discovery Initiatives at the Antitrust Division. Tracy Greer 1 Senior Litigation Counsel E-Discovery

E-Discovery Defensibility ViewS from the Bench page : 1

Discovery Data Management

REDUCING COSTS WITH ADVANCED REVIEW STRATEGIES - PRIORITIZATION FOR 100% REVIEW. Bill Tolson Sr. Product Marketing Manager Recommind Inc.

ediscovery Defensibility

E-Discovery in Michigan. Presented by Angela Boufford

The Case for Technology Assisted Review and Statistical Sampling in Discovery

for Insurance Claims Professionals

THE FEDERAL COURTS LAW REVIEW. Comments on The Implications of Rule 26(g) on the Use of Technology-Assisted Review

Industry Leading Solutions: Innovative Technology. Quality Results.

When E-Discovery Becomes Evidence

Case 2:11-cv LRH-PAL Document 174 Filed 07/18/14 Page 1 of 18 UNITED STATES DISTRICT COURT DISTRICT OF NEVADA * * * Plaintiff, Defendants.

Top 10 Best Practices in Predictive Coding

Predictive Coding and The Return on Investment (ROI) of Advanced Review Strategies in ediscovery

Predictive Coding as a Means to Prioritize Review and Reduce Discovery Costs. White Paper

Social Media & ediscovery: Untangling the Tweets for the Trials

STATISTICAL ANALYSIS AND INTERPRETATION OF DATA COMMONLY USED IN EMPLOYMENT LAW LITIGATION

Case 1:11-cv ALC-AJP Document 96 Filed 02/24/12 Page 1 of 49

Minimizing ediscovery risks. What organizations need to know in today s litigious and digital world.

Any and all documents Meets Electronically Stored Information: Discovery in the Electronic Age

E-Discovery for Backup Tapes. How Technology Is Easing the Burden

How To Protect A Company Or Law Firm From Being Sanctioned For Violating A Discovery Order

Built by the clients, for the clients. Utilizing Contract Attorneys for Document Review

The Ethics of E-Discovery. computer technologies in civil litigation, courts are faced with a myriad of issues

Emerging Topics for E-Discovery. October 22, 2014

Pros And Cons Of Computer-Assisted Review

COALSP 2013 E-Discovery Case Law Update. Drew Unthank Partner Wheeler Trigg O Donnell LLP

The Five Pillars of E-Discovery

Transcription:

A Practitioner s Guide to Statistical Sampling in E-Discovery October 16, 2012 1

Meet the Panelists Maura R. Grossman, Counsel at Wachtell, Lipton, Rosen & Katz Gordon V. Cormack, Professor at the David R. Cheriton School of Computer Science at the University of Waterloo Jim Wagner, Co-founder and CEO of DiscoverReady Maureen O Neill, SVP, Marketplace Leader at DiscoverReady 2

Agenda What is statistical sampling? Why should a practitioner use statistical sampling in e-discovery? Opportunities to use statistical sampling The basics of statistical sampling An example of statistical sampling in the e-discovery context Key decisions when using statistical sampling Take-away recommendations on using statistical sampling 3

What is Statistical Sampling? In general, statistical sampling is a method to estimate a characteristic of a large population by examining only a subset of it. In the specific context of e-discovery: Estimate is a reasonably precise mathematical measurement. Characteristic refers to the number (or proportion) of items in the document population having a certain property, such as responsiveness or privilege. Population is a collection of electronic documents. Subset is a small but representative sample of the document population (chosen at random). Note that the size of the subset the sample size determines the precision of the estimate. Sample size is dependent on the acceptable margin of error, the desired confidence level, and to a negligible extent, the size of the population (these variables will be discussed later in the webinar). 4

What is Statistical Sampling? Judgmental sampling can be very useful in e-discovery, but it is not what we are discussing in today s webinar Statistical sampling is not the same thing as judgmental sampling Judgmental sampling is not random It involves a selection of items for the subset using some degree of human judgment When sampling is judgmental, inferences cannot be drawn about the population based on an examination of the subset Judgmental sampling is akin to spot-checking 5

Why Use Statistical Sampling? Thoughtful use of statistical sampling can improve the quality, efficiency, and defensibility of e-discovery efforts. Quality By counting or measuring the inputs and outputs of an e- discovery process, we can work to improve the process to make it more accurate. For example: If we find that a proposed search term is bringing in too many false positives (i.e., poor precision ) we can try a different search term and test the results using sampling. If statistical sampling confirms that the new term reduces the number of irrelevant documents (better precision), but is not under-inclusive (the recall is good), we have improved the search process. 6

Why Use Statistical Sampling? Efficiency Sampling saves time and money by allowing us to count or measure things more efficiently. Statistical sampling is scalable; even with extremely large populations, we can use relatively small samples. For example, if we want to estimate the proportion of relevant documents (the richness or prevalence ) in a collection of one million documents, an appropriate sample size might be 600 documents. If we want to estimate the proportion of relevant documents in a collection of a billion documents, the appropriate sample size would remain the same. 7

Why Use Statistical Sampling? Although its use in e-discovery is not yet widespread, we are moving in a direction where it soon will be considered best practice. Defensibility Courts are increasingly requiring the use of sampling as part of a reasonable discovery process. In re Seroquel Prods. Liabil. Litig., 244 F.R.D. 650 (M.D. Fla. 2007) Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) William. A. Gross Constr. Assocs. Inc. v. American Mfrs. Mut. Ins. Co., 256 F.R.D. 134 (S.D.N.Y. 2009) Mt. Hawley Ins. Co. v. Felman Prod., Inc., 2010 WL 199055 (S.D. W. Va. May 18, 2010) DaSilva Moore v. Publicis Group, No. 11 Civ. 1279 (S.D.N.Y. Feb. 24, 2012) (Peck, M.J.), aff d (S.D.N.Y. Apr. 26, 2012) (Carter, D.J.) In re: Actos, MDL No. 6:11-md-2299 (W.D. La. Jul. 27, 2012) 8

Opportunities to Use Statistical Sampling Incorporate into early case assessment and strategy development Efficiently hone in on the sources and custodians of information likely to be relevant Assess the burdens and costs involved in accessing certain information, such as backup tapes or other offline media Gauge the richness of populations before embarking on review (Later in the program, we will take a step-by-step walk through this use of sampling, as an example of how sampling is performed) Test the culling of a data set to ensure that your cull is neither over-broad nor too restrictive 9

Opportunities to Use Statistical Sampling Measure the efficacy of search terms and refine the terms Measure the accuracy of a predictive coding process Test automated methods of screening documents for privilege and confidentiality Sample a document production before it goes out the door to provide additional assurance that privileged content is not inadvertently included 10

Opportunities to Use Statistical Sampling Support proportionality arguments Determine whether the cost of reviewing certain types of ESI is reasonable and proportional Is there bang for the buck in reviewing a particular set of documents based on how many responsive documents are estimated to be found? For example: Should we spend money on reviewing this custodian s documents if the collection has very low prevalence? Is it worth the expense to continue reviewing more documents from more custodians if that additional effort is not likely to yield significantly more relevant information? 11

Opportunities to Use Statistical Sampling Conduct quality control and quality assurance of human review efforts Measure the error rate on document review decisions for an overall project, or for particular reviewers When done in real time, as the project progresses, error rate measurement can be part of an effective quality control ( QC ) workflow When done at the conclusion of the project, or a phase of the project, the measurement becomes part of the quality assurance ( QA ) and defensibility of process 12

Opportunities to Use Statistical Sampling Important caveat about taking statistical measurements using human decisions as a reference point or gold standard Human decisions about documents inevitably involve an element of subjectivity, and even the best decision-makers will make mistakes. (This is true for all decisions, whether in the original review, a QC review, or in a sampling review.) Even the gold standard decisions are not going to be 100% consistent or correct. This element of human error, or legitimate differences of opinion, will always introduce some degree of measurement error into a statistical measurement involving human decisionmaking. Therefore, statistical measurement cannot be more accurate than human judgment permits. 13

The Basics of Statistical Sampling: Drawing a Random Sample What is a random sample and how is it generated? A random sample is a subset of documents that is chosen at random from a larger population of interest. Choosing at random means that every document in the collection has an equal chance of being selected in the sample. Random sampling can be achieved a number of different ways: Drawing numbers from a hat Using a computerized random-number generator Choosing documents based on one or more digits from their hash value Many e-discovery tools have built-in random sample generators 14

The Basics of Statistical Sampling: Drawing a Random Sample A random sample has been drawn now what? Review the sample and count the number of documents with the characteristic of interest (e.g., responsiveness). The proportion of responsive documents in the sample is calculated by dividing the number of responsive documents in the sample by the total number of documents in the sample. Because the sample is random, we can extrapolate that the proportion of responsive documents in the population is approximately the same as the proportion in the sample. When we say approximately the same, we mean that there is a margin of error in our estimate we ll explain margin of error in a moment. 15

The Basics of Statistical Sampling: Drawing a Random Sample Suppose we have a collection of one million documents And we want to estimate how many of them are relevant for the purposes of building a budget and timeline for review and production. Reviewing all one million documents for the purposes of budgeting and project planning is infeasible. So, instead, we take a random sample of 1,000 documents. We review the sample and find that 300 documents are relevant. The proportion of relevant documents in the sample is 300/1,000, or 30%. Therefore, the proportion of relevant documents in the collection is estimated to be approximately 30% (or 300,000 documents). If we repeated this process, we would get a slightly different estimate each time, but in general, each estimate would be close to the actual proportion. 16

The Basics of Statistical Sampling: Margin of Error / Confidence Interval The margin of error is a way of expressing a range above and below the estimate that is likely to contain the actual value. In our example: We determined that the proportion of relevant documents in the sample was 30%, and we extrapolated that the proportion of relevant documents in the collection was approximately, but not exactly, 30%. The exact proportion of relevant documents in the collection is unknown, but is likely to fall within a margin of error of +/- 3%. We express this by saying that that the proportion of relevant documents in the collection is estimated to be 30%, plus or minus 3%. 17

The Basics of Statistical Sampling: Margin of Error / Confidence Interval An alternative way of stating the estimate is by using a confidence interval instead of margin of error. The confidence interval is the range of values that is likely to contain the actual value. In our previous example, we would state that the proportion of relevant documents is likely to fall within a range of 27% to 33%. As compared to the margin of error, the confidence interval does not have to be exactly symmetrical around the estimate, and can therefore be a more precise way of expressing the uncertainty of the estimate. 18

The Basics of Statistical Sampling: Confidence Level What do we mean when we say the confidence interval is likely to contain the actual value? The confidence level is the probability that the confidence interval would contain the actual value if the sampling process were to be repeated a large number of times. For example: If the confidence level is 95%, it means that there is a 95% chance that the actual value is within the confidence interval. (In our previous example, we would say with a 95% confidence, that the proportion of relevant documents falls between 27% and 33%. ) If the confidence level is 99%, it means that there is a 99% chance that the actual value is within the confidence interval. (In our previous example, we would say with a 99% confidence, that the proportion of relevant documents falls between 26% and 34%. ) 19

The Basics of Statistical Sampling: The Relationship Between Confidence Level, Margin of Error, and Sample Size The three concepts of confidence level, confidence interval (or margin of error), and sample size are interrelated. Generally speaking, if the confidence level remains constant, as the sample size goes up, the margin of error goes down. Similarly, to increase the confidence level, either the sample size or the margin of error must increase. Finally, if you want to decrease your margin of error, either the confidence level must come down or the sample size must go up. Bear in mind that these relationships are not proportional: Getting a smaller confidence interval, or a higher confidence level, may require drawing a much larger sample. To illustrate this, consider again the previous example, and assume that we find the proportion of relevant documents is 30% in each sample. The following slide depicts the relationship between sample size, confidence level, and margin of error: 20

The Basics of Statistical Sampling: The Relationship Between Confidence Level, Margin of Error, and Sample Size Margin of Error Confidence Level Sample Size -1.4-1.9 +1.4 +1.9 95% 99% 4,000-3 -4 +3 +4 95% 99% 1,000-4 -5.3 +4 +5.3 95% 99% 500 0% 30% 100% Proportion of Relevant documents 21

The Basics of Statistical Sampling: The Relationship Between Confidence Level, Margin of Error, and Sample Size Here s another illustration that also demonstrates how taking relatively small samples can help us understand very large populations. We have a collection of one million documents; we assume the proportion of relevant documents is 50% and we want a 95% confidence level for our estimate. Here are some examples of the (pretty good) margins of error we will achieve with some relatively small sample sizes: Sample Size Margin of Error 500 4.4% 600 4.0% 700 3.7% 22

The Basics of Statistical Sampling: The Tradeoffs In general, a higher confidence level is better. But that comes at the price of a larger sample size, and/or A wider confidence interval (i.e., a higher margin of error). Likewise, a smaller margin of error generally is better, because it reflects a more precise estimate. But that requires a lower confidence level (less certainty), and/or A larger sample size (higher cost and/or less efficiency) There are calculators available on the Internet that allow you to plug in the variables and make computations of sample size, margin of error, and confidence level, but be careful when choosing one. Make sure you understand the assumptions each one uses or if you don t, get help from someone who does! 23

The Basics of Statistical Sampling: Standards for E-Discovery? Is there a minimum acceptable confidence level and/or margin of error when using statistical sampling in e-discovery? There are no bright line rules regarding confidence level or margin of error 95% confidence level is commonly used in statistical measurement The acceptable margin of error will depend on the consequences of an inaccurate estimate The operative standard is one of reasonableness 24

The Basics of Statistical Sampling: Standards for E-Discovery? Is there a minimum acceptable confidence level and/or margin of error when using statistical sampling in e-discovery? Every matter is different, and what is reasonable in one matter may not be in another Factors affecting the reasonableness calculus include: The cost of greater precision in measurement as compared to the amount at stake and the importance of the matter (proportionality) The purpose for which the sampling is being performed The time and resources available for sampling 25

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Determine the total number of documents in the collection. We ll call that number N (e.g., N = 1,000,000 documents). Choose the desired confidence level (e.g., 95%), and margin of error (e.g., +/- 2%). Determine the appropriate sample size (we ll call that n ) using an appropriate calculator (e.g., n = 2,395). 26

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Select 2,395 of the documents, at random, from the collection. 27

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Review Documents Responsive 700 documents Not Responsive 1,695 documents Count the number of responsive documents in the sample (say, 700). 28

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Review Documents Responsive 700 documents The proportion of responsive documents in the sample is calculated by dividing the number of responsive documents in the sample (700) by n (2,395): 700/2,395 = 29.2% 29

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Review Documents Responsive 700 documents The confidence interval is 29.2% (the proportion of responsive documents) plus or minus 2% (the margin of error): 27.2% to 31.2% 30

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Review Documents Responsive 700 documents To calculate the number of responsive documents in the collection, multiply the two confidence limits by N (27.2% x 1,000,000 = 272,000 responsive documents; 31.2% x 1,000,000 = 312,000 responsive documents). 31

Examples of Statistical Sampling: Example 1 Collection 1,000,000 documents How many responsive documents are likely to be found in the document collection? Select documents at random Sample 2,395 documents Review Documents Responsive 700 documents In this example, we can state that, with 95% confidence, we estimate there are between 272,000 and 312,000 responsive documents in this collection. 32

Examples of Statistical Sampling: Example 2 How many responsive documents are likely to be found in the document collection? 1. Let s tweak Example 1 slightly, and assume that the number of responsive documents in the reviewed sample is only 3. 2. The proportion of responsive documents in this example is 3/2,395 = 0.13%, and with a confidence interval of -1.87% to +2.13% (-18,700 to 21,300 relevant documents). 3. Obviously it is not possible to have a negative number of relevant documents in the collection. This scenario illustrates the problem faced when sampling populations with a low proportion of relevant documents (i.e., low prevalence or low richness ). 4. When sampling such populations, a normal sample size calculation of confidence intervals and margin of error is not precise enough. When dealing with a low prevalence population, a more precise confidence interval can be computed using a binomial confidence interval calculator. 33

Examples of Statistical Sampling: Example 2 How many responsive documents are likely to be found in a collection of documents? 5. Here s an example of a binomial confidence interval calculator: http://statpages.org/confint.html 6. If we take our sample of 2,395 documents and re-compute the confidence intervals at the 95% confidence level using the binomial confidence interval calculator, if there are three responsive documents in the sample, the confidence interval is 0.03% to 0.37% (300 to 3,700 documents). 7. This is an example of how the confidence interval may not be symmetrical around the estimate. 34

Using Statistical Sampling: Key Decisions a Practitioner Must Make What are you trying to measure? What are trying to accomplish with the measurement? Are you sampling for internal purposes (to improve process) or to defend your process to opposing counsel or the court? Do you need to hire an expert in statistics? What are your acceptable levels of recall, precision, etc.? What is your desired confidence level? How wide a confidence interval will you accept? How large a sample size is feasible under the circumstances of your matter, taking into account cost and timing considerations? How will you document and report your measurements and methodology? What aspects of your statistical sampling efforts are you willing to share with the opposing party? With the court? 35

Take-Away Recommendations on Using Statistical Sampling Statistical sampling can be tricky most lawyers do not have sufficient understanding of the concepts (or the math) to go it alone and get it right. Practitioners should use an expert or statistical tool to: Determine whether statistical sampling is an appropriate tool for the matter Determine which statistical measures are best for the situation (precision, recall, elusion, accuracy, etc.) Determine the desired confidence level and margin of error Calculate the proper sample size to achieve this confidence level and margin of error Draw a random sample of the appropriate size Compute the estimate 36

Question & Answer Session To contact the panelists: Maura R. Grossman, Counsel at Wachtell, Lipton, Rosen & Katz mrgrossman@wlrk.com Jim Wagner, Co-founder and CEO of DiscoverReady jim.wagner@discoverready.com Gordon V. Cormack, Professor at the David R. Cheriton School of Computer Science at the University of Waterloo cormack@cormack.uwaterloo.ca Maureen O Neill, SVP, Marketplace Leader at DiscoverReady maureen.oneill@discoverready.com 37