The Random Sampling Road to Reasonableness. Reduce Risk and Cost by Employing a Complete and Integrated Validation Process

Transcription

1 The Random Sampling Road to Reasonableness Reduce Risk and Cost by Employing a Complete and Integrated Validation Process By: Michael R. Wade Planet Data Executive Vice President Chief Technology Officer June, 2012

2 2 Page Table of Contents Introduction 3 A Few Random Thoughts 3 Why It Works 4 Confidence Level 6 Confidence Interval 9 Conclusion Planet Data Solutions All Rights Reserved Since 2001 Planet Data has provided high-quality Discovery Management Services and Solutions to its clients. Planet Data today is also recognized for our strides in technology development, including leveraging The Cerulean Engine to create our proprietary processing platform, now known as Exego. All data and examples presented in this white paper are intended for illustrative purposes only. Sampling rates and methodologies are always dictated by each individual case. We try to provide quality information, but we make no claims, promises or guarantees about the accuracy, completeness, or adequacy of the information contained herein. This white paper does not constitute legal advice in any jurisdiction. No part of this document may be reproduced or distributed without written consent and credit acknowledgement. The views expressed here are our own. Contact: Laura Marques Vice President, Marketing and Communications LMarques@PlanetDS.com

3 3 Page Introduction This paper was primarily inspired by the upsurge in discussions about predictive coding workflow and how to verify that the process is truly effective. In discussions with clients and other practitioners it became quite clear that there is a wide range of understanding and opinions about how to best employ Random Sampling techniques for validation purposes. This paper will address some fundamental aspects of Random Sampling to assist users in understanding the basic concepts, and better utilize these techniques. In particular Random Sampling can be used to reduce cost and risk when it is properly incorporated into your ESI workflow. In order to achieve these goals, practitioners must have a clear understanding on the impact of the two basic settings, Confidence Level and Confidence Interval. A Few Random Thoughts For a Lawyer or IT professional, random events are what we often fear most. They are by definition unpredictable, and for the most part uncontrollable. So it seems contradictory when we utilize the concept of randomness to validate and draw scientifically definable conclusions about the composition of large data sets. Random events are what we often fear most unpredictable uncontrollable Many people are highly skeptical of Random Sampling results. In many cases this is due to gross misuses of polling information in public settings, as well as a lack of comfort with the mathematics behind Random Sampling. As a result we are not always comfortable on how to incorporate these techniques into a defensible and reasonable litigation strategy. This paper will address several important concepts that should help overcome this distrust and put us on the path to using random sampling techniques in everyday practice. First, it seems counter intuitive to most of us that looking at randomly selected documents from a large set of documents could tell us very much about the collection. Second, the application of Confidence Levels and Intervals is not clearly understood - particularly in terms of exactly what is being measured and how to build a sound and defensible document review strategy based upon those values. 1 This calculation is based upon a 95% Confidence Level and a +/- 5% Confidence Interval and a population size of one million documents.

4 4 Page A Confidence Level is simply stating that we expect the observed number from a random sample to be within the Confidence Interval that percentage of the time (on average). The Confidence Interval defines within what margin of error we expect the real value to be from the observed value (e.g. +/- 5%). For example, when should we use a higher Confidence Level (CL) vs. a smaller Confidence Interval (CI)? NOTE: For the purposes of this article we will be measuring the number of documents that are responsive to some unknown criteria. Why It Works One of the most common questions concerning Random Sampling is how can a small sample from a very large number of documents give us information that is largely representative of the entire population? In basic terms the answer is because we are measuring a very simple property that has two possible values (yes/no) and that taking random samples to estimate this property falls within what is known as a normal distribution when it is properly implemented. In basic terms, the answer is because we are measuring a very simple property that has two possible values (yes/no). A key requirement for Random Sampling is that every document in the population has an equal chance of being selected. On average we would expect to see about the same percentage of responsive documents in our random sample as we would in the overall population (within the specified margin of error or CI). Figure 1 Ten trials of 384 samples at 95% CL and +/- 5% CI

5 5 Page Figure 1 illustrates the result of performing ten trials of 384 random samples (95% CL and +/- 5% CI) against a population of 250,000 documents. Of this population, 50% are known (a priori) to be responsive. The aqua blue background shows the margin of error (or Confidence Interval) that you would use to predict the window that the actual value would fall within. As this chart shows, in all ten trials the actual population was within the +/- 5% margin that we specified. If a person were to run more trials we would expect that over time approximately 5% of the trials would predict a window where the true value would fall outside of the margin of error (CI). By taking a sample of only 384 documents, we can predict that the actual number of responsive documents in this population falls between 43.96% and 53.96% (using the first sample which returned 48.96% as the number of responsive documents) with a 95% CL. Since we know that the actual value is 50% in this case, we can see that this prediction holds true. Just as importantly, the results of the random sample trials follow what is known as a Normal Distribution. You may remember the famous Bell Curve in Figure 2 from high school statistics, and now you will see why that lesson was actually worthwhile! Figure 2: Normal Distribution Curve It is important to note that approximately 68% of the measurements (responsive/nonresponsive) from your random sample trials will fall within one standard deviation of the actual value in your overall population. This means that your results tend to fall fairly closely around the actual value. To illustrate this with real world data, we have run 50,000 random trials against a population of one million documents and plotted a frequency diagram (i.e., shows the percentage of responsive documents as measured by each random sample trial and the number of times that each particular measured percentage occurred). As can be seen in Figure 3, the results are a Normal Distribution or a Bell Curve. This further illustrates how the actual results of Random Sampling match what the theory predicts. In fact, the results fall within the statistical range as specified by the CL and CI.

6 6 Page This result is in agreement with what is commonly referred to as the Central Limit Theorem. 2 Figure 3 - Frequency Distribution of 50,000 Random Sampling trials against one million documents where 20% are responsive using a 95% CL and +/- 2.5% CI. Confidence Level (CL) The Confidence Level is often poorly understood, and because of this, sampling validation decisions are often based upon faulty assumptions. Let s begin by discussing exactly what the CL means and how it impacts our decision making. A CL of 95% is simply stating that when a series of random samples (trials) are taken, we expect on average that 95% of those measurements will fall within the CI (e.g. +/- 2.5%) around the actual true value. Or put another way, the actual number will fall within the CI of the observed measure of the random sample trial 3. This seems pretty simple, but there are some assumptions that many of us are making without even realizing it. For example, if only ONE random sample is taken and it finds that 0% of the documents are responsive, can you be sure that this value falls within the CI of the actual value? As always, it depends upon what sure means. In this case, we used a 95% CL so we can say that 95% of the time the actual value will fall with the CI of the observed measurement NOTE: Values that fall outside of the CI are referred to as outliers.

7 7 Page Unfortunately this also means that there is a 1 in 20 chance (on average) that this measurement is an outlier and that the actual number could be very different from what this particular random sample trial would predict. This may be too high of a risk that the predicted value could be an outlier depending upon the importance of the data being measured. To mitigate this risk, you could perform additional random sample trials. For example, what are the odds that two random samples would both produce an outlier? Mathematically the odds would be 5% x 5%, or 0.25% (using a 95% CL). So by taking two random samples we have reduced the chances that we have an outlier to 0.25% (which is 1 out of 400 times). Additional sampling would reduce the odds even further. By taking two random samples we have reduced the chances that we have an outlier. Additional sampling would reduce this further. It is imperative to understand that you will never know when the random sample that you are taking will produce an outlier. You will only know that if you take multiple samples the number of responsive documents predicted should fall within the +/- CI percentage of the actual value at the specified CL percentage rate. It never means that the first sample taken is a good result, or even the second, only that over time we would expect that 95% of the random sample trials would fall within the CI% of the actual value. Lesson learned: perform more than one random sample trial when possible. Figure 4 Depicts 100 trials against one million documents with 95% CL and +/- 5% CI; with 20% of documents actually responsive.

8 8 Page In Figure 4 we ran 100 trials at a 95% CL and found that five of the results were outliers (exactly as predicted by the math). It is important to note that it would not be uncommon to have only three outliers or even six outliers when running this test. Remember, the 95% is a prediction that holds true over a large number of samples and is NOT a guarantee that it will be EXACTLY 95 out of every 100. If we are concerned and want to make sure that we reduce the risk of an outlier, we can change the CL for example to 99%. When you increase the CL percentage you will notice that the results of the random samples tend to cluster more closely around the actual value. We have reduced the likelihood that any one random sample will be an outlier from 5% to 1% (this is five times better). Figure 5 illustrates how we ran 100 trials using a 99% CL and +/- 5% and saw zero outliers. Notice that while the CI is still +/- 5%, the results from each random sample are more closely grouped around the 20% level (the actual number of responsive documents in this test set). Figure trials against one million documents using 99% CL and +/- 5% CI; 20% were actually responsive. To achieve this more precise result, we increased our sample sizes minimally, from 246 documents to 424 documents 4. In effect, we reduced the likelihood of an outlier result from a random sample by a factor of 5x. 4 When the actual proportion of responsive documents is known, the formula used to calculate the sample size will incorporate that proportion. However, when the proportion is unknown, as is normally the case, you must use 50%, which is the worst case scenario. This is why in this figure the sample size was 246 instead of 384 documents because the proportion of 20% was already known. This means that when sampling populations where the proportion is significantly different than 50%, the actual CL and CI are better than what was specified.

9 9 Page This was achieved by only slightly increasing the number of documents that we had to review. As before, if we perform more than one random sample against the document population, we can greatly reduce the chance that our predicted value was not an outlier (i.e. outside of the CI window from the actual value). Confidence Interval (CI) The Confidence Interval (CI), or margin of error, may be the most important concept that has to be understood to successfully employ Random Sampling techniques. This parameter sets the range of possible values that the actual number of responsive documents is likely to fall within. In simpler terms, it lets us know how close to the actual number of responsive documents in the full population that we can claim to be! So if we are sampling against one million documents using a 95% CL and a +/-5%, and our random sample predicts that 20% of those documents are responsive, we can still only say that we are 95% certain that the ACTUAL NUMBER of responsive documents lie between 150,000 and 250,000 documents (a range of 100,000 documents). The Confidence Interval may be the most important concept that has to be understood to successfully employ random sampling techniques. Now think about the case where only 1% of the documents are actually responsive. Due to the fact that the actual number is much smaller than the CI, its percentage difference between what we measure and what the actual value is could be significantly different on a percentage basis. It is possible that your prediction of the actual number of documents could be off by a factor of five (because the CI window is so much larger than the actual value) and the prediction would still fall within the CI. There are three primary methods for dealing with this issue: First, you can decrease the CI percentage. As an example you could reduce it from +/- 5% to +/- 1%. However, this has the impact of significantly increasing the number of documents that will need to be reviewed. For example, with a population size of one million documents, you would have to review 16,317 documents to achieve a 99% CL with a +/- 1% CI vs. 663 documents to achieve a 99% CL with a +/- 5% CI.

10 10 Page This change will reduce the window from a range of 100,000 documents to a range of 20,000 documents. While this is still a large range, particularly when we are looking for documents that have a low frequency of occurrence, it is still a significantly better prediction. The second method is the use of Judgmental Sampling. For example, if we know that the responsive documents are most likely going to be found within two custodians who have a total of 10,000 documents between them, we can take a sample of just that set of documents. If we use a 99% CL and a +/- 1% CI, you would have to sample 6,329 documents of the 10,000 - but the margin of error would only be 200 documents. You can then sample (from all documents save from these two custodians) the remaining population using a higher CI (or even the same) to confirm the assumption that the responsive documents fall predominately within the two custodians. Use iterative random sampling to significantly reduce the overall risk of leaving behind responsive materials or being misled by an outlier. Finally, you can use iterative Random Sampling to significantly reduce the overall risk of leaving behind responsive materials or being misled by an outlier. A typical method to look for responsive documents is to take an initial sample using a fairly low CL (e.g. 95%) and with moderate CI (+/- 5%) to search for responsive documents 5. Based on the actual number of responsive documents found, new search criteria is then developed (using any combination of keyword search, metadata filtering, concept search, and document similarity). These responsive documents can be removed from the population and a new round of sampling is performed. If any new responsive documents are found, you repeat the entire process. After multiple rounds are finished and no new responsive documents are found, you can then do an additional round of Random Sampling using tighter statistics (e.g. 99% CL and 2% CI). If at this point more responsive documents are found, you can then repeat the entire process. Because sampling with the lowered constraints does not involve looking at a large number of documents, this is still a very efficient process. The final sampling rounds are done at higher CL s and lower CI s to ensure that we were not missing anything in the other sampling rounds. 5 The CL and CI used in this example are for illustrative purposes only. Every activity in a case must be weighed against the risk associated with getting it wrong.

11 11 Page Conclusion Random Sampling without being incorporated into an overall workflow and strategy has limited value. However, by using it to help validate all parts of your ESI and Review processes, you can greatly reduce both costs and risk at the same time. When you understand how the Confidence Level and Confidence Interval affects the outcome of sampling, you will be ready to employ these techniques throughout the entire process. Random Sampling is not only used to protect against missing data, it can also be used to ensure that you are using efficient processes to find responsive documents from the very beginning of the case. Instead of just testing to see if anything was left behind, use Random Sampling to test how effective the searching methodology is in finding responsive documents. Before sending large numbers of documents for review (or deciding which documents will NOT be reviewed) take samples and form statistically valid opinions about the effectiveness of the techniques employed. For example, if only 5% of the documents being returned by a sample are responsive, we can be fairly certain that the process used to find those documents can be significantly improved. In conclusion, combining Random Sampling with a strong and repeatable workflow is the key to good results and a defensible process. Contact: Michael R. Wade Mike@PlanetDS.com About the Author: Mr. Wade has led several developmental efforts in the information, knowledge and document management areas. In 1988 Mr. Wade was CTO and principal of Switzerland s Tecomac AG, a company that focused on developing knowledge management solutions for large European corporations and private banks. During his time with Tecomac, he secured several patents for data compression technologies. Mr. Wade became involved in the litigation support industry more than a decade ago as one of the founders and CTO of Cerulean LLC, which was acquired by Planet Data in Mr. Wade has a B.S. in Accounting with a Minor in computer Science from Virginia Tech.