Quality Control for predictive coding in ediscovery kpmg.com
Advances in technology are changing the way organizations perform ediscovery. Most notably, predictive coding, or technology assisted review, is becoming more widely accepted as part of the document review process. While it promises to be a powerful tool to reduce ediscovery costs, the strategies, implications, and leading practices for predictive coding are still evolving. More and more courts are taking up the question of acceptability of the use of predictive coding under the rules of civil procedure. The issue is whether the use of technology to replace human review is sufficient to discharge the parties discovery obligations. Predictive coding has been the subject of recent court decisions, but there has not been a definitive endorsement by a court in a case in which one party objected to the use of predictive coding by the other. 1 It is clear, however, that high standards of quality control during predictive coding will help lower the risk of a dispute over an ediscovery tool. While the courts view will continue to evolve over time, predictive coding appears likely to become a standard tool in ediscovery, and litigants should change their approach to quality control in ediscovery as a result. This paper will discuss several strategies that companies can use now for improving quality control while using predictive coding for document review. Training Phase Select Sample Code Train Measure Improve Application Phase Cull Review Validate Produce 1 See Da Silva Moore v. Publicis Group & MSL Group, No. 11 Civ. 1279 (ALC) (AJP), 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012) (endorsing predictive coding where both parties agreed to use it but argued over methodology); Kleen Products v. Packaging Corp of America, No. 10 C 5711, 2012 U.S. Dist. LEXIS 139632 (N.D. Ill. Sept. 28, 2012) (Parties ultimately agreed to use Boolean search terms instead of predictive coding); EORHB, Inc. et al v. HOA Holdings, LLC, C.A. No. 7409-VCL (Del. Ch. Oct. 15, 2012) (court sua sponte ordered parties to use predictive coding)
Quality Control for predictive coding in ediscovery 2 How technology has changed the ediscovery process In addition to increased efficiency and reduced costs, technology has made the ediscovery process more complex, and this complexity increases risk. Statistical sampling has become a vital tool for quality control under predictive coding, and the complexity of the predictive coding work flow has highlighted the need for project management strategies to mitigate risk. Under the traditional model of document review, documents were examined, coded, and organized by hand. The first electronic tools for document review were designed to recreate this process, as electronic documents were divided into assignments in the order in which they were loaded onto the software, and each document was reviewed linearly. Using an electronic document depository with the ability to extract metadata from each document, however, has enabled work flow improvements and allowed for stratification, prioritization, and contextualization. Stratification means that file types like spreadsheets or image files can be handled separately, perhaps even by specialized review teams. Prioritization based on search terms, date ranges, or other meta data can be used to similar effect. Finally, technology permits documents to be contextualized by extracting concepts and presenting the reviewer with clusters of similar documents. These strategies have improved review efficiency and quality.
3 Quality Control for predictive coding in ediscovery The need for improved quality control While technical tools have led to improvements, the increased complexity only emphasizes the importance of quality control during document review, and statistical sampling is an important tool in this regard. Used early in the process, the insight gained from statistical sampling can be used to refine the training of reviewers, adjust the work flow of secondlevel review, or improve search-term lists and other strategies used to guide reviewers. When predictive coding technology is used, statistical sampling is indispensable. Predictive coding projects proceed in two distinct phases: the training phase and the application phase. The training phase involves the review of documents in order to train the classifier. The application phase involves the classifier making decisions about the documents not reviewed during the training phase. Each phase requires a different approach to quality control. In addition, underlying both the training and application phases, is the issue of work flow complexity, which deals with incremental data loads and with document types that either cannot be classified or that require special treatment. Training: Predictive coding software must be trained to recognize the same distinctions between responsive and nonresponsive documents as a person reviewing the documents would recognize. The benefit of predictive coding lies in the fact that human decisions are limited to the training set, and are then leveraged across the entire body of documents. Thus, each decision on a training document potentially disposes of many documents. The primary quality-control goal during the training phase is to achieve consistent and accurate coding of the training documents, since consistency and accuracy in the training documents will determine the success of the predictive coding project. Human reviewers tend to be inconsistent, including during the training phase. Their views on the classification of documents evolve, and mistakes happen. With a large set of training documents, the predictive-coding software has the capacity, within limits, to correct mistakes made to a small portion of the documents during training, and the software will still learn the correct coding for the type of document. The impact of inconsistent coding during the training phase thus depends on the absolute number of training documents that are relevant to the specific issue. For this reason, it is important to monitor the prevalence, also known as richness, of responsive documents among nonresponsive documents during the training phase. Particular care should be taken to conduct effective quality control during the training phase. One approach is a double blind review of the training set in which the training documents are reviewed by two independent reviewers or review teams. Documents for which there is disagreement between the reviewers are then reviewed by a subject-matter authority to resolve disagreements about the training documents before they become the basis for training the classifier. The same effect can be achieved by using the predictive-coding classifier to review the training documents. Predictive coding classifiers abstract from specific training documents to discover patterns and similarities, and the classifier will often suggest that training documents be coded differently. Those documents should be reviewed again, ideally by a subject matter authority, to resolve the issue. The classifier can then be retrained to avoid the problem in the future. Notably, some predictive coding solutions feature built-in consistency checks designed to eliminate disagreements between the authority and the software. Application: There are several options for how predictive coding can be applied. In a traditional coding work flow, where each document is reviewed, predictive coding can be used to reveal inconsistencies and function as a powerful quality-control tool. Predictive coding can also be used as the basis of production decisions by separating documents that were not reviewed but were classified as responsive by the predictive coding technology. In the most popular approach, predictive coding is used to eliminate from further review documents that were classified as nonresponsive, while documents classified as responsive are then reviewed. Because the majority of documents are typically classified as nonresponsive, this last approach improves efficiency while eliminating the risks of producing documents that were not reviewed by an attorney.
Quality Control for predictive coding in ediscovery 4 Running Cost $2,100,000 $1,800,000 $1,500,000 $1,200,000 $900,000 $600,000 $300,000 $0 1/30/12 2/6/12 2/13/12 2/20/12 2/27/12 3/5/12 3/12/12 3/19/12 3/26/12 4/2/12 4/9/12 4/16/12 4/23/12 4/30/12 5/7/12 5/14/12 Time line Wave 1 First Pass (250,000 Docs) Wave 3 First Pass (250,000 Docs) Linear Review Review ends 5/14/2012 after 106 days at a cost of $1,968,000 Wave 1 Second Level (62,500 Docs) Wave 3 Second Level (62,900 Docs) Wave 2 First Pass (250,000 Docs) Wave 4 First Pass (250,000 Docs) Wave 2 Second Level (62,500 Docs) Wave 4 Second Level (62,900 Docs) Using predictive-coding technology to limit review by attorneys to the documents most likely to be relevant, can reduce time and overall ediscovery costs significantly. In this example, document review time was reduced from 106 days to about 20 days, and cost was reduced from nearly $2 million to about $400,000 Linear Relevance Review Review ends 2/17/2012 after 18 days at a cost of $403,500 $400,000 Running Cost $300,000 $200,000 $100,000 $0 1/30/12 1/31/12 2/1/12 2/2/12 2/3/12 2/6/12 2/7/12 2/8/12 2/9/12 2/10/12 2/13/12 2/14/12 2/15/12 2/16/12 2/17/12 Time line Relevance Loading (1,000,000 Docs) Relevance Training (2,100 Docs) Relevance Range 100-01 (57,390 Docs) Relevance Range 00-61 (1,569 Docs) Relevance Range 60-41 (1,327 Docs) Relevance Range 40-35 (0 Docs) Low Relevance (334-0) 95% Confident Sample of 996,917 Resolution Assessment (656 Docs) $400,000 Relevance Sampling Review Review ends 2/20/2012 after 21 days at a cost of $371,200 Running Cost $300,000 $200,000 $100,000 $0 1/30/12 1/31/12 2/1/12 2/2/12 2/3/12 2/6/12 2/7/12 2/8/12 2/9/12 2/10/12 2/13/12 2/14/12 2/15/12 2/16/12 2/17/12 2/20/12 Time line Relevance Loading (1,000,000 Docs) High Relevance (100-70) Second Level (50-545 Docs) Low Relevance (34-0) 95% Confident Sample of 996, 917 Population Assessment (656 Docs) Relevance Training (2,100 Docs) Less Relevance (69-35) 95% Confident Sample of 1,749 Population Assessment (696 Docs) Using statistical sampling for quality control Regardless of the work flow used, quality control over a large number of documents classified through predictive coding remains a challenge. It is similar to quality control for large document reviews performed by humans. There are three basic options for quality control in the application phase: a second review of a subset of the documents, judgmental sampling by nonstatistical methods, or statistical sampling. Both a second-level review and judgmental sampling are important parts of a well-rounded quality control program. Statistical sampling, however, is a much more powerful way to provide insight into the overall population of documents and the quality of coding. Statistical sampling solutions should be built into any ediscovery software platform. This requirement is especially important for predictive coding applications. KPMG s proprietary enterprise-level ediscovery software, DiscoveryRadar, provides one example. There are three basic rules to remember about using statistical sampling to validate 2 the results of predictive coding: the sampling population must be defined, the sample size must be calculated correctly, and the samples must be drawn randomly. 2 Validate is used here in a meaning specific to quality control in ediscovery; it does not refer to AICPA standards.
5 Quality Control for predictive coding in ediscovery Define the sampling population: Statistical sampling is used to draw an inference about the population from the sample. This process first requires the population to be defined correctly in order to interpret the results correctly. There are several approaches to defining the population. The first approach is to define the population as all documents that were not part of the training set. This comprehensive approach will yield an inference about the entire document population and will test the overall quality of the predictivecoding work flow. A second approach is to sample only the set of the documents that were coded nonresponsive in order to gauge whether or how many responsive documents were missed. Calculate sample size correctly: The most common calculation of sample size is a straightforward binomial formula. The most important factors determining the sample size are the desired confidence level, error rate (confidence interval), and prevalence. The confidence level reflects the likelihood that the sample is a true representation of the overall population. For example, a 95 percent confidence level means that if 100 independent samples were randomly selected, 95 of them would accurately represent the population (within the error rate). The error rate expresses the range of expected results. For example, with an error rate of +/-5 percent, if the sample shows that 10 percent of the documents were classified incorrectly in the validation sample, the actual number will be between 5 percent and 15 percent. Increasing the sample size may lower the error rate. Finally, prevalence represents the expected percentage of responsive documents in the population. As a note of caution, correctly interpreting prevalence and adjusting the sample size accordingly requires an advanced-level understanding of statistics. Draw samples randomly: Randomizing software can help any user draw sufficiently random results easily. Randomization becomes a challenge, however, when there are changes in the population, such as the addition of new documents. In practice, a validation sample against half the documents at the midpoint of the project cannot be updated simply by sampling the second half of the documents at the completion of the project. Both validation samples may be useful, but neither will permit a statistically-valid statement about the entire population because each sample will not have been randomly selected from the total population. Getting sampling right is an important part of making the work flow defensible and ensuring quality. The KPMG white paper The case for statistical sampling in e-discovery, provides an excellent resource on statistical sampling and statistical process control in document review. Quality control for documents that require individual review The training and application phases represent the core of the predictive coding work flow. By necessity, predictive coding work flows are complex, as they require the identification and tracking of different categories of documents. In addition to the bulk of the documents for which predictions are generated, there are five categories of documents that will require individual review and should be a special focus of quality control: training documents, validation samples, ambiguous documents, nontext files, and potentially privileged documents. Training documents These documents enable the classifier to learn how a reviewer would handle a specific document. They must be reviewed to give the technology the required input and, as discussed, should be a special focus of quality control. Validation samples These randomly selected documents must be reviewed by an attorney in order to assess the performance of the predictive-coding classifier. Validation samples are statistical samples, and the rules discussed above apply. Ambiguous documents Given the variation in documents, case strategy used, and the complexity of the subject matter, predictive coding technology may not achieve sufficiently clear results for all documents, leaving these documents to be reviewed by an attorney. Depending on the software used, ambiguous documents may not be explicitly identified. Nontext documents Since predictive-coding technology is based on the content of text documents, nontext documents such as image files or poor-quality scans must be reviewed. Potentially privileged documents These documents need to be reviewed by an attorney to produce a privilege log and confirm that the information is subject to privilege.
Quality Control for predictive coding in ediscovery 6 Tracking, checklists, and an enterprise-level approach have emerged as the primary tools and strategies for quality control. Tracking: Quality control is essential to the integrity of the ediscovery process, and the foundation of quality control rests on the tracking of all data and activity. Ideally, a tracking system should connect all relevant information in an accessible manner, linking documents to the electronic media on which they were collected as well as to specific work flows and to the final disposition of the documents. Tracking technology should produce a record of data collection, processing, review, and production. The goal is not only to provide chain-of-custody documentation, but also to associate each document with all its relevant process-related information. Using such technology as KPMG s Global Evidence Tracking System (GETS) can help to ensure quality control and help minimize the errors that often result from manual data entry. Checklists: The second principle of quality control in ediscovery is the use of checklists. ediscovery projects are extremely complex, and the use of predictive coding only adds to the complexity. Simple checklists that are followed consistently can help mitigate the risks of error. 3 Checklists make project delivery documentable and auditable by preserving a record of the tasks performed. Checklists can also be customized for large projects and enterprise solutions. In order to unlock cost savings while minimizing risk, checklists should be living documents that are amended as optimal work flows for a particular client are developed and information is shared. An enterprise-level approach: An enterprise-level approach can reduce costs and increase efficiency by allowing the ediscovery provider to become familiar with the data sources, share information across different cases, and to avoid an ad hoc approach for each stage of the process. The enterprise-level approach can also allow for continual improvement and consistency in work flow. For example, the protection of privileged information is one of the core concerns of ediscovery, as privilege considerations are often subject to interpretation and different approaches. Corporations also tend to work with several law firms, and often with various teams within each firm. Consistency in maintaining claims of privilege among matters and over time is important, as any variation in how that information is handled increases risk. Documents that were produced in one matter may no longer be subject to privilege in other matters. 3 Effective checklists require some thought and testing. For example, checklists should use natural breaks in the work flow, be simple and logical, fit on one page, and have a clear objective. See Gawande, Atul, The Checklist Manifesto: How to Get Things Right, 2009. Conclusion Predictive coding technology demonstrates the general principle that more sophisticated technical tools lead to work flow complexity. While defensibility and disclosure requirements may be top of mind for outside counsel, successfully navigating the many moving parts in predictive coding technology should be the foremost project-management concern. Predictive coding is a powerful document review tool. Nonetheless, the increased use of technology has also increased the complexity of the ediscovery process, which can result in increased risks. Companies should consider the strategies discussed above for improving quality control during complex ediscovery work flows, particularly for predictive coding. About the author Manfred Gabriel is a principal in KPMG s Forensic Technology Services practice, where he focuses on ediscovery. He provides clients with a wide range of services from enterpriselevel ediscovery management to delivery on large, complex ediscovery projects. As a former practicing antitrust attorney, Manfred has successfully assisted clients in responding to large, fast-paced regulatory requests and in litigations, both domestic and international.
Contact us Kelli Brooks U.S. Forensic Technology Network Co-Leader T: 714-934-5435 E: kjbrooks@kpmg.com Ed Goings U.S. Forensic Technology Network Co-Leader T: 312-665-2551 E: egoings@kpmg.com Manfred Gabriel Principal T: 212-954-3656 E: mjgabriel@kpmg.com kpmg.com 2013 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative ( KPMG International ), a Swiss entity. All rights reserved. Printed in the U.S.A. The KPMG name, logo and cutting through complexity are registered trademarks or trademarks of KPMG International. NDPPS 141499