1 Making document review faster, cheaper and more accurate: SECTORS AND THEMES Title here How concept searching can change the Additional way your information legal teams in handle Univers 45 Light 12pt first on pass 16pt review leading kpmg.ca/imed kpmg.com Credits and authors in Univers 45 Light 12pt on 16pt leading
2 Contents Introduction... 1 Why Concept Searching?... 2 So what exactly is Concept Searching?... 4 How is concept searching not just another kind of keyword searching?... 6 Iterative versus non-iterative... 6 What is the most cautious way of using concept clustering?... 7 What is the most aggressive way of using concept clustering?... 9 What is the best way to use concept clustering?... 9 Using Concept Searching for First Pass Review... 10
3 Making document review faster, cheaper and more accurate: 1 Introduction A party may satisfy its obligation to preserve, collect, review and produce electronically stored information in good faith by using electronic tools and processes such as data sampling, searching or by using selection criteria to collect potentially relevant electronically stored information. Sedona, Canada Principle 7. There has been a lot of talk in e-discovery circles these past few years about concept searching and its various manifestations. A technology that began with a few early innovators selling mysterious but fascinating applications to cautious law firms has become a well-accepted part of any full-featured processing and review platform. With concept searching, documents can now be searched, grouped, served up for review and even bulk-coded, not just based on what specific words can be found within them, but according to what they are about. And this technology, which identifies themes and topics within and across documents, can also be used to find, group and tag thousands of documents in the larger population based on how a small subset are coded. This means that a skilled reviewer knowledgeable about a case can code thousands or even tens of thousands of documents for every document actually reviewed. While this may seem impossible, the underlying technology has been rigorously tested, from both a statistical and a substantive legal perspective, and proven to be robust and reliable. But are we making the best use of these amazing new capabilities? We have all seen how an early adopter or power user can impress onlookers with a new tool s capabilities. But more often than not, for every power user there are dozens or hundreds using the tool who are not using it well; who are perhaps allowing what it does well to blind them to its drawbacks; and who, in the world of e-discovery, are perhaps letting the glamour and gloss of the technology distract them from the need to adhere to the same old practices that are the backbone of any discovery project: thoroughness, quality control, consistency and defensibility. For all their benefits, these new concept-based review tools are just tools. They should always be deployed as part of a well-thought-out process designed by a subject-matter expert. And they will never eliminate the need for a good project manager. But there is one thing these tools can do that many in the field have been hesitant to acknowledge. They can perform a first-pass responsiveness review more effectively than can a traditional team of reviewer attorneys. And they can do it faster and cheaper. The challenges to such a claim are obvious: first, how can you trust a machine to understand a document and decide what it is about? Then, even if you let a machine decide what is relevant, will you not have to go through every document to make sure the software got it right? In which case, why not just go through the documents the old way? A leading authority on these matters, The Sedona Conference, has clearly stated that A party may satisfy its obligation to preserve, collect, review and produce electronically stored information in good faith by using electronic tools and processes
4 2 Making document review faster, cheaper and more accurate: such as data sampling, searching or by using selection criteria to collect potentially relevant electronically stored information. 1 Conor Crowley, an internationally recognized expert on e-discovery matters and a member of The Sedona Conference, has argued that, when done properly, computerassisted methods are both defensible and effective in a first-pass review. [T]he use of analytical software for culling and first-pass review should provide accurate, costeffective relevancy determinations and should be considered no less defensible than the exclusive use of human reviewers. Conor Crowley 2 We will explain how these concerns about accuracy and defensibility are unwarranted when concept searching and concept-based categorization are deployed as part of a well-designed process. With the right tool and the right procedures, it is possible to get results that are higher-quality, more consistent and more reliable than any linear review could ever achieve and to get them faster and cheaper. We will look at concept searching in the context of litigation and investigation, and examine how some of its most impressive capabilities can yield significant improvements in both quality and efficiency. We will also offer what, for some at least, may be some needed reassurance that these new tools can really do what many say they can. Finally, we hope to demystify what is still quite a confusing topic made all the more confusing by the many companies offering one form of concept searching or another, many with their own catch-phrase branding. Why Concept Searching? The best way to understand why concept searching was adopted in the legal world is to start with two fundamental problems with the way document reviews were conducted in the past. First is the burden and cost of a traditional linear review ; second is the inherent limitations of keyword searching. Imagine a room full of boxes of documents that have to be reviewed. The task could be to make a simple yes/no decision about responsiveness, or it could be to identify issues that a document touches on. Often the documents will not be organized. So reviewers review them in the order they appear in the box, from start to finish thus the term linear. This is time-consuming work; it is tiring even mind-numbing; and the entire project depends for its accuracy and reliability on all reviewers assessing documents the same way and bringing the same level of attention to the task. These assumptions hardly ever hold. 3 With tools that combine an image viewer and a relational database (the first were Summation and Concordance, adopted by law firms in the 1990s), it was possible to view documents on screen and code them much faster than in a paper environment. It was also possible to organize them by objective coding information like date, page count, author and recipient(s). One could also assign batches of documents by Custodian or Source/Folder. But reviews were still essentially linear; everything had to be looked at. 1 The Sedona Canada Principles Addressing Electronic Discovery (January 2008), Principle 7. Available at See also The Sedona [U.S.] Principles (Second Edition, 2007) ( Sedona Canada Principles ), Principle 11, available at PRINCP_2nd_ed_607.pdf. 2 Crowley, Conor R., Defending the Use of Analytical Software In Civil Discovery, Digital Discovery and E-Evidence, Vol. 10:16, September 16, See Maura R. Grossman and Gordon V. Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? (June 2011), available at
5 Making document review faster, cheaper and more accurate: 3 Images with bib coding (bibliographic or simple objective coding) were then supplemented with OCR (optical character recognition) technology, which turns text on a page into searchable electronic text. With searching and the grouping of results, one could identify subsets of more-important documents and examine them together. But OCR often yielded poor results (garbled text), causing words to be missed and documents to be overlooked. Even when keyword searches (as well as more sophisticated forms of word-based searching like Boolean, fuzzy, stemming, wildcard and proximity) could be run against good-quality OCR, reviewers still had to look at each document to find out what it was actually about. It was not enough to know that a word or phrase appeared in a document (making it responsive to the search criteria); someone had to decide whether it was actually relevant to the litigation. 4 Many of the headaches and complaints from this period can be traced to people taking short-cuts because of the time and cost of linear review even when aided by imaging, coding, OCR and searchability. All too often, parties would hand over all responsive documents, thus producing not just relevant documents, but also a large number of false positives documents that contained a supposedly-relevant word or phrase but which had nothing at all to do with the litigation. Cost and burden drove litigants and counsel to search for alternatives to keyword search-assisted linear review. How could false positives (files that are responsive but irrelevant) be minimized? How could false negatives (relevant documents that are not responsive and therefore not retrieved) be minimized? 5 At around this time, studies revealed that even the best use of the best search technology would retrieve large numbers of irrelevant documents while missing large numbers of relevant documents. 6 And even if search technology could be improved, what could be done about the quality of human reviewers, which other studies have shown are often quite bad at making responsiveness and relevance decisions? 7 At the same time, the per-hour cost of attorneys meant that any document review that uses attorneys is going to be expensive, even if good technologies and processes have minimized the number of documents to be reviewed. 8 But why use attorneys? For decades, discovery had involved having the individuals most familiar with the matter go through their files and pull out documents responsive to the document request. 4 A document is responsive when it meets certain search criteria or falls within the scope of a document request. But not all responsive documents are relevant. Deciding whether a document is relevant involves a different assessment. For a listing of the Canadian federal and provincial Rules of Civil Procedure dealing with pleadings and relevance, see The Sedona Canada Commentary on Practical Approaches for Cost Containment: Best Practices for Managing the Preservation, Collection, Processing, Review & Analysis of Electronically Stored Information, April 2011, Appendix D, at Available at 5 For an overview of some of the techniques that have been used to address these problems over the years, see Review_Technologies. 6 See AutoDocumentReviewReliability.pdf. See also the EDI publications at and more particularly electronicdiscoveryinstitute.com/pubs/toolsfortextcategorization.pdf. When people talk about precision and recall, this is what they are talking about. Precision measures the degree to which a set of results is made up of nothing but what one is looking for: Of all my results, how many of them are relevant? A low precision score indicates a lot of false positives. Recall measures the degree to which all of the relevant documents in the dataset being searched are being retrieved in the search: Are there relevant documents out there that I m not finding? A low recall score indicates a high number of false negatives. 7 See the discussion of a 1985 study by Blair and Maron in Herbert L. Roitblat, Search & Information Retrieval Methods, The Sedona Conference Journal, Fall 2007, at 206 (available at revised_cover_and_preface.pdf: The attorneys estimated that they had found more than 75% of the relevant documents, but more detailed analysis found that the number was actually only about 20%. 8 See The Sedona Canada Commentary on Practical Approaches for Cost Containment: Best Practices for Managing the Preservation, Collection, Processing, Review & Analysis of Electronically Stored Information, April 2011, at Available at
6 4 Making document review faster, cheaper and more accurate: The process may have been initiated and supervised by attorneys, but the all-important responsiveness calls were made by non-attorneys. Yet, with the advent of electronic documents, it became the responsibility of lawyers to make these decisions. Concept searching is a way of dealing with synonymy and polysemy. Polysemy is the phenomenon where a word can have several meanings. (The word stock could refer to investments, cooking, car racing, lineage, cattle or inventory.) Synonymy is when the same meaning can be conveyed by several words or phrases (as in start, begin, initiate, turn on, set off, incite, instigate, provoke and flip the switch). Synonymy and polysemy are the reasons why keyword searches will find the words you ask for even when they carry the wrong meaning and will fail to find the meaning you are looking for because you cannot possibly include in your search all the words and phrases that might convey the desired meaning. Polysemy leads to false positives; synonymy results in false negatives. But who decided that these first-pass responsiveness decisions had to be made by lawyers? There is little if any case law on the subject. As cases grew larger, it became necessary (or so it was thought) for the largest firms to hire dozens or even hundreds of contract attorneys. These attorneys some straight out of law school would start the review with absolutely no understanding of the case. Even the most professional training protocols could not possibly make these contract attorneys knowledgeable enough to catch important nuances in the documents. As a result, large-scale attorney reviews have became notorious (at least among those who have performed them) for the amount of shoddy work they involve: documents marked responsive/relevant when they are not, important documents tagged non-responsive, two coders sitting side-by-side (or on opposite sides of the country) making completely different calls on the same document. In summary, substantiated linear document reviews by attorneys rely on imperfect search techniques guaranteed to leave out important documents while forcing reviewers to wade through large numbers of irrelevant documents; they rely on often inexperienced lawyers unfamiliar with the case to perform mind-numbing work at a high level of precision and reliability; and they are expensive precisely because lawyers perform them. Every one of these deficiencies can now be remedied. Through a carefully designed process, to be discussed in more detail below, it is now possible to save significant amounts of time and money by eliminating first-pass responsiveness review and yet identify a body of responsive documents with much better precision and recall than could ever be achieved through a traditional linear review 9. So what exactly is Concept Searching? First, some definitions. Concept searching identifies the meanings of words using any of a number of different technologies, including latent semantic indexing ( LSI ); Bayesian statistical inference; support vector machine technology; ontologies, taxonomies and/or thesauri; and language modeling 10. Some tools use two or more of these in combination. Of these technologies, perhaps the most commonly used is LSI 11. Latent semantic indexing means a way of indexing a document based on the meanings that are hidden within it 12. Determining how latent semantics can be found in a body of text is something that computer scientists and linguists have been working on since at least the 1970s. 9 For a discussion of precision and recall, see note 6, above. 10 See EDRM Search Guide, Draft v. 1.17, May 7, 2009, at Available online at as a PDF wp-content/uploads/downloads/2010/02/edrm-search-guide-v1.17.pdf. EDRM (which stands for Electronic Discovery Reference Model ) is both an organization and an initiative aimed at standardizing terminology and best practices in the e-discovery field. 11 One of the leading document review platforms on the market today, kcura s Relativity, offers a module called Analytics, which is built on an LSI-based concept search engine. 12 A key point to make here is that LSI technology does not build indexes in the normal sense of the word not like the lists of words used to perform standard keyword searches. Instead, an LSI index is a set of mathematical formulas representing the semantic content of each document.
7 Making document review faster, cheaper and more accurate: 5 Concept searching uses mathematics to identify relationships between words based on what is known as their co-occurrence within a document. It then groups documents according to these patterns and relationships. Concept searching does not require that any particular word or words appear in a document (the threshold requirement for any traditional keyword search). Even the concept word no longer really applies; concept search engines work with misspelled words, words in foreign languages, even scientific words not found in a normal dictionary. Most technologies are language-independent in that they do not rely on dictionaries, taxonomies or thesauruses and do not need to know the meanings of specific words. Another mathematical technique used in some concept search engines is Bayesian statistical inference. A leading information management, content management and e-discovery solution, Autonomy, is built on Bayesian technology. Concept search software can even assign a higher score to documents that focus on a particular topic and a lower score to a document that only touches on it in passing. The basic functionality offered by these tools, then, is concept searching; depending on the tool, this functionality is powered by one or more specific computational technologies 13. The specific technology being deployed can have an important effect on speed, efficiency and results, but this should not obscure the fact that these tools are all trying to provide to users in the legal discovery market the same basic solution. All the terms, phrases and buzzwords so often heard at legal technology conferences are just ways of describing how one form of concept searching or another can be made useful in the legal world. They describe what concept searching can help you do with a body of documents. For example: it can intelligently categorize them ( intelligent categorization ); it can tell you how you are likely to want to code a set of documents ( predictive coding ); it can suggest the coding you might want to apply ( Suggested Coding ); it can classify documents by topic ( Topic Classification ); and it enables prioritized review ( Review Prioritization ). At a fundamental level, these are all the same thing; they differ in the specific mathematical techniques used or in the added functionality a vendor has built into its specific application. These differences in functionality in turn give rise to the various catch-phrases you hear, together with a little marketing and branding. These vendors are all offering concept searching as a way to make sense of a large body of electronic documents by finding their hidden meanings and then gathering them into groups named for these meanings Even the term concept search is a simplification; not all of these technologies search for or reveal concepts as we understand that word. This paper uses this term, however, because, for most people, the ultimate point of these technologies is to help make sense of what documents are about - i.e., to reveal the concepts they contain. Related concepts and technologies, including technologies that make LSI possible as well as others that LSI has made possible, include: singular value decomposition, vectorial semantics, termdocument matrices, automated document classification, concept clustering, latent semantic mapping, latent semantic structure indexing, probabilistic latent semantic indexing, predictive categorization, and computer-assisted coding. Note that these terms are widely used and are neither associated with nor claimed by any particular vendor or service provider. 14 Vendors describe their concept-based offerings in various ways. See e-discovery Institute Survey on Predictive Coding, October 1, 2010, a report based on survey responses from the following companies: Capital Legal Solutions; Catalyst Repository Systems, Inc.; Equivio; FTI Technology; Gallivan Gallivan & O Melia; Hot Neuron; InterLegis; Kroll Ontrack; Recommind; Valora Technologies, Inc.; and Xerox Litigation Services. (Available at The specific phrase predictive coding has been in the news lately. Recommind announced by press release on June 8, 2011 that it had obtained a patent on predictive coding. Commentators soon took issue with this claim, arguing that what Recommind described as predictive coding is nothing more than a set of technologies or a combination of technologies that had already been developed by others and that have been widely available for many years. For more information, see and Jeremy Pickens, The Recommind Patent and the Need to Better Define Predictive Coding, (June 2011) ( The patent, U.S. Patent 7,933,859 B1, can be found here:
8 6 Making document review faster, cheaper and more accurate: How is concept searching not just another kind of keyword searching? Concept searches recognize conceptual similarities between documents by noticing how words relate to each other, how often they appear together or near each other, how far apart they tend to be and how often they do or do not appear in other documents with similar or different characteristics. Take the example of a set of documents, all of which contain the word investment and other investment-related terms. These documents will probably warrant a concept cluster called investment. But the key is not the specific word investment ; it is the co-occurrence of investment and all the other, conceptually-related words. Thus, if there is another document that does not contain investment but does contain bonds, broker and return, it will be included in the same cluster. However, a different document containing stock would be less likely to be included in the investment cluster if it includes soup and chicken. (See the discussion of synonymy and polysemy in the side-panel, above.) How the math actually accomplishes this is difficult to explain and beyond the ken of most of us. Making it all even more complex is that the algorithms are closely guarded as valuable commercial secrets. Yet the formulas underlying these methods have been in the public domain for decades even centuries, if we consider Bayes s work and have been tested in the most rigorous peer-review environments 15. These tools work. Iterative versus non-iterative The simplest approach to concept searching is non-iterative. You have the software identify clusters of documents based on content, issues or topics. You then move these subsets of documents through standard review workflows. Some could go immediately to senior reviewers, some could be treated as less relevant and others could be designated as irrelevant and set aside, perhaps for a final check. In other words, you use concept searching simply to get you started with a set of groupings and then perform a traditional review. The tool does nothing more than help you prioritize your otherwise linear review. The value of this simple approach is not to be underestimated; the mental fatigue caused by having to switch back and forth between thematically dissimilar documents can be greatly reduced by giving reviewers groups of documents that deal with the same set of related topics. 15 See, e.g., Berry, Michael W., Dumais, Susan T., O Brien, Gavin W., Using Linear Algebra for Intelligent Information Retrieval, December 1994, SIAM Review 37:4 (1995), pp ; Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, The Sedona Conference, 2007, pp ; Graesser, A., and Karnavat, A., Latent Semantic Analysis Captures Causal, Goal-oriented, and Taxonomic Structures, Proceedings of CogSci 2000, pp ; Thomas Hoffman, Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, USA August 15-19, 1999, 50-57; Dian I. Martina; Michael W. Berry, Latent Semantic Indexing, in Encyclopedia of Library and Information Sciences, Third Edition, Feb. 2010; Leonhard Hennig, Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2009).
9 Making document review faster, cheaper and more accurate: 7 In many settings involving electronically stored information, [the] time and burden of a manual search process for the purpose of finding producible data may not be feasible or justified. In such cases, the use of automated search methods should be viewed as reasonable, valuable and even necessary. Sedona Canada Principles, Comment 7.c.Review. A more complex approach takes advantage of the learning capabilities of concept searching software. You run a search, review the results (or a sample thereof) and have the software run the search again but better. Better because your review has allowed you to give the software instructions on how to perform a better search. In the most scientific approach, you start with a random sample (of sufficient size) of the entire dataset and have someone who understands the issues in the case review them and make the desired decision 16. The software examines the documents you selected, identifies their shared characteristics, looks for the same characteristics in the remaining dataset, pulls out similar documents and presents them for review. The same reviewer (ideally) goes through these results and essentially corrects the work of the software by confirming which documents are wanted (true positives) and which are not (false positives). These new decisions are then fed to the software and, in a second and third round, it refines its search to find more documents like the true positives and leave out anything like the false positives. This is the approach used in the latest version of KPMG s Discovery Radar (DR 4.1), which uses the support vector machine technology developed by Equivio. Thus, an experienced attorney, performing a focused review of a small subset of documents, can use these tools to extend his or her decisions across the entire population of documents in the case. But can this method really be trusted? Can it be trusted to perform fine-grained issue coding? Can it be trusted to perform a privilege review? The answer is yes but only as part of a multi-stage, multi-layered process that incorporates sampling and testing of results according to recognized leading practices. Studies have shown, repeatedly and conclusively, that human beings are not very good at making consistent and accurate judgment calls when reviewing documents. But these tools are not merely better than mediocre reviewers; they have been shown to be exceptionally good at reaching or even exceeding the performance of even top-quality reviewers 17. What is the most cautious way of using concept clustering? Before we discuss in more detail how these tools can be used, take a moment to think about a key distinction: Will you allow the software to decide what a document is about (and apply coding based on that assessment) or will you merely allow it to suggest what a document appears to be about so that, using your own judgment, you decide what it is about and apply the appropriate coding? Once this distinction is clearly grasped, it is obvious that the word coding (as in predictive coding ) 16 A statistically representative sample can be surprisingly small as low as 0.24% if the population being sampled is large enough. See Doug Stewart, Application of Simple Random Sampling (SRS) in e-discovery, April 20, 2011, available at ( a sample of fewer than 2,400 records from a population of one million can be used to accurately estimate the population as a whole. ) The smaller the total population, the larger the sample percentage needs to be. 17 Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), See also Bruce Hedin et al., Overview of the TREC 2009 Legal Track, in NIST Special Publication: SP , The Eighteenth Text Retrieval Conference (TREC 2009) Proceedings 16 & tbl.5 (2009), available at Douglas W. Oard et al., Overview of the TREC 2008 Legal Track, in NIST Special Publication: SP , The Seventeenth Text Retrieval Conference (TREC 2008) Proceedings 8 (2008), available at
10 is often misused in this context. Even those who talk about how their software can perform predictive coding do not usually recommend that the software be trusted to make coding decisions on its own and without any human involvement. What they are selling is a tool that predicts how an attorney is likely to understand the content of a document ( This memo is about investments, not about cooking. ). The determination that it is therefore responsive is an entirely separate mental operation. These software tools, properly understood, do not presume to make that decision. In the end, the vaunted prediction is merely a grouping of documents into clusters based on semantic content and the assignment of names to those groupings (such as investment or cooking ) nothing more. True coding comes later. We can now see that there is a spectrum that extends from: (1) trusting the software to perform coding with litigation consequences (e.g., Responsive / Not Responsive) to (2) trusting the software only to create topic-based groups of documents which you then review in the traditional manner. If you were to adopt the second approach, you would likely be paying a lot of money for very little benefit. The grouping of documents into groups is useful, but concept search tools allow for much more than just a linear review of subsets of documents; they allow for sophisticated re-groupings of new subsets within initial subsets, provisional coding for work-assignment purposes, and the creation of workflows so that the right people look at the right documents as quickly as possible. Some might argue that this cautious approach is still the best because it ensures that a human being makes each key decision. But, as indicated above, studies have shown that having an individual look at each document has serious flaws flaws which only computer technology could reveal. In light of this fact, such a cautious use of concept searching looks much less appealing; it is still very time-consuming and expensive, and this much caution is likely be unwarranted given that the software is now very effective.
11 Making document review faster, cheaper and more accurate: 9 What is the most aggressive way of using concept clustering? Moving to the other end of the spectrum is not necessarily a wiser choice. Letting the software do all the work, from indexing right through to coding and production, with human intervention limited to choosing which concepts count as responsive and which do not, requires an immense faith in an algorithm s ability to group documents by issue, do this well (with no false positives or false negatives) and assign productionoriented coding based on these groupings. True, there could be huge cost and time savings and, quite possibly, very good results with respect to what is produced and what is withheld. But you would be entrusting the entire discovery phase of litigation to a software application without any attempt to check its results. What is the best way to use concept clustering? A balanced approach The best way to use concept-search tools is to have the technology assist the human and the human assist the technology. Let the tool do what it is good at up to its maximum capability (in other words, do not be afraid to use its best features against the largest volume of data) but then be sure that you bring to bear a separate set of tools and processes to check its work. To quote Reagan, Trust but verify. Concept search technology, when coupled with good visualizations and tagging systems, does nothing more and nothing less than gather together documents that are likely to be highly relevant to your case. It helps you to find the most useful (and most dangerous) documents as quickly as possible. When you have identified these high-value populations, you can pass them quickly to your lead attorneys so that they can make important strategic decisions about the case. You can get a head-start on witness binders and affidavits of documents. It allows for these things without preventing you from reviewing the remaining documents. A key question that has to be addressed one that requires a separate step in your trust but verify process is how confident you are that you are not leaving responsive documents out of your production solely on the strength of the software s clustering decisions. You need to take reasonable steps to assure yourself that the body of documents identified by the software as not responsive does not in fact contain responsive documents. This will require a manual review of at least a statistically significant sample of those documents This phase of computer-assisted review taking reasonable steps to ensure that no responsive documents are being missed has been widely discussed. See, e.g., EDRM Search Guide, Non Hit Validation, available at resources/guides/edrm-search-guide/validation-of-results; see also Roitblat, op. cit.; The Sedona Principles, Second Edition (2007). On the value and defensibility of using sampling in the review of electronic documents, see org/content/miscfiles/achieving_ Quality.pdf; and LegalOverview09.pdf
12 10 Making document review faster, cheaper and more accurate: The key is knowing how to use these powerful tools and, most importantly, when and how to use them in conjunction with other tools as part of an integrated, defensible process that makes the best use of your human assets and your budget. No software tool will magically solve the problem of volume and cost but the right combination of tools, processes and human involvement can yield significant savings over traditional approaches to document review. Using Concept Searching for First Pass Review If your tool allows for a teaching approach to topic-based categorization, you should start by identifying a teaching set documents that you know are highly representative of the kinds of documents you are interested in. Some projects will lend themselves to a simple, binary, either-or division of the document population: Responsive / Non-responsive. Other projects will require a more complex set of topics or categories for example, there might be eight or ten issues or topics that count as Responsive, with everything else being Non-responsive. In any event, if your tool allows you to submit teaching sets, this is a good place to start. Even if your tool offers a teaching approach, there might be situations where you want to see what the software throws up when given free rein to assess, for itself, what concept clusters exist in the population. You can then work with these clusters and devise groupings and categories that you would not otherwise have thought of. Alternatively, you may have started with a teaching set or several teaching sets and, having exhausted that approach, now want find out what else is out there in your document set. You would tell the program to perform a free-form clustering of all of your as-yet-untagged documents.
13 Making document review faster, cheaper and more accurate: 11 You will want to develop an approach that suits the needs of your particular case, but the following general approach would be a good template to follow, at least initially: 1. Work your way through each software-generated cluster, reviewing either all the documents in the cluster (if it is a small enough number) or a sample of the cluster. 2. As you do so, keep in mind that, when you make a determination, you need to lock in that determination, as it were, by tagging it into one or more folders. [Different tools use different terms for this: folders, briefcases, binders, tags, etc. Here we will use tag for the action and folder for the collection of documents.] 3. You will be tagging documents according to whether they (a) are responsive or not (a simple binary choice) or (b) relate to one or more issues in the case. 4. Decide which of these approaches you want to use. For a first-pass review, you might be inclined to opt for the binary approach, but clustering is more accurate if the software is allowed to work with narrower, finer-grained distinctions (rather than having to summarize thousands of concepts and issues into an either-or distinction). The multiple-issue approach might make more sense, while also giving you a head-start on your second-level review. 5. Whichever approach you take, work through the concept clusters. You will likely see that several have been given labels that do not make much sense. This is why you cannot simply say Yes, I agree when looking at a document assigned to a concept cluster; you need to decide for yourself what to call your issue folders and then which folder(s) to put each document into. 6. For each decision you make, consider whether the document is a particularly strong representative of that folder or issue. This is because, in the next round of clustering, you will be asking the software to find more like these. To do this, you need the best possible set of example documents (some applications refer to them as seed sets ) on which the software will base its assessments.
14 12 Making document review faster, cheaper and more accurate: 7. When you have completed your review of the clusters, you should have a strong set of example documents with which to perform the second round of clustering. Different programs call this next phase by different names, but it is generally known as the teaching phase where the machine-learning elements of the software are involved. 8. You now instruct the application, for each folder you have populated (one for each topic or issue, including basic Responsive and Non-responsive ) to find similar documents. 9. When the results come back, you repeat the process: review the documents and decide (1) whether this document is in fact what you are looking for (confirm or reject) and (2) whether it is a particularly strong representative of that set of documents. 10. Remember that a document can be a particularly strong representative of nonresponsive documents just as much as responsive or specific-issue documents. If, for example, you find an about how to make a good vegetable soup (and this is irrelevant to your case), a concept search based on that document or even a block of text within that document can go out and find all kinds of other documents that are likely to be equally non-responsive and irrelevant, even if none of the same words appear in them. The software will pull back documents referring to pots and pans, chopping, chicken, vegetables and other soup-related concepts. But beware, because it may also pull back documents touching on other, tangentially-related, but relevant topics, such as client entertainment (because some soup-related words will overlap with words like lunch, account, expense); or precious metals (cooking concepts overlap with ounces and grams) or climate change (cooking concepts overlap with heat, hot, bake, water, warm). So be careful; think through what the software might be doing in the background. When you decide to use what is clearly a non-responsive and irrelevant document in your teaching set, you will find similarly non-responsive and irrelevant documents but you should review these results (or at least a sample of them) to be sure you are not missing anything that was captured through these second-order associations. If you do find something interesting, you can use it as the basis for another round of similar-documents searching finding unexpected interesting documents in a sea of mostly uninteresting documents. 11. After a number of careful iterations, you should be approaching a point of significantly diminishing marginal returns. The software has done its job, you have sampled and checked its results, and your groupings constitute an accurate, reliable and comprehensive first pass review. And all of this has required only a fraction of the time and cost of traditional review Lawyers at Williams Mullen, an East-coast U.S. law firm ( compared standard linear review and computer-assisted review and found that humans doing linear review produce significantly inferior results compared to computer-assisted review. Not only was the [computer-assisted] review ten times faster, it resulted in the correct coding decision 99.8% of the time. In nearly every instance where there was a dispute between the read every document approach of the linear review and our computer-assisted non-linear review, the non-linear review won out. Williams Mullen, EDIG: E-Discovery & Information Governance, May 2011, at 2-3. Available at According to kcura, the developers of Relativity, a computer-assisted review using the tool s Categorization feature, which runs on an LSI engine, can effectively code 100,000 documents by having a reviewer code only 5,000 documents (5% of the total). The software applies what it has learned from the reviewer s coding to the remaining 95,000 documents. Relativity s approach suggests three review phases: a first coding phase followed by two QC phases in which the reviewer examines samples of the document groupings. See kcura, Speed Code Workflow (March 7, 2011), available from kcura: (312) See also
15 Making document review faster, cheaper and more accurate: 13 This is just a sketch of how you can use concept searching to perform a costeffective yet defensible first-pass review. How you do it, in detail, will depend on the capabilities of the software you choose, the issues in the case, time and staffing considerations and a range of case-specific considerations. What should be clear, though, is that the strength and reliability of conceptbased search tools now make it possible to dispense with the teams of contract attorneys or even the three or four associates you would normally have assigned to review an entire set of documents. For a fraction of the time and cost, those most knowledgeable about the case can use concept search tools to achieve better results than were ever possible using traditional linear review. The underlying technology has been proven to be effective more effective than traditional human review. The defensibility of concept searching combined with sampling as an approach to legal document review has been confirmed by The Sedona Canada Principles, which, themselves, have been adopted by legislators and judges across Canada. In light of the demonstrated effectiveness and reliability of concept searching, it seems clear that an appropriately designed computer-assisted first pass review that combines this technology with rigorous sampling, review and quality-control processes would meet the proportionality requirements set forth in the various Canadian federal and provincial rules of practice. A good computer-assisted review process, when combined with non-waiver and claw-back agreements between parties, will, we believe, become the new best practice for anything but the smallest matters 20. Law firms and their clients should now feel comfortable adopting these methods as part of a comprehensive and well-designed discovery plan. They can do so, not simply to save on time and expense, but also to secure the very real benefits, in both accuracy and consistency, that this technology provides. 20 On proportionality and the importance of collaboration between parties, particularly concerning claims of privilege and claw-back agreements, see Sedona Canada Principle 9. See also Sedona Canada Working Group, The Sedona Canada Commentary on Practical Approaches for Cost Containment (June 2011), available at documents/ SedonaCanadaCostContainment.pdf; Sedona Canada Working Group, The Sedona Canada Commentary on Proportionality in Electronic Disclosure & Discovery (October 2010), available at WG7CommentaryonProportionality-for-public-comment.pdf.
16 Contacts To find out more about Information Management and e-discovery and how we can help, please contact: Dominic Jaar National Practice Leader Information Management & e-discovery T: + (514) C: + (514) E: David Sharpe Manager Information Management & e-discovery T: + (416) E: Visit us: Follow us: The information contained herein is of a general nature and is not intended to address the circumstances of any particular individual or entity. Although we endeavor to provide accurate and timely information, there can be no guarantee that such information is accurate as of the date it is received or that it will continue to be accurate in the future. No one should act on such information without appropriate professional advice after a thorough examination of the particular situation. firms affiliated with KPMG International Cooperative ( KPMG International ), a Swiss entity. All rights reserved. Printed in Canada The KPMG name, logo and cutting through complexity are registered trademarks or trademarks of KPMG International.
Software-assisted document review: An ROI your GC can appreciate kpmg.com b Section or Brochure name Contents Introduction 4 Approach 6 Metrics to compare quality and effectiveness 7 Results 8 Matter 1
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by
WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive
Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,
Predictive Coding Defensibility Who should read this paper The Veritas ediscovery Platform facilitates a quality control workflow that incorporates statistically sound sampling practices developed in conjunction
REDUCING COSTS WITH ADVANCED REVIEW STRATEGIES - Bill Tolson Sr. Product Marketing Manager Recommind Inc. Introduction... 3 Traditional Linear Review... 3 Advanced Review Strategies: A Typical Predictive
Artificial Intelligence and Transactional Law: Automated M&A Due Diligence By Ben Klaber Introduction Largely due to the pervasiveness of electronically stored information (ESI) and search and retrieval
Introduction to Predictive Coding Herbert L. Roitblat, Ph.D. CTO, Chief Scientist, OrcaTec Predictive coding uses computers and machine learning to reduce the number of documents in large document sets
WHITE PAPER: SYMANTEC TRANSPARENT PREDICTIVE CODING Symantec Transparent Predictive Coding Cost-Effective and Defensible Technology Assisted Review Who should read this paper Predictive coding is one of
Xerox Legal Services Viewpoint ediscovery Platform Technical Brief Viewpoint ediscovery Services Viewpoint by Xerox delivers a flexible approach to ediscovery designed to help you manage your litigation,
Mastering Predictive Coding: The Ultimate Guide Key considerations and best practices to help you increase ediscovery efficiencies and save money with predictive coding 4.5 Validating the Results and Producing
This paper presents the results of the collaborative entry of Backstop LLP and Cleary Gottlieb Steen & Hamilton LLP in the Legal Track of the 2009 Text Retrieval Conference (TREC) sponsored by the National
Quality Control for predictive coding in ediscovery kpmg.com Advances in technology are changing the way organizations perform ediscovery. Most notably, predictive coding, or technology assisted review,
Ashish Prasad, Esq. Noah Miller, Esq. Joshua C. Garbarino, Esq. October 27, 2014 Table of Contents Introduction... 3 What is TAR?... 3 TAR Workflows and Roles... 3 Predictive Coding Workflows... 4 Conclusion...
Continuous Active Learning for Technology Assisted Review How It Works and Why It Matters for E-Discovery John Tredennick, Esq. Founder and CEO, Catalyst Repository Systems Peer-Reviewed Study Compares
Technology-Assisted Review and Other Discovery Initiatives at the Antitrust Division Tracy Greer 1 Senior Litigation Counsel E-Discovery The Division has moved to implement several discovery initiatives
877.557.4273 catalystsecure.com ARTICLE An Open Look at Keyword Search vs. Can Keyword Search Be As Effective as TAR? John Tredennick, Esq. Founder and CEO, Catalyst Repository Systems 2015 Catalyst Repository
LONG INTERNATIONAL Long International, Inc. 10029 Whistling Elk Drive Littleton, CO 80127-6109 (303) 972-2443 Fax: (303) 972-6980 www.long-intl.com TABLE OF CONTENTS INTRODUCTION... 1 Why Use Computerized
By John Tredennick CEO Catalyst Repository Systems Magistrate Judge Andrew J. Peck issued a landmark decision in Da Silva Moore v. Publicis and MSL Group, filed on Feb. 24, 2012. This decision made headlines
Predictive Coding, TAR, CAR NOT Just for Litigation February 26, 2015 Olivia Gerroll VP Professional Services, D4 Agenda Drivers The Evolution of Discovery Technology Definitions & Benefits How Predictive
Review Easy Guide for Administrators Version 1.0 Notice to Users Verve software as a service is a software application that has been developed, copyrighted, and licensed by Kroll Ontrack Inc. Use of the
Symantec ediscovery Platform, powered by Clearwell Data Sheet: Archiving and ediscovery The brings transparency and control to the electronic discovery process. From collection to production, our workflow
Assisted Review Guide Version 8.2 November 20, 2015 For the most recent version of this document, visit our documentation website. Table of Contents 1 Relativity Assisted Review overview 5 Using Assisted
E-Discovery Tip Sheet LegalTech 2015 Some Panels and Briefings Last month I took you on a select tour of the vendor exhibits and products from LegalTech 2015. This month I want to provide a small brief
Measurement in ediscovery A Technical White Paper Herbert Roitblat, Ph.D. CTO, Chief Scientist Measurement in ediscovery From an information-science perspective, ediscovery is about separating the responsive
Predictive Coding Gain Earlier Insight and Reduce Document Review Costs Tom Groom Vice President, Discovery Engineering firstname.lastname@example.org 303.840.3601 D4 LLC Litigation support service provider since
Manuscript submitted to the Organizing Committee of the Fourth DESI Workshop on Setting Standards for Electronically Stored Information in Discovery Proceedings on April 20, 2011. Updated May 18, 2011.
Forensic The case for statistical sampling in e-discovery January 2012 kpmg.com 2 The case for statistical sampling in e-discovery The sheer volume and unrelenting production deadlines of today s electronic
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
White Paper Predictive Coding The Next Phase of Electronic Discovery Process Automation By Katey Wood and Brian Babineau August, 2011 This ESG White Paper was commissioned by Recommind and is distributed
TM Veritas ediscovery Platform Overview The is the leading enterprise ediscovery solution that enables enterprises, governments, and law firms to manage legal, regulatory, and investigative matters using
By: H. S. Hyman, ABD, University of South Florida Warren Fridy III, MS, Fridy Enterprises Abstract One condition of ediscovery making it unique from other, more routine forms of IR is that all documents
The United States Law Week Source: U.S. Law Week: News Archive > 2012 > 04/24/2012 > BNA Insights > Under Fire: A Closer Look at Technology- Assisted Document Review E-DISCOVERY Under Fire: A Closer Look
Predictive Coding Helps Companies Reduce Discovery Costs Recent Court Decisions Open Door to Wider Use by Businesses to Cut Costs in Document Discovery By John Tredennick As companies struggle to manage
A Practitioner s Guide to Statistical Sampling in E-Discovery October 16, 2012 1 Meet the Panelists Maura R. Grossman, Counsel at Wachtell, Lipton, Rosen & Katz Gordon V. Cormack, Professor at the David
Predictive Coding as a Means to Prioritize Review and Reduce Discovery Costs White Paper INTRODUCTION Computers and the popularity of digital information have changed the way that the world communicates
Simplifying Cost Savings in E-Discovery PROVEN, EFFECTIVE STRATEGIES FOR RESOURCE ALLOCATION IN DOCUMENT REVIEW Simplifying Cost Savings in E-Discovery PROVEN, EFFECTIVE STRATEGIES FOR RESOURCE ALLOCATION
E-Discovery Getting a Handle on Predictive Coding John J. Jablonski Goldberg Segalla LLP 665 Main St Ste 400 Buffalo, NY 14203-1425 (716) 566-5400 email@example.com Drew Lewis Recommind 7028
This Webcast Will Begin Shortly If you have any technical problems with the Webcast or the streaming audio, please contact us via email at: firstname.lastname@example.org Thank You! Welcome! Electronic Data
E-Discovery Tip Sheet A TAR Too Far Here s the buzzword feed for the day: Technology-assisted review (TAR) Computer-assisted review (CAR) Predictive coding Latent semantic analysis Precision Recall The
White Paper Technology Assisted Review Allison Stanfield and Jeff Jarrett 25 February 2015 1300 136 993 www.elaw.com.au Table of Contents 1. INTRODUCTION 3 2. KEYWORD SEARCHING 3 3. KEYWORD SEARCHES: THE
! AccessData Group The Business Case for ECA White Paper TABLE OF CONTENTS Introduction... 1 What is ECA?... 1 ECA as a Process... 2 ECA as a Software Process... 2 AccessData ECA... 3 What Does This Mean
Ba k e Offs, De m o s & Kicking t h e Ti r e s: A Pr a c t i c a l Litigator s Br i e f Gu i d e t o Eva l u at i n g Ea r ly Ca s e Assessment So f t wa r e & Search & Review Tools Ronni D. Solomon, King
E-Discovery in Mass Torts: Predictive Coding Friend or Foe? Sherry A. Knutson Sidley Austin One S Dearborn St 32nd Fl Chicago, IL 60603 (312) 853-4710 email@example.com Sherry A. Knutson is a partner
KPMG Forensic Technology Services Managing Costs in e-discoverye October 14, 2010 1 Agenda: Strategies to Manage Costs in e-discovery Pre-collection Strategies Filtering Strategies Review and Production
E-Discovery in Michigan ESI Presented by Angela Boufford DISCLAIMER: This is by no means a comprehensive examination of E-Discovery issues. You will not be an E-Discovery expert after this presentation.
IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining
PRESENTED BY: Sponsored by: Practical Uses of Analytics in E-Discovery - A PRIMER Jenny Le, Esq. Vice President of Discovery Services firstname.lastname@example.org E-Discovery & Ethics Structured, Conceptual,
ediscovery & Information Management White Paper The Tested Effectiveness of Equivio>Relevance in Technology Assisted Review Scott M. Cohen Elizabeth T. Timkovich John J. Rosenthal February 2014 2014 Winston
A BNA, INC. DIGITAL DISCOVERY & E-EVIDENCE! VOL. 7, NO. 11 232-235 REPORT NOVEMBER 1, 2007 Reproduced with permission from Digital Discovery & e-evidence, Vol. 7, No. 11, 11/01/2007, pp. 232-235. Copyright
www.pwc.nl Review & AI Lessons learned while using Artificial Intelligence Why are non-users staying away from PC? source: edj Group s Q1 2013 Predictive Coding Survey, February 2013, N = 66 Slide 2 Introduction
Today s elunch Presenters John Rosenthal Litigation Washington, D.C. JRosenthal@winston.com Scott Cohen Director of E Discovery Support Services New York SCohen@winston.com 2 What Was Advertised Effective
Top 10 Best Practices in Predictive Coding Emerging Best Practice Guidelines for the Conduct of a Predictive Coding Project Equivio internal document " design an appropriate process, including use of available
The Case for Technology Assisted Review and Statistical Sampling in Discovery Position Paper for DESI VI Workshop, June 8, 2015, ICAIL Conference, San Diego, CA Christopher H Paskach The Claro Group, LLC
Discovery in the Digital Age: e-discovery Technology Overview Chuck Rothman, P.Eng Wortzman Nickle Professional Corp. The Ontario e-discovery Institute 2013 Contents 1 Technology Overview... 1 1.1 Introduction...
INFORMATION DRIVES SOUND ANALYSIS, INSIGHT METADATA MATTERS Introduction Metadata is a strategic imperative for any organization looking to manage and exploit its knowledge more effectively. Optimity Advisors
Challenges in Legal Electronic Discovery CISML 2011 this presentation is available: http://tinyurl.com/cat-cisml2011 Dr Jeremy Pickens Sr Applied Research Scientist Likes: Information Retrieval, collaborative
E- Discovery in Criminal Law ! An e-discovery Solution for the Criminal Context Criminal lawyers often lack formal procedures to guide them through preservation, collection and analysis of electronically
LexisNexis Early Data Analyzer + LAW PreDiscovery + Concordance Software Are you ready for more efficient and effective ways to manage discovery? Did you know that all-in-one solutions often omit robust
APPLIED DISCOVERY WHITE PAPER File Formats for Electronic Document Review Why PDF Trumps TIFF APPLIED DISCOVERY WHITE PAPER What is the difference between PDF and TIFF, and why should lawyers care? The
SAP Brief SAP s for Enterprise Information Management SAP Information Steward Objectives Measure Your Data and Achieve Information Governance Excellence A single solution for managing enterprise data quality
Case :-cv-00-lrh-pal Document Filed 0// Page of 0 PROGRESSIVE CASUALTY INSURANCE COMPANY, v. JACKIE K. DELANEY, et al., UNITED STATES DISTRICT COURT DISTRICT OF NEVADA Plaintiff, Defendants. * * * Case
Vol. 46 No. 3 February 6, 2013 PREDICTIVE CODING: SILVER BULLET OR PANDORA S BOX? The high costs of e-discovery have led to the development of computerized review technology by which the user may search
Making Sense of E-Discovery: 10 Plain Steps for Producing ESI The following article provides a practical guide to producing electronically stored information (ESI) that lawyers can apply immediately in
HAPTER 9: L ITIGATION CHAPTER ITIGATION S UPPORT SOFTWARE CHAPTER OUTLINE I. Introduction to Litigation Support Software A. Litigation support software assists attorneys in organizing, storing, retrieving,
Make better decisions, faster March 2008 Integrated email archiving: streamlining compliance and discovery through content and business process management 2 Table of Contents Executive summary.........
Amendments to the Rules to Civil Procedure: Yours to E-Discover Prepared by Christopher M. Bartlett Cassels Brock & Blackwell LLP September 25, 2009 Amendments to the Rules of Civil Procedure: Yours to
NOT FOR REPRINT Click to Print or Select 'Print' in your browser menu to print this document. Page printed from: Corporate Counsel E-Discovery #NoFilter A Survey of Current Trends and Candid Perspectives
A Practical Guide to Understanding ediscovery for Insurance Claims Professionals ediscovery Defined and its Relationship to an Insurance Claim Simply put, ediscovery (or Electronic Discovery) refers to
ABA SECTION OF LITIGATION 2012 SECTION ANNUAL CONFERENCE APRIL 18-20, 2012: PREDICTIVE CODING Predictive Coding SUBMITTED IN SUPPORT OF THE PANEL DISCUSSION INTRODUCTION Technology has created a problem.
MEALEY S TM LITIGATION REPORT Discovery Predictive Coding: A Primer by Amy Jane Longo, Esq. and Usama Kahf, Esq. O Melveny & Myers LLP Los Angeles, California A commentary article reprinted from the March
The power of money management One trader lost ($3000) during the course of a year trading one contract of system A. Another trader makes $25,000 trading the same system that year. One trader makes $24,000
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
The Changing Legal Industry Hiring an E-Discovery Expert Can Make Sense AND Save Money Managed E-Discovery If you aren t an expert in e-discovery that s okay In a recent e-discovery ruling Delta airlines
Document management solutions Litigation Support glossary of Terms Learn How to Talk the Talk Covering litigation support from A to Z. Designed to help you come up to speed quickly on key terms and concepts,
Discovery Devices Automatic (mandatory) disclosure Rule 26 requires the automatic disclosure of a host of basic information regarding the case Interrogatories Questions addressed to the other party Depositions
Proposed Guidelines for Measuring the Benefit of Technology for Managing Redundant Data in E-Discovery Review Position Paper 1 INTRODUCTION With document volumes continuing to balloon, litigation costs