Title here. Making document review faster, cheaper and more accurate: SECTORS AND THEMES. How concept searching can change

Transcription

1 Making document review faster, cheaper and more accurate: SECTORS AND THEMES Title here How concept searching can change the Additional way your information legal teams in handle Univers 45 Light 12pt first on pass 16pt review leading kpmg.ca/imed kpmg.com Credits and authors in Univers 45 Light 12pt on 16pt leading

2 Contents Introduction... 1 Why Concept Searching?... 2 So what exactly is Concept Searching?... 4 How is concept searching not just another kind of keyword searching?... 6 Iterative versus non-iterative... 6 What is the most cautious way of using concept clustering?... 7 What is the most aggressive way of using concept clustering?... 9 What is the best way to use concept clustering?... 9 Using Concept Searching for First Pass Review... 10

3 Making document review faster, cheaper and more accurate: 1 Introduction A party may satisfy its obligation to preserve, collect, review and produce electronically stored information in good faith by using electronic tools and processes such as data sampling, searching or by using selection criteria to collect potentially relevant electronically stored information. Sedona, Canada Principle 7. There has been a lot of talk in e-discovery circles these past few years about concept searching and its various manifestations. A technology that began with a few early innovators selling mysterious but fascinating applications to cautious law firms has become a well-accepted part of any full-featured processing and review platform. With concept searching, documents can now be searched, grouped, served up for review and even bulk-coded, not just based on what specific words can be found within them, but according to what they are about. And this technology, which identifies themes and topics within and across documents, can also be used to find, group and tag thousands of documents in the larger population based on how a small subset are coded. This means that a skilled reviewer knowledgeable about a case can code thousands or even tens of thousands of documents for every document actually reviewed. While this may seem impossible, the underlying technology has been rigorously tested, from both a statistical and a substantive legal perspective, and proven to be robust and reliable. But are we making the best use of these amazing new capabilities? We have all seen how an early adopter or power user can impress onlookers with a new tool s capabilities. But more often than not, for every power user there are dozens or hundreds using the tool who are not using it well; who are perhaps allowing what it does well to blind them to its drawbacks; and who, in the world of e-discovery, are perhaps letting the glamour and gloss of the technology distract them from the need to adhere to the same old practices that are the backbone of any discovery project: thoroughness, quality control, consistency and defensibility. For all their benefits, these new concept-based review tools are just tools. They should always be deployed as part of a well-thought-out process designed by a subject-matter expert. And they will never eliminate the need for a good project manager. But there is one thing these tools can do that many in the field have been hesitant to acknowledge. They can perform a first-pass responsiveness review more effectively than can a traditional team of reviewer attorneys. And they can do it faster and cheaper. The challenges to such a claim are obvious: first, how can you trust a machine to understand a document and decide what it is about? Then, even if you let a machine decide what is relevant, will you not have to go through every document to make sure the software got it right? In which case, why not just go through the documents the old way? A leading authority on these matters, The Sedona Conference, has clearly stated that A party may satisfy its obligation to preserve, collect, review and produce electronically stored information in good faith by using electronic tools and processes

4 2 Making document review faster, cheaper and more accurate: such as data sampling, searching or by using selection criteria to collect potentially relevant electronically stored information. 1 Conor Crowley, an internationally recognized expert on e-discovery matters and a member of The Sedona Conference, has argued that, when done properly, computerassisted methods are both defensible and effective in a first-pass review. [T]he use of analytical software for culling and first-pass review should provide accurate, costeffective relevancy determinations and should be considered no less defensible than the exclusive use of human reviewers. Conor Crowley 2 We will explain how these concerns about accuracy and defensibility are unwarranted when concept searching and concept-based categorization are deployed as part of a well-designed process. With the right tool and the right procedures, it is possible to get results that are higher-quality, more consistent and more reliable than any linear review could ever achieve and to get them faster and cheaper. We will look at concept searching in the context of litigation and investigation, and examine how some of its most impressive capabilities can yield significant improvements in both quality and efficiency. We will also offer what, for some at least, may be some needed reassurance that these new tools can really do what many say they can. Finally, we hope to demystify what is still quite a confusing topic made all the more confusing by the many companies offering one form of concept searching or another, many with their own catch-phrase branding. Why Concept Searching? The best way to understand why concept searching was adopted in the legal world is to start with two fundamental problems with the way document reviews were conducted in the past. First is the burden and cost of a traditional linear review ; second is the inherent limitations of keyword searching. Imagine a room full of boxes of documents that have to be reviewed. The task could be to make a simple yes/no decision about responsiveness, or it could be to identify issues that a document touches on. Often the documents will not be organized. So reviewers review them in the order they appear in the box, from start to finish thus the term linear. This is time-consuming work; it is tiring even mind-numbing; and the entire project depends for its accuracy and reliability on all reviewers assessing documents the same way and bringing the same level of attention to the task. These assumptions hardly ever hold. 3 With tools that combine an image viewer and a relational database (the first were Summation and Concordance, adopted by law firms in the 1990s), it was possible to view documents on screen and code them much faster than in a paper environment. It was also possible to organize them by objective coding information like date, page count, author and recipient(s). One could also assign batches of documents by Custodian or Source/Folder. But reviews were still essentially linear; everything had to be looked at. 1 The Sedona Canada Principles Addressing Electronic Discovery (January 2008), Principle 7. Available at See also The Sedona [U.S.] Principles (Second Edition, 2007) ( Sedona Canada Principles ), Principle 11, available at PRINCP_2nd_ed_607.pdf. 2 Crowley, Conor R., Defending the Use of Analytical Software In Civil Discovery, Digital Discovery and E-Evidence, Vol. 10:16, September 16, See Maura R. Grossman and Gordon V. Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? (June 2011), available at

5 Making document review faster, cheaper and more accurate: 3 Images with bib coding (bibliographic or simple objective coding) were then supplemented with OCR (optical character recognition) technology, which turns text on a page into searchable electronic text. With searching and the grouping of results, one could identify subsets of more-important documents and examine them together. But OCR often yielded poor results (garbled text), causing words to be missed and documents to be overlooked. Even when keyword searches (as well as more sophisticated forms of word-based searching like Boolean, fuzzy, stemming, wildcard and proximity) could be run against good-quality OCR, reviewers still had to look at each document to find out what it was actually about. It was not enough to know that a word or phrase appeared in a document (making it responsive to the search criteria); someone had to decide whether it was actually relevant to the litigation. 4 Many of the headaches and complaints from this period can be traced to people taking short-cuts because of the time and cost of linear review even when aided by imaging, coding, OCR and searchability. All too often, parties would hand over all responsive documents, thus producing not just relevant documents, but also a large number of false positives documents that contained a supposedly-relevant word or phrase but which had nothing at all to do with the litigation. Cost and burden drove litigants and counsel to search for alternatives to keyword search-assisted linear review. How could false positives (files that are responsive but irrelevant) be minimized? How could false negatives (relevant documents that are not responsive and therefore not retrieved) be minimized? 5 At around this time, studies revealed that even the best use of the best search technology would retrieve large numbers of irrelevant documents while missing large numbers of relevant documents. 6 And even if search technology could be improved, what could be done about the quality of human reviewers, which other studies have shown are often quite bad at making responsiveness and relevance decisions? 7 At the same time, the per-hour cost of attorneys meant that any document review that uses attorneys is going to be expensive, even if good technologies and processes have minimized the number of documents to be reviewed. 8 But why use attorneys? For decades, discovery had involved having the individuals most familiar with the matter go through their files and pull out documents responsive to the document request. 4 A document is responsive when it meets certain search criteria or falls within the scope of a document request. But not all responsive documents are relevant. Deciding whether a document is relevant involves a different assessment. For a listing of the Canadian federal and provincial Rules of Civil Procedure dealing with pleadings and relevance, see The Sedona Canada Commentary on Practical Approaches for Cost Containment: Best Practices for Managing the Preservation, Collection, Processing, Review & Analysis of Electronically Stored Information, April 2011, Appendix D, at Available at 5 For an overview of some of the techniques that have been used to address these problems over the years, see Review_Technologies. 6 See AutoDocumentReviewReliability.pdf. See also the EDI publications at and more particularly electronicdiscoveryinstitute.com/pubs/toolsfortextcategorization.pdf. When people talk about precision and recall, this is what they are talking about. Precision measures the degree to which a set of results is made up of nothing but what one is looking for: Of all my results, how many of them are relevant? A low precision score indicates a lot of false positives. Recall measures the degree to which all of the relevant documents in the dataset being searched are being retrieved in the search: Are there relevant documents out there that I m not finding? A low recall score indicates a high number of false negatives. 7 See the discussion of a 1985 study by Blair and Maron in Herbert L. Roitblat, Search & Information Retrieval Methods, The Sedona Conference Journal, Fall 2007, at 206 (available at revised_cover_and_preface.pdf: The attorneys estimated that they had found more than 75% of the relevant documents, but more detailed analysis found that the number was actually only about 20%. 8 See The Sedona Canada Commentary on Practical Approaches for Cost Containment: Best Practices for Managing the Preservation, Collection, Processing, Review & Analysis of Electronically Stored Information, April 2011, at Available at

6 4 Making document review faster, cheaper and more accurate: The process may have been initiated and supervised by attorneys, but the all-important responsiveness calls were made by non-attorneys. Yet, with the advent of electronic documents, it became the responsibility of lawyers to make these decisions. Concept searching is a way of dealing with synonymy and polysemy. Polysemy is the phenomenon where a word can have several meanings. (The word stock could refer to investments, cooking, car racing, lineage, cattle or inventory.) Synonymy is when the same meaning can be conveyed by several words or phrases (as in start, begin, initiate, turn on, set off, incite, instigate, provoke and flip the switch). Synonymy and polysemy are the reasons why keyword searches will find the words you ask for even when they carry the wrong meaning and will fail to find the meaning you are looking for because you cannot possibly include in your search all the words and phrases that might convey the desired meaning. Polysemy leads to false positives; synonymy results in false negatives. But who decided that these first-pass responsiveness decisions had to be made by lawyers? There is little if any case law on the subject. As cases grew larger, it became necessary (or so it was thought) for the largest firms to hire dozens or even hundreds of contract attorneys. These attorneys some straight out of law school would start the review with absolutely no understanding of the case. Even the most professional training protocols could not possibly make these contract attorneys knowledgeable enough to catch important nuances in the documents. As a result, large-scale attorney reviews have became notorious (at least among those who have performed them) for the amount of shoddy work they involve: documents marked responsive/relevant when they are not, important documents tagged non-responsive, two coders sitting side-by-side (or on opposite sides of the country) making completely different calls on the same document. In summary, substantiated linear document reviews by attorneys rely on imperfect search techniques guaranteed to leave out important documents while forcing reviewers to wade through large numbers of irrelevant documents; they rely on often inexperienced lawyers unfamiliar with the case to perform mind-numbing work at a high level of precision and reliability; and they are expensive precisely because lawyers perform them. Every one of these deficiencies can now be remedied. Through a carefully designed process, to be discussed in more detail below, it is now possible to save significant amounts of time and money by eliminating first-pass responsiveness review and yet identify a body of responsive documents with much better precision and recall than could ever be achieved through a traditional linear review 9. So what exactly is Concept Searching? First, some definitions. Concept searching identifies the meanings of words using any of a number of different technologies, including latent semantic indexing ( LSI ); Bayesian statistical inference; support vector machine technology; ontologies, taxonomies and/or thesauri; and language modeling 10. Some tools use two or more of these in combination. Of these technologies, perhaps the most commonly used is LSI 11. Latent semantic indexing means a way of indexing a document based on the meanings that are hidden within it 12. Determining how latent semantics can be found in a body of text is something that computer scientists and linguists have been working on since at least the 1970s. 9 For a discussion of precision and recall, see note 6, above. 10 See EDRM Search Guide, Draft v. 1.17, May 7, 2009, at Available online at as a PDF wp-content/uploads/downloads/2010/02/edrm-search-guide-v1.17.pdf. EDRM (which stands for Electronic Discovery Reference Model ) is both an organization and an initiative aimed at standardizing terminology and best practices in the e-discovery field. 11 One of the leading document review platforms on the market today, kcura s Relativity, offers a module called Analytics, which is built on an LSI-based concept search engine. 12 A key point to make here is that LSI technology does not build indexes in the normal sense of the word not like the lists of words used to perform standard keyword searches. Instead, an LSI index is a set of mathematical formulas representing the semantic content of each document.

7 Making document review faster, cheaper and more accurate: 5 Concept searching uses mathematics to identify relationships between words based on what is known as their co-occurrence within a document. It then groups documents according to these patterns and relationships. Concept searching does not require that any particular word or words appear in a document (the threshold requirement for any traditional keyword search). Even the concept word no longer really applies; concept search engines work with misspelled words, words in foreign languages, even scientific words not found in a normal dictionary. Most technologies are language-independent in that they do not rely on dictionaries, taxonomies or thesauruses and do not need to know the meanings of specific words. Another mathematical technique used in some concept search engines is Bayesian statistical inference. A leading information management, content management and e-discovery solution, Autonomy, is built on Bayesian technology. Concept search software can even assign a higher score to documents that focus on a particular topic and a lower score to a document that only touches on it in passing. The basic functionality offered by these tools, then, is concept searching; depending on the tool, this functionality is powered by one or more specific computational technologies 13. The specific technology being deployed can have an important effect on speed, efficiency and results, but this should not obscure the fact that these tools are all trying to provide to users in the legal discovery market the same basic solution. All the terms, phrases and buzzwords so often heard at legal technology conferences are just ways of describing how one form of concept searching or another can be made useful in the legal world. They describe what concept searching can help you do with a body of documents. For example: it can intelligently categorize them ( intelligent categorization ); it can tell you how you are likely to want to code a set of documents ( predictive coding ); it can suggest the coding you might want to apply ( Suggested Coding ); it can classify documents by topic ( Topic Classification ); and it enables prioritized review ( Review Prioritization ). At a fundamental level, these are all the same thing; they differ in the specific mathematical techniques used or in the added functionality a vendor has built into its specific application. These differences in functionality in turn give rise to the various catch-phrases you hear, together with a little marketing and branding. These vendors are all offering concept searching as a way to make sense of a large body of electronic documents by finding their hidden meanings and then gathering them into groups named for these meanings Even the term concept search is a simplification; not all of these technologies search for or reveal concepts as we understand that word. This paper uses this term, however, because, for most people, the ultimate point of these technologies is to help make sense of what documents are about - i.e., to reveal the concepts they contain. Related concepts and technologies, including technologies that make LSI possible as well as others that LSI has made possible, include: singular value decomposition, vectorial semantics, termdocument matrices, automated document classification, concept clustering, latent semantic mapping, latent semantic structure indexing, probabilistic latent semantic indexing, predictive categorization, and computer-assisted coding. Note that these terms are widely used and are neither associated with nor claimed by any particular vendor or service provider. 14 Vendors describe their concept-based offerings in various ways. See e-discovery Institute Survey on Predictive Coding, October 1, 2010, a report based on survey responses from the following companies: Capital Legal Solutions; Catalyst Repository Systems, Inc.; Equivio; FTI Technology; Gallivan Gallivan & O Melia; Hot Neuron; InterLegis; Kroll Ontrack; Recommind; Valora Technologies, Inc.; and Xerox Litigation Services. (Available at The specific phrase predictive coding has been in the news lately. Recommind announced by press release on June 8, 2011 that it had obtained a patent on predictive coding. Commentators soon took issue with this claim, arguing that what Recommind described as predictive coding is nothing more than a set of technologies or a combination of technologies that had already been developed by others and that have been widely available for many years. For more information, see and Jeremy Pickens, The Recommind Patent and the Need to Better Define Predictive Coding, (June 2011) ( The patent, U.S. Patent 7,933,859 B1, can be found here:

8 6 Making document review faster, cheaper and more accurate: How is concept searching not just another kind of keyword searching? Concept searches recognize conceptual similarities between documents by noticing how words relate to each other, how often they appear together or near each other, how far apart they tend to be and how often they do or do not appear in other documents with similar or different characteristics. Take the example of a set of documents, all of which contain the word investment and other investment-related terms. These documents will probably warrant a concept cluster called investment. But the key is not the specific word investment ; it is the co-occurrence of investment and all the other, conceptually-related words. Thus, if there is another document that does not contain investment but does contain bonds, broker and return, it will be included in the same cluster. However, a different document containing stock would be less likely to be included in the investment cluster if it includes soup and chicken. (See the discussion of synonymy and polysemy in the side-panel, above.) How the math actually accomplishes this is difficult to explain and beyond the ken of most of us. Making it all even more complex is that the algorithms are closely guarded as valuable commercial secrets. Yet the formulas underlying these methods have been in the public domain for decades even centuries, if we consider Bayes s work and have been tested in the most rigorous peer-review environments 15. These tools work. Iterative versus non-iterative The simplest approach to concept searching is non-iterative. You have the software identify clusters of documents based on content, issues or topics. You then move these subsets of documents through standard review workflows. Some could go immediately to senior reviewers, some could be treated as less relevant and others could be designated as irrelevant and set aside, perhaps for a final check. In other words, you use concept searching simply to get you started with a set of groupings and then perform a traditional review. The tool does nothing more than help you prioritize your otherwise linear review. The value of this simple approach is not to be underestimated; the mental fatigue caused by having to switch back and forth between thematically dissimilar documents can be greatly reduced by giving reviewers groups of documents that deal with the same set of related topics. 15 See, e.g., Berry, Michael W., Dumais, Susan T., O Brien, Gavin W., Using Linear Algebra for Intelligent Information Retrieval, December 1994, SIAM Review 37:4 (1995), pp ; Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, The Sedona Conference, 2007, pp ; Graesser, A., and Karnavat, A., Latent Semantic Analysis Captures Causal, Goal-oriented, and Taxonomic Structures, Proceedings of CogSci 2000, pp ; Thomas Hoffman, Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, USA August 15-19, 1999, 50-57; Dian I. Martina; Michael W. Berry, Latent Semantic Indexing, in Encyclopedia of Library and Information Sciences, Third Edition, Feb. 2010; Leonhard Hennig, Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2009).

9 Making document review faster, cheaper and more accurate: 7 In many settings involving electronically stored information, [the] time and burden of a manual search process for the purpose of finding producible data may not be feasible or justified. In such cases, the use of automated search methods should be viewed as reasonable, valuable and even necessary. Sedona Canada Principles, Comment 7.c.Review. A more complex approach takes advantage of the learning capabilities of concept searching software. You run a search, review the results (or a sample thereof) and have the software run the search again but better. Better because your review has allowed you to give the software instructions on how to perform a better search. In the most scientific approach, you start with a random sample (of sufficient size) of the entire dataset and have someone who understands the issues in the case review them and make the desired decision 16. The software examines the documents you selected, identifies their shared characteristics, looks for the same characteristics in the remaining dataset, pulls out similar documents and presents them for review. The same reviewer (ideally) goes through these results and essentially corrects the work of the software by confirming which documents are wanted (true positives) and which are not (false positives). These new decisions are then fed to the software and, in a second and third round, it refines its search to find more documents like the true positives and leave out anything like the false positives. This is the approach used in the latest version of KPMG s Discovery Radar (DR 4.1), which uses the support vector machine technology developed by Equivio. Thus, an experienced attorney, performing a focused review of a small subset of documents, can use these tools to extend his or her decisions across the entire population of documents in the case. But can this method really be trusted? Can it be trusted to perform fine-grained issue coding? Can it be trusted to perform a privilege review? The answer is yes but only as part of a multi-stage, multi-layered process that incorporates sampling and testing of results according to recognized leading practices. Studies have shown, repeatedly and conclusively, that human beings are not very good at making consistent and accurate judgment calls when reviewing documents. But these tools are not merely better than mediocre reviewers; they have been shown to be exceptionally good at reaching or even exceeding the performance of even top-quality reviewers 17. What is the most cautious way of using concept clustering? Before we discuss in more detail how these tools can be used, take a moment to think about a key distinction: Will you allow the software to decide what a document is about (and apply coding based on that assessment) or will you merely allow it to suggest what a document appears to be about so that, using your own judgment, you decide what it is about and apply the appropriate coding? Once this distinction is clearly grasped, it is obvious that the word coding (as in predictive coding ) 16 A statistically representative sample can be surprisingly small as low as 0.24% if the population being sampled is large enough. See Doug Stewart, Application of Simple Random Sampling (SRS) in e-discovery, April 20, 2011, available at ( a sample of fewer than 2,400 records from a population of one million can be used to accurately estimate the population as a whole. ) The smaller the total population, the larger the sample percentage needs to be. 17 Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), See also Bruce Hedin et al., Overview of the TREC 2009 Legal Track, in NIST Special Publication: SP , The Eighteenth Text Retrieval Conference (TREC 2009) Proceedings 16 & tbl.5 (2009), available at Douglas W. Oard et al., Overview of the TREC 2008 Legal Track, in NIST Special Publication: SP , The Seventeenth Text Retrieval Conference (TREC 2008) Proceedings 8 (2008), available at

10 is often misused in this context. Even those who talk about how their software can perform predictive coding do not usually recommend that the software be trusted to make coding decisions on its own and without any human involvement. What they are selling is a tool that predicts how an attorney is likely to understand the content of a document ( This memo is about investments, not about cooking. ). The determination that it is therefore responsive is an entirely separate mental operation. These software tools, properly understood, do not presume to make that decision. In the end, the vaunted prediction is merely a grouping of documents into clusters based on semantic content and the assignment of names to those groupings (such as investment or cooking ) nothing more. True coding comes later. We can now see that there is a spectrum that extends from: (1) trusting the software to perform coding with litigation consequences (e.g., Responsive / Not Responsive) to (2) trusting the software only to create topic-based groups of documents which you then review in the traditional manner. If you were to adopt the second approach, you would likely be paying a lot of money for very little benefit. The grouping of documents into groups is useful, but concept search tools allow for much more than just a linear review of subsets of documents; they allow for sophisticated re-groupings of new subsets within initial subsets, provisional coding for work-assignment purposes, and the creation of workflows so that the right people look at the right documents as quickly as possible. Some might argue that this cautious approach is still the best because it ensures that a human being makes each key decision. But, as indicated above, studies have shown that having an individual look at each document has serious flaws flaws which only computer technology could reveal. In light of this fact, such a cautious use of concept searching looks much less appealing; it is still very time-consuming and expensive, and this much caution is likely be unwarranted given that the software is now very effective.

11 Making document review faster, cheaper and more accurate: 9 What is the most aggressive way of using concept clustering? Moving to the other end of the spectrum is not necessarily a wiser choice. Letting the software do all the work, from indexing right through to coding and production, with human intervention limited to choosing which concepts count as responsive and which do not, requires an immense faith in an algorithm s ability to group documents by issue, do this well (with no false positives or false negatives) and assign productionoriented coding based on these groupings. True, there could be huge cost and time savings and, quite possibly, very good results with respect to what is produced and what is withheld. But you would be entrusting the entire discovery phase of litigation to a software application without any attempt to check its results. What is the best way to use concept clustering? A balanced approach The best way to use concept-search tools is to have the technology assist the human and the human assist the technology. Let the tool do what it is good at up to its maximum capability (in other words, do not be afraid to use its best features against the largest volume of data) but then be sure that you bring to bear a separate set of tools and processes to check its work. To quote Reagan, Trust but verify. Concept search technology, when coupled with good visualizations and tagging systems, does nothing more and nothing less than gather together documents that are likely to be highly relevant to your case. It helps you to find the most useful (and most dangerous) documents as quickly as possible. When you have identified these high-value populations, you can pass them quickly to your lead attorneys so that they can make important strategic decisions about the case. You can get a head-start on witness binders and affidavits of documents. It allows for these things without preventing you from reviewing the remaining documents. A key question that has to be addressed one that requires a separate step in your trust but verify process is how confident you are that you are not leaving responsive documents out of your production solely on the strength of the software s clustering decisions. You need to take reasonable steps to assure yourself that the body of documents identified by the software as not responsive does not in fact contain responsive documents. This will require a manual review of at least a statistically significant sample of those documents This phase of computer-assisted review taking reasonable steps to ensure that no responsive documents are being missed has been widely discussed. See, e.g., EDRM Search Guide, Non Hit Validation, available at resources/guides/edrm-search-guide/validation-of-results; see also Roitblat, op. cit.; The Sedona Principles, Second Edition (2007). On the value and defensibility of using sampling in the review of electronic documents, see org/content/miscfiles/achieving_ Quality.pdf; and LegalOverview09.pdf

12 10 Making document review faster, cheaper and more accurate: The key is knowing how to use these powerful tools and, most importantly, when and how to use them in conjunction with other tools as part of an integrated, defensible process that makes the best use of your human assets and your budget. No software tool will magically solve the problem of volume and cost but the right combination of tools, processes and human involvement can yield significant savings over traditional approaches to document review. Using Concept Searching for First Pass Review If your tool allows for a teaching approach to topic-based categorization, you should start by identifying a teaching set documents that you know are highly representative of the kinds of documents you are interested in. Some projects will lend themselves to a simple, binary, either-or division of the document population: Responsive / Non-responsive. Other projects will require a more complex set of topics or categories for example, there might be eight or ten issues or topics that count as Responsive, with everything else being Non-responsive. In any event, if your tool allows you to submit teaching sets, this is a good place to start. Even if your tool offers a teaching approach, there might be situations where you want to see what the software throws up when given free rein to assess, for itself, what concept clusters exist in the population. You can then work with these clusters and devise groupings and categories that you would not otherwise have thought of. Alternatively, you may have started with a teaching set or several teaching sets and, having exhausted that approach, now want find out what else is out there in your document set. You would tell the program to perform a free-form clustering of all of your as-yet-untagged documents.

13 Making document review faster, cheaper and more accurate: 11 You will want to develop an approach that suits the needs of your particular case, but the following general approach would be a good template to follow, at least initially: 1. Work your way through each software-generated cluster, reviewing either all the documents in the cluster (if it is a small enough number) or a sample of the cluster. 2. As you do so, keep in mind that, when you make a determination, you need to lock in that determination, as it were, by tagging it into one or more folders. [Different tools use different terms for this: folders, briefcases, binders, tags, etc. Here we will use tag for the action and folder for the collection of documents.] 3. You will be tagging documents according to whether they (a) are responsive or not (a simple binary choice) or (b) relate to one or more issues in the case. 4. Decide which of these approaches you want to use. For a first-pass review, you might be inclined to opt for the binary approach, but clustering is more accurate if the software is allowed to work with narrower, finer-grained distinctions (rather than having to summarize thousands of concepts and issues into an either-or distinction). The multiple-issue approach might make more sense, while also giving you a head-start on your second-level review. 5. Whichever approach you take, work through the concept clusters. You will likely see that several have been given labels that do not make much sense. This is why you cannot simply say Yes, I agree when looking at a document assigned to a concept cluster; you need to decide for yourself what to call your issue folders and then which folder(s) to put each document into. 6. For each decision you make, consider whether the document is a particularly strong representative of that folder or issue. This is because, in the next round of clustering, you will be asking the software to find more like these. To do this, you need the best possible set of example documents (some applications refer to them as seed sets ) on which the software will base its assessments.

14 12 Making document review faster, cheaper and more accurate: 7. When you have completed your review of the clusters, you should have a strong set of example documents with which to perform the second round of clustering. Different programs call this next phase by different names, but it is generally known as the teaching phase where the machine-learning elements of the software are involved. 8. You now instruct the application, for each folder you have populated (one for each topic or issue, including basic Responsive and Non-responsive ) to find similar documents. 9. When the results come back, you repeat the process: review the documents and decide (1) whether this document is in fact what you are looking for (confirm or reject) and (2) whether it is a particularly strong representative of that set of documents. 10. Remember that a document can be a particularly strong representative of nonresponsive documents just as much as responsive or specific-issue documents. If, for example, you find an about how to make a good vegetable soup (and this is irrelevant to your case), a concept search based on that document or even a block of text within that document can go out and find all kinds of other documents that are likely to be equally non-responsive and irrelevant, even if none of the same words appear in them. The software will pull back documents referring to pots and pans, chopping, chicken, vegetables and other soup-related concepts. But beware, because it may also pull back documents touching on other, tangentially-related, but relevant topics, such as client entertainment (because some soup-related words will overlap with words like lunch, account, expense); or precious metals (cooking concepts overlap with ounces and grams) or climate change (cooking concepts overlap with heat, hot, bake, water, warm). So be careful; think through what the software might be doing in the background. When you decide to use what is clearly a non-responsive and irrelevant document in your teaching set, you will find similarly non-responsive and irrelevant documents but you should review these results (or at least a sample of them) to be sure you are not missing anything that was captured through these second-order associations. If you do find something interesting, you can use it as the basis for another round of similar-documents searching finding unexpected interesting documents in a sea of mostly uninteresting documents. 11. After a number of careful iterations, you should be approaching a point of significantly diminishing marginal returns. The software has done its job, you have sampled and checked its results, and your groupings constitute an accurate, reliable and comprehensive first pass review. And all of this has required only a fraction of the time and cost of traditional review Lawyers at Williams Mullen, an East-coast U.S. law firm ( compared standard linear review and computer-assisted review and found that humans doing linear review produce significantly inferior results compared to computer-assisted review. Not only was the [computer-assisted] review ten times faster, it resulted in the correct coding decision 99.8% of the time. In nearly every instance where there was a dispute between the read every document approach of the linear review and our computer-assisted non-linear review, the non-linear review won out. Williams Mullen, EDIG: E-Discovery & Information Governance, May 2011, at 2-3. Available at According to kcura, the developers of Relativity, a computer-assisted review using the tool s Categorization feature, which runs on an LSI engine, can effectively code 100,000 documents by having a reviewer code only 5,000 documents (5% of the total). The software applies what it has learned from the reviewer s coding to the remaining 95,000 documents. Relativity s approach suggests three review phases: a first coding phase followed by two QC phases in which the reviewer examines samples of the document groupings. See kcura, Speed Code Workflow (March 7, 2011), available from kcura: (312) See also

15 Making document review faster, cheaper and more accurate: 13 This is just a sketch of how you can use concept searching to perform a costeffective yet defensible first-pass review. How you do it, in detail, will depend on the capabilities of the software you choose, the issues in the case, time and staffing considerations and a range of case-specific considerations. What should be clear, though, is that the strength and reliability of conceptbased search tools now make it possible to dispense with the teams of contract attorneys or even the three or four associates you would normally have assigned to review an entire set of documents. For a fraction of the time and cost, those most knowledgeable about the case can use concept search tools to achieve better results than were ever possible using traditional linear review. The underlying technology has been proven to be effective more effective than traditional human review. The defensibility of concept searching combined with sampling as an approach to legal document review has been confirmed by The Sedona Canada Principles, which, themselves, have been adopted by legislators and judges across Canada. In light of the demonstrated effectiveness and reliability of concept searching, it seems clear that an appropriately designed computer-assisted first pass review that combines this technology with rigorous sampling, review and quality-control processes would meet the proportionality requirements set forth in the various Canadian federal and provincial rules of practice. A good computer-assisted review process, when combined with non-waiver and claw-back agreements between parties, will, we believe, become the new best practice for anything but the smallest matters 20. Law firms and their clients should now feel comfortable adopting these methods as part of a comprehensive and well-designed discovery plan. They can do so, not simply to save on time and expense, but also to secure the very real benefits, in both accuracy and consistency, that this technology provides. 20 On proportionality and the importance of collaboration between parties, particularly concerning claims of privilege and claw-back agreements, see Sedona Canada Principle 9. See also Sedona Canada Working Group, The Sedona Canada Commentary on Practical Approaches for Cost Containment (June 2011), available at documents/ SedonaCanadaCostContainment.pdf; Sedona Canada Working Group, The Sedona Canada Commentary on Proportionality in Electronic Disclosure & Discovery (October 2010), available at WG7CommentaryonProportionality-for-public-comment.pdf.

16 Contacts To find out more about Information Management and e-discovery and how we can help, please contact: Dominic Jaar National Practice Leader Information Management & e-discovery T: + (514) C: + (514) E: djaar@kpmg.ca David Sharpe Manager Information Management & e-discovery T: + (416) E: davidsharpe@kpmg.ca Visit us: Follow us: The information contained herein is of a general nature and is not intended to address the circumstances of any particular individual or entity. Although we endeavor to provide accurate and timely information, there can be no guarantee that such information is accurate as of the date it is received or that it will continue to be accurate in the future. No one should act on such information without appropriate professional advice after a thorough examination of the particular situation. firms affiliated with KPMG International Cooperative ( KPMG International ), a Swiss entity. All rights reserved. Printed in Canada The KPMG name, logo and cutting through complexity are registered trademarks or trademarks of KPMG International.