Predictive Coding in Multi-Language E-Discovery

Comprehending the Challenges of Technology Assisted Document Review Predictive Coding in Multi-Language E-Discovery UBIC North America, Inc. 3 Lagoon Dr., Ste. 180, Redwood City, CA 94065 877-321-8242 / UBIC usinfo@ubicna.com www.ubicna.com REDWOOD CITY WASHINGTON DC NEW YORK SEOUL TAIPEI TOKYO LONDON

Computers assist humans with every part of our lives today from transportation to banking to shopping. We even have technology assisted communication through e-mail and text messages. Computers assist lawyers every day with drafting documents, billing clients and legal research so there s no reason they can t lend a hand in the tedious and expensive exercise of e-discovery. Today s litigation matters involve such an unwieldy amount of electronically stored information (ESI) that no human could hope to gaze upon every file. That s where technology can assist. The promise of Technology Assisted Review (TAR) or Computer Assisted Review (CAR) is that lawyers no longer have to personally examine ALL the ESI collected in a matter. That doesn t mean lawyers don t need to look at SOME of the relevant documents, it just means they don t have to waste time looking at ALL of the documents (especially the non-relevant information). Traditional TAR with Search Terms We re all familiar with search terms as a form of TAR. Only a computer can search millions of electronic files and precisely pull out the ones containing the magic words or phrases. Any file that contains a search term is set aside while the rest are assumed to be non-relevant. Human reviewers can then look at the files containing search terms and tag or code them as responsive or non-responsive to the matter. Somewhere in the history of contemporary litigation practice, search terms became the accepted method for filtering electronic files, but they have a mixed record of identifying responsive documents. For example, an e-mail or document must contain the exact search term to be returned as a search result misspellings and shorthand references are completely ignored. Search terms are also completely powerless in determining the contextual use of a word. If you search for the word bow

you will get results regardless of whether the word refers to archery, sailing, music or etiquette. Courts have lambasted the inadequacy and limitations of using search terms as an effective approach to e-discovery 1. But even if search terms, with all their shortcomings, are used to initially cull down a large collection of ESI, humans must still be employed to determine how the documents aid the strategic goals of the litigation. Other forms of TAR have sought to fill this need under the rubrics of clustering and concept searches. These approaches use computer algorithms to analyze electronic files and group them together based on the similarity of words and phrases found in the files. In a similar fashion, predictive coding describes a method where a computer predicts that a file will be responsive by comparing it to a set of highly responsive files identified by a lawyer. A successful predictive coding project, therefore, requires an initial training exercise guided by humans. 1 See Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008): While keyword searches have long been recognized as appropriate and helpful for ESI search and retrieval, there are well-known limitations and risks associated with them, and proper selection and implementation obviously involves technical, if not scientific knowledge. Also see Custom Hardware Eng g & Consulting v. Dowell, 2012 U.S. LEXIS 146 (E.D. Mo. Jan. 3, 2012): While keyword searches have long been recognized as appropriate and helpful for ESI search and retrieval, there are well-known limitations and risks associated with them. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 260 (D. Md. 2008). These limitations and risks exist because [k]eyword searches identify all documents containing a specified term regardless of context. As a result, such searches may capture many documents irrelevant to the user s query, but at the same time exclude common or inadvertently misspelled instances of the term. Therefore, keyword searches end up being both over- and under-inclusive in light of the inherent malleability and ambiguity of spoken and written English (as well as other languages). As a result, the usefulness of keyword searches as a means of discovery is limited by their dependence on matching a specific, sometimes arbitrary choice of language to describe the targeted topic of interest. Citing The Sedona Conference Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery, 8 Sedona Conf. J. 189, 201 (2007). Training Humans and Computers Lawyers in large litigation matters are familiar with the process of training a group of other lawyers how to recognize responsive documents. Typically a senior lawyer that is intimately familiar with the matter will already possess a group of hot documents that they will show to the other lawyers who are tasked with the job of finding similar documents in a large review database. This system would work wonderfully if you didn t have to account for human opinion. A group of 10 document reviewers may not always agree on how to code a document. There will, of course, be a small set of documents on which all 10 human reviewers will agree are relevant. But for the vast majority of documents, there will be gradations of agreement among the human reviewers. If 8 or 9 of the 10 human reviewers agree that a document is responsive, there s probably a good chance that it s responsive. But if 3 or 4 of the 10 human reviewers declare a document to be responsive, is it actually responsive?

Predictive coding tools work in the same manner, but it takes out the elements of human error, opinion and distraction. An experienced lawyer identifies a seed set of highly responsive documents and then a computer utilizes an algorithm to compare those documents to the greater body of collected ESI. The tool assigns a score to each document based on its similarity to the seed set. A high number means there s an excellent chance the document is responsive. A low number means the document can be set aside since there s a low probability that the document is responsive. The results of a predictive coding exercise can constantly be validated and tweaked by having the senior lawyers view small subsets of the predictively coded documents. If non-responsive documents have been coded as responsive by the tool, the lawyers can correct the error which further educates the algorithm being used. In most cases, a predictive coding tool is cheaper, faster and more accurate than manual document review methods. This helps to secure the just, speedy, and inexpensive determination of cases as stated in Rule 1 of the Federal Rules of Civil Procedure 2. The Benefits of Predictive Coding The 2012 RAND study entitled Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery determined that 73% of every dollar spent on e-discovery goes to data review costs 3. That s an astonishing number. Predictive coding tools can lower that number drastically because there will be far less data for humans to review. Predictive coding tools are also faster. Put simply, a computer can perform the task of reviewing documents faster than a human. At best, a single human being can look at 80-120 documents per hour. A predictive coding tool can zip through 330,000 documents per hour 4. Lastly, predictive coding tools require the input of the more experienced lawyers involved in a litigation matter versus relying on barely trained reviewers who may only have trivial knowledge of the overall strategy. 2 See Peck, Andrew, United States Magistrate Judge for the Southern District of New York, Search, Forward, Law.com, October 1, 2011 (http://www.law.com/jsp/lawtechnologynews/pubarticleltn.jsp?id= 1202516530534): In my opinion, computer assisted coding should be used in those cases where it will help secure the just, speedy, and inexpensive (Fed.R. Civ. P. 1) determination of cases in our e-discovery world. 3 Nicholas M. Pace & Laura Zakaras, Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery, RAND Institute for Civil Justice, 2012 (http://www.rand.org/pubs/monographs/mg1208.html) 4 As measured by Lit i View, UBIC, 2013

Barriers to the Widespread Adoption of Predictive Coding If predictive coding tools promise so many advantages, why do they have so much trouble finding traction among litigators? The paramount reason is the paralyzing fear of producing a privileged or confidential document to an opposing party that s why lawyers have historically attempted to personally vet every single document before turning it over to the other side. The risk of producing a privileged document to an opponent is a compelling motivator for thoroughness. Another reason has been the uncertainty involved with black box technology that hasn t been officially approved by the bench. Up until 3 or 4 years ago, that may have been a significant concern. But in February 2012, Judge Andrew Peck in his Da Silva Moore opinion declared that Counsel no longer have to worry about being the first or guinea pig for judicial acceptance of computer-assisted review. 5 5 Da Silva Moore v. Publicis Groupe & MSL Group, No. 11 Civ. 1279 (ALC) (AJP), 2012 U.S. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012): This judicial opinion now recognizes that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases Counsel no longer have to worry about being the first or guinea pig for judicial acceptance of computer assisted review. 6 Global Aerospace, Inc. v. Landow Aviation, L.P., No. CL 61040 (Vir. Cir.Ct. Apr. 23, 2012 7 Gabriel Technologies Corp. v. Qualcomm Inc., Civ. No. 08cv1992 AJB (MDD), 2013 U.S. Dist. LEXIS 14105 (S.D. Cal. Feb. 1, 2013) 8 EORHB, Inc. v. HOA Holdings, LLC, No. 7409-VCL (Del. Ch. Oct. 15, 2012) 9 [The Federal Rules of Civil Procedure] govern the procedure in all civil actions and proceedings in the United States They should be construed and administered to secure the just, speedy, and inexpensive determination of every action and proceeding. FRCP 1 The 2012 RAND study determined that 73% of every dollar spent on ediscovery goes to data review costs. Predictive coding tools can lower that number drastically. Other opinions such as Global Aerospace 6, Gabriel Technologies 7, and EORHB 8 have all championed predictive coding in a similar vein in that it can accomplish the goal of securing the expedient and inexpensive goal of litigation 9. In all these cases, the standard is NOT that the predictive coding tool must be perfect, but that it be reasonable and defensible in its approach. The Requirements for a Predictive Coding Tool When considering a predictive coding tool, you should inquire if the tool can support both supervised (i.e. the lawyer provides example documents) and active (i.e. the computer algorithm chooses documents to be coded by the lawyer) learning processes. A reasonable predictive coding plan will require several iterative learning cycles as the technology identifies responsive files and then allows a lawyer to rank how well the tool performed. Based on the additional feedback, the technology improves the accuracy of predictively coding responsive documents. It is also imperative that the predictive coding tool be able to produce detailed reports and metrics at every stage of the

project. Not only will these reports assist in defending the overall approach, it will also help to validate the results throughout the project s lifecycle. Lastly, you should inquire as to all forms of technology assisted review that a vendor offers since it may not be enough to simply ask for predictive coding. UBIC s Lit i View, for example, offers a clustering scheme based on predictive coding that helps to prioritize documents for human review. This unique technology combination could provide a distinct advantage in certain matters. The Challenge of Mixed Languages in Predictive Coding The discussion around predictive coding is geographically centered in the United States mainly due to the country s open litigation environment. But as the world gets smaller, the e-discovery challenges get bigger since more matters involve ESI with multiple languages. If all your collected ESI is in English, your predictive coding options are wide open. But there are only a few vendors who can successfully navigate a multi-language e-discovery project. E-discovery is experiencing an increase in ESI written in Chinese, Japanese and Korean languages, otherwise known as CJK. Whether it s an increase in cross-border litigation or foreign corporations utilizing the U.S. judicial system, the influx of CJK languages in ediscovery reveals distinct challenges for vendors and parties. "Proper identification of textencoding and grouping of wordsets in the Asian-language context is critical to accurate search capability and efficient review/presentation of content. Predictive Coding results are no better than the quality of data put into it - "garbage in, garbage out". Accordingly, proper handling of CJK data is paramount to the success of Predictive Coding." For example, the English language uses spaces to separate words but the CJK languages do not (the Korean language will sometimes use spaces to separate phrases). Beyond natural language processing capabilities, working with CJK languages requires negotiating unfamiliar encoding schemes and character formats. While the English alphabet can be represented by a limited number of digital bytes, CJK languages require a much more complex system involving Unicode and multi-byte characters. If the textencoding and word-set groupings are not appropriately considered from the beginning, the entire predictive coding exercise will be compromised it s the age-old adage of garbage in, garbage out.

It s not just language idiosyncrasies that create challenges in predictive coding with CJK. E-mail collected in the U.S. is almost universally from either a Microsoft Exchange (.PST) or Lotus Notes (.NSF) environment. But e-mail collected in CJK countries could come from numerous other email systems that require a familiarity rarely found in many U.S.-based e-discovery vendors. There are even differences in how the Windows operating system and officesuite software is set up to handle CJK input that would be completely foreign to inexperienced vendors and practitioners. All of these considerations have major consequences on the success of a multilanguage predictive coding exercise. The good news is that a computer doesn t make a distinction between languages so a predictive coding tool will theoretically work regardless of the language found in the ESI. The critical element is engaging a vendor who can effectively guide you through an e-discovery project involving CJK. The vendor must understand both the linguistic and technical challenges involved with multi-language e-discovery or else you ll end up with a garbled mess of un-recognized files and inaccurate searches. There s no question that predictive coding will become more prolific in the next few years of ediscovery. Neither is there a question that ediscovery will involve multiple languages as litigation matters grow and expand to international circles. It is imperative that litigators find vendor partners who can successfully handle these challenges going forward. STUDIES IN TECHNOLOGY ASSISTED REVIEW There have been several studies comparing the cost and time savings of a technology assisted document review to a manual review. These studies are regularly relied upon to show the effectiveness of predictive coding approaches in litigation. David C. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, 28 Communications of the ACM 289 (1985) Herbert L. Roitblat, Anne Kershaw & Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, 61 Journal of the American Society for Information Science and Technology 70 (2010) Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in EDiscovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, 17 Richmond Journal of Law & Technology 11 (2011) Nicholas M. Pace & Laura Zakaras, Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery, RAND Institute for Civil Justice (2012)

TAR Terminology There are a lot of confusing terms involved with technology assisted review and predictive coding. Here s a short list to help you make sense of it all: Active Learning: An iterative training regimen in which the training set is repeatedly augmented by additional documents chosen by the computer algorithm and coded by a lawyer. Algorithm: A formally specified series of computations that, when executed, accomplishes a particular goal. The algorithms used in e-discovery are implemented as computer software. Boolean Search: A keyword search in which the keywords are combined using operators such as AND, OR, and [BUT] NOT. Clustering: A process where documents are segregated into categories or groups so that the documents in any group are more similar to one another than to those in other groups. Coding: The action of labeling a document as relevant or non-relevant. Concept Search: A method to return documents beyond a simple keyword or Boolean search through techniques such as stemming and thesaurus expansion. Culling: The practice of narrowing a large set of ESI into a smaller data set for the purposes of review. De-duplication: A method of replacing multiple identical copies of a document by a single instance of that document. De-duplication can occur within the data of a single custodian ( vertical de-duplication), or across all custodians ( horizontal de-duplication). Electronically Stored Information (ESI): Used in Federal Rule of Civil Procedure 34(a)(1)(A) to refer to discoverable information stored in any medium from which the information can be obtained either directly or, if necessary, after translation by the responding party into a reasonably usable form. Iterative Training: The process of repeatedly augmenting the training set of documents with additional examples of coded documents until the effectiveness of the computer algorithm reaches an acceptable level. Keyword: A word or search term that is used as part of a query in a keyword search. Keyword Search: A search in which all documents that contain one or more specific keywords are returned. Manual Document Review: The practice of having human reviewers individually read and code the documents in a collection of ESI for responsiveness, particular issues, privilege, and/or confidentiality. Predictive Coding: An industry-specific term generally used to describe a Technology Assisted Review process involving the use of a computer algorithm to distinguish relevant from non-relevant documents, based on a lawyer s coding of a seed set of documents. Seed Set: The initial set of relevant documents identified by a lawyer that is provided to the learning algorithm in an active learning process. Supervised Learning: A method in which the computer algorithm infers how to distinguish between relevant and non-relevant documents using a seed set of documents that have been identified by a lawyer. Technology Assisted Review (TAR): A process for prioritizing or coding a collection of ESI using a computerized system that harnesses human judgments on a smaller set of documents and then extrapolates those judgments to the remaining document collection. Adapted from The Grossman-Cormack Glossary of Technology Assisted Review, Maura R. Grossman & Gordon V. Cormack, Federal Courts Law Review, Vol. 7, Issue 1, 2013