ediscovery Institute Survey on Predictive Coding

Released October 1, 2010 The ediscovery Institute is a 501(c)(3) nonprofit research organization dedicated to identifying and promoting cost-effective methods of processing discovery. More information on the work of the Institute is available at www.ediscoveryinstitute.org. 2010 ediscovery Institute 2010, all rights reserved.

Foreword: Why A Survey on Predictive Coding? The largest cost element in the ever-escalating cost of electronic discovery is typically the cost of having teams of lawyers review and select records for production or privilege. Those costs can be especially staggering if the lawyers are reviewing every record that is produced. Predictive coding is a process in which review decisions from examining sample records are propagated or extended by the use of various technologies to records which have not been individually examined. The producing party may use the suggested evaluations to avoid examining all records, or it can lower costs by triaging the documents, assigning the lower ranking documents to the lowest cost personnel, letting the more expensive resources focus on the records that are most likely to be relevant. Either way, predictive coding can significantly reduce the largest single element of cost in e-discovery. The survey was undertaken to collect information on technologies or processes that were being used to accomplish predictive coding and to quantify the savings that they were achieving. There is growing recognition that the old brute-force linear review process in which each record is examined is not economically feasible. For example: Principle 6 of the Sedona Principles (Second Edition), June 2007: Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information. Principle 11 of the Sedona Principles (Second Edition), June 2007: A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information. Practice Point 1 from The Sedona Conference Best Practices Commentary on the use of Search and Information Retrieval Methods in E-Discovery: In many settings involving electronically stored information, reliance solely on a manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary. (Emphasis added.) Not only is predictive coding less expensive, there is also a growing belief that it is actually superior to linear review in several ways: Consistency. Human review is not necessarily the gold standard it is sometimes assumed to be. In a study by the ediscovery Institute 1 and earlier studies by the Text Retrieval Conference ( TREC), 2 two reviewers or teams of reviewers have examined the same records. In these 1 Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Herbert L. Roitblat, Anne Kershaw and Patrick Oot, Journal of the American Society for Information Science and Technology, 61(1):70-80, 2010. Two teams of reviewers examined 5,000 documents that had earlier been examined as part of a response to a US Department of Justice request. Team A identified 48.8% of the records identified as responsive by the original reviewers from the sample and Team B identified 53.9%. Two computer-assisted review systems were also used to review the entire original population. System C identified 45.8% of the documents originally identified as responsive, System D indentified 52.7%. See also, Automated Document Review Proves Its Reliability, Anne Kershaw, Digital Discovery & e-evidence Newsletter, Pike & Fischer, November 2005. It describes a study comparing the performance of a human review team to that of an automated document assessment system in evaluating a sample of 43% of a collection of 48,000 documents. Relevant documents were deemed to be those identified by both the system and the humans plus those identified by either the system or humans where subsequent arbitration decided that they were relevant. The system identified more than 95% of relevant records whereas the people identified 51%. 2 Overview of the TREC 2009 Legal Track, by Bruce Hedin, Stephen Tomlinson, Jason R. Baron, and Douglas W. Oard, downloaded from http://trec.nist.gov/pubs/trec18/papers/legal09.overview.pdf on September 29, 2010, provides corrected results for a study ii

studies, the second review has identified as responsive just 48.8 to 62.0% of the records identified as responsive by the first review. In other words, linear human review is itself quite fallible. Transparency. When humans review records there is seldom any documentation on why particular records were deemed responsive or not. By contrast, most predictive coding methodologies build an audit trail of what decisions were made and what rules were applied. Retroactive Evaluation. Linear review is so expensive that it is rarely feasible to re-examine records that had been reviewed earlier even though the review team may have gained substantial new insight into the issues of the case in the meantime. Not so with some automated review technologies and processes. Time. Predictive reviews can greatly speed the time required to produce records, thereby shortening the time required to resolve disputes. Confidentiality. Individually reviewing each record requires large review teams; this necessarily exposes confidential information to more risks of unwanted disclosure than would predictive reviews that can process the same volumes with far fewer reviewers. We hope that the results will inform discussions on what types of pre-production review are legally defensible. The ediscovery Institute Anne Kershaw President & Co-founder Joe Howie Director, Metrics Development and Communications done in 2008 in which a subset of records that had previously been reviewed in 2006 and 2007 were reviewed again in 2008. According to the Overview, just 62% of documents previously judged relevant were judged relevant again in 2008. The sample consisted of 104 documents that had been previously judged relevant and 120 documents that had previously been deemed not relevant. See discussion at section 4 (Correction to 2008 Assessor Consistency Study) in the Legal09 Overview. See also, interassessor consistency data on TREC 06 Legal track ad hoc topics, by Dave Lewis, downloaded on Sept. 29, 2010 from http://cio.nist.gov/esd/emaildir/lists/ireval/msg00012.html. In 2006 the TREC Legal Track took a sample of documents on each of 40 topics. That consisted of 25 documents per topic that had been deemed relevant by an assessor or all of them if there weren t 25 relevant documents, and enough other nonrelevant documents to bring the total to 50. Due to a glitch, one topic only had 49 records. This sample set was then reviewed by another assessor. The first assessor identified 877 of 1999 documents as relevant. The second assessor identified 470 of those 877 documents, or 58.0%, as being relevant. iii

Contents Foreword: Why A Survey on Predictive Coding?... ii I. Special Thanks... 1 II. Background... 2 III. Overview of Results... 3 IV. Respondents... 4 V. Terminology... 5 VI. Offering & Overall Process... 6 VII. Identifying Like Records... 11 VIII. Email Threading... 12 IX. Paper-Based Records... 13 X. Savings... 14 XI. Pricing/Cost... 15 XII. Incremental Cost of Predictive Coding... 16 XIII. Sample Sizes... 17 XIV. Set Up Efforts... 18 XV. Transparency... 21 XVI. Privilege... 23 XVII. Repeatable Results... 24 XVIII. Elevator Pitch... 25 XIX. Acceptance/Adoption... 27 XX. Type Matters, Size Threshold... 28 XXI. Obstacles to Broader Adoption... 30 XXII. Languages... 31 XXIII. Review Platforms... 32 XXIV. Judicial Review... 33 XXV. Should Have Asked... 34 XXVI. Comments... 35 iv

I. Special Thanks We believe that the best way to identify and adopt cost effective ways to process electronic discovery is to have an informed debate on the various options and we want to thank the companies that provided the responses shown in this report. Many of them provided great insight into how they accomplish predictive coding and what the benefits of this approach are. Kudos to the following companies for stepping up and sharing this valuable information: Capital Legal Solutions Catalyst Repository Systems, Inc. Equivio FTI Technology Gallivan Gallivan & O Melia Hot Neuron InterLegis Kroll Ontrack Recommind Valora Technologies, Inc. Xerox Litigation Services 2010 ediscovery Institute 2010, all rights reserved.

II. Background The predictive coding survey is the third in series of surveys by the ediscovery Institute on technologies or processes that can be used to speed the processing of electronic data while improving the quality of the review. The first showed that proper consolidation of duplicate electronic files could, on average, reduce the volume of records to be reviewed by 38%. 3 The second survey showed that with grouping emails in threads or conversations could reduce the effort required to review e-mail by an additional 36% on average. 4 This survey dealt with predictive coding, which we defined as a combination of technologies and processes in which decisions pertaining to the responsiveness of records gathered or preserved for potential production purposes are made by having reviewers examine a subset of the collection and having the decisions on those documents propagated to the rest of the collection without reviewers examining each record. In May, 2010, invitations to participate in the survey were sent to a number of companies known to be active in the electronic discovery market. Additionally, postings inviting participation were made on a number of forums including the EDDUpdate.com blog, the Lit Support list serve on Yahoo, the ediscovery group on LinkedIn.com and on LegalORamp.com. Responses were all received by July 1, 2010. 3 Report on Kershaw-Howie Survey of E-Discovery Providers Pertaining to Deduping Strategies, available at www.ediscoveryinstitute.org/pubs/dedupe-report.pdf. This study showed that on average, consolidating duplicates across custodians reduced the volume to be reviewed by 38% on average with many individual respondents reporting project-level reductions in excess of 70%. The savings from across-custodian deduping was almost double the reduction in volume of electronic discovery compared to only consolidating within the records of individual custodians, yet was performed in only half the cases raising serious ethical considerations which were explored in Ethics and Ediscovery Review, published in the ACC Docket, Jan/Feb. 2010. Reprints are available at http://ediscoveryinstitute.org/pubs/edi- EthicsOfEdiscover.pdf 4 Report on Kershaw-Howie Survey of E-Discovery Providers Pertaining to Threading, available at www.ediscoveryinstitute.org/pubs/ediscoveryinstitutethreadingreportfinal_jh.pdf. 2

III. Overview of Results This is a summary of the results. Complete responses to the questions are provided later in this report. Some highlights: Savings. Respondents reported average savings of 45% with 71% average maximum observed savings and 23% average minimum observed savings. Individual respondents reported savings as high as 80 to 95 and even 100% and minimum savings as low as zero % on individual projects. Obstacles to Implementation. The respondents felt that the largest obstacle to a more widespread use of predictive coding was uncertainty over judicial acceptance of that approach. The next closest obstacle was lack of awareness of options on the part of in-house counsel followed closely by insensitivity to the cost of inefficiencies by law firms. General Approach. The respondents varied in their approach to predictive coding. Most respondents used some form of queries combined with document clustering. Non-Binary Process. In describing their responses, several of the respondents noted that predictive coding is non-binary in nature, i.e. documents are ranked according to how closely they match previously examined records. In other words there is a continuum and the review team has to select what the cutoff point is. Terminology. Almost all of the respondents thought there was a better generic term than predictive coding. Suggestions included: Automated Document Classification Automatic Categorization Predictive Categorization Predictive Ranking Prognostic Document Profiling Propagated Coding Relevance Assessment Replicated Coding Suggested coding Pricing Models. The respondents offered a variety of pricing models, including per GB pre-culling, per GB post-culling, hourly fees, and flat per case fees. Sampling. Most respondents used some form of statistical sampling. Transparency. Most of the respondents provide an audit trail of what decisions were made. Replicability. Most of the respondents indicated that the results of a second analysis using the audit trail from the first analysis would produce the same results. Adoption Rate. There were not enough responses in this area to provide metrics on the rate of adoption. Maturity of Offerings. Predictive coding as an offering is far more recent than deduping or email threading. Many of the respondents have added predictive coding in the last two years. Email Threading. Most respondents were able to treat either individual emails or to treat emails grouped in threads. Paper records. All the respondents included scanned and OCR d paper records in with electronic records for predictive coding purposes. Languages. All the respondents could handle English, French, German and Spanish but there were a few who could not handle Chinese, Japanese, Korean or Arabic. 3

IV. Respondents The following companies provided responses to the survey. Company Contact Person Involvement w/ Predictive Coding Capital Legal Solutions www.capitallegals.com Catalyst Repository Systems, Inc. www.catalystsecure.com Equivio www.equivio.com FTI Technology www.ftitechnology.com Gallivan Gallivan & O Melia www.digitalwarroom.com Hot Neuron, LLC www.cluster-text.com InterLegis www.interlegis.com Kroll Ontrack www.krollontrack.com Recommind www.recommind.com Valora Technologies, Inc. www.valoratech.com Xerox Litigation Services www.xerox-xls.com Gregory Brooks, VP Information Technology gbrooks@capitallegals.com John Tredennick, CEO jtredennick@catalystsecure.com 303-824-0840 Warwick Sharp, VP Marketing and Business Development warwick.sharp@equivio.com 800-851-1965 206-373-6521 Kate Holmes, Director, Corporate Communications kate.holmes@fticonsulting.com Daniel Gallivan 206.654.1441 Bill Dimm, CEO clustify@hotneuron.com 610-581-7702 Kevin Carr, President kcarr@interlegis.com 214-468-8800 x205 Jamie Ritter, Document Review Manager jritter@krollontrack.com 952-906-4857 Chris Hutcheson, Marketing Director Sandra Serkes, President & CEO sserkes@valoratech.com 781.229.2265 Karen Miller, Director of Marketing (karen.miller@xls.xerox.com) 212.337.5449 Developed own predictive coding software; provide software & hosted review Developed own predictive coding software; developed methodology within Developed own predictive coding software Developed own predictive coding software; we provide both software and hosted review. Integrated others predictive coding Developed own predictive coding software Developed own predictive coding software Developed own predictive coding software Developed own predictive coding software; provide hosting and software Combo software provider & services provider. Developed own predictive coding software 4

V. Terminology ediscovery Institute Survey on Predictive Coding The survey asked, If you think there is a better generic term than predictive coding, what would it be? and Why? These were the responses: Company Better Term Why Capital Legal Solutions Catalyst Repository Systems Equivio FTI Technology Gallivan Gallivan & O Melia Hot Neuron, LLC InterLegis Kroll Ontrack Recommind Valora Technologies, Inc. Xerox Litigation Services Prognostic Document Profiling Predictive Ranking Relevance Assessment Suggested coding Predictive Categorization Automatic Categorization No Propogated Coding or Replicated Coding Automated Document Classification The prognostic and iterative content categorization can play a broader part than simply review "call" score evaluation; for example in the document management system's context. More descriptive of the process and result. All systems deal with a rank or likelihood of responsive or not responsive. It is up to the trial team to determine the acceptable risk. The term "coding" suggests that the output is binary (responsive or not). However, one of the important use scenarios is prioritized review, which can only be facilitated by graduated relevance scores. In addition, graduated relevance scores are important in allowing the user to select which documents to review (above a certain cut-off score), based on the mix of risk (recall) and cost (precision) appropriate for the given case and business scenario. We take the approach that this review technology does not completely eliminate human review. "Suggested coding" correctly indicates that human review decisions are preserved and help guide the computer through concept-clustering of documents and the integration of reference documents into the review. Review decisions become more consistent and faster, without relinquishing control over the substantive decisions for each document. "Coding" implies a decision, the machine is suggesting. The coding happens when a person confirms (or refutes) the suggested category I don't know if it is "better," but it better aligns with terminology outside of the legal field. May we suggest "Propagated Coding," rather than Predictive Coding, as "predictive" tends to mean ahead of the current time (like a forecast), whereas "propagated" would indicate taking existing results and carrying them forth across the remainder of the population (at any time, including the present). We believe that a generic term for a new offering in this market should be as transparent and descriptive as possible. Automated Document Classification is our preferred name for this particular technology, because we believe that it more clearly conveys the intended output for the technology namely, a definitively classified set of documents. In our view, the term Predictive Coding is opaque and imprecise. It does not differentiate Automated Document Classification from less robust similarity-detection technologies, like clustering, near de-duplication, and e-mail threading. These other techniques could be used to make predictions regarding relevance for certain groups of documents within a corpus. Unlike Automated Document Classification, though, they would not comprehensively classify a document population such that clear definitive lines could confidently be drawn segregating relevant documents from non-relevant documents. In sum, the term Predictive Coding seems to us to suggest a technology whose end results are imprecise, immeasurable, and unreliable. This is not, in our view, an appropriate designation for the emerging body of Automated Document Classification systems. 5

VI. The survey asked: ediscovery Institute Survey on Predictive Coding Offering & Overall Process Name of PC Offering. What do you call your predictive coding offering? Time Offered. What year did you first provide predictive coding software or services? Overall Process. Please describe the overall process involved in your offering: (Example: After records have been collected and placed in a repository, the duplicate records are consolidated. Reviewers perform full text searches and otherwise browse the records of custodians with the most known involvement in the issues. The reviewers identify records known to be responsive and then our system identifies other records that are most like those records based on. We sample non-selected records based on and examine samples of about XX records to determine if there are still sets of relevant records that had not already been selected for production. We repeat iterations until ) The responses were as follows: Company Offering Capital Legal Solutions Dynamic Content Profiling Catalyst Repository Systems Predictive Ranking Offered Since 2010 2nd Quarter Predictive Coding Offering & Overall Process Overall Process Dynamic Content Profiling will work on any corpus of documents across any language set that is imported into our ezreview repository pre or post culling for de-duplication, date filtering or key word searching. Dynamic profiler works on any folder or navigational view in the system. As such, client can execute across searches, tags, production data, random sample sets or customized queries. In any event, CLS review architects can work with client to create a powerful strategy whereby they can preview deliberate batches based on any folder technique mentioned prior or through our automated randomizer engine. In our random sampling module, user can make a decision as to the size of sample set and the pass or fail threshold levels. Sampled or deliberate batches then receive review decisions by expert or top level reviewers. Our profiling engine will then scan across the corpus of documents in entire database and find similar documents based on content and concepts based on our CLS s customized algorithms. All such documents are pulled and folder for mass categorization. A random sampling can then we performed against that data set for quality assurance purposes. This process can be repeated until all documents in the corpus are reviewed. 2008 Catalyst offers Predictive Ranking and statistical analysis based upon initial coding decisions made by counsel during initial document review/sampling. These coding decisions are coupled with weighted key concepts and search terms, and then are applied against the non-reviewed documents, leading to an assigned predictive weighting for responsiveness. The ranks are typically used two ways: 1. Documents with a very low rank, tested and shown to be extremely unlikely to be responsive, are not reviewed and not produced. 2. The remaining documents are typically prioritized and reviewed in priority order, beginning with those most likely to be responsive. This allows for a prioritized review, making the 6 Underlying Technology Capital Legal Solutions own Intellectual Property / developed internally. Catalyst Repository Systems (Catalyst CR)

Company Offering (Catalyst Cont d) Equivio Equivio> Relevance Offered Since ediscovery Institute Survey on Predictive Coding Predictive Coding Offering & Overall Process Overall Process review more efficient and supporting rolling productions. The steps we follow are as follows: Begin with a list of search terms that counsel believes are likely to find responsive documents, and run those searches. Sample a random sample of both the hits and non-hits, tagging for responsiveness, and looking for additional words and phrases that are found in responsive documents and false hit terms that are often found in non-responsive documents. Adjust the search terms based on what was learned during the sampling. If there were phrases found that are common false hits, run a Catalyst unique True Hit Finder/False Hit Remover process to tag the true hits and not the false hits. Assign the search terms scores representing likelihood of responsiveness, and run the searches in Power Search, based on subject matter expertise and sampling results. Assign each document a responsiveness rank based on a combination of the search terms that hit and the scores of each search term.. Sample additional documents to verify the scoring. Determine cut-off, and remove the docs that are ranked as nonresponsive to a subcollection where they can be sampled and archived. Review the docs ranked as likely responsive beginning with the highest ranked documents. The benefits can be magnified when combined with Catalyst s additional features to accelerate the review, such as Equivio Email Thread/Near Dupe analysis, sophisticated handling of multiple languages, clustering, and managed review workflow. 2009 Equivio>Relevance enables organization of a document collection by relevance. Based on initial input from an attorney knowledgeable of the case, Equivio>Relevance uses statistical and self-learning techniques to calculate graduated relevance scores for each document in the data collection. As an expert-guided system, Equivio>Relevance works as follows: An expert reviews a sample of documents, ranking them as relevant or not. Based on the results, Equivio learns how to score documents for relevance. In an iterative, self-correcting process, Equivio feeds additional samples to the expert. These statistically generated samples allow Equivio>Relevance to progressively improve the accuracy of its relevance scoring. Once the sampling process has optimized, Equivio scores the entire collection, calculating a graduated relevance score for each document. The product includes a statistical model which monitors the software training process, ensuring validation and optimization of the sampling and training effort. Underlying Technology We are not at liberty to disclose this information. 7

Company Offering FTI Technology Acuity is the name of our allin-one legal review offering that utilizes "predictive coding" (our preference is "suggested coding"). Gallivan Gallivan & O Melia depends on client -- we are not consistent: have used clustering, auto tagging, grouping Hot Neuron, LLC Clustify (PC is a subset of its functionality Offered Since Acuity launched in January 2010. rudiments in 2003 (Attenex style); fully (if client requested) since 2008 ediscovery Institute Survey on Predictive Coding Predictive Coding Offering & Overall Process Overall Process The Acuity process is to review a subset of the documents, which we call the reference set, and have the review team code them as appropriate. This serves two functions - these will suggest coding on uncoded documents, and will continually guide and instruct reviewers. From there, the reference set is uploaded to an enhanced Attenex Document Mapper tool where the coded documents are clustered with other documents of similar content and themes. Based upon the coding of the reference set, the software provides suggestions to the reviewers on how to code the similar documents. Coding decisions are implemented by the reviewers rather than automatically by the software and the process can be accurately described as machine-assisted document review. Collect and process records; extract content and placed in a repository, store references in a database. Consolidate duplicates. Extract text or OCR, compare text content to create a similarity vector, store results. Reviewers perform full text searches and otherwise browse the records of custodians, filtering based on metadata as needed. Similar documents are grouped together. The reviewers identify groups known to be responsive and then we associate other records that are most like those records based on the similarity vector. Reviewer decisions define the actual mark of the documents vs. the mark suggested by our system. As new waves of data arrived, they are placed in groups based on similarity vectors generated for that data. 2008 Clustify only does the automatic categorization step of the process, so the details of other steps (de-dupe, searches, etc.) are really up to the user. The user supplies two sets of documents to Clustify: documents that have already been categorized (perhaps as responsive/not-resposive, or whatever categories the user wants to use), and documents that haven't been categorized. Clustify compares the uncategorized documents to the ones that have been categorized, and categorizes them automatically if they are sufficiently similar to any of the categorized documents. The "sufficiently similar" criteria is specified by the user. It could be a minimum conceptual similarity percentage, or a near-dupe percentage. Any uncategorized documents that aren't sufficiently similar to any categorized document for automatic categorization are clustered, labeled with descriptive keywords, and presented to the user for manual categorization. Clustify tells the user how similar an auto-categorized document is to the most similar manually categorized document, so the user can identify the documents most at risk of incorrect categorization (i.e., those with lowest similarity). The process can be iterated in an effort to cover more of the 8 Underlying Technology The underlying software is Attenex Patterns. The Acuity all-inone offering utilizes an enhanced version of wellknown software that includes the suggested coding features. It is Hot Neuron's own proprietary technology.

Company Offering (Hot Neuron Cont d) InterLegis Discovery360 Predictive Coding (Interlegis Cont d) Kroll Ontrack Intelligent Prioritization Recommind Axcelerate ediscovery ediscovery Institute Survey on Predictive Coding Predictive Coding Offering & Overall Process Offered Overall Process Since uncategorized documents, but it is only wise to do so if there is a manual review of the documents most at-risk for errors. Without such review, it is better to increase coverage by simply setting the similarity requirement lower. 2009 Predictive coding is a technology feature within Discovery360 Reviewer. There are two ways to leverage PC within Discovery360. 1. User-Defined: Case administrators define various attributes that define certain issue codes. They can use any attribute in the database, including: keywords, concepts, file types, email domains, specific names and more. This process enables users to first "teach" the system, then ask it to find all documents that match their criteria. 2. Automatic: With this feature activated, the system will essentially "watch and learn" what commonalities are found between documents as they are issue coded. And as reviewers work, the system will find and recommend likely candidates for each issue code. Users can then either approve the entire recommended list, edit the criteria, or quickly QC the list to confirm selections. Additionally, case administrators have the ability to ask the system to either code matching documents immediately, or place likely candidates in a holding folder for confirmation. In all cases, documents coded via the PC engine are always designated as such in the database for logging and defensibility purposes. 2010 After documents have been processed and uploaded into Ontrack Inview, the project administrator builds an initial workflow. An early workflow stage is designated for Intelligent Prioritization. Initially, a statistically relevant sample of the uploaded documents is provided to reviewers for standard linear review. The system then assesses the reviewed documents and defines the characteristics of potentially Responsive documents. The system then prioritizes other likely Responsive documents for review. As the review continues, the system s knowledge of Responsive characteristics improves. When new documents are loaded into Ontrack Inview another statistically significant sample is identified from this new data and that sample of data is prioritized for Responsive review. In addition to the document prioritization identified above, Kroll Ontrack provides additional project analysis that helps determine when a high percentage of potentially Responsive documents have been identified within the data. By analyzing the Responsiveness patterns in the data and comparing them to the entire population of documents, Ontrack Inview can provide statistical details that can be utilized to indicate the completeness of a review. 2006 All software, processes and workflow are the proprietary intellectual property of Recommind and cannot, therefore, be disclosed. 9 Underlying Technology InterLegis' proprietary technology Intelligent Prioritization is a proprietary Kroll Ontrack technology. Recommind

Company Offering Valora Technologies, Inc. We have numerous offerings here: AutoCoding, AutoIssues, AutoPriv, AutoResponsive, AutoND (NearDupe), AutoETG (EmailThreadGro up) and a roll-up of the above: AutoReview Xerox Litigation Services CategoriX Offered Since Our first predictive/ propagated capability was AutoCoding, first offered in 2002. ediscovery Institute Survey on Predictive Coding Predictive Coding Offering & Overall Process Overall Process Valora loads the entire collected population into our system, including any review data already available from previously (typically manual) efforts by reviewers. We build a custom computer-representation of the Document Review Ruleset for each matter. We extract/understand these Rules from three possible places: 1) From a Coding or Review Manual, typically written by the client to train human reviewers 2) From existing/previously coded data from earlier review efforts. In this case, Valora creates a translation from prior actions taken to the underlying rules that guided those decisions (even if not explicitly stated). 3) From direct conversations with the client, particularly when no existing data or review efforts exist (e.g., starting fresh). Once established, Valora propogates the Document Review Ruleset uniformly across any already-coded documents. The results are reviewed and corrected for precision and recall (accuracy). Once the results meet the desired threshold, the Ruleset is propagated across the entire population. 2009 CategoriX automatically classifies documents by learning from samples that have been reviewed by knowledgeable case attorneys. CategoriX utilizes attorney-supplied document assessments, together with its own statistical analyses, to create a model that will accurately and consistently generalize the attorneys assessments across the entire review population. The statistical analysis underlying CategoriX technology is called Probabilistic Latent Semantic Analysis (PLSA). CategoriX leverages PLSA to identify correlations between words and attorney-supplied relevance assessments. This knowledge then informs CategoriX classifications for novel documents going forward. CategoriX performance depends on the quality of the assessments provided by the attorneys in the training samples. For this reason, several iterations of training and intensive quality control are undertaken during the model-building process to ensure the accuracy and consistency of the training input. Precision and recall are monitored throughout the incremental model-building process to ensure that progress is being made toward our client s performance goals. Once CategoriX models are consistently performing at the desired levels, CategoriX is applied to the entire review population. Finally, one last round of attorney-driven QC sample review is undertaken to validate the quality of the final result set. The iterative CategoriX approach has several distinct stages and entails a strong consultative partnership between CategoriX technical experts at XLS and the client s attorneys. Nevertheless, a CategoriX-based review can typically be completed in a very short timeframe, as many of the analyses are aided by computers working 24 x365. 10 Underlying Technology Valora Technologies, Inc. Xerox s two research centers, Xerox Research Centre Europe (XRCE) and Xerox Palo Alto Research Center (PARC).

VII. Identifying Like Records The survey asked: ediscovery Institute Survey on Predictive Coding What general approach is used to identify like records? (Select all that apply) Custom queries Statistically based clustering, with no terms inferred, e.g., basing a search or clustering on a document that contains Ford and Toyota would not find or associate documents that only contained the words Chevy and Honda Statistically based clustering with co occurring words inferred, e.g. basing a search or clustering on a document that contains Ford and Toyota could find or associate documents that only contained the words Chevy and Honda Taxonomies Other (please specify): These were the responses: Company Queries Clustering (no inf.) Clustering (w/ inf.) Taxonomies Other Capital Legal Solutions Yes Yes Like records are identifiable is various ways. In addition to the above two we identify also based on document content. Catalyst Repository Yes Yes Systems Equivio Supervised learning FTI Technology Linguistic statistical analysis assesses similarity in documents. Gallivan Gallivan & Yes Yes O Melia Hot Neuron, LLC Yes InterLegis Yes Machine Learning based on common threads between documents. Kroll Ontrack Classification based technology that assesses document text to determine related documents. Recommind Yes Yes Yes Valora Technologies Yes Xerox Litigation Services CategoriX uses Probabilistic Latent Semantic Analysis to identify correlations between words and attorney-supplied category assessments. From these building blocks, CategoriX assembles models capable of assigning relevance probabilities to novel documents that have not been manually reviewed. CategoriX s probability assignments do not depend on the presence of any specific words or phrases in a document. Instead, each document s score is dictated by the probabilities of the specific combination of words comprising it. 11

VIII. Email Threading Section 3 of the survey asked: ediscovery Institute Survey on Predictive Coding Email Threads. Please explain how email threads are handled in conjunction with your offering. Emails are analyzed individually so that different emails from the same thread can be placed in different groups or clusters Email threads are identified prior to grouping or clustering so that all emails in a thread or branch of a discussion areplaced in the same group or cluster Other: please explain: The responses were as follows: Company Indiv. emails All EM in Thread Other Capital Legal Solutions With our system, there is no boxed in solution for E-mail Thread review. We can and will work with case team to establish a workflow that will be most efficient per their strategy. If review based on searching is required for instance, then we can search get those results, pull in the entire conversation and take that into account. Or if review based on similar or associated documents is the desired first pass we can do that way and then account for E-mails in those thread to be automatically categorized. So we allow a flexibility here as different clients work different way but we can find the efficient way per their work methods. Predictive Ranking is flexible: Searching is done by document, but analysis and ranking can be done by: (a) individual documents, (b) families of email and related attachments (c) email threads (optional with Equivio email thread processing) Both options are supported. This is a user-defined parameter. We can do both depending upon client preference. Catalyst Repository Systems Equivio FTI Technology Gallivan Gallivan & O Melia Hot Neuron, LLC InterLegis Kroll Ontrack Yes Clustify allows you to do it either way. Yes Emails are handled in the Intelligent Prioritization technical solution without additional document type handling. In addition to Intelligent Prioritization, Kroll Ontrack provides email threading technology that analyzes emails and presents them to reviewers grouped by conversation, and identifies the earliest and latest emails in each thread. Recommind Valora Technologies Xerox Litigation Services Yes We offer both choices as an option to our customers. CategoriX typically operates on individual emails. However, the XLS review platform incorporates email threading technology that could be used to ensure that all members of an email thread would be assigned to the same category, should the client prefer this organization. 12

IX. Paper-Based Records Section 3 of the survey asked: ediscovery Institute Survey on Predictive Coding Paper based Records. How are paper based record treated for predictive coding purposes? Paper records are scanned and OCR d and the OCR d text is included with the ESI for predictive coding Paper records are scanned and OCR d and treated as a separate population from ESI for predictive coding Paper records are not treated with predictive coding Other (please explain) The responses were: Company Paper w/esi Paper Separate Paper not treated for Predictive Coping Other Capital Legal Solutions Yes Catalyst Repository Systems Yes Equivio Yes FTI Technology Yes Gallivan Gallivan & O Melia Yes Hot Neuron, LLC It can be any of the above. It's entirely up to the user to decide whether to put OCR'ed text in the same document set as the ESI, or whether to separate them. InterLegis Yes Kroll Ontrack Yes Recommind Yes Valora Technologies, Inc. Yes Any ESI documents without text are processed like paper (OCR, etc.). Xerox Litigation Services Yes 13

X. Savings The survey asked, Cost Savings As compared to a linear review of the same content after duplicate consolidation, after culling based on domain name analysis of emails (e.g. excluding emails from CNNSports.com) and after email threading, what percentage of time do you estimate is saved by predictive coding when used to select responsive records? On average: % Most observed: % Least observed: % The responses were: Company Average % Savings Most % Savings Observed Least % Savings Observed Capital Legal Solutions 40 70 25 Catalyst Repository Systems 40 60 25 Equivio (note 1) 65 80 50 FTI Technology 50-60 80 25 Gallivan Gallivan & O Melia 3 10 0 Hot Neuron, LLC InterLegis 40 80 10 Kroll Ontrack Recommind 40 95 20 Valora Technologies, Inc. ** 80 100 25 Xerox Litigation Services 55 77 30 Total 363 572 185 Average of Responses (divide by 9) 45.4% 71.5% 23.1% Green shading with a gold star indicates that the respondent provided names and contact information for a client who substantiated the information provided regarding savings. Two stars indicate two references. Providing references was optional for the respondents ** Equivio Note 1: These percentage savings refer to cases in which the software was successfully trained and used. The software includes a statistical model which monitors the "success" of training. Occasionally, due to poorly-defined issues, inconsistent tagging by the expert, or exceptionally low richness (less than 1%), the statistical model detects and notifies the user that training is ineffective, and in these cases, the results are not used. ** Valora Note: Valora builds a computer-representation of the Document Review Ruleset for each matter as part of Valora s services. In some cases clients have completely forgone a linear review and used the results of the Ruleset instead. 14

XI. Pricing/Cost The survey asked: How do your calculate the prices you charge for PC? (select all that apply) Per GB, pre-culling Per GB, post culling Per GB, post culling and deduping Per File, pre-culling Per File, post culling Per File, post culling and deduping Hourly consulting fees Flat Fee per case Other (Please specify below) The responses were as follows: Company Per GB, pre-cull Per GB, post cull Per GB post cull & dedupe Per File Pre Cull Per File, postcull Per File, post cull & Dedupe Hourly Fees Flat Fees Per Case Other Other Text Capital LS Yes Yes Yes Catalyst RS Yes Equivio Yes Yes Most customers prefer the per-file pricing model. FTI Tech. Yes Yes Gallivan Yes Yes Gallivan & O Melia Hot Neuron Yes Yes Yes Yes We also offer perpetual site licenses with no per- GB fee. Note that our per-gb fees are based on the amount of text, not raw data, which we believe is more fair and economically sensible. Whether the user culls/de-dupes first is up to him/her. InterLegis Yes Per GB fee after culling, and includes all software, technologies, and services such as project management and productions. Kroll Ontrack Yes Free introductory offer. Recommind Yes Yes Yes Yes Enterprise license; SaaS (i.e. per month/quarter/ year charge for all volume) Valora Tech. Yes Yes Yes Yes Yes Per page or per paper document. Xerox LS Yes Similar to our processing and review platform pricing, our models are very flexible. Depending on client needs and the complexity and size of the matter, our pricing models can vary from matter to matter. 15

XII. Incremental Cost of Predictive Coding The survey asked: What is the incremental cost of providing predictive coding technology above the basic costs of ingesting and deduping electronic records? (express as a percentage over basic ingesting, deduping and threading) These were the responses: Company Capital Legal Solutions 20% Catalyst Repository Systems Hourly consulting at $250-$350 per hour Equivio Equivio is a software vendor. Processing and hosting services, as referred to in the question, are provided by our e-discovery partners. As such, we are not in a position to respond to this question. FTI Technology Acuity is an all-in-one offering from processing through to production, including legal review. The predictive coding feature is included in the fees so there is no additional cost - in fact it offers cost savings. Gallivan Gallivan & O Melia Less than 1/10 of 1%. Since we do no charge for processing time, the only "cost" is the extra time required to process the documents. Not all clients want the delay given the perceived small % gain in time. Hot Neuron, LLC InterLegis Included in full-suite of services. Kroll Ontrack This information is proprietary. Recommind Question is unclear Valora Technologies, Inc. When Valora performs the ingesting, deduping, etc., there is no incremental cost to perform document tagging of any sort. This includes AutoCoding, AutoReview, etc. When Valora does not perform the preliminary steps, the cost of AutoReview usually runs between 25-50% of typical ESI processing/scanning costs. A better cost comparison is the cost of Predictive/Propogated Coding against the cost of linear review. Xerox Litigation Services Because our pricing models are based off of client needs and the complexity and size of the matter, incremental costs can vary from matter to matter. 16

XIII. Sample Sizes The survey asked: ediscovery Institute Survey on Predictive Coding Sampling Non-selected Records. If you use sampling of non-selected records as a way of validating your approach, what size samples do you use and how is that sample size determined? The responses were: Company Capital Legal Solutions Catalyst Repository Systems Equivio FTI Technology Gallivan Gallivan & O Melia Hot Neuron, LLC InterLegis Kroll Ontrack 17 Sampling Using statistical random sampling techniques. Inspection batch sizes can be determined A) by desired % of records; B) by a set number of items; or C) to achieve a degree of accuracy % based on pool size and accuracy level formula Generally a statistically valid sample with 95% confidence level is used. Sample size required depends on several variables, including collection richness and size, and the required level of statistical confidence. Size sample is different for each case, depending on what we're looking for (nonresponsive versus privileged, as an example). We use accepted statistical methodology (acceptance sampling, statistical sampling) which includes expected responsive rate, confidence level and acceptable error rate. n/a Intelligent Prioritization does not utilize sampling of non-selected records as an automated way of validating the technical approach. The system is designed to allow clients to utilize the approach of sampling non-selected documents as a companion validation of the solution if they choose to do so. Recommind 10,000 Valora Technologies, Inc. Valora samples records using random selection from across the entire population. Sample size determination is a function of the size of the population and the accuracy desired. Xerox Litigation Services XLS relies on statistical methods developed by our in-house statistician to calculate sound precision and recall estimates for CategoriX results. Our techniques focus on establishing extremely accurate estimates of the rates of relevance (or yields) for the client s categories in the review population as a whole. We ensure that our yield estimates are reliable by selecting random samples for review that are large enough to produce yield estimates with very narrow margins of error according to standard sample size tables. Once stable yield estimates have been established, they provide a reference point from which recall estimates can be calculated following a) the final assessment of categories to documents by CategoriX and b) the establishment of a precision estimate based on direct sampling from the set of documents classified as relevant by CategoriX. Direct sampling in the non-selected records is undertaken only in circumstances where that represents the most efficient option for establishing recall for the final result set. In those cases, the sample size for non-selected records would be dictated by the desired width of the margins of error for the resulting recall estimate.

XIV. Set Up Efforts ediscovery Institute Survey on Predictive Coding The survey asked: Set-up Effort What level of effort it terms of time and level of people involved, is required to set up or start a PC review using your offering? To what extent can efforts expended to start up one review in your system be re-used in other reviews? To what extent can efforts expended to start up one review in your system be re-used as part of an enterprise-wide information management or retrieval system? The responses were: Company Set Up Effort Re-Use in Other Reviews Re-Use in Enterprise System Capital Legal Solutions Catalyst Repository Systems Equivio FTI Technology Dynamic Content Profiling engine is an offering built into our application. However, the time to setup is dependent upon the data set received as we have to run several processes before we can activate the various features. History shows that we have already been able to work with clients per their time line. No more than at the start of any typical review. Creation of searches, scoring and initial sampling should be done by associates or higher level attorneys familiar with the case. Installation and set-up of the software takes about 1-2 hours. For each case, the software needs to be trained by an "expert" (an attorney familiar with the case) in order to estimate the relevance of documents in the specific case. This training process typically takes 1.5-2 days Nothing - it's currently part of the Acuity all-in-one service. Work flows for executing our Content Intelligence process are identifiable and reusable. However, work flows depends on the case team and their needs. We can streamline the path to take depending on strategy that team decides to take. Our review consultants are pretty methodological when it comes to devising the most desirable, defensible cost effective review workflow. Most setup for one can be applied to another case as to review forms, views, folders, subcollections, etc. The training of Equivio>Relevance is specific per case/issue. If there is overlap in data or issues, the efforts and work product can be reused. 18 Not currently planned to deploy as such but could envision the use of one document corpus' prognostic scores against other matters or cross matters document profiling. A default site is created and replicated for an unlimited number of matters. As above, the training of Equivio>Relevance is specific per case/issue. Because predictive coding comes with Acuity, clients can realize great efficiencies as FTI becomes familiar with