Information Retrieval for E-Discovery Douglas W. Oard

Similar documents
II Workshop University of Pennsylvania Philadelphia, PA

White Paper Technology Assisted Review. Allison Stanfield and Jeff Jarrett 25 February

Application of Simple Random Sampling 1 (SRS) in ediscovery

SAMPLING: MAKING ELECTRONIC DISCOVERY MORE COST EFFECTIVE

Pr a c t i c a l Litigator s Br i e f Gu i d e t o Eva l u at i n g Ea r ly Ca s e

The Wave of the Future:

Understanding Search and Retrieval for Effective E- Discovery Results

Evaluating Expertise and Sample Bias Effects for Privilege Classification in E-Discovery

Any and all documents Meets Electronically Stored Information: Discovery in the Electronic Age

2011 Winston & Strawn LLP

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Power-Up Your Privilege Review: Protecting Privileged Materials in Ediscovery

community for use in e-discovery. It is an iterative process involving relevance feedback and

DEFAULT STANDARD FOR DISCOVERY, INCLUDING DISCOVERY OF ELECTRONICALLY STORED INFORMATION {"ESI")

COURSE DESCRIPTION AND SYLLABUS LITIGATING IN THE DIGITAL AGE: ELECTRONIC CASE MANAGEMENT ( ) Fall 2014

Social Media & ediscovery: Untangling the Tweets for the Trials

The Case for Technology Assisted Review and Statistical Sampling in Discovery

E-Discovery Basics For the RIM Professional. Learning Objectives 5/18/2015. What is Electronic Discovery?

8TH INFORMATION GOVERNANCE AND EDISCOVERY SUMMIT. 17 th - 18 th June 2014 Swissotel Sydney CBD

Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review

SUBMISSION VERSION (Revised, Dec. 1, 2008) Cross-Border E-Discovery/E-Disclosure Workshop (DESI III 1 ) ICAIL 2009, Barcelona Monday, June 8, 2009

(Previously published in The Legal Intelligencer, November 8, 2011) New Cost Guidelines for E-Discovery by Peter Vaira

Book Review THE ELECTRONIC EVIDENCE AND DISCOVERY HANDBOOK: FORMS, CHECKLISTS, AND GUIDELINES

PROPOSED ELECTRONIC DATA DISCOVERY GUIDELINES FOR THE MARYLAND BUSINESS AND TECHONOLOGY CASE MANAGEMENT PROGRAM JUDGES

Modeling Concept and Context to Improve Performance in ediscovery

UNDERSTANDING E DISCOVERY A PRACTICAL GUIDE. 99 Park Avenue, 16 th Floor New York, New York

Jason R. Baron Director of Litigation Office of General Counsel National Archives and Records Administration

A User Study of Relevance Judgments for E-Discovery

DISCOVERY IN-HOUSE. Institute of Information Management 23 rd November 2010, Sydney. Graham Costello EMC

Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review

E-discovery Taking Predictive Coding Out of the Black Box

Using Visual Analytics in E-Discovery and E-Disclosure Cases

GUIDELINES FOR USE OF THE MODEL AGREEMENT REGARDING DISCOVERY OF ELECTRONICALLY STORED INFORMATION

Digital Government Institute. Managing E-Discovery for Government: Integrating Teams and Technology

TREC 2008 at the University at Buffalo: Legal and Blog Track

Ethics and ediscovery

Information Retrieval for E-Discovery [[ DRAFT: $Revision: 461 $ $Date: :37: (Thu, 16 May 2013) $]]

The United States Law Week

How To Know If A Human Review Is More Accurate

A Practitioner s Guide to Statistical Sampling in E-Discovery. October 16, 2012

Information Retrieval for E-Discovery. Contents

Case 2:14-cv KHV-JPO Document 12 Filed 07/10/14 Page 1 of 10 IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF KANSAS

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery

ELECTRONIC DISCOVERY IN HONG KONG PRACTICE DIRECTION SL1.2

INDIVIDUALS WITH DISABILITIES EDUCATION ACT NOTICE OF PROCEDURAL SAFEGUARDS

A Re- Examination of Blair & Maron (1985)

ESI: Focus on Review and Production Strategy. Meredith Lee, Online Document Review Supervisor, Paralegal

Xact Data Discovery. Xact Data Discovery. Xact Data Discovery. Xact Data Discovery. ediscovery for DUMMIES LAWYERS. MDLA TTS August 23, 2013

case 3:12-md RLM-CAN document 396 filed 04/18/13 page 1 of 7 UNITED STATES DISTRICT COURT NORTHERN DISTRICT OF INDIANA SOUTH BEND DIVISION

EnCase ediscovery. Automatically search, identify, collect, preserve, and process electronically stored information across the network.

Predictive Coding: A Primer

COURT OF QUEEN S BENCH OF MANITOBA PRACTICE DIRECTION GUIDELINES REGARDING DISCOVERY OF ELECTRONIC DOCUMENTS

Top Ten E-Discovery Developments and Trends in 2011

Cost-Effective and Defensible Technology Assisted Review

The Effect of Product Safety Regulatory Compliance

November/December 2010 THE MAGAZINE OF THE AMERICAN INNS OF COURT. rofessionalism. Ethics Issues. and. Today s. Technology.

Reduce Cost and Risk during Discovery E-DISCOVERY GLOSSARY

Recent Developments in the Law & Technology Relating to Predictive Coding

E-Discovery in Employment Litigation: Making Practical, Yet Defensible Decisions

Intelligent Review Technology: Improving the Practice of Document Review in Legal Discovery

Metadata, Electronic File Management and File Destruction

Electronic Discovery: Litigation Holds, Data Preservation and Production

The Benefits of. in E-Discovery. How Smart Sampling Can Help Attorneys Reduce Document Review Costs. A white paper from

Electronic Discovery and the New Amendments to the Federal Rules of Civil Procedure: A Guide For In-House Counsel and Attorneys

AN E-DISCOVERY MODEL ORDER

E-Discovery: A Common Sense Approach. In order to know how to handle and address ESI issues, the preliminary and

GOVERNMENT PROSECUTIONS AND QUI TAM ACTIONS

New E-Discovery Rules: Is Your Company Prepared?

Turning the Tide The Need for E-Discovery Education

Legal Arguments & Response Strategies for E-Discovery

Record Retention, ediscovery, Spoliation: Issues for In-House Counsel

Discussion of Electronic Discovery at Rule 26(f) Conferences: A Guide for Practitioners

TECHNOLOGY-ASSISTED REVIEW: A View From Plaintiffs Side

Clearwell Legal ediscovery Solution

ediscovery Policies: Planned Protection Saves More than Money Anticipating and Mitigating the Costs of Litigation

Ethics in Technology and ediscovery Stuff You Know, But Aren t Thinking About

E-DISCOVERY GUIDELINES. Former Reference: Practice Directive #6 issued September 1, 2009

BEYOND THE HYPE: Understanding the Real Implications of the Amended Federal Rules of Civil Procedure. A Clearwell Systems White Paper

Electronic Discovery How can I be prepared? September 2010

DISCOVERY OF ELECTRONICALLY-STORED INFORMATION IN STATE COURT: WHAT TO DO WHEN YOUR COURT S RULES DON T HELP

Bringing Electronic Discovery In House as a Managed Service

Measures Regarding Litigation Holds and Preservation of Electronically Stored Information (ESI)

Electronically Stored Information: Focus on Review and Strategies

* IN THE. * CASE NO.: 24-C Defendant * * * * * * * * * * * * * * * * * * * * * * * MEMORANDUM

Top 10 Things We Hate to Hear During an Internal Investigation

From ESI to EDRM. An Overview of Electronic Discovery

The Business Case for ECA

E- Discovery in Criminal Law

SENATE STAFF ANALYSIS AND ECONOMIC IMPACT STATEMENT

The State Of Predictive Coding

You ve Been Served: What Does the Company Do When a Federal Grand Jury Subpoena Arrives at the Door?

ANALYSIS OF ORIGINAL BILL

Arkfeld on Electronic Discovery and Evidence: The Spotlight on Legal Holds

Evaluation of Information Retrieval for E-Discovery

SEVENTH CIRCUIT ELECTRONIC DISCOVERY PILOT PROGRAM FOR DISCOVERY OF ELECTRONICALLY STORED

FEDERAL PRACTICE. In some jurisdictions, understanding the December 1, 2006 Amendments to the Federal Rules of Civil Procedure is only the first step.

Making The Most Of Document Analytics

Streamlining the ediscovery

Recent FCPA Enforcement Activities

Transcription:

Information Retrieval for E-Discovery Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park Joint work with: Mossaab Bagdouri, Jason Baron, David Doermann, Gordon Cormack, Sergey Golitsynskiy, Maura Grossman, Bruce Hedin, David Kirsch, David Lewis, Falk Scholer, Ian Soboroff, Paul Thompson, Stephen Tomlinson, Jyothi Vinjumur, William Webber January 20, 2015 University of Florida

Early E-Discovery Clinton White House search request Tobacco Policy 32 million emails ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 80,000 National Archives hired 25 persons for 6 months ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 200,000

Federal Rules of Civil Procedure Rule 26(f): At the parties planning meeting, issues expected to be discussed include: Any issues relating to disclosure or discovery of electronically stored information, including the form or forms in which it should be produced Any issues relating to preserving discoverable information

Document Review Case Knowledge The Black Box Unprocessed Documents Coded Documents

Case Knowledge Inside Yesterday s Black Box Unprocessed Documents Coded Documents

Case Knowledge Inside Today s Black Box Keyword Search & Linear Review Reasoning Representation Interaction Unprocessed Documents Coded Documents

Judge Grimm, writing for the U.S. District Court for the District of Maryland all keyword searches are not created equal; and there is a growing body of literature that highlights the risks associated with conducting an unreliable or inadequate keyword search Victor Stanley, Inc. v. Creative Pipe, Inc., ---F.Supp.2d---, 2008 WL 2221841, * 3 & n.9 (D. Md. May 29, 2008)

Inside Tomorrow s Black Box Case Knowledge Technology Assisted Review Reasoning Representation Interaction Unprocessed Documents Coded Documents

A Process View

What Does Better Mean? D Better Technique INCREASING SUCCESS (finding relevant documents) y A C Baseline Technique B INCREASING EFFORT (time, resources expended, etc.) x

Outside Our Experience Base Web Search E-Discovery Participants 1 50 Duration 10 seconds 10 weeks Budget $0.001 $1,000,000 Goal High precision High recall Relevance Personal Authoritative Stopping criterion Satisfaction Reasonableness

Formulation Production request An IR-Centric Process Model Acquisition Collection Review for Relevance Responsive ESI Review for Privilege Production Sensemaking Insight

Complaint and Production Request 12. On January 1, 2002, Echinoderm announced record results for the prior year, primarily attributed to strong demand growth in overseas markets, particularly China, for its products. The announcement also touted the fact that Echinoderm was unique among U.S. tobacco companies in that it had seen no decline in domestic sales during the prior three years. 13. Unbeknownst to shareholders at the time of the January 1, 2002 announcement, defendants had failed to disclose the following facts which they knew at the time, or should have known: a. The Company's success in overseas markets resulted in large part from bribes paid to foreign government officials to gain access to their respective markets; b. The Company knew that this conduct was in violation of the Foreign Corrupt Practices Act and therefore was likely to result in enormous fines and penalties; c. The Company intentionally misrepresented that its success in overseas markets was due to superior marketing. d. Domestic demand for the Company's products was dependent on pervasive and ubiquitous advertising, including outdoor, transit, point of sale and counter top displays of the Company's products, in key markets. Such advertising violated the marketing and advertising restrictions to which the Company was subject as a party to the Attorneys General Master Settlement Agreement ("MSA"). e. The Company knew that it could be ordered at any time to cease and desist from advertising practices that were not in compliance with the MSA and that the inability to continue such practices would likely have a material impact on domestic demand for its products. All documents which describe, refer to, report on, or mention any in-store, on-counter, point of sale, or other retail marketing campaigns for cigarettes.

Estimating Retrieval Effectiveness Sampling rate = 6/10 Each Rel counts 10/6 4 6 67% relevant in this region Sampling rate = 3/10 Each Rel counts 10/3 1 3 33% relevant in this estrel(s) region 1 ( d JudgedRel( S) p d)

Estimated Relevant Estimated Highly Relevant 700,000 600,000 500,000 400,000 300,000 200,000 100,000 82K 0 26 topics 24 topics 12K 2008 Ad Hoc task

Est. Boolean Recall: All Relevant 1.0 0.8 0.6 0.4 0.33 0.2 0.0 26 topics 2008 Ad Hoc task

Est. Boolean Recall: Highly Relevant 1.0 0.8 0.6 0.4 0.42 0.2 0.0 24 topics 2008 Ad Hoc task

Inter-Annotator Agreement A1 A2 Messages R R 72 R N 32 N R 27 N N 510 TOTAL 641 Positive Overlap = 0.550 A1 A2 2010 Interactive task, Enron email families, Topic 301

Evaluation Design Interactive Task

TREC Interactive Task Create Complaint & Production Requests (Topics) Team-TA Interaction & Classifier Training Sampling & First-Pass Assessment Appeal & Adjudicate Analysis & Reporting Coordinators TA Teams TA Coordinators Assessors TA Teams TA Coordinators Teams

UB Cl H5 Pitt AdHoc N n a r R R R R R 5,727 46 46 38 R R R R N 24 5 5 4 R R R N R 11,965 98 98 78 R R R N N 995 9 9 9 R R N R R 131 5 5 3 R R N R N 0 0 0 0 R R N N R 1,547 13 13 2 R R N N N 220 5 5 2 R N R R R 1,901 15 15 11 R N R R N 46 5 5 2 R N R N R 17,082 145 145 111 R N R N N 10,291 84 84 61 R N N R R 176 5 5 1 R N N R N 19 5 5 2 R N N N R 7,679 62 61 23 R N N N N 9,531 77 77 17 N R R R R 8,068 65 65 49 N R R R N 101 5 5 2 N R R N R 73,280 541 540 393 N R R N N 28,409 235 235 146 N R N R R 1,185 10 10 4 N R N R N 37 5 4 3 N R N N R 23,688 193 193 84 N R N N N 20,078 171 164 57 N N R R R 5,321 43 43 33 N N R R N 371 5 5 2 N N R N R 151,787 800 795 552 N N R N N 293,439 1,100 1,095 621 N N N R R 2,253 18 18 6 N N N R N 456 5 5 2 N N N N R 526,099 1,100 1,087 234 N N N N N 5,708,286 1,625 1,579 111 TOTAL 6,910,192 6,500 6,421 2,663

Precision 2009 Results (pre-adjudication) 1.0 0.8 Topic 201 (2009) Topic 202 (2009) 0.6 Topic 203 (2009) Topic 204 (2009) Topic 205 (2009) 0.4 Topic 206 (2009) Topic 207 (2009) 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

Precision 2009 Results (post-adjudication) 1.0 0.8 Topic 201 (2009) Topic 202 (2009) 0.6 Topic 203 (2009) Topic 204 (2009) Topic 205 (2009) 0.4 Topic 206 (2009) Topic 207 (2009) 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

Precision 2009 Results (pre- to post- adjudication) 1.0 0.8 Topic 201 (2009) Topic 202 (2009) 0.6 Topic 203 (2009) Topic 204 (2009) Topic 205 (2009) 0.4 Topic 206 (2009) Topic 207 (2009) 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

First-Pass Assessor Errors

Minimizing Downside Risk F 1 α = 0.05 τ ^ F 1 F 1 θ Annotations Training Test

Fixed test set Growing training set F 1 τ F 1 ^ F 1 θ Annotations Training Test

Fixed test set Growing training set τ Collection = RCV1, Topic = M132, Freq = 3.33% Stop Criterion Success Desired 95.00% F ^ 1 τ 46.42% θ τ 91.87% Training documents Training Test

Training + Test ($$$) Minimizing Total Cost + + τ Test True F 1 Training Training

Training + Test ($$$) Policies Topic = C18 Frequency = 6.57% Training documents

E-Discovery Test Collections IIT CDIP (Scanned) 7 million documents (no de-dupe performed) Uncorrected OCR text EDRM Enron Version 2 (Email) 455K messages (after de-dupe) Extracted text Avocado (Email) 614K messages (after de-dupe) Extracted text and metadata (e.g., calendar entries)

For More Information FnTIR survey http://ediscovery.umiacs.umd.edu TREC Legal Track http://trec-legal.umiacs.umd.edu Sedona Conference Working Group 1 http://www.thesedonaconference.org DESI Workshops http://www.umiacs.umd.edu/~oard/desi6/

This work has been supported in part by the National Science Foundation under grant IIS-1065250. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.