Cohsset Assocites, Inc. Expnding Your Skill Set: How to Apply the Right Serch Methods to Your Big Dt Problems Juli L. Brickell H5 Generl Counsel MER Conference My 18, 2015 H5 POWERING YOUR DISCOVERY GLOBALLY WWW.H5.COM INFO@H5.COM TEL: 1.866.999.4215 Corporte Dt Loctions Internl Enterprise dt sources Externl Mnged Externl Cloud Employee sources Externl Gmil Gmil Google Docs Google Docs 2 Identify The End Gme Gols differ Find prticulr documents Find prticulr document types Segregte wht is needed from wht is not Illuminte drk dt Defensibly dispose of unneeded informtion 2015 Mnging Electronic Records Conference 8.1
Cohsset Assocites, Inc. Prepre to Use Effective Methods Methods lign regrdless of purpose; tools my not Know wht you need Employ the right expertise to find it The right tools The right methods Fine tune for diverse sources Securely dispose, if disposing Serch Superior to Mnul Review Richmond Journl of Lw nd Technology (2011) Overll, the myth tht exhustive mnul review is the most effective nd therefore, the most defensible pproch to document review is strongly refuted. Technology-ssisted review cn (nd does) yield more ccurte results thn exhustive mnul review, with much lower effort. TECHNOLOGY ASSISTED REVIEW IN E DISCOVERY CAN BE MORE EFFECTIVE AND MORE EFFICIENT THAN EXHAUSTIVE MANUAL REVIEW Mur R. Grossmn Gordon V. Cormck XVII RICH. J.L. & TECH. 11 (2011), http://jolt.richmond.edu/v17i3/rticle11.pdf, p.48 Serch Superior to Mnul Review Richmond Journl of Lw nd Technology (2011) Of course, not ll technology-ssisted reviews re creted equl. The prticulr processes found to be superior in this study re both interctive, employing combintion of computer nd humn input. TECHNOLOGY ASSISTED REVIEW IN E DISCOVERY CAN BE MORE EFFECTIVE AND MORE EFFICIENT THAN EXHAUSTIVE MANUAL REVIEW Mur R. Grossmn Gordon V. Cormck XVII RICH. J.L. & TECH. 11 (2011), http://jolt.richmond.edu/v17i3/rticle11.pdf, p.48 2015 Mnging Electronic Records Conference 8.2
Cohsset Assocites, Inc. Serch Results Vry Widely TREC Interctive Tsk Results 2008 2010 1.0 0.8 High Recll High Precision Precision 0.6 0.4 2008 2009 2010 0.2 Keyword Serch (Blir & Mron,1985) Mnul Review (Grossmn & Cormck, 2011) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recll TREC: Ntionl Institute of Stndrds nd Technology Text Retrievl Conference Legl Trck Are you meeting your gols? The metrics Recll nd Precision Recll A mesure of how complete the results of retrievl effort re. Recll nswers the question: Out of ll the documents tht retrievl ws chrged with finding, wht proportion did the retrievl succeed in ctully finding? Precision A mesure of how on trget the results of retrievl effort re. Precision nswers the question: Out of ll the documents retrievl identified s responsive, wht proportion ws ctully responsive? Vlidting Retrievl Effort Scientific Perspective There re stndrd informtion retrievl metrics tht nswer the key questions bout the qulity of document retrievl effort: recll nd precision. There re ccepted smpling methodologies for obtining estimtes of recll nd precision in reltively costeffective wy. Smpling pproches to vlidtion offer considerble flexibility nd control over key input prmeters, so tht, in ny given circumstnce, if you think through your rel informtion need, you cn find test tht strikes the optiml blnce between the informtion gined nd the resources required. 2015 Mnging Electronic Records Conference 8.3
Cohsset Assocites, Inc. Technology for Retrievl Incresingly Accepted In the three yers since D Silv Moore, the cse lw hs developed to the point tht it is now blck letter lw tht where the producing prty wnts to utilize TAR for document review, courts will permit it. Rio Tinto PLC v. Vles S.A., Cse 1:14 cv 03042 RMB AJP (Mrch 2, 2015) Methodology Mtters Know the gol Lern the dt popultion Design pproprite smpling process Vet the smple documents Use the knowledge to improve the smple Use the gol nd knowledge to select tool Choose pproprite tools Use pproprite, itertive methodology Vlidte results Design pproprite smpling process Vet the smple documents Estimte recll nd precision; iterte process s needed Sttistics Supports Knowledge nd Choice Yield Estimte Estimte of trget documents in dt set Dt set 100,000 documents 1000 doc smple 15,000 trget docs estimted yield 150 150/1000 trget docs in smple = 15% Hence estimted 15,000/100,000 trget docs in dt set 2015 Mnging Electronic Records Conference 8.4
Cohsset Assocites, Inc. Smpling: Qulity of Smple Affects Qulity of Serch Results document sources skewed smple H D q R i h m b Q 2 M A G d F G y v m g c 2 v p j l A 2 B N 2 s e x f 2 Z t g P w u k r G o j M b d k i g h A f c e smple prmeters to drw smple vry by sitution different deprtments different dtes rolling collection multiple issues Serch is run on n index Token Loctions Sme serch queries provide different results depending on the tool Google Exct serch Algorithmic serch ction 3:1; 24:10; 45:112; ll 3:5; 4; 23 ccountnts 2:2; 41::33 business 2:3; 4::56 conferences 3:12; 7:1; 88:5; 95:1 dte 1:1; 4:1; 5:3; 8:13 dec 1:3; 155:9 Not ll words re indexed smoking Boolen Serch: Keywords or Serch Strings o known o djustble o over-inclusive -- nchor o under-inclusive dd terms o trgets specific lnguge cough! mlise sore throt trffic congest! Relevnt Documents: common cold sneez! runny nose llergies flu fever loss w/3 ppetite virus computers 2015 Mnging Electronic Records Conference 8.5
Cohsset Assocites, Inc. Serch Strings: words you cn red TreC09_204_ST_Retention_Deletion BM enron #w5 [dt, documents, e{ }mil{s}, record{s}, evidence{s}, info{rmtion}, cop[y, ies], file{s}] #w10 [shred{s, ded, ding}, destroy{s, ed, ing}] Mthemticl Serch Algorithms count nd weight words nd other tokens: clustering, concept, predictive Document 1 totl α o unknown o imbedded o hrd to djust o over-inclusive o under-inclusive o groups or rnks bsed on prevlent lnguge β Document 2 totl Vlidtion of Recll looking in the discrd pile A look tht flls short of vlidting Recll Look in the discrd pile, but do not tie results bck to fullcollection prevlence (i.e., do not tie to Recll). Exmple: Look in the discrd pile, find tht 1 out of every 100 documents is ctully responsive. Tht s good, isn t it? It depends Good, if full collection prevlence ws 10%. Bd, if full collection prevlence ws 1%. A look tht tht truly vlidtes Recll Look in the discrd pile, nd do tie the results bck to full collection prevlence (i.e., tie to Recll). 2015 Mnging Electronic Records Conference 8.6
Cohsset Assocites, Inc. Sttistics Supports Knowledge nd Choice Smple of results 46% recll: 7,000/15,100 More trget documents missed thn tgged. Tgged Dt 10,000 documents Not Tgged Dt 90,000 documents 1000 doc smple 700 1000 doc smple 10,000 x 70% correct = 7,000 trget docs tgged 90,000 x 9% missed = 8,100 trget docs missed 90 The Biomet Exmple In re: Biomet M2 Mgnum Hip Implnt Products Libility Litigtion (MDL 2391) Bckground Defendnt followed two stge review process: Keyword culling (+ dedupliction); 19.5m 3.9m 2.5m Predictive coding (+ humn review). Plintiffs contend tht the keyword culling step left lot of responsive documents behind nd tht predictive coding should be pplied to the full 19.5m collection. Defendnt objects, sying tht, while they re willing to entertin dditionl keywords from plintiffs nd to produce dditionl nonprivileged docs from the 2.5m culled in subset, they re not willing to pply predictive coding to the full 19.5m collection. The Biomet Exmple The Judge s Ruling The Judge ruled in fvor of Defendnt, bsed in prt on: Rules nd stndrds governing discovery process Proportionlity considertions Numbers derived from Defendnt s smpling of the collection. With regrd to the numbers, the Judge observed: Smpling of the full collection found tht between 1.37% nd 2.47% of the full collection ws responsive; Smpling of the discrd pile left behind by the keyword cull found tht between 0.55% nd 1.33% of the discrd pile ws responsive. Therefore, were Defendnt to pply predictive coding to the full collection, comprtively modest number of [dditionl responsive] documents would be found (p. 5). 2015 Mnging Electronic Records Conference 8.7
Cohsset Assocites, Inc. Wht the numbers sy The Biomet Exmple Between 267,150 nd 481,650 responsive documents reside documents in the full collection. Between 85,800 nd 207,480 responsive documents reside in the set left behind by the keyword cull. Tking the midpoints of the two rnges, Recll of keyword cull is: 60.8%; o i.e., nerly 40% of the responsive documents re left behind by the keyword cull. Recll of the entire two stge process only flls further when the predictive coding stge is tken into ccount. o Assuming predictive coding chieves 70% recll ( generous ssumption), overll recll of the two stge process is 42.6%; over hlf of the responsive documents re left behind. Implictions? Are there business, legl or ethicl implictions rising from the qulity of the retrievl? Likely. Business, Legl or Ethicl Implictions? Serch designers rely on keywords creted in conference room bsed on ssumptions bout how the business might discuss the trgeted content. As result, lrge mount of responsive dt is missed. Smpling would hve demonstrted the gp. Serch design comprises over inclusive keywords or technology, resulting in retrievl of vst, lrgely off trget dt set. Smpling would hve demonstrted the overge. Algorithmic serch tool used in investigtion differentites bsed on most prevlent lnguge in documents nd rnks very low documents with nunced lnguge indicting frud. Clustering lgorithm groups stndrd form contrcts together but misses informl greements with business prtners. Records mngement exercise plns disposl of pper copies believed to overlp with scnned electronic copies. Records re required by regultors. Smpling exercise to compre sets is improperly designed. 2015 Mnging Electronic Records Conference 8.8
Cohsset Assocites, Inc. You need to drw on the right kinds of expertise if you re going to get sound nswer H5 POWERING YOUR DISCOVERY GLOBALLY WWW.H5.COM INFO@H5.COM TEL: 1.866.999.4215 References TREC 2008 Overview of the Legl Trck http://trec.nist.gov/pubs/trec17/ppers/legal.overview08.pdf TREC 2009 Overview of the Legl Trck http://trec.nist.gov/pubs/trec18/ppers/legal09.overview.pdf TREC 2010 Overview of the Legl Trck http://trec.nist.gov/pubs/trec19/ppers/legal10.overview.pdf Blir nd Mron 1985 Blir, Dvid C., nd M. E. Mron. 1985. An Evlution of Retrievl Effectiveness for Full Text Document Retrievl System. Communictions of the ACM 28 (3): 289 299 Grossmn nd Cormck 2011 Mur R. Grossmn & Gordon V. Cormck, Technology Assisted Review in E Discovery Cn Be More Effective nd More Efficient Thn Exhustive Mnul Review, XVII RICH. J.L. & TECH. 11 (2011), http://jolt.richmond.edu/v17i3/rticle11.pdf In re: Biomet In re: Biomet M2 Mgnum Hip Implnt Prods. Lib. Litig., NO. 3:12 MD 2391, Order Regrding Discovery of ESI (N.D. Ind. Apr. 18, 2013) H5. 2014 2015 Mnging Electronic Records Conference 8.9