How to Keep OCR Errors from Spoiling Your ediscovery Party ACEDS Webinar, May 21 st 2014 2002-2013 Nuance Communications, Inc. All rights reserved. Page 1
ACEDS Membership Benefits Training, Resources and Networking for the E-Discovery Community! Exclusive News and Analysis! Weekly Web Seminars! Podcasts! On- Demand Training! Networking! Resources! Jobs Board & Career Center! bits + bytes NewsleRer! CEDS CerSficaSon! And Much More! ACEDS provides an excellent, much needed forum to train, network and stay current on critical information. Kimarie Stratos, General Counsel, Memorial Health Systems, Ft. Lauderdale Join Today! aceds.org/join or Call ACEDS Member Services 786-517- 2701
Speaker Introduction Greg Gies Director, Product Marketing, Imaging Nuance Communications Leads go-to-market planning for Nuance s print, capture & PDF solutions. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 3
Agenda What OCR is. What are OCR errors and what causes them. Why ediscovery professionals should care. How common these problems are. What can be done to prevent OCR errors. How to correct OCR errors when they happen. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 4
What is OCR? Wikipedia: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine encoded / computer-readable text. 2009 Computerworld Article: Optical character recognition (OCR) is the translation of optically scanned bitmaps of printed or written text characters into character codes. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 5
I say, OCR is the digital transcription of bitmaps containing machine-printed text into encoded text characters, using a coding scheme such as ASCII, which among other capabilities enables indexing software to decipher textual elements contained within bitmaps. Encoded text is searchable text. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 6
What are OCR errors and what causes them. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 7
Types of OCR errors Transcription errors: Result: misspelled words. Impact: Unsearchable. Proportional fonts are especially problematic. Formatting errors. Deleted metadata: Example, learning becomes leaming Result: poor legibility. Impact: Unsearchable. Example, learning becomes l e a r n i n g Result: data loss. Impact. Metadata is unsearchable. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 8
Some image defects and causes Faulty printing equipment DEFECT Toner specks Vertical lines Toner smear Gray background Page skew Light/dark print CAUSE Defective toner cartridge Worn rollers Wrong settings Worn pickup roller Low toner or ink cartridge Clogged nozzles Paper or form elements Faulty scanning equipment Halftones Vertical/horizontal lines Noise Specks Page skew Light/dark image Colored paper Carbon copies Shaded and lined forms Low/high contrast background Dirty platen Feeder misadjusted Worn pickup roller Low resolution Misfeed 2002-2013 Nuance Communications, Inc. All rights reserved. Page 9
Examples of image defects Specks Fuzzy edges Color Background Halftone Gridlines 2002-2013 Nuance Communications, Inc. All rights reserved. Page 10
Cause & effect SKEWED IMAGE DARK TEXT TONER SPECKS 2002-2013 Nuance Communications, Inc. All rights reserved. Page 11
OCR results... ;., 1HIt#~!':tl'Ol\<fLo\i;l;i\~: 'do ~""r.bf...,200., by- ~ ODNEY D~"'!\Ob$Y~s.IN(;.. ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) JOE STANDUi' ~Irnown """,,"). _ and SclIa _<01"'"""y 'l>eknowhllereinasutlj.e paiti~s",'......... " 2002-2013 Nuance Communications, Inc. All rights reserved. Page 12
Formatting errors Original Image OCR conversion to.docx 2002-2013 Nuance Communications, Inc. All rights reserved. Page 13
Why should ediscovery professionals care about OCR errors? 2002-2013 Nuance Communications, Inc. All rights reserved. Page 14
PDF Files Created by Scanners Aren t Necessarily Searchable 1. Unlike PDF Files that were born digital, a PDF file from a scanner is an image of a paper document. 2. While the text in an image may appear similar or the same as text in a born digital PDF, it s invisible as far as search algorithms are concerned. 3. Images aren t searchable until processed with OCR. 4. Searchable PDF contains OCR output in its metadata. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 15
Digitized Versus Native Documents Why special handling is required Electronic Originated online Many native file formats Encoded text machinereadable ediscovery software designed for these documents Digitized Scans of paper originals TIFF & PDF common formats Text is a bitmap; not encoded OCR makes scans machinereadable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 16
All Searchable PDFs Aren t Equally Findable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 17
All Searchable PDFs Aren t Equally Findable OCR Word Accuracy Matters Character vs. word accuracy 1character error = 1 word error 1 char / 10,000 =.0001 1 word / 1,000 =.001 Seemingly small differences in OCR error rate lead to very large differences in errors 98.25% vs. 96.05% 948 pages ~ 2% delta Over 12,000 more word errors 2002-2013 Nuance Communications, Inc. All rights reserved. Page 18
PDF Isn t Just PDF Anymore Improper handling may lead to data loss Electronic sticky notes & stamps obscure text behind PDF contains native pages or elements PDF contains digitized pages or elements OCR software may flatten the image, i.e., convert PDF to TIF, then OCR 2002-2013 Nuance Communications, Inc. All rights reserved. Page 19
Four Important Ways PDF is Different 1. PDF files can be assembled from multiple files. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 20
Four Important Ways PDF is Different 2. PDF elements can be rearranged. User copied snip of scanned document. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 21
Four Important Ways PDF is Different 3. PDF files can have multiple layers. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 22
Four Important Ways PDF is Different 4. PDF files can be born digitally or created by a scanner. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 23
Why These Differences Matter When these properties converge with scanned text ediscovery pitfalls arise that can lead to inadvertent data loss due to processing errors. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 24
Single PDF Page Can Contain Both Born Digital and Scanned Content This PDF contains 2 scanned pages. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 25
Notes, Text Boxes, Callouts, Stamps, Etc Can Be Overlaid On Images 2002-2013 Nuance Communications, Inc. All rights reserved. Page 26
How Data Gets Inadvertently Destroyed OCR may flatten objects overlaid on text, making the text underneath unreadable and metadata unsearchable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 27
Why Is Data Destruction A Problem? Spoliation is the destruction or significant alteration of evidence, or the failure to preserve property for another's use as evidence in pending or reasonably foreseeable litigation. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 28
Penalties for Spoliation $2,750,000 United States v. Philip Morris USA, Inc. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 29
What If The Data Was Destroyed Unintentionally? I m Okay Right? The intent to alter or destroy electronic data is not required for spoliation to occur. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 30
Other Reasons Data Loss is Bad Spoliation isn t the only concern Destruction of evidence affect case outcome. Time wasted recreating lost data reduced profitability. Client relations loss of credibility and future business. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 31
How common are OCR errors? 2002-2013 Nuance Communications, Inc. All rights reserved. Page 32
Typical OCR Word Accuracy Rates 2002-2013 Nuance Communications, Inc. All rights reserved. Page 33
Firms With Initiatives To Convert Paper Documents And Process To Digital 100% Percentage of Respondents 80% 60% 40% 20% Don t know No Yes 0% Total 2002-2013 Nuance Communications, Inc. All rights reserved. Page 34
What Percentage Of The Knowledge Workers Have Access To Scanners? 100% 80% 72.8% 66.7% 75.2% 64.5% 78.6% 80.9% Workers 60% 40% 20% 0% Total Education Financial Healthcare Insurance Legal 2002-2013 Nuance Communications, Inc. All rights reserved. Page 35
The Use Of Scanning Within Organizations 100% Percentage of Respondents 80% 60% 40% 20% Decreasing Stay the same Increasing 0% Total 2002-2013 Nuance Communications, Inc. All rights reserved. Page 36
Why scanning is increasing 100% Percentage of Respondents 80% 60% 40% 20% More people have been given access to scanning Mix of more documents being scanned and more people gaining access to scanning Individuals are scanning more documents 0% Total N = 158 Respondents who have seen an increase in scanning 2002-2013 Nuance Communications, Inc. All rights reserved. Page 37
What can be done to prevent OCR errors from occurring 2002-2013 Nuance Communications, Inc. All rights reserved. Page 38
Image enhancement filters out defects Despeckle Smooth Characters Halftone Removal Remove Gridlines Color Background Removal 2002-2013 Nuance Communications, Inc. All rights reserved. Page 39
Selectively Processing PDF Files Isn t a Viable Strategy Client delivers a disk with millions of files, how do you know which PDF are compound that require special handling? It s kind of like looking for needles in a haystack. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 40
Segregate All PDF Files Before Conversion After collecting documents, but before processing, review and analysis Identify all PDF documents Move them to a separate directory to be run through OCR 2002-2013 Nuance Communications, Inc. All rights reserved. Page 41
Pre-process all PDF files with OCR Text images are potentially hidden within files Newer OCR tools will make image text searchable, without disturbing the rest of the data within the document Ensures all text within every PDF document can be searched by ediscovery system Should also convert to PDF/A at the same time so files are ready for court filing 2002-2013 Nuance Communications, Inc. All rights reserved. Page 42
How to correct OCR errors when they happen 2002-2013 Nuance Communications, Inc. All rights reserved. Page 43
Post processing error-correction options 1. Proofreading too time-intensive 2. Run an automated spell check Effective at identifying and correcting spelling errors Doesn t solve contextual errors Example, How is you day? Spelled correctly but clearly is wrong If searching for your this instance won t be found No commercial solutions today solve this problem Helps to understand this potential problem Can either manually review and correct or adjust search strategy to minimize the impact, e.g., fuzzy search techniques 2002-2013 Nuance Communications, Inc. All rights reserved. Page 44
Final Thought Why You Should Care Important enough to print, then to scan, it is important! General business / transaction documents Purchase orders, invoices, contracts, employee identification documents Healthcare organizations / patient medical records Physician s notes, discharge summaries, test results, post operative reports, etc Personal identification documents Driver s licenses, social security cards, professional certificates Public institutions / Police records Incident / accident reports, police logs, court records Insurance & banking Claims documents, medical records, financial records Schools Inoculation records, transcripts, applications 2002-2013 Nuance Communications, Inc. All rights reserved. Page 45
Q&A 2002-2013 Nuance Communications, Inc. All rights reserved. Page 46
Next Steps A recording of today s Webinar will be available shortly for you to review at your leisure Contact Nuance with any questions you may have: 781-565-5000 or imaging@nuance.com For more information visit: http://www.nuance.com/for-business/by-industry/legal/legal-solution Try our 30-Day free trial of Power PDF at http://www.powerpdf.com 2002-2013 Nuance Communications, Inc. All rights reserved. Page 47
Thank you 2002-2013 Nuance Communications, Inc. All rights reserved. Page 48